Occurrence data


Teaching: 10 min
Exercises: 10 min
  • download occurrence data through API.

  • filter occurrance data.

## "","x"
## "1","data/occ_raw.csv"

2.1 API

~1 billion biodiversity records on GBIF.org


What is an API looks like?

put this in Chrome/IE: http://api.gbif.org/v1/occurrence/search?year=1800,1899

What is an API? (Application Programming Interface)
API is the acronym for Application Programming Interface, which is a software intermediary that allows two applications to talk to each other. Each time you use an app like Facebook, send an instant message, or check the weather on your phone, you’re using an API.

2.1 Download occurrence data

gbif() is a function in dismo package, which can directly download occurrences through GBIF api; here we query the number of records of the nine-banded armadillo, without downloading

[1] 7520

by setting download=TRUE, we can download all records


  occ_raw <- gbif(genus="Dasypus",species="novemcinctus",download=TRUE) 
  save(occ_raw,file = "data/occ_raw.rdata")

# to view the first few records the occurrence dataset use:
head( occ_raw )

2.2 List of biodiversity databases and their R package.

Table 1. List of biodiversity databases and their R package.

Database R package
BISON rbison
eBird rebird
GBIF rgbif
iNaturalist rinat
VertNet rvertnet
iDigBio ridigbio

The great thing is, you could query many databases at one time using spocc package, developed by rOpenSci

2.3 Occurrence data in Darwin Core

Take a look at the columns of the GBIF occurrences.

names(occ_raw) [1:20 ]
 [1] "acceptedNameUsage"             "acceptedScientificName"       
 [3] "acceptedTaxonKey"              "accessRights"                 
 [5] "adm1"                          "adm2"                         
 [7] "associatedReferences"          "basisOfRecord"                
 [9] "behavior"                      "bibliographicCitation"        
[11] "catalogNumber"                 "class"                        
[13] "classKey"                      "cloc"                         
[15] "collectionCode"                "collectionID"                 
[17] "continent"                     "coordinatePrecision"          
[19] "coordinateUncertaintyInMeters" "country"                      

The meaning of those columns/terms are defined by Darwin Core. Refer to Darwin Core quick reference guide for more information.

A few columns to highlight:

2.4 Clean occurrence data

Since some of our records do not have appropriate coordinates and some have missing locational data, we need to remove them from our dataset. To do this, we created a new dataset named “occ_clean”, which is a subset of the “occ_raw” dataset where records with missing latitude and/or longitude are removed.

# here we remove erroneous coordinates, where either the latitude or longitude is missing
occ_clean <- subset(occ_raw,(!is.na(lat))&(!is.na(lon))) 
#  "!" means the opposite logic value

#Show the number of records that are removed from the dataset.  
cat(nrow(occ_raw)-nrow(occ_clean), "records are removed")
2401 records are removed

Remove duplicated data based on latitude and longitude

dups <- duplicated( occ_clean[c("lat","lon")]  )
occ_unique <- occ_clean[!dups,]
cat(nrow(occ_clean)-nrow(occ_unique), "records are removed")
1472 records are removed

show the frequency table of “basisOfRecord”


                 13                2444                   1 
                 33                  27                 921 

only keep record that are associted with a specimen

occ_unique_specimen <- subset(occ_unique, basisOfRecord=="PRESERVED_SPECIMEN")
cat(nrow(occ_unique_specimen), "out of ", nrow(occ_unique), "records are specimen")
921 out of  3647 records are specimen

show the histogram of “year”


plot of chunk clean_data5

to filter the species records by year, in this example 1950 to 2000:

occ_final <- subset(occ_unique_specimen, year>=1950 & year <=2000)

show a quick summary of years in the data

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1950    1965    1976    1977    1989    2000 

2.5 Make occurrence data spatial

make occ spatial, assign coordinate reference system to spatial points

occ_final_COPY <- occ_final
coordinates(occ_final) <- ~ lon + lat

Note that, after make the dataframe spatial, the dataframe object is transformed into a spatial object

cat("the previous object is: ", class(occ_final_COPY),"\n")  
the previous object is:  data.frame 
cat("the new object is: ",class(occ_final),"\n" )
the new object is:  SpatialPointsDataFrame 

we could view the coordinates and the data that are associated with the spatial object

            lon      lat
3452  -84.55206 10.49557
3454 -104.51337 19.13245
3458 -100.51001 31.30495
3459 -103.90280 19.16453
3462  -90.88333 16.15000
3467  -94.82222 16.43611

read the CRS of the spatial object; it is NA because it has not been defined.

CRS arguments: NA 

now we define a CRS object

# Define the coordinate system that will be used. Here we show several examples:
myCRS1 <- CRS("+init=epsg:4326") # WGS 84
myCRS2 <- CRS("+init=epsg:4269") # NAD 83
myCRS3 <- CRS("+init=epsg:3857") # Mercator
myCRS3 <- CRS("+init=epsg:3413") # WGS 84 / NSIDC Sea Ice Polar Stereographic North

You can full reference list from spatialreference.org website.

assign the Coordinate Reference System (CRS) to our occ spatial object

crs(occ_final) <- myCRS1
CRS arguments:
 +init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84

after defineing the CRS, we can do CRS projecitons

occ_final_projected <- spTransform(occ_final, myCRS3)

plot of chunk occ-spatial6


plot of chunk occ-spatial6

after we transform a dataframe into a spatial object, we can still subset it by column; for example, here we only keep occurrences north of the Equator

occ_north <- subset(occ_final, occ_final@coords[,2] >=0)

plot of chunk occ-spatial7

or we can subset by year

occ_1990 <- subset(occ_final, year ==1990)

plot of chunk occ-spatial8

2.6 Read/Write shapefile files

loaded_shapefile <- shapefile("temp/occ_final.shp")

Challenge: Download occurrences from GBIF and filter data

–select your favorite species
–only keep specimen records
–only keep records that are collected between 2000 & 2018
–only keep records that have valid longitude & latitude
–make the occ spatial –assign WGS84 as the crs of the occurrences –save the spatial object as “myocc_final.shp” in folder “temp”



# download 
myocc <- gbif(genus="Dasypus",species="novemcinctus",download=TRUE) 
# filter 
myocc_final <- subset(myocc,basisOfRecord=="PRESERVED_SPECIMEN" &
                            year >= 2000 & year <= 2018 &
                            !is.na(lat) & !is.na(lon)    )

# show number of records that are removed 
nrow(myocc) - nrow(myocc_final)     

# make it spatial
coordinates(myocc_final) <- ~ lon + lat

# define CRS
myCRS1 <- CRS("+init=epsg:4326") # WGS 84

# assign CRS to your occ
crs(myocc_final) <- myCRS1

# write shapefile