Data quality control using R
Checking points on land
The obistools package has a
check_onland() function to check if coordinate pairs are located on land. By default this function uses a web service, but it can optionally work offline (although this is less accurate).
First fetch some Madrepora occurrences using
library(robis) mad <- occurrence("Madrepora") leafletmap(mad)
Then run the
check_onland() command. By default the function will return a data frame containing all records on land (another option is to return a data frame with errors):
library(obistools) land <- check_onland(mad) leafletmap(land)
In some cases it makes sense to apply a buffer when checking for records on land. In this case we add a 1000 m buffer zone:
land_buffer <- check_onland(mad, buffer = 1000) leafletmap(land)
As expected this returns less “wrong” records.
Now create a map showing all suspicious records, in orange by default but in red when they are suspicious even with the 1000 m buffer zone:
library(ggplot2) world <- map_data("world") ggplot() + geom_polygon(data = world, aes(x = long, y = lat, group = group), fill = "#dddddd") + geom_point(data = land, aes(x = decimalLongitude, y = decimalLatitude), color = "#cc3300") + geom_point(data = land_buffer, aes(x = decimalLongitude, y = decimalLatitude), color = "#ff9900") + coord_fixed(1)
The obistools package allows us to match taxa with the World Register of Marine Species directly from our R environment. To demonstrate this functionality, we are going to use the Reef Life Survey example dataset which is published on IPT here.
First, make sure the finch package is installed and loaded:
library(devtools) install_github("ropensci/finch") library(finch)
Then read the Darwin Core Archive:
archive <- dwca_read("http://ipt.iobis.org/obis-env/archive.do?r=rls-subset", read = TRUE) occurrence <- archive$data$occurrence.txt
Next, we can start the taxon matching procedure by passing our scientific names to the
library(obistools) names <- match_taxa(occurrence$scientificName)
When the name matching has finished (this can take a while), a summary will be displayed indicating how many names were matches and how many need to be resolved manually:
291 names, 0 without matches, 8 with multiple matches Proceed to resolve names (y/n/info)?
Type print to see how which names need manual action, y to start manual resolution, or n to skip manual resolution. After selecting y, several options will be presented for each name. Pick a number or press enter to skip the names:
AphiaID scientificname authority status match_type 1 346769 Apogonia Cressey & Cressey, 1990 unaccepted near_1 2 125913 Apogon Lacepède, 1801 accepted near_2 Apogonid spp. Multiple matches, pick a number or leave empty to skip:
After this procedure, you will end up with a data frame containing the matched name, the WoRMS LSID, and the type of match. Add the LSIDs to your source data as
occurrence$scientificNameID <- names$scientificNameID
Checking depth values
The obistools package has a
check_depth() function to check if there are any potential problems with the values in the
maximumDepthInMeters fields. This function uses a webservice to fetch bathymetry information from various sources.
First download some occurrences from OBIS:
abrseg <- robis::occurrence("Abra segmentum")
check_depth() with a
depthmargin of 10 meters, this will return all records where depth values are 10 meters or more below the bottom depth returned from the webservice:
library(obistools) problems <- check_depth(abrseg, depthmargin = 10)
To plot sample depth versus bottom depth, first use
lookup_xy() to obtain bathymetry for our points:
bathymetry <- lookup_xy(problems, shoredistance = FALSE, grids = TRUE, areas = FALSE)$bathymetry plot(bathymetry, problems$maximumDepthInMeters) abline(0, 1, lty = 2) abline(10, 1, col = "red") plot_map_leaflet(problems)