Data quality control using R

Checking points on land

The obistools package has a check_onland() function to check if coordinate pairs are located on land. By default this function uses a web service, but it can optionally work offline (although this is less accurate).

First fetch some Madrepora occurrences using robis:

mad <- occurrence("Madrepora")

Then run the check_onland() command. By default the function will return a data frame containing all records on land (another option is to return a data frame with errors):

land <- check_onland(mad)

In some cases it makes sense to apply a buffer when checking for records on land. In this case we add a 1000 m buffer zone:

land_buffer <- check_onland(mad, buffer = 1000)

As expected this returns less “wrong” records.

Now create a map showing all suspicious records, in orange by default but in red when they are suspicious even with the 1000 m buffer zone:

world <- map_data("world")

ggplot() +
 geom_polygon(data = world, aes(x = long, y = lat, group = group), fill = "#dddddd") +
 geom_point(data = land, aes(x = decimalLongitude, y = decimalLatitude), color = "#cc3300") +
 geom_point(data = land_buffer, aes(x = decimalLongitude, y = decimalLatitude), color = "#ff9900") + coord_fixed(1)

Taxon matching

The obistools package allows us to match taxa with the World Register of Marine Species directly from our R environment. To demonstrate this functionality, we are going to use the Reef Life Survey example dataset which is published on IPT here.

First, make sure the finch package is installed and loaded:


Then read the Darwin Core Archive:

archive <- dwca_read("", read = TRUE) 
occurrence <- archive$data$occurrence.txt

Next, we can start the taxon matching procedure by passing our scientific names to the match_taxa() function:

names <- match_taxa(occurrence$scientificName)

When the name matching has finished (this can take a while), a summary will be displayed indicating how many names were matches and how many need to be resolved manually:

291 names, 0 without matches, 8 with multiple matches
Proceed to resolve names (y/n/info)?

Type print to see how which names need manual action, y to start manual resolution, or n to skip manual resolution. After selecting y, several options will be presented for each name. Pick a number or press enter to skip the names:

  AphiaID scientificname               authority     status match_type
1  346769       Apogonia Cressey & Cressey, 1990 unaccepted     near_1
2  125913         Apogon          Lacepède, 1801   accepted     near_2
Apogonid spp.
Multiple matches, pick a number or leave empty to skip:

After this procedure, you will end up with a data frame containing the matched name, the WoRMS LSID, and the type of match. Add the LSIDs to your source data as scientificNameID.

occurrence$scientificNameID <- names$scientificNameID

Checking depth values

The obistools package has a check_depth() function to check if there are any potential problems with the values in the minimumDepthInMeters and maximumDepthInMeters fields. This function uses a webservice to fetch bathymetry information from various sources.

First download some occurrences from OBIS:

abrseg <- robis::occurrence("Abra segmentum")

Then use check_depth() with a depthmargin of 10 meters, this will return all records where depth values are 10 meters or more below the bottom depth returned from the webservice:

problems <- check_depth(abrseg, depthmargin = 10)

To plot sample depth versus bottom depth, first use lookup_xy() to obtain bathymetry for our points:

bathymetry <- lookup_xy(problems, shoredistance = FALSE, grids = TRUE, areas = FALSE)$bathymetry
plot(bathymetry, problems$maximumDepthInMeters)
abline(0, 1, lty = 2)
abline(10, 1, col = "red")


Working with OBIS-ENV-DATA