When constructing the backbone for sPlot 2.1, a list of 4,093 "weird" species names (consisting mainly of trivial names) was generated, and corrected manually by [Jürgen Dengler](https://www.uni-bayreuth.de/de/forschung/profilfelder/advanced-fields/oekologie-und-umweltwissenschaften/mitwirkende/Dengler/index.php) (thereafter JD).
**This step is not performed anymore**
```
## String manipulation routines
## A-priori cleaning of names
Stripping unwanted characters as well as abbreviation (such as hybrid markers) which would prevent name matching:
A total of `r nrow(spec.list.TRY.sPlot %>% filter(OriginalNames != Species))` species names were modified. Although substantially improved, the species list has still quite a lot of inconsistencies.
The total list submitted to TNRS containes `r length(unique(spec.list.TRY.sPlot$Species)` species names.
The total list submitted to TNRS containes `r length(unique(spec.list.TRY.sPlot$Species))` species names.
# Match names against Taxonomic Name Resolution Service ([TNRS](http://tnrs.iplantcollaborative.org))
The csv-file of species names was submitted to Taxonomic Name Resolution Service web application (Boyle et al. 2013, iPlant Collaborative (2015)). TNRS version 4.0 was used, which became available in August 2015 (this version also included The Plant List version 1.1).
TNRS was queried on 27/07/2019.
## TNRS settings {#ID}
...
...
@@ -231,10 +237,10 @@ The initial TNRS name resolution run was based on the **five standard sources**
### Family Classification
Resolved names were assigned to families based on the [APGIII classification](http://onlinelibrary.wiley.com/doi/10.1111/j.1095-8339.2009.00996.x/abstract) [@Chase2009], the same classification system used by Tropicos.
## Retrieve results
### Retrieve results
Once the matching process was finished, results were retrieved from TNRS using the `Detailed Download` option that included the full name information (parsed components, warnings, links to sources, etc.). We retrieved all the matches for each species, constrained by source (TNRS default), where the name in the first source was selected as best match, unless there was `no suitable match found` in that source, the match from the next lower-ranked source was selected, until all resources where exhausted.
# Manually inspecting name matching results {#ID}
### General procedure {#ID}
Manually inspect the TNRS-results table in a spreadsheat application (i.e. LibreOffice or Excel). Starting with the highest taxonomic rank considered (i.e. Family). For instance, if manual checking of the TRNS output reveals that all accepted names or synonyms that have accuracy scores >0.9 are correct taxon names, use the following selection procedure:
* Name_matched_rank (==Family)
...
...
@@ -243,13 +249,9 @@ Manually inspect the TNRS-results table in a spreadsheat application (i.e. Libre
Continue this selection procedure for entries that were matched at lower taxonomic ranks, i.e. genus, species, etc..
## Iteration 1 - Read and combine TNRS result files
After this first step, there are `r sum(tnrs.res$Name_matched=="No suitable matches found.")` for which no match was found. Another `r sum(tnrs.res$Overall_score<0.9)` were unreliably matched (overall match score <0.9).
### General procedure {#ID}
1. Open `tnrs.res` in a spreadsheet program and sort according to `Name_matched_rank`, `Taxonomic_status` and `Family_score`, and select thresholds for selection.
2. Repeat selection for entries matched at lower taxonomic ranks, such `Name_matched_rank` ==:
After this first step, there are `r sum(tnrs.res$Name_matched=="No suitable matches found.")` recprds for which no match was found. Another `r sum(tnrs.res$Overall_score<0.9)` were unreliably matched (overall match score <0.9).
* forma
* genus
* infraspecies
* ...
3. Adjust accuracy score threshold values, e.g. use higher or lower values for infraspec., variety, ...
### Family level {#ID}
Manually inspect sorted table and select all entries at the highest hierarchical level (family). Manually identify the family accuracy score threshold value above which a name can be considered a correct name. In the following case, this corresponds to a score $>$0.88.
After matching the remaining genera with the Catalogue of life there are still `r nrow(Backbone %>% filter(is.na(Family_correct)))` records without Family affiliation, for a total of `r nrow(Backbone %>% filter(is.na(Family_correct)) %>% dplyr::select(Genus_correct) %>% distinct())` genera.
Manually fix some residual, known issues:
```{r}
Backbone <- Backbone %>%
mutate(Family_correct=replace(Family_correct,
list=word(Name_short, 1)=="Coptidium",
values="Ranunculaceae"))
```
## Create Field `is_vascular_plant`
Assign all families that belong to `Tracheophyta` to category `is_vascular_species`, based on The Catalogue of Life
The new backbone contains `r nrow(Backbone)`. The backbone 2.1 contained `r nrow(backbone.splot2.1.try3)`. The two backbones have `r incommon` records in common.
**Database affiliations (`sPlot 3.1`, `TRY 3.0`, and `Alpine`).**
```{r, eval = T, echo=F}
kable((table(Backbone$sPlot_TRY)), caption = "Number of (standardized) name entries
unique to, or shared between sPlot (S), TRY (T) and Alpine (A).") %>%
`r nrow(Backbone %>% filter(sPlot_TRY %in% c("S", "ST", "SA", "STA")))` of the total number of entries belong to sPlot. `r nrow(Backbone %>% filter(sPlot_TRY %in% c("T", "ST", "TA", "STA")))` name entries belong to TRY.
**Taxonomic ranks:**
```{r, eval = T, echo=F}
kable((table(Backbone$Rank_correct, exclude=NULL)), caption = "Number of (standardized) name entries per taxonomic rank.") %>%
Generate version of the backbone that only includes the unique resolved names in `name.short.correct`, and for the non-unique names, the first rows of duplicated name:
```{r, eval = T}
Backbone.uni <- Backbone %>%
distinct(Name_short, .keep_all = T) %>%
filter(!is.na(Name_short))
nrow(Backbone.uni)
```
There are `r nrow(Backbone.uni)` unique taxon names the in the backbone.
Exclude the non-vascular plant and non-matching taxon names:
```{r, eval = T}
Backbone.uni.vasc <- Backbone.uni %>%
dplyr::filter(is_vascular_species == TRUE)
```
`df.uni` had two names less than `df.count`, as they were accidentially tagged as non-vascular species names. Resolve that issue:
**Now, run the stats for unique resolved names (excluding non-vascular and non-matching taxa):**
```{r, eval = T}
nrow(Backbone.uni.vasc$Name_short)
```
There are `r nrow(Backbone.uni.vasc$name.short.correct)` unique (vascular plant) taxon names:
```{r, eval = T, echo=F}
kable((table(Backbone.uni.vasc$sPlot_TRY)), caption = "Number of (standardized) vascular plant taxon names per unique to, and shared between TRY (S), sPlot (T) and the Alpine (A) dataset.") %>%