Skip to content
Snippets Groups Projects
Commit 2f427475 authored by Francesco Sabatini's avatar Francesco Sabatini
Browse files

Completed 03_Backbone - v1

parent 326b4b98
Branches
No related tags found
No related merge requests found
......@@ -31,7 +31,8 @@ urlcolor: blue
***
# Load packages
# Data preparation
## Load packages
```{r results="hide", message=F, warning=F}
library(reshape2)
......@@ -54,9 +55,9 @@ library(vegdata)
mushroom <- c("Mycena", "Boletus", "Russula","Calocybe","Collybia","Amanita","Amanitopsis","Coprinus",
"Galerina","Geoglossum","Hebeloma","Hydnum","Lactarius","Leucocarpia","Naucoria","Otidea","Polyporus",
"Sarcodom","Sarcoscyphus","Scleroderma","Stropharia","Tylopilus","Typhula", "Calyptella", "Chrysopsora", "Lacrymaria", "Dermoloma",
"Alnicola", "Amanitina", "Bovista", "Cheilymenia","Clavulinopsis", "Clitocybe", "Entoloma", "Geaster", "Inocybe",
"Agaricus","Alnicola", "Amanitina", "Bovista", "Cheilymenia","Clavulinopsis", "Clitocybe", "Entoloma", "Geaster", "Inocybe",
"Laccaria", "Laetiporus", "Lepista", "Macrolepiota", "Macrolepis", "Marasmius", "Panaeolus", "Psathyrella", "Psilocybe",
"Rickenella", "Sarcoscypha", "Vascellum")
"Rickenella", "Sarcoscypha", "Vascellum", "Ramaria")
```
......@@ -78,10 +79,6 @@ DT0 <- readr::read_delim("../sPlot_data_export/sPlot_3_0_2_species.csv",
x_ = col_double()
)
)
## Exclude fungi
splot.species <- DT0 %>%
rename(Species.original=`Turboveg2 concept`, Matched.concept=`Matched concept`) %>%
......@@ -98,7 +95,7 @@ splot.species <- DT0 %>%
write_csv(splot.species, path = "../_derived/splot3.0.2.species.csv")
```
!!! Should I use the column from TRY using the full species name, or the column with only the two words from name strings?
!!! I used the column from TRY with the full species name, not the column with only a two-word name strings
```{r, message=F}
splot.species <- read_csv("../_derived/splot3.0.2.species.csv")
......@@ -137,17 +134,27 @@ spec.list.TRY.sPlot <- splot.species %>%
mutate(Source=paste(S, T, A, sep="")) %>%
dplyr::select(-A, -S, -T)
length(spec.list.TRY.sPlot)
table(spec.list.TRY.sPlot$Source) #Number of species unique and in common across databases
#Number of species unique and in common across databases
```
The *total number of species* in the backbone is `r nrow(spec.list.TRY.sPlot)`.
```{r echo=F}
knitr::kable(spec.list.TRY.sPlot %>%
mutate(Source=factor(Source,
levels=c("S", "T", "A", "ST", "SA", "TA", "STA"),
labels=c("sPlot only", "TRY only", "Alpine only",
"sPlot + TRY", "sPlot + Alpine", "TRY + Alpine",
"sPlot + TRY + Alpine"))) %>%
group_by(Source) %>%
summarize(Num.taxa=n()),
caption="Number of taxa per database") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
# A-priori cleaning of names
## Manual cleaning
When constructing the backbone for sPlot 2.1, a list of 4,093 "weird" species names (consisting mainly of trivial names) was generated, and corrected manually by [Jürgen Dengler](https://www.uni-bayreuth.de/de/forschung/profilfelder/advanced-fields/oekologie-und-umweltwissenschaften/mitwirkende/Dengler/index.php) (thereafter JD).
**This step is not performed anymore**
```
## String manipulation routines
## A-priori cleaning of names
Stripping unwanted characters as well as abbreviation (such as hybrid markers) which would prevent name matching:
```{r}
......@@ -176,8 +183,6 @@ spec.list.TRY.sPlot <- spec.list.TRY.sPlot %>%
mutate(Species=gsub('like ', '', Species, fixed=TRUE)) %>%
mutate(Species=gsub(',', '', Species, fixed=TRUE)) %>%
mutate(Species=gsub('_', ' ', Species))
```
For all names, that have a number in their first word, and consist of $>$ 1 words, remove that word:
......@@ -188,8 +193,9 @@ spec.list.TRY.sPlot <- spec.list.TRY.sPlot %>%
mutate(firstWordWithNumbers=grepl('[0-9]', word(Species, 1))) %>%
mutate(numberOfWords= sapply(gregexpr("\\W+", Species), length) + 1) %>%
mutate(Species=ifelse((firstWordWithNumbers & numberOfWords > 1),
sapply(Species, function(x) substr(x, start=regexpr(pattern =' ', text=x)+1, stop=nchar(x))),
Species))
sapply(Species,
function(x) substr(x, start=regexpr(pattern =' ', text=x)+1,
stop=nchar(x))), Species))
```
Correct some name abbreviations using `taxname.abbr` in `vegdata`:
......@@ -201,18 +207,18 @@ spec.list.TRY.sPlot <- spec.list.TRY.sPlot %>%
```
A total of `r nrow(spec.list.TRY.sPlot %>% filter(OriginalNames != Species))` species names were modified. Although substantially improved, the species list has still quite a lot of inconsistencies.
The total list submitted to TNRS containes `r length(unique(spec.list.TRY.sPlot$Species)` species names.
The total list submitted to TNRS containes `r length(unique(spec.list.TRY.sPlot$Species))` species names.
# Match names against Taxonomic Name Resolution Service ([TNRS](http://tnrs.iplantcollaborative.org))
Export species name list
```{r }
#Export species name list
write_csv(spec.list.TRY.sPlot %>% dplyr::select(Species) %>% distinct() ,
path = "../_derived/TNRS_submit/tnrs_submit_iter1.csv")
```
The csv-file of species names was submitted to Taxonomic Name Resolution Service web application (Boyle et al. 2013, iPlant Collaborative (2015)). TNRS version 4.0 was used, which became available in August 2015 (this version also included The Plant List version 1.1).
TNRS was queried on 27/07/2019.
## TNRS settings {#ID}
......@@ -231,10 +237,10 @@ The initial TNRS name resolution run was based on the **five standard sources**
### Family Classification
Resolved names were assigned to families based on the [APGIII classification](http://onlinelibrary.wiley.com/doi/10.1111/j.1095-8339.2009.00996.x/abstract) [@Chase2009], the same classification system used by Tropicos.
## Retrieve results
### Retrieve results
Once the matching process was finished, results were retrieved from TNRS using the `Detailed Download` option that included the full name information (parsed components, warnings, links to sources, etc.). We retrieved all the matches for each species, constrained by source (TNRS default), where the name in the first source was selected as best match, unless there was `no suitable match found` in that source, the match from the next lower-ranked source was selected, until all resources where exhausted.
# Manually inspecting name matching results {#ID}
### General procedure {#ID}
Manually inspect the TNRS-results table in a spreadsheat application (i.e. LibreOffice or Excel). Starting with the highest taxonomic rank considered (i.e. Family). For instance, if manual checking of the TRNS output reveals that all accepted names or synonyms that have accuracy scores >0.9 are correct taxon names, use the following selection procedure:
* Name_matched_rank (==Family)
......@@ -243,13 +249,9 @@ Manually inspect the TNRS-results table in a spreadsheat application (i.e. Libre
Continue this selection procedure for entries that were matched at lower taxonomic ranks, i.e. genus, species, etc..
## Iteration 1 - Read and combine TNRS result files
Read the downloaded TNRS files into `R`.
Read the files downloaded from TNRS into `R`.
```{r }
tnrs.res0 <- readr::read_delim("../_derived/TNRS_submit/tnrs_results_iter1.txt", delim="\t", locale = locale(encoding = 'UTF-8'),quote="",
col_type = cols(
......@@ -277,8 +279,11 @@ For each name submitted, only the record having the highest rank was retained.
```{r}
tnrs.res <- tnrs.res0 %>%
mutate(Name_matched_rank=factor(Name_matched_rank,
levels=c("variety", "subspecies", "species", "genus", "family", "section", "supersection", "infraspecies", "forma",
"race", "nothosubspecies", "proles", "monstr", "series"))) %>%
levels=c("variety", "subspecies", "species", "genus",
"family", "section", "supersection",
"infraspecies", "forma", "race",
"nothosubspecies", "proles", "monstr",
"series"))) %>%
mutate(Source=factor(Source, levels=c("tpl", #reorder priorities
"tpl;gcc",
"tpl;gcc;tropicos",
......@@ -319,18 +324,8 @@ tnrs.res <- tnrs.res0 %>%
slice(1)
```
After this first step, there are `r sum(tnrs.res$Name_matched=="No suitable matches found.")` for which no match was found. Another `r sum(tnrs.res$Overall_score<0.9)` were unreliably matched (overall match score <0.9).
### General procedure {#ID}
1. Open `tnrs.res` in a spreadsheet program and sort according to `Name_matched_rank`, `Taxonomic_status` and `Family_score`, and select thresholds for selection.
2. Repeat selection for entries matched at lower taxonomic ranks, such `Name_matched_rank` ==:
After this first step, there are `r sum(tnrs.res$Name_matched=="No suitable matches found.")` recprds for which no match was found. Another `r sum(tnrs.res$Overall_score<0.9)` were unreliably matched (overall match score <0.9).
* forma
* genus
* infraspecies
* ...
3. Adjust accuracy score threshold values, e.g. use higher or lower values for infraspec., variety, ...
### Family level {#ID}
Manually inspect sorted table and select all entries at the highest hierarchical level (family). Manually identify the family accuracy score threshold value above which a name can be considered a correct name. In the following case, this corresponds to a score $>$0.88.
......@@ -524,10 +519,10 @@ tnrs.submit.iter2 <- data.frame(old=tnrs.res.uncertain$Name_submitted) %>%
mutate(new=gsub('Abiesnordmannia', 'Abies nordmannia', new)) %>%
mutate(new=gsub('Alnus inca', 'Alnus incana', new)) %>%
mutate(new=gsub('Amalencier alnifolia', 'Amalenchier alnifolia', new)) %>%
mutate(new=gsub('"Antylis barba-jovis"', '"Anthyllis barba-jovis"', new))
mutate(new=gsub('Antylis barba-jovis', 'Anthyllis barba-jovis', new)) %>%
mutate(new=gsub('^Albizzia "', 'Albizia ', new))
# delete remaining records of mushroom species
tnrs.submit.iter2 <- tnrs.submit.iter2 %>%
filter(!word(new,1) %in% mushroom)
......@@ -547,8 +542,9 @@ tnrs.submit.iter2 <- tnrs.submit.iter2 %>%
mutate(Name_binomial=paste(word(new, c(1,2)), collapse=" ")) %>%
ungroup() %>%
mutate(Name_binomial=gsub(' NA$', '', Name_binomial))
#save species name list to be submitted to TNRS
```
### Save species list to submit to TNRS for iteration 2
```{r}
write_csv(tnrs.submit.iter2 %>%
dplyr::select(Name_binomial) %>%
#After cleaning some names now match to those already resolved in iteration 1. Take them out
......@@ -576,15 +572,23 @@ tnrs.res.iter2.raw <- readr::read_delim("../_derived/TNRS_submit/tnrs_results_it
tnrs.res.iter2 <- tnrs.res.iter2.raw %>%
mutate(Name_matched_rank=factor(Name_matched_rank,
levels=c("variety", "subspecies", "species", "genus", "family", "section", "supersection", "infraspecies", "forma",
"race", "nothosubspecies", "proles", "monstr", "series"))) %>%
levels=c("variety", "subspecies", "species",
"genus", "family", "section",
"supersection", "infraspecies", "forma",
"race", "nothosubspecies", "proles",
"monstr", "series"))) %>%
mutate(Source=factor(Source, levels=c("tpl", #reorder priorities
"tpl;gcc", "tpl;gcc;tropicos", "tpl;gcc;tropicos;usda", "tpl;gcc;usda","tpl;ildis","tpl;ildis;tropicos",
"tpl;ildis;usda","tpl;tropicos","tpl;tropicos;usda","tpl;usda","gcc","gcc;tropicos","gcc;tropicos;usda",
"gcc;usda", "ildis", "ildis;tropicos","ildis;tropicos;usda","ildis;usda","tropicos","tropicos;gcc",
"tropicos;usda","usda" ))) %>%
"tpl;gcc", "tpl;gcc;tropicos", "tpl;gcc;tropicos;usda",
"tpl;gcc;usda","tpl;ildis","tpl;ildis;tropicos",
"tpl;ildis;usda","tpl;tropicos","tpl;tropicos;usda",
"tpl;usda","gcc","gcc;tropicos",
"gcc;tropicos;usda","gcc;usda","ildis",
"ildis;tropicos","ildis;tropicos;usda","ildis;usda",
"tropicos","tropicos;gcc","tropicos;usda","usda" ))) %>%
mutate(Taxonomic_status=factor(Taxonomic_status,
levels=c("Accepted","Synonym", "No opinion","Invalid","Illegitimate","Misapplied","Rejected name"))) %>%
levels=c("Accepted","Synonym", "No opinion",
"Invalid","Illegitimate","Misapplied",
"Rejected name"))) %>%
arrange(Name_number,
desc(Infraspecific_epithet_2_score),
desc(Infraspecific_epithet_score),
......@@ -675,14 +679,15 @@ save(tnrs.res.iter2.certain, tnrs.res.iter2.uncertain,
```
**Generate list of `uncertain` species that are still to be resolved on TNRS:**
### Save species list to submit to TNRS for iteration 3
```{r, eval = T}
write_csv(tnrs.res.iter2.uncertain[,2], path = "../_derived/TNRS_submit/tnrs_submit_iter3.csv")
```
This list was submitted to `TNRS`, but only selecting the `NCBI` database.
## Iteration 3 - Reimport resolved species names from `TNRS_NCBI`
In the last iteration, records were submitted to `TNRS NCBI`.
```{r}
tnrs.res.iter3.raw <- readr::read_delim("../_derived/TNRS_submit/tnrs_results_iter3.txt", delim="\t", locale = locale(encoding = 'UTF-8'),quote="",
......@@ -702,8 +707,11 @@ tnrs.res.iter3.raw <- readr::read_delim("../_derived/TNRS_submit/tnrs_results_it
tnrs.ncbi <- tnrs.res.iter3.raw %>%
mutate(Name_matched_rank=factor(Name_matched_rank,
levels=c("variety", "subspecies", "species", "genus", "family", "section", "supersection", "infraspecies", "forma",
"race", "nothosubspecies", "proles", "monstr", "series"))) %>%
levels=c("variety", "subspecies", "species",
"genus", "family", "section", "supersection",
"infraspecies", "forma", "race",
"nothosubspecies", "proles", "monstr",
"series"))) %>%
mutate(Source=factor(Source, levels=c("tpl", #reorder priorities
"tpl;gcc", "tpl;gcc;tropicos", "tpl;gcc;tropicos;usda",
"tpl;gcc;usda","tpl;ildis","tpl;ildis;tropicos",
......@@ -840,7 +848,7 @@ save(tpl.ncbi.certain, tpl.ncbi.uncertain, file="../_derived/TNRS_submit/tnrs.it
```
# Merge the resolved species lists
# Merge the resolved species lists into a Backbone
## Read files
```{r, eval = T}
......@@ -856,14 +864,14 @@ Combine the `certain` data sets:
Backbone <- spec.list.TRY.sPlot %>%
as.tbl() %>%
rename(Name_sPlot_TRY=OriginalNames,
Name_corrected1=Species) %>%
Name_string_corr1=Species) %>%
left_join(tnrs.submit.iter2 %>%
dplyr::select(-new) %>%
rename(Name_corrected1=old, Name_corrected2=Name_binomial),
by="Name_corrected1") %>%
mutate(Name_submitted=ifelse(!is.na(Name_corrected2), Name_corrected2, Name_corrected1)) %>%
dplyr::select(Name_sPlot_TRY, Name_corrected1, Name_corrected2, Source, Name_submitted) %>%
rename(sPlot_Try=Source) %>%
rename(Name_string_corr1=old, Name_string_corr2=Name_binomial),
by="Name_string_corr1") %>%
mutate(Name_submitted=ifelse(!is.na(Name_string_corr2), Name_string_corr2, Name_string_corr1)) %>%
dplyr::select(Name_sPlot_TRY, Name_string_corr1, Name_string_corr2, Source, Name_submitted) %>%
rename(sPlot_TRY=Source) %>%
left_join(tnrs.res.certain %>%
bind_rows(tnrs.res.iter2.certain) %>%
bind_rows(tnrs.ncbi.certain) %>%
......@@ -897,30 +905,392 @@ nrow(Backbone) == nrow(spec.list.TRY.sPlot)
```
## Tag names that could not be resolved
## Tag unresolved names and create output columns
Add four additional columns.
If names were not corrected, set `Taxonomic.status == ""`, and assign `No suitable matches found.` to the remaining species.
```{r, eval = T}
Backbone <- Backbone %>%
mutate(Status_correct=ifelse(!is.na(Taxonomic_status), Taxonomic_status, NA)) %>%
mutate(Status_correct=ifelse(Taxonomic_status %in% c("Accepted", "Synonym", "Unresolved"),
Taxonomic_status, NA)) %>%
mutate(Status_correct=replace(Status_correct,
list=is.na(Status_correct),
values="No suitable matches found.")) %>%
mutate(Status_correct=factor(Status_correct)) %>%
mutate(Name_correct=ifelse(!is.na(Accepted_name), Accepted_name, "No suitable matches found.")) %>%
mutate(Rank_correct=ifelse(!is.na(Name_matched_rank), as.character(Name_matched_rank), "higher")) %>%
mutate(Rank_correct=factor(Rank_correct)) %>%
#Create Name_correct field. Use Accepted names, if any. Otherwise matched names.
mutate(Name_correct=ifelse(!is.na(Accepted_name),
Accepted_name,
Name_matched)) %>%
mutate(Genus_correct=ifelse(!is.na(Name_correct) & (!Accepted_name_rank %in% c("family")),
word(Name_correct,1),
NA)) %>%
mutate(Name_correct=ifelse(!is.na(Name_correct),
Name_correct,
"No suitable matches found.")) %>%
mutate(Rank_correct=ifelse(!is.na(Name_matched_rank),
as.character(Name_matched_rank),
"higher")) %>%
mutate(Rank_correct=factor(Rank_correct, levels=c("higher", "family", "genus", "species",
"subspecies", "variety", "infraspecies",
"race", "forma")
)) %>%
mutate(Name_short=ifelse(!is.na(Accepted_name_species), Accepted_name_species, NA))
summary(Backbone$Status_correct)
summary(Backbone$Rank_correct)
```
## Complete list of families
### Derive info from other species of the same Genera in the Backbone itself
Copy family info for taxa resolved at family level
```{r}
Backbone <- Backbone %>%
mutate(family.lev=str_extract(word(Name_correct,1), pattern='([^\\s]+acea)')) %>%
mutate(Family_correct=ifelse(!is.na(Accepted_name_family),
Accepted_name_family,
family.lev)) %>%
dplyr::select(-family.lev)
#Records with missing family info
sum((is.na(Backbone$Family_correct)))
```
Derive family info from each genus in the backbone, and use this info to complement records from the same genera, but with missing family info.
```{r}
genera_families <- Backbone %>%
filter(Taxonomic_status=="Accepted") %>%
dplyr::select(Genus_correct, Family_correct) %>%
rename(family=Family_correct) %>%
distinct() %>%
na.omit() %>%
#for some genera there are multiple families assigned
# (e.g. in case of unresolved species names )
# Extract the family names that occurs the most across each genus
group_by(Genus_correct, family) %>%
summarize(n=n()) %>%
arrange(desc(n)) %>%
slice(1) %>%
ungroup() %>%
dplyr::select(-n)
# Assign family derived from backbone to other records
Backbone <- Backbone %>%
left_join(genera_families, by="Genus_correct") %>%
mutate(Family_correct=ifelse( (is.na(Family_correct) & !is.na(family)),
family,
Family_correct)) %>%
dplyr::select(-family)
#Records with missing family info
sum(is.na(Backbone$Family_correct))
```
### Resolve genera with missing family info with `TNRS`
```{r}
Genera_submit <- Backbone %>%
filter(is.na(Family_correct)) %>%
dplyr::select(Genus_correct) %>%
distinct()
write_csv(Genera_submit, "../_derived/TNRS_submit/Genera_submit.csv")
```
Import results from TNRS. Best match only and simple download
```{r}
import.profile <- cols(
Name_submitted = col_character(),
Name_matched = col_character(),
Author_matched = col_logical(),
Overall_score = col_double(),
Taxonomic_status = col_character(),
Accepted_name = col_character(),
Accepted_author = col_character(),
Accepted_family = col_character(),
Source = col_character(),
Warnings = col_character(),
Accepted_name_lsid = col_character()
)
tnrs.genera <- read_delim("../_derived/TNRS_submit/tnrs_genera.txt", delim="\t",
locale = locale(encoding = 'UTF-8'),quote="",col_type = import.profile)
```
Attach resolved families to backbone
```{r}
Backbone <- Backbone %>%
left_join(tnrs.genera %>%
dplyr::select(Name_submitted, Accepted_family) %>%
rename(Genus_correct=Name_submitted, Family_import=Accepted_family),
by="Genus_correct") %>%
mutate(Family_correct=ifelse(is.na(Family_correct),
Family_import,
Family_correct)) %>%
dplyr::select(-Family_import)
#Records with missing family info
sum(is.na(Backbone$Family_correct))
```
### Complement with data from `The Catalogue of Life`.
```{r, eval=F}
#Download data from Catalogue of Life - 2019
download.file("http://www.catalogueoflife.org/DCA_Export/zip/archive-kingdom-plantae-bl3.zip",
destfile="/data/sPlot/users/Francesco/Ancillary_Data/Catalogue_of_Life/CatLife2019.zip")
unzip("/data/sPlot/users/Francesco/Ancillary_Data/Catalogue_of_Life/CatLife2019.zip", files="taxa.txt", exdir = "/data/sPlot/users/Francesco/Ancillary_Data/Catalogue_of_Life/")
```
```{r, message=F, warning=F}
cat.life <- read_delim("/data/sPlot/users/Francesco/Ancillary_Data/Catalogue_of_Life/taxa.txt",
delim="\t")
Genera_missing <- Backbone %>%
filter(is.na(Family_correct) & !is.na(Genus_correct)) %>%
dplyr::select(Genus_correct) %>%
distinct()
Backbone <- Backbone %>%
left_join(cat.life %>%
dplyr::select(genus, family) %>%
distinct() %>%
filter(family != "") %>%
filter(genus %in% Genera_missing$Genus_correct) %>%
rename(Genus_correct=genus),
by="Genus_correct") %>%
mutate(Family_correct=ifelse(is.na(Family_correct) & !is.na(family),
family,
Family_correct)) %>%
dplyr::select(-family)
#Records with missing family info
sum(is.na(Backbone$Family_correct))
```
After matching the remaining genera with the Catalogue of life there are still `r nrow(Backbone %>% filter(is.na(Family_correct)))` records without Family affiliation, for a total of `r nrow(Backbone %>% filter(is.na(Family_correct)) %>% dplyr::select(Genus_correct) %>% distinct())` genera.
Manually fix some residual, known issues:
```{r}
Backbone <- Backbone %>%
mutate(Family_correct=replace(Family_correct,
list=word(Name_short, 1)=="Coptidium",
values="Ranunculaceae"))
```
## Create Field `is_vascular_plant`
Assign all families that belong to `Tracheophyta` to category `is_vascular_species`, based on The Catalogue of Life
```{r}
Backbone <- Backbone %>%
left_join(cat.life %>%
dplyr::select(phylum, family) %>%
distinct() %>%
na.omit() %>%
rename(Family_correct=family),
by="Family_correct") %>%
mutate(is_vascular_species=ifelse(phylum=="Tracheophyta", T, F))
table(Backbone$is_vascular_species, exclude=NULL)
```
## Export Backbone
```{r}
save(Backbone, file="../_output/Backbone3.0.RData")
```
# Statistics
## Statistics for backbone combining names in `sPlot3.0` and `TRY5.0`
### All taxon name entries
View(Backbone %>% mutate(n_words = stringr::str_count(Name_short, ' ') + 1) %>% filter(n_words>2))
```{r, eval = T}
load("../_output/Backbone3.0.RData")
```
**How many new entries are in the backbone 3.0 compared to the backbone 2.1? How many entries are in common?**
```{r, eval = T, echo=F}
load("/data/sPlot/releases/sPlot2.1/backbone.splot2.1.try3.is.vascular.Rdata")
incommon <- nrow(Backbone %>%
dplyr::select(Name_sPlot_TRY) %>%
inner_join(backbone.splot2.1.try3 %>%
dplyr::select(names.sPlot.TRY) %>%
rename(Name_sPlot_TRY=names.sPlot.TRY),
by="Name_sPlot_TRY"))
```
The new backbone contains `r nrow(Backbone)`. The backbone 2.1 contained `r nrow(backbone.splot2.1.try3)`. The two backbones have `r incommon` records in common.
**Database affiliations (`sPlot 3.1`, `TRY 3.0`, and `Alpine`).**
```{r, eval = T, echo=F}
kable((table(Backbone$sPlot_TRY)), caption = "Number of (standardized) name entries
unique to, or shared between sPlot (S), TRY (T) and Alpine (A).") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
`r nrow(Backbone %>% filter(sPlot_TRY %in% c("S", "ST", "SA", "STA")))` of the total number of entries belong to sPlot. `r nrow(Backbone %>% filter(sPlot_TRY %in% c("T", "ST", "TA", "STA")))` name entries belong to TRY.
**Taxonomic ranks:**
```{r, eval = T, echo=F}
kable((table(Backbone$Rank_correct, exclude=NULL)), caption = "Number of (standardized) name entries per taxonomic rank.") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
**Taxonomic status:**
```{r, eval = T, echo=F}
kable((table(Backbone$Taxonomic_status, exclude=NULL)), caption = "Number of (standardized) name entries for taxonomic status")%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
**Total number of unique standardized taxon names and families:**
```{r, eval = T}
length(unique(Backbone$name_short_correct))-1 # minus 1 for NA
length(unique(Backbone$Family_correct))-1 # minus 1 for NA
```
**Number of entries corresponding to vascular plant species:**
```{r, eval = T}
table(Backbone$is_vascular_species, exclude=NULL)
```
**Number of duplicated entries after taxonomic standardization:**
Frequency of original (non-standardized) species names per resolved (non-standardized) name (excluding non-vascular and non-matched species).
```{r, eval = T}
df.count <- Backbone %>%
dplyr::filter(is_vascular_species == TRUE & !is.na(Name_correct)) %>%
dplyr::group_by(Name_correct) %>%
dplyr::summarise(n = n()) %>%
dplyr::arrange(desc(n))
```
```{r, echo=F}
kable(df.count[c(1:20), ], , caption = "Number of unresolved, original name
entries per resolved name. (Only first 20 shown") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
### Based on `unique` standardized names
Generate version of the backbone that only includes the unique resolved names in `name.short.correct`, and for the non-unique names, the first rows of duplicated name:
```{r, eval = T}
Backbone.uni <- Backbone %>%
distinct(Name_short, .keep_all = T) %>%
filter(!is.na(Name_short))
nrow(Backbone.uni)
```
There are `r nrow(Backbone.uni)` unique taxon names the in the backbone.
Exclude the non-vascular plant and non-matching taxon names:
```{r, eval = T}
Backbone.uni.vasc <- Backbone.uni %>%
dplyr::filter(is_vascular_species == TRUE)
```
`df.uni` had two names less than `df.count`, as they were accidentially tagged as non-vascular species names. Resolve that issue:
**Now, run the stats for unique resolved names (excluding non-vascular and non-matching taxa):**
```{r, eval = T}
nrow(Backbone.uni.vasc$Name_short)
```
There are `r nrow(Backbone.uni.vasc$name.short.correct)` unique (vascular plant) taxon names:
```{r, eval = T, echo=F}
kable((table(Backbone.uni.vasc$sPlot_TRY)), caption = "Number of (standardized) vascular plant taxon names per unique to, and shared between TRY (S), sPlot (T) and the Alpine (A) dataset.") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
**Taxonomic ranks:**
```{r, eval = T, echo=F}
kable((table(Backbone.uni.vasc$Rank_correct, exclude=NULL)), caption = "Number of (standardized) name entries per taxonomic rank.") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
**Taxonomic status:**
```{r, eval = T, echo=F}
kable((table(Backbone.uni.vasc$Status_correct, exclude=NULL)), caption = "Number of (standardized) name entries per taxonomic status")%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
**Total number of unique standardized taxon names and families:**
```{r, eval = T}
length(unique(Backbone.uni.vasc$name_short))-1 # minus 1 for NA
length(unique(Backbone.uni.vasc$family_correct))-1
```
## Stats for the corrected names in `sPlot` only:
```{r, eval = T}
Backbone.uni.sPlot <- Backbone.uni.vasc %>%
filter(sPlot_TRY %in% c("S", "ST", "SA", "STA"))
```
There are `r nrow(Backbone.uni.sPlot %>% distinct(Name_correct))` unique, corrected names of vascular plants for sPlot species
**Database affiliations**
```{r, eval = T, echo=F}
kable((table(Backbone.uni.sPlot$sPlot_TRY)), caption = "Number of (standardized) vascular
plant taxon names per unique to sPlot (S), and shared with TRY (ST), the Alpine dataset (SA) or both (STA).")%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
**Taxonomic ranks:**
```{r, eval = T, echo=F}
kable((table(Backbone.uni.sPlot$Rank_correct, exclude=NULL)), caption = "Number of (standardized) vascular plant taxon names per taxonomic rank.")%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
**Taxonomic status:**
```{r, eval = T, echo=F}
kable((table(Backbone.uni.sPlot$Status_correct, exclude=NULL)), caption = "Number of (standardized) vascular plant taxon names that correspond to `Accepted`, `Synonyms` or Unresolved species, respecively.") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
latex_options = "basic",
full_width = F, position = "center")
```
**Number of families in sPlot**:
```{r, eval = T}
nrow(unique(Backbone$Family.correct))
```
**Done!**
--------------------
# `R`-settings
```{r, eval = T}
sessionInfo()
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment