Check traits at the individual level. There are some traits with unexpected negative entries:
Check traits at the individual level. There are some traits with unexpected negative entries:
```{r}
```{r}
try.species.names %>%
try.species.names %>%
...
@@ -168,29 +169,41 @@ try.species.names %>%
...
@@ -168,29 +169,41 @@ try.species.names %>%
group_by(variable) %>%
group_by(variable) %>%
summarize(n=n())
summarize(n=n())
```
```
According to Jens Kattge, the entries for `Leaf.delta.15N` are legitimate, while in the other cases, it may be due to bad predictions. He suggested to delete these negative records.
According to Jens Kattge, the entries for `Leaf.delta.15N` are legitimate, while in the other cases, it may be due to bad predictions. He suggested to delete these negative records.
Similarly, there are records with impossible values for height. Some species incorrectly predicted to have height >100 meters, and some herbs predicted to have a height >10 m.
This results in the exclusion of `r length(toexclude)` individuals. In this way the total number of species included in TRY reduces to `r try.individuals %>% distinct(Name_short) %>% nrow()`
This results in the exclusion of `r length(unique(c(toexclude, toexclude2, toexclude3)))` individuals. In this way the total number of species included in TRY reduces to `r try.individuals %>% distinct(Name_short) %>% nrow()`
## Calculate species and genus level trait means and sd
## Calculate species and genus level trait means and sd
```{r}
```{r}
## Calculate species level trait means and sd.
## Calculate species level trait means and sd.
try.species.means <- try.individuals %>%
try.species.means <- try.individuals %>%
group_by(Name_short) %>%
group_by(Name_short) %>%
#Add a field to indivate the number of observation per taxon
#Add a field to indivate the number of observation per taxon
...
@@ -270,7 +283,7 @@ Merge vegetation layers, where necessary. Combine cover values across layers
...
@@ -270,7 +283,7 @@ Merge vegetation layers, where necessary. Combine cover values across layers
## Classify plots in `is.forest` or `is.non.forest` based on species traits
sPlot has two independent systems for classifying plots to vegetation types. The first, classifies plots into forest and non-forest, based on the share of trees, and the layering of vegetation. The second system classifies plots into broad habitat types and relies on the expert opinion of data contributors. This is, unfortunately, not consistently available across all plots, being the large majority of classified plots only available for Europe. These broad habitat types are coded using 5, non-mutually exclusive dummy variables:
1) Forest - F
2) Grassland - G
3) Shrubland - S
4) Sparse vegetation - B (Bare)
5) Wetland - W
A plot may belong to more than one formation, e.g. a Savannah is categorized as Forest + Grassland (FG).
\newline\newline
Derive the `if.forest` and `is.non.forest` classification of plots.
### Derive species level information on Growth Forms.
We used different sources of information:
1) Data from the gap-filled trait matrix
2) Manual cleaning of the most common species for which growth trait info is not available
3) Data from TRY (public dataset only) on all species with growth form info (Trait ID = 42)
4) Cross-match with species assigned to tree layer in DT table.
\newline\newline
Step 1: Derive growth form trait information to DT table. Growth form information derives from TRY
```{r}
DT.gf <- DT2 %>%
filter(taxon_group=="Vascular plant") %>%
#join with try names, using resolved species names as key
left_join(try.species.names %>%
dplyr::select(Name_short, GrowthForm) %>%
rename(species=Name_short) %>%
distinct(species, .keep_all=T),
by="species") %>%
left_join(try.species.means %>%
dplyr::select(Name_short, PlantHeight_mean) %>%
rename(species=Name_short),
by="species")
# number of records withouth Growth Form info
sum(is.na(DT.gf$GrowthForm))
```
Step 2: Select most common species without growth-trait information to export and check manually
```{r, eval=F}
top.gf.nas <- DT.gf %>%
filter(is.na(GrowthForm)) %>%
group_by(species) %>%
summarize(n=n()) %>%
arrange(desc(n))
write_csv(top.gf.nas %>%
filter(n>1000),
path="../_derived/Species_missingGF.csv")
```
The first `r nrow(top.gf.nas)` species account for `r sum(top.gf.nas %>% filter(n>1000) %>% pull(n))/sum(top.gf.nas$n)*100`% of the missing records. Assign growth forms manually, reimport and coalesce into `DT.gf`
After manual completion, the number of records without growth form information decresead to `r sum(is.na(DT.gf$GrowthForm))`.
\newline\newline
Step 3: Import additional data on growth-form from TRY (Accessed 10 March 2020). All public data on growth form downloaded. First take care of unmatched quotation marks in the txt file. Do this from command line.
```{bash, eval=F}
# escape all unmatched quotation marks. Run in Linux terminal
Step 4: Cross-match. Assign all species occurring in at least one relevé in the tree layer as tree. Conservatively, do this only when the record is at species level (exclude records at genus\\family level)
```{r}
other.trees <- DT.gf %>%
filter(Layer==1 & is.na(GrowthForm)) %>%
filter(Rank_correct=="species") %>%
distinct(species, Layer, GrowthForm) %>%
pull(species)
DT.gf <- DT.gf %>%
mutate(GrowthForm=replace(GrowthForm,
list=species %in% other.trees,
values="tree"))
```
After cross-matching, the number of records without growth form information decresead to `r sum(is.na(DT.gf$GrowthForm))`.
Classify species as tree or tall shrubs vs. other. Make a compact table of species growth forms and create fields `is.tree.or.tall.shrub` and `is.not.tree.and.small`.
Define a species as `is.tree.or.tall.shrub` when it is either defined as tree, OR has a height >10
Define a species as `is.not.tree.or.shrub.and.small` when it has a height <10, as long as it's not defined a tree. When height is not available, it is sufficient that the species is classified as "herb" or "other".
Combine classifications from the three criteria. Use majority vote to assign plots. In case of ties, a progressively lower priority is given from criterium 1 to criterim 4.
#Build a confusion matrix to evaluate the comparison
u <- union(cross.check$isfor_isnonfor, cross.check$veg_type)
t <- table( factor(cross.check$isfor_isnonfor, u), factor(cross.check$veg_type, u))
confm <- caret::confusionMatrix(t)
```
```{r echo=F}
knitr::kable(confm$table, caption="Confusion matrix between sPlot's native classification of habitats (columns), and classification based on four criteria based on vegetation layers and growth forms (rows)") %>%
Formulas of associated statistics are available on the help page of the [caret package](https://www.rdocumentation.org/packages/caret/versions/6.0-84/topics/confusionMatrix) and associated references.
The overall accuracy of the classification based on `is.forest`\\`is.non.forest`, when tested against sPlot's native habitat classification is `r round(confm$overall[1],2)`, the Kappa statistics is `r round(confm$overall[2],2)`.
```{r echo=F}
knitr::kable((confm$byClass), caption="Associated statistics of confusion matrix by class") %>%
Through the process described above, we managed to classify `r plot.vegtype %>% filter(is.forest==T | is.non.forest==T) %>% nrow()`, of which `r plot.vegtype %>% filter(is.forest==T) %>% nrow()` is forest and `r plot.vegtype %>% filter(is.non.forest==T) %>% nrow()` is non-forest.
\newline\newline
The total number of plots with attribution to forest\\non-forest (either coming from sPlot's native classification, or from the process above) is: `r header.vegtype %>% dplyr::select(-PlotObservationID) %>% filter(rowMeans(is.na(.)) < 1) %>% nrow()`.