04_buildHeader.Rmd
title: "sPlot3.0 - Build Header"
author: "Francesco Maria Sabatini"
date: "2/4/2020"
output: html_document
\newline
Timestamp: r date()
Drafted: Francesco Maria Sabatini
Revised: Helge Bruelheide
Version: 1.2
This report documents the construction of the header file for sPlot 3.0. It is based on dataset sPlot_3.0.2, received on 24/07/2019 from Stephan Hennekens.
Changes in version 1.1
- Excluded plots from Canada, as recommended by Custodian
- Filled missing info from most of the ~2000 plots without country information from these datasets.
- Corrected mismatched sBiomes and ecoregions
Changes in version 1.2 - Reassigned coordinates to ~19.000 misplaced plots (mostly from SOPHY or in Hungary). Assigned country level centroids
- Corrected mismatched CONTINENTS & Countries
- Added graphs to check assignment to continents or countries
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(purrr)
library(viridis)
library(readr)
library(xlsx)
library(knitr)
library(kableExtra)
## Spatial packages
library(rgdal)
library(sp)
library(rgeos)
library(raster)
library(rworldmap)
library(elevatr)
library(sf)
library(rnaturalearth)
library(dggridR)
library(shotGroups) #minCircle
#save temporary files
write("TMPDIR = /data/sPlot/users/Francesco/_tmp", file=file.path(Sys.getenv('TMPDIR'), '.Renviron'))
write("R_USER = /data/sPlot/users/Francesco/_tmp", file=file.path(Sys.getenv('R_USER'), '.Renviron'))
rasterOptions(tmpdir="/data/sPlot/users/Francesco/_tmp")
1 Import data
Import header data. Clean header data from quotation and double quotation marks from linux terminal.
# escape all double quotation marks. Run in Linux terminal
#sed 's/"/\\"/g' sPlot_3_0_2_header.csv > sPlot_3_0_2_header_test.csv
#more general alternative in case some " are already escaped
##first removing \s before all "s, and then adding \ before all ":
#sed 's/\([^\\]\)"/\1\\\"/g; s/"/\\"/g'
Import cleaned header data.
header0 <- readr::read_delim("../sPlot_data_export/sPlot_3_0_2_header_test.csv",
locale = locale(encoding = 'UTF-8'),
delim="\t", col_types=cols(
PlotObservationID = col_double(),
PlotID = col_double(),
`TV2 relevé number` = col_double(),
Country = col_character(),
`Cover abundance scale` = col_factor(),
`Date of recording` = col_date(format="%d-%m-%Y"),
`Relevé area (m²)` = col_double(),
`Altitude (m)` = col_double(),
`Aspect (°)` = col_double(),
`Slope (°)` = col_double(),
`Cover total (%)` = col_double(),
`Cover tree layer (%)` = col_double(),
`Cover shrub layer (%)` = col_double(),
`Cover herb layer (%)` = col_double(),
`Cover moss layer (%)` = col_double(),
`Cover lichen layer (%)` = col_double(),
`Cover algae layer (%)` = col_double(),
`Cover litter layer (%)` = col_double(),
`Cover open water (%)` = col_double(),
`Cover bare rock (%)` = col_double(),
`Height (highest) trees (m)` = col_double(),
`Height lowest trees (m)` = col_double(),
`Height (highest) shrubs (m)` = col_double(),
`Height lowest shrubs (m)` = col_double(),
`Aver. height (high) herbs (cm)` = col_double(),
`Aver. height lowest herbs (cm)` = col_double(),
`Maximum height herbs (cm)` = col_double(),
`Maximum height cryptogams (mm)` = col_double(),
`Mosses identified (y/n)` = col_factor(),
`Lichens identified (y/n)` = col_factor(),
COMMUNITY = col_character(),
SUBSTRATE = col_character(),
Locality = col_character(),
ORIG_NUM = col_character(),
ALLIAN_REV = col_character(),
REV_AUTHOR = col_character(),
Forest = col_logical(),
Grassland = col_logical(),
Wetland = col_logical(),
`Sparse vegetation` = col_logical(),
Shrubland = col_logical(),
`Plants recorded` = col_factor(),
`Herbs identified (y/n)` = col_factor(),
Naturalness = col_factor(),
EUNIS = col_factor(),
Longitude = col_double(),
Latitude = col_double(),
`Location uncertainty (m)` = col_double(),
Dataset = col_factor(),
GUID = col_character()
)) %>%
rename(Sparse.vegetation=`Sparse vegetation`,
ESY=EUNIS) %>%
dplyr::select(-COMMUNITY, -ALLIAN_REV, -REV_AUTHOR, -SUBSTRATE) %>% #too sparse information to be useful
dplyr::select(-PlotID) #identical to PlotObservationID
The following column names occurred in the header of sPlot v2.1 and are currently missing from the header of v3.0
- Syntaxon
- Cover cryptogams (%)
- Cover bare soil (%)
- is.forest
- is.non.forest
- EVA
- Biome
- BiomeID
- CONTINENT
- POINT_X
- POINT_Y
~~ Columns #1 (closed), #2 (closed), #3 (closed), #10, #11 will be dropped. The others will be derived below.
1.1 Exclude unreliable plots
Some canadian plots need to be removed, on indication of Laura Boisvert-Marsh from GIVD NA-CA-004. The plots (and corresponding PlotObservationID
) are:
\newline
Fabot01 - 1707776
Fadum01, 02 & 03 - 1707779:1707781
Faers01 - 1707782
Pfe-f-08 - 1707849
Pfe-o-05- 1707854
header0 <- header0 %>%
filter(!PlotObservationID %in% c(1707776, 1707779:1707782, 1707849, 1707854)) %>%
filter(Dataset != "$Coastal_Borja") %>%
filter(Dataset != "$Coastal_Poland")
1.2 Solve spatial problems
There are 2020 plots in the Nile dataset without spatial coordinates. Assign manually with wide (90km) location uncertainty.
header <- header0 %>%
mutate(Latitude=replace(Latitude,
list=(is.na(Latitude) & Dataset=="Egypt Nile delta"),
values=30.917351)) %>%
mutate(Longitude=replace(Longitude,
list=(is.na(Longitude) & Dataset=="Egypt Nile delta"),
values=31.138534)) %>%
mutate(`Location uncertainty (m)`=replace(`Location uncertainty (m)`,
list=(is.na(`Location uncertainty (m)`) & Dataset=="Egypt Nile delta"),
values=-90000))
There are two plots in the Romania Grassland Databse
, ~4442 plots in the Japan
database, and a few in the European Weed Vegetation Database
whose lat\long are inverted. Correct.
toswap <- c(which(header$Dataset=="Japan" & header$Latitude>90),
which(header$Dataset=="Romania Grassland Database" & header$Longitude>40),
which(header$PlotObservationID==525283))
header[toswap, c("Latitude", "Longitude")] <- header[toswap, c("Longitude", "Latitude")]
nouncert <- nrow(header %>% filter(is.na(`Location uncertainty (m)`)))
There are r nouncert
plots without location uncertainty. As a first approximation, we assign the median of the respective dataset, as a negative value to indicate this is an estimation, rather than a measure.
header <- header %>%
left_join(header %>%
group_by(Dataset) %>%
summarize(loc.uncer.median=median(`Location uncertainty (m)`, na.rm=T)),
by="Dataset") %>%
mutate(`Location uncertainty (m)`=ifelse( is.na(`Location uncertainty (m)` & !is.na(Latitude)),
-abs(loc.uncer.median),
`Location uncertainty (m)`)) %>%
dplyr::select(-loc.uncer.median)
nouncert <- nrow(header %>% filter(is.na(`Location uncertainty (m)`)))
There are still r nouncert
plots with no estimation of location uncertainty.
\newline
Assign plot size to plots in the Patagonia dataset (input of Ana Cingolani)
header <- header %>%
mutate(`Relevé area (m²)`=ifelse( (Dataset=="Patagonia" & is.na(`Relevé area (m²)`)),
-900, `Relevé area (m²)`))
There are 518 plots from the dataset Germany_gvrd (EU-DE-014) having a location uncertainty equal to 2,147,483 km (!). These plots have a location reported. Replace with a more likely estimate (20 km)