Codebase Documentation
This repository implements a species distribution modeling comparison study for about 600 South American mammal species. Specifically, the
Project Structure
-
R/
: Contains all the R scripts organized by workflow steps. -
Symobio_modeling.Rproj
: RStudio project file for easy navigation. -
README.md
: High-level overview of the project. -
renv/
: Manages package dependencies for reproducibility. -
renv.lock
: Lockfile forrenv
to ensure consistent package versions.
Workflow Overview
The workflow is divided into several stages, each represented by scripts in the R/
directory. Below is a summary of the key steps:
1. Preparation of Geographic Data
-
01_01_range_map_preparation.R
: Processes IUCN mammal range maps for South America, converting them to a standardized raster format with consistent projection for downstream analysis. -
01_02_raster_preparation.R
: Prepares environmental predictor rasters (e.g., climate, elevation, land cover) by cropping to South America extent, resampling to consistent resolution, and performing any necessary transformations.
2. Preparation of complementary species-level data
-
02_01_functional_group_assignment.R
: Assigns mammal species to functional groups based on diet, locomotion, and body size characteristics, creating categorical variables for modeling. -
02_02_functional_traits_preparation.R
: Cleans and standardizes continuous trait data (body mass, diet breadth, etc.) for all study species, handling missing values through imputation where necessary. -
02_03_phylo_preparation.R
: Extracts phylogenetic information for target mammal species, computes phylogenetic distance matrices, and prepares the data for inclusion in models.
3. Preparation of Presence/Absence Data
-
03_01_presence_preparation.R
: Processes occurrence records from GBIF and other sources, applies spatial filtering to reduce sampling bias, and aligns taxonomic nomenclature. -
03_02_absence_preparation.R
: Generates pseudo-absence points using a stratified random approach, with constraints based on environmental conditions and range map boundaries. -
03_03_dataset_exploration.R
: Produces descriptive statistics and visualizations of presence/absence data, environmental variables, and species coverage to assess data quality. -
03_04_model_data_finalization.R
: Merges all prepared datasets (occurrences, absences, predictors) into final modeling datasets, splits data into training/testing sets, and applies any necessary scaling or transformations.
4. Modeling
-
04_01_modelling_ssdm.R
: Implements traditional single-species distribution modeling (SSDM) approaches with selected algorithms (e.g., MaxEnt, random forests), including hyperparameter tuning. -
04_02_modelling_msdm_embed.R
: Develops multi-species distribution models using neural network approaches that embed species identities into a latent space, capturing inter-species relationships. -
04_03_modelling_msdm_onehot.R
: Implements multi-species distribution models using one-hot encoding for species identities, enabling joint prediction across all species simultaneously. -
04_04_modelling_msdm_rf.R
: Implements random forest-based multi-species distribution modeling, incorporating species identity as a predictor variable alongside environmental variables.
5. Analysis
-
05_01_performance_report.qmd
: Generates comprehensive reports on model performance metrics (AUC, TSS, etc.) for all modeling approaches, with visualizations comparing performance across species and methods. -
05_02_publication_analysis.qmd
: Conducts advanced statistical analyses of model results, creates publication-quality figures, and summarizes findings for manuscript preparation.
Miscellaneous
-
utils.R
: Contains utility functions used across multiple scripts, including data processing helpers, custom evaluation metrics, and visualization functions.
Miscellaneous
-
utils.R
:
Getting Started
- Clone the repository and open the
Symobio_modeling.Rproj
file in RStudio. - Restore the project environment using
renv
:renv::restore()
- Run the scripts in the R/ directory sequentially. Some scripts, especially for model fitting, may run a long time and benefit from powerful hardware.
Additional Notes
- Ensure that all required input data (e.g., range maps, raster files) is available in the expected directories.
- Outputs from each script are typically saved to disk and used as inputs for subsequent scripts.
- Refer to the README.md file for any additional project-specific instructions.