Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Codebase Documentation

This repository implements a species distribution modeling comparison study for about 600 South American mammal species. Specifically, the

Project Structure

  • R/: Contains all the R scripts organized by workflow steps.
  • Symobio_modeling.Rproj: RStudio project file for easy navigation.
  • README.md: High-level overview of the project.
  • renv/: Manages package dependencies for reproducibility.
  • renv.lock: Lockfile for renv to ensure consistent package versions.

Workflow Overview

The workflow is divided into several stages, each represented by scripts in the R/ directory. Below is a summary of the key steps:

1. Preparation of Geographic Data

  • 01_01_range_map_preparation.R: Processes IUCN mammal range maps for South America, converting them to a standardized raster format with consistent projection for downstream analysis.
  • 01_02_raster_preparation.R: Prepares environmental predictor rasters (e.g., climate, elevation, land cover) by cropping to South America extent, resampling to consistent resolution, and performing any necessary transformations.

2. Preparation of complementary species-level data

  • 02_01_functional_group_assignment.R: Assigns mammal species to functional groups based on diet, locomotion, and body size characteristics, creating categorical variables for modeling.
  • 02_02_functional_traits_preparation.R: Cleans and standardizes continuous trait data (body mass, diet breadth, etc.) for all study species, handling missing values through imputation where necessary.
  • 02_03_phylo_preparation.R: Extracts phylogenetic information for target mammal species, computes phylogenetic distance matrices, and prepares the data for inclusion in models.

3. Preparation of Presence/Absence Data

  • 03_01_presence_preparation.R: Processes occurrence records from GBIF and other sources, applies spatial filtering to reduce sampling bias, and aligns taxonomic nomenclature.
  • 03_02_absence_preparation.R: Generates pseudo-absence points using a stratified random approach, with constraints based on environmental conditions and range map boundaries.
  • 03_03_dataset_exploration.R: Produces descriptive statistics and visualizations of presence/absence data, environmental variables, and species coverage to assess data quality.
  • 03_04_model_data_finalization.R: Merges all prepared datasets (occurrences, absences, predictors) into final modeling datasets, splits data into training/testing sets, and applies any necessary scaling or transformations.

4. Modeling

  • 04_01_modelling_ssdm.R: Implements traditional single-species distribution modeling (SSDM) approaches with selected algorithms (e.g., MaxEnt, random forests), including hyperparameter tuning.
  • 04_02_modelling_msdm_embed.R: Develops multi-species distribution models using neural network approaches that embed species identities into a latent space, capturing inter-species relationships.
  • 04_03_modelling_msdm_onehot.R: Implements multi-species distribution models using one-hot encoding for species identities, enabling joint prediction across all species simultaneously.
  • 04_04_modelling_msdm_rf.R: Implements random forest-based multi-species distribution modeling, incorporating species identity as a predictor variable alongside environmental variables.

5. Analysis

  • 05_01_performance_report.qmd: Generates comprehensive reports on model performance metrics (AUC, TSS, etc.) for all modeling approaches, with visualizations comparing performance across species and methods.
  • 05_02_publication_analysis.qmd: Conducts advanced statistical analyses of model results, creates publication-quality figures, and summarizes findings for manuscript preparation.

Miscellaneous

  • utils.R: Contains utility functions used across multiple scripts, including data processing helpers, custom evaluation metrics, and visualization functions.

Miscellaneous

  • utils.R:

Getting Started

  1. Clone the repository and open the Symobio_modeling.Rproj file in RStudio.
  2. Restore the project environment using renv:
    renv::restore()
  3. Run the scripts in the R/ directory sequentially. Some scripts, especially for model fitting, may run a long time and benefit from powerful hardware.

Additional Notes

  • Ensure that all required input data (e.g., range maps, raster files) is available in the expected directories.
  • Outputs from each script are typically saved to disk and used as inputs for subsequent scripts.
  • Refer to the README.md file for any additional project-specific instructions.