diff --git a/README.md b/README.md index a26b7475873243a602c58480864a3895bfd9033c..35f71745bc6b5e4b05cdc2639c6b38e7d01dee01 100644 --- a/README.md +++ b/README.md @@ -1,51 +1,52 @@ # Codebase Documentation -This repository implements a species distribution modeling comparison study for about 600 South American mammal species. Specifically, the +This repository implements a species distribution modeling comparison study for about 600 South American mammal species. Specifically, the study compares different modeling approaches for predicting species distributions. ## Project Structure - **`R/`**: Contains all the R scripts organized by workflow steps. +- **`renv/`**: Manages package dependencies for reproducibility. - **`Symobio_modeling.Rproj`**: RStudio project file for easy navigation. - **`README.md`**: High-level overview of the project. -- **`renv/`**: Manages package dependencies for reproducibility. +- **`occurrences.png`**: Visualization or reference image for occurrences data. +- **`.Rprofile`**: Custom R environment settings. - **`renv.lock`**: Lockfile for `renv` to ensure consistent package versions. ## Workflow Overview The workflow is divided into several stages, each represented by scripts in the `R/` directory. Below is a summary of the key steps: -### 1. Preparation of Geographic Data -- **`01_01_range_map_preparation.R`**: Processes IUCN mammal range maps for South America, converting them to a standardized raster format with consistent projection for downstream analysis. -- **`01_02_raster_preparation.R`**: Prepares environmental predictor rasters (e.g., climate, elevation, land cover) by cropping to South America extent, resampling to consistent resolution, and performing any necessary transformations. - -### 2. Preparation of complementary species-level data +### 1. Data Preparation +Pre-processing of species-specific and environmental information for model fitting and results analysis. +- **`01_01_range_preparation.R`**: Process species range maps and calculate range dissimilarity. +- **`01_02_traits_preparation.R`**: Prepare species trait data and calculate functional distances. +- **`01_03_phylo_preparation.R`**: Process phylogenetic information and alculate phylogenetic distances. +- **`01_04_raster_preparation.R`**: Prepare environmental raster layers for modeling for data extraction. -- **`02_01_functional_group_assignment.R`**: Assigns mammal species to functional groups based on diet, locomotion, and body size characteristics, creating categorical variables for modeling. -- **`02_02_functional_traits_preparation.R`**: Cleans and standardizes continuous trait data (body mass, diet breadth, etc.) for all study species, handling missing values through imputation where necessary. -- **`02_03_phylo_preparation.R`**: Extracts phylogenetic information for target mammal species, computes phylogenetic distance matrices, and prepares the data for inclusion in models. +### 2. Presence/Absence Data Processing +Querying of presence data from Symobio DB, sampling of absence data and initial exploration of the dataset. -### 3. Preparation of Presence/Absence Data -- **`03_01_presence_preparation.R`**: Processes occurrence records from GBIF and other sources, applies spatial filtering to reduce sampling bias, and aligns taxonomic nomenclature. -- **`03_02_absence_preparation.R`**: Generates pseudo-absence points using a stratified random approach, with constraints based on environmental conditions and range map boundaries. -- **`03_03_dataset_exploration.R`**: Produces descriptive statistics and visualizations of presence/absence data, environmental variables, and species coverage to assess data quality. -- **`03_04_model_data_finalization.R`**: Merges all prepared datasets (occurrences, absences, predictors) into final modeling datasets, splits data into training/testing sets, and applies any necessary scaling or transformations. +- **`02_01_presence_data_preparation.R`**: Query species occurrence data from Symobio DB, extract environmental variables from raster files. +- **`02_02_absence_data_preparation.R`**: Sample pseudo-absence points, extract environmental variables from raster files. +- **`02_03_model_data_finalization.R`**: Create final dataset for modeling. +- **`02_04_dataset_exploration.R`**: Explore and visualize the dataset. -### 4. Modeling -- **`04_01_modelling_ssdm.R`**: Implements traditional single-species distribution modeling (SSDM) approaches with selected algorithms (e.g., MaxEnt, random forests), including hyperparameter tuning. -- **`04_02_modelling_msdm_embed.R`**: Develops multi-species distribution models using neural network approaches that embed species identities into a latent space, capturing inter-species relationships. -- **`04_03_modelling_msdm_onehot.R`**: Implements multi-species distribution models using one-hot encoding for species identities, enabling joint prediction across all species simultaneously. -- **`04_04_modelling_msdm_rf.R`**: Implements random forest-based multi-species distribution modeling, incorporating species identity as a predictor variable alongside environmental variables. +### 3. Modeling +Scripts for model fitting -### 5. Analysis -- **`05_01_performance_report.qmd`**: Generates comprehensive reports on model performance metrics (AUC, TSS, etc.) for all modeling approaches, with visualizations comparing performance across species and methods. -- **`05_02_publication_analysis.qmd`**: Conducts advanced statistical analyses of model results, creates publication-quality figures, and summarizes findings for manuscript preparation. +- **`03_01_modelling_ssdm.R`**: Fit single-species distribution models (SSDM) based on different algorithms. +- **`03_02_modelling_msdm_embed.R`**: Fit multi-species distribution model (MSDM) based on Deep Neural Network with species embeddings. +- **`03_03_modelling_msdm_onehot.R`**: Fit multi-species distribution model (MSDM) based on Deep Neural Network with species identity as factor. +- **`03_04_modelling_msdm_rf.R`**: Fit multi-species distribution model (MSDM) based on Random Forest with species identity as factor. -## Miscellaneous -- **`utils.R`**: Contains utility functions used across multiple scripts, including data processing helpers, custom evaluation metrics, and visualization functions. +### 4. Analysis and Reporting +- **`04_01_performance_report.qmd`**: Generate an interactive performance evaluation of implemented SDM algorithms. +- **`04_02_publication_analysis.R`**: Explore results in depth, analyse ### Miscellaneous -- **`utils.R`**: +- **`utils.R`**: Contains utility functions used across multiple scripts. +- **`_publish.yml`**: Configuration for publishing reports and analyses. ## Getting Started @@ -59,4 +60,4 @@ The workflow is divided into several stages, each represented by scripts in the ## Additional Notes - Ensure that all required input data (e.g., range maps, raster files) is available in the expected directories. - Outputs from each script are typically saved to disk and used as inputs for subsequent scripts. -- Refer to the README.md file for any additional project-specific instructions. +- Refer to the README.md file for any additional project-specific instructions. \ No newline at end of file