Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
05_performance_analysis.qmd 19.13 KiB
---
title: "sSDM Performance analysis"
format: html
editor: visual
engine: knitr
---

```{r init, echo = FALSE, include = FALSE}
library(tidyverse)
library(sf)
library(plotly)
library(DT)


load("../data/r_objects/ssdm_results.RData")
load("../data/r_objects/msdm_fit.RData")
load("../data/r_objects/msdm_results.RData")
load("../data/r_objects/msdm_results_embedding_trained.RData.")
load("../data/r_objects/msdm_results_multiclass.RData.")
load("../data/r_objects/range_maps.RData")
load("../data/r_objects/range_maps_gridded.RData")
load("../data/r_objects/occs_final.RData")
load("../data/r_objects/functional_groups.RData")
sf::sf_use_s2(use_s2 = FALSE)
```

```{r globals, echo = FALSE, include = FALSE}
# Select metrics
focal_metrics = c("auc", "f1", "accuracy")  # There's a weird bug in plotly that scrambles up lines when using more than three groups

# Dropdown options
plotly_buttons = list()
for(metric in focal_metrics){
  plotly_buttons[[length(plotly_buttons) + 1]] = list(method = "restyle", args = list("transforms[0].value", metric), label = metric)
}

# Regression functions
asym_regression = function(x, y){
  nls_fit = nls(y ~ 1 - (1-b) * exp(-c * log(x)), start = list(b = 0.1, c = 0.1))
  new_x = exp(seq(log(min(x)), log(max(x)), length.out = 100))
  data.frame(
    x = new_x,
    fit = predict(nls_fit, newdata = data.frame(x = new_x))
  )
}

lin_regression = function(x, y){
  glm_fit = suppressWarnings(glm(y~x, family = "binomial"))
  new_x = seq(min(x), max(x), length.out = 100)
  data.frame(
    x = new_x,
    fit = predict(glm_fit, newdata = data.frame(x = new_x), type = "response")
  )
}

# Performance table
performance = bind_rows(ssdm_results, msdm_results, msdm_results_embedding_trained, msdm_results_multiclass) %>% 
  pivot_longer(c(auc, accuracy, kappa, precision, recall, f1), names_to = "metric") %>% 
  dplyr::filter(!is.na(value)) %>% 
  dplyr::mutate(
    metric = factor(metric, levels = c("auc", "kappa", "f1", "accuracy", "precision", "recall")),
    value = round(pmax(value, 0, na.rm = T), 3) # Fix one weird instance of f1 < 0
  ) %>% 
  dplyr::filter(metric %in% focal_metrics)
```

## Summary

This document summarizes the performance of different sSDM and mMSDM algorithms for `r I(length(unique(performance$species)))` South American mammal species. Model performance is evaluated on `r I(xfun::numbers_to_words(length(focal_metrics)))` metrics (`r I(paste(focal_metrics, collapse = ', '))`) and analyzed along five potential influence factors (number of records, range size, range coverage, range coverage bias, and functional group). The comparison of sSDM vs mSDM approaches is of particular interest.
Code can be found on [GitLab](https://git.idiv.de/ye87zine/symobio-modeling).

### Modeling overview:

#### General decisions

-   Randomly sampled pseudo-absences from expanded area of extent of occurrence records (×1.25)
-   Balanced presences and absences for each species
-   Predictors: all 19 CHELSA bioclim variables
-   70/30 Split of training vs. test data (except for NN models)

#### sSDM Algorithms

Random Forest (**SSDM_RF**)

-   Hyperparameter tuning of `mtry`
-   Spatial block cross-validation during training

Generalized boosted machine (**SSDM_GBM**)

-   Hyperparameter tuning across `n.trees` , `interaction.depth` , `shrinkage`, `n.minobsinnode`
-   Spatial block cross-validation during training

Generalized Linear Model (**SSDM_GLM**)

-   Logistic model with binomial link function
-   Spatial block cross-validation during training

Neural Netwok (**SSDM_NN**)

-   Three hidden layers, leaky ReLu activations, binomial loss
-   no spatial block cross-validation during training

#### mSDM Algorithms

Binary Neural Network with species embedding (**MSDM_embed**)

-   definition: presence \~ environment + embedding(species)
-   prediction: probability of occurrence given a set of (environmental) inputs and species identity
-   embedding initialized at random
-   three hidden layers, sigmoid + leaky ReLu activations, binomial loss

Binary Neural Network with trait-informed species embedding (**MSDM_embed_traits**)

-   definition: presence \~ environment + embedding(species)
-   prediction: probability of occurrence given a set of (environmental) inputs and species identity
-   embedding initialized using eigenvectors of functional distance matrix
-   three hidden layers, sigmoid + leaky ReLu activations, binomial loss

Multi-Class Neural Network (**MSDM_multiclass**)

-   definition: species identity \~ environment
-   prediction: probability distribution across all observed species given a set of (environmental) inputs
-   presence-only data in training
-   three hidden layers, leaky ReLu activations, softmax loss
-   Top-k based evaluation (k=10, P/A \~ target species in / not among top 10 predictions)

### Key findings:

-   sSDM algorithms (RF, GBM) outperformed mSDMs in most cases
-   mSDMs showed indications of better performance for rare species (\< 10-20 occurrences)
-   More occurrence records and larger range sizes tended to improve model performance
-   Higher range coverage correlated with better performance
-   Range coverage bias and functional group showed some impact but were less consistent
-   Convergence problems hampered NN sSDM performance

## Analysis

The table below shows the analysed modeling results.
```{r performance, echo = FALSE, message=FALSE, warnings=FALSE}
DT::datatable(performance)
```

### Number of records

-   Model performance was generally better for species with more observations
-   Very poor performance below 50-100 observations

```{r number_of_records, echo = FALSE, message=FALSE, warnings=FALSE}
df_plot = performance

# Calculate regression lines for each model and metric combination
suppressWarnings({
  regression_lines = df_plot %>%
    group_by(model, metric) %>%
    group_modify(~asym_regression(.x$obs, .x$value))
})

# Create base plot
plot <- plot_ly() %>% 
  layout(
    title = "Model Performance vs. Number of observations",
    xaxis = list(title = "Number of observations", type = "log"),
    yaxis = list(title = "Value"),
    legend = list(x = 1.1, y = 0.5),  # Move legend to the right of the plot
    margin = list(r = 150),  # Add right margin to accommodate legend
    hovermode = 'closest',
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = plotly_buttons
      )
    )
  )

# Points
for (model_name in unique(df_plot$model)) {
  plot = plot %>%
    add_markers(
      data = filter(df_plot, model == model_name),
      x = ~obs,
      y = ~value,
      color = model_name,  # Set color to match legendgroup
      legendgroup = model_name,
      opacity = 0.6,
      name = ~model,
      hoverinfo = 'text',
      text = ~paste("Species:", species, "<br>Observations:", obs, "<br>Value:", round(value, 3)),
      transforms = list(
        list(
          type = 'filter',
          target = ~metric,
          operation = '=',
          value = focal_metrics[1]
        )
      )
    )
}

# Add regression lines
for(model_name in unique(df_plot$model)){
  reg_data = dplyr::filter(regression_lines, model == model_name)
  plot = plot %>% 
    add_lines(
      data = reg_data,
      x = ~x,
      y = ~fit,
      color = model_name,  # Set color to match legendgroup
      legendgroup = model_name,
      name = paste(model_name, '(fit)'),
      showlegend = FALSE,
      transforms = list(
        list(
          type = 'filter',
          target = ~metric,
          operation = '=',
          value = focal_metrics[1]
        )
      )
    )
}

bslib::card(plot, full_screen = T)
```

### Range characteristics

#### Range size

Range size was calculated based on polygon layers from the IUCN Red List of Threatened Species (2016).

-   Model performance tended to be slightly higher for species with larger range size
-   Only RF shows continuous performance improvements beyond range sizes of \~5M km²

```{r range_size, echo = FALSE, message=FALSE, warnings=FALSE}

df_join = range_maps %>% 
  dplyr::mutate(range_size = as.numeric(st_area(range_maps) / 1000000)) %>%  # range in sqkm
  sf::st_drop_geometry()

df_plot = performance %>% 
  inner_join(df_join, by = c("species" = "name_matched"))

# Calculate regression lines for each model and metric combination
suppressWarnings({
  regression_lines = df_plot %>%
    group_by(model, metric) %>%
    group_modify(~asym_regression(.x$range_size, .x$value))
})

# Create base plot
plot <- plot_ly() %>% 
  layout(
    title = "Model Performance vs. Range size",
    xaxis = list(title = "Range size"),
    yaxis = list(title = "Value"),
    legend = list(x = 1.1, y = 0.5),  # Move legend to the right of the plot
    margin = list(r = 150),  # Add right margin to accommodate legend
    hovermode = 'closest',
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = plotly_buttons
      )
    )
  )

# Add regression lines and confidence intervals for each model
for (model_name in unique(df_plot$model)) {
  plot <- plot %>%
    add_markers(
      data = filter(df_plot, model == model_name),
      x = ~range_size,
      y = ~value,
      color = model_name,  # Set color to match legendgroup
      legendgroup = model_name,
      opacity = 0.6,
      name = ~model,
      hoverinfo = 'text',
      text = ~paste("Species:", species, "<br>Range size:", range_size, "<br>Value:", round(value, 3)),
      transforms = list(
        list(
          type = 'filter',
          target = ~metric,
          operation = '=',
          value = focal_metrics[1]
        )
      )
    )
}

for (model_name in unique(df_plot$model)) {
  reg_data = dplyr::filter(regression_lines, model == model_name)
  plot = plot %>% 
    add_lines(
      data = reg_data,
      x = ~x,
      y = ~fit,
      color = model_name,  # Set color to match legendgroup
      legendgroup = model_name,
      name = paste(model_name, '(fit)'),
      showlegend = FALSE,
      transforms = list(
        list(
          type = 'filter',
          target = ~metric,
          operation = '=',
          value = focal_metrics[1]
        )
      )
    )
}

bslib::card(plot, full_screen = T)
```

#### Range coverage

Species ranges were split into continuous hexagonal grid cells of 1 degree diameter. Range coverage was then calculated as the number of grid cells containing at least one occurrence record divided by the number of total grid cells.

$$
RangeCoverage = \frac{N_{cells\_occ}}{N_{cells\_total}}
$$

-   Models for species with higher range coverage showed slightly better performance

```{r range_coverage, echo = FALSE, message=FALSE, warnings=FALSE}
df_cells_total = range_maps_gridded %>%
  dplyr::rename("species" = name_matched) %>% 
  group_by(species) %>%
  summarise(cells_total = n()) %>% 
  st_drop_geometry()

df_cells_occ <- range_maps_gridded %>%
  st_join(occs_final, join = st_intersects) %>%
  filter(name_matched == species) %>%         # Filter only intersections of the same species
  group_by(species) %>%
  summarise(cells_occupied = n_distinct(geometry)) %>%
  st_drop_geometry()

df_join = df_cells_total %>% 
  dplyr::inner_join(df_cells_occ, by = "species") %>% 
  dplyr::mutate(range_cov = cells_occupied / cells_total) %>% 
  dplyr::select(species, range_cov)

df_plot = performance %>% 
  inner_join(df_join, by = "species")

# Calculate regression lines for each model and metric combination
suppressWarnings({
  regression_lines = df_plot %>%
    group_by(model, metric) %>%
    group_modify(~lin_regression(.x$range_cov, .x$value))
})

# Create base plot
plot <- plot_ly() %>% 
  layout(
    title = "Model Performance vs. Range coverage",
    xaxis = list(title = "Range coverage"),
    yaxis = list(title = "Value"),
    legend = list(x = 1.1, y = 0.5),  # Move legend to the right of the plot
    margin = list(r = 150),  # Add right margin to accommodate legend
    hovermode = 'closest',
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = plotly_buttons
      )
    )
  )

# Add regression lines and confidence intervals for each model
for (model_name in unique(df_plot$model)) {
  plot <- plot %>%
    add_markers(
      data = filter(df_plot, model == model_name),
      x = ~range_cov,
      y = ~value,
      color = model_name,  # Set color to match legendgroup
      legendgroup = model_name,
      opacity = 0.6,
      name = ~model,
      hoverinfo = 'text',
      text = ~paste("Species:", species, "<br>Range coverage:", range_cov, "<br>Value:", round(value, 3)),
      transforms = list(
        list(
          type = 'filter',
          target = ~metric,
          operation = '=',
          value = focal_metrics[1]
        )
      )
    )
}

for (model_name in unique(df_plot$model)) {
  reg_data = dplyr::filter(regression_lines, model == model_name)
  plot = plot %>% 
    add_lines(
      data = reg_data,
      x = ~x,
      y = ~fit,
      color = model_name,  # Set color to match legendgroup
      legendgroup = model_name,
      name = paste(model_name, '(fit)'),
      showlegend = FALSE,
      transforms = list(
        list(
          type = 'filter',
          target = ~metric,
          operation = '=',
          value = focal_metrics[1]
        )
      )
    )
}

bslib::card(plot, full_screen = T)
```

#### Range coverage bias

Range coverage bias was calculated as 1 minus the ratio of the actual range coverage and the hypothetical range coverage if all observations were maximally spread out across the range.

$$
RangeCoverageBias = 1 - \frac{RangeCoverage}{min({N_{obs\_total}} / {N_{cells\_total}}, 1)}
$$

Higher bias values indicate that occurrence records are spatially more clustered within the range of the species.

-   There was no strong relationship between range coverage bias and model performance

```{r range_coverage_bias, echo = FALSE, message=FALSE, warnings=FALSE}
df_occs_total = occs_final %>% 
  st_drop_geometry() %>% 
  group_by(species) %>% 
  summarise(occs_total = n())

df_join = df_occs_total %>% 
  dplyr::inner_join(df_cells_total, by = "species") %>% 
  dplyr::inner_join(df_cells_occ, by = "species") %>% 
  dplyr::mutate(range_bias = 1-((cells_occupied / cells_total) / pmin(occs_total / cells_total, 1)))

df_plot = performance %>% 
  inner_join(df_join, by = "species")

# Calculate regression lines for each model and metric combination
suppressWarnings({
  regression_lines = df_plot %>%
    group_by(model, metric) %>%
    group_modify(~lin_regression(.x$range_bias, .x$value))
})

# Create base plot
plot <- plot_ly() %>% 
  layout(
    title = "Model Performance vs. Range coverage bias",
    xaxis = list(title = "Range coverage bias"),
    yaxis = list(title = "Value"),
    legend = list(x = 1.1, y = 0.5),  # Move legend to the right of the plot
    margin = list(r = 150),  # Add right margin to accommodate legend
    hovermode = 'closest',
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = plotly_buttons
      )
    )
  )

# Add regression lines and confidence intervals for each model
for (model_name in unique(df_plot$model)) {
  plot <- plot %>%
    add_markers(
      data = filter(df_plot, model == model_name),
      x = ~range_bias,
      y = ~value,
      color = model_name,  # Set color to match legendgroup
      legendgroup = model_name,
      opacity = 0.6,
      name = ~model,
      hoverinfo = 'text',
      text = ~paste("Species:", species, "<br>Range coverage bias:", range_bias, "<br>Value:", round(value, 3)),
      transforms = list(
        list(
          type = 'filter',
          target = ~metric,
          operation = '=',
          value = focal_metrics[1]
        )
      )
    )
}

for (model_name in unique(df_plot$model)) {
  reg_data = dplyr::filter(regression_lines, model == model_name)
  plot = plot %>% 
    add_lines(
      data = reg_data,
      x = ~x,
      y = ~fit,
      color = model_name,  # Set color to match legendgroup
      legendgroup = model_name,
      name = paste(model_name, '(fit)'),
      showlegend = FALSE,
      transforms = list(
        list(
          type = 'filter',
          target = ~metric,
          operation = '=',
          value = focal_metrics[1]
        )
      )
    )
}


bslib::card(plot, full_screen = T)
```

### Functional group

Functional groups were assigned based on taxonomic order. The following groupings were used:

| Functional group      | Taxomic orders                                                        |
|-----------------------|-----------------------------------------------------------------------|
| large ground-dwelling | Carnivora, Artiodactyla, Cingulata, Perissodactyla                    |
| small ground-dwelling | Rodentia, Didelphimorphia, Soricomorpha, Paucituberculata, Lagomorpha |
| arboreal              | Primates, Pilosa                                                      |
| flying                | Chiroptera                                                            |

-   Models for bats tended to perform slightly worse than for other groups.

```{r functional_groups, echo = FALSE, message=FALSE, warnings=FALSE}
df_plot = performance %>% 
  dplyr::left_join(functional_groups, by = c("species" = "name_matched"))

plot <- plot_ly(
  data = df_plot,
  x = ~functional_group,
  y = ~value,
  color = ~model,
  type = 'box',
  boxpoints = "all",
  jitter = 1,
  pointpos = 0,
  hoverinfo = 'text',
  text = ~paste("Species:", species, "<br>Functional group:", functional_group, "<br>Value:", round(value, 3)),
  transforms = list(
    list(
      type = 'filter',
      target = ~metric,
      operation = '=',
      value = focal_metrics[1]  # default value
    )
  )
)

plot <- plot %>%
  layout(
    title = "Model Performance vs. Functional Group",
    xaxis = list(title = "Functional group"),
    yaxis = list(title = "Value"),
    legend = list(x = 1.1, y = 0.5),  # Move legend to the right of the plot
    margin = list(r = 150),  # Add right margin to accommodate legend
    hovermode = 'closest',
    boxmode = "group",
    updatemenus = list(
      list(
        type = "dropdown",
        active = 0,
        buttons = plotly_buttons
      )
    )
  )

bslib::card(plot, full_screen = T)
```

### Relative performance

The table below summarizes the relative performance of the models across different observation frequency ranges. The `rank` column indicates the model's performance rank compared to all other models for a given combination of model and metric. The subsequent columns, `(1,10]`, `(10,25]`, ..., `(5000, Inf]`, represent bins of observation frequency. The values in these columns show how many times the model's performance was ranked at the specified `rank` within the respective frequency range.

```{r msdm_vs_ssdm, echo = FALSE, message=FALSE, warnings=FALSE}
freq_thresholds = c(1, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, Inf)

df_print = performance %>%
  mutate(freq_class = cut(obs, freq_thresholds, dig.lab = 5)) %>%
  group_by(species, metric, freq_class) %>%
  dplyr::mutate(rank = order(value, decreasing = T)) %>% 
  group_by(model, metric, rank, freq_class) %>% 
  tally() %>% 
  pivot_wider(names_from = freq_class, values_from = n) %>% 
  dplyr::select("model", "metric", "rank", "(1,10]", "(10,25]", 
                "(25,50]", "(50,100]", "(100,250]", "(250,500]", 
                "(500,1000]", "(1000,2500]", "(2500,5000]", "(5000,Inf]") %>% 
  dplyr::arrange(metric, rank, model) %>% 
  replace(is.na(.), 0)

DT::datatable(df_print)
```