Simulation of Ecological Data: Estimating the posterior distribution of residuals

Everything in a Bayesian model has a distribution: there are distributions for the slope coeffiecients, distributions for the residual variance. The model makes a distribution of lines that form the average, and can produce a distribution of predicted values.

When we measure residuals, therefore, these have distributions too. This is useful because it lets us transmit model uncertainty into these uncertain values.

let’s look at bill characteristics of the GEntoo penguin

library(palmerpenguins)
suppressPackageStartupMessages(library(tidyverse))
library(patchwork)

gentoo <- penguins |> 
  filter(species == "Gentoo") |> 
  filter(!is.na(bill_depth_mm) & !is.na(bill_length_mm))

depth_fig <- gentoo |> 
  ggplot(aes(x = body_mass_g, y = bill_depth_mm)) + 
  geom_count(alpha = 0.4)

length_fig <- gentoo |> 
  ggplot(aes(x = body_mass_g, y = bill_length_mm)) + 
  geom_count(alpha = .4)

depth_fig + length_fig

Not surprisingly, both change with size. let’s model this, with a simple gaussian bayesian model

library(brms)

len_mass <- brm(bill_length_mm ~ body_mass_g, 
                data = gentoo, refresh = 0,
                backend = "cmdstanr",
                file = here::here("_posts",
                                  "2022-06-14-estimating-the-posterior-distribution-of-residuals",
                                  "len_mass.rds")) 


dep_mass <- brm(bill_depth_mm ~ body_mass_g,
                data = gentoo,
                refresh = 0,
                backend = "cmdstanr",
                file = here::here("_posts",
                                  "2022-06-14-estimating-the-posterior-distribution-of-residuals",
                                  "dep_mass.rds"))

The only essential things here are the model formulae. the arguments refresh = 0 means the chain progress is not printed to the screen. backend = cmdstanr means use cmdstan to run the model – the default Rstan would work the same way.

let’s look at these modesl

len_mass

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: bill_length_mm ~ body_mass_g 
   Data: gentoo (Number of observations: 123) 
  Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept      26.74      2.07    22.56    30.79 1.00     4547
body_mass_g     0.00      0.00     0.00     0.00 1.00     4579
            Tail_ESS
Intercept       2901
body_mass_g     2946

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     2.32      0.15     2.06     2.63 1.00     1668     1809

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

dep_mass

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: bill_depth_mm ~ body_mass_g 
   Data: gentoo (Number of observations: 123) 
  Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept       7.88      0.65     6.59     9.15 1.00     3964
body_mass_g     0.00      0.00     0.00     0.00 1.00     4023
            Tail_ESS
Intercept       2628
body_mass_g     2516

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.69      0.05     0.61     0.79 1.00     1306     1286

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

let’s draw lines with them

library(modelr)
library(tidybayes)

gentoo_lines <- gentoo |> 
  data_grid(body_mass_g = seq_range(body_mass_g, n = 40)) |> 
  add_epred_rvars(len_mass, value = ".epred_len") |> 
  add_epred_rvars(dep_mass, value = ".epred_dep")


dep_epred_fig <- depth_fig + 
  stat_dist_lineribbon(aes(x = body_mass_g, dist = .epred_dep), 
                       data = gentoo_lines, inherit.aes = FALSE) + 
  geom_count(pch = 21, col = "white", fill = "black", alpha = 1)

len_epred_fig <- length_fig + 
  stat_dist_lineribbon(aes(x = body_mass_g, dist = .epred_len), 
                       data = gentoo_lines, inherit.aes = FALSE) + 
  geom_count(pch = 21, col = "white", fill = "black", alpha = 1)

dep_epred_fig + len_epred_fig

Extract residuals

gentoo_len_resid <- gentoo |> 
  select(species, bill_length_mm, bill_depth_mm, body_mass_g) |> 
  add_residual_draws(len_mass, value = ".resid_len") |> 
  summarise(.resid_len_mean = mean (.resid_len), 
            .resid_len_sd = sd(.resid_len))

gentoo_dep_resid <- gentoo |> 
  select(species, bill_length_mm, bill_depth_mm, body_mass_g) |> 
  add_residual_draws(dep_mass, value = ".resid_dep") |> 
  summarise(.resid_dep_mean = mean (.resid_dep), 
            .resid_dep_sd    =   sd(.resid_dep))

We can visualize thse at the same time

gentoo_two_resid <- gentoo_len_resid |> 
  ungroup() |> 
  left_join(ungroup(gentoo_dep_resid), 
            by = c("species", "bill_length_mm", 
                   "bill_depth_mm", "body_mass_g", ".row"))

gentoo_two_resid |> 
  ggplot(aes(x = .resid_len_mean, y = .resid_dep_mean,
             xmin = .resid_len_mean - .resid_len_sd,
             xmax = .resid_len_mean + .resid_len_sd)) + 
  geom_pointrange() + 
  geom_linerange(aes(x = .resid_len_mean, y = .resid_dep_mean,
                     ymin = .resid_dep_mean - .resid_dep_sd,
                     ymax = .resid_dep_mean + .resid_dep_sd))

Before moving on, lets demonstrate that the posterior is well-described by a normal distribution

one_resid <-  gentoo[5,] |> 
  select(species, bill_length_mm, bill_depth_mm, body_mass_g) |> 
  add_residual_draws(dep_mass, value = ".resid_dep")

mean_draws <-  mean(one_resid$.resid_dep)
sd_draws <- sd(one_resid$.resid_dep)
one_resid |> 
  ggplot(aes(x = .resid_dep)) + 
  geom_histogram(aes(y = ..density..)) + 
  stat_function(fun = function(x) dnorm(x, mean = mean_draws, sd = sd_draws))

error-in-variables regression

error_in_var_regression <- brm(.resid_dep_mean | mi(.resid_dep_sd) ~ me(.resid_len_mean, .resid_len_sd),
                               data = gentoo_two_resid,
                               backend = "cmdstanr", file = here::here("_posts",
                                  "2022-06-14-estimating-the-posterior-distribution-of-residuals",
                                  "error_in_var_regression.rds"))