A short description of the post.
Everything in a Bayesian model has a distribution: there are distributions for the slope coeffiecients, distributions for the residual variance. The model makes a distribution of lines that form the average, and can produce a distribution of predicted values.
When we measure residuals, therefore, these have distributions too. This is useful because it lets us transmit model uncertainty into these uncertain values.
let’s look at bill characteristics of the GEntoo penguin
library(palmerpenguins)
suppressPackageStartupMessages(library(tidyverse))
library(patchwork)
gentoo <- penguins |>
filter(species == "Gentoo") |>
filter(!is.na(bill_depth_mm) & !is.na(bill_length_mm))
depth_fig <- gentoo |>
ggplot(aes(x = body_mass_g, y = bill_depth_mm)) +
geom_count(alpha = 0.4)
length_fig <- gentoo |>
ggplot(aes(x = body_mass_g, y = bill_length_mm)) +
geom_count(alpha = .4)
depth_fig + length_fig
Not surprisingly, both change with size. let’s model this, with a simple gaussian bayesian model
library(brms)
len_mass <- brm(bill_length_mm ~ body_mass_g,
data = gentoo, refresh = 0,
backend = "cmdstanr",
file = here::here("_posts",
"2022-06-14-estimating-the-posterior-distribution-of-residuals",
"len_mass.rds"))
dep_mass <- brm(bill_depth_mm ~ body_mass_g,
data = gentoo,
refresh = 0,
backend = "cmdstanr",
file = here::here("_posts",
"2022-06-14-estimating-the-posterior-distribution-of-residuals",
"dep_mass.rds"))
The only essential things here are the model formulae. the arguments refresh = 0
means the chain progress is not printed to the screen. backend = cmdstanr
means use cmdstan
to run the model – the default Rstan
would work the same way.
let’s look at these modesl
len_mass
Family: gaussian
Links: mu = identity; sigma = identity
Formula: bill_length_mm ~ body_mass_g
Data: gentoo (Number of observations: 123)
Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept 26.74 2.07 22.56 30.79 1.00 4547
body_mass_g 0.00 0.00 0.00 0.00 1.00 4579
Tail_ESS
Intercept 2901
body_mass_g 2946
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 2.32 0.15 2.06 2.63 1.00 1668 1809
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
dep_mass
Family: gaussian
Links: mu = identity; sigma = identity
Formula: bill_depth_mm ~ body_mass_g
Data: gentoo (Number of observations: 123)
Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
total post-warmup draws = 4000
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept 7.88 0.65 6.59 9.15 1.00 3964
body_mass_g 0.00 0.00 0.00 0.00 1.00 4023
Tail_ESS
Intercept 2628
body_mass_g 2516
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.69 0.05 0.61 0.79 1.00 1306 1286
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
let’s draw lines with them
library(modelr)
library(tidybayes)
gentoo_lines <- gentoo |>
data_grid(body_mass_g = seq_range(body_mass_g, n = 40)) |>
add_epred_rvars(len_mass, value = ".epred_len") |>
add_epred_rvars(dep_mass, value = ".epred_dep")
dep_epred_fig <- depth_fig +
stat_dist_lineribbon(aes(x = body_mass_g, dist = .epred_dep),
data = gentoo_lines, inherit.aes = FALSE) +
geom_count(pch = 21, col = "white", fill = "black", alpha = 1)
len_epred_fig <- length_fig +
stat_dist_lineribbon(aes(x = body_mass_g, dist = .epred_len),
data = gentoo_lines, inherit.aes = FALSE) +
geom_count(pch = 21, col = "white", fill = "black", alpha = 1)
dep_epred_fig + len_epred_fig
gentoo_len_resid <- gentoo |>
select(species, bill_length_mm, bill_depth_mm, body_mass_g) |>
add_residual_draws(len_mass, value = ".resid_len") |>
summarise(.resid_len_mean = mean (.resid_len),
.resid_len_sd = sd(.resid_len))
gentoo_dep_resid <- gentoo |>
select(species, bill_length_mm, bill_depth_mm, body_mass_g) |>
add_residual_draws(dep_mass, value = ".resid_dep") |>
summarise(.resid_dep_mean = mean (.resid_dep),
.resid_dep_sd = sd(.resid_dep))
We can visualize thse at the same time
gentoo_two_resid <- gentoo_len_resid |>
ungroup() |>
left_join(ungroup(gentoo_dep_resid),
by = c("species", "bill_length_mm",
"bill_depth_mm", "body_mass_g", ".row"))
gentoo_two_resid |>
ggplot(aes(x = .resid_len_mean, y = .resid_dep_mean,
xmin = .resid_len_mean - .resid_len_sd,
xmax = .resid_len_mean + .resid_len_sd)) +
geom_pointrange() +
geom_linerange(aes(x = .resid_len_mean, y = .resid_dep_mean,
ymin = .resid_dep_mean - .resid_dep_sd,
ymax = .resid_dep_mean + .resid_dep_sd))
Before moving on, lets demonstrate that the posterior is well-described by a normal distribution
one_resid <- gentoo[5,] |>
select(species, bill_length_mm, bill_depth_mm, body_mass_g) |>
add_residual_draws(dep_mass, value = ".resid_dep")
mean_draws <- mean(one_resid$.resid_dep)
sd_draws <- sd(one_resid$.resid_dep)
one_resid |>
ggplot(aes(x = .resid_dep)) +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = function(x) dnorm(x, mean = mean_draws, sd = sd_draws))