<- rnorm(1560, mean = 14, sd = 2.5)
some_numbers hist(some_numbers)
This is not a review or a proof of the Probability Integral Transform. A grad student asked me “what is this and why does it work” and I wanted to explore it with them.
Among the many wonderful plots that you can make with bayesplot
, you will find one called ppc_loo_pit_overlay
. You can read a lot more about it in this fabulous paper on Visualisation in Bayesian Workflow (Gabry et al. 2019)
So what is it, and how does it work?
Exploring with simulation
Take random numbers from a distribution
Then run them through that distribution’s CDF
<- pnorm(some_numbers, mean = 14, sd = 2.5)
some_pit hist(some_pit)
Sure enough we get a uniform shape!
What happens when you are wrong
let’s make some curves that don’t really match
library(tidyverse)
<- 4000
n tibble(meanval = seq(from = 1, to = 14, length.out = 6),
sd = 2.5) |>
expand_grid(x = seq(from = 0, to = 18, length.out = 30)) |>
mutate(normal_dist = dnorm(x, mean = meanval, sd = sd),
gamma_dist = dgamma(x,
shape = meanval^2/sd^2,
rate = meanval/sd^2)) |>
pivot_longer(ends_with("dist"),
names_to = "distribution",
values_to = "value") |>
ggplot(aes(x = x, y = value, colour = distribution)) +
geom_line() +
facet_wrap(~meanval)
We can see that the fit gets worse as the mean drops
let’s simulate data from the gamma and use the PIT assuming instead it is normal:
<- 4000
n tibble(meanval = seq(from = 1, to = 14, length.out = 6),
sd = 2.5) |>
rowwise() |>
mutate(normal_dist = list(rnorm(n, mean = meanval, sd = sd)),
gamma_dist = list(rgamma(n,
shape = meanval^2/sd^2,
rate = meanval/sd^2))) |>
pivot_longer(ends_with("dist"),
names_to = "distribution",
values_to = "samples") |>
rowwise() |>
mutate(pit_samples = list(pnorm(samples, mean = meanval, sd = sd))) |>
select(-samples) |>
# filter(distribution == "gamma_dist") |>
unnest(pit_samples) |>
ggplot(aes(x = pit_samples)) +
geom_histogram() +
facet_grid(distribution~meanval)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
let’s try it with just the gamma, but changing both moments and always using the normal:
<- 4000
n expand_grid(meanval = seq(from = 1, to = 14, length.out = 6),
sdval = seq(from = .2, to = 7, length.out = 4)) |>
rowwise() |>
mutate(gamma_dist = list(rgamma(n,
shape = meanval^2/sdval^2,
rate = meanval/sdval^2))) |>
rowwise() |>
mutate(pit_samples = list(
pnorm(gamma_dist,
mean = meanval,
sd = sdval))) |>
select(-gamma_dist) |>
# filter(distribution == "gamma_dist") |>
unnest(pit_samples) |>
ggplot(aes(x = pit_samples)) +
geom_histogram() +
facet_grid(sdval~meanval)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
and with the lognormal
<- 4000
n expand_grid(meanval = seq(from = 1,
to = 14,
length.out = 6),
sdval = seq(from = .2,
to = 7,
length.out = 4)) |>
rowwise() |>
mutate(
cf = log(sdval/meanval)^2 + 1,
lnorm_dist = list(rlnorm(n,
meanlog = log(meanval) - .5*cf,
sdlog = sqrt(cf))
)|>
)rowwise() |>
mutate(pit_samples = list(
pnorm(lnorm_dist,
mean = meanval,
sd = sdval)
# plnorm(lnorm_dist,
# meanlog = log(meanval) - .5*cf,
# sdlog = sqrt(cf))
|>
)) select(-lnorm_dist) |>
# filter(distribution == "gamma_dist") |>
unnest(pit_samples) |>
ggplot(aes(x = pit_samples)) +
geom_histogram() +
facet_grid(sdval~meanval)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What did we learn here
The PIT is one of an arsenal of diagnostic tools. The idea here is to run data “backwards” through the distribution we’ve chosen. If the distribution we chose is something like realize, the result should look kind of flat. If we are very far wrong, it won’t be flat. So this can serve as a quick goodness-of-fit test for your response distribution