# Lecture 10 - Poisson regression
library(tidyverse); theme_set(theme_bw() + theme(text = element_text(size = 16)))Lecture 10 - Generalized linear models
Example 1 - Poisson regression - Student awards
1 Packages
2 Data
The data is taken from https://stats.idre.ucla.edu/r/dae/poisson-regression/
2.1 Read csv-file
Read comma separated (csv) file:
award <- read.csv("Data/poisson_sim.csv")
glimpse(award)Rows: 200
Columns: 4
$ id <int> 45, 108, 15, 67, 153, 51, 164, 133, 2, 53, 1, 128, 16, 106,…
$ num_awards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ prog <int> 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 3, 2, 3,…
$ math <int> 41, 41, 44, 42, 40, 42, 46, 40, 33, 46, 40, 38, 44, 37, 40,…
summary(award) id num_awards prog math
Min. : 1.00 Min. :0.00 Min. :1.000 Min. :33.00
1st Qu.: 50.75 1st Qu.:0.00 1st Qu.:2.000 1st Qu.:45.00
Median :100.50 Median :0.00 Median :2.000 Median :52.00
Mean :100.50 Mean :0.63 Mean :2.025 Mean :52.65
3rd Qu.:150.25 3rd Qu.:1.00 3rd Qu.:2.250 3rd Qu.:59.00
Max. :200.00 Max. :6.00 Max. :3.000 Max. :75.00
2.2 Factor labels
Give the programs their correct factor labels, not numbers, see the description in the link above.
award <- mutate(award,
prog = factor(prog, levels = c(1, 2, 3),
labels = c("General", "Academic", "Vocational")))
glimpse(award)Rows: 200
Columns: 4
$ id <int> 45, 108, 15, 67, 153, 51, 164, 133, 2, 53, 1, 128, 16, 106,…
$ num_awards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ prog <fct> Vocational, General, Vocational, Vocational, Vocational, Ge…
$ math <int> 41, 41, 44, 42, 40, 42, 46, 40, 33, 46, 40, 38, 44, 37, 40,…
2.3 Frequency table of the number of awards
Make a frequency table of the number of awards:
award_table <- award |> count(num_awards)
award_table num_awards n
1 0 124
2 1 49
3 2 13
4 3 9
5 4 2
6 5 2
7 6 1
Add the observed propotions of the different number of awards:
award_table <- award_table |> mutate(observed = n/sum(n))
award_table num_awards n observed
1 0 124 0.620
2 1 49 0.245
3 2 13 0.065
4 3 9 0.045
5 4 2 0.010
6 5 2 0.010
7 6 1 0.005
2.4 Average number of awards
Compare the size of the average and the variance.
mu <- mean(award$num_awards)
mu[1] 0.63
var(award$num_awards)[1] 1.108643
Note that the variance is larger than the mean. Some of this extra variability might be explained by the covariates.
2.5 Expected frequencies
Calculated the expected proportions according to a Poisson distribution with mean and variance \(\mu = 0.63\). The probability (density) function is given by the dpois() function.
award_table <- award_table |>
mutate(expected = dpois(num_awards, mu))
award_table num_awards n observed expected
1 0 124 0.620 5.325918e-01
2 1 49 0.245 3.355328e-01
3 2 13 0.065 1.056928e-01
4 3 9 0.045 2.219550e-02
5 4 2 0.010 3.495791e-03
6 5 2 0.010 4.404696e-04
7 6 1 0.005 4.624931e-05
Plot both the observed and expected proportions in the same bar chart.
The +0.2 and -0.2 together with width = 0.4 is an easy way of getting the bars side by side.
ggplot(award_table) +
geom_col(aes(x = num_awards + 0.2, y = observed, fill = "Observed"),
width = 0.4) +
geom_col(aes(x = num_awards - 0.2, y = expected, fill = "Expected"),
width = 0.4) +
labs(y = "Probability", x = "number of awards", fill = "",
title = "Distribution of awards: Poisson?") A more advanced way is to rearrange the data from wide format to long format where all the proportions are located in one variable while a new variable indicates which original variable they came from. This is done by the pivot_longer() function from the tidyr package, which is included in the tidyverse package.
award_long <- award_table |> pivot_longer(
cols = c("observed", "expected"),
values_to = "Proportion",
names_to = "variable")
award_long# A tibble: 14 × 4
num_awards n variable Proportion
<int> <int> <chr> <dbl>
1 0 124 observed 0.62
2 0 124 expected 0.533
3 1 49 observed 0.245
4 1 49 expected 0.336
5 2 13 observed 0.065
6 2 13 expected 0.106
7 3 9 observed 0.045
8 3 9 expected 0.0222
9 4 2 observed 0.01
10 4 2 expected 0.00350
11 5 2 observed 0.01
12 5 2 expected 0.000440
13 6 1 observed 0.005
14 6 1 expected 0.0000462
We now use position = "dodge" in order to get the bars side by side.
ggplot(award_long, aes(x = num_awards, y = Proportion, fill = variable)) +
geom_col(position = "dodge") +
labs(y = "Probability", x = "number of awards", fill = "",
title = "Distribution of awards: Poisson?") +
theme(text = element_text(size = 14))3 Number of awards by program
3.1 Table by program
award |> count(prog, num_awards) prog num_awards n
1 General 0 36
2 General 1 9
3 Academic 0 48
4 Academic 1 32
5 Academic 2 11
6 Academic 3 9
7 Academic 4 2
8 Academic 5 2
9 Academic 6 1
10 Vocational 0 40
11 Vocational 1 8
12 Vocational 2 2
We can use pivot_wider() from tidyr to rearrange this. The values_fill = 0 replaces any missing combination with the value 0.
award |> count(prog, num_awards) |> pivot_wider(
id_cols = num_awards, names_from = prog, values_from = n, values_fill = 0)# A tibble: 7 × 4
num_awards General Academic Vocational
<int> <int> <int> <int>
1 0 36 48 40
2 1 9 32 8
3 2 0 11 2
4 3 0 9 0
5 4 0 2 0
6 5 0 2 0
7 6 0 1 0
3.2 Mean and variance by program
award |> summarise(mu = mean(num_awards),
s2 = var(num_awards),
.by = prog) prog mu s2
1 Vocational 0.24 0.2677551
2 General 0.20 0.1636364
3 Academic 1.00 1.6346154
Large variance in Academic. Need maths? Variance larger than the mean. Might be explained by covariates?
3.3 Poisson regression model
The glm() function with family = "poisson" fits a poisson regression model.
award_glm <- glm(num_awards ~ prog + math, family = "poisson", data = award)
summary(award_glm)
Call:
glm(formula = num_awards ~ prog + math, family = "poisson", data = award)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.24712 0.65845 -7.969 1.60e-15 ***
progAcademic 1.08386 0.35825 3.025 0.00248 **
progVocational 0.36981 0.44107 0.838 0.40179
math 0.07015 0.01060 6.619 3.63e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 287.67 on 199 degrees of freedom
Residual deviance: 189.45 on 196 degrees of freedom
AIC: 373.5
Number of Fisher Scoring iterations: 6
Rate ratios and confidence intervals:
beta <- award_glm$coefficients
RR <- exp(beta)
ci_beta = confint(award_glm)
ci_RR = exp(ci_beta)
cbind(beta = beta, ci_beta, RR = RR, ci_RR) |> round(digits = 2) beta 2.5 % 97.5 % RR 2.5 % 97.5 %
(Intercept) -5.25 -6.57 -3.99 0.01 0.00 0.02
progAcademic 1.08 0.44 1.86 2.96 1.55 6.40
progVocational 0.37 -0.49 1.27 1.45 0.61 3.55
math 0.07 0.05 0.09 1.07 1.05 1.10
Compared to the general program and for fixed final math exam, students from the academic program get about 3 times more awards.
For a fixed study program, a unit increase in the students maths grade predicts a 7% increase (multiply by the factor 1.07) in the number of awards.
3.4 Estimated \(\mu\) with confidence interval
As in logistic regression we should start by construction a confidence interval for the linear predictor, since it is closer to asymptotic normality.
award_pred <- cbind(
award,
xb = predict(award_glm, se.fit = TRUE))
head(award_pred) id num_awards prog math xb.fit xb.se.fit xb.residual.scale
1 45 0 Vocational 41 -2.001067 0.3139247 1
2 108 0 General 41 -2.370876 0.3589907 1
3 15 0 Vocational 44 -1.790610 0.3028454 1
4 67 0 Vocational 42 -1.930914 0.3099133 1
5 153 0 Vocational 40 -2.071219 0.3182388 1
6 51 0 General 42 -2.300724 0.3551921 1
award_pred <- award_pred |> mutate(
xb.residual.scale = NULL,
xb.lwr = xb.fit - 1.96*xb.se.fit,
xb.upr = xb.fit + 1.96*xb.se.fit,
muhat = exp(xb.fit),
mu.lwr = exp(xb.lwr),
mu.upr = exp(xb.upr))
glimpse(award_pred)Rows: 200
Columns: 11
$ id <int> 45, 108, 15, 67, 153, 51, 164, 133, 2, 53, 1, 128, 16, 106,…
$ num_awards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ prog <fct> Vocational, General, Vocational, Vocational, Vocational, Ge…
$ math <int> 41, 41, 44, 42, 40, 42, 46, 40, 33, 46, 40, 38, 44, 37, 40,…
$ xb.fit <dbl> -2.0010669, -2.3708761, -1.7906097, -1.9309145, -2.0712193,…
$ xb.se.fit <dbl> 0.3139247, 0.3589907, 0.3028454, 0.3099133, 0.3182388, 0.35…
$ xb.lwr <dbl> -2.616359, -3.074498, -2.384187, -2.538345, -2.694967, -2.9…
$ xb.upr <dbl> -1.3857744, -1.6672542, -1.1970328, -1.3234844, -1.4474713,…
$ muhat <dbl> 0.13519097, 0.09339886, 0.16685841, 0.14501552, 0.12603202,…
$ mu.lwr <dbl> 0.07306840, 0.04621282, 0.09216392, 0.07899707, 0.06754459,…
$ mu.upr <dbl> 0.2501300, 0.1887647, 0.3020892, 0.2662061, 0.2351642, 0.20…
Plot data, \(\hat{\mu}_i\), and its confidence interval.
ggplot(award_pred, aes(math, num_awards, color = prog)) +
geom_point() +
geom_line(aes(y = muhat), linewidth = 1) +
geom_ribbon(aes(ymin = mu.lwr, ymax = mu.upr), alpha = 0.1) +
labs(title = "Expected number of awards",
caption = "95% confidence interval",
color = "program") +
facet_wrap(~ prog)4 Influence measures
4.1 Leverage
Get the leverage by the hatvalue() function:
award_pred <- award_pred |> mutate(
v = hatvalues(award_glm))Plot them against the math score, separately for each program, and add a reference line at \(2(p+1)/n\):
pplus1 <- length(award_glm$coefficients)
n <- nobs(award_glm)
ggplot(award_pred, aes(math, v, color = prog)) +
geom_point(size = 2) +
geom_hline(yintercept = 2*pplus1/n, linewidth = 1) +
labs(title = "Leverage",
color = "program", caption = "horizontal line = 2(p+1)/n") +
facet_wrap(~ prog)A student on the Vocational program has an unusually high math score, for that program, giving a large leverage.
4.2 Standardised deviance residuals
The deviance residuals (and Pearson residuals) are available through the influence() functions:
award_infl <- influence(award_glm)
glimpse(award_infl)List of 5
$ hat : Named num [1:200] 0.0133 0.012 0.0153 0.0139 0.0128 ...
..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
$ coefficients: num [1:200, 1:4] -0.0136 -0.025 -0.0125 -0.0133 -0.0137 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:200] "1" "2" "3" "4" ...
.. ..$ : chr [1:4] "(Intercept)" "progAcademic" "progVocational" "math"
$ sigma : Named num [1:200] 0.985 0.985 0.985 0.985 0.985 ...
..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
$ dev.res : Named num [1:200] -0.52 -0.432 -0.578 -0.539 -0.502 ...
..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
$ pear.res : Named num [1:200] -0.368 -0.306 -0.408 -0.381 -0.355 ...
..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
Extract the deviance residuals and calculate the standardized deviance residuals:
award_pred <- award_pred |> mutate(
devres = award_infl$dev.res,
std.devres = devres/sqrt(1 - v))
glimpse(award_pred)Rows: 200
Columns: 14
$ id <int> 45, 108, 15, 67, 153, 51, 164, 133, 2, 53, 1, 128, 16, 106,…
$ num_awards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ prog <fct> Vocational, General, Vocational, Vocational, Vocational, Ge…
$ math <int> 41, 41, 44, 42, 40, 42, 46, 40, 33, 46, 40, 38, 44, 37, 40,…
$ xb.fit <dbl> -2.0010669, -2.3708761, -1.7906097, -1.9309145, -2.0712193,…
$ xb.se.fit <dbl> 0.3139247, 0.3589907, 0.3028454, 0.3099133, 0.3182388, 0.35…
$ xb.lwr <dbl> -2.616359, -3.074498, -2.384187, -2.538345, -2.694967, -2.9…
$ xb.upr <dbl> -1.3857744, -1.6672542, -1.1970328, -1.3234844, -1.4474713,…
$ muhat <dbl> 0.13519097, 0.09339886, 0.16685841, 0.14501552, 0.12603202,…
$ mu.lwr <dbl> 0.07306840, 0.04621282, 0.09216392, 0.07899707, 0.06754459,…
$ mu.upr <dbl> 0.2501300, 0.1887647, 0.3020892, 0.2662061, 0.2351642, 0.20…
$ v <dbl> 0.013322901, 0.012036719, 0.015303471, 0.013928197, 0.01276…
$ devres <dbl> -0.5199826, -0.4322010, -0.5776823, -0.5385453, -0.5020598,…
$ std.devres <dbl> -0.5234815, -0.4348259, -0.5821539, -0.5423354, -0.5052950,…
Plot the standardized deviance residuals against the math score separately for each program, with reference lines at \(0,\, \pm 2, \pm 3\):
ggplot(award_pred, aes(math, std.devres, color = prog)) +
geom_point(size = 2) +
geom_hline(yintercept = c(-3, -2, 0, 2, 3),
linetype = rep(c("dotted", "dashed", "solid", "dashed", "dotted"), 3),
linewidth = 1) +
labs(title = "Standardized deviance residuals",
color = "program") +
facet_wrap(~ prog)No alarmingly large residuals and the variance seems constant. The “patterns” are due to \(Y\) only taking a few different integer values.
4.3 Cook’s D
Extract Cook’s Distance values with cooks.distance() and plot them in the same manner with a reference lines at \(4/n\) and \(1\).
award_pred <- award_pred |> mutate(D = cooks.distance(award_glm))
ggplot(award_pred, aes(math, D, color = prog)) +
geom_point(size = 2) +
geom_hline(yintercept = c(4/n, 1),
linetype = rep(c("solid", "dashed"), 3),
linewidth = 1) +
labs(title = "Cook's distance",
color = "program", caption = "horizontal lines = 4/n and 1") +
facet_wrap(~ prog)None of the Cook’s Ds are anywhere near 1 so let’s remove that reference line:
ggplot(award_pred, aes(math, D, color = prog)) +
geom_point(size = 2) +
geom_hline(yintercept = c(4/n)) +
labs(title = "Cook's distance",
color = "program", caption = "horizontal line = 4/n") +
facet_wrap(~ prog)5 Deviance tests
5.1 Against the null model
As with logistic regression, the deviances and degrees of freedom needed for the test are available in the glm() output:
# Not run
glimpse(award_glm)Calculate \(D_0 - D\) and the difference in the number of parameters.
D_diff <- award_glm$null.deviance - award_glm$deviance
D_diff[1] 98.22261
df_diff <- award_glm$df.null - award_glm$df.residual
df_diff[1] 3
A \(\chi^2\)-quantile to compare with:
qchisq(1 - 0.05, df_diff)[1] 7.814728
or the P-value:
pchisq(D_diff, df_diff, lower.tail = FALSE)[1] 3.746554e-21
We should reject H\(_0\): \(\beta_1 + \beta_2 = \beta_3 = 0\). At least one of the \(\beta\)-parameters is significanlty different from zero.
Alternatively, we can just fit the null model and compare them using the anova()-function:
award_null <- glm(num_awards ~ 1, family = "poisson", data = award)
anova(award_null, award_glm)Analysis of Deviance Table
Model 1: num_awards ~ 1
Model 2: num_awards ~ prog + math
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 199 287.67
2 196 189.45 3 98.223 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5.2 Are the differences between the programs?
Update the model by removing prog. We can use the update() function:
award_red_glm <- update(award_glm, . ~ . - prog)
summary(award_red_glm)
Call:
glm(formula = num_awards ~ math, family = "poisson", data = award)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.333532 0.591261 -9.021 <2e-16 ***
math 0.086166 0.009679 8.902 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 287.67 on 199 degrees of freedom
Residual deviance: 204.02 on 198 degrees of freedom
AIC: 384.08
Number of Fisher Scoring iterations: 6
Calculate \(D_{\text{red}} - D_{\text{full}}\) and the difference in the number of parameters.
D_diff_red <- award_red_glm$deviance - award_glm$deviance
D_diff_red[1] 14.57168
df_diff_red <- award_red_glm$df.residual - award_glm$df.residual
df_diff_red[1] 2
Calculate the \(\chi^2\)-quantile and/or the P-value:
qchisq(1 - 0.05, df_diff_red)[1] 5.991465
pchisq(D_diff_red, df_diff_red, lower.tail = FALSE)[1] 0.0006851718
Again, it’s easier to use the anova() function.
anova(award_red_glm, award_glm)Analysis of Deviance Table
Model 1: num_awards ~ math
Model 2: num_awards ~ prog + math
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 198 204.02
2 196 189.45 2 14.572 0.0006852 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We should reject H\(_0\): \(\beta_1 = \beta_2 = 0\). There are significant differences between at least two of the programs.
6 AIC, BIC and pseudo R2
Fit the null model as well, and collect AIC, BIC, \(p\), deviance \(D\) for the three models and the null deviance \(D_0\).
collect.AICetc <- data.frame(
AIC(award_null, award_red_glm, award_glm),
BIC(award_null, award_red_glm, award_glm),
D = c(award_null$deviance, award_red_glm$deviance, award_glm$deviance)) |>
mutate(df.1 = NULL,
D0 = award_null$deviance,
p = df - 1)Use the values to calculate pseudo R2-values:
collect.AICetc <- collect.AICetc |> mutate(
pseudoR2 = 1 - D/D0,
pseudoR2adj = 1 - (D + p)/D0)
collect.AICetc df AIC BIC D D0 p pseudoR2 pseudoR2adj
award_null 1 465.7271 469.0254 287.6722 287.6722 0 0.0000000 0.0000000
award_red_glm 2 384.0762 390.6728 204.0213 287.6722 1 0.2907856 0.2873094
award_glm 4 373.5045 386.6978 189.4496 287.6722 3 0.3414393 0.3310108
The full model has lowest AIC and BIC and highest \(R^2_{adj}\) so they agree that it is the best one.
7 Prediction intervals using bootstrap
Pre-calculated prediction intervals are stored in the data.frame award.predint in the file award_predint.RData.
load("Data/award_predint.RData")
glimpse(award.predint)Rows: 129
Columns: 4
$ math <int> 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 4…
$ prog <fct> General, General, General, General, General, General, General…
$ pred.lwr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pred.upr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
You can add them to your confidence interval plot:
ggplot(award_pred, aes(math, num_awards, color = prog)) +
geom_point() +
geom_line(aes(y = muhat)) +
geom_ribbon(aes(ymin = mu.lwr, ymax = mu.upr), alpha = 0.1) +
geom_line(data = award.predint, aes(y = pred.lwr), linewidth = 1) +
geom_line(data = award.predint, aes(y = pred.upr), linewidth = 1) +
labs(title = "Expected number of awards",
caption = "95% conf.int. and 95% bootstrap pred.int.") +
facet_wrap(~ prog)The confidence interval for \(\mu\) have continouus limits since the expected values are non-negative continuous values.
The prediction intervals are step functions, reflecting that \(Y\) only takes integer values.
If you want to produce your own bootstraped poisson regression prediction intervals.
See lecture10_ex1_poisson-predint.R for how to generate the prediction intervals using an \(X_0\)-data frame and a fitted poisson regression model.
If you want to know the details, the R-function that calculates the bootstrap intervals is avaliable in boot_poisson.R.