Lecture 10 - Generalized linear models

Example 1 - Poisson regression - Student awards

Author

Anna Lindgren

Published

May 11, 2026

1 Packages

# Lecture 10 - Poisson regression
library(tidyverse); theme_set(theme_bw() + theme(text = element_text(size = 16)))

2 Data

The data is taken from https://stats.idre.ucla.edu/r/dae/poisson-regression/

2.1 Read csv-file

Read comma separated (csv) file:

award <- read.csv("Data/poisson_sim.csv")
glimpse(award)

Rows: 200
Columns: 4
$ id         <int> 45, 108, 15, 67, 153, 51, 164, 133, 2, 53, 1, 128, 16, 106,…
$ num_awards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ prog       <int> 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 3, 2, 3,…
$ math       <int> 41, 41, 44, 42, 40, 42, 46, 40, 33, 46, 40, 38, 44, 37, 40,…

summary(award)

       id           num_awards        prog            math      
 Min.   :  1.00   Min.   :0.00   Min.   :1.000   Min.   :33.00  
 1st Qu.: 50.75   1st Qu.:0.00   1st Qu.:2.000   1st Qu.:45.00  
 Median :100.50   Median :0.00   Median :2.000   Median :52.00  
 Mean   :100.50   Mean   :0.63   Mean   :2.025   Mean   :52.65  
 3rd Qu.:150.25   3rd Qu.:1.00   3rd Qu.:2.250   3rd Qu.:59.00  
 Max.   :200.00   Max.   :6.00   Max.   :3.000   Max.   :75.00

2.2 Factor labels

Give the programs their correct factor labels, not numbers, see the description in the link above.

award <- mutate(award,
                prog = factor(prog, levels = c(1, 2, 3),
                labels = c("General", "Academic", "Vocational")))
glimpse(award)

Rows: 200
Columns: 4
$ id         <int> 45, 108, 15, 67, 153, 51, 164, 133, 2, 53, 1, 128, 16, 106,…
$ num_awards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ prog       <fct> Vocational, General, Vocational, Vocational, Vocational, Ge…
$ math       <int> 41, 41, 44, 42, 40, 42, 46, 40, 33, 46, 40, 38, 44, 37, 40,…

2.3 Frequency table of the number of awards

Make a frequency table of the number of awards:

award_table <- award |> count(num_awards)
award_table

  num_awards   n
1          0 124
2          1  49
3          2  13
4          3   9
5          4   2
6          5   2
7          6   1

Add the observed propotions of the different number of awards:

award_table <- award_table |> mutate(observed = n/sum(n))
award_table

  num_awards   n observed
1          0 124    0.620
2          1  49    0.245
3          2  13    0.065
4          3   9    0.045
5          4   2    0.010
6          5   2    0.010
7          6   1    0.005

2.4 Average number of awards

Compare the size of the average and the variance.

mu <- mean(award$num_awards)
mu

[1] 0.63

var(award$num_awards)

[1] 1.108643

Note that the variance is larger than the mean. Some of this extra variability might be explained by the covariates.

2.5 Expected frequencies

Calculated the expected proportions according to a Poisson distribution with mean and variance \(\mu = 0.63\). The probability (density) function is given by the dpois() function.

award_table <- award_table |> 
  mutate(expected = dpois(num_awards, mu))
award_table

  num_awards   n observed     expected
1          0 124    0.620 5.325918e-01
2          1  49    0.245 3.355328e-01
3          2  13    0.065 1.056928e-01
4          3   9    0.045 2.219550e-02
5          4   2    0.010 3.495791e-03
6          5   2    0.010 4.404696e-04
7          6   1    0.005 4.624931e-05

Plot both the observed and expected proportions in the same bar chart.

Note

The +0.2 and -0.2 together with width = 0.4 is an easy way of getting the bars side by side.

ggplot(award_table) +
  geom_col(aes(x = num_awards + 0.2, y = observed, fill = "Observed"), 
           width = 0.4) +
  geom_col(aes(x = num_awards - 0.2, y = expected, fill = "Expected"), 
           width = 0.4) +
  labs(y = "Probability", x = "number of awards", fill = "",
       title = "Distribution of awards: Poisson?")

Note

A more advanced way is to rearrange the data from wide format to long format where all the proportions are located in one variable while a new variable indicates which original variable they came from. This is done by the pivot_longer() function from the tidyr package, which is included in the tidyverse package.

award_long <- award_table |> pivot_longer(
  cols = c("observed", "expected"), 
  values_to = "Proportion", 
  names_to = "variable")
award_long

# A tibble: 14 × 4
   num_awards     n variable Proportion
        <int> <int> <chr>         <dbl>
 1          0   124 observed  0.62     
 2          0   124 expected  0.533    
 3          1    49 observed  0.245    
 4          1    49 expected  0.336    
 5          2    13 observed  0.065    
 6          2    13 expected  0.106    
 7          3     9 observed  0.045    
 8          3     9 expected  0.0222   
 9          4     2 observed  0.01     
10          4     2 expected  0.00350  
11          5     2 observed  0.01     
12          5     2 expected  0.000440 
13          6     1 observed  0.005    
14          6     1 expected  0.0000462

We now use position = "dodge" in order to get the bars side by side.

ggplot(award_long, aes(x = num_awards, y = Proportion, fill = variable)) +
  geom_col(position = "dodge") +
  labs(y = "Probability", x = "number of awards", fill = "",
       title = "Distribution of awards: Poisson?") +
  theme(text = element_text(size = 14))

3 Number of awards by program

3.1 Table by program

award |> count(prog, num_awards)

         prog num_awards  n
1     General          0 36
2     General          1  9
3    Academic          0 48
4    Academic          1 32
5    Academic          2 11
6    Academic          3  9
7    Academic          4  2
8    Academic          5  2
9    Academic          6  1
10 Vocational          0 40
11 Vocational          1  8
12 Vocational          2  2

Note

We can use pivot_wider() from tidyr to rearrange this. The values_fill = 0 replaces any missing combination with the value 0.

award |> count(prog, num_awards) |> pivot_wider(
  id_cols = num_awards, names_from = prog, values_from = n, values_fill = 0)

# A tibble: 7 × 4
  num_awards General Academic Vocational
       <int>   <int>    <int>      <int>
1          0      36       48         40
2          1       9       32          8
3          2       0       11          2
4          3       0        9          0
5          4       0        2          0
6          5       0        2          0
7          6       0        1          0

3.2 Mean and variance by program

award |> summarise(mu = mean(num_awards),
                   s2 = var(num_awards),
                   .by = prog)

        prog   mu        s2
1 Vocational 0.24 0.2677551
2    General 0.20 0.1636364
3   Academic 1.00 1.6346154

Large variance in Academic. Need maths? Variance larger than the mean. Might be explained by covariates?

3.3 Poisson regression model

The glm() function with family = "poisson" fits a poisson regression model.

award_glm <- glm(num_awards ~ prog + math, family = "poisson", data = award)
summary(award_glm)


Call:
glm(formula = num_awards ~ prog + math, family = "poisson", data = award)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -5.24712    0.65845  -7.969 1.60e-15 ***
progAcademic    1.08386    0.35825   3.025  0.00248 ** 
progVocational  0.36981    0.44107   0.838  0.40179    
math            0.07015    0.01060   6.619 3.63e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 287.67  on 199  degrees of freedom
Residual deviance: 189.45  on 196  degrees of freedom
AIC: 373.5

Number of Fisher Scoring iterations: 6

Rate ratios and confidence intervals:

beta <- award_glm$coefficients
RR <- exp(beta)
ci_beta = confint(award_glm)
ci_RR = exp(ci_beta)

cbind(beta = beta, ci_beta, RR = RR, ci_RR) |> round(digits = 2)

                beta 2.5 % 97.5 %   RR 2.5 % 97.5 %
(Intercept)    -5.25 -6.57  -3.99 0.01  0.00   0.02
progAcademic    1.08  0.44   1.86 2.96  1.55   6.40
progVocational  0.37 -0.49   1.27 1.45  0.61   3.55
math            0.07  0.05   0.09 1.07  1.05   1.10

Compared to the general program and for fixed final math exam, students from the academic program get about 3 times more awards.

For a fixed study program, a unit increase in the students maths grade predicts a 7% increase (multiply by the factor 1.07) in the number of awards.

3.4 Estimated \(\mu\) with confidence interval

As in logistic regression we should start by construction a confidence interval for the linear predictor, since it is closer to asymptotic normality.

award_pred <- cbind(
  award,
  xb = predict(award_glm, se.fit = TRUE))
head(award_pred)

   id num_awards       prog math    xb.fit xb.se.fit xb.residual.scale
1  45          0 Vocational   41 -2.001067 0.3139247                 1
2 108          0    General   41 -2.370876 0.3589907                 1
3  15          0 Vocational   44 -1.790610 0.3028454                 1
4  67          0 Vocational   42 -1.930914 0.3099133                 1
5 153          0 Vocational   40 -2.071219 0.3182388                 1
6  51          0    General   42 -2.300724 0.3551921                 1

award_pred <- award_pred |> mutate(
  xb.residual.scale = NULL,
  xb.lwr = xb.fit - 1.96*xb.se.fit,
  xb.upr = xb.fit + 1.96*xb.se.fit,
  muhat = exp(xb.fit),
  mu.lwr = exp(xb.lwr),
  mu.upr = exp(xb.upr))

glimpse(award_pred)

Rows: 200
Columns: 11
$ id         <int> 45, 108, 15, 67, 153, 51, 164, 133, 2, 53, 1, 128, 16, 106,…
$ num_awards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ prog       <fct> Vocational, General, Vocational, Vocational, Vocational, Ge…
$ math       <int> 41, 41, 44, 42, 40, 42, 46, 40, 33, 46, 40, 38, 44, 37, 40,…
$ xb.fit     <dbl> -2.0010669, -2.3708761, -1.7906097, -1.9309145, -2.0712193,…
$ xb.se.fit  <dbl> 0.3139247, 0.3589907, 0.3028454, 0.3099133, 0.3182388, 0.35…
$ xb.lwr     <dbl> -2.616359, -3.074498, -2.384187, -2.538345, -2.694967, -2.9…
$ xb.upr     <dbl> -1.3857744, -1.6672542, -1.1970328, -1.3234844, -1.4474713,…
$ muhat      <dbl> 0.13519097, 0.09339886, 0.16685841, 0.14501552, 0.12603202,…
$ mu.lwr     <dbl> 0.07306840, 0.04621282, 0.09216392, 0.07899707, 0.06754459,…
$ mu.upr     <dbl> 0.2501300, 0.1887647, 0.3020892, 0.2662061, 0.2351642, 0.20…

Plot data, \(\hat{\mu}_i\), and its confidence interval.

ggplot(award_pred, aes(math, num_awards, color = prog)) +
  geom_point() +
  geom_line(aes(y = muhat), linewidth = 1) +
  geom_ribbon(aes(ymin = mu.lwr, ymax = mu.upr), alpha = 0.1) +
  labs(title = "Expected number of awards",
       caption = "95% confidence interval",
       color = "program") +
  facet_wrap(~ prog)

4 Influence measures

4.1 Leverage

Get the leverage by the hatvalue() function:

award_pred <- award_pred |> mutate(
  v = hatvalues(award_glm))

Plot them against the math score, separately for each program, and add a reference line at \(2(p+1)/n\):

pplus1 <- length(award_glm$coefficients)
n <- nobs(award_glm)

ggplot(award_pred, aes(math, v, color = prog)) +
  geom_point(size = 2) +
  geom_hline(yintercept = 2*pplus1/n, linewidth = 1) +
  labs(title = "Leverage",
       color = "program", caption = "horizontal line = 2(p+1)/n") +
  facet_wrap(~ prog)

A student on the Vocational program has an unusually high math score, for that program, giving a large leverage.

4.2 Standardised deviance residuals

The deviance residuals (and Pearson residuals) are available through the influence() functions:

award_infl <- influence(award_glm)
glimpse(award_infl)

List of 5
 $ hat         : Named num [1:200] 0.0133 0.012 0.0153 0.0139 0.0128 ...
  ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
 $ coefficients: num [1:200, 1:4] -0.0136 -0.025 -0.0125 -0.0133 -0.0137 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:200] "1" "2" "3" "4" ...
  .. ..$ : chr [1:4] "(Intercept)" "progAcademic" "progVocational" "math"
 $ sigma       : Named num [1:200] 0.985 0.985 0.985 0.985 0.985 ...
  ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
 $ dev.res     : Named num [1:200] -0.52 -0.432 -0.578 -0.539 -0.502 ...
  ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
 $ pear.res    : Named num [1:200] -0.368 -0.306 -0.408 -0.381 -0.355 ...
  ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...

Extract the deviance residuals and calculate the standardized deviance residuals:

award_pred <- award_pred |> mutate(
  devres = award_infl$dev.res,
  std.devres = devres/sqrt(1 - v))
glimpse(award_pred)

Rows: 200
Columns: 14
$ id         <int> 45, 108, 15, 67, 153, 51, 164, 133, 2, 53, 1, 128, 16, 106,…
$ num_awards <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ prog       <fct> Vocational, General, Vocational, Vocational, Vocational, Ge…
$ math       <int> 41, 41, 44, 42, 40, 42, 46, 40, 33, 46, 40, 38, 44, 37, 40,…
$ xb.fit     <dbl> -2.0010669, -2.3708761, -1.7906097, -1.9309145, -2.0712193,…
$ xb.se.fit  <dbl> 0.3139247, 0.3589907, 0.3028454, 0.3099133, 0.3182388, 0.35…
$ xb.lwr     <dbl> -2.616359, -3.074498, -2.384187, -2.538345, -2.694967, -2.9…
$ xb.upr     <dbl> -1.3857744, -1.6672542, -1.1970328, -1.3234844, -1.4474713,…
$ muhat      <dbl> 0.13519097, 0.09339886, 0.16685841, 0.14501552, 0.12603202,…
$ mu.lwr     <dbl> 0.07306840, 0.04621282, 0.09216392, 0.07899707, 0.06754459,…
$ mu.upr     <dbl> 0.2501300, 0.1887647, 0.3020892, 0.2662061, 0.2351642, 0.20…
$ v          <dbl> 0.013322901, 0.012036719, 0.015303471, 0.013928197, 0.01276…
$ devres     <dbl> -0.5199826, -0.4322010, -0.5776823, -0.5385453, -0.5020598,…
$ std.devres <dbl> -0.5234815, -0.4348259, -0.5821539, -0.5423354, -0.5052950,…

Plot the standardized deviance residuals against the math score separately for each program, with reference lines at \(0,\, \pm 2, \pm 3\):

ggplot(award_pred, aes(math, std.devres, color = prog)) +
  geom_point(size = 2) +
  geom_hline(yintercept = c(-3, -2, 0, 2, 3),
             linetype = rep(c("dotted", "dashed", "solid", "dashed", "dotted"), 3),
             linewidth = 1) +
  labs(title = "Standardized deviance residuals",
       color = "program") +
  facet_wrap(~ prog)

No alarmingly large residuals and the variance seems constant. The “patterns” are due to \(Y\) only taking a few different integer values.

4.3 Cook’s D

Extract Cook’s Distance values with cooks.distance() and plot them in the same manner with a reference lines at \(4/n\) and \(1\).

award_pred <- award_pred |> mutate(D = cooks.distance(award_glm))
ggplot(award_pred, aes(math, D, color = prog)) +
  geom_point(size = 2) +
  geom_hline(yintercept = c(4/n, 1), 
             linetype = rep(c("solid", "dashed"), 3),
             linewidth = 1) +
  labs(title = "Cook's distance",
       color = "program", caption = "horizontal lines = 4/n and 1") +
  facet_wrap(~ prog)

None of the Cook’s Ds are anywhere near 1 so let’s remove that reference line:

ggplot(award_pred, aes(math, D, color = prog)) +
  geom_point(size = 2) +
  geom_hline(yintercept = c(4/n)) +
  labs(title = "Cook's distance",
       color = "program", caption = "horizontal line = 4/n") +
  facet_wrap(~ prog)

5 Deviance tests

5.1 Against the null model

As with logistic regression, the deviances and degrees of freedom needed for the test are available in the glm() output:

# Not run
glimpse(award_glm)

Calculate \(D_0 - D\) and the difference in the number of parameters.

D_diff <- award_glm$null.deviance - award_glm$deviance
D_diff

[1] 98.22261

df_diff <- award_glm$df.null - award_glm$df.residual
df_diff

[1] 3

A \(\chi^2\)-quantile to compare with:

qchisq(1 - 0.05, df_diff)

[1] 7.814728

or the P-value:

pchisq(D_diff, df_diff, lower.tail = FALSE)

[1] 3.746554e-21

We should reject H\(_0\): \(\beta_1 + \beta_2 = \beta_3 = 0\). At least one of the \(\beta\)-parameters is significanlty different from zero.

Note

Alternatively, we can just fit the null model and compare them using the anova()-function:

award_null <- glm(num_awards ~ 1, family = "poisson", data = award)
anova(award_null, award_glm)

Analysis of Deviance Table

Model 1: num_awards ~ 1
Model 2: num_awards ~ prog + math
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       199     287.67                          
2       196     189.45  3   98.223 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.2 Are the differences between the programs?

Update the model by removing prog. We can use the update() function:

award_red_glm <- update(award_glm, . ~ . - prog)
summary(award_red_glm)


Call:
glm(formula = num_awards ~ math, family = "poisson", data = award)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.333532   0.591261  -9.021   <2e-16 ***
math         0.086166   0.009679   8.902   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 287.67  on 199  degrees of freedom
Residual deviance: 204.02  on 198  degrees of freedom
AIC: 384.08

Number of Fisher Scoring iterations: 6

Calculate \(D_{\text{red}} - D_{\text{full}}\) and the difference in the number of parameters.

D_diff_red <- award_red_glm$deviance - award_glm$deviance
D_diff_red

[1] 14.57168

df_diff_red <- award_red_glm$df.residual - award_glm$df.residual
df_diff_red

[1] 2

Calculate the \(\chi^2\)-quantile and/or the P-value:

qchisq(1 - 0.05, df_diff_red)

[1] 5.991465

pchisq(D_diff_red, df_diff_red, lower.tail = FALSE)

[1] 0.0006851718

Note

Again, it’s easier to use the anova() function.

anova(award_red_glm, award_glm)

Analysis of Deviance Table

Model 1: num_awards ~ math
Model 2: num_awards ~ prog + math
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       198     204.02                          
2       196     189.45  2   14.572 0.0006852 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We should reject H\(_0\): \(\beta_1 = \beta_2 = 0\). There are significant differences between at least two of the programs.

6 AIC, BIC and pseudo R2

Fit the null model as well, and collect AIC, BIC, \(p\), deviance \(D\) for the three models and the null deviance \(D_0\).

collect.AICetc <- data.frame(
  AIC(award_null, award_red_glm, award_glm), 
  BIC(award_null, award_red_glm, award_glm),
  D = c(award_null$deviance, award_red_glm$deviance, award_glm$deviance)) |> 
  mutate(df.1 = NULL,
         D0 = award_null$deviance,
         p = df - 1)

Use the values to calculate pseudo R2-values:

collect.AICetc <- collect.AICetc |> mutate(
  pseudoR2 = 1 - D/D0,
  pseudoR2adj = 1 - (D + p)/D0)
collect.AICetc

              df      AIC      BIC        D       D0 p  pseudoR2 pseudoR2adj
award_null     1 465.7271 469.0254 287.6722 287.6722 0 0.0000000   0.0000000
award_red_glm  2 384.0762 390.6728 204.0213 287.6722 1 0.2907856   0.2873094
award_glm      4 373.5045 386.6978 189.4496 287.6722 3 0.3414393   0.3310108

The full model has lowest AIC and BIC and highest \(R^2_{adj}\) so they agree that it is the best one.

7 Prediction intervals using bootstrap

Pre-calculated prediction intervals are stored in the data.frame award.predint in the file award_predint.RData.

load("Data/award_predint.RData")
glimpse(award.predint)

Rows: 129
Columns: 4
$ math     <int> 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 4…
$ prog     <fct> General, General, General, General, General, General, General…
$ pred.lwr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pred.upr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

You can add them to your confidence interval plot:

ggplot(award_pred, aes(math, num_awards, color = prog)) +
  geom_point() +
  geom_line(aes(y = muhat)) +
  geom_ribbon(aes(ymin = mu.lwr, ymax = mu.upr), alpha = 0.1) +
  geom_line(data = award.predint, aes(y = pred.lwr), linewidth = 1) +
  geom_line(data = award.predint, aes(y = pred.upr), linewidth = 1) +
  labs(title = "Expected number of awards",
       caption = "95% conf.int. and 95% bootstrap pred.int.") +
  facet_wrap(~ prog)

The confidence interval for \(\mu\) have continouus limits since the expected values are non-negative continuous values.

The prediction intervals are step functions, reflecting that \(Y\) only takes integer values.

Bootstrap prediction intervals

If you want to produce your own bootstraped poisson regression prediction intervals.

See lecture10_ex1_poisson-predint.R for how to generate the prediction intervals using an \(X_0\)-data frame and a fitted poisson regression model.

The bootstrap function

If you want to know the details, the R-function that calculates the bootstrap intervals is avaliable in boot_poisson.R.