# Lecture 10 - Negative binomial regression
library(tidyverse); theme_set(theme_bw() + theme(text = element_text(size = 16)))
library(MASS)Lecture 10 - Generalised linear models
Example 2 - Negative binomial regression - Student absence
1 Packages
We now need to activate the MASS package. MASS = Modern Applied Statistics with S (“S” is the statistical software “R” is inspired by).
2 Number of days absent
The data is taken from https://stats.oarc.ucla.edu/r/dae/negative-binomial-regression/
2.1 Read the data
The data at UCLA is in Stata (another statistical software) format. The package foreign has functions for reading, i.a., SAS, Stata, Minitab, Octave, SPSS formats. Use it if you like.
#library(foreign)
#data.nb <- read.dta("https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta")If not, I have make an R version of the data:
load("Data/nb_data.RData")
glimpse(data.nb)Rows: 314
Columns: 5
$ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 10…
$ gender <fct> male, male, female, female, female, female, female, male, male…
$ math <dbl> 63, 27, 20, 16, 2, 71, 63, 3, 51, 49, 31, 22, 73, 77, 10, 89, …
$ daysabs <dbl> 4, 4, 2, 3, 3, 13, 11, 7, 10, 9, 4, 5, 5, 6, 1, 0, 1, 0, 5, 24…
$ prog <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 1, 3, 2, 2, 2, 1, 2, 2,…
2.2 Set factor labels
See the description in the link above.
data.nb <- data.nb |> mutate(
prog = factor(prog,
levels = c(1, 2, 3),
labels = c("General", "Academic", "Vocational")))
glimpse(data.nb)Rows: 314
Columns: 5
$ id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 10…
$ gender <fct> male, male, female, female, female, female, female, male, male…
$ math <dbl> 63, 27, 20, 16, 2, 71, 63, 3, 51, 49, 31, 22, 73, 77, 10, 89, …
$ daysabs <dbl> 4, 4, 2, 3, 3, 13, 11, 7, 10, 9, 4, 5, 5, 6, 1, 0, 1, 0, 5, 24…
$ prog <fct> Academic, Academic, Academic, Academic, Academic, Academic, Ac…
2.3 Plot vs math by program
Plot the number of days absent by math score and program
ggplot(data.nb, aes(math, daysabs, color = prog)) +
geom_point() +
facet_wrap(~ prog)3 Fit a negbin model
The ordinary glm() function cannot fit a negative binomial regression so we have to use the glm.nb() (nb = negative binomial) function from the MASS package.
absence_glmnb <- glm.nb(daysabs ~ math + prog, data = data.nb)
absence_glmnb
Call: glm.nb(formula = daysabs ~ math + prog, data = data.nb, init.theta = 1.032713156,
link = log)
Coefficients:
(Intercept) math progAcademic progVocational
2.615265 -0.005993 -0.440760 -1.278651
Degrees of Freedom: 313 Total (i.e. Null); 310 Residual
Null Deviance: 427.5
Residual Deviance: 358.5 AIC: 1741
summary(absence_glmnb)
Call:
glm.nb(formula = daysabs ~ math + prog, data = data.nb, init.theta = 1.032713156,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.615265 0.197460 13.245 < 2e-16 ***
math -0.005993 0.002505 -2.392 0.0167 *
progAcademic -0.440760 0.182610 -2.414 0.0158 *
progVocational -1.278651 0.200720 -6.370 1.89e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)
Null deviance: 427.54 on 313 degrees of freedom
Residual deviance: 358.52 on 310 degrees of freedom
AIC: 1741.3
Number of Fisher Scoring iterations: 1
Theta: 1.033
Std. Err.: 0.106
2 x log-likelihood: -1731.258
The summary() also gives us an estimate of \(\theta\) and its standard error.
3.1 Confidence intervals for \(\beta\)
Profile likelihood confidence intervals for the \(\beta\)-parameters are calculated with confint() as usual.
beta <- absence_glmnb$coefficients
ci_beta = confint(absence_glmnb)
cbind(beta = beta, ci_beta) |> round(digits = 2) beta 2.5 % 97.5 %
(Intercept) 2.62 2.24 3.01
math -0.01 -0.01 0.00
progAcademic -0.44 -0.81 -0.09
progVocational -1.28 -1.68 -0.89
cbind(RR = exp(beta), exp(ci_beta)) |> round(digits = 2) RR 2.5 % 97.5 %
(Intercept) 13.67 9.41 20.35
math 0.99 0.99 1.00
progAcademic 0.64 0.44 0.91
progVocational 0.28 0.19 0.41
3.2 Estimate means with confidence intervals
The linear predictor and its confidence interval is calculated with the help of predict(), as before:
absence_pred <- cbind(
data.nb,
xb = predict(absence_glmnb, se.fit = TRUE))
absence_pred <- absence_pred |>
mutate(
xb.residual.scale = NULL,
xb.lwr = xb.fit - 1.96*xb.se.fit,
xb.upr = xb.fit + 1.96*xb.se.fit,
mu.hat = exp(xb.fit),
mu.lwr = exp(xb.lwr),
mu.upr = exp(xb.upr))Add the fitted line and confidence intervals to the plot
ggplot(absence_pred, aes(math, daysabs, color = prog)) +
geom_point() +
geom_line(aes(y = mu.hat), linewidth = 1) +
geom_ribbon(aes(ymin = mu.lwr, ymax = mu.upr), alpha = 0.1) +
facet_wrap(~ prog)4 Likelihood ratio tests
4.1 Against the null model
The \(D\) and \(D_0\) saved in the glm.nb-result reflect what happens if we remove the x-variables but keep the \(\theta\)-estimate the same. But if we remove variables the \(\theta\) has to pick up the left-over variability and will change, so the null deviance will be different for different models.
absence_null <- glm.nb(daysabs ~ 1, data = data.nb)
absence_null$null.deviance[1] 357.4863
absence_glmnb$null.deviance[1] 427.5401
We need to use a Likelihood-ratio test now. The easiest is to rely on the anova()-function that has this implemented.
anova(absence_null, absence_glmnb)Likelihood ratio tests of Negative Binomial Models
Response: daysabs
Model theta Resid. df 2 x log-lik. Test df LR stat.
1 1 0.7999836 313 -1792.945
2 math + prog 1.0327132 310 -1731.258 1 vs 2 3 61.68695
Pr(Chi)
1
2 2.563505e-13
We should reject H\(_0\): \(\beta_1 + \beta_2 = \beta_3 = 0\). At least one of the \(\beta\)-parameters is significantly different from zero.
4.2 Are the differences between the programs?
We need to take the fact that the null deviance changes when we change the model, into account. This is because we also get a new \(\theta\)-estimate. Again, use the anova()-function here.
Update the model by removing prog. We can use the update() function:
absence_red <- update(absence_glmnb, . ~ . - prog)
summary(absence_red)
Call:
glm.nb(formula = daysabs ~ math, data = data.nb, init.theta = 0.8558564931,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.24663 0.13916 16.145 < 2e-16 ***
math -0.01034 0.00259 -3.991 6.58e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.8559) family taken to be 1)
Null deviance: 375.05 on 313 degrees of freedom
Residual deviance: 357.90 on 312 degrees of freedom
AIC: 1782.3
Number of Fisher Scoring iterations: 1
Theta: 0.8559
Std. Err.: 0.0829
2 x log-likelihood: -1776.3060
anova(absence_red, absence_glmnb)Likelihood ratio tests of Negative Binomial Models
Response: daysabs
Model theta Resid. df 2 x log-lik. Test df LR stat.
1 math 0.8558565 312 -1776.306
2 math + prog 1.0327132 310 -1731.258 1 vs 2 2 45.04798
Pr(Chi)
1
2 1.65179e-10
We should reject H\(_0\): \(\beta_1 = \beta_2 = 0\). There are significant differences between at least two of the programs.
5 AIC, BIC and pseudo R2
Fit the null model as well, and collect AIC, BIC, \(p\), deviance \(D\) and null deviance \(D_0\) for the three models.
collect.AICetc <- data.frame(
AIC(absence_null, absence_red, absence_glmnb),
BIC(absence_null, absence_red, absence_glmnb),
D = c(absence_null$deviance, absence_red$deviance, absence_glmnb$deviance),
D0 = c(absence_null$null.deviance, absence_red$null.deviance,
absence_glmnb$null.deviance),
p = c(absence_null$df.null - absence_null$df.residual,
absence_red$df.null - absence_red$df.residual,
absence_glmnb$df.null - absence_glmnb$df.residual)) |>
mutate(df.1 = NULL)
collect.AICetc df AIC BIC D D0 p
absence_null 2 1796.945 1804.444 357.4863 357.4863 0
absence_red 3 1782.306 1793.554 357.8987 375.0467 1
absence_glmnb 5 1741.258 1760.005 358.5193 427.5401 3
Note that the null deviance now changes with the model.
Calculate the pseudo \(R^2\) and adjusted \(R^2\) for each of the models.
collect.AICetc |> mutate(
R2 = 1 - D/D0,
R2.adj = 1 - (D + p)/D0) |> round(digits = 3) df AIC BIC D D0 p R2 R2.adj
absence_null 2 1796.945 1804.444 357.486 357.486 0 0.000 0.000
absence_red 3 1782.306 1793.554 357.899 375.047 1 0.046 0.043
absence_glmnb 5 1741.258 1760.005 358.519 427.540 3 0.161 0.154
The full model has lowest AIC and BIC and highest \(R^2_{adj}\) so they agree that it is the best one.
6 Negbin or Poisson?
6.1 Using the standardized deviance residuals
Calculate the standardized deviance residuals for the negative binomial model.
absence_infl <- influence(absence_glmnb)
absence_pred <- absence_pred |> mutate(
v = hatvalues(absence_glmnb),
devres = absence_infl$dev.res,
std.devres = devres/sqrt(1 - v))Fit a Poisson regression model to the data and calculate its standardized deviance residuals as well:
absence_pois_glm <- glm(daysabs ~ math + prog, family = "poisson",
data = data.nb)
absence_pois_infl <- influence(absence_pois_glm)
absence_pred <- absence_pred |> mutate(
xb_pois = predict(absence_pois_glm),
v_pois = hatvalues(absence_pois_glm),
devres_pois = absence_pois_infl$dev.res,
std.devres_pois = devres_pois/sqrt(1 - v_pois))checking the min and max values to get equal y-axes
absence_pred |> summarise(
min_nb = min(std.devres),
max_nb = max(std.devres),
min_pois = min(std.devres_pois),
max_pois = max(std.devres_pois)) min_nb max_nb min_pois max_pois
1 -2.172154 2.539892 -4.300453 7.446744
A y-axis from \(-4.5\) to \(+7.5\) seems reasonable.
6.1.1 Residual plot for Poisson
ggplot(absence_pred, aes(x = xb_pois, color = prog)) +
geom_point(aes(y = std.devres_pois), size = 2) +
geom_hline(yintercept = c(-3, -2, 0, 2, 3),
linetype = c("dotted", "dashed", "solid", "dashed", "dotted"),
linewidth = 1) +
expand_limits(y = c(-4.5, 7.5)) +
labs(y = "std dev.res", x = "xb", color = "program",
title = "Absence: Poisson model")6.1.2 Residual plot for negbin
ggplot(absence_pred, aes(x = xb.fit, color = prog)) +
geom_point(aes(y = std.devres), size = 2) +
geom_hline(yintercept = c(-3, -2, 0, 2, 3),
linetype = c("dotted", "dashed", "solid", "dashed", "dotted"),
linewidth = 1) +
expand_limits(y = c(-4.5, 7.5)) +
labs(y = "std dev.res", x = "xb", color = "program",
title = "Absence: Negbin model")Poisson residuals are too large. Use negbin model.
6.2 With Likelihood ratio test
Do NOT use the anova() function as that tests two models following the same distributions, which is not the case here.
You have to compute everything “by hand” using the likelihood function, logLik().
This is not the same as the deviance, since the saturated models behave differently in Poisson and Negative binomial. We are not testing \(\beta\)-parameters here. Instead, the Poisson distribution is nested within the Negative binomial where the parameter \(1/\theta = 0\).
D_pois <- -2*logLik(absence_pois_glm)[1]
D_pois[1] 2657.285
D_nb <- -2*logLik(absence_glmnb)[1]
D_nb[1] 1731.258
D_diff <- D_pois - D_nb
D_diff[1] 926.0272
qchisq(1 - 0.05, 1)[1] 3.841459
pchisq(D_diff, 1, lower.tail = FALSE)[1] 2.157298e-203
We should reject H\(_0\): \(1/\theta = 0\). The extra parameter that allows for the extra variability and turns a Poisson regression into a Negative binomial regression, is significant.
Poisson gives the same overall \(\mu\) as Negative binomial regression, but the fit does not account for increasing variability for increasing \(\mu\).
7 Bootstrap prediction intervals
The bootstrap prediction intervals are pre-calculated and available in nb_predint.RData
load("Data/nb_predint.RData")Add them to the plot. They are stepfunctions, since \(Y\) is non-negative integers, but they are plotted here withgeom_smooth() to avoid the raggedness.
ggplot(absence_pred, aes(math, daysabs, color = prog)) +
geom_point() +
geom_line(aes(y = mu.hat)) +
geom_ribbon(aes(ymin = mu.lwr, ymax = mu.upr), alpha = 0.1) +
geom_smooth(data = nb.predint, aes(y = pred.lwr), linewidth = 1, se = FALSE) +
geom_smooth(data = nb.predint, aes(y = pred.upr), linewidth = 1, se = FALSE) +
labs(title = "Expected number of days absent",
caption = "95% conf.int. and 95% bootstrap pred.int.") +
facet_wrap(~ prog)If interested, see Lecture10_ex1_negbin-predint.R for calculating these yourself using the function in boot_negbin.R
8 Are the awards negative binomial?
Refit the poisson model from Example 1 and compare with a negative binomial model.
award <- read.csv("Data/poisson_sim.csv")
award <- award |> mutate(
prog = factor(prog,
levels = c(1, 2, 3),
labels = c("General", "Academic", "Vocational")))award_pois_glm <- glm(num_awards ~ prog + math, family = "poisson",
data = award)
award_glmnb <- glm.nb(num_awards ~ prog + math, data = award)
award_pred <- cbind(
award,
xb_pois = predict(award_pois_glm),
xb_nb = predict(award_glmnb))
award_pred <- award_pred |> mutate(
devres_pois = influence(award_pois_glm)$dev.res,
v_pois = hatvalues(award_pois_glm),
std.devres_pois = devres_pois/sqrt(1 - v_pois),
devres_nb = influence(award_glmnb)$dev.res,
v_nb = hatvalues(award_glmnb),
std.devres_nb = devres_nb/sqrt(1 - v_nb))ggplot(award_pred, aes(x = xb_pois, color = prog)) +
geom_point(aes(y = std.devres_pois), size = 2) +
geom_hline(yintercept = c(-3, -2, 0, 2, 3),
linetype = c("dotted", "dashed", "solid", "dashed", "dotted"),
linewidth = 1) +
labs(y = "std dev.res", x = "xb", color = "program",
title = "Awards: Poisson model")ggplot(award_pred, aes(x = xb_nb, color = prog)) +
geom_point(aes(y = std.devres_nb), size = 2) +
geom_hline(yintercept = c(-3, -2, 0, 2, 3),
linetype = c("dotted", "dashed", "solid", "dashed", "dotted"),
linewidth = 1) +
labs(y = "std dev.res", x = "xb", color = "program",
title = "Awards: Negbin model")D_diff.award <- -2*(logLik(award_pois_glm)[1] - logLik(award_glmnb)[1])
D_diff.award[1] 1.693609
qchisq(1 - 0.05, 1)[1] 3.841459
pchisq(D_diff.award, 1, lower.tail = FALSE)[1] 0.1931259
No, the number of awards does not need a negative binomial. Poisson is ok.