Lecture 7 - Binary logistic regression

Example 1 - PM10 particles in Oslo

Author

Anna Lindgren

Published

April 28, 2026

1 Packages

# Lecture 7 - Oslo particles - logistic regression
library(tidyverse); theme_set(theme_bw() + theme(text = element_text(size = 16)))
library(readxl)
If your R-version is older than 4.4.0
  • Update to a newer version (preferably) or
  • make sure that the package MASS is installed. It probably is, but look for it on the Packages tab in RStudio; it should be under the System Library heading. You do not have to activate it with library(MASS).

2 Data

oslo <- read_excel("Data/oslo.xlsx")
glimpse(oslo)
Rows: 500
Columns: 7
$ cars      <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 984, 2159, 30…
$ windspeed <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2, 3.4, 4.4, …
$ temp2m    <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, 1.1, -9.0, …
$ tempdiff  <chr> "zero", "zero", "zero", "zero", "pos", "zero", "zero", "zero…
$ zerodiff  <chr> "zero", "zero", "zero", "zero", "nonzero", "zero", "zero", "…
$ windquad  <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "SW", "NE", …
$ highpm10  <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, …

There are some extra variables used in Lecture 8.

Plot highpm10 against number of cars, cars:

ggplot(oslo, aes(cars, highpm10)) +
  geom_point() +
  xlab("number of cars") +
  ylab("High PM10") +
  labs(title = "High PM10 (=1) or Not high PM10 (=0) vs number of cars") 

It’s hard to see any relationship so let’s add a moving average with geom_smooth():

ggplot(oslo, aes(cars, highpm10)) +
  geom_point() +
  geom_smooth() +
  xlab("number of cars") +
  ylab("High PM10") +
  labs(title = "High PM10 (=1) or Not high PM10 (=0) vs number of cars") 

Seems like the proportion of High PM10-values increases when the number of cars increases.

3 Simple logistic regression model

3.1 Fit a logistic model

Instead of the lm() (Linear Model) function we now use the glm() (Generalized Linear Model) function. Since it can handle several different distributions, we have to specify family = "binomial":

glm(highpm10 ~ cars, family = "binomial", data = oslo)

Call:  glm(formula = highpm10 ~ cars, family = "binomial", data = oslo)

Coefficients:
(Intercept)         cars  
 -2.1002955    0.0004759  

Degrees of Freedom: 499 Total (i.e. Null);  498 Residual
Null Deviance:      536.8 
Residual Deviance: 511.2    AIC: 515.2

Note the small \(\hat{\beta}_1\) = effect of 1 car. It’s better to use thousands of cars, cars/1000, as x-variable instead of cars. This is both for numerical reasons and for interpretation of parameters.

model_kcars_glm <- glm(highpm10 ~ I(cars/1000), 
                       family = "binomial", data = oslo)
model_kcars_glm

Call:  glm(formula = highpm10 ~ I(cars/1000), family = "binomial", data = oslo)

Coefficients:
 (Intercept)  I(cars/1000)  
     -2.1003        0.4759  

Degrees of Freedom: 499 Total (i.e. Null);  498 Residual
Null Deviance:      536.8 
Residual Deviance: 511.2    AIC: 515.2
Note

The beta for I(cars/1000) = 1000*beta for cars.

3.2 Summary for logistic model

summary(model_kcars_glm)

Call:
glm(formula = highpm10 ~ I(cars/1000), family = "binomial", data = oslo)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.10030    0.22306  -9.416  < 2e-16 ***
I(cars/1000)  0.47592    0.09675   4.919 8.69e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 536.85  on 499  degrees of freedom
Residual deviance: 511.23  on 498  degrees of freedom
AIC: 515.23

Number of Fisher Scoring iterations: 4

Look! our friend AIC. But no R2, residual standard error or F-test. Instead we have something called “deviance”. Deviance = -2*loglikelihood. More on this next lecture.

3.3 Wald test for beta-parameters

model_kcars_sum <- summary(model_kcars_glm)
model_kcars_sum$coefficients
               Estimate Std. Error   z value     Pr(>|z|)
(Intercept)  -2.1002955 0.22305548 -9.416023 4.684984e-21
I(cars/1000)  0.4759191 0.09674806  4.919159 8.691675e-07
b1 <- model_kcars_sum$coefficients[2, "Estimate"]
se.b1 <- model_kcars_sum$coefficients[2, "Std. Error"]
z.b1 <- model_kcars_sum$coefficients[2, "z value"]

Since \(|z| = \frac{|\hat{\beta}_1 - 0|}{d(\hat{\beta}_1)} = \frac{|0.48 - 0|}{0.1} = |4.92| > \lambda_{0.025} = 1.96\) we can reject H\(_0\): \(\beta_1 = 0\).

Alt. Since \(Pr(|N(0,1)| > 4.92) = 2\cdot Pr(N(0,1) > 4.92) = 8.7\cdot 10^{-7} < 0.05\) we can reject H\(_0\).

The number of cars (or, rather, the number of thousands of cars) has a significant impact on the probability of a high concentration of PM\(_{10}\)-particles.

4 Confidence intervals for beta parameters

4.1 Interval for beta (intercept log odds and log odds ratio)

The confint()-function calculates profile likelihood based confidence intervals for the \(\beta\)-parameters.

bhat <- model_kcars_glm$coefficients
ci.beta <- confint(model_kcars_glm)
cbind(beta = bhat, ci.beta) |> round(digits = 2)
              beta 2.5 % 97.5 %
(Intercept)  -2.10 -2.55  -1.68
I(cars/1000)  0.48  0.29   0.67

4.2 Interval for Odds and Odds Ratio, exp(beta)

# exp(beta0), exp(beta1)
or = exp(bhat)
ci.or <- exp(ci.beta)
cbind(`exp(beta)` = or, ci.or) |> round(digits = 2)
             exp(beta) 2.5 % 97.5 %
(Intercept)       0.12  0.08   0.19
I(cars/1000)      1.61  1.34   1.95

4.3 Wald-based intervals by hand

se.bhat <- summary(model_kcars_glm)$coefficients[, "Std. Error"]
ci.wald <- cbind(lo = bhat - 1.96*se.bhat, hi = bhat + 1.96*se.bhat)
ci.wald |> round(digits = 2)
                lo    hi
(Intercept)  -2.54 -1.66
I(cars/1000)  0.29  0.67
exp(ci.wald) |> round(digits = 2)
               lo   hi
(Intercept)  0.08 0.19
I(cars/1000) 1.33 1.95

Small differences between the versions due to large data set.

5 Linear predictor and probabilities

5.1 Estimated probabilities

We can use the predict()-function in order to get either the linear predictor (log-odds), which is the default, or the estimated probabilities by supplying type = "response":

Note

We can get rid of the confidence interval for the moving average by saying se = FALSE (se = Standard error).

oslo_pred <- oslo |> 
  mutate(phat = predict(model_kcars_glm, type = "response"))

ggplot(oslo_pred, aes(cars, highpm10)) +
  geom_point() +
  geom_smooth(se = FALSE, linetype = "dashed") +
  geom_line(aes(y = phat), color = "red", size = 1) +
  xlab("number of cars") +
  ylab("High PM10") +
  labs(title = "High PM10 (=1) or Not high PM10 (=0) vs number of cars",
       caption = "red = fitted line, blue dashed = moving average")

5.2 Confidence interval for probabilities

When we had a linear regression model from lm(), we could get confidence and prediction intervals with the predict() function. This is not possible for a logistic regression. Instead, we have to get the standard errors from from predict(model, se.fit = TRUE) and then calculate Wald-type confidence intervals.

5.2.1 Step 1: Conf.int. for the linear predictor, logodds

Note

We have to use cbind() due to the residual.scale variable.

oslo_pred <- cbind(
  oslo_pred,
  logit = predict(model_kcars_glm, se.fit = TRUE))
glimpse(oslo_pred)
Rows: 500
Columns: 11
$ cars                 <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 98…
$ windspeed            <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2,…
$ temp2m               <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, …
$ tempdiff             <chr> "zero", "zero", "zero", "zero", "pos", "zero", "z…
$ zerodiff             <chr> "zero", "zero", "zero", "zero", "nonzero", "zero"…
$ windquad             <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "…
$ highpm10             <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0…
$ phat                 <dbl> 0.2685731, 0.3469301, 0.1142603, 0.2282989, 0.272…
$ logit.fit            <dbl> -1.0018742, -0.6325609, -2.0479444, -1.2179414, -…
$ logit.se.fit         <dbl> 0.1132430, 0.1510233, 0.2138470, 0.1104975, 0.114…
$ logit.residual.scale <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

The residual.scale-variable is unnecessary. Let’s remove it:

oslo_pred <- oslo_pred |> mutate(logit.residual.scale = NULL)

Calculate the lower and upper confidence interval limits for the linear predictor.

Note

The Wald-method is OK here since we do not want to use the interval for any hypothesis testing. There is also no feasible alternative method!

lambda <- qnorm(1 - 0.05/2)
oslo_pred <- oslo_pred |> mutate(
  logit.lwr = logit.fit - lambda*logit.se.fit,
  logit.upr = logit.fit + lambda*logit.se.fit)
glimpse(oslo_pred)
Rows: 500
Columns: 12
$ cars         <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 984, 2159,…
$ windspeed    <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2, 3.4, 4.…
$ temp2m       <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, 1.1, -9.…
$ tempdiff     <chr> "zero", "zero", "zero", "zero", "pos", "zero", "zero", "z…
$ zerodiff     <chr> "zero", "zero", "zero", "zero", "nonzero", "zero", "zero"…
$ windquad     <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "SW", "NE…
$ highpm10     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
$ phat         <dbl> 0.2685731, 0.3469301, 0.1142603, 0.2282989, 0.2726122, 0.…
$ logit.fit    <dbl> -1.0018742, -0.6325609, -2.0479444, -1.2179414, -0.981409…
$ logit.se.fit <dbl> 0.1132430, 0.1510233, 0.2138470, 0.1104975, 0.1143774, 0.…
$ logit.lwr    <dbl> -1.2238264, -0.9285611, -2.4670769, -1.4345126, -1.205585…
$ logit.upr    <dbl> -0.7799220, -0.3365607, -1.6288119, -1.0013702, -0.757234…

5.2.2 Step 2: Confidence interval for the odds

Transform the interval limits for the linear predictor (logodds) into limits for the odds:

oslo_pred <- oslo_pred |> mutate(
  odds.lwr = exp(logit.lwr),
  odds.upr = exp(logit.upr))
glimpse(oslo_pred)
Rows: 500
Columns: 14
$ cars         <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 984, 2159,…
$ windspeed    <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2, 3.4, 4.…
$ temp2m       <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, 1.1, -9.…
$ tempdiff     <chr> "zero", "zero", "zero", "zero", "pos", "zero", "zero", "z…
$ zerodiff     <chr> "zero", "zero", "zero", "zero", "nonzero", "zero", "zero"…
$ windquad     <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "SW", "NE…
$ highpm10     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
$ phat         <dbl> 0.2685731, 0.3469301, 0.1142603, 0.2282989, 0.2726122, 0.…
$ logit.fit    <dbl> -1.0018742, -0.6325609, -2.0479444, -1.2179414, -0.981409…
$ logit.se.fit <dbl> 0.1132430, 0.1510233, 0.2138470, 0.1104975, 0.1143774, 0.…
$ logit.lwr    <dbl> -1.2238264, -0.9285611, -2.4670769, -1.4345126, -1.205585…
$ logit.upr    <dbl> -0.7799220, -0.3365607, -1.6288119, -1.0013702, -0.757234…
$ odds.lwr     <dbl> 0.29410267, 0.39512183, 0.08483247, 0.23823144, 0.2995166…
$ odds.upr     <dbl> 0.4584418, 0.7142225, 0.1961625, 0.3673757, 0.4689618, 0.…

5.2.3 Step 3: Confidence interval for the probabilities

Transform the interval limits for the odds into limits for the probabilities:

oslo_pred <- oslo_pred |> mutate(
  p.lwr = odds.lwr/(1 + odds.lwr),
  p.upr = odds.upr/(1 + odds.upr))
glimpse(oslo_pred)
Rows: 500
Columns: 16
$ cars         <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 984, 2159,…
$ windspeed    <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2, 3.4, 4.…
$ temp2m       <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, 1.1, -9.…
$ tempdiff     <chr> "zero", "zero", "zero", "zero", "pos", "zero", "zero", "z…
$ zerodiff     <chr> "zero", "zero", "zero", "zero", "nonzero", "zero", "zero"…
$ windquad     <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "SW", "NE…
$ highpm10     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
$ phat         <dbl> 0.2685731, 0.3469301, 0.1142603, 0.2282989, 0.2726122, 0.…
$ logit.fit    <dbl> -1.0018742, -0.6325609, -2.0479444, -1.2179414, -0.981409…
$ logit.se.fit <dbl> 0.1132430, 0.1510233, 0.2138470, 0.1104975, 0.1143774, 0.…
$ logit.lwr    <dbl> -1.2238264, -0.9285611, -2.4670769, -1.4345126, -1.205585…
$ logit.upr    <dbl> -0.7799220, -0.3365607, -1.6288119, -1.0013702, -0.757234…
$ odds.lwr     <dbl> 0.29410267, 0.39512183, 0.08483247, 0.23823144, 0.2995166…
$ odds.upr     <dbl> 0.4584418, 0.7142225, 0.1961625, 0.3673757, 0.4689618, 0.…
$ p.lwr        <dbl> 0.22726378, 0.28321672, 0.07819869, 0.19239654, 0.2304831…
$ p.upr        <dbl> 0.3143367, 0.4166452, 0.1639932, 0.2686721, 0.3192471, 0.…

5.3 Plot with confidence interval

ggplot(oslo_pred, aes(cars, highpm10)) +
  geom_point() +
  geom_line(aes(y = phat), color = "red", size = 1) +
  geom_ribbon(aes(ymin = p.lwr, ymax = p.upr), alpha = 0.2) +
  xlab("number of cars") +
  ylab("High PM10") +
  labs(title = "High PM10 (=1) or Not high PM10 (=0) vs number of cars",
       caption = "red = fitted line, with 95% confidence interval")