# Lecture 7 - Oslo particles - logistic regression
library(tidyverse); theme_set(theme_bw() + theme(text = element_text(size = 16)))
library(readxl)Lecture 7 - Binary logistic regression
Example 1 - PM10 particles in Oslo
1 Packages
- Update to a newer version (preferably) or
- make sure that the package
MASSis installed. It probably is, but look for it on the Packages tab in RStudio; it should be under the System Library heading. You do not have to activate it withlibrary(MASS).
2 Data
oslo <- read_excel("Data/oslo.xlsx")
glimpse(oslo)Rows: 500
Columns: 7
$ cars <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 984, 2159, 30…
$ windspeed <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2, 3.4, 4.4, …
$ temp2m <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, 1.1, -9.0, …
$ tempdiff <chr> "zero", "zero", "zero", "zero", "pos", "zero", "zero", "zero…
$ zerodiff <chr> "zero", "zero", "zero", "zero", "nonzero", "zero", "zero", "…
$ windquad <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "SW", "NE", …
$ highpm10 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, …
There are some extra variables used in Lecture 8.
Plot highpm10 against number of cars, cars:
ggplot(oslo, aes(cars, highpm10)) +
geom_point() +
xlab("number of cars") +
ylab("High PM10") +
labs(title = "High PM10 (=1) or Not high PM10 (=0) vs number of cars") It’s hard to see any relationship so let’s add a moving average with geom_smooth():
ggplot(oslo, aes(cars, highpm10)) +
geom_point() +
geom_smooth() +
xlab("number of cars") +
ylab("High PM10") +
labs(title = "High PM10 (=1) or Not high PM10 (=0) vs number of cars") Seems like the proportion of High PM10-values increases when the number of cars increases.
3 Simple logistic regression model
3.1 Fit a logistic model
Instead of the lm() (Linear Model) function we now use the glm() (Generalized Linear Model) function. Since it can handle several different distributions, we have to specify family = "binomial":
glm(highpm10 ~ cars, family = "binomial", data = oslo)
Call: glm(formula = highpm10 ~ cars, family = "binomial", data = oslo)
Coefficients:
(Intercept) cars
-2.1002955 0.0004759
Degrees of Freedom: 499 Total (i.e. Null); 498 Residual
Null Deviance: 536.8
Residual Deviance: 511.2 AIC: 515.2
Note the small \(\hat{\beta}_1\) = effect of 1 car. It’s better to use thousands of cars, cars/1000, as x-variable instead of cars. This is both for numerical reasons and for interpretation of parameters.
model_kcars_glm <- glm(highpm10 ~ I(cars/1000),
family = "binomial", data = oslo)
model_kcars_glm
Call: glm(formula = highpm10 ~ I(cars/1000), family = "binomial", data = oslo)
Coefficients:
(Intercept) I(cars/1000)
-2.1003 0.4759
Degrees of Freedom: 499 Total (i.e. Null); 498 Residual
Null Deviance: 536.8
Residual Deviance: 511.2 AIC: 515.2
The beta for I(cars/1000) = 1000*beta for cars.
3.2 Summary for logistic model
summary(model_kcars_glm)
Call:
glm(formula = highpm10 ~ I(cars/1000), family = "binomial", data = oslo)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.10030 0.22306 -9.416 < 2e-16 ***
I(cars/1000) 0.47592 0.09675 4.919 8.69e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 536.85 on 499 degrees of freedom
Residual deviance: 511.23 on 498 degrees of freedom
AIC: 515.23
Number of Fisher Scoring iterations: 4
Look! our friend AIC. But no R2, residual standard error or F-test. Instead we have something called “deviance”. Deviance = -2*loglikelihood. More on this next lecture.
3.3 Wald test for beta-parameters
model_kcars_sum <- summary(model_kcars_glm)
model_kcars_sum$coefficients Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.1002955 0.22305548 -9.416023 4.684984e-21
I(cars/1000) 0.4759191 0.09674806 4.919159 8.691675e-07
b1 <- model_kcars_sum$coefficients[2, "Estimate"]
se.b1 <- model_kcars_sum$coefficients[2, "Std. Error"]
z.b1 <- model_kcars_sum$coefficients[2, "z value"]Since \(|z| = \frac{|\hat{\beta}_1 - 0|}{d(\hat{\beta}_1)} = \frac{|0.48 - 0|}{0.1} = |4.92| > \lambda_{0.025} = 1.96\) we can reject H\(_0\): \(\beta_1 = 0\).
Alt. Since \(Pr(|N(0,1)| > 4.92) = 2\cdot Pr(N(0,1) > 4.92) = 8.7\cdot 10^{-7} < 0.05\) we can reject H\(_0\).
The number of cars (or, rather, the number of thousands of cars) has a significant impact on the probability of a high concentration of PM\(_{10}\)-particles.
4 Confidence intervals for beta parameters
4.1 Interval for beta (intercept log odds and log odds ratio)
The confint()-function calculates profile likelihood based confidence intervals for the \(\beta\)-parameters.
bhat <- model_kcars_glm$coefficients
ci.beta <- confint(model_kcars_glm)
cbind(beta = bhat, ci.beta) |> round(digits = 2) beta 2.5 % 97.5 %
(Intercept) -2.10 -2.55 -1.68
I(cars/1000) 0.48 0.29 0.67
4.2 Interval for Odds and Odds Ratio, exp(beta)
# exp(beta0), exp(beta1)
or = exp(bhat)
ci.or <- exp(ci.beta)
cbind(`exp(beta)` = or, ci.or) |> round(digits = 2) exp(beta) 2.5 % 97.5 %
(Intercept) 0.12 0.08 0.19
I(cars/1000) 1.61 1.34 1.95
4.3 Wald-based intervals by hand
se.bhat <- summary(model_kcars_glm)$coefficients[, "Std. Error"]
ci.wald <- cbind(lo = bhat - 1.96*se.bhat, hi = bhat + 1.96*se.bhat)
ci.wald |> round(digits = 2) lo hi
(Intercept) -2.54 -1.66
I(cars/1000) 0.29 0.67
exp(ci.wald) |> round(digits = 2) lo hi
(Intercept) 0.08 0.19
I(cars/1000) 1.33 1.95
Small differences between the versions due to large data set.
5 Linear predictor and probabilities
5.1 Estimated probabilities
We can use the predict()-function in order to get either the linear predictor (log-odds), which is the default, or the estimated probabilities by supplying type = "response":
We can get rid of the confidence interval for the moving average by saying se = FALSE (se = Standard error).
oslo_pred <- oslo |>
mutate(phat = predict(model_kcars_glm, type = "response"))
ggplot(oslo_pred, aes(cars, highpm10)) +
geom_point() +
geom_smooth(se = FALSE, linetype = "dashed") +
geom_line(aes(y = phat), color = "red", size = 1) +
xlab("number of cars") +
ylab("High PM10") +
labs(title = "High PM10 (=1) or Not high PM10 (=0) vs number of cars",
caption = "red = fitted line, blue dashed = moving average")5.2 Confidence interval for probabilities
When we had a linear regression model from lm(), we could get confidence and prediction intervals with the predict() function. This is not possible for a logistic regression. Instead, we have to get the standard errors from from predict(model, se.fit = TRUE) and then calculate Wald-type confidence intervals.
5.2.1 Step 1: Conf.int. for the linear predictor, logodds
We have to use cbind() due to the residual.scale variable.
oslo_pred <- cbind(
oslo_pred,
logit = predict(model_kcars_glm, se.fit = TRUE))
glimpse(oslo_pred)Rows: 500
Columns: 11
$ cars <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 98…
$ windspeed <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2,…
$ temp2m <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, …
$ tempdiff <chr> "zero", "zero", "zero", "zero", "pos", "zero", "z…
$ zerodiff <chr> "zero", "zero", "zero", "zero", "nonzero", "zero"…
$ windquad <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "…
$ highpm10 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0…
$ phat <dbl> 0.2685731, 0.3469301, 0.1142603, 0.2282989, 0.272…
$ logit.fit <dbl> -1.0018742, -0.6325609, -2.0479444, -1.2179414, -…
$ logit.se.fit <dbl> 0.1132430, 0.1510233, 0.2138470, 0.1104975, 0.114…
$ logit.residual.scale <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
The residual.scale-variable is unnecessary. Let’s remove it:
oslo_pred <- oslo_pred |> mutate(logit.residual.scale = NULL)Calculate the lower and upper confidence interval limits for the linear predictor.
The Wald-method is OK here since we do not want to use the interval for any hypothesis testing. There is also no feasible alternative method!
lambda <- qnorm(1 - 0.05/2)
oslo_pred <- oslo_pred |> mutate(
logit.lwr = logit.fit - lambda*logit.se.fit,
logit.upr = logit.fit + lambda*logit.se.fit)
glimpse(oslo_pred)Rows: 500
Columns: 12
$ cars <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 984, 2159,…
$ windspeed <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2, 3.4, 4.…
$ temp2m <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, 1.1, -9.…
$ tempdiff <chr> "zero", "zero", "zero", "zero", "pos", "zero", "zero", "z…
$ zerodiff <chr> "zero", "zero", "zero", "zero", "nonzero", "zero", "zero"…
$ windquad <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "SW", "NE…
$ highpm10 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
$ phat <dbl> 0.2685731, 0.3469301, 0.1142603, 0.2282989, 0.2726122, 0.…
$ logit.fit <dbl> -1.0018742, -0.6325609, -2.0479444, -1.2179414, -0.981409…
$ logit.se.fit <dbl> 0.1132430, 0.1510233, 0.2138470, 0.1104975, 0.1143774, 0.…
$ logit.lwr <dbl> -1.2238264, -0.9285611, -2.4670769, -1.4345126, -1.205585…
$ logit.upr <dbl> -0.7799220, -0.3365607, -1.6288119, -1.0013702, -0.757234…
5.2.2 Step 2: Confidence interval for the odds
Transform the interval limits for the linear predictor (logodds) into limits for the odds:
oslo_pred <- oslo_pred |> mutate(
odds.lwr = exp(logit.lwr),
odds.upr = exp(logit.upr))
glimpse(oslo_pred)Rows: 500
Columns: 14
$ cars <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 984, 2159,…
$ windspeed <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2, 3.4, 4.…
$ temp2m <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, 1.1, -9.…
$ tempdiff <chr> "zero", "zero", "zero", "zero", "pos", "zero", "zero", "z…
$ zerodiff <chr> "zero", "zero", "zero", "zero", "nonzero", "zero", "zero"…
$ windquad <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "SW", "NE…
$ highpm10 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
$ phat <dbl> 0.2685731, 0.3469301, 0.1142603, 0.2282989, 0.2726122, 0.…
$ logit.fit <dbl> -1.0018742, -0.6325609, -2.0479444, -1.2179414, -0.981409…
$ logit.se.fit <dbl> 0.1132430, 0.1510233, 0.2138470, 0.1104975, 0.1143774, 0.…
$ logit.lwr <dbl> -1.2238264, -0.9285611, -2.4670769, -1.4345126, -1.205585…
$ logit.upr <dbl> -0.7799220, -0.3365607, -1.6288119, -1.0013702, -0.757234…
$ odds.lwr <dbl> 0.29410267, 0.39512183, 0.08483247, 0.23823144, 0.2995166…
$ odds.upr <dbl> 0.4584418, 0.7142225, 0.1961625, 0.3673757, 0.4689618, 0.…
5.2.3 Step 3: Confidence interval for the probabilities
Transform the interval limits for the odds into limits for the probabilities:
oslo_pred <- oslo_pred |> mutate(
p.lwr = odds.lwr/(1 + odds.lwr),
p.upr = odds.upr/(1 + odds.upr))
glimpse(oslo_pred)Rows: 500
Columns: 16
$ cars <dbl> 2308, 3084, 110, 1854, 2351, 2662, 2478, 2387, 984, 2159,…
$ windspeed <dbl> 4.2, 4.8, 4.3, 3.0, 5.6, 2.3, 1.9, 8.9, 2.0, 5.2, 3.4, 4.…
$ temp2m <dbl> -4.4, -5.7, -13.5, 1.4, 4.1, 5.8, 2.7, 7.1, 4.1, 1.1, -9.…
$ tempdiff <chr> "zero", "zero", "zero", "zero", "pos", "zero", "zero", "z…
$ zerodiff <chr> "zero", "zero", "zero", "zero", "nonzero", "zero", "zero"…
$ windquad <chr> "NE", "NE", "NE", "SE", "NW", "SW", "SW", "SW", "SW", "NE…
$ highpm10 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
$ phat <dbl> 0.2685731, 0.3469301, 0.1142603, 0.2282989, 0.2726122, 0.…
$ logit.fit <dbl> -1.0018742, -0.6325609, -2.0479444, -1.2179414, -0.981409…
$ logit.se.fit <dbl> 0.1132430, 0.1510233, 0.2138470, 0.1104975, 0.1143774, 0.…
$ logit.lwr <dbl> -1.2238264, -0.9285611, -2.4670769, -1.4345126, -1.205585…
$ logit.upr <dbl> -0.7799220, -0.3365607, -1.6288119, -1.0013702, -0.757234…
$ odds.lwr <dbl> 0.29410267, 0.39512183, 0.08483247, 0.23823144, 0.2995166…
$ odds.upr <dbl> 0.4584418, 0.7142225, 0.1961625, 0.3673757, 0.4689618, 0.…
$ p.lwr <dbl> 0.22726378, 0.28321672, 0.07819869, 0.19239654, 0.2304831…
$ p.upr <dbl> 0.3143367, 0.4166452, 0.1639932, 0.2686721, 0.3192471, 0.…
5.3 Plot with confidence interval
ggplot(oslo_pred, aes(cars, highpm10)) +
geom_point() +
geom_line(aes(y = phat), color = "red", size = 1) +
geom_ribbon(aes(ymin = p.lwr, ymax = p.upr), alpha = 0.2) +
xlab("number of cars") +
ylab("High PM10") +
labs(title = "High PM10 (=1) or Not high PM10 (=0) vs number of cars",
caption = "red = fitted line, with 95% confidence interval")