Difference between margins/cplot and predict()/plogis() for glm predictions and CIs? #92

mark-williamson · Apr 6, 2018

Hi @leeper, I have a question about what margins::cplot() is doing under the hood and how it differs from the predicted values and CIs produced using the base functions plogis() and predict().

In particular, I'm trying to plot the predicted values and uncertainty from a logistic regression with an interaction between a binary and continuous independent variables.

library(ggplot2)
library(margins)
set.seed(5)

data <- data.frame(outcome = rbinom(100, 1, 0.5),
                   var1 = rbinom(100, 1, 0.1),
                   var2 = rnorm(100, 10, 7)
                   )

m <- glm(outcome ~ var1*var2, data = data, family = binomial(link = "logit"))

Using the regular predict() function requires creating hypothetical data and predicting values on the "link" response scale and then converting them to probabilities using plogis().

pred <- data.frame(expand.grid(var1 = c(0, 1),
                               var2 = seq(min(data$var2), max(data$var2), 0.5)))
pred <- cbind(pred, predict(m, newdata = pred,
                             type = "link", se.fit = TRUE))

pred$PP <- plogis(pred$fit)
pred$LL <- plogis(pred$fit - (1.96 * pred$se.fit))
pred$UL <- plogis(pred$fit + (1.96 * pred$se.fit))

ggplot(pred, aes(var2, PP, fill = factor(var1))) + 
  geom_line(aes(col = factor(var1))) + 
  geom_ribbon(aes(x = var2, ymin = LL, ymax = UL), alpha = 0.4)

When I attempt to create the same plot with cplot(), I get different results for the CIs. Notably, the output seems to produce CIs that can fall outside of [0,1] bounds of predicted probabilities, whereas the plogis transformation guarantees the output will be bounded at 0 and 1.

out1 <- cplot(m, x = "var2", dx = "var1", what = "prediction", 
              data = data[data$var1 == 0,],
              draw = FALSE)
out2 <- cplot(m, x = "var2", dx = "var1", what = "prediction", 
              data = data[data$var1 == 1,],
              draw = FALSE)
ggplot(out1) + 
  geom_hline(aes(yintercept = 0), lty = "dotted") +
  geom_hline(aes(yintercept = 1), lty = "dotted") +
  geom_line(aes(xvals, yvals, col = "0")) + 
  geom_ribbon(aes(x = xvals, ymin = lower, ymax = upper, fill = "0"), alpha = 0.4) + 
  geom_line(data = out2, aes(xvals, yvals, col = "1")) + 
  geom_ribbon(data = out2, 
              aes(x = xvals, ymin = lower, ymax = upper, fill = "1"), 
              alpha = 0.4)

I'm sorry in advance if this question is simply a result of me misunderstanding the nature of predicted values for logistic models. Is this simply a matter of using the original vs. hypothetical data?

I've really enjoyed using margins and I'm just hoping to clarify what exactly it is giving me as output. Any advice is appreciated!

Well, I guess I'm not sure what the "correct" behavior actually is here. prediction::prediction() (which is working under the hood for margins::cplot()) is not doing any any substantive transformation here, it's just wrapping base predict() with type = "response" so any output that you get from prediction() (and thus from cplot()) should - and does - match what you would get from just working directly with predict():

pred <- data.frame(expand.grid(var1 = c(0, 1), var2 = seq(min(data$var2), max(data$var2), 0.5)))
pred <- cbind(pred, predict(m, newdata = pred, type = "response", se.fit = TRUE))
pred$UL <- pred$fit + (1.96*pred$se.fit)
pred$LL <- pred$fit - (1.96*pred$se.fit)
ggplot(pred, aes(var2, fit, fill = factor(var1))) + 
  geom_line(aes(col = factor(var1))) + 
  geom_ribbon(aes(x = var2, ymin = LL, ymax = UL), alpha = 0.4)

This is also equivalent to Stata's behavior on these example data:

use data.dta, clear
quietly logit outcome c.var1##c.var2
quietly margins, at(var2 = (-8(0.5)28) var1 = (0 1))
marginsplot

I suppose it's reasonable to expect a different output (akin to what you describe using plogis() and some googling suggests a fair amount of debate on what these intervals are and what you might want in a given circumstance. I'm totally open to suggestions on better behavior but as written, prediction() simply tidies the output of predict() rather than substantively modifying that output. If cplot() should do something different from the base behavior, I'm open to creating options for that, including changing the default.

leeper · Apr 14, 2018

Well, I guess I'm not sure what the "correct" behavior actually is here. prediction::prediction() (which is working under the hood for margins::cplot()) is not doing any any substantive transformation here, it's just wrapping base predict() with type = "response" so any output that you get from prediction() (and thus from cplot()) should - and does - match what you would get from just working directly with predict():

pred <- data.frame(expand.grid(var1 = c(0, 1), var2 = seq(min(data$var2), max(data$var2), 0.5)))
pred <- cbind(pred, predict(m, newdata = pred, type = "response", se.fit = TRUE))
pred$UL <- pred$fit + (1.96*pred$se.fit)
pred$LL <- pred$fit - (1.96*pred$se.fit)
ggplot(pred, aes(var2, fit, fill = factor(var1))) + 
  geom_line(aes(col = factor(var1))) + 
  geom_ribbon(aes(x = var2, ymin = LL, ymax = UL), alpha = 0.4)

This is also equivalent to Stata's behavior on these example data:

use data.dta, clear
quietly logit outcome c.var1##c.var2
quietly margins, at(var2 = (-8(0.5)28) var1 = (0 1))
marginsplot

I suppose it's reasonable to expect a different output (akin to what you describe using plogis() and some googling suggests a fair amount of debate on what these intervals are and what you might want in a given circumstance. I'm totally open to suggestions on better behavior but as written, prediction() simply tidies the output of predict() rather than substantively modifying that output. If cplot() should do something different from the base behavior, I'm open to creating options for that, including changing the default.

I think the problem here is that type = "response" gives you the predicted probabilities, while type = "link" gives you the predictions on the log-odds scale (see ?predict.glm). Thus, the formula pred$fit +/- (1.96*pred$se.fit) won't work for predicted probabilities, as you need to back transform predictions and se, then do +/- 1.96*se and then back-back-transform. Or, you use type = "link", then you need to back-transform only once, after doing +/- 1.96*se. This ensures that predictions and CI are always within the correct (=possible) range.

strengejacke · Jun 25, 2018

I think the problem here is that type = "response" gives you the predicted probabilities, while type = "link" gives you the predictions on the log-odds scale (see ?predict.glm). Thus, the formula pred$fit +/- (1.96*pred$se.fit) won't work for predicted probabilities, as you need to back transform predictions and se, then do +/- 1.96*se and then back-back-transform. Or, you use type = "link", then you need to back-transform only once, after doing +/- 1.96*se. This ensures that predictions and CI are always within the correct (=possible) range.

See a similar discussion here: glmmTMB/glmmTMB#324 (comment)

strengejacke · Jun 28, 2018

See a similar discussion here: glmmTMB/glmmTMB#324 (comment)

mark-williamson closed this Apr 6, 2018

mark-williamson reopened this Apr 6, 2018

leeper added the question label Apr 13, 2018

leeper closed this in 734f4ec Aug 1, 2018

strengejacke referenced this issue Aug 3, 2018
Closed
Confidence intervals with survey weighted data #38

leeper/margins

Difference between margins/cplot and predict()/plogis() for glm predictions and CIs? #92

mark-williamson commented Apr 6, 2018 •

edited

Edited 3 times

mark-williamson edited Apr 6, 2018 (most recent)

mark-williamson edited Apr 6, 2018

mark-williamson edited Apr 6, 2018

mark-williamson closed this Apr 6, 2018

mark-williamson reopened this Apr 6, 2018

leeper added the question label Apr 13, 2018

This comment has been minimized.

leeper commented Apr 14, 2018

This comment has been minimized.

strengejacke commented Jun 25, 2018 •

edited

Edited 1 time

strengejacke edited Jun 25, 2018 (most recent)

strengejacke created Jun 25, 2018

This comment has been minimized.

strengejacke commented Jun 28, 2018

leeper closed this in `734f4ec` Aug 1, 2018

strengejacke referenced this issue Aug 3, 2018

Confidence intervals with survey weighted data #38

leeper/margins

Join GitHub today

Difference between margins/cplot and predict()/plogis() for glm predictions and CIs? #92

Comments

mark-williamson commented Apr 6, 2018 • edited Edited 3 times mark-williamson edited Apr 6, 2018 (most recent) mark-williamson edited Apr 6, 2018 mark-williamson edited Apr 6, 2018

mark-williamson closed this Apr 6, 2018

mark-williamson reopened this Apr 6, 2018

leeper added the question label Apr 13, 2018

This comment has been minimized.

leeper Apr 14, 2018

leeper commented Apr 14, 2018

This comment has been minimized.

strengejacke Jun 25, 2018

strengejacke commented Jun 25, 2018 • edited Edited 1 time strengejacke edited Jun 25, 2018 (most recent) strengejacke created Jun 25, 2018

This comment has been minimized.

strengejacke Jun 28, 2018

strengejacke commented Jun 28, 2018

leeper closed this in 734f4ec Aug 1, 2018

strengejacke referenced this issue Aug 3, 2018

Confidence intervals with survey weighted data #38

mark-williamson commented Apr 6, 2018 •

edited

Edited 3 times

mark-williamson edited Apr 6, 2018 (most recent)

mark-williamson edited Apr 6, 2018

mark-williamson edited Apr 6, 2018

strengejacke commented Jun 25, 2018 •

edited

Edited 1 time

strengejacke edited Jun 25, 2018 (most recent)

strengejacke created Jun 25, 2018

leeper closed this in `734f4ec` Aug 1, 2018