New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between margins/cplot and predict()/plogis() for glm predictions and CIs? #92

Closed
mark-williamson opened this Issue Apr 6, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@mark-williamson

mark-williamson commented Apr 6, 2018

Hi @leeper, I have a question about what margins::cplot() is doing under the hood and how it differs from the predicted values and CIs produced using the base functions plogis() and predict().

In particular, I'm trying to plot the predicted values and uncertainty from a logistic regression with an interaction between a binary and continuous independent variables.

library(ggplot2)
library(margins)
set.seed(5)

data <- data.frame(outcome = rbinom(100, 1, 0.5),
                   var1 = rbinom(100, 1, 0.1),
                   var2 = rnorm(100, 10, 7)
                   )

m <- glm(outcome ~ var1*var2, data = data, family = binomial(link = "logit"))

Using the regular predict() function requires creating hypothetical data and predicting values on the "link" response scale and then converting them to probabilities using plogis().

pred <- data.frame(expand.grid(var1 = c(0, 1),
                               var2 = seq(min(data$var2), max(data$var2), 0.5)))
pred <- cbind(pred, predict(m, newdata = pred,
                             type = "link", se.fit = TRUE))

pred$PP <- plogis(pred$fit)
pred$LL <- plogis(pred$fit - (1.96 * pred$se.fit))
pred$UL <- plogis(pred$fit + (1.96 * pred$se.fit))

ggplot(pred, aes(var2, PP, fill = factor(var1))) + 
  geom_line(aes(col = factor(var1))) + 
  geom_ribbon(aes(x = var2, ymin = LL, ymax = UL), alpha = 0.4)

image

When I attempt to create the same plot with cplot(), I get different results for the CIs. Notably, the output seems to produce CIs that can fall outside of [0,1] bounds of predicted probabilities, whereas the plogis transformation guarantees the output will be bounded at 0 and 1.

out1 <- cplot(m, x = "var2", dx = "var1", what = "prediction", 
              data = data[data$var1 == 0,],
              draw = FALSE)
out2 <- cplot(m, x = "var2", dx = "var1", what = "prediction", 
              data = data[data$var1 == 1,],
              draw = FALSE)
ggplot(out1) + 
  geom_hline(aes(yintercept = 0), lty = "dotted") +
  geom_hline(aes(yintercept = 1), lty = "dotted") +
  geom_line(aes(xvals, yvals, col = "0")) + 
  geom_ribbon(aes(x = xvals, ymin = lower, ymax = upper, fill = "0"), alpha = 0.4) + 
  geom_line(data = out2, aes(xvals, yvals, col = "1")) + 
  geom_ribbon(data = out2, 
              aes(x = xvals, ymin = lower, ymax = upper, fill = "1"), 
              alpha = 0.4)

image

I'm sorry in advance if this question is simply a result of me misunderstanding the nature of predicted values for logistic models. Is this simply a matter of using the original vs. hypothetical data?

I've really enjoyed using margins and I'm just hoping to clarify what exactly it is giving me as output. Any advice is appreciated!

@leeper

This comment has been minimized.

Show comment
Hide comment
@leeper

leeper Apr 14, 2018

Owner

Well, I guess I'm not sure what the "correct" behavior actually is here. prediction::prediction() (which is working under the hood for margins::cplot()) is not doing any any substantive transformation here, it's just wrapping base predict() with type = "response" so any output that you get from prediction() (and thus from cplot()) should - and does - match what you would get from just working directly with predict():

pred <- data.frame(expand.grid(var1 = c(0, 1), var2 = seq(min(data$var2), max(data$var2), 0.5)))
pred <- cbind(pred, predict(m, newdata = pred, type = "response", se.fit = TRUE))
pred$UL <- pred$fit + (1.96*pred$se.fit)
pred$LL <- pred$fit - (1.96*pred$se.fit)
ggplot(pred, aes(var2, fit, fill = factor(var1))) + 
  geom_line(aes(col = factor(var1))) + 
  geom_ribbon(aes(x = var2, ymin = LL, ymax = UL), alpha = 0.4)

r

This is also equivalent to Stata's behavior on these example data:

use data.dta, clear
quietly logit outcome c.var1##c.var2
quietly margins, at(var2 = (-8(0.5)28) var1 = (0 1))
marginsplot

stata

I suppose it's reasonable to expect a different output (akin to what you describe using plogis() and some googling suggests a fair amount of debate on what these intervals are and what you might want in a given circumstance. I'm totally open to suggestions on better behavior but as written, prediction() simply tidies the output of predict() rather than substantively modifying that output. If cplot() should do something different from the base behavior, I'm open to creating options for that, including changing the default.

Owner

leeper commented Apr 14, 2018

Well, I guess I'm not sure what the "correct" behavior actually is here. prediction::prediction() (which is working under the hood for margins::cplot()) is not doing any any substantive transformation here, it's just wrapping base predict() with type = "response" so any output that you get from prediction() (and thus from cplot()) should - and does - match what you would get from just working directly with predict():

pred <- data.frame(expand.grid(var1 = c(0, 1), var2 = seq(min(data$var2), max(data$var2), 0.5)))
pred <- cbind(pred, predict(m, newdata = pred, type = "response", se.fit = TRUE))
pred$UL <- pred$fit + (1.96*pred$se.fit)
pred$LL <- pred$fit - (1.96*pred$se.fit)
ggplot(pred, aes(var2, fit, fill = factor(var1))) + 
  geom_line(aes(col = factor(var1))) + 
  geom_ribbon(aes(x = var2, ymin = LL, ymax = UL), alpha = 0.4)

r

This is also equivalent to Stata's behavior on these example data:

use data.dta, clear
quietly logit outcome c.var1##c.var2
quietly margins, at(var2 = (-8(0.5)28) var1 = (0 1))
marginsplot

stata

I suppose it's reasonable to expect a different output (akin to what you describe using plogis() and some googling suggests a fair amount of debate on what these intervals are and what you might want in a given circumstance. I'm totally open to suggestions on better behavior but as written, prediction() simply tidies the output of predict() rather than substantively modifying that output. If cplot() should do something different from the base behavior, I'm open to creating options for that, including changing the default.

@strengejacke

This comment has been minimized.

Show comment
Hide comment
@strengejacke

strengejacke Jun 25, 2018

I think the problem here is that type = "response" gives you the predicted probabilities, while type = "link" gives you the predictions on the log-odds scale (see ?predict.glm). Thus, the formula pred$fit +/- (1.96*pred$se.fit) won't work for predicted probabilities, as you need to back transform predictions and se, then do +/- 1.96*se and then back-back-transform. Or, you use type = "link", then you need to back-transform only once, after doing +/- 1.96*se. This ensures that predictions and CI are always within the correct (=possible) range.

strengejacke commented Jun 25, 2018

I think the problem here is that type = "response" gives you the predicted probabilities, while type = "link" gives you the predictions on the log-odds scale (see ?predict.glm). Thus, the formula pred$fit +/- (1.96*pred$se.fit) won't work for predicted probabilities, as you need to back transform predictions and se, then do +/- 1.96*se and then back-back-transform. Or, you use type = "link", then you need to back-transform only once, after doing +/- 1.96*se. This ensures that predictions and CI are always within the correct (=possible) range.

@strengejacke

This comment has been minimized.

Show comment
Hide comment
@strengejacke

strengejacke commented Jun 28, 2018

See a similar discussion here: glmmTMB/glmmTMB#324 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment