Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upSeparate backend for tidy prediction #41
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 7, 2018
Owner
Returns a tidy tibble
That would be the most tidy thing to do but my first thought is that it should return a vector unless the prediction contains multiple columns (e.g. class probs, multivariate Y). Let me ponder this a bit.
Never drops missing data
I agree on this. It is something that can happen after prediction if people really don't want those data.
As a corollary, we should always return the same length/nrow as the input even when the prediction produces multiple columns. That may not be automatically 100% tidy, but I think that it is good practice.
Consistent naming of fitted values
We should think about the names though. Let's say that we choose pred
as the name (just for argument). For numeric outputs, what about quantile regression, risk/probability estimates, and others that have a different context than a simple numeric prediction? Do we have different names and document them or should we keep a single label?
Uncertainty in predictions
Yes, but not be default (if that's what you mean) since 1) most models don't implement them and 2) they might be computationally taxing and, if you don't want them, why would you take the time to generate them. We should have a formal syntax for getting them though; I just disagree with the defaultedness (sp?)
Potentially we could use the prediction package by Thomas Leeper
Maybe but I don't want to automatically return the original data along with the prediction.
That would be the most tidy thing to do but my first thought is that it should return a vector unless the prediction contains multiple columns (e.g. class probs, multivariate Y). Let me ponder this a bit.
I agree on this. It is something that can happen after prediction if people really don't want those data. As a corollary, we should always return the same length/nrow as the input even when the prediction produces multiple columns. That may not be automatically 100% tidy, but I think that it is good practice.
We should think about the names though. Let's say that we choose
Yes, but not be default (if that's what you mean) since 1) most models don't implement them and 2) they might be computationally taxing and, if you don't want them, why would you take the time to generate them. We should have a formal syntax for getting them though; I just disagree with the defaultedness (sp?)
Maybe but I don't want to automatically return the original data along with the prediction. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexpghayes
Aug 7, 2018
That would be the most tidy thing to do but my first thought is that it should return a vector unless the prediction contains multiple columns (e.g. class probs, multivariate Y). Let me ponder this a bit.
The major argument for always having tibble predictions is type safety. I think it's the right move, especially since yardstick
expect predictions to live in a tibble and we've been telling everyone to put everything in a tibble for some time now.
As a corollary, we should always return the same length/nrow as the input even when the prediction produces multiple columns. That may not be automatically 100% tidy, but I think that it is good practice.
This is something I'm running into with broom::augment()
right now that isn't as straightforward as I would have hoped. Requiring that nrow(predictions) == nrow(data)
makes sense for univariate prediction, but multivariate models sometimes have a more natural representation in long form. For example, considering the joineRML
augment method:
library(joineRML)
#> Loading required package: nlme
#> Loading required package: survival
data(heart.valve)
hvd <- heart.valve[!is.na(heart.valve$log.grad) &
!is.na(heart.valve$log.lvmi) &
heart.valve$num <= 50, ]
fit <- mjoint(
formLongFixed = list(
"grad" = log.grad ~ time + sex + hs,
"lvmi" = log.lvmi ~ time + sex
),
formLongRandom = list(
"grad" = ~ 1 | num,
"lvmi" = ~ time | num
),
formSurv = Surv(fuyrs, status) ~ age,
data = hvd,
inits = list("gamma" = c(0.11, 1.51, 0.80)),
timeVar = "time"
)
#> Running multivariate LMM EM algorithm to establish initial parameters...
#> Finished multivariate LMM EM algorithm...
#> EM algorithm has converged!
#> Calculating post model fit statistics...
au <- broom::augment(fit)
colnames(au)
#> [1] "num" "sex" "age" "time"
#> [5] "fuyrs" "status" "grad" "log.grad"
#> [9] "lvmi" "log.lvmi" "ef" "bsa"
#> [13] "lvh" "prenyha" "redo" "size"
#> [17] "con.cabg" "creat" "dm" "acei"
#> [21] "lv" "emergenc" "hc" "sten.reg.mix"
#> [25] "hs" ".fitted_grad_0" ".fitted_lvmi_0" ".fitted_grad_1"
#> [29] ".fitted_lvmi_1" ".resid_grad_0" ".resid_lvmi_0" ".resid_grad_1"
#> [33] ".resid_lvmi_1"
Created on 2018-08-07 by the reprex
package (v0.2.0).
Here it would make sense to have an id / outcome
column and .fitted
and .resid
rather than .fitted_outcome_1
, .resid_outcome_1
, etc.
Dave pretty strongly believes that it is better for users to spread
data than to gather
it. I'm not entirely sure how I feel about this, but the nice thing about returning data in a long format is you have predictable columns and unpredictable rows, as opposed to predictable rows and unpredictable columns. I think this is better for other developers extending existing work, since they are less likely to make assumptions about which rows are present than which columns are present.
We should think about the names though. Let's say that we choose pred as the name (just for argument). For numeric outputs, what about quantile regression, risk/probability estimates, and others that have a different context than a simple numeric prediction? Do we have different names and document them or should we keep a single label?
I have some brainstorming about this at tidymodels/broom#452. My impression from working with broom is that we should use the most specific name possible. In some sense having a statistic
column is nice because of the consistency, but I pretty much always find myself wonder what statistic is actually in that column. The documentation is cleaner that way too, since there's less conceptual overloading of the same term.
I just disagree with the defaultedness (sp?)
I agree, uncertainty should not be reported by default. What I'm playing in augment()
right now is an se_fit
argument that defaults to FALSE
. Then for methods that don't support some measure of uncertainty, there just is no se_fit
argument.
alexpghayes
commented
Aug 7, 2018
The major argument for always having tibble predictions is type safety. I think it's the right move, especially since
This is something I'm running into with library(joineRML)
#> Loading required package: nlme
#> Loading required package: survival
data(heart.valve)
hvd <- heart.valve[!is.na(heart.valve$log.grad) &
!is.na(heart.valve$log.lvmi) &
heart.valve$num <= 50, ]
fit <- mjoint(
formLongFixed = list(
"grad" = log.grad ~ time + sex + hs,
"lvmi" = log.lvmi ~ time + sex
),
formLongRandom = list(
"grad" = ~ 1 | num,
"lvmi" = ~ time | num
),
formSurv = Surv(fuyrs, status) ~ age,
data = hvd,
inits = list("gamma" = c(0.11, 1.51, 0.80)),
timeVar = "time"
)
#> Running multivariate LMM EM algorithm to establish initial parameters...
#> Finished multivariate LMM EM algorithm...
#> EM algorithm has converged!
#> Calculating post model fit statistics...
au <- broom::augment(fit)
colnames(au)
#> [1] "num" "sex" "age" "time"
#> [5] "fuyrs" "status" "grad" "log.grad"
#> [9] "lvmi" "log.lvmi" "ef" "bsa"
#> [13] "lvh" "prenyha" "redo" "size"
#> [17] "con.cabg" "creat" "dm" "acei"
#> [21] "lv" "emergenc" "hc" "sten.reg.mix"
#> [25] "hs" ".fitted_grad_0" ".fitted_lvmi_0" ".fitted_grad_1"
#> [29] ".fitted_lvmi_1" ".resid_grad_0" ".resid_lvmi_0" ".resid_grad_1"
#> [33] ".resid_lvmi_1" Created on 2018-08-07 by the reprex Here it would make sense to have an Dave pretty strongly believes that it is better for users to
I have some brainstorming about this at tidymodels/broom#452. My impression from working with broom is that we should use the most specific name possible. In some sense having a
I agree, uncertainty should not be reported by default. What I'm playing in |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexpghayes
Aug 7, 2018
And I really do think that we want to export the tests for whatever specification we come up with.
alexpghayes
commented
Aug 7, 2018
And I really do think that we want to export the tests for whatever specification we come up with. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Yes, absolutely. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexpghayes
Aug 7, 2018
Other thoughts:
- Arguments should always default to
NULL
and missing arguments should error - The default behavior should be to produce a prediction of the same type as the original response data
This should also all end up in principles
eventually.
alexpghayes
commented
Aug 7, 2018
Other thoughts:
This should also all end up in |
Aug 7, 2018
This was referenced
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexpghayes
Aug 7, 2018
Another decision (related to #37): should uncertainty just be reported as a se.fit
column, or should it be reported as intervals.
alexpghayes
commented
Aug 7, 2018
Another decision (related to #37): should uncertainty just be reported as a |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
leeper
Aug 8, 2018
Also, just to throw out edge cases:
- predictions from bayesian models typically aren't single values
- sometimes features relevant to predictions are appended as attributes or other elements in the response lists from
predict()
methods
I'd say a data frame-like structure is the way to go as it's trivial to add columns. A difficult decision in prediction was whether to always include class probabilities and predicted class, or just one or the other. For my purposes, I decided to always do both but in practice it means callling most predict()
methods twice.
leeper
commented
Aug 8, 2018
Also, just to throw out edge cases:
I'd say a data frame-like structure is the way to go as it's trivial to add columns. A difficult decision in prediction was whether to always include class probabilities and predicted class, or just one or the other. For my purposes, I decided to always do both but in practice it means callling most |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 8, 2018
Owner
A difficult decision in prediction was whether to always include class probabilities and predicted class, or just one or the other.
Some models don't generate class probabilities. I keep them separate as a default but provide an api to get both at once (with a single call to get the probabilities).
Some models don't generate class probabilities. I keep them separate as a default but provide an api to get both at once (with a single call to get the probabilities). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
leeper
Aug 8, 2018
I should also add that I'd be really happy to see broom and prediction integrated into one package if we can find a way to make an integrated version serve both our purposes. The key functionality for me is actually for the margins package, which requires the following:
prediction::build_datalist()
to get predictions over specified out-of-sample cases- Some of the convenience functions for getting specific predictions, like
seq_range()
,mean_or_mode()
, etc. that are currently in prediction but could easily live somewhere else - A consistent data frame (or tibble) response from
prediction()
(or its equivalent)
I imagine it would be beneficial to the community to have one really good package doing this rather than various packages serving slightly different purposes.
leeper
commented
Aug 8, 2018
I should also add that I'd be really happy to see broom and prediction integrated into one package if we can find a way to make an integrated version serve both our purposes. The key functionality for me is actually for the margins package, which requires the following:
I imagine it would be beneficial to the community to have one really good package doing this rather than various packages serving slightly different purposes. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 8, 2018
Owner
If the scope is really big (and that sounds like the case), I don't think that one package is the answer. That's how caret
got to where it is. Modular is better (generally).
parsnip
is designed to be the unified interface to models (fitting, prediction, etc) but there is a lot that it doesn't do (e.g. resampling, tuning parameters, etc) and that is by design. This is not meant to imply that you should use it. Using it as an example, put the infrastructure is packages with reasonable scope, then create a package with a higher-level api that give you what you want.
What you're saying is good; let's agree on some standards and develop form there so we don't obstruct each other (and we can import each other without reformatting output).
If the scope is really big (and that sounds like the case), I don't think that one package is the answer. That's how
What you're saying is good; let's agree on some standards and develop form there so we don't obstruct each other (and we can import each other without reformatting output). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexpghayes
Aug 8, 2018
I agree that a standalone prediction package is a good idea. predictions
is almost it, but doesn't provided all the behavioral guarantees we'd like just yet.
Re: bayesian predictions: tidybayes
just hit CRAN. Developing an interface for manipulating posterior samples certainly is valuable, but it's not something I feel wildly compelled to do at the moment. For a wrapper predict method, I think a reasonable standard would be to provide a point estimate and standard errors (or bounds on a credible interval).
alexpghayes
commented
Aug 8, 2018
I agree that a standalone prediction package is a good idea. Re: bayesian predictions: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
leeper
Aug 8, 2018
If you think it will work for you with modifications, let me know what's missing and I'll try to get it all implemented.
leeper
commented
Aug 8, 2018
If you think it will work for you with modifications, let me know what's missing and I'll try to get it all implemented. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 9, 2018
Owner
How does this should for a prediction standard (pinging @hadley, @artichaud1, and @DavisVaughan for more input) :
Any code to produce predictions should follow these conventions:
- The return value is a tibble with the same number of rows as the data being predicted. This implies that
na.action
should not affect the shape of the outcome object. - The return tibble can contain extra attributes for values relevant to the prediction (e.g. alpha for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g.
arrange
,filter
, etc.). - For univariate, numeric point estimates, the column should be named
pred
. For multivariate numeric predictions (excluding probabilities), the columns should be named{outcome name}_pred
. - For class probability predictions, the columns should be named the same as the factor levels (with back-ticks when they are not valid column names).
- If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be
lower
andupper
. If a standard error is produced, it should be namedstd_error
. - For predictions that are not simple scalars, such as distributions or non-rectangular structures, the
pred
column should be a list-column. Examples:- If a posterior distribution is returned for each sample, each element of the list column can be a tibble with as many rows as samples in the distribution.
- When a
predict
method produces values over multiple tuning parameter values (e.g.glmnet
), the list column elements have rows for every tuning parameter value combination that is being changed (e.g.lambda
inglmnet
). - In time to event models, where survivor probabilities are produced for values of
time
, the return tibble has a column fortimes
. - When using a quantile regression, one might make the median the default that is predicted. If multiple percentiles are requested, then
pred
would be a tibble with a column for the predictions and another for the percentile.
- Class predictions should be factors with the same levels as the original outcome.
- By default,
newdata
should not be returned by the prediction function. - The function to produce predictions should be a class-specific
predict
method with argument namesobject
,newdata
, andtype
. Other arguments, such asalpha
, should be standardized and not named without discussion on RStudio Community. - The main predict method can defer to separate, unexported functions (
predict_class
, etc). type
should also come from a set of pre-defined values such as "response" (numeric values), "class", "prob", "link", and "raw". Other values should be assigned with consensus.- If
newdata
is not supplied, an error should be thrown. - The prediction code should work when
newdata
has multiple rows or a single row. - If there is a strong need for producing output during the prediction phase, there should be a
verbose
option that defaults to no output.
<edit 1>
added "link"
<edit 2>
added note about single point predictions.
<edit 3>
another example for quantile regression.
How does this should for a prediction standard (pinging @hadley, @artichaud1, and @DavisVaughan for more input) : Any code to produce predictions should follow these conventions:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
leeper
Aug 9, 2018
Class predictions should be factors with the same levels as the original outcome.
Named?
type should also come from a set of pre-defined values such as "response" (numeric values), "class", "prob", and "raw". Other values should be assigned with consensus.
What about "link"?
If newdata is not supplied, an error should be thrown.
Very good!
If there is a strong need for producing output during the prediction phase, there should be a verbose option that defaults to no output.
How about getOption("verbose", FALSE)
?
leeper
commented
Aug 9, 2018
Named?
What about "link"?
Very good!
How about |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
DavisVaughan
Aug 9, 2018
The return tibble can contain extra attributes for values relevant to the prediction (e.g. alpha for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g. arrange, filter, etc.).
Depending on the state of sloop::reconstruct()
integration with dplyr
, this could be really simple or a manual (but not difficult) process.
For class probability predictions, the columns should be named the same as the factor levels (with back-ticks when they are not valid column names).
Any reason to want to format like {factor_label}_prob
instead? Just thinking about preemptively preventing any name clashes if you somehow joined the prob predictions to something that had the factor labels in a wide format already. Might also be a consistent companion to pred
.
If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be lower and upper. If a standard error is produced, it should be named std_error.
Should this include the confidence level? lower_95
or something similar. See the forecast
pkg for examples:
library(forecast)
fit <- auto.arima(WWWusage)
forecast(fit, h=20, level=95)
Point Forecast Lo 95 Hi 95
101 218.8805 212.6840 225.0770
102 218.1524 203.3133 232.9915
103 217.6789 194.1786 241.1792
104 217.3709 185.6508 249.0910
...
DavisVaughan
commented
Aug 9, 2018
Depending on the state of
Any reason to want to format like
Should this include the confidence level? library(forecast)
fit <- auto.arima(WWWusage)
forecast(fit, h=20, level=95)
Point Forecast Lo 95 Hi 95
101 218.8805 212.6840 225.0770
102 218.1524 203.3133 232.9915
103 217.6789 194.1786 241.1792
104 217.3709 185.6508 249.0910
... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 9, 2018
Owner
Named?
Not sure what you mean.
How about
getOption("verbose", FALSE)
?
Maybe... there might be a lot of verbose globals that get made so it would have to be something like my_model_verbose
. There are good arguments to make that verbosity should be able to be set at different levels (e.g. model fit, feature selection, etc).
Not sure what you mean.
Maybe... there might be a lot of verbose globals that get made so it would have to be something like |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 9, 2018
Owner
Any reason to want to format like
{factor_label}_prob
instead? Just thinking about preemptively preventing any name clashes if you somehow joined the prob predictions to something that had the factor labels in a wide format already. Might also be a consistent companion topred
.
That's good for standardization but bad for consuming those values. So if I go to make a ggplot
, then a lot of rename_at
will be going on.
Let me think about that.
Should this include the confidence level?
No. Then you've encoding data into the name and you can't predict what the name will be easily. I'd rather set an attribute for the level (so that it is there) but not add it to the same or the value.
That's good for standardization but bad for consuming those values. So if I go to make a Let me think about that.
No. Then you've encoding data into the name and you can't predict what the name will be easily. I'd rather set an attribute for the level (so that it is there) but not add it to the same or the value. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 9, 2018
Owner
Named?
~~~Not sure what you mean. ~~~
Sorry, I get it now.
My inclination is to also call it pred
but I'm torn between that and class_pred
. I plan on making a new data structure with S3 class class_pred
so maybe pred_class
?
That could get out of hand easily though. Do we end up with pred_time
, pred_post_mode
and so on.
~~~Not sure what you mean. ~~~ Sorry, I get it now. My inclination is to also call it That could get out of hand easily though. Do we end up with |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
leeper
Aug 9, 2018
Agree it's messy but I think it's sensible to keep class predictions as a separate thing. For example, for margins, I use predictions for numerical derivatives and want to always be able to count on those being numeric.
leeper
commented
Aug 9, 2018
Agree it's messy but I think it's sensible to keep class predictions as a separate thing. For example, for margins, I use predictions for numerical derivatives and want to always be able to count on those being numeric. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
alexpghayes
Aug 9, 2018
Naming: I'm not a huge fan of pred
. See also some conventions the Stan crew just went with. I think full singular nouns is the way to go.
The return tibble can contain extra attributes for values relevant to the prediction (e.g. alpha for intervals) but care should be taken to make sure that these attributes are not destroyed when standard operations are applied to the tibble (e.g. arrange, filter, etc.).
I think that ideally as much information as possible should be contained in the tibble itself. Users are much more familiar working with tibbles than attributes. For example, for intervals, I'd strongly prefer a column width
to an attribute. This would make working with multiple intervals at once nicer as well (for plotting, say).
For class probability predictions, the columns should be named the same as the factor levels (with back-ticks when they are not valid column names).
No indication that the columns are probabilities in the name?
If interval estimates are produced (e.g. prediction/confidence/credible), the column names should be lower and upper. If a standard error is produced, it should be named std_error.
Yes.
If a posterior distribution is returned for each sample, each element of the list column can be a tibble with as many rows as samples in the distribution.
Not sold on this. For a high level interface, we should not force users to interact with posterior samples. We should provide a default summary of those samples and only optionally return them.
When a predict method produces values over multiple tuning parameter values (e.g. glmnet), the list column elements have rows for every tuning parameter value combination that is being changed (e.g. lambda in glmnet).
I think we need to think very carefully if we want to do this. My take is that so far this entire thread has been defining a predict()
method for a single fitted model object. When you have multiple hyperparameter combinations, you are fundamentally working with multiple fitted model objects. I'm not convinced it even makes sense to define predict()
in this case. If we are going to define fit at all, I think one sensibly option is:
predict.many_fits <- function(...) {
best_fit <- extract_best_fit(many_fits)
predict(best_fit)
}
to get predictions for different hyperparameter values I'd much prefer a workflow like:
map(many_fits, predict) # many_fits basically a list of fitted objects
But I think we should be really careful here.
If newdata is not supplied, an error should be thrown.
Strongly second this, and came here to recommend it. This is an issue because models often drop rows with missing data silently, which leads to all sorts of nasty surprises and after the fact attempts to determine precisely which rows got dropped.
alexpghayes
commented
Aug 9, 2018
Naming: I'm not a huge fan of
I think that ideally as much information as possible should be contained in the tibble itself. Users are much more familiar working with tibbles than attributes. For example, for intervals, I'd strongly prefer a column
No indication that the columns are probabilities in the name?
Yes.
Not sold on this. For a high level interface, we should not force users to interact with posterior samples. We should provide a default summary of those samples and only optionally return them.
I think we need to think very carefully if we want to do this. My take is that so far this entire thread has been defining a predict.many_fits <- function(...) {
best_fit <- extract_best_fit(many_fits)
predict(best_fit)
} to get predictions for different hyperparameter values I'd much prefer a workflow like: map(many_fits, predict) # many_fits basically a list of fitted objects But I think we should be really careful here.
Strongly second this, and came here to recommend it. This is an issue because models often drop rows with missing data silently, which leads to all sorts of nasty surprises and after the fact attempts to determine precisely which rows got dropped. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 9, 2018
Owner
Naming: I'm not a huge fan of pred. See also some conventions the Stan crew just went with. I think full singular nouns is the way to go.
How about .pred
, .pred_class
, .pred_{class levels}
? .predicted
or .prediction
is too long, especially when combined with class levels.
I think that ideally as much information as possible should be contained in the tibble itself.
I don't want to do that. It's replicating data that has a single value into a column that is potentially large.
Not sold on this. For a high level interface, we should not force users to interact with posterior samples. We should provide a default summary of those samples and only optionally return them.
That's one example of many where the prediction isn't a scalar. Even so, I've had applications where I've needed the full posterior distribution for each sample.
I think we need to think very carefully if we want to do this. My take is that so far this entire thread has been defining a
predict()
method for a single fitted model object. When you have multiple hyperparameter combinations, you are fundamentally working with multiple fitted model objects. I'm not convinced it even makes sense to definepredict()
in this case.
I'm only talking about doing it in the cases where the predict
method make simultaneous predictions across many parameters from a single fitted model object. That's the whole "submodel trick" that caret
uses for efficiency and it would be lost by repeated calling of predict
for each parameter value using the same model object (when the underlying predict code gives them all to you at once).
How about
I don't want to do that. It's replicating data that has a single value into a column that is potentially large.
That's one example of many where the prediction isn't a scalar. Even so, I've had applications where I've needed the full posterior distribution for each sample.
I'm only talking about doing it in the cases where the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
artichaud1
Aug 10, 2018
I think this is better for other developers extending existing work, since they are less likely to make assumptions about which rows are present than which columns are present.
Not sure if this is still up for debate, and although I favored wide format in the past, more often than not I ended up going to long format for analysis, because it allowed me to standardize my workflow:: filter, group_by, summarise -> send to ggplot.
In the same vein as @alexpghayes point, I feel that it removes some of the weight in the decision making, i.e. it makes it easier to change ideas whenever, with a lower probability (or severity) of breaking stuff.
artichaud1
commented
Aug 10, 2018
•
Not sure if this is still up for debate, and although I favored wide format in the past, more often than not I ended up going to long format for analysis, because it allowed me to standardize my workflow:: filter, group_by, summarise -> send to ggplot. In the same vein as @alexpghayes point, I feel that it removes some of the weight in the decision making, i.e. it makes it easier to change ideas whenever, with a lower probability (or severity) of breaking stuff. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kevinykuo
Aug 10, 2018
Collaborator
This one pops up on my dashboard every day, so I'll join the party ;)
Regarding class probabilities, since we're using tibble anyway, why not just have the probabilities in a listcol? I think it's good to avoid wide tables by default. This would also allow .prediction
as a column name, which is clearer than pred/.pred
. Also, we might not always have names of factor levels.
This one pops up on my dashboard every day, so I'll join the party ;) Regarding class probabilities, since we're using tibble anyway, why not just have the probabilities in a listcol? I think it's good to avoid wide tables by default. This would also allow |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 10, 2018
Owner
Also, we might not always have names of factor levels.
spark may not, but that will be illegal in the modeling new world order
spark may not, but that will be illegal in the modeling new world order |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 10, 2018
Owner
Regarding class probabilities, since we're using tibble anyway, why not just have the probabilities in a listcol? I think it's good to avoid wide tables by default.
They are probably going to end up wide in a lot of cases anyway.
In classification, people probably want the class prediction as well as the probabilities, so that's already > 1 column even if we make the probabilities a list. tidyr
(or built in gather
/spread
methods) can solve going from wide -> long easily.
They are probably going to end up wide in a lot of cases anyway. In classification, people probably want the class prediction as well as the probabilities, so that's already > 1 column even if we make the probabilities a list. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kevinykuo
Aug 10, 2018
Collaborator
In classification, people probably want the class prediction as well as the probabilities, so that's already > 1 column even if we make the probabilities a list.
These are definitely subjective points but imo
- 4-5 columns is one thing, but 1000 class probability columns doesn't feel as tidy, and
- It's easier to specify the contract/expectations of what you get if you have long rather than wide data.
These are definitely subjective points but imo
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 10, 2018
Owner
4-5 columns is one thing, but 1000 class probability columns doesn't feel as tidy, and
True but I don't want extreme cases to drive the bus
It's easier to specify the contract/expectations of what you get if you have long rather than wide data.
Unless you are going to merge these results with the original data (which is pretty common). Plus, you already have wide data anyway with classes and probabilities. If you stack the probabilities then you have a class column that is replicated. Why duplicate data?
True but I don't want extreme cases to drive the bus
Unless you are going to merge these results with the original data (which is pretty common). Plus, you already have wide data anyway with classes and probabilities. If you stack the probabilities then you have a class column that is replicated. Why duplicate data? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kevinykuo
Aug 10, 2018
Collaborator
Oh I didn't mean we should stack probabilities (although upon re-reading it does sound like it!). I'm thinking about appending .prediction
, .predicted_class
, and .class_probabilities
to the dataset for classification. Having to write paste0("pred_", lvl)
to get the right column feels brittle, although practically it likely won't matter. I guess going down this line of reasoning, we'd also have .prediction_interval
and friends (and .prediction
if we have multiple outputs) as list columns, and this may deviate too much from what folks are used to. So yeah, if we're going to do lower
and upper
, it would be consistent to have pred_setosa
and pred_versicolor
.
Oh I didn't mean we should stack probabilities (although upon re-reading it does sound like it!). I'm thinking about appending |
added a commit
that referenced
this issue
Aug 11, 2018
added a commit
that referenced
this issue
Aug 11, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 16, 2018
Owner
I think that parsnip
will have functions to make predictions across submodels (maybe predict_submodel
?). I think that I'm leaning towards restricting predict
to produce a single model prediction per object.
Similarly, add predict_interval
that will produce .pred_lower
and .pred_upper
and also add predict(object, newdata, type = "interval")
that references it. I'll prototype it with lm
, stan
, and mars
.
<edit>
This would be in the service about predict
being restricted to returning the same number of rows as newdata
.
I think that Similarly, add
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
DavisVaughan
Aug 16, 2018
So for predict_boost_tree()
would it have a type
argument as well so you can specify class
vs prob
? And then it would just have a default type
based off mode
if no type
is specified, like the current predict.model_fit()
has?
And you could call predict(boost_fit, newdata, type = "whatever")
which would just call predict_boost_tree()
?
The separate prediction functions per submodel might be a nice way to modularize any nuances that might come up for the prediction methods of a specific model type, and would be pretty extensible.
DavisVaughan
commented
Aug 16, 2018
So for And you could call The separate prediction functions per submodel might be a nice way to modularize any nuances that might come up for the prediction methods of a specific model type, and would be pretty extensible. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
topepo
Aug 16, 2018
Owner
So for
predict_boost_tree()
would it have atype
argument as well so you can specifyclass
vsprob
? And then it would just have a defaulttype
based offmode
if notype
is specified, like the currentpredict.model_fit()
has?
Close. There is not predict_boost_tree
, just predict.model_spec
. I'm about to merge in a devel branch that shows how predict
will pass off to different functions. I should just merge that it.
And you could call
predict(boost_fit, newdata, type = "whatever")
which would just callpredict_boost_tree()
?
It will call predict.model_spec
use use type
to redirect.
Close. There is not
It will call |
alexpghayes commentedAug 7, 2018
•
edited
Edited 2 times
-
-
-
alexpghayes editedAug 7, 2018 (most recent)
alexpghayes editedAug 7, 2018
alexpghayes createdAug 7, 2018
I've been reworking the
augment()
methods and it's rapidly becoming clear that dealing with idiosyncraticpredict()
methods is going to slow down progress immensely.In the end
broom
andparsnip
are both going to want to wrap a bajillion predict methods, and we should report predictions in the same way for consistencies sake. I think we should move this functionality to a separate package. Potentially we could use theprediction
package by Thomas Leeper, but we should decide on the behavior we want first.If we define a new generic / series of generics, we can then test these behaviors in
modeltests
and allow other modelling package developers to guarantee that theirpredict()
methods are sane and consistent.What I want from a predict method:
predict.lm(..., na.action = na.pass)
)I want all of these to be guaranteed, and for methods that cannot meet these guarantees, I want an informative error rather than a partially correct output.