Add panel data example to vignette? #577

leeper · Mar 19, 2019

Would be great to have an example with a mix of time-variant and time-invariant variables for gather() or pivot_long() or both. basically what's the tidyr equivalent of:

d <- data.frame(
  x = 1:4,
  y1 = rnorm(4),
  y2 = rnorm(4),
  z1 = rep(3,4),
  z2 = rep(-2,4),
  a = c(1,1,0,0),
  b = c(0,1,1,1)
)

d
##   x         y1          y2 z1 z2 a b
## 1 1  0.7751885 -0.56351522  3 -2 1 0
## 2 2  0.1562380 -0.09576944  3 -2 1 1
## 3 3 -0.1208141  0.49756405  3 -2 0 1
## 4 4  0.6798801 -1.49171491  3 -2 0 1

reshape(
  d,
  varying = list(c("y1", "y2"), c("z1", "z2")),
  v.names = c("y", "z"),
  idvar = "x",
  direction = "long"
)
##     x a b time           y  z
## 1.1 1 1 0    1  0.77518846  3
## 2.1 2 1 1    1  0.15623801  3
## 3.1 3 0 1    1 -0.12081405  3
## 4.1 4 0 1    1  0.67988013  3
## 1.2 1 1 0    2 -0.56351522 -2
## 2.2 2 1 1    2 -0.09576944 -2
## 3.2 3 0 1    2  0.49756405 -2
## 4.2 4 0 1    2 -1.49171491 -2

DavisVaughan · Mar 20, 2019

This is very similar to the anscombe example! The key is overwriting the auto generated .value column to hold multiple values. Scroll down a bit at the below link:
https://tidyr.tidyverse.org/dev/articles/pivot.html#multiple-value-columns

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

d <- data.frame(
  x = 1:4,
  y1 = rnorm(4),
  y2 = rnorm(4),
  z1 = rep(3,4),
  z2 = rep(-2,4),
  a = c(1,1,0,0),
  b = c(0,1,1,1)
)

spec <- pivot_long_spec(d, c(y1, y2, z1, z2)) %>%
  separate(name, c(".value", "time"), 1, convert = TRUE)

pivot_long(d, spec = spec)
#> # A tibble: 8 x 6
#>       x     a     b  time      y     z
#>   <int> <dbl> <dbl> <int>  <dbl> <dbl>
#> 1     1     1     0     1 -0.808     3
#> 2     1     1     0     2  0.491    -2
#> 3     2     1     1     1 -1.61      3
#> 4     2     1     1     2 -1.04     -2
#> 5     3     0     1     1  0.809     3
#> 6     3     0     1     2  2.26     -2
#> 7     4     0     1     1 -0.389     3
#> 8     4     0     1     2 -1.06     -2

reshape(
  d,
  varying = list(c("y1", "y2"), c("z1", "z2")),
  v.names = c("y", "z"),
  idvar = "x",
  direction = "long"
) %>%
  as_tibble() %>%
  arrange(x, a, b)
#> # A tibble: 8 x 6
#>       x     a     b  time      y     z
#>   <int> <dbl> <dbl> <int>  <dbl> <dbl>
#> 1     1     1     0     1 -0.808     3
#> 2     1     1     0     2  0.491    -2
#> 3     2     1     1     1 -1.61      3
#> 4     2     1     1     2 -1.04     -2
#> 5     3     0     1     1  0.809     3
#> 6     3     0     1     2  2.26     -2
#> 7     4     0     1     1 -0.389     3
#> 8     4     0     1     2 -1.06     -2

^{Created on 2019-03-20 by the reprex package (v0.2.1.9000)}

hadley · Mar 20, 2019

(BTW I think this example would be a little easier to understand if a and b were placed next to x in the input, and it's probably worth printing the spec. I think it's worth adding to the vignette - it's similar to Anscombe, but this idea is complicated enough that a couple of examples would be worthwhile.)

Given that separate() is often used in these examples, I wonder if it's worth having an additional argument to pivot_long() that would somehow let you supply the basic syntax? (That said, it would have to be a novel syntax, because I think we could add at most one more argument to pivot_long())

leeper · Mar 20, 2019

The need to use separate() is pretty unintuitive.

hadley · Mar 21, 2019

@leeper could you please write a couple of sentences on why you consider this "panel" data? (i.e. something I could use to introduce a section in the vignette)

I don't currently see away to avoid separate() with out encumbering pivot_long() with many extra arguments — you need to be able to distinguish between the case of x_1 vs x1 (i.e. separate by match or by position), you might need to use extract() instead, and there may be other variables caught up in the column headers.

hadley · Mar 21, 2019

But see discussion in #586.

hadley · Mar 22, 2019

@leeper to close the loop on this, the syntax is now:

pnl %>%
  pivot_longer(-c(x, a, b), names_to = c(".value", "time"), names_sep = 1)

leeper · Mar 23, 2019

Sorry, not sure if you still need it but I'd say something like "Panel data consist of multiple cases/units/observations observed at multiple points in time. They feature commonly in economic, sociological, and political datasets, such as cross-country, over-time datasets like Gapminder."

hadley · Mar 23, 2019

What’s the opposite of panel data? Your description just sounds like data to me 😉

hadley closed this in b04f4e7 Mar 21, 2019

tidyverse/tidyr

Add panel data example to vignette? #577

Add panel data example to vignette? #577

leeper commented Mar 19, 2019

This comment has been minimized.

DavisVaughan commented Mar 20, 2019

This comment has been minimized.

hadley commented Mar 20, 2019

This comment has been minimized.

leeper commented Mar 20, 2019

This comment has been minimized.

hadley commented Mar 21, 2019

hadley closed this in `b04f4e7` Mar 21, 2019

This comment has been minimized.

hadley commented Mar 21, 2019

This comment has been minimized.

hadley commented Mar 22, 2019 •

edited

This comment has been minimized.

leeper commented Mar 23, 2019

This comment has been minimized.

hadley commented Mar 23, 2019

tidyverse/tidyr

Join GitHub today

Add panel data example to vignette? #577

Comments

leeper commented Mar 19, 2019

This comment has been minimized.

DavisVaughan commented Mar 20, 2019

This comment has been minimized.

hadley commented Mar 20, 2019

This comment has been minimized.

leeper commented Mar 20, 2019

This comment has been minimized.

hadley commented Mar 21, 2019

hadley closed this in b04f4e7 Mar 21, 2019

This comment has been minimized.

hadley commented Mar 21, 2019

This comment has been minimized.

hadley commented Mar 22, 2019 • edited

This comment has been minimized.

leeper commented Mar 23, 2019

This comment has been minimized.

hadley commented Mar 23, 2019

hadley closed this in `b04f4e7` Mar 21, 2019

hadley commented Mar 22, 2019 •

edited