Refine the organization of `get_file()`? #48

wibeasley · 2020-01-11T00:05:07Z

@adam3smith, @kuriwaki, @pdurbin, and anyone else,

Should get_file() be refactored into multiple child functions? It seems like we're asking it to do a lot of things, including

retrieve by file id or by file name
retrieve single file or multiple files
in #35, @kuriwaki has the cool suggestion of returning a specific file format (like Stata)
in #35, I hijacked @kuriwaki's thread to suggest also returning R's native data.frame or tibble.
in PR #47, @adam3smith has a cool solution of downloading multiple files as a single zip (also discussed in #46)

I like all these capabilities, and want to run discuss organizational ideas with people so the package structure is (a) easy for us to develop, test, & maintain, and (b) useful and natural to users to learn and incorporate.

One possible approach:
A foundational functional retrieves the file(s) by ID; it is the workhorse that actually retrieves the file. A second function accepts the file name (not ID); it essentially wraps the first function after calling get_fileid(). Both of these functions deal with a single file at a time.

Another pair of functions deal with multiple files (one by name, one by id). But these return lists, not a single object. They're essentially lapplys/loops around their respective siblings described above.

To avoid breaking the package interface, maybe the existing get_file() keeps its same interface (that ambiguously accepts with file names or id and returns either single files or a list of files), but we soft-deprecate it and encourage new code to use these more explicit functions? The guts of the function is moved out into the four new functions

Maybe the function names are

the existing get_file() with an unchanged interface
get_file_by_id() (the workhorse)
get_file_by_name()
get_files_by_id()
get_files_by_name()
get_tibble_by_id()
get_tibble_by_name()
get_zip_by_id()
get_zip_by_name()

I'm pretty sure it would be easier to write tests that isolate problems. The documentation becomes more verbose, but probably more straight-forward.

You guys have more experience with Dataverse than I do, and better sense of the use cases. Would this reorganization help users? If not, maybe we still split it into multiple functions, but just keep the visibility of functions 2-5 private.

Maybe I'm making this unnecessarily tedious, but I'm thinking that these download functions are the most called by R users, and they're certainly the ones that are called by new users. So if they leave a bad impression, the package is less likely to be used.

pdurbin · 2020-01-11T01:58:01Z

To be honest, I don't have a strong opinion. I'd suggest maintaining backward compatibility if you can.

adam3smith · 2020-01-11T03:03:31Z

Yes, keeping get_file() for backward compatibility makes sense.
I like adding get_zip_by_id() and I like adding a separate function for by_name -- the logic in get_file() is indeed rather tortured for that.

On the other hand, I think keeping get_file* and get_files* together keeps things simple, especially if you agree with my re-write of the multi-file download to use sequential downloads of individual files.

get zip by name() and get_files_by_name() don't make sense -- you can only request multiple files by id I believe.

I don't have a view about the tibble/data frame

So my suggestion then would be to reduce your list to

the existing get_file() with an unchanged interface
get_file_by_id() (the workhorse)
get_file_by_name()
get_tibble_by_id()
get_tibble_by_name()
get_zip_by_id()

Given my re-write o

adam3smith · 2020-01-11T12:37:41Z

More thoughts after sleeping on it:

I've come around to a separate get_files* function so that the type of output is predictable: get_file* always produces a raw vector, get_files* always a list of raw vectors. Easy enough to implement
As opposed to what I write above, while the API doesn't implement it, we could of course implement get_files and get_zip by name if we wanted to -- not sure if there's demand for that, I think we shouldn't encourage it so I'm still inclined not to do those
What we should implement, however, is get_file_by_doi -- one of the principal reasons for implementing file-level DOIs were reproducibility workflows
I wonder if we could provite a write_files() function. It's a bit tricky to write the file with its right filename (especially when requesting the original) and it's especially tricky to get the folder structure written if you're requesting individual files, so this seems it'd be a huge help.

wibeasley self-assigned this Jan 11, 2020

wibeasley mentioned this issue Jan 11, 2020

Add ability to download zip of multiple files #47

Open

3 of 6 tasks complete

Please note that GitHub no longer supports your web browser.

IQSS / dataverse-client-r

Refine the organization of `get_file()`? #48

Refine the organization of `get_file()`? #48

wibeasley commented Jan 11, 2020

This comment has been minimized.

pdurbin commented Jan 11, 2020

This comment has been minimized.

adam3smith commented Jan 11, 2020

This comment has been minimized.

adam3smith commented Jan 11, 2020

Please note that GitHub no longer supports your web browser.

IQSS / dataverse-client-r

Join GitHub today

Refine the organization of `get_file()`? #48

Refine the organization of `get_file()`? #48

Comments

wibeasley commented Jan 11, 2020

This comment has been minimized.

pdurbin commented Jan 11, 2020

This comment has been minimized.

adam3smith commented Jan 11, 2020

This comment has been minimized.

adam3smith commented Jan 11, 2020