Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Idiom for loading DataFrames #231

Open
lintool opened this issue May 21, 2018 · 4 comments

Comments

@lintool
Copy link
Member

commented May 21, 2018

In my original implementation I wrote a DataFrameLoader, but it seems to have rapidly fallen out of use... We should decide on the idiom we want for loading DataFrames.

Current implementation:

val df = RecordLoader.loadArchives("example.arc.gz", sc).extractImageDetailsDF()
// alternatively, extractValidPagesDF, extractHyperlinksDF, etc.

The downside of this is that the user has access to raw RDDs, which is what loadArchives returns... this is asking for trouble in mixing RDDs and DFs in unpredictable ways?

Another option would be to introduce a DF interface that does not give access to RDDs. Something like:

val df = DataFrameLoader.loadArchives("example.arc.gz", sc).images

The other nice feature is that we can have much shorter DF names like pages, links, images, image_links, etc. - don't need the DF part to disambiguate because DataFrameLoader makes this clear. One more nice features is the ability to selectively reduce scope down the road and hide RDDs from the user, as we move completely over to DFs.

I'm leaning towards this design, but would be happy to hear opinions from others...

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented May 22, 2018

The other nice feature is that we can have much shorter DF names like pages, links, images, image_links, etc. - don't need the DF part to disambiguate because DataFrameLoader makes this clear. One more nice features is the ability to selectively reduce scope down the road and hide RDDs from the user, as we move completely over to DFs.

I'm generally agnostic but this pushes me in the camp of having a DF-specific interface. The second syntax example you gave with the .images is very usable.

@ruebot

This comment has been minimized.

Copy link
Member

commented May 22, 2018

Fine by me. I can see moving towards strict DataFrames helping out on the AUK side of things.

@JWZ2018

This comment has been minimized.

Copy link
Contributor

commented May 22, 2018

+1 for strict dataframes and hiding away RDDs

@ruebot ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018

@ruebot ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

@ruebot ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

@ruebot ruebot added the discussion label Aug 20, 2018

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

I think #350 hits this, and/or resolves it. I'll leave that to @lintool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.