Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDiscussion: Idiom for loading DataFrames #231
Comments
This comment has been minimized.
This comment has been minimized.
I'm generally agnostic but this pushes me in the camp of having a DF-specific interface. The second syntax example you gave with the |
This comment has been minimized.
This comment has been minimized.
Fine by me. I can see moving towards strict DataFrames helping out on the AUK side of things. |
This comment has been minimized.
This comment has been minimized.
+1 for strict dataframes and hiding away RDDs |
ruebot
added this to In Progress
in DataFrames and PySpark
Aug 13, 2018
ruebot
added this to To Do
in 1.0.0 Release of AUT
Aug 13, 2018
ruebot
moved this from In Progress
to ToDo
in DataFrames and PySpark
Aug 13, 2018
ruebot
added
the
discussion
label
Aug 20, 2018
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
We can close this issue after #350 is merged. |
This comment has been minimized.
This comment has been minimized.
Closed with e32ae17. |
lintool commentedMay 21, 2018
In my original implementation I wrote a
DataFrameLoader
, but it seems to have rapidly fallen out of use... We should decide on the idiom we want for loading DataFrames.Current implementation:
The downside of this is that the user has access to raw RDDs, which is what
loadArchives
returns... this is asking for trouble in mixing RDDs and DFs in unpredictable ways?Another option would be to introduce a DF interface that does not give access to RDDs. Something like:
The other nice feature is that we can have much shorter DF names like
pages
,links
,images
,image_links
, etc. - don't need theDF
part to disambiguate becauseDataFrameLoader
makes this clear. One more nice features is the ability to selectively reduce scope down the road and hide RDDs from the user, as we move completely over to DFs.I'm leaning towards this design, but would be happy to hear opinions from others...