Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upMigration of all RDD functionality over to DataFrames #223
Comments
ruebot
added this to In Progress
in DataFrames and PySpark
May 21, 2018
TitusAn
referenced this issue
May 24, 2018
Merged
Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236
ruebot
added this to To Do
in 1.0.0 Release of AUT
Aug 13, 2018
ruebot
moved this from In Progress
to ToDo
in DataFrames and PySpark
Aug 13, 2018
ruebot
added
enhancement
DataFrames
labels
Aug 20, 2018
This comment has been minimized.
This comment has been minimized.
@jrwiebe I have a note here from our call to "Go through the scala dir and identify all the functions that take in RDD and do not take in DF, then create tickets for JWb." Do you want me to do that granular of a level, or do you want to use this issue to take care of it? |
This comment has been minimized.
This comment has been minimized.
This is fine. |
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 @lintool this look the basic inventory?
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
lintool commentedMay 15, 2018
We need to migrate all current RDD functionality over to DataFrames. This means porting all matchbox UDFs over to DF UDFs.
There are two possible ways to do this - we can simply take matchbox UDFs and wrap them, or rewrite them from scratch. I suggest we revisit one by one, which will give us an opportunity to refine the UDF we actually want.
For example, the current RDD matchbox
ExtractDomain
is implemented a bit differently than the DF version we've been playing with... it, for example, strips the prefixwww
, whereas the RDD impl doesn't. I like the newer implementation better, but open to discussion.Also, this is an issue we'll come across sooner or later:
https://stackoverflow.com/questions/33664991/spark-udf-initialization
I have a general question we need to look into from the performance perspective: what's the lifecycle status of a Spark DF UDF? In particular, if there's initialization like compiling regexp, we don't want to do that over and over again... we want to have an init stage?
@TitusAn let's start developing in parallel the DF versions of the apps in #222 and try and work this out?