Migration of all RDD functionality over to DataFrames #223

lintool · 2018-05-15T10:16:15Z

We need to migrate all current RDD functionality over to DataFrames. This means porting all matchbox UDFs over to DF UDFs.

There are two possible ways to do this - we can simply take matchbox UDFs and wrap them, or rewrite them from scratch. I suggest we revisit one by one, which will give us an opportunity to refine the UDF we actually want.

For example, the current RDD matchbox ExtractDomain is implemented a bit differently than the DF version we've been playing with... it, for example, strips the prefix www, whereas the RDD impl doesn't. I like the newer implementation better, but open to discussion.

Also, this is an issue we'll come across sooner or later:
https://stackoverflow.com/questions/33664991/spark-udf-initialization

I have a general question we need to look into from the performance perspective: what's the lifecycle status of a Spark DF UDF? In particular, if there's initialization like compiling regexp, we don't want to do that over and over again... we want to have an init stage?

@TitusAn let's start developing in parallel the DF versions of the apps in #222 and try and work this out?

ruebot · 2019-02-01T18:37:30Z

@jrwiebe I have a note here from our call to "Go through the scala dir and identify all the functions that take in RDD and do not take in DF, then create tickets for JWb." Do you want me to do that granular of a level, or do you want to use this issue to take care of it?

jrwiebe · 2019-02-01T18:38:43Z

This is fine.

ruebot · 2019-07-17T16:31:13Z

@ianmilligan1 @lintool this look the basic inventory?

rdd	data frame
`keepValidPages`
`extractValidPages`	`extractValidPagesDF`
	`extractHyperlinksDF`
	`extractImageLinksDF`
	`extractImageDetailsDF`
`keepImages`
`keepMimeTypes`
`keepUrls`
`keepUrlPatterns`
`keepDomains`
`keepLanguages`
`keepContent`
`discardMimeTypes`
`discardDate`
`discardUrls`
`discardUrlPatterns`
`discardDomains`
`discardContent`
`extractFromRecords`
`extractFromScrapeText`
`WriteGEXF`
`ExtractPopularImages`
`DomainFrequencyExtractor`
`DomainGraphExtractor`
`PlainTextExtractor`
`ExtractGraphX`

lintool · 2019-08-21T08:47:12Z

I think we should just leave this as a "catch-all" issue, open.

IMO, this should be driven by the documentation update - go through docs, everything that we do with RDDs, we make sure there's a corresponding DF code example. When the docs have everything in both RDD and DF, I think we're done.

ruebot · 2019-11-08T22:39:22Z

@SinghGursimran this one is tied to #372, and should become a lot clearer as to what needs to be done to close this one. I think we're pretty close here.

ruebot added this to In Progress in DataFrames and PySpark May 21, 2018

TitusAn referenced this issue May 24, 2018

Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236

Merged

ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

ruebot added enhancement DataFrames labels Aug 20, 2018

ruebot referenced this issue Aug 17, 2019

DataFrame discussion: open thread #190

Open

ruebot referenced this issue Oct 19, 2019

Documentation reorg #2

Merged

ruebot referenced this issue Nov 5, 2019

Convert RecordLoader.loadArchives to a Spark Data Source #371

Open

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Migration of all RDD functionality over to DataFrames #223

Migration of all RDD functionality over to DataFrames #223

lintool commented May 15, 2018

This comment has been minimized.

ruebot commented Feb 1, 2019

This comment has been minimized.

jrwiebe commented Feb 1, 2019

This comment has been minimized.

ruebot commented Jul 17, 2019 •

edited

This comment has been minimized.

lintool commented Aug 21, 2019

This comment has been minimized.

ruebot commented Nov 8, 2019

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Join GitHub today

Migration of all RDD functionality over to DataFrames #223

Comments

lintool commented May 15, 2018

This comment has been minimized.

ruebot commented Feb 1, 2019

This comment has been minimized.

jrwiebe commented Feb 1, 2019

This comment has been minimized.

ruebot commented Jul 17, 2019 • edited

This comment has been minimized.

lintool commented Aug 21, 2019

This comment has been minimized.

ruebot commented Nov 8, 2019

ruebot commented Jul 17, 2019 •

edited