Migration of all RDD functionality over to DataFrames #223

lintool · 2018-05-15T10:16:15Z

We need to migrate all current RDD functionality over to DataFrames. This means porting all matchbox UDFs over to DF UDFs.

There are two possible ways to do this - we can simply take matchbox UDFs and wrap them, or rewrite them from scratch. I suggest we revisit one by one, which will give us an opportunity to refine the UDF we actually want.

For example, the current RDD matchbox ExtractDomain is implemented a bit differently than the DF version we've been playing with... it, for example, strips the prefix www, whereas the RDD impl doesn't. I like the newer implementation better, but open to discussion.

Also, this is an issue we'll come across sooner or later:
https://stackoverflow.com/questions/33664991/spark-udf-initialization

I have a general question we need to look into from the performance perspective: what's the lifecycle status of a Spark DF UDF? In particular, if there's initialization like compiling regexp, we don't want to do that over and over again... we want to have an init stage?

@TitusAn let's start developing in parallel the DF versions of the apps in #222 and try and work this out?

ruebot · 2019-02-01T18:37:30Z

@jrwiebe I have a note here from our call to "Go through the scala dir and identify all the functions that take in RDD and do not take in DF, then create tickets for JWb." Do you want me to do that granular of a level, or do you want to use this issue to take care of it?

jrwiebe · 2019-02-01T18:38:43Z

This is fine.

ruebot · 2019-07-17T16:31:13Z

@ianmilligan1 @lintool this look the basic inventory?

rdd	data frame
`keepValidPages`
`extractValidPages`	`extractValidPagesDF`
	`extractHyperlinksDF`
	`extractImageLinksDF`
	`extractImageDetailsDF`
`keepImages`
`keepMimeTypes`
`keepUrls`
`keepUrlPatterns`
`keepDomains`
`keepLanguages`
`keepContent`
`discardMimeTypes`
`discardDate`
`discardUrls`
`discardUrlPatterns`
`discardDomains`
`discardContent`
`extractFromRecords`
`extractFromScrapeText`
`WriteGEXF`
`ExtractPopularImages`
`DomainFrequencyExtractor`
`DomainGraphExtractor`
`PlainTextExtractor`
`ExtractGraphX`

lintool · 2019-08-21T08:47:12Z

I think we should just leave this as a "catch-all" issue, open.

IMO, this should be driven by the documentation update - go through docs, everything that we do with RDDs, we make sure there's a corresponding DF code example. When the docs have everything in both RDD and DF, I think we're done.

ruebot · 2019-11-08T22:39:22Z

@SinghGursimran this one is tied to #372, and should become a lot clearer as to what needs to be done to close this one. I think we're pretty close here.


        Extend more Matchbook utilities to DataFrames (#380).

- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames - Addresses #223

ruebot · 2019-11-19T11:47:45Z

From @lintool

Sorry to be late to the party... I seem to faintly recall that we were going to rename all our UDFs to MyFuncDF and MyFuncRDD to disambiguate. Were we still going to do that? Circle around in another PR?

This spawned from a Slack convo Jimmy and I had in a non-public channel that never made it to the ticket.

I have a branch working through this now.


        Append UDF with RDD or RF.

- Addresses #223


        Append UDF with RDD or RF. (#381)

- Addresses #223


        Add all() method and refactor DF UDFs (#383).

- Add `all()` DataFrame method - Refactor fixity DataFrame UDFs - Add ComputeImageSize UDF - Add Python implementation of `all()` - Addresses #223


        Add "Extract popular images" DataFrame implementation (#382).

- Add tests for ExtractPopularImagesDF - Rename ExtractPopularImages to ExtractPopularImagesRDD - Addresses #223


        Add new DataFrame matchbox udfs (#387)

- Add DetectLanguageDF - Add ExtractBoilerpipeTextDF - Add ExtractDateDF - Update tests - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD - Partially addresses #223


        Add Serializable APIs for DataFrames (#389)

- Add keepValidPagesDF - Add HTTP status code column to all() - Add test for keepValidPagesDF - Addresses #223


        udf API implementations for DataFrame (#391)

- add discardMimeTypesDF - add discardDateDF - add discardUrlsDF - add discardDomainsDF - update tests - addresses #223

SinghGursimran · 2019-12-19T19:35:01Z

Current Inventory: Matchbox

RDD	Data Frame
ComputeImageSize
ComputeMD5RDD	ComputeMD5DF
ComputeSHA1RDD	ComputeSHA1DF
DetectLanguageRDD	DetectLanguageDF
DetectMimeTypeTikka	DetectMimeTypeTikkaDF
ExtarctBoilerPipeTextRDD	ExtarctBoilerPipeTextDF
ExtractDateRDD	ExtractDateDF
ExtarctDomainRDD	ExtarctDomainDF
ExtarctImageDetails
ExtarctImageLinksRDD	ExtarctImageLinksDF
ExtarctLinksRDD	ExtarctLinksDF
ExtractTextFromPDFs	-
GetExtensionMimeRDD	GetExtensionMimeDF
RemoveHTMLRDD	RemoveHTMLDF
RemoveHTTPHeaderRDD	RemoveHTTPHeaderDF
NERClassifier	-
RemovePrefixWWW	RemovePrefixWWWDF

SinghGursimran · 2019-12-19T19:44:04Z

Current Inventory: App

Functionality	RDD	Data Frame
CommandLineApp	Yes	Yes
DomainFrequencyExtractor	Yes	Yes
DomainGraphExtractor	Yes	Yes
ExtractEntities	Yes	-
ExtractGraphX	Yes	No
ExtarctPopularImages	Yes	Yes
NERCombinerJson	Yes	-
PlainTextExtarctor	Yes	No
WriteGEXF	Yes	No
WriteGraph	Yes	No
WriteGraphML	Yes	No


        Add more serializable APIs for DataFrames (#396)

- Partially address #223 - Add keepHttpStatusDF - Add keepDateDF - Add keepUrlsDF - Add keepDomainsDF - Add tests


        Add more DF implementations for #223. (#399)

- Add discardHttpStatusDF - Add keepMimeTypesDF - Add keepMimeTypesTikaDF - Update tests

SinghGursimran · 2020-01-10T01:59:33Z

Final Inventory: Serializable APIs

RDD	Data Frame
keepValidPages	keepValidPagesDF
keepMimeTypes	keepMimeTypesDF
keepMimeTypesTika	keepMimeTypesTikaDF
keepHttpStatus	keepHttpStatusDF
keepDate	keepDateDF
keepUrls	keepUrlsDF
keepUrlPatterns	keepUrlPatternsDF
keepImages	keepImagesDF
keepDomains	keepDomainsDF
keepLanguages	keepLanguagesDF
keepContent	keepContentDF
discardMimeTypes	discardMimeTypesDF
discardMimeTypesTika	discardMimeTypesDF
discardDate	discardDateDF
discardUrls	discardUrlsDF
discardHttpStatus	discardHttpStatusDF
discardUrlPatterns	discardUrlPatternsDF
discardDomains	discardDomainsDF
discardContent	discardContentDF
discardLanguages	discardLanguagesDF


        Add more DataFrame Implementation Serializable APIs (#401).

- Partially addresses #223 - Add discardContentDF - Add discardUrlPatternsDF - Add discardLanguagesDF - Add keepImagesDF - Add keepContentDF - Add keepUrlPatternsDF - Add keepLanguagesDF - Update tests

ruebot · 2020-01-13T14:22:34Z

@SinghGursimran we still need to do discardMimeTypesTikaDF. Looks like you have discardMimeTypesDF duplicated.


        Scala DF filter updates for a few aut PRs.

- Addresses #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399


        Scala DF filter updates for a few aut PRs. (#36)

- Addresses #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399

ruebot · 2020-01-13T15:02:30Z

@SinghGursimran nvm. I see it now :-)

aut/src/main/scala/io/archivesunleashed/package.scala

Lines 243 to 250 in bc0d663

    
               /** Removes all data but selected mimeTypeTikas specified. 
        
                 * 
        
                 * @param mimeTypesTika a list of Mime Types Tika 
        
                 */ 
        
               def keepMimeTypesTikaDF(mimeTypes: Set[String]): DataFrame = { 
        
                 val takeMimeTypeTika = udf((mimeTypeTika: String) => mimeTypes.contains(mimeTypeTika)) 
        
                 df.filter(takeMimeTypeTika(DetectMimeTypeTikaDF($"bytes"))) 
        
               }


        More DF filter updates.

- Addresses #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399


        More DF filter updates. (#37)

- Addresses #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399


        Add discardMimeTypesTika, and tweak other MIME type examples.

- Resolves #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399


        Add discardMimeTypesTika, and tweak other MIME type examples. (#38)

- Resolves #32 - Addresses archivesunleashed/aut#223 - Addresses archivesunleashed/aut#401 - Addresses archivesunleashed/aut#396 - Addresses archivesunleashed/aut#399

ruebot · 2020-01-17T22:11:41Z

Created three new issues that should cover the Python implementations of most of the work here.

ruebot added this to In Progress in DataFrames and PySpark May 21, 2018

TitusAn mentioned this issue May 24, 2018

Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236

Merged

ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

ruebot added enhancement DataFrames labels Aug 20, 2018

ruebot mentioned this issue Aug 17, 2019

DataFrame discussion: open thread #190

Closed

ruebot mentioned this issue Oct 19, 2019

Documentation reorg #2

Merged

ruebot mentioned this issue Nov 5, 2019

Convert RecordLoader.loadArchives to a Spark Data Source #371

Open

ruebot mentioned this issue Nov 13, 2019

DISCUSSION: Mark TODO sections as won't implement where necessary #22

Open

ruebot moved this from ToDo to In Progress in DataFrames and PySpark Nov 14, 2019

SinghGursimran mentioned this issue Nov 17, 2019

Matchbox utilities to DataFrames #380

Merged

ruebot added a commit that referenced this issue Nov 18, 2019

Extend more Matchbook utilities to DataFrames (#380).

Loading status checks…

a081d7b

- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames - Addresses #223

ruebot added a commit that referenced this issue Nov 19, 2019

Append UDF with RDD or RF.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

0a68c81

- Addresses #223

ruebot mentioned this issue Nov 19, 2019

Append UDF with RDD or RF. #381

Merged

ianmilligan1 added a commit that referenced this issue Nov 19, 2019

Append UDF with RDD or RF. (#381)

Loading status checks…

b98ba4b

- Addresses #223

SinghGursimran mentioned this issue Nov 20, 2019

More Data Frame Implementations + Code Refactoring #383

Merged

ruebot added a commit that referenced this issue Nov 21, 2019

Add all() method and refactor DF UDFs (#383).

Loading status checks…

c4eaca9

- Add `all()` DataFrame method - Refactor fixity DataFrame UDFs - Add ComputeImageSize UDF - Add Python implementation of `all()` - Addresses #223

ruebot added a commit that referenced this issue Nov 21, 2019

Add "Extract popular images" DataFrame implementation (#382).

Loading status checks…

4042180

- Add tests for ExtractPopularImagesDF - Rename ExtractPopularImages to ExtractPopularImagesRDD - Addresses #223

SinghGursimran mentioned this issue Dec 4, 2019

Dataframe matchbox Implementations #387

Merged

ruebot mentioned this issue Dec 16, 2019

Add and update tests, resolve textFiles bug. #388

Merged

SinghGursimran mentioned this issue Dec 16, 2019

Setup for Serializable APIs on DataFrames #389

Merged

ruebot added a commit that referenced this issue Dec 17, 2019

Add Serializable APIs for DataFrames (#389)

Loading status checks…

ca928d8

- Add keepValidPagesDF - Add HTTP status code column to all() - Add test for keepValidPagesDF - Addresses #223

SinghGursimran mentioned this issue Dec 17, 2019

API implementations for DataFrame #391

Merged

ruebot added a commit that referenced this issue Dec 17, 2019

udf API implementations for DataFrame (#391)

Loading status checks…

40a59de

- add discardMimeTypesDF - add discardDateDF - add discardUrlsDF - add discardDomainsDF - update tests - addresses #223

SinghGursimran mentioned this issue Dec 29, 2019

More Serializable APIs for DataFrames #396

Merged

ruebot added a commit that referenced this issue Dec 29, 2019

Add more serializable APIs for DataFrames (#396)

Loading status checks…

b915f82

- Partially address #223 - Add keepHttpStatusDF - Add keepDateDF - Add keepUrlsDF - Add keepDomainsDF - Add tests

ruebot mentioned this issue Jan 1, 2020

Add Scala DF documentation for AUK derivatives. #34

Draft

This was referenced Jan 2, 2020

WriteGraph DataFrame implementation #397

Draft

More df implementations #399

Merged

ruebot added a commit that referenced this issue Jan 7, 2020

Add more DF implementations for #223. (#399)

Loading status checks…

be15375

- Add discardHttpStatusDF - Add keepMimeTypesDF - Add keepMimeTypesTikaDF - Update tests

SinghGursimran mentioned this issue Jan 9, 2020

DataFrame Implementation - Serializable APIs #401

Merged

ruebot mentioned this issue Jan 10, 2020

Update docs for https://github.com/archivesunleashed/aut/pull/391 #32

Closed

ruebot mentioned this issue Jan 13, 2020

Scala DF filter updates for a few aut PRs. #36

Merged

ruebot mentioned this issue Jan 13, 2020

More DF filter updates. #37

Merged

ruebot mentioned this issue Jan 13, 2020

Add discardMimeTypesTika, and tweak other MIME type examples. #38

Merged

ruebot mentioned this issue Jan 15, 2020

Test and documentation inventory #372

Open

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut

Migration of all RDD functionality over to DataFrames #223

Migration of all RDD functionality over to DataFrames #223

lintool commented May 15, 2018

This comment has been minimized.

ruebot commented Feb 1, 2019

This comment has been minimized.

jrwiebe commented Feb 1, 2019

This comment has been minimized.

ruebot commented Jul 17, 2019 •

edited

This comment has been minimized.

lintool commented Aug 21, 2019

This comment has been minimized.

ruebot commented Nov 8, 2019

This comment has been minimized.

ruebot commented Nov 19, 2019 •

edited

This comment has been minimized.

SinghGursimran commented Dec 19, 2019 •

edited

This comment has been minimized.

SinghGursimran commented Dec 19, 2019

This comment has been minimized.

SinghGursimran commented Jan 10, 2020 •

edited

This comment has been minimized.

ruebot commented Jan 13, 2020

This comment has been minimized.

ruebot commented Jan 13, 2020

This comment has been minimized.

ruebot commented Jan 17, 2020

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut

Join GitHub today

Migration of all RDD functionality over to DataFrames #223

Migration of all RDD functionality over to DataFrames #223

Comments

lintool commented May 15, 2018

This comment has been minimized.

ruebot commented Feb 1, 2019

This comment has been minimized.

jrwiebe commented Feb 1, 2019

This comment has been minimized.

ruebot commented Jul 17, 2019 • edited

This comment has been minimized.

lintool commented Aug 21, 2019

This comment has been minimized.

ruebot commented Nov 8, 2019

This comment has been minimized.

ruebot commented Nov 19, 2019 • edited

This comment has been minimized.

SinghGursimran commented Dec 19, 2019 • edited

This comment has been minimized.

SinghGursimran commented Dec 19, 2019

This comment has been minimized.

SinghGursimran commented Jan 10, 2020 • edited

This comment has been minimized.

ruebot commented Jan 13, 2020

This comment has been minimized.

ruebot commented Jan 13, 2020

This comment has been minimized.

ruebot commented Jan 17, 2020

ruebot commented Jul 17, 2019 •

edited

ruebot commented Nov 19, 2019 •

edited

SinghGursimran commented Dec 19, 2019 •

edited

SinghGursimran commented Jan 10, 2020 •

edited