Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration of all RDD functionality over to DataFrames #223

Open
lintool opened this issue May 15, 2018 · 12 comments
Open

Migration of all RDD functionality over to DataFrames #223

lintool opened this issue May 15, 2018 · 12 comments

Comments

@lintool
Copy link
Member

@lintool lintool commented May 15, 2018

We need to migrate all current RDD functionality over to DataFrames. This means porting all matchbox UDFs over to DF UDFs.

There are two possible ways to do this - we can simply take matchbox UDFs and wrap them, or rewrite them from scratch. I suggest we revisit one by one, which will give us an opportunity to refine the UDF we actually want.

For example, the current RDD matchbox ExtractDomain is implemented a bit differently than the DF version we've been playing with... it, for example, strips the prefix www, whereas the RDD impl doesn't. I like the newer implementation better, but open to discussion.

Also, this is an issue we'll come across sooner or later:
https://stackoverflow.com/questions/33664991/spark-udf-initialization

I have a general question we need to look into from the performance perspective: what's the lifecycle status of a Spark DF UDF? In particular, if there's initialization like compiling regexp, we don't want to do that over and over again... we want to have an init stage?

@TitusAn let's start developing in parallel the DF versions of the apps in #222 and try and work this out?

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Feb 1, 2019

@jrwiebe I have a note here from our call to "Go through the scala dir and identify all the functions that take in RDD and do not take in DF, then create tickets for JWb." Do you want me to do that granular of a level, or do you want to use this issue to take care of it?

@jrwiebe

This comment has been minimized.

Copy link
Contributor

@jrwiebe jrwiebe commented Feb 1, 2019

This is fine.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Jul 17, 2019

@ianmilligan1 @lintool this look the basic inventory?

rdd data frame
keepValidPages
extractValidPages extractValidPagesDF
extractHyperlinksDF
extractImageLinksDF
extractImageDetailsDF
keepImages
keepMimeTypes
keepUrls
keepUrlPatterns
keepDomains
keepLanguages
keepContent
discardMimeTypes
discardDate
discardUrls
discardUrlPatterns
discardDomains
discardContent
extractFromRecords
extractFromScrapeText
WriteGEXF
ExtractPopularImages
DomainFrequencyExtractor
DomainGraphExtractor
PlainTextExtractor
ExtractGraphX
@lintool

This comment has been minimized.

Copy link
Member Author

@lintool lintool commented Aug 21, 2019

I think we should just leave this as a "catch-all" issue, open.

IMO, this should be driven by the documentation update - go through docs, everything that we do with RDDs, we make sure there's a corresponding DF code example. When the docs have everything in both RDD and DF, I think we're done.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 8, 2019

@SinghGursimran this one is tied to #372, and should become a lot clearer as to what needs to be done to close this one. I think we're pretty close here.

@ruebot ruebot moved this from ToDo to In Progress in DataFrames and PySpark Nov 14, 2019
ruebot added a commit that referenced this issue Nov 18, 2019
- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
- Addresses #223
@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Nov 19, 2019

From @lintool

Sorry to be late to the party... I seem to faintly recall that we were going to rename all our UDFs to MyFuncDF and MyFuncRDD to disambiguate. Were we still going to do that? Circle around in another PR?

This spawned from a Slack convo Jimmy and I had in a non-public channel that never made it to the ticket.

I have a branch working through this now.

ruebot added a commit that referenced this issue Nov 19, 2019
- Addresses #223
ianmilligan1 added a commit that referenced this issue Nov 19, 2019
- Addresses #223
ruebot added a commit that referenced this issue Nov 21, 2019
- Add `all()` DataFrame method 
- Refactor fixity DataFrame UDFs
- Add ComputeImageSize UDF
- Add Python implementation of `all()`
- Addresses #223
ruebot added a commit that referenced this issue Nov 21, 2019
- Add tests for ExtractPopularImagesDF
- Rename ExtractPopularImages to ExtractPopularImagesRDD
- Addresses #223
ruebot added a commit that referenced this issue Dec 5, 2019
- Add DetectLanguageDF
- Add ExtractBoilerpipeTextDF
- Add ExtractDateDF
- Update tests
- Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
- Partially addresses #223
ruebot added a commit that referenced this issue Dec 17, 2019
- Add keepValidPagesDF
- Add HTTP status code column to all()
- Add test for keepValidPagesDF
- Addresses #223
ruebot added a commit that referenced this issue Dec 17, 2019
- add discardMimeTypesDF
- add discardDateDF
- add discardUrlsDF
- add discardDomainsDF
- update tests
- addresses #223
@SinghGursimran

This comment has been minimized.

Copy link
Contributor

@SinghGursimran SinghGursimran commented Dec 19, 2019

Current Inventory: Matchbox

RDD Data Frame
ComputeImageSize
ComputeMD5RDD ComputeMD5DF
ComputeSHA1RDD ComputeSHA1DF
DetectLanguageRDD DetectLanguageDF
DetectMimeTypeTikka DetectMimeTypeTikkaDF
ExtarctBoilerPipeTextRDD ExtarctBoilerPipeTextDF
ExtractDateRDD ExtractDateDF
ExtarctDomainRDD ExtarctDomainDF
ExtarctImageDetails
ExtarctImageLinksRDD ExtarctImageLinksDF
ExtarctLinksRDD ExtarctLinksDF
ExtractTextFromPDFs -
GetExtensionMimeRDD GetExtensionMimeDF
RemoveHTMLRDD RemoveHTMLDF
RemoveHTTPHeaderRDD RemoveHTTPHeaderDF
NERClassifier -
RemovePrefixWWW RemovePrefixWWWDF
@SinghGursimran

This comment has been minimized.

Copy link
Contributor

@SinghGursimran SinghGursimran commented Dec 19, 2019

Current Inventory: App

Functionality RDD Data Frame
CommandLineApp Yes Yes
DomainFrequencyExtractor Yes Yes
DomainGraphExtractor Yes Yes
ExtractEntities Yes -
ExtractGraphX Yes No
ExtarctPopularImages Yes Yes
NERCombinerJson Yes -
PlainTextExtarctor Yes No
WriteGEXF Yes No
WriteGraph Yes No
WriteGraphML Yes No
ruebot added a commit that referenced this issue Dec 29, 2019
- Partially address #223 
- Add keepHttpStatusDF
- Add keepDateDF
- Add keepUrlsDF
- Add keepDomainsDF
- Add tests
ruebot added a commit that referenced this issue Jan 7, 2020
- Add discardHttpStatusDF
- Add keepMimeTypesDF
- Add keepMimeTypesTikaDF
- Update tests
@SinghGursimran

This comment has been minimized.

Copy link
Contributor

@SinghGursimran SinghGursimran commented Jan 10, 2020

Final Inventory: Serializable APIs

RDD Data Frame
keepValidPages keepValidPagesDF
keepMimeTypes keepMimeTypesDF
keepMimeTypesTika keepMimeTypesTikaDF
keepHttpStatus keepHttpStatusDF
keepDate keepDateDF
keepUrls keepUrlsDF
keepUrlPatterns keepUrlPatternsDF
keepImages keepImagesDF
keepDomains keepDomainsDF
keepLanguages keepLanguagesDF
keepContent keepContentDF
discardMimeTypes discardMimeTypesDF
discardMimeTypesTika discardMimeTypesDF
discardDate discardDateDF
discardUrls discardUrlsDF
discardHttpStatus discardHttpStatusDF
discardUrlPatterns discardUrlPatternsDF
discardDomains discardDomainsDF
discardContent discardContentDF
discardLanguages discardLanguagesDF
ruebot added a commit that referenced this issue Jan 10, 2020
- Partially addresses  #223 
- Add discardContentDF
- Add discardUrlPatternsDF
- Add discardLanguagesDF
- Add keepImagesDF
- Add keepContentDF
- Add keepUrlPatternsDF
- Add keepLanguagesDF
- Update tests
@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Jan 13, 2020

@SinghGursimran we still need to do discardMimeTypesTikaDF. Looks like you have discardMimeTypesDF duplicated.

ruebot added a commit to archivesunleashed/aut-docs that referenced this issue Jan 13, 2020
ianmilligan1 added a commit to archivesunleashed/aut-docs that referenced this issue Jan 13, 2020
@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Jan 13, 2020

@SinghGursimran nvm. I see it now :-)

/** Removes all data but selected mimeTypeTikas specified.
*
* @param mimeTypesTika a list of Mime Types Tika
*/
def keepMimeTypesTikaDF(mimeTypes: Set[String]): DataFrame = {
val takeMimeTypeTika = udf((mimeTypeTika: String) => mimeTypes.contains(mimeTypeTika))
df.filter(takeMimeTypeTika(DetectMimeTypeTikaDF($"bytes")))
}

ruebot added a commit to archivesunleashed/aut-docs that referenced this issue Jan 13, 2020
ianmilligan1 added a commit to archivesunleashed/aut-docs that referenced this issue Jan 13, 2020
ruebot added a commit to archivesunleashed/aut-docs that referenced this issue Jan 13, 2020
ianmilligan1 added a commit to archivesunleashed/aut-docs that referenced this issue Jan 15, 2020
@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Jan 17, 2020

Created three new issues that should cover the Python implementations of most of the work here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
DataFrames and PySpark
  
In Progress
4 participants
You can’t perform that action at this time.