Discussion: Restyle UDFs in the context of DataFrames #425

lintool · 2020-02-11T13:56:43Z

Currently, we're doing something like this in DFs:

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.keepUrlPatternsDF(Set(".*index.*".r))
			.show(10,false)

This is a straightforward translation of what we've been doing in RDDs, so that's fine. However, in DF, something like this would be more fluent:

			.filter($"src".isInUrlPatterns(Set(".*index.*".r)))

This would require reimplementation of our all filters... let's discuss.

ruebot · 2020-02-11T14:00:01Z

Pulling this in from Slack:

Looking at all the RDD filters, they're all basically the same implementation; there's a field, do this custom filter on it. So, a DF and RDD re-implementation could be very similar. Basically what you proposed, the filter UDF taking in two parameters. So, we could do something like this for both RDD and DF:

.filter($"col".isInUrlPatterns(Set(".*index.*".r)))

...and, if we play our cards right, we could just have one implementation for both 🤷‍♂

lintool · 2020-02-11T14:05:20Z

we could just have one implementation for both

That would be great in the short term, but not necessary for the long term, IMO. Eventually, the DF functionality would be a superset of the RDD functionality, since we have no intention of backporting new DF features to RDD.

greebie · 2020-02-11T14:24:46Z

Seems like it would be helpful to have the a -> Bool tests regardless and these could be implemented in the existing .keep functions if that's desired. Filter and FilterNot (does scala have FilterNot?) are more canonical in both Python and Scala. Also using filter suits FAAV. Ryan...

…

On Tuesday, February 11, 2020, Jimmy Lin ***@***.***> wrote: we could just have one implementation for both That would be great in the short term, but not necessary for the long term, IMO. Eventually, the DF functionality would be a superset of the RDD functionality, since we have no intention of backporting new DF features to RDD. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#425?email_source=notifications&email_token=AAA3D46CZUBF52CNQJB2K33RCKWCDA5CNFSM4KTBIGSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELMQXWI#issuecomment-584649689>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA3D47RRM2JMG4TAMNFER3RCKWCDANCNFSM4KTBIGSA> .

-- Ryan Deschamps ryan.deschamps@gmail.com @ryandeschamps <ryan.deschamps@gmail.com>

ruebot · 2020-02-14T13:16:15Z

Thinking about this more, I'm not seeing the use of moving in this direction since it appears to be a slightly more complicated version of just using filter.

For example:

import io.archivesunleashed._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .count()

res0: Long = 125579

import io.archivesunleashed._

val languages = Set("th","de","ht")

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .keepLanguagesDF(languages)
  .count()

res1: Long = 3536

import io.archivesunleashed._

val languages = Set("th","de","ht")

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .filter($"language".isinCollection(languages))
  .count()

res7: Long = 3536

With that, I'd argue we keep what we have now, or remove all the DataFrame filters as they exist now in DataFrames, and resolve this issue by updating the current documentation with the pure Spark DF implementation of filters.

ruebot · 2020-02-14T13:31:26Z

...and if we go with the latter, that'll solve a sizable chunk of the Python implementation 😃

lintool · 2020-02-14T16:33:26Z

Can I propose yet another alternative design?

import io.archivesunleashed._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .filter(hasLanguage("th","de","ht"))
  .count()

This saves the scholar from having to know about the schema explicitly? The UDF should be able to figure it out...

And similarly, we can have:

import io.archivesunleashed._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .filter(urlMatches("""regex"""))
  .count()

I'm just thinking that a humanities scholar might get confused/scared about the $ notation and would be unfamiliar with the . method notation being applied with weird $ thingys?

ruebot · 2020-02-14T18:09:48Z

Yeah, I like that better @lintool. Then we should be able get the negation with ! in Scala and ~ in Python/PySpark.

keepImagesDF -> hasImages
keepHttpStatusDF -> hasHTTPStatus
keepDateDF -> hasDates
keepUrlsDF -> hasUrls
keepDomainsDF -> hasDomains
keepMimeTypesTikaDF -> hasTikaMimeTypes
keepMimeTypesDF -> hasMimeTypes
keepContentDF -> contentMatches
keepUrlPatternsDF -> urlMatches
keepLanguagesDF -> hasLanguages

Feel free to suggest better names if you have them.

SinghGursimran · 2020-03-05T18:20:41Z

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .filter(hasLanguage("th","de","ht"))
  .count()

In this approach, hasLanguage() function would require the column as well.
Sth like -

filter(hasLanguage($"content",["th","de","ht"]))

Is this fine?

ruebot · 2020-03-05T19:26:34Z

Yeah, that makes sense to me. Work for you @lintool and @ianmilligan1?

ianmilligan1 · 2020-03-05T19:37:19Z

Works for me @ruebot and @SinghGursimran!

@SinghGursimran


        #425 cleanup, and build on @SinghGursimran's work.

ruebot added DataFrames enhancement rdd Scala labels Feb 11, 2020

ruebot added this to ToDo in DataFrames and PySpark Feb 11, 2020

ruebot mentioned this issue Feb 11, 2020

update for 'src' column #424

Merged

SinghGursimran mentioned this issue Mar 6, 2020

Restyle UDFs in the context of DataFrames #427

Closed

ruebot added a commit that referenced this issue Mar 17, 2020

#425 cleanup, and build on @SinghGursimran's work.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

7e34950

ruebot mentioned this issue Mar 17, 2020

Restyle keep/discard filter UDFs in the context of DataFrames #429

Merged

ruebot closed this in e1d908b Mar 18, 2020

DataFrames and PySpark automation moved this from ToDo to In review Mar 18, 2020

archivesunleashed / aut

Discussion: Restyle UDFs in the context of DataFrames #425

Discussion: Restyle UDFs in the context of DataFrames #425

lintool commented Feb 11, 2020

This comment has been minimized.

ruebot commented Feb 11, 2020

This comment has been minimized.

lintool commented Feb 11, 2020

This comment has been minimized.

greebie commented Feb 11, 2020

This comment has been minimized.

ruebot commented Feb 14, 2020 •

edited

This comment has been minimized.

ruebot commented Feb 14, 2020

This comment has been minimized.

lintool commented Feb 14, 2020

This comment has been minimized.

ruebot commented Feb 14, 2020

This comment has been minimized.

SinghGursimran commented Mar 5, 2020

This comment has been minimized.

ruebot commented Mar 5, 2020

This comment has been minimized.

ianmilligan1 commented Mar 5, 2020

archivesunleashed / aut

Join GitHub today

Discussion: Restyle UDFs in the context of DataFrames #425

Discussion: Restyle UDFs in the context of DataFrames #425

Comments

lintool commented Feb 11, 2020

This comment has been minimized.

ruebot commented Feb 11, 2020

This comment has been minimized.

lintool commented Feb 11, 2020

This comment has been minimized.

greebie commented Feb 11, 2020

This comment has been minimized.

ruebot commented Feb 14, 2020 • edited

This comment has been minimized.

ruebot commented Feb 14, 2020

This comment has been minimized.

lintool commented Feb 14, 2020

This comment has been minimized.

ruebot commented Feb 14, 2020

This comment has been minimized.

SinghGursimran commented Mar 5, 2020

This comment has been minimized.

ruebot commented Mar 5, 2020

This comment has been minimized.

ianmilligan1 commented Mar 5, 2020

ruebot commented Feb 14, 2020 •

edited