Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Restyle UDFs in the context of DataFrames #425

Closed
lintool opened this issue Feb 11, 2020 · 10 comments
Closed

Discussion: Restyle UDFs in the context of DataFrames #425

lintool opened this issue Feb 11, 2020 · 10 comments

Comments

@lintool
Copy link
Member

@lintool lintool commented Feb 11, 2020

Currently, we're doing something like this in DFs:

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webgraph()
			.select($"src")
			.keepUrlPatternsDF(Set(".*index.*".r))
			.show(10,false)

This is a straightforward translation of what we've been doing in RDDs, so that's fine. However, in DF, something like this would be more fluent:

			.filter($"src".isInUrlPatterns(Set(".*index.*".r)))

This would require reimplementation of our all filters... let's discuss.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Feb 11, 2020

Pulling this in from Slack:

Looking at all the RDD filters, they're all basically the same implementation; there's a field, do this custom filter on it. So, a DF and RDD re-implementation could be very similar. Basically what you proposed, the filter UDF taking in two parameters. So, we could do something like this for both RDD and DF:

.filter($"col".isInUrlPatterns(Set(".*index.*".r)))

...and, if we play our cards right, we could just have one implementation for both 🤷‍♂

@lintool

This comment has been minimized.

Copy link
Member Author

@lintool lintool commented Feb 11, 2020

we could just have one implementation for both

That would be great in the short term, but not necessary for the long term, IMO. Eventually, the DF functionality would be a superset of the RDD functionality, since we have no intention of backporting new DF features to RDD.

@greebie

This comment has been minimized.

Copy link
Contributor

@greebie greebie commented Feb 11, 2020

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Feb 14, 2020

Thinking about this more, I'm not seeing the use of moving in this direction since it appears to be a slightly more complicated version of just using filter.

For example:

import io.archivesunleashed._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .count()

res0: Long = 125579

import io.archivesunleashed._

val languages = Set("th","de","ht")

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .keepLanguagesDF(languages)
  .count()

res1: Long = 3536

import io.archivesunleashed._

val languages = Set("th","de","ht")

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .filter($"language".isinCollection(languages))
  .count()

res7: Long = 3536

With that, I'd argue we keep what we have now, or remove all the DataFrame filters as they exist now in DataFrames, and resolve this issue by updating the current documentation with the pure Spark DF implementation of filters.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Feb 14, 2020

...and if we go with the latter, that'll solve a sizable chunk of the Python implementation 😃

@lintool

This comment has been minimized.

Copy link
Member Author

@lintool lintool commented Feb 14, 2020

Can I propose yet another alternative design?

import io.archivesunleashed._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .filter(hasLanguage("th","de","ht"))
  .count()

This saves the scholar from having to know about the schema explicitly? The UDF should be able to figure it out...

And similarly, we can have:

import io.archivesunleashed._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .filter(urlMatches("""regex"""))
  .count()

I'm just thinking that a humanities scholar might get confused/scared about the $ notation and would be unfamiliar with the . method notation being applied with weird $ thingys?

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Feb 14, 2020

Yeah, I like that better @lintool. Then we should be able get the negation with ! in Scala and ~ in Python/PySpark.

  • keepImagesDF -> hasImages
  • keepHttpStatusDF -> hasHTTPStatus
  • keepDateDF -> hasDates
  • keepUrlsDF -> hasUrls
  • keepDomainsDF -> hasDomains
  • keepMimeTypesTikaDF -> hasTikaMimeTypes
  • keepMimeTypesDF -> hasMimeTypes
  • keepContentDF -> contentMatches
  • keepUrlPatternsDF -> urlMatches
  • keepLanguagesDF -> hasLanguages

Feel free to suggest better names if you have them.

@SinghGursimran

This comment has been minimized.

Copy link
Collaborator

@SinghGursimran SinghGursimran commented Mar 5, 2020

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities",sc)
  .webpages()
  .filter(hasLanguage("th","de","ht"))
  .count()

In this approach, hasLanguage() function would require the column as well.
Sth like -

filter(hasLanguage($"content",["th","de","ht"]))

Is this fine?

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Mar 5, 2020

Yeah, that makes sense to me. Work for you @lintool and @ianmilligan1?

@ianmilligan1

This comment has been minimized.

Copy link
Member

@ianmilligan1 ianmilligan1 commented Mar 5, 2020

Works for me @ruebot and @SinghGursimran!

ruebot added a commit that referenced this issue Mar 17, 2020
@ruebot ruebot closed this in e1d908b Mar 18, 2020
DataFrames and PySpark automation moved this from ToDo to In review Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.