Implement Scala Matchbox UDFs in Python. #463

ruebot · 2020-05-13T20:10:44Z

GitHub issue(s):

What does this Pull Request do?

Implement Scala Matchbox UDFs in Python.

Resolves #408
Alphabetizes DataFrameloader functions
Alphabetizes UDFs functions
Move DataFrameLoader to df packages
Move UDFs out of df into their own package
Rename UDFs (no more DF tagged to the end).
Update tests as necessary
Partially addresses #410, #409
Supersedes #412.

How should this be tested?

TravisCI
Adapt this to local files.
archivesunleashed/aut-docs#62

Additional Notes:

I made a number of structural changes to the Scala side. @lintool, please let me know if you take strong issue with anything.
I'm going to punt on the hasX filters for right now, and loop back around to them. I hit a wall with trying to get them to run in PySpark, and part of me is tempted to just say that we should go with the natural PySpark (Python) implementation here. Basically:

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(col("language").isin(languages))\
  .select("crawl_date")\
  .show(10, True)

or

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(~col("language").isin(languages))\
  .select("crawl_date")\
  .show(10, True)

Instead of

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(Udf.has_language("language", languages))\
  .select("crawl_date")\
  .show(10, True)

or

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(~Udf.has_language("language", languages))\
  .select("crawl_date")\
  .show(10, True)

Basically, an argument I made in #425.

codecov · 2020-05-13T20:34:19Z

Codecov Report

Merging #463 into master will decrease coverage by 0.05%.
The diff coverage is 94.54%.

@@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   76.49%   76.43%   -0.06%     
==========================================
  Files          49       50       +1     
  Lines        1459     1460       +1     
  Branches      279      279              
==========================================
  Hits         1116     1116              
- Misses        213      214       +1     
  Partials      130      130


        Documentation updates for archivesunleashed/aut#463


        Implement Scala Matchbox UDFs in Python.

- Resolves #408 - Alphabetizes DataFrameloader functions - Alphabetizes UDFs functions - Move DataFrameLoader to df packages - Move UDFs out of df into their own package - Rename UDFs (no more DF tagged to the end). - Update tests as necessary - Partially addresses #410, #409 - Supersedes #412.

ianmilligan1

Works like a charm - tried on a CPP sample and it's all perfect.

For the language question, as noted in Slack and now seen firsthand when working through the notebook, I think re: your question here:

part of me is tempted to just say that we should go with the natural PySpark (Python) implementation here

We should do this. The working language in the notebook, i.e.

languages = ["es", "fr"]

WebArchive(sc, sqlContext, data)\
  .webpages()\
  .filter(~col("language").isin(languages))\
  .select("crawl_date", Udf.extract_domain("url").alias("domain"), "url", "language")\
  .show(100, True)

Is intuitive and makes sense.

This notebook is great, too. We should host it somewhere (apart from perhaps making data a variable that's passed in lieu of your directory it could plug and play quite nicely as part of a hands-on approach to learning about the new PySpark functionality 🤔).

ruebot · 2020-05-19T17:17:27Z

Yeah, I could clean that notebook up, and toss it in https://github.com/archivesunleashed/notebooks when we're done.

lintool

lgtm!

🎉


        Documentation updates for https://github.com/archivesunleashed/aut/pu… (

#62) * Documentation updates for archivesunleashed/aut#463 - See archivesunleashed/aut#463 for more info.

ruebot requested review from lintool and ianmilligan1 May 13, 2020

ruebot mentioned this pull request May 13, 2020

Add some PySpark udfs #412

Closed

ruebot force-pushed the pyspark-imp branch from f013c7b to 9dc3ad7 May 19, 2020

ruebot changed the title ~~Load Scala UDFs from Scala to Python; supersedes #412.~~ Implement Scala Matchbox UDFs in Python. May 19, 2020

ruebot marked this pull request as ready for review May 19, 2020

ruebot mentioned this pull request May 19, 2020

Implement Python versions of Serializable APIs #410

Open

ianmilligan1 approved these changes May 19, 2020

View changes

lintool approved these changes May 19, 2020

View changes

ianmilligan1 deleted the pyspark-imp branch May 19, 2020

archivesunleashed / aut

Implement Scala Matchbox UDFs in Python. #463

Implement Scala Matchbox UDFs in Python. #463

ruebot commented May 13, 2020 •

edited

codecov bot commented May 13, 2020 •

edited

ianmilligan1 left a comment

ruebot commented May 19, 2020

lintool left a comment

archivesunleashed / aut

Join GitHub today

Implement Scala Matchbox UDFs in Python. #463

Implement Scala Matchbox UDFs in Python. #463

Conversation

ruebot commented May 13, 2020 • edited

What does this Pull Request do?

How should this be tested?

Additional Notes:

codecov bot commented May 13, 2020 • edited

Codecov Report

ianmilligan1 left a comment

ruebot commented May 19, 2020

lintool left a comment

ruebot commented May 13, 2020 •

edited

codecov bot commented May 13, 2020 •

edited