Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Scala Matchbox UDFs in Python. #463

Merged
merged 1 commit into from May 19, 2020
Merged

Implement Scala Matchbox UDFs in Python. #463

merged 1 commit into from May 19, 2020

Conversation

@ruebot
Copy link
Member

ruebot commented May 13, 2020

GitHub issue(s):

What does this Pull Request do?

Implement Scala Matchbox UDFs in Python.

  • Resolves #408
  • Alphabetizes DataFrameloader functions
  • Alphabetizes UDFs functions
  • Move DataFrameLoader to df packages
  • Move UDFs out of df into their own package
  • Rename UDFs (no more DF tagged to the end).
  • Update tests as necessary
  • Partially addresses #410, #409
  • Supersedes #412.

How should this be tested?

Additional Notes:

  1. I made a number of structural changes to the Scala side. @lintool, please let me know if you take strong issue with anything.

  2. I'm going to punt on the hasX filters for right now, and loop back around to them. I hit a wall with trying to get them to run in PySpark, and part of me is tempted to just say that we should go with the natural PySpark (Python) implementation here. Basically:

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(col("language").isin(languages))\
  .select("crawl_date")\
  .show(10, True)

or

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(~col("language").isin(languages))\
  .select("crawl_date")\
  .show(10, True)

Instead of

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(Udf.has_language("language", languages))\
  .select("crawl_date")\
  .show(10, True)

or

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(~Udf.has_language("language", languages))\
  .select("crawl_date")\
  .show(10, True)

Basically, an argument I made in #425.

@ruebot ruebot requested review from lintool and ianmilligan1 May 13, 2020
@ruebot ruebot mentioned this pull request May 13, 2020
@codecov
Copy link

codecov bot commented May 13, 2020

Codecov Report

Merging #463 into master will decrease coverage by 0.05%.
The diff coverage is 94.54%.

@@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   76.49%   76.43%   -0.06%     
==========================================
  Files          49       50       +1     
  Lines        1459     1460       +1     
  Branches      279      279              
==========================================
  Hits         1116     1116              
- Misses        213      214       +1     
  Partials      130      130              
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request May 14, 2020
- Resolves #408
- Alphabetizes DataFrameloader functions
- Alphabetizes UDFs functions
- Move DataFrameLoader to df packages
- Move UDFs out of df into their own package
- Rename UDFs (no more DF tagged to the end).
- Update tests as necessary
- Partially addresses #410, #409
- Supersedes #412.
@ruebot ruebot force-pushed the pyspark-imp branch from f013c7b to 9dc3ad7 May 19, 2020
@ruebot ruebot changed the title Load Scala UDFs from Scala to Python; supersedes #412. Implement Scala Matchbox UDFs in Python. May 19, 2020
@ruebot ruebot marked this pull request as ready for review May 19, 2020
Copy link
Member

ianmilligan1 left a comment

Works like a charm - tried on a CPP sample and it's all perfect.

For the language question, as noted in Slack and now seen firsthand when working through the notebook, I think re: your question here:

part of me is tempted to just say that we should go with the natural PySpark (Python) implementation here

We should do this. The working language in the notebook, i.e.

languages = ["es", "fr"]

WebArchive(sc, sqlContext, data)\
  .webpages()\
  .filter(~col("language").isin(languages))\
  .select("crawl_date", Udf.extract_domain("url").alias("domain"), "url", "language")\
  .show(100, True)

Is intuitive and makes sense.

This notebook is great, too. We should host it somewhere (apart from perhaps making data a variable that's passed in lieu of your directory it could plug and play quite nicely as part of a hands-on approach to learning about the new PySpark functionality 🤔).

@ruebot
Copy link
Member Author

ruebot commented May 19, 2020

Yeah, I could clean that notebook up, and toss it in https://github.com/archivesunleashed/notebooks when we're done.

Copy link
Member

lintool left a comment

lgtm!

🎉

@ianmilligan1 ianmilligan1 merged commit 69007e2 into master May 19, 2020
2 of 3 checks passed
2 of 3 checks passed
codecov/project 76.43% (+-0.06%) compared to 1d01571
Details
codecov/patch 94.54% of diff hit (target 76.49%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ianmilligan1 ianmilligan1 deleted the pyspark-imp branch May 19, 2020
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request May 19, 2020
#62)

* Documentation updates for archivesunleashed/aut#463

- See archivesunleashed/aut#463 for more info.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants
You can’t perform that action at this time.