Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upImplement Scala Matchbox UDFs in Python. #463
Conversation
codecov
bot
commented
May 13, 2020
•
Codecov Report
@@ Coverage Diff @@
## master #463 +/- ##
==========================================
- Coverage 76.49% 76.43% -0.06%
==========================================
Files 49 50 +1
Lines 1459 1460 +1
Branches 279 279
==========================================
Hits 1116 1116
- Misses 213 214 +1
Partials 130 130 |
Works like a charm - tried on a CPP sample and it's all perfect. For the language question, as noted in Slack and now seen firsthand when working through the notebook, I think re: your question here:
We should do this. The working language in the notebook, i.e. languages = ["es", "fr"]
WebArchive(sc, sqlContext, data)\
.webpages()\
.filter(~col("language").isin(languages))\
.select("crawl_date", Udf.extract_domain("url").alias("domain"), "url", "language")\
.show(100, True) Is intuitive and makes sense. This notebook is great, too. We should host it somewhere (apart from perhaps making |
Yeah, I could clean that notebook up, and toss it in https://github.com/archivesunleashed/notebooks when we're done. |
lgtm! |
#62) * Documentation updates for archivesunleashed/aut#463 - See archivesunleashed/aut#463 for more info.
ruebot commentedMay 13, 2020
•
edited
GitHub issue(s):
What does this Pull Request do?
Implement Scala Matchbox UDFs in Python.
How should this be tested?
Additional Notes:
I made a number of structural changes to the Scala side. @lintool, please let me know if you take strong issue with anything.
I'm going to punt on the
hasX
filters for right now, and loop back around to them. I hit a wall with trying to get them to run in PySpark, and part of me is tempted to just say that we should go with the natural PySpark (Python) implementation here. Basically:or
Instead of
or
Basically, an argument I made in #425.