Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some PySpark udfs #412

Draft
wants to merge 2 commits into
base: master
from
Draft

Add some PySpark udfs #412

wants to merge 2 commits into from

Conversation

@ruebot
Copy link
Member

ruebot commented Jan 18, 2020

GitHub issue(s): #409

What does this Pull Request do?

  • Add remove_http_header, remove_prefix_www
  • Rename extract_domain_func to extract_domain
  • Formatting updates
  • Addresses #409

How should this be tested?

https://gist.github.com/ruebot/e50892b0b2b4a6abad8ddc7933cf79b2

Additional Notes:

  • We need to sort out how we'll bundle something like everything that requirements.txt would help with in the zip file; something like the Uberjar. Right now, we really don't have anything, but I imagine we'd want to pull in external libraries like Beautiful Soup, or tld-extractor.

  • I'll leave this as a draft, and push to it as I'm working on it. Others are welcome to push to it as well, since GitHub is now setup to provide credit to all those accounts who contributed to a PR when it is squashed down.

Interested parties

@SinghGursimran if you're sick of Scala, let me know, and I can give you access.

@lintool this approach fine? Naming convention fine?

@ianmilligan1 let me know if the notebook testing makes sense. Figured that'd be easy to test this stuff.

- Add remove_http_header, remove_prefix_www
- Rename extract_domain_func to extract_domain
- Formatting updates
- Addresses #409
@codecov

This comment has been minimized.

Copy link

codecov bot commented Jan 18, 2020

Codecov Report

Merging #412 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #412   +/-   ##
=======================================
  Coverage   77.56%   77.56%           
=======================================
  Files          40       40           
  Lines        1542     1542           
  Branches      292      292           
=======================================
  Hits         1196     1196           
  Misses        218      218           
  Partials      128      128
@SinghGursimran

This comment has been minimized.

Copy link
Collaborator

SinghGursimran commented Jan 18, 2020

@ruebot, yes, sure, pls give the access.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Jan 18, 2020

@ianmilligan1 let me know if the notebook testing makes sense. Figured that'd be easy to test this stuff.

Makes sense to me! Thanks @ruebot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.