Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some PySpark udfs #412

Closed
wants to merge 31 commits into from
Closed

Add some PySpark udfs #412

wants to merge 31 commits into from

Conversation

@ruebot
Copy link
Member

ruebot commented Jan 18, 2020

GitHub issue(s): #408

What does this Pull Request do?

  • Add remove_http_header, remove_prefix_www
  • Rename extract_domain_func to extract_domain
  • Formatting updates
  • Addresses #408

How should this be tested?

https://gist.github.com/ruebot/e50892b0b2b4a6abad8ddc7933cf79b2

Additional Notes:

  • We need to sort out how we'll bundle something like everything that requirements.txt would help with in the zip file; something like the Uberjar. Right now, we really don't have anything, but I imagine we'd want to pull in external libraries like Beautiful Soup, or tld-extractor.

  • I'll leave this as a draft, and push to it as I'm working on it. Others are welcome to push to it as well, since GitHub is now setup to provide credit to all those accounts who contributed to a PR when it is squashed down.

Interested parties

@SinghGursimran if you're sick of Scala, let me know, and I can give you access.

@lintool this approach fine? Naming convention fine?

@ianmilligan1 let me know if the notebook testing makes sense. Figured that'd be easy to test this stuff.

- Add remove_http_header, remove_prefix_www
- Rename extract_domain_func to extract_domain
- Formatting updates
- Addresses #409
@codecov
Copy link

codecov bot commented Jan 18, 2020

Codecov Report

Merging #412 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #412   +/-   ##
=======================================
  Coverage   76.49%   76.49%           
=======================================
  Files          49       49           
  Lines        1459     1459           
  Branches      279      279           
=======================================
  Hits         1116     1116           
  Misses        213      213           
  Partials      130      130           
@SinghGursimran
Copy link
Collaborator

SinghGursimran commented Jan 18, 2020

@ruebot, yes, sure, pls give the access.

@ianmilligan1
Copy link
Member

ianmilligan1 commented Jan 18, 2020

@ianmilligan1 let me know if the notebook testing makes sense. Figured that'd be easy to test this stuff.

Makes sense to me! Thanks @ruebot.

ruebot and others added 3 commits Jan 18, 2020
g285sing
g285sing
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented on src/main/python/aut/udfs.py in 5830ab9 Jan 19, 2020

I have a version of this with BeautifulSoup on my local branch that I haven't pushed up yet. That might work better, we can tests and see which method works better later. In the interim, I'm close to a packaging solution with dependencies on my branch that I'll hopefully be able to push up soon. Then we can pull in external libraries and make life a whole lot easier 😃

This comment has been minimized.

Copy link
Collaborator

SinghGursimran replied Jan 19, 2020

Great!!

ruebot added 3 commits Jan 21, 2020
Add remove_html udf
Rename remove_http_header to remove_http_headers
…ue-409
- Get detect_language setup
- ComputeSHA1 and MD5 need some work?
@ruebot
Copy link
Member Author

ruebot commented Jan 21, 2020

Updated the testing notebook. Hopefully it renders. It might be getting too big with content in it 🤷‍♂️

https://gist.github.com/ruebot/e50892b0b2b4a6abad8ddc7933cf79b2

src/main/python/aut/udfs.py Outdated Show resolved Hide resolved
from textblob import TextBlob


def compute_MD5(bytes):

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 21, 2020

Author Member

I'm getting a TypeError: Unicode-objects must be encoded before hashing on this and compute_MD5.

This comment has been minimized.

Copy link
@SinghGursimran

SinghGursimran Jan 21, 2020

Collaborator

Oh! It was working on my system...

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 21, 2020

Author Member

Interesting. What version of Python are you running? ...and you're on Windows too iirc?

This comment has been minimized.

Copy link
@SinghGursimran

SinghGursimran Jan 21, 2020

Collaborator

I tested it on Linux, there I have python 2.8. I will test on windows as well. I have python 3.6 there.

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 21, 2020

Author Member

Ah, ok. I'm on 3.7.3 on Linux. I wonder if that's it. (Using the Anaconda distribution of Python)

This comment has been minimized.

Copy link
@SinghGursimran

SinghGursimran Jan 21, 2020

Collaborator

Oh ok, I will update python to 3.7 and test with it.

This comment has been minimized.

Copy link
@SinghGursimran

SinghGursimran Jan 24, 2020

Collaborator

Hi @ruebot,
Did you try running this in the terminal. There's a timeout issue in jupyter notebook.

Something similar to what's mentioned here: https://www.idaima.org/topic/2447498/getting-error-caused-by-java-net-sockettimeoutexception-accept-timed-out/2

It is working in the terminal. I could not find any other library in python that would give the MD5 or SHA1 hashes. Currently, I am trying to make it work in jupyter.

def extract_domain_func(url):

def detect_language(input):
text = TextBlob(input)

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 21, 2020

Author Member

This doesn't look like it's going to scale: "Language translation and detection powered by Google Translate"

I ran it twice in testing on the 10 Geocities files I use locally, and I got hit with a: urllib.error.HTTPError: HTTP Error 429: Too Many Requests.

Which all begs a bigger question on porting a lot of these over to Python; do we need to? We provide an MD5, SHA1, and Language column now. Is there a use case for having them in Python? I can't think of a reason to run them on a column if we already provide them. @ianmilligan1 @lintool what do you think?

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jan 21, 2020

Member

I can't think of a reason to run them on a column if we already provide them. @ianmilligan1 @lintool what do you think?

Agreed with you - makes sense to not port in these cases (and especially in the particular case here).

ruebot added 4 commits Jan 21, 2020
- TODO, make it object oriented
@ruebot ruebot added this to In Progress in DataFrames and PySpark Feb 5, 2020
@ruebot ruebot added this to In Progress in 1.0.0 Release of AUT Feb 5, 2020
ruebot added 10 commits Feb 6, 2020
…ue-409
ruebot added 9 commits Mar 26, 2020
…lementations.
ruebot added a commit that referenced this pull request May 13, 2020
@ruebot
Copy link
Member Author

ruebot commented May 13, 2020

Superseded by #463

@ruebot ruebot closed this May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
1.0.0 Release of AUT
  
In Progress
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.