Add some PySpark udfs #412

ruebot · 2020-01-18T02:37:00Z

GitHub issue(s): #409

What does this Pull Request do?

Add remove_http_header, remove_prefix_www
Rename extract_domain_func to extract_domain
Formatting updates
Addresses #409

How should this be tested?

https://gist.github.com/ruebot/e50892b0b2b4a6abad8ddc7933cf79b2

Additional Notes:

We need to sort out how we'll bundle something like everything that requirements.txt would help with in the zip file; something like the Uberjar. Right now, we really don't have anything, but I imagine we'd want to pull in external libraries like Beautiful Soup, or tld-extractor.
I'll leave this as a draft, and push to it as I'm working on it. Others are welcome to push to it as well, since GitHub is now setup to provide credit to all those accounts who contributed to a PR when it is squashed down.

Interested parties

@SinghGursimran if you're sick of Scala, let me know, and I can give you access.

@lintool this approach fine? Naming convention fine?

@ianmilligan1 let me know if the notebook testing makes sense. Figured that'd be easy to test this stuff.


        Add some PySpark udfs

- Add remove_http_header, remove_prefix_www - Rename extract_domain_func to extract_domain - Formatting updates - Addresses #409

codecov · 2020-01-18T02:52:46Z

Codecov Report

Merging #412 into master will increase coverage by 0.14%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #412      +/-   ##
=========================================
+ Coverage   77.56%   77.7%   +0.14%     
=========================================
  Files          40      40              
  Lines        1542    1552      +10     
  Branches      292     292              
=========================================
+ Hits         1196    1206      +10     
  Misses        218     218              
  Partials      128     128

SinghGursimran · 2020-01-18T05:52:07Z

@ruebot, yes, sure, pls give the access.


        rename links to webgraph

ianmilligan1 · 2020-01-18T19:42:39Z

@ianmilligan1 let me know if the notebook testing makes sense. Figured that'd be easy to test this stuff.

Makes sense to me! Thanks @ruebot.


        Merge branch 'master' into issue-409


        computeMD5


        Matchbox-pyhton1

ruebot · 2020-01-19T13:34:03Z

I have a version of this with BeautifulSoup on my local branch that I haven't pushed up yet. That might work better, we can tests and see which method works better later. In the interim, I'm close to a packaging solution with dependencies on my branch that I'll hopefully be able to push up soon. Then we can pull in external libraries and make life a whole lot easier 😃

SinghGursimran · 2020-01-19T19:00:58Z

Great!!


        Setup external lib packaging for Python!!

Add remove_html udf Rename remove_http_header to remove_http_headers


        Merge branch 'issue-409' of github.com:archivesunleashed/aut into iss…

…ue-409


        - Remove order udfs alphabetically

- Get detect_language setup - ComputeSHA1 and MD5 need some work?

ruebot · 2020-01-21T03:18:47Z

Updated the testing notebook. Hopefully it renders. It might be getting too big with content in it 🤷‍♂

https://gist.github.com/ruebot/e50892b0b2b4a6abad8ddc7933cf79b2

ruebot · 2020-01-21T03:22:26Z

src/main/python/aut/udfs.py

+    return content.replace("[\\r\\n]+", " ")
+
+
+remove_html_no_external_lib = udf(remove_html_no_external_lib, StringType())


I think we end up with HTML still in this one after looking at the output. You cool if we remove it and go the Beautifulsoup route?

ruebot · 2020-01-21T03:22:26Z

src/main/python/aut/udfs.py

+from textblob import TextBlob
+
+
+def compute_MD5(bytes):


I'm getting a TypeError: Unicode-objects must be encoded before hashing on this and compute_MD5.

ruebot · 2020-01-21T03:22:26Z

src/main/python/aut/udfs.py

-def extract_domain_func(url):
+
+def detect_language(input):
+    text = TextBlob(input)


This doesn't look like it's going to scale: "Language translation and detection powered by Google Translate"

I ran it twice in testing on the 10 Geocities files I use locally, and I got hit with a: urllib.error.HTTPError: HTTP Error 429: Too Many Requests.

Which all begs a bigger question on porting a lot of these over to Python; do we need to? We provide an MD5, SHA1, and Language column now. Is there a use case for having them in Python? I can't think of a reason to run them on a column if we already provide them. @ianmilligan1 @lintool what do you think?

rename links to webgraph

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

af2bc5a

ruebot and others added 3 commits Jan 18, 2020

Merge branch 'master' into issue-409

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

d5a4f06

computeMD5

Loading status checks…

e310611

Matchbox-pyhton1

Loading status checks…

5830ab9

ruebot added 3 commits Jan 21, 2020

Setup external lib packaging for Python!!

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

0a094cc

Add remove_html udf Rename remove_http_header to remove_http_headers

Merge branch 'issue-409' of github.com:archivesunleashed/aut into iss…

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

7060f5f

…ue-409

- Remove order udfs alphabetically

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

825a933

- Get detect_language setup - ComputeSHA1 and MD5 need some work?

ruebot reviewed Jan 21, 2020

View changes

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut

Add some PySpark udfs #412

Add some PySpark udfs #412

ruebot commented Jan 18, 2020 •

edited

This comment has been minimized.

codecov bot commented Jan 18, 2020 •

edited

This comment has been minimized.

SinghGursimran commented Jan 18, 2020

This comment has been minimized.

ianmilligan1 commented Jan 18, 2020

This comment has been minimized.

ruebot commented on `src/main/python/aut/udfs.py` in `5830ab9` Jan 19, 2020

This comment has been minimized.

SinghGursimran replied Jan 19, 2020

This comment has been minimized.

ruebot commented Jan 21, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut

Join GitHub today

Add some PySpark udfs #412

Add some PySpark udfs #412

Conversation

ruebot commented Jan 18, 2020 • edited

What does this Pull Request do?

How should this be tested?

Additional Notes:

Interested parties

This comment has been minimized.

codecov bot commented Jan 18, 2020 • edited

Codecov Report

This comment has been minimized.

SinghGursimran commented Jan 18, 2020

This comment has been minimized.

ianmilligan1 commented Jan 18, 2020

This comment has been minimized.

ruebot commented on src/main/python/aut/udfs.py in 5830ab9 Jan 19, 2020

This comment has been minimized.

SinghGursimran replied Jan 19, 2020

This comment has been minimized.

ruebot commented Jan 21, 2020

This comment has been minimized.

ruebot Jan 21, 2020

This comment has been minimized.

ruebot Jan 21, 2020

This comment has been minimized.

ruebot Jan 21, 2020

ruebot commented Jan 18, 2020 •

edited

codecov bot commented Jan 18, 2020 •

edited

ruebot commented on `src/main/python/aut/udfs.py` in `5830ab9` Jan 19, 2020