Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd some PySpark udfs #412
Conversation
codecov
bot
commented
Jan 18, 2020
•
Codecov Report
@@ Coverage Diff @@
## master #412 +/- ##
=======================================
Coverage 76.49% 76.49%
=======================================
Files 49 49
Lines 1459 1459
Branches 279 279
=======================================
Hits 1116 1116
Misses 213 213
Partials 130 130 |
@ruebot, yes, sure, pls give the access. |
Makes sense to me! Thanks @ruebot. |
This comment has been minimized.
This comment has been minimized.
I have a version of this with BeautifulSoup on my local branch that I haven't pushed up yet. That might work better, we can tests and see which method works better later. In the interim, I'm close to a packaging solution with dependencies on my branch that I'll hopefully be able to push up soon. Then we can pull in external libraries and make life a whole lot easier |
This comment has been minimized.
This comment has been minimized.
Great!! |
Updated the testing notebook. Hopefully it renders. It might be getting too big with content in it https://gist.github.com/ruebot/e50892b0b2b4a6abad8ddc7933cf79b2 |
from textblob import TextBlob | ||
|
||
|
||
def compute_MD5(bytes): |
This comment has been minimized.
This comment has been minimized.
ruebot
Jan 21, 2020
Author
Member
I'm getting a TypeError: Unicode-objects must be encoded before hashing
on this and compute_MD5
.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ruebot
Jan 21, 2020
Author
Member
Interesting. What version of Python are you running? ...and you're on Windows too iirc?
This comment has been minimized.
This comment has been minimized.
SinghGursimran
Jan 21, 2020
•
Collaborator
I tested it on Linux, there I have python 2.8. I will test on windows as well. I have python 3.6 there.
This comment has been minimized.
This comment has been minimized.
ruebot
Jan 21, 2020
Author
Member
Ah, ok. I'm on 3.7.3 on Linux. I wonder if that's it. (Using the Anaconda distribution of Python)
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
SinghGursimran
Jan 24, 2020
Collaborator
Hi @ruebot,
Did you try running this in the terminal. There's a timeout issue in jupyter notebook.
Something similar to what's mentioned here: https://www.idaima.org/topic/2447498/getting-error-caused-by-java-net-sockettimeoutexception-accept-timed-out/2
It is working in the terminal. I could not find any other library in python that would give the MD5 or SHA1 hashes. Currently, I am trying to make it work in jupyter.
def extract_domain_func(url): | ||
|
||
def detect_language(input): | ||
text = TextBlob(input) |
This comment has been minimized.
This comment has been minimized.
ruebot
Jan 21, 2020
Author
Member
This doesn't look like it's going to scale: "Language translation and detection powered by Google Translate"
I ran it twice in testing on the 10 Geocities files I use locally, and I got hit with a: urllib.error.HTTPError: HTTP Error 429: Too Many Requests.
Which all begs a bigger question on porting a lot of these over to Python; do we need to? We provide an MD5, SHA1, and Language column now. Is there a use case for having them in Python? I can't think of a reason to run them on a column if we already provide them. @ianmilligan1 @lintool what do you think?
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Jan 21, 2020
Member
I can't think of a reason to run them on a column if we already provide them. @ianmilligan1 @lintool what do you think?
Agreed with you - makes sense to not port in these cases (and especially in the particular case here).
Superseded by #463 |
ruebot commentedJan 18, 2020
•
edited
GitHub issue(s): #408
What does this Pull Request do?
How should this be tested?
https://gist.github.com/ruebot/e50892b0b2b4a6abad8ddc7933cf79b2
Additional Notes:
We need to sort out how we'll bundle something like everything that
requirements.txt
would help with in the zip file; something like the Uberjar. Right now, we really don't have anything, but I imagine we'd want to pull in external libraries like Beautiful Soup, or tld-extractor.I'll leave this as a draft, and push to it as I'm working on it. Others are welcome to push to it as well, since GitHub is now setup to provide credit to all those accounts who contributed to a PR when it is squashed down.
Interested parties
@SinghGursimran if you're sick of Scala, let me know, and I can give you access.
@lintool this approach fine? Naming convention fine?
@ianmilligan1 let me know if the notebook testing makes sense. Figured that'd be easy to test this stuff.