Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Permalink
Browse files

Matchbox-pyhton1

  • Loading branch information
g285sing
g285sing committed Jan 19, 2020
1 parent e310611 commit 5830ab96ae0e679f9ef84b0e35f512b18f6a76b2
Showing with 21 additions and 3 deletions.
  1. +2 −2 src/main/python/aut/__init__.py
  2. +19 −1 src/main/python/aut/udfs.py
@@ -1,5 +1,5 @@
from aut.common import WebArchive
from aut.udfs import compute_MD5, extract_domain, remove_http_header, remove_prefix_www
from aut.udfs import compute_MD5, compute_SHA1, extract_domain, remove_html, remove_http_header, remove_prefix_www

__all__ = ["WebArchive", "compute_MD5", "extract_domain", "remove_prefix_www",
__all__ = ["WebArchive", "compute_MD5", "compute_SHA1", "extract_domain", "remove_html", "remove_prefix_www",
"remove_http_header"]
@@ -1,5 +1,6 @@
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
#from textblob import TextBlob
import hashlib


@@ -37,4 +38,21 @@ def remove_http_header(content):
def compute_MD5(bytes):
return hashlib.md5(bytes).hexdigest()

compute_MD5 = udf(compute_MD5, StringType())
compute_MD5 = udf(compute_MD5, StringType())

def compute_SHA1(bytes):
return hashlib.sha1(bytes).hexdigest()

compute_SHA1 = udf(compute_SHA1, StringType())

# def detect_language(input):
# text = TextBlob(input)
# return text.detect_language()

# detect_language = udf(detect_language, StringType())

def remove_html(content):

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 19, 2020

Member

I have a version of this with BeautifulSoup on my local branch that I haven't pushed up yet. That might work better, we can tests and see which method works better later. In the interim, I'm close to a packaging solution with dependencies on my branch that I'll hopefully be able to push up soon. Then we can pull in external libraries and make life a whole lot easier 😃

This comment has been minimized.

Copy link
@SinghGursimran

SinghGursimran Jan 19, 2020

Collaborator

Great!!

return content.replace("[\\r\\n]+", " ")

remove_html = udf(remove_html, StringType())

0 comments on commit 5830ab9

Please sign in to comment.
You can’t perform that action at this time.