Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Python versions of Matchbox utilities #408

Closed
ruebot opened this issue Jan 17, 2020 · 0 comments
Closed

Implement Python versions of Matchbox utilities #408

ruebot opened this issue Jan 17, 2020 · 0 comments

Comments

@ruebot
Copy link
Member

@ruebot ruebot commented Jan 17, 2020

RDD Scala DF Python DF
ComputeImageSize
ComputeMD5RDD ComputeMD5DF in progress
ComputeSHA1RDD ComputeSHA1DF in progress
DetectLanguageRDD DetectLanguageDF in progress
DetectMimeTypeTika DetectMimeTypeTikaDF
ExtractBoilerPipeTextRDD ExtractBoilerPipeTextDF
ExtractDateRDD ExtractDateDF
ExtractDomainRDD ExtractDomainDF ✔️
ExtractImageDetails
ExtractImageLinksRDD ExtractImageLinksDF
ExtractLinksRDD ExtractLinksDF
ExtractTextFromPDFs -
GetExtensionMimeRDD GetExtensionMimeDF
RemoveHTMLRDD RemoveHTMLDF ✔️
RemoveHTTPHeaderRDD RemoveHTTPHeaderDF ✔️
NERClassifier -
RemovePrefixWWW RemovePrefixWWWDF ✔️

Stealing @SinghGursimran's very helpful tables here 😃

@ruebot ruebot added this to ToDo in DataFrames and PySpark Feb 5, 2020
ruebot added a commit that referenced this issue May 19, 2020
- Resolves #408
- Alphabetizes DataFrameloader functions
- Alphabetizes UDFs functions
- Move DataFrameLoader to df packages
- Move UDFs out of df into their own package
- Rename UDFs (no more DF tagged to the end).
- Update tests as necessary
- Partially addresses #410, #409
- Supersedes #412.
DataFrames and PySpark automation moved this from ToDo to In review May 19, 2020
ianmilligan1 pushed a commit that referenced this issue May 19, 2020
- Resolves #408
- Alphabetizes DataFrameloader functions
- Alphabetizes UDFs functions
- Move DataFrameLoader to df packages
- Move UDFs out of df into their own package
- Rename UDFs (no more DF tagged to the end).
- Update tests as necessary
- Partially addresses #410, #409
- Supersedes #412.
@ruebot ruebot moved this from In review to Done in DataFrames and PySpark May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

1 participant
You can’t perform that action at this time.