Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Python versions of Serializable APIs #410

Open
ruebot opened this issue Jan 17, 2020 · 4 comments
Open

Implement Python versions of Serializable APIs #410

ruebot opened this issue Jan 17, 2020 · 4 comments

Comments

@ruebot
Copy link
Member

@ruebot ruebot commented Jan 17, 2020

RDD Scala DF Python DF
keepValidPages keepValidPagesDF keepValidPagesDF
keepMimeTypes keepMimeTypesDF keepMimeTypesDF
keepMimeTypesTika keepMimeTypesTikaDF
keepHttpStatus keepHttpStatusDF
keepDate keepDateDF discardDateDF
keepUrls keepUrlsDF keepUrlsDF
keepUrlPatterns keepUrlPatternsDF keepUrlPatternsDF
keepImages keepImagesDF
keepDomains keepDomainsDF
keepLanguages keepLanguagesDF
keepContent keepContentDF keepContentDF
discardMimeTypes discardMimeTypesDF DiscardMimeTypesDF
discardMimeTypesTika discardMimeTypesTIkaDF
discardDate discardDateDF
discardUrls discardUrlsDF discardUrlsDF
discardHttpStatus discardHttpStatusDF
discardUrlPatterns discardUrlPatternsDF discardUrlPatternsDF
discardDomains discardDomainsDF
discardContent discardContentDF discardContentDF
discardLanguages discardLanguagesDF

Stealing @SinghGursimran's very helpful tables here 😃

@SinghGursimran
Copy link
Collaborator

@SinghGursimran SinghGursimran commented Feb 1, 2020

https://gist.github.com/SinghGursimran/2e97126a966b4a5c7f4704e61f0eec82

I have added code for serializable APIs here. Still, few are left, will add that soon. With spark 2.4.4, it works in the terminal but not in the jupyter notebook (some issues with connection timeout for spark 2.4.4). With spark 2.3, it will work in both.

When the above issue is resolved, I will add the code to the project.

@ruebot ruebot added this to ToDo in DataFrames and PySpark Feb 5, 2020
ruebot added a commit that referenced this issue May 19, 2020
- Resolves #408
- Alphabetizes DataFrameloader functions
- Alphabetizes UDFs functions
- Move DataFrameLoader to df packages
- Move UDFs out of df into their own package
- Rename UDFs (no more DF tagged to the end).
- Update tests as necessary
- Partially addresses #410, #409
- Supersedes #412.
@ruebot
Copy link
Member Author

@ruebot ruebot commented May 19, 2020

I'm proposing that these should not be implemented as originally intended in #463, and should just implemented via the normal -- .filter(col("somecolumn").isin(somearray)) -- method.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented May 19, 2020

As noted in #463, your above has a 👍 from me @ruebot.

@ruebot
Copy link
Member Author

@ruebot ruebot commented May 19, 2020

Cool. We'll mark this closed once we get all the documentation updated.

ianmilligan1 pushed a commit that referenced this issue May 19, 2020
- Resolves #408
- Alphabetizes DataFrameloader functions
- Alphabetizes UDFs functions
- Move DataFrameLoader to df packages
- Move UDFs out of df into their own package
- Rename UDFs (no more DF tagged to the end).
- Update tests as necessary
- Partially addresses #410, #409
- Supersedes #412.
@ruebot ruebot moved this from ToDo to In Progress in DataFrames and PySpark May 19, 2020
@ruebot ruebot self-assigned this May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.