Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restyle keep/discard filter UDFs in the context of DataFrames #429

Merged
merged 15 commits into from Mar 18, 2020
Merged

Conversation

@ruebot
Copy link
Member

ruebot commented Mar 17, 2020

GitHub issue(s):

What does this Pull Request do?

Restyle keep/discard filter UDFs in the context of DataFrames

How should this be tested?

Additional Notes:

  • @SinghGursimran we still have a bit more to go here I think. I'll loop back around to this tomorrow morning, and double check. But, I think it's just hasImages, and hasMimeTypeTika.
  • I'll start a documentation PR for this tomorrow
g285sing and others added 10 commits Mar 6, 2020
g285sing
g285sing
g285sing
g285sing
g285sing
g285sing
g285sing
g285sing
@codecov

This comment has been minimized.

Copy link

codecov bot commented Mar 17, 2020

Codecov Report

Merging #429 into master will decrease coverage by 0.45%.
The diff coverage is 90.00%.

@@            Coverage Diff             @@
##           master     #429      +/-   ##
==========================================
- Coverage   78.15%   77.70%   -0.46%     
==========================================
  Files          41       41              
  Lines        1584     1534      -50     
  Branches      299      283      -16     
==========================================
- Hits         1238     1192      -46     
+ Misses        218      217       -1     
+ Partials      128      125       -3     
// scalastyle:on

/** Removes all non-html-based data (images, executables, etc.) from html text. */
def keepValidPagesDF(): DataFrame = {

This comment has been minimized.

Copy link
@ruebot

ruebot Mar 17, 2020

Author Member

We should probably keep some version of this. @lintool @ianmilligan1 in the spirit of #425, we should change this. Should it just be hasValidPages?

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Mar 17, 2020

Member

That works for me!

This comment has been minimized.

Copy link
@ruebot

ruebot Mar 17, 2020

Author Member

Well, maybe that's not the right path. keepValidPagesDF() is different that all of the others. Maybe it should stay as is, but be moved from package.scala to df/package.scala, and rename to .keepValidPages(). But, we might run into problems there because of the existing keepValidPages() in package.scala. Is it worth moving stuff of functions in package.scala to a new rdd/package.scala?

This comment has been minimized.

Copy link
@ruebot

ruebot Mar 17, 2020

Author Member

...or, just restore this small class with keepValidPagesDF and say it's not in scope for issue/PR?

}

/** Removes all data except images. */
def keepImagesDF(): DataFrame = {

This comment has been minimized.

Copy link
@ruebot

ruebot Mar 17, 2020

Author Member

@ianmilligan1 @lintool hasImages() work for y'all?

This comment has been minimized.

Copy link
@ianmilligan1
ruebot added 3 commits Mar 17, 2020
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Mar 18, 2020

@ianmilligan1 @lintool I got a branch on aut-docs I'm hacking on for documentation updates. Not 100% certain about the path I've taken, but it might work. I'll explain on our call this afternoon.

...and if y'all are good with this, I can squash and merge so we get it in right with all of @SinghGursimran's work here.

Copy link
Member

ianmilligan1 left a comment

Tested on sample data and looks great - fantastic work!

Screen Shot 2020-03-18 at 5 02 07 PM

@ruebot ruebot merged commit e1d908b into master Mar 18, 2020
2 of 3 checks passed
2 of 3 checks passed
codecov/project 77.70% (+-0.46%) compared to 9ce73a8
Details
codecov/patch 90.00% of diff hit (target 78.15%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ruebot ruebot deleted the issue-425 branch Mar 18, 2020
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.