Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upRestyle keep/discard filter UDFs in the context of DataFrames #429
Conversation
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Mar 17, 2020
•
Codecov Report
@@ Coverage Diff @@
## master #429 +/- ##
==========================================
- Coverage 78.15% 77.70% -0.46%
==========================================
Files 41 41
Lines 1584 1534 -50
Branches 299 283 -16
==========================================
- Hits 1238 1192 -46
+ Misses 218 217 -1
+ Partials 128 125 -3 |
// scalastyle:on | ||
|
||
/** Removes all non-html-based data (images, executables, etc.) from html text. */ | ||
def keepValidPagesDF(): DataFrame = { |
This comment has been minimized.
This comment has been minimized.
ruebot
Mar 17, 2020
Author
Member
We should probably keep some version of this. @lintool @ianmilligan1 in the spirit of #425, we should change this. Should it just be hasValidPages
?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ruebot
Mar 17, 2020
Author
Member
Well, maybe that's not the right path. keepValidPagesDF()
is different that all of the others. Maybe it should stay as is, but be moved from package.scala
to df/package.scala
, and rename to .keepValidPages()
. But, we might run into problems there because of the existing keepValidPages()
in package.scala
. Is it worth moving stuff of functions in package.scala
to a new rdd/package.scala
?
This comment has been minimized.
This comment has been minimized.
ruebot
Mar 17, 2020
Author
Member
...or, just restore this small class with keepValidPagesDF
and say it's not in scope for issue/PR?
} | ||
|
||
/** Removes all data except images. */ | ||
def keepImagesDF(): DataFrame = { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 @lintool I got a branch on aut-docs I'm hacking on for documentation updates. Not 100% certain about the path I've taken, but it might work. I'll explain on our call this afternoon. ...and if y'all are good with this, I can squash and merge so we get it in right with all of @SinghGursimran's work here. |
ruebot commentedMar 17, 2020
•
edited
GitHub issue(s):
What does this Pull Request do?
Restyle keep/discard filter UDFs in the context of DataFrames
How should this be tested?
Additional Notes:
hasImages
, andhasMimeTypeTika
.