Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various UDF implementation and cleanup for DF #370

Merged
merged 1 commit into from Nov 5, 2019

Conversation

@lintool
Copy link
Member

lintool commented Oct 26, 2019

Addresses #367 #368 #369

This works:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select($"url")
  .show(20, false)

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(ExtractDomain($"url"))
  .show(20, false)

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(RemovePrefixWWW(ExtractDomain($"url")))
  .show(20, false)

val pages = RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(RemoveHttpHeader($"content"))
  .head(10)

pages(1).getString(0)
// HTML, no header

val pages = RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(RemoveHTML($"content"))
  .head(10)

pages(1).getString(0)
// Plain text, no header

I will adjust camel casing issues based on outcome of discussion in #368

@lintool lintool requested review from ruebot and ianmilligan1 Oct 26, 2019
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Oct 26, 2019

Are we not using the PR template?

@codecov

This comment has been minimized.

Copy link

codecov bot commented Oct 26, 2019

Codecov Report

Merging #370 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
+ Coverage   76.24%   76.25%   +0.01%     
==========================================
  Files          40       40              
  Lines        1410     1411       +1     
  Branches      267      267              
==========================================
+ Hits         1075     1076       +1     
- Misses        218      219       +1     
+ Partials      117      116       -1
Copy link
Member

ianmilligan1 left a comment

LGTM (built and tested locally), but yeah, as @ruebot notes it'd be good to get a bit more detail as we'll have to make sure to update docs, etc. accordingly.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Oct 28, 2019

When you have a chance, can you update the PR a bit @lintool so that we've got a straightforward path to revising the docs ?

@lintool

This comment has been minimized.

Copy link
Member Author

lintool commented Nov 5, 2019

Video chat with @ruebot and @ianmilligan1

Specifically, this PR:

  • Closes #367 ExtractDomain or ExtractBaseDomain?
  • Punts on #368 UDF CaMeL cASe consistency issues (same for subsequent PR since I don't have time right now).
  • Closes #369 Bug in ArcTest? Why run RemoveHTML? - fixes the bug.
  • Wraps RemoveHttpHeader and RemoveHTML for use in data frames.

Pending sign off from @ruebot and we can merge.

@ruebot
ruebot approved these changes Nov 5, 2019
@ruebot ruebot merged commit 6686519 into master Nov 5, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 76.24%)
Details
codecov/project 76.25% (+0.01%) compared to 4e8b41d
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ruebot ruebot deleted the refactoring branch Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.