Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various UDF implementation and cleanup for DF #370

Open
wants to merge 1 commit into
base: master
from

Conversation

@lintool
Copy link
Member

lintool commented Oct 26, 2019

Addresses #367 #368 #369

This works:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select($"url")
  .show(20, false)

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(ExtractDomain($"url"))
  .show(20, false)

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(RemovePrefixWWW(ExtractDomain($"url")))
  .show(20, false)

val pages = RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(RemoveHttpHeader($"content"))
  .head(10)

pages(1).getString(0)
// HTML, no header

val pages = RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(RemoveHTML($"content"))
  .head(10)

pages(1).getString(0)
// Plain text, no header

I will adjust camel casing issues based on outcome of discussion in #368

@lintool lintool requested review from ruebot and ianmilligan1 Oct 26, 2019
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Oct 26, 2019

Are we not using the PR template?

@codecov

This comment has been minimized.

Copy link

codecov bot commented Oct 26, 2019

Codecov Report

Merging #370 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #370      +/-   ##
==========================================
+ Coverage   76.24%   76.25%   +0.01%     
==========================================
  Files          40       40              
  Lines        1410     1411       +1     
  Branches      267      267              
==========================================
+ Hits         1075     1076       +1     
- Misses        218      219       +1     
+ Partials      117      116       -1
Copy link
Member

ianmilligan1 left a comment

LGTM (built and tested locally), but yeah, as @ruebot notes it'd be good to get a bit more detail as we'll have to make sure to update docs, etc. accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.