Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PlainTextExtractor to output a single column; text. #453

Merged
merged 3 commits into from Apr 22, 2020
Merged

Conversation

@ruebot
Copy link
Member

ruebot commented Apr 21, 2020

GitHub issue(s): #452

What does this Pull Request do?

Update PlainTextExtractor to output a single column; text.

  • Resolves #452
  • PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on
    content
  • Update test

How should this be tested?

  • TravisCI
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/452-test/plaintext
- Resolves #452
- PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on
`content`
- Update test
@ruebot ruebot requested review from lintool and ianmilligan1 Apr 21, 2020
@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 21, 2020

Codecov Report

Merging #453 into master will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #453      +/-   ##
==========================================
- Coverage   76.72%   76.70%   -0.02%     
==========================================
  Files          49       49              
  Lines        1422     1421       -1     
  Branches      264      264              
==========================================
- Hits         1091     1090       -1     
  Misses        215      215              
  Partials      116      116              
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Apr 21, 2020
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 21, 2020

Documentation PR: archivesunleashed/aut-docs#58

Copy link
Member

ianmilligan1 left a comment

Something seems to have gone awry - I get the two part files when running on data, but all I'm seeing are row after row of ""

Screen Shot 2020-04-22 at 2 03 45 PM

I used this command in case I did something wrong:

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz --output /Users/ianmilligan1/desktop/results/plaintext
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 22, 2020

That's right. If you're using the test warc, there should only be one line with text.

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 22, 2020

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Apr 22, 2020

Oh, I'm using data from the CPP collection – and doesn't appear to have any data in the whole collection, whereas yes, there's like two records that come out in the text extractor. There's a lot of junk records you'd expect to see removed, but there are some legit URLs that aren't coming through. i.e. I'm getting 46KB of text whereas running the old WebPages is 25MB (albeit with more columns).

Let me poke at this a bit - I am pretty sure I've run BoilerPipe on these WARCs before and the results have been bigger than this.

@@ -32,7 +32,6 @@ object PlainTextExtractor {
// scalastyle:off
import spark.implicits._
// scalastyle:on
d.select($"crawl_date", ExtractDomainDF($"url").as("domain"),
$"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("text"))
d.select(ExtractBoilerpipeTextDF(RemoveHTMLDF($"content")).as("content"))

This comment has been minimized.

Copy link
@ruebot

ruebot Apr 22, 2020

Author Member

def apply(input: String): String = {
removeBoilerplate(RemoveHTTPHeaderRDD(input))
}

We really don't have any documentation on using that from what I can tell.

Maybe we shouldn't be calling RemoveHTMLDF before calling ExtractBoilerpipeTextDF here?

@ianmilligan1 @lintool

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Apr 22, 2020

Member

Yeah, I think you're right on that @ruebot. From comparing our docs, when we do regular text extract in DF, we use a command like:

  .select($"crawl_date", ExtractDomainDF($"url").as("domain"), $"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("content"))

so yes the headers are removed and then the HTML is removed. Whereas in the boilerpipe we use:

  .select($"crawl_date", ExtractDomainDF($"url"), $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))

So we omit the HTML step and run Boilerpipe on the HTTP-headerless content.

This comment has been minimized.

Copy link
@ruebot

ruebot Apr 22, 2020

Author Member

Cool. It should just be ExtractBoilerpipeTextDF then, since that calls ExtractBoilerpipeTextRDD which runs RemoveHTTPHeaderRDD before running removeBoilerplate.

I'll update this, the test, and push it up shortly.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Apr 22, 2020

Member

Perfect! And I've got the output now from the shell comparator, so I can quickly see how it more or less lines up.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Apr 22, 2020

Tested with the DF script from here:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .webpages()
  .select($"crawl_date", ExtractDomainDF($"url"), $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))
  .write.csv("plain-text-no-boilerplate-df-testing-453/")

And yes results are more robust (i.e. from scrolling through the CSV at a glance obvious boilerplate has been removed but content is still there in many cases; exponentially more than when using the PlainTextExtractor app).

ruebot added 2 commits Apr 22, 2020
Copy link
Member

ianmilligan1 left a comment

Works well now - thanks @ruebot!

@ianmilligan1 ianmilligan1 merged commit e91d01f into master Apr 22, 2020
1 check was pending
1 check was pending
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@ianmilligan1 ianmilligan1 deleted the issue-452 branch Apr 22, 2020
ianmilligan1 pushed a commit to archivesunleashed/aut-docs that referenced this pull request Apr 22, 2020
#58)

* Documentation update for archivesunleashed/aut#453
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

2 participants
You can’t perform that action at this time.