Update PlainTextExtractor to output a single column; text. #453

ruebot · 2020-04-21T23:04:58Z

GitHub issue(s): #452

What does this Pull Request do?

Update PlainTextExtractor to output a single column; text.

Resolves #452
PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on
content
Update test

How should this be tested?

TravisCI

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/452-test/plaintext


        Update PlainTextExtractor to output a single column; text.

- Resolves #452 - PlainTextExtractor runs RemoveHTML, and ExtractBoilerplate on `content` - Update test

codecov · 2020-04-21T23:20:40Z

Codecov Report

Merging #453 into master will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #453      +/-   ##
==========================================
- Coverage   76.72%   76.70%   -0.02%     
==========================================
  Files          49       49              
  Lines        1422     1421       -1     
  Branches      264      264              
==========================================
- Hits         1091     1090       -1     
  Misses        215      215              
  Partials      116      116


        Documentation update for archivesunleashed/aut#453

ruebot · 2020-04-21T23:36:37Z

Documentation PR: archivesunleashed/aut-docs#58

ianmilligan1

Something seems to have gone awry - I get the two part files when running on data, but all I'm seeing are row after row of ""

I used this command in case I did something wrong:

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz --output /Users/ianmilligan1/desktop/results/plaintext

ruebot · 2020-04-22T18:05:24Z

That's right. If you're using the test warc, there should only be one line with text.

ruebot · 2020-04-22T18:06:44Z

https://github.com/archivesunleashed/aut/pull/453/files#diff-617068d8eb9f49b4cac9249793a2d409R48

Line 35 in the output will have text.

ianmilligan1 · 2020-04-22T18:11:49Z

Oh, I'm using data from the CPP collection – and doesn't appear to have any data in the whole collection, whereas yes, there's like two records that come out in the text extractor. There's a lot of junk records you'd expect to see removed, but there are some legit URLs that aren't coming through. i.e. I'm getting 46KB of text whereas running the old WebPages is 25MB (albeit with more columns).

Let me poke at this a bit - I am pretty sure I've run BoilerPipe on these WARCs before and the results have been bigger than this.

ruebot · 2020-04-22T18:17:56Z

src/main/scala/io/archivesunleashed/app/PlainTextExtractor.scala

@@ -32,7 +32,6 @@ object PlainTextExtractor {
    // scalastyle:off
    import spark.implicits._
    // scalastyle:on
-    d.select($"crawl_date", ExtractDomainDF($"url").as("domain"),
-      $"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("text"))
+    d.select(ExtractBoilerpipeTextDF(RemoveHTMLDF($"content")).as("content"))


aut/src/main/scala/io/archivesunleashed/matchbox/ExtractBoilerpipeTextRDD.scala

Lines 31 to 33 in f1eb43b

def apply(input: String): String = {

removeBoilerplate(RemoveHTTPHeaderRDD(input))

}

We really don't have any documentation on using that from what I can tell.

Maybe we shouldn't be calling RemoveHTMLDF before calling ExtractBoilerpipeTextDF here?

@ianmilligan1 @lintool

Yeah, I think you're right on that @ruebot. From comparing our docs, when we do regular text extract in DF, we use a command like:

.select($"crawl_date", ExtractDomainDF($"url").as("domain"), $"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("content"))

so yes the headers are removed and then the HTML is removed. Whereas in the boilerpipe we use:

.select($"crawl_date", ExtractDomainDF($"url"), $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))

So we omit the HTML step and run Boilerpipe on the HTTP-headerless content.

Cool. It should just be ExtractBoilerpipeTextDF then, since that calls ExtractBoilerpipeTextRDD which runs RemoveHTTPHeaderRDD before running removeBoilerplate.

I'll update this, the test, and push it up shortly.

Perfect! And I've got the output now from the shell comparator, so I can quickly see how it more or less lines up.

ianmilligan1 · 2020-04-22T18:20:20Z

Tested with the DF script from here:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .webpages()
  .select($"crawl_date", ExtractDomainDF($"url"), $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))
  .write.csv("plain-text-no-boilerplate-df-testing-453/")

And yes results are more robust (i.e. from scrolling through the CSV at a glance obvious boilerplate has been removed but content is still there in many cases; exponentially more than when using the PlainTextExtractor app).


        review


        review

ianmilligan1

Works well now - thanks @ruebot!


        Documentation update for https://github.com/archivesunleashed/aut/pul… (

#58) * Documentation update for archivesunleashed/aut#453

ruebot requested review from lintool and ianmilligan1 Apr 21, 2020

ruebot mentioned this pull request Apr 22, 2020

Add option to save to Parquet for app. #454

Merged

ianmilligan1 requested changes Apr 22, 2020

View changes

ruebot reviewed Apr 22, 2020

View changes

ruebot added 2 commits Apr 22, 2020

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

67149e3

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

fbc6f72

ianmilligan1 approved these changes Apr 22, 2020

View changes

ianmilligan1 merged commit e91d01f into master Apr 22, 2020
1 check was pending

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

ianmilligan1 deleted the issue-452 branch Apr 22, 2020

archivesunleashed / aut

Update PlainTextExtractor to output a single column; text. #453

Update PlainTextExtractor to output a single column; text. #453

ruebot commented Apr 21, 2020

This comment has been minimized.

codecov bot commented Apr 21, 2020

This comment has been minimized.

ruebot commented Apr 21, 2020

ianmilligan1 left a comment

This comment has been minimized.

ruebot commented Apr 22, 2020

This comment has been minimized.

ruebot commented Apr 22, 2020

This comment has been minimized.

ianmilligan1 commented Apr 22, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ianmilligan1 commented Apr 22, 2020

ianmilligan1 left a comment

	def apply(input: String): String = {
	removeBoilerplate(RemoveHTTPHeaderRDD(input))
	}

archivesunleashed / aut

Join GitHub today

Update PlainTextExtractor to output a single column; text. #453

Update PlainTextExtractor to output a single column; text. #453

Conversation

ruebot commented Apr 21, 2020

What does this Pull Request do?

How should this be tested?

This comment has been minimized.

codecov bot commented Apr 21, 2020

Codecov Report

This comment has been minimized.

ruebot commented Apr 21, 2020

ianmilligan1 left a comment

This comment has been minimized.

ruebot commented Apr 22, 2020

This comment has been minimized.

ruebot commented Apr 22, 2020

This comment has been minimized.

ianmilligan1 commented Apr 22, 2020

This comment has been minimized.

ruebot Apr 22, 2020

This comment has been minimized.

ianmilligan1 Apr 22, 2020

This comment has been minimized.

ruebot Apr 22, 2020

This comment has been minimized.

ianmilligan1 Apr 22, 2020

This comment has been minimized.

ianmilligan1 commented Apr 22, 2020

ianmilligan1 left a comment