Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upUpdate PlainTextExtractor to output a single column; text. #453
Conversation
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Apr 21, 2020
Codecov Report
@@ Coverage Diff @@
## master #453 +/- ##
==========================================
- Coverage 76.72% 76.70% -0.02%
==========================================
Files 49 49
Lines 1422 1421 -1
Branches 264 264
==========================================
- Hits 1091 1090 -1
Misses 215 215
Partials 116 116 |
This comment has been minimized.
This comment has been minimized.
Documentation PR: archivesunleashed/aut-docs#58 |
Something seems to have gone awry - I get the two part files when running on data, but all I'm seeing are row after row of I used this command in case I did something wrong: bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz --output /Users/ianmilligan1/desktop/results/plaintext |
This comment has been minimized.
This comment has been minimized.
That's right. If you're using the test warc, there should only be one line with text. |
This comment has been minimized.
This comment has been minimized.
https://github.com/archivesunleashed/aut/pull/453/files#diff-617068d8eb9f49b4cac9249793a2d409R48 Line 35 in the output will have text. |
This comment has been minimized.
This comment has been minimized.
Oh, I'm using data from the CPP collection – and doesn't appear to have any data in the whole collection, whereas yes, there's like two records that come out in the text extractor. There's a lot of junk records you'd expect to see removed, but there are some legit URLs that aren't coming through. i.e. I'm getting 46KB of text whereas running the old Let me poke at this a bit - I am pretty sure I've run BoilerPipe on these WARCs before and the results have been bigger than this. |
@@ -32,7 +32,6 @@ object PlainTextExtractor { | |||
// scalastyle:off | |||
import spark.implicits._ | |||
// scalastyle:on | |||
d.select($"crawl_date", ExtractDomainDF($"url").as("domain"), | |||
$"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("text")) | |||
d.select(ExtractBoilerpipeTextDF(RemoveHTMLDF($"content")).as("content")) |
This comment has been minimized.
This comment has been minimized.
ruebot
Apr 22, 2020
Author
Member
We really don't have any documentation on using that from what I can tell.
Maybe we shouldn't be calling RemoveHTMLDF
before calling ExtractBoilerpipeTextDF
here?
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Apr 22, 2020
Member
Yeah, I think you're right on that @ruebot. From comparing our docs, when we do regular text extract in DF, we use a command like:
.select($"crawl_date", ExtractDomainDF($"url").as("domain"), $"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("content"))
so yes the headers are removed and then the HTML is removed. Whereas in the boilerpipe we use:
.select($"crawl_date", ExtractDomainDF($"url"), $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))
So we omit the HTML step and run Boilerpipe on the HTTP-headerless content.
This comment has been minimized.
This comment has been minimized.
ruebot
Apr 22, 2020
Author
Member
Cool. It should just be ExtractBoilerpipeTextDF
then, since that calls ExtractBoilerpipeTextRDD
which runs RemoveHTTPHeaderRDD
before running removeBoilerplate
.
I'll update this, the test, and push it up shortly.
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Apr 22, 2020
Member
Perfect! And I've got the output now from the shell comparator, so I can quickly see how it more or less lines up.
This comment has been minimized.
This comment has been minimized.
Tested with the DF script from here: import io.archivesunleashed._
import io.archivesunleashed.df._
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
.webpages()
.select($"crawl_date", ExtractDomainDF($"url"), $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))
.write.csv("plain-text-no-boilerplate-df-testing-453/") And yes results are more robust (i.e. from scrolling through the CSV at a glance obvious boilerplate has been removed but content is still there in many cases; exponentially more than when using the |
Works well now - thanks @ruebot! |
#58) * Documentation update for archivesunleashed/aut#453
ruebot commentedApr 21, 2020
GitHub issue(s): #452
What does this Pull Request do?
Update PlainTextExtractor to output a single column; text.
content
How should this be tested?
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.60.1-SNAPSHOT-fatjar.jar --extractor PlainTextExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/452-test/plaintext