Add office document binary extraction. #346

ruebot · 2019-08-15T02:29:59Z

GitHub issue(s):

What does this Pull Request do?

Add WordProcessor DF and binary extraction
Add Spreadsheets DF and binary extraction
Add Presentation Program DF and binary extraction
Add tests for new DF and binary extractions
Add test fixture for new DF and binary extractions
Resolves #303
Resolves #304
Resolves #305
Back out 39831c2 (We might not have
to do this)

How should this be tested?

TravisCI
I tested on the 10 GeoCities WARCs, here is a whole bunch of info

$ rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && time ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/office-document-extraction.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_ss = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/presentation", "extension")

val df_word = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/document", "extension")

sys.exit

Additional Notes:

Do we want to include plain text (txt) files in Word Processor?
Is it work putting some NOT conditionals in these? Like ! r._1 == "text/html" or ! r._1.startsWith("image/"). Or, is it worth leaving some of this noise in there?


        Add office document binary extraction.

- Add WordProcessor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add tests for new DF and binary extractions - Add test fixture for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Back out 39831c2 (We _might_ not have to do this)

codecov · 2019-08-15T02:47:29Z

Codecov Report

Merging #346 into master will decrease coverage by 3.53%.
The diff coverage is 43.49%.

@@            Coverage Diff             @@
##           master     #346      +/-   ##
==========================================
- Coverage    75.2%   71.66%   -3.54%     
==========================================
  Files          39       38       -1     
  Lines        1230     1426     +196     
  Branches      224      330     +106     
==========================================
+ Hits          925     1022      +97     
- Misses        214      245      +31     
- Partials       91      159      +68


        Add Codecov configuration.


        Remove more Twitter cruft; #322

ianmilligan1 · 2019-08-15T12:52:21Z

Will review in a few minutes!

Re:

Do we want to include plain text (txt) files in Word Processor?

I think it'd probably make sense to include txt files. That said, if so, maybe we should consider renaming extractWordProcessorDetailsDF to something like extractTextDocumentDetailsDF?

Or we could create a new one just for plain text files? 🤔


        Remove tika-parser classifier.

ruebot · 2019-08-15T13:09:25Z

I think it would make since to have a separate method for plain text files. I can add that here if there is consensus.

jrwiebe · 2019-08-15T13:10:54Z

I agree re. a separate method for plain text files.

ianmilligan1 · 2019-08-15T13:13:49Z

Separate method sounds good to me too. Thanks, @ruebot !

ianmilligan1

Tested on sample data and worked really well. Word documents and presentations worked perfectly (doc, rtf, and pps). Some interesting finds already.

We did get one false positive spreadsheet which was actually an HTML document: gist here if you're curious.

ruebot · 2019-08-15T13:21:03Z

@ianmilligan1 yeah, I got some js files and a couple other odd files types listed in the gist I linked in the original comment. I toss some NOT conditionals in there for some of this stuff.

jrwiebe · 2019-08-15T13:27:30Z

This looks good. It looks pretty familiar, as I had a branch addressing #303 and/or #304, but I got stymied by the dependency issue and seem to have lost that work (I probably messed up some git stuff locally and re-cloned aut). We could remove most of the ors and replace them with something like documentMimes.contains(r._2), creating a corresponding list of mime types residing outside of the method definition, but I don't feel strongly about that.

I removed the classifier from the pom and the build went fine. For the record, the reason it didn't work for @ruebot (he already knows this) is that the CDN used by JitPack to serve the jars cached an older version of the file we need, so his build was not getting the shaded artifact.

jrwiebe · 2019-08-15T13:30:29Z

The false positives come from filtering on file extension. I think we could remove most of those, except for "tsv" and "csv".

ruebot · 2019-08-15T13:36:11Z

@jrwiebe we actually need those. If I do have them, those tests will fail. Tika doesn't identify a bunch of those files in the test fixture. I originally had the method you suggested, and then when with the extension/url ends with.

I'll create a enhancement ticket for breaking all those conditionals out after we get this merged. That's a good idea.

jrwiebe · 2019-08-15T13:40:46Z

@ruebot You don't have a log of the failed tests, do you? I'm interested to know which files Tika failed to identify correctly, and what it identified them as (application/octet-stream?).

ruebot · 2019-08-15T13:42:00Z

iirc application/zip


        Add text files extraction, and add NOT conditionals.

ruebot · 2019-08-15T14:08:25Z

Ok, give that a test again. I haven't tested it other than unit tests since I have to run now. I'll put it through it's paces again later this afternoon, and see if I can sort out the shading issue on my end again.


        Merge branch '303-304-305' of github.com:archivesunleashed/aut into 3…

…03-304-305


        Remove unnecessary filtering on file extension, which might produce

false positives.

jrwiebe · 2019-08-15T17:11:28Z

@ruebot The unit tests failed when you weren't filtering on file extension because:
a) the word processor extractor was filtering on mime "text/rtf" instead of "application/rtf", and
b) the spreadsheet extractor doesn't detect CSV or TSV (CSV detection might be in the works, see TIKA-2826).

I pushed a fix to (a), and removed the rest of the file extension filtering from the word processor document extractor. For (b), I left in the filtering on CSV and TSV extensions, but removed the other formats. Now the tests pass.

BTW, should we add the dot before extensions when using endsWith()? Currently our spreadsheet filter, for example, would evaluate to true something like "http://example.com/fileformatsoftheages/csv". Also, URLs could contain uppercase; we should probably insert a .toLowerCase before .endsWith.

ruebot · 2019-08-15T17:13:37Z

@jrwiebe good call on the extensions. I was going to add .txt etc., in front of each one, but second guessed myself. You want to do that, or want me to?

jrwiebe · 2019-08-15T17:14:23Z

@ruebot Go for it. I'm done with this for today.


        code review updates, and I guess I missed these tests in the last com…

…mit 😱

ianmilligan1

While this worked when I originally reviewed, it's now failing on the previous test script with:

java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:189)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:115)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:40)
etc. etc.

This is with

rm -rf ~/.m2/repository/* && mvn clean install

for build and

rm -rf ~/.ivy2/* && ./bin/spark-shell --driver-memory 8g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT

for shell. I can test again to see if this works on my machine once the classifier issues are sorted out?

ruebot · 2019-08-15T18:05:31Z

@ianmilligan1 yeah, I'm still getting the same thing

jrwiebe · 2019-08-15T18:11:31Z

@ianmilligan1 @ruebot Revert this change (i.e., put the classifier back in). You guys are getting the cached tika-parsers JAR (see my description of this problem). This might break use of --packages, though.

ruebot · 2019-08-15T18:13:16Z

@jrwiebe yeah, I've been fiddling with the JitPack interface in the morning/afternoon and keep on getting that cached JAR. We'll kick this can down the road, and hopefully resolve it before the release 🤞


        Use AU GitHub repo to distribute our shaded tika-parsers 1.22.


        clean-up text file extraction


        Merge branch '303-304-305' of github.com:archivesunleashed/aut into 3…

…03-304-305

ruebot · 2019-08-16T01:04:48Z

I'm getting significantly less output after 8028aa3.

Original output (w/o text files extraction)
8028aa39aa53b8f27168c0ad49e9adf582cfda54 output (note, there are no presentation files extracted. This is what I was experiencing early on, and probably didn't explain well here)
Updated commit (we have presentation files back, better results, but a bit of noise. Not sure why I'm getting Mime Types that I say not to give, but file --mime-types could have a different Magic file than a web server and/or Tika 🤷‍♂) ...anyway, I have an idea that I tossed into #343 and we can clean things up better in that resolution.

jrwiebe · 2019-08-16T01:16:56Z

Try running tika-app detect on those same files.

java -jar /path/to/tika-app-1.22.jar -d [files]

ruebot · 2019-08-16T01:21:32Z

[nruest@rho:aut-test-fixtures]$ ls -1                                   
aut-test-fixtures.odp
aut-test-fixtures.pptx
aut-text
aut-text.txt
test-aut-fixture.ods
test-aut-fixtures.docx
'test-aut-fixture - Sheet1.csv'
'test-aut-fixture - Sheet1.tsv'
test-aut-fixtures.odt
test-aut-fixtures.rtf
test-aut-fixture.xlsx

[nruest@rho:aut-test-fixtures]$ java -jar ~/bin/tika-app-1.22.jar -d *  
Aug 15, 2019 9:21:03 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Aug 15, 2019 9:21:04 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
application/vnd.oasis.opendocument.presentation
application/vnd.openxmlformats-officedocument.presentationml.presentation
text/plain
text/plain
application/vnd.oasis.opendocument.spreadsheet
application/vnd.openxmlformats-officedocument.wordprocessingml.document
text/csv
text/tab-separated-values
application/vnd.oasis.opendocument.text
application/rtf
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

ruebot · 2019-08-16T01:49:46Z

hrm, if I test with that tika-parsers commit you added, things blow up after a while, and I get this error:

java.nio.file.FileSystemException: /tmp/apache-tika-4408340307210314122.tmp: Too many open files

...with a whole lot of apache-tika-adfasdfasdfasdfdasw.tmp files in my /tmp dir.

I also noticed that Spark behaved a little bit differently. I had stages with your commit, whereas with the original state of the pom in the PR, I don't have stages. It goes to straight tasks.

I'll try and gist everything up here before I go to bed.

ruebot · 2019-08-16T02:00:31Z

Dealing with this Tika dependency thing over the last 6 months


        slim shady

jrwiebe · 2019-08-16T02:08:41Z

Dealing with this Tika dependency thing over the last 6 months

Yup

ruebot · 2019-08-16T02:29:31Z

rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && time ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/office-document-extraction.scala


        Close TikaInputStream

jrwiebe · 2019-08-16T02:32:45Z

hrm, if I test with that tika-parsers commit you added, things blow up after a while, and I get this error:
java.nio.file.FileSystemException: /tmp/apache-tika-4408340307210314122.tmp: Too many open files
...with a whole lot of apache-tika-adfasdfasdfasdfdasw.tmp files in my /tmp dir.

I probably should have been close()-ing each TikaInputStream in DetectMimeTypeTika. I'll push this fix in a sec assuming it passes the build tests.

jrwiebe · 2019-08-16T02:35:48Z

There shouldn't be any difference in behaviour between the two tika-parsers artifacts. The only difference besides where they are downloaded from is that one has a classifier as part of its versioning.


        Merge remote-tracking branch 'origin/303-304-305' into 303-304-305


        Add RDD filters on MimeTypeTika value


        put it all back; IT WORKS NOW.

ruebot · 2019-08-16T10:55:35Z

@ianmilligan1 ready for the big old rm -rf test process again. If it works for you, ping me, and I'll write up a big ol' commit message.

jrwiebe · 2019-08-16T13:09:17Z

I ran a variant of the above office-document-extraction.scala last night, built from
8a124d3. The script was something I had around from testing yesterday, so it was missing extractTextFilesDetailsDF. I ran it on /tuna1/scratch/nruest/geocites/warcs/1/*.gz. Is this the same sample you were using, @ruebot?

This was on tuna, so the resources available were different, but it didn't blow up. It took about four hours.

Extraction stats:

jrwiebe@tuna:~$ ls office-extract-test/spreadsheet-*|wc -l
1291
jrwiebe@tuna:~$ ls office-extract-test/presentation-*|wc -l
2840
jrwiebe@tuna:~$ ls office-extract-test/document-*|wc -l
8732
jrwiebe@tuna:~$ ls office-extract-test/*|wc -l
12863
jrwiebe@tuna:~$ cat aut/office-document-extraction.time.txt 
real    168m19.665s
user    997m59.700s
sys 153m2.312s

ruebot · 2019-08-16T13:14:53Z

@jrwiebe no, I just have 10 GeoCities WARCs on my desktop at home that I use for testing. But all those numbers look relatively inline with what I've seen on my end.

ianmilligan1

Tested on sample data and this works nicely – great work both!

ruebot requested review from ianmilligan1 and jrwiebe Aug 15, 2019

ruebot added 2 commits Aug 15, 2019

Add Codecov configuration.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

f94335b

Remove more Twitter cruft; #322

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

6639bb4

Remove tika-parser classifier.

4f7f8e8

ianmilligan1 approved these changes Aug 15, 2019

View changes

Add text files extraction, and add NOT conditionals.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

d0fe0ae

jrwiebe added 2 commits Aug 15, 2019

Merge branch '303-304-305' of github.com:archivesunleashed/aut into 3…

Loading status checks…

372d890

…03-304-305

Remove unnecessary filtering on file extension, which might produce

Loading status checks…

8028aa3

false positives.

jrwiebe approved these changes Aug 15, 2019

View changes

code review updates, and I guess I missed these tests in the last com…

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

8861574

…mit 😱

ianmilligan1 requested changes Aug 15, 2019

View changes

Use AU GitHub repo to distribute our shaded tika-parsers 1.22.

Loading status checks…

3612f7f

ruebot referenced this pull request Aug 16, 2019

Add method for unknown extensions in binary extractions #343

Closed

ruebot added 2 commits Aug 16, 2019

clean-up text file extraction

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

2d63814

Merge branch '303-304-305' of github.com:archivesunleashed/aut into 3…

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

ddf793a

…03-304-305

slim shady

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

7edd25d

Close TikaInputStream

668985a

jrwiebe and others added 3 commits Aug 16, 2019

Merge remote-tracking branch 'origin/303-304-305' into 303-304-305

Loading status checks…

8a124d3

Add RDD filters on MimeTypeTika value

Loading status checks…

306015b

put it all back; IT WORKS NOW.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

fb10f4e

ianmilligan1 approved these changes Aug 16, 2019

View changes

ianmilligan1 deleted the 303-304-305 branch Aug 16, 2019

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Join GitHub today

Add office document binary extraction. #346

Conversation

ruebot commented Aug 15, 2019 • edited

What does this Pull Request do?

How should this be tested?

Additional Notes:

This comment has been minimized.

codecov bot commented Aug 15, 2019 • edited

Codecov Report

This comment has been minimized.

ianmilligan1 commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

This comment has been minimized.

ianmilligan1 commented Aug 15, 2019

ianmilligan1 left a comment

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019 • edited

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

ianmilligan1 left a comment

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019 • edited

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019 • edited

This comment has been minimized.

ruebot commented Aug 16, 2019 • edited

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

Extraction stats:

This comment has been minimized.

ruebot commented Aug 16, 2019

ianmilligan1 left a comment

ruebot commented Aug 15, 2019 •

edited

codecov bot commented Aug 15, 2019 •

edited

jrwiebe commented Aug 15, 2019 •

edited

jrwiebe commented Aug 15, 2019 •

edited

jrwiebe commented Aug 16, 2019 •

edited

ruebot commented Aug 16, 2019 •

edited