Add office document binary extraction. #346

ruebot · Aug 15, 2019

GitHub issue(s):

What does this Pull Request do?

Add WordProcessor DF and binary extraction
Add Spreadsheets DF and binary extraction
Add Presentation Program DF and binary extraction
Add tests for new DF and binary extractions
Add test fixture for new DF and binary extractions
Resolves #303
Resolves #304
Resolves #305
Back out 39831c2 (We might not have
to do this)

How should this be tested?

TravisCI
I tested on the 10 GeoCities WARCs, here is a whole bunch of info

$ rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && time ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/office-document-extraction.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_ss = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/presentation", "extension")

val df_word = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/document", "extension")

sys.exit

Additional Notes:

Do we want to include plain text (txt) files in Word Processor?
Is it work putting some NOT conditionals in these? Like ! r._1 == "text/html" or ! r._1.startsWith("image/"). Or, is it worth leaving some of this noise in there?

codecov · Aug 15, 2019

Codecov Report

Merging #346 into master will decrease coverage by 3.53%.
The diff coverage is 43.49%.

@@            Coverage Diff             @@
##           master     #346      +/-   ##
==========================================
- Coverage    75.2%   71.66%   -3.54%     
==========================================
  Files          39       38       -1     
  Lines        1230     1426     +196     
  Branches      224      330     +106     
==========================================
+ Hits          925     1022      +97     
- Misses        214      245      +31     
- Partials       91      159      +68

ianmilligan1 · Aug 15, 2019

Will review in a few minutes!

Re:

Do we want to include plain text (txt) files in Word Processor?

I think it'd probably make sense to include txt files. That said, if so, maybe we should consider renaming extractWordProcessorDetailsDF to something like extractTextDocumentDetailsDF?

Or we could create a new one just for plain text files? 🤔

ruebot · Aug 15, 2019

I think it would make since to have a separate method for plain text files. I can add that here if there is consensus.

jrwiebe · Aug 15, 2019

I agree re. a separate method for plain text files.

ianmilligan1 · Aug 15, 2019

Separate method sounds good to me too. Thanks, @ruebot !

ianmilligan1 · Aug 15, 2019

ianmilligan1 approved these changes Aug 15, 2019

View changes

Tested on sample data and worked really well. Word documents and presentations worked perfectly (doc, rtf, and pps). Some interesting finds already.

We did get one false positive spreadsheet which was actually an HTML document: gist here if you're curious.

ruebot · Aug 15, 2019

@ianmilligan1 yeah, I got some js files and a couple other odd files types listed in the gist I linked in the original comment. I toss some NOT conditionals in there for some of this stuff.

jrwiebe · Aug 15, 2019

This looks good. It looks pretty familiar, as I had a branch addressing #303 and/or #304, but I got stymied by the dependency issue and seem to have lost that work (I probably messed up some git stuff locally and re-cloned aut). We could remove most of the ors and replace them with something like documentMimes.contains(r._2), creating a corresponding list of mime types residing outside of the method definition, but I don't feel strongly about that.

I removed the classifier from the pom and the build went fine. For the record, the reason it didn't work for @ruebot (he already knows this) is that the CDN used by JitPack to serve the jars cached an older version of the file we need, so his build was not getting the shaded artifact.

jrwiebe · Aug 15, 2019

The false positives come from filtering on file extension. I think we could remove most of those, except for "tsv" and "csv".

ruebot · Aug 15, 2019

@jrwiebe we actually need those. If I do have them, those tests will fail. Tika doesn't identify a bunch of those files in the test fixture. I originally had the method you suggested, and then when with the extension/url ends with.

I'll create a enhancement ticket for breaking all those conditionals out after we get this merged. That's a good idea.

jrwiebe · Aug 15, 2019

@ruebot You don't have a log of the failed tests, do you? I'm interested to know which files Tika failed to identify correctly, and what it identified them as (application/octet-stream?).

ruebot · Aug 15, 2019

iirc application/zip

ruebot · Aug 15, 2019

Ok, give that a test again. I haven't tested it other than unit tests since I have to run now. I'll put it through it's paces again later this afternoon, and see if I can sort out the shading issue on my end again.

jrwiebe · Aug 15, 2019

@ruebot The unit tests failed when you weren't filtering on file extension because:
a) the word processor extractor was filtering on mime "text/rtf" instead of "application/rtf", and
b) the spreadsheet extractor doesn't detect CSV or TSV (CSV detection might be in the works, see TIKA-2826).

I pushed a fix to (a), and removed the rest of the file extension filtering from the word processor document extractor. For (b), I left in the filtering on CSV and TSV extensions, but removed the other formats. Now the tests pass.

BTW, should we add the dot before extensions when using endsWith()? Currently our spreadsheet filter, for example, would evaluate to true something like "http://example.com/fileformatsoftheages/csv". Also, URLs could contain uppercase; we should probably insert a .toLowerCase before .endsWith.

ruebot · Aug 15, 2019

@jrwiebe good call on the extensions. I was going to add .txt etc., in front of each one, but second guessed myself. You want to do that, or want me to?

jrwiebe · Aug 15, 2019

@ruebot Go for it. I'm done with this for today.

jrwiebe · Aug 15, 2019

jrwiebe approved these changes Aug 15, 2019

View changes

ianmilligan1 · Aug 15, 2019

ianmilligan1 requested changes Aug 15, 2019

View changes

While this worked when I originally reviewed, it's now failing on the previous test script with:

java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:189)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:115)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:40)
etc. etc.

This is with

rm -rf ~/.m2/repository/* && mvn clean install

for build and

rm -rf ~/.ivy2/* && ./bin/spark-shell --driver-memory 8g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT

for shell. I can test again to see if this works on my machine once the classifier issues are sorted out?

ruebot · Aug 15, 2019

@ianmilligan1 yeah, I'm still getting the same thing

jrwiebe · Aug 15, 2019

@ianmilligan1 @ruebot Revert this change (i.e., put the classifier back in). You guys are getting the cached tika-parsers JAR (see my description of this problem). This might break use of --packages, though.

ruebot · Aug 15, 2019

@jrwiebe yeah, I've been fiddling with the JitPack interface in the morning/afternoon and keep on getting that cached JAR. We'll kick this can down the road, and hopefully resolve it before the release 🤞

ruebot · Aug 16, 2019

I'm getting significantly less output after 8028aa3.

Original output (w/o text files extraction)
8028aa39aa53b8f27168c0ad49e9adf582cfda54 output (note, there are no presentation files extracted. This is what I was experiencing early on, and probably didn't explain well here)
Updated commit (we have presentation files back, better results, but a bit of noise. Not sure why I'm getting Mime Types that I say not to give, but file --mime-types could have a different Magic file than a web server and/or Tika 🤷‍♂) ...anyway, I have an idea that I tossed into #343 and we can clean things up better in that resolution.

jrwiebe · Aug 16, 2019

Try running tika-app detect on those same files.

java -jar /path/to/tika-app-1.22.jar -d [files]

ruebot · Aug 16, 2019

[nruest@rho:aut-test-fixtures]$ ls -1                                   
aut-test-fixtures.odp
aut-test-fixtures.pptx
aut-text
aut-text.txt
test-aut-fixture.ods
test-aut-fixtures.docx
'test-aut-fixture - Sheet1.csv'
'test-aut-fixture - Sheet1.tsv'
test-aut-fixtures.odt
test-aut-fixtures.rtf
test-aut-fixture.xlsx

[nruest@rho:aut-test-fixtures]$ java -jar ~/bin/tika-app-1.22.jar -d *  
Aug 15, 2019 9:21:03 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Aug 15, 2019 9:21:04 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
application/vnd.oasis.opendocument.presentation
application/vnd.openxmlformats-officedocument.presentationml.presentation
text/plain
text/plain
application/vnd.oasis.opendocument.spreadsheet
application/vnd.openxmlformats-officedocument.wordprocessingml.document
text/csv
text/tab-separated-values
application/vnd.oasis.opendocument.text
application/rtf
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

ruebot · Aug 16, 2019

hrm, if I test with that tika-parsers commit you added, things blow up after a while, and I get this error:

java.nio.file.FileSystemException: /tmp/apache-tika-4408340307210314122.tmp: Too many open files

...with a whole lot of apache-tika-adfasdfasdfasdfdasw.tmp files in my /tmp dir.

I also noticed that Spark behaved a little bit differently. I had stages with your commit, whereas with the original state of the pom in the PR, I don't have stages. It goes to straight tasks.

I'll try and gist everything up here before I go to bed.

ruebot · Aug 16, 2019

Dealing with this Tika dependency thing over the last 6 months

jrwiebe · Aug 16, 2019

Dealing with this Tika dependency thing over the last 6 months

Yup

ruebot · Aug 16, 2019

rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && time ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/office-document-extraction.scala

jrwiebe · Aug 16, 2019

hrm, if I test with that tika-parsers commit you added, things blow up after a while, and I get this error:
java.nio.file.FileSystemException: /tmp/apache-tika-4408340307210314122.tmp: Too many open files
...with a whole lot of apache-tika-adfasdfasdfasdfdasw.tmp files in my /tmp dir.

I probably should have been close()-ing each TikaInputStream in DetectMimeTypeTika. I'll push this fix in a sec assuming it passes the build tests.

jrwiebe · Aug 16, 2019

There shouldn't be any difference in behaviour between the two tika-parsers artifacts. The only difference besides where they are downloaded from is that one has a classifier as part of its versioning.

ruebot requested review from ianmilligan1 and jrwiebe Aug 15, 2019

Remove tika-parser classifier.

4f7f8e8

Add text files extraction, and add NOT conditionals.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

d0fe0ae

jrwiebe added some commits Aug 15, 2019

Merge branch '303-304-305' of github.com:archivesunleashed/aut into 3…
…03-304-305

Loading status checks…

372d890

Remove unnecessary filtering on file extension, which might produce
false positives.

Loading status checks…

8028aa3

code review updates, and I guess I missed these tests in the last com…
…mit 😱

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

8861574

Use AU GitHub repo to distribute our shaded tika-parsers 1.22.

Loading status checks…

3612f7f

ruebot referenced this pull request Aug 16, 2019
Open
Add method for unknown extensions in binary extractions #343

slim shady

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

7edd25d

Close TikaInputStream

668985a

jrwiebe added some commits Aug 16, 2019

Merge remote-tracking branch 'origin/303-304-305' into 303-304-305

Loading status checks…

8a124d3

Add RDD filters on MimeTypeTika value

Loading status checks…

306015b

archivesunleashed/aut

Join GitHub today

Add office document binary extraction. #346

Conversation

ruebot commented Aug 15, 2019 • edited

What does this Pull Request do?

How should this be tested?

Additional Notes:

ruebot requested review from ianmilligan1 and jrwiebe Aug 15, 2019

This comment has been minimized.

codecov bot commented Aug 15, 2019 • edited

Codecov Report

ruebot added some commits Aug 15, 2019

This comment has been minimized.

ianmilligan1 commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

This comment has been minimized.

ianmilligan1 commented Aug 15, 2019

ianmilligan1 approved these changes Aug 15, 2019 View changes

ianmilligan1 left a comment

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

ruebot commented Aug 15, 2019

jrwiebe added some commits Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019 • edited

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019

jrwiebe approved these changes Aug 15, 2019 View changes

ianmilligan1 requested changes Aug 15, 2019 View changes

ianmilligan1 left a comment

This comment has been minimized.

ruebot commented Aug 15, 2019

This comment has been minimized.

jrwiebe commented Aug 15, 2019 • edited

This comment has been minimized.

ruebot commented Aug 15, 2019

ruebot referenced this pull request Aug 16, 2019

Add method for unknown extensions in binary extractions #343

ruebot added some commits Aug 16, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019 • edited

This comment has been minimized.

ruebot commented Aug 16, 2019 • edited

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

jrwiebe added some commits Aug 16, 2019

ruebot commented Aug 15, 2019 •

edited

codecov bot commented Aug 15, 2019 •

edited

ianmilligan1 approved these changes Aug 15, 2019

View changes

jrwiebe commented Aug 15, 2019 •

edited

jrwiebe approved these changes Aug 15, 2019

View changes

ianmilligan1 requested changes Aug 15, 2019

View changes

jrwiebe commented Aug 15, 2019 •

edited

jrwiebe commented Aug 16, 2019 •

edited

ruebot commented Aug 16, 2019 •

edited