Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add office document binary extraction. #346

Merged
merged 16 commits into from Aug 16, 2019

Conversation

@ruebot
Copy link
Member

commented Aug 15, 2019

GitHub issue(s):

What does this Pull Request do?

  • Add WordProcessor DF and binary extraction
  • Add Spreadsheets DF and binary extraction
  • Add Presentation Program DF and binary extraction
  • Add tests for new DF and binary extractions
  • Add test fixture for new DF and binary extractions
  • Resolves #303
  • Resolves #304
  • Resolves #305
  • Back out 39831c2 (We might not have
    to do this)

How should this be tested?

  • TravisCI
  • I tested on the 10 GeoCities WARCs, here is a whole bunch of info
$ rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && time ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/office-document-extraction.scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val df_ss = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/presentation", "extension")

val df_word = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/document", "extension")

sys.exit

Additional Notes:

  • Do we want to include plain text (txt) files in Word Processor?
  • Is it work putting some NOT conditionals in these? Like ! r._1 == "text/html" or ! r._1.startsWith("image/"). Or, is it worth leaving some of this noise in there?
Add office document binary extraction.
- Add WordProcessor DF and binary extraction
- Add Spreadsheets DF and binary extraction
- Add Presentation Program DF and binary extraction
- Add tests for new DF and binary extractions
- Add test fixture for new DF and binary extractions
- Resolves #303
- Resolves #304
- Resolves #305
- Back out 39831c2 (We _might_ not have
to do this)

@ruebot ruebot requested review from ianmilligan1 and jrwiebe Aug 15, 2019

@codecov

This comment has been minimized.

Copy link

commented Aug 15, 2019

Codecov Report

Merging #346 into master will decrease coverage by 3.53%.
The diff coverage is 43.49%.

@@            Coverage Diff             @@
##           master     #346      +/-   ##
==========================================
- Coverage    75.2%   71.66%   -3.54%     
==========================================
  Files          39       38       -1     
  Lines        1230     1426     +196     
  Branches      224      330     +106     
==========================================
+ Hits          925     1022      +97     
- Misses        214      245      +31     
- Partials       91      159      +68

ruebot added some commits Aug 15, 2019

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

Will review in a few minutes!

Re:

Do we want to include plain text (txt) files in Word Processor?

I think it'd probably make sense to include txt files. That said, if so, maybe we should consider renaming extractWordProcessorDetailsDF to something like extractTextDocumentDetailsDF?

Or we could create a new one just for plain text files? 🤔

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

I think it would make since to have a separate method for plain text files. I can add that here if there is consensus.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

I agree re. a separate method for plain text files.

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

Separate method sounds good to me too. Thanks, @ruebot !

@ianmilligan1
Copy link
Member

left a comment

Tested on sample data and worked really well. Word documents and presentations worked perfectly (doc, rtf, and pps). Some interesting finds already.

Screen Shot 2019-08-15 at 9 13 03 AM

Screen Shot 2019-08-15 at 9 13 02 AM

We did get one false positive spreadsheet which was actually an HTML document: gist here if you're curious.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

@ianmilligan1 yeah, I got some js files and a couple other odd files types listed in the gist I linked in the original comment. I toss some NOT conditionals in there for some of this stuff.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

This looks good. It looks pretty familiar, as I had a branch addressing #303 and/or #304, but I got stymied by the dependency issue and seem to have lost that work (I probably messed up some git stuff locally and re-cloned aut). We could remove most of the ors and replace them with something like documentMimes.contains(r._2), creating a corresponding list of mime types residing outside of the method definition, but I don't feel strongly about that.

I removed the classifier from the pom and the build went fine. For the record, the reason it didn't work for @ruebot (he already knows this) is that the CDN used by JitPack to serve the jars cached an older version of the file we need, so his build was not getting the shaded artifact.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

The false positives come from filtering on file extension. I think we could remove most of those, except for "tsv" and "csv".

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

@jrwiebe we actually need those. If I do have them, those tests will fail. Tika doesn't identify a bunch of those files in the test fixture. I originally had the method you suggested, and then when with the extension/url ends with.

I'll create a enhancement ticket for breaking all those conditionals out after we get this merged. That's a good idea.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@ruebot You don't have a log of the failed tests, do you? I'm interested to know which files Tika failed to identify correctly, and what it identified them as (application/octet-stream?).

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

iirc application/zip

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

Ok, give that a test again. I haven't tested it other than unit tests since I have to run now. I'll put it through it's paces again later this afternoon, and see if I can sort out the shading issue on my end again.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@ruebot The unit tests failed when you weren't filtering on file extension because:
a) the word processor extractor was filtering on mime "text/rtf" instead of "application/rtf", and
b) the spreadsheet extractor doesn't detect CSV or TSV (CSV detection might be in the works, see TIKA-2826).

I pushed a fix to (a), and removed the rest of the file extension filtering from the word processor document extractor. For (b), I left in the filtering on CSV and TSV extensions, but removed the other formats. Now the tests pass.

BTW, should we add the dot before extensions when using endsWith()? Currently our spreadsheet filter, for example, would evaluate to true something like "http://example.com/fileformatsoftheages/csv". Also, URLs could contain uppercase; we should probably insert a .toLowerCase before .endsWith.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

@jrwiebe good call on the extensions. I was going to add .txt etc., in front of each one, but second guessed myself. You want to do that, or want me to?

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@ruebot Go for it. I'm done with this for today.

@ianmilligan1
Copy link
Member

left a comment

While this worked when I originally reviewed, it's now failing on the previous test script with:

java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:189)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:115)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:40)
etc. etc.

This is with

rm -rf ~/.m2/repository/* && mvn clean install

for build and

rm -rf ~/.ivy2/* && ./bin/spark-shell --driver-memory 8g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT

for shell. I can test again to see if this works on my machine once the classifier issues are sorted out?

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

@ianmilligan1 yeah, I'm still getting the same thing

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@ianmilligan1 @ruebot Revert this change (i.e., put the classifier back in). You guys are getting the cached tika-parsers JAR (see my description of this problem). This might break use of --packages, though.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 15, 2019

@jrwiebe yeah, I've been fiddling with the JitPack interface in the morning/afternoon and keep on getting that cached JAR. We'll kick this can down the road, and hopefully resolve it before the release 🤞

ruebot added some commits Aug 16, 2019

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

I'm getting significantly less output after 8028aa3.

  • Original output (w/o text files extraction)
  • 8028aa39aa53b8f27168c0ad49e9adf582cfda54 output (note, there are no presentation files extracted. This is what I was experiencing early on, and probably didn't explain well here)
  • Updated commit (we have presentation files back, better results, but a bit of noise. Not sure why I'm getting Mime Types that I say not to give, but file --mime-types could have a different Magic file than a web server and/or Tika 🤷‍♂) ...anyway, I have an idea that I tossed into #343 and we can clean things up better in that resolution.
@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

Try running tika-app detect on those same files.

java -jar /path/to/tika-app-1.22.jar -d [files]
@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

[nruest@rho:aut-test-fixtures]$ ls -1                                   
aut-test-fixtures.odp
aut-test-fixtures.pptx
aut-text
aut-text.txt
test-aut-fixture.ods
test-aut-fixtures.docx
'test-aut-fixture - Sheet1.csv'
'test-aut-fixture - Sheet1.tsv'
test-aut-fixtures.odt
test-aut-fixtures.rtf
test-aut-fixture.xlsx

[nruest@rho:aut-test-fixtures]$ java -jar ~/bin/tika-app-1.22.jar -d *  
Aug 15, 2019 9:21:03 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Aug 15, 2019 9:21:04 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
application/vnd.oasis.opendocument.presentation
application/vnd.openxmlformats-officedocument.presentationml.presentation
text/plain
text/plain
application/vnd.oasis.opendocument.spreadsheet
application/vnd.openxmlformats-officedocument.wordprocessingml.document
text/csv
text/tab-separated-values
application/vnd.oasis.opendocument.text
application/rtf
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

hrm, if I test with that tika-parsers commit you added, things blow up after a while, and I get this error:

java.nio.file.FileSystemException: /tmp/apache-tika-4408340307210314122.tmp: Too many open files

...with a whole lot of apache-tika-adfasdfasdfasdfdasw.tmp files in my /tmp dir.

I also noticed that Spark behaved a little bit differently. I had stages with your commit, whereas with the original state of the pom in the PR, I don't have stages. It goes to straight tasks.

I'll try and gist everything up here before I go to bed.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

Dealing with this Tika dependency thing over the last 6 months

tenor

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

Dealing with this Tika dependency thing over the last 6 months

tenor

Yup

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && time ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/office-document-extraction.scala
@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

hrm, if I test with that tika-parsers commit you added, things blow up after a while, and I get this error:

java.nio.file.FileSystemException: /tmp/apache-tika-4408340307210314122.tmp: Too many open files

...with a whole lot of apache-tika-adfasdfasdfasdfdasw.tmp files in my /tmp dir.

I probably should have been close()-ing each TikaInputStream in DetectMimeTypeTika. I'll push this fix in a sec assuming it passes the build tests.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

There shouldn't be any difference in behaviour between the two tika-parsers artifacts. The only difference besides where they are downloaded from is that one has a classifier as part of its versioning.

jrwiebe and others added some commits Aug 16, 2019

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

@ianmilligan1 ready for the big old rm -rf test process again. If it works for you, ping me, and I'll write up a big ol' commit message.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

I ran a variant of the above office-document-extraction.scala last night, built from
8a124d3
. The script was something I had around from testing yesterday, so it was missing extractTextFilesDetailsDF. I ran it on /tuna1/scratch/nruest/geocites/warcs/1/*.gz. Is this the same sample you were using, @ruebot?

This was on tuna, so the resources available were different, but it didn't blow up. It took about four hours.

Extraction stats:

jrwiebe@tuna:~$ ls office-extract-test/spreadsheet-*|wc -l
1291
jrwiebe@tuna:~$ ls office-extract-test/presentation-*|wc -l
2840
jrwiebe@tuna:~$ ls office-extract-test/document-*|wc -l
8732
jrwiebe@tuna:~$ ls office-extract-test/*|wc -l
12863
jrwiebe@tuna:~$ cat aut/office-document-extraction.time.txt 
real    168m19.665s
user    997m59.700s
sys 153m2.312s
@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 16, 2019

@jrwiebe no, I just have 10 GeoCities WARCs on my desktop at home that I use for testing. But all those numbers look relatively inline with what I've seen on my end.

@ianmilligan1
Copy link
Member

left a comment

Tested on sample data and this works nicely – great work both!

@ianmilligan1 ianmilligan1 merged commit c824ad8 into master Aug 16, 2019

1 of 3 checks passed

codecov/patch 43.49% of diff hit (target 75.2%)
Details
codecov/project 71.66% (-3.54%) compared to 39831c2
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ianmilligan1 ianmilligan1 deleted the 303-304-305 branch Aug 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.