Add method for determining binary file extension #349

jrwiebe · Aug 17, 2019

GitHub issue(s):

#343

What does this Pull Request do?

This PR implements the strategy described in the discussion of the above issue to get an extension for a file described by a URL and a MIME type. It creates a GetExtensionMime object in the matchbox.

This PR also removes most of the filtering by URL from the image, audio, video, presentation, spreadsheet, and word processor document extraction methods, since these were returning false positives. (CSV and TSV files are a special case, since Tika detects them as "text/plain" based on content.)

Finally, I have inserted toLowerCase into the getUrl.endsWith() filter tests, which could possibly bring in some more CSV and TSV files

How should this be tested?

Test by running something like the following script first on a build of master, then modify the output path and do the same on a build of get-extension. Depending on your input there may or may not be a difference between the sets of files that are extracted. If there is, the second run should have fewer files of all types except images, due to misidentification of files by URL in the first run (i.e., false positives), and they should all have extensions. Because extractImageDetailsDF was using the MIME type stored in the archive record and not the detected version, the first run might produce fewer image files than the second (i.e., master was producing false negatives); the master version's reliance on the URL extension could also produce false positives. Because we

(Tip: You can use the MD5 hash in the filenames to identify files with the same content.)

import io.archivesunleashed._
import io.archivesunleashed.df._

val warcs_path = "/home/jrwiebe/warcs/cpp10/*.gz"
val output_path = "/tuna1/scratch/jrwiebe/get-extension-test/master/"

val df_ss = RecordLoader.loadArchives(warcs_path, sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", output_path+"spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives(warcs_path, sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", output_path+"presentation", "extension")

val df_word = RecordLoader.loadArchives(warcs_path, sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", output_path+"document", "extension")

val df_img = RecordLoader.loadArchives(warcs_path, sc).extractImageDetailsDF();
val res_img = df_img.select($"bytes", $"extension").saveToDisk("bytes", output_path+"image", "extension")

val df_aud = RecordLoader.loadArchives(warcs_path, sc).extractAudioDetailsDF();
val res_aud = df_aud.select($"bytes", $"extension").saveToDisk("bytes", output_path+"audio", "extension")

val df_vid = RecordLoader.loadArchives(warcs_path, sc).extractWordProcessorDetailsDF();
val res_vid = df_vid.select($"bytes", $"extension").saveToDisk("bytes", output_path+"video", "extension")

sys.exit

Here are my results. For the document, spreadsheet, and presentation files I confirmed that files missing from the second run were files that had been misidentified in the first run (master branch).

Admittedly mine wasn't a complete test, since it doesn't show how GetExtensionMime would handle a file with the wrong extension in the URL. @ruebot, since the tests you created recently reference actual files on your web server, maybe you could add a couple? To demonstrate how the method does work, see:

scala> import io.archivesunleashed.matchbox._
import io.archivesunleashed.matchbox._

scala> GetExtensionMime("http://ruebot.net/misnameddoc.exe", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
19/08/16 12:42:02 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

19/08/16 12:42:02 WARN TesseractOCRParser: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
19/08/16 12:42:02 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
res0: String = docx

scala> GetExtensionMime("http://ruebot.net/this_is_an_mp3", "audio/mpeg")
res1: String = mpga

(mpga might) be unexpected here, but it is the first in the list of extensions associated with the MIME type audio/mpeg. Oh, well.)

Additional notes

Part of the reason for false positives was that blocks like this should be ANDs, not ORs.
I left the !r.getUrl.endsWith("robots.txt") condition in keepImages, because removing it caused a few files to be found that looked like GIFs and JPEGs, but which were named /robots.txt and which were incomplete, causing saveImageToDisk to fail with a java.io.EOFException.
I don't know if we need to be using saveImageToDisk. We could simply use saveToDisk. This PR adds the "extension" (and "file") field to the DF returned by extractImageDetailsDF, so using the later save method is now an option.

jrwiebe · Aug 17, 2019

If we deprecated saveImageToDisk in favour of simply using saveToDisk, we could safely remove the robots.txt check, since the generic save message does not read the binary bytes to ensure they represent a complete, well-formed file.

This isn't a big deal. I like removing the robots check to make the code more elegant. And theoretically a URL ending with "robots.txt" could actually be an image – though this is unlikely.

codecov · Aug 17, 2019

Codecov Report

Merging #349 into master will increase coverage by 3.67%.
The diff coverage is 64.06%.

@@            Coverage Diff             @@
##           master     #349      +/-   ##
==========================================
+ Coverage    71.7%   75.38%   +3.67%     
==========================================
  Files          38       39       +1     
  Lines        1428     1373      -55     
  Branches      331      265      -66     
==========================================
+ Hits         1024     1035      +11     
+ Misses        245      221      -24     
+ Partials      159      117      -42

ruebot · Aug 17, 2019

@jrwiebe go for it! It makes sense to have a single saveToDisk method.

ruebot · Aug 17, 2019

since the tests you created recently reference actual files on your web server, maybe you could add a couple?

Sure! Let me know what you want, and I'll can get add a new test WARC or replace one or a couple.

jrwiebe · Aug 17, 2019

@ruebot How about this_is_a_gif (no extension) and this_is_a_jpeg.mp3 (JPEG).

Edited: no need for something like real_png.png. Regular cases are getting tested already.

ruebot · Aug 17, 2019

@jrwiebe you want this_is_a_gif to be a gif, and no extension?

jrwiebe · Aug 17, 2019

@ruebot Yes

ruebot · Aug 17, 2019

@jrwiebe https://www.dropbox.com/s/tdegsqp4fjqcx8j/example.media.warc.gz -- that should do it. webrecorder.io did just displayed all the binary characters when I hit the gif with no extension. We'll see what happens there WARC record-wise.

...it should have all the existing files in it too.

jrwiebe · Aug 17, 2019

@ruebot Would you mind replacing this_is_a_jpeg.mp3 with an actual JPEG file? I wanted to test the case where the Tika extension and the FilenameUtils one differ.

ruebot · Aug 17, 2019

...let's see what happens with this one: https://www.dropbox.com/s/lovjzrm9wkauzgc/temp-20190817230619.warc.gz

ruebot · Aug 17, 2019

I got a too many files open error on the most recent commit when I hit image extraction.

[Stage 3:>                                                        (0 + 10) / 10]19/08/17 19:05:52 ERROR Executor: Exception in task 5.0 in stage 3.0 (TID 35)
java.nio.file.FileSystemException: /tmp/apache-tika-1401590413822656748.tmp: Too many open files
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
	at java.nio.file.Files.newByteChannel(Files.java:361)
	at java.nio.file.Files.createFile(Files.java:632)
	at java.nio.file.TempFileHelper.create(TempFileHelper.java:138)
	at java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:161)
	at java.nio.file.Files.createTempFile(Files.java:897)
	at org.apache.tika.io.TemporaryResources.createTempFile(TemporaryResources.java:80)
	at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:608)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:395)
	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:468)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.Tika.detect(Tika.java:156)
	at org.apache.tika.Tika.detect(Tika.java:203)
	at io.archivesunleashed.matchbox.DetectMimeTypeTika$.apply(DetectMimeTypeTika.scala:44)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$keepImages$1.apply(package.scala:473)
	at io.archivesunleashed.package$WARecordRDD$$anonfun$keepImages$1.apply(package.scala:472)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:464)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Prior to that, I tested on 2c26dd0 and everything worked fine.

test script

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_pdf = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPDFDetailsDF();
val res_pdf = df_pdf.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/pdf", "extension")

val df_audio = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractAudioDetailsDF();
val res_audio = df_audio.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/audio", "extension")

val df_video = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractVideoDetailsDF();
val res_video = df_video.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/video", "extension")

val df_image = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractImageDetailsDF();
val res_image = df_image.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/image", "extension")

val df_ss = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractSpreadsheetDetailsDF();
val res_ss = df_ss.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/spreadsheet", "extension")

val df_pp = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPresentationProgramDetailsDF();
val res_pp = df_pp.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/presentation", "extension")

val df_word = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractWordProcessorDetailsDF();
val res_word = df_word.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/document", "extension")

val df_txt = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractTextFilesDetailsDF();
val res_txt = df_txt.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/text", "extension")

sys.exit

jrwiebe · Aug 18, 2019

I didn't get that in my test, but my WARCs might contain fewer files. Try throwing a file.close() after this line.

ruebot · Aug 18, 2019

Good to go again!

2c26dd0:

12320.01s user 631.90s system 693% cpu 31:06.43 total

248,226 files

86fb543:

11089.29s user 533.86s system 659% cpu 29:22.60 total

248,412 files

ruebot · Aug 18, 2019

@jrwiebe I can fix the tests and push up when I get some time tomorrow if you. I just have to tweak the layout. If you're cool with that, once it turns green, I can squash and merge.

jrwiebe · Aug 18, 2019

@ruebot I fixed the tests, but if you want to tweak them that's fine. I think we're ready to go.

ruebot · 2019-08-18T03:26:19Z

ruebot approved these changes Aug 18, 2019

View changes

jrwiebe requested a review from ruebot Aug 17, 2019

Merge branch 'master' into get-extension

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

b5e9c2d

Remove saveImageToDisk and its test

2c26dd0

Remove robots.txt check and extraneous imports

Loading status checks…

fa4e858

Close files so we don't get too many files open again.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

86fb543

jrwiebe added some commits Aug 18, 2019

Add GetExtensionMimeTest

Loading status checks…

9d788ad

Fix test

34a69d6

Fix test (I guess I should run the tests before committing!)

Loading status checks…

b6de1f2

ruebot merged commit 448601e into master Aug 18, 2019
1 check passed

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

ruebot deleted the get-extension branch Aug 18, 2019

archivesunleashed/aut

Join GitHub today

Add method for determining binary file extension #349

Conversation

jrwiebe commented Aug 17, 2019

What does this Pull Request do?

How should this be tested?

Additional notes

jrwiebe added some commits Aug 13, 2019

jrwiebe requested a review from ruebot Aug 17, 2019

This comment has been minimized.

jrwiebe commented Aug 17, 2019

This comment has been minimized.

codecov bot commented Aug 17, 2019 • edited

Codecov Report

This comment has been minimized.

ruebot commented Aug 17, 2019

This comment has been minimized.

ruebot commented Aug 17, 2019

This comment has been minimized.

jrwiebe commented Aug 17, 2019 • edited

This comment has been minimized.

ruebot commented Aug 17, 2019

This comment has been minimized.

jrwiebe commented Aug 17, 2019

This comment has been minimized.

ruebot commented Aug 17, 2019 • edited

This comment has been minimized.

jrwiebe commented Aug 17, 2019

This comment has been minimized.

ruebot commented Aug 17, 2019

This comment has been minimized.

ruebot commented Aug 17, 2019

This comment has been minimized.

jrwiebe commented Aug 18, 2019

This comment has been minimized.

ruebot commented Aug 18, 2019

jrwiebe added some commits Aug 18, 2019

This comment has been minimized.

ruebot commented Aug 18, 2019

This comment has been minimized.

jrwiebe commented Aug 18, 2019

ruebot approved these changes Aug 18, 2019 View changes

Hide details View details ruebot merged commit 448601e into master Aug 18, 2019 1 check passed

1 check passed

ruebot deleted the get-extension branch Aug 18, 2019

codecov bot commented Aug 17, 2019 •

edited

jrwiebe commented Aug 17, 2019 •

edited

ruebot commented Aug 17, 2019 •

edited

ruebot approved these changes Aug 18, 2019

View changes

ruebot merged commit `448601e` into master Aug 18, 2019
1 check passed