Add Audio & Video binary extraction #341

ruebot · Aug 13, 2019

GitHub issue(s):

#306
#307

What does this Pull Request do?

Add Audio & Video binary extraction.
Updates saveToDisk to use file extension from Tika
Adds tests for Audio, PDF, and Video DF extraction
Add test fixtures for Audio, PDF, and Video DF extraction
Remove tweet cruft

How should this be tested?

Pull down the branch, and do something along these lines (I had memory issues, so I did my whole Spark config thing):

rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/306-pdf-audio-video-extract.scala

306-pdf-audio-video-extract.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_pdf = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPDFDetailsDF();
val res_pdf = df_pdf.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/pdf", "extension")

val df_audio = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractAudioDetailsDF();
val res_audio = df_audio.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/audio", "extension")

val df_video = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractVideoDetailsDF();
val res_video = df_video.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/video", "extension")

sys.exit

I have a whole lot of audio, pdf, and video files.

Additional Notes:

Is this an ok implementation for getting the extension? It seems to have a really good success rate.
How do we want to handle when we don't or are not able to get an extension? Throw a conditional in the mix, and say if null/empty UNKNOWN? Do we want that stored in the dataframe, or done on the fly in saveToDisk?

@jrwiebe @lintool @ianmilligan1 let me know what you think.

ruebot · Aug 13, 2019

Tossing this in to see how CodeCov goes, and give a single place of discussion for implementation strategy on #306 and #307. When we're ready, I'll change the status to ready to review.

codecov · Aug 13, 2019

Codecov Report

Merging #341 into master will increase coverage by 1.42%.
The diff coverage is 67.53%.

@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
+ Coverage   73.49%   74.91%   +1.42%     
==========================================
  Files          39       39              
  Lines        1147     1216      +69     
  Branches      198      224      +26     
==========================================
+ Hits          843      911      +68     
+ Misses        237      214      -23     
- Partials       67       91      +24

Impacted Files	Coverage Δ
...c/main/scala/io/archivesunleashed/df/package.scala	`96.66% <100%> (+27.7%)`	⬆️
...rchivesunleashed/matchbox/DetectMimeTypeTika.scala	`77.77% <50%> (-2.23%)`	⬇️
src/main/scala/io/archivesunleashed/package.scala	`78.97% <66.66%> (+4.37%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73981a7...1f7fff9. Read the comment docs.

ruebot · Aug 13, 2019

Cleaned up the filename and extension extraction, seems to work well most of the time on PDFs and audio files. Struggles more on video files. Might be worth exploring combining this method with @jrwiebe's work over on the this branch.

I also tried ripping out the big block of OR conditionals on the audio and video extraction, and hit some problems with my tests. I think those OR conditionals are what's actually doing all the work, and the DetectMimeTypeTika(r.getContentString).startsWith("audio/") lines aren't actually doing anything 😰

...I'll loop back around to that tomorrow or Wednesday.

ruebot · Aug 13, 2019

I think this is why we're having problems with DetectMimeTypeTika(r.getContentString).startsWith("audio/"). It looks like we're getting image, audio, and video binaries identified as generic binaries by Tika application/octet-stream.

I'm going to try some tests with getMimeType, and see what the number comparisons look like.

jrwiebe · Aug 13, 2019

If we're filtering for DetectMimeTypeTika(r.getContentString).startsWith("audio/") how is it that DetectMimeTypeTika(r.getContentString) == application/octet-stream?

ruebot · Aug 13, 2019

@jrwiebe it's why the test is failing if I remove all the OR conditionals. The test isn't getting the one match in the List() it is supposed to get because DetectMimeTypeTika is not getting a match on begins with "audio" since the MimeType being detected on the binary is application/octet-stream.

jrwiebe · Aug 13, 2019

But are the other assertions -- e.g., URL, MD5 hash -- evaluating as true?

ruebot · Aug 13, 2019

Yes

jrwiebe · Aug 13, 2019

Generic mime type results are what I was getting before I made DetectMimeTypeTika use org.apache.tika.io.TikaInputStream. I think this is happening because the TikaInputStream is consumed the first time tika.detect() is called by DetectMimeTypeTika. We need to reset the stream, or better, only call detect once.

Tika should be returning "video/mp4" when detecting that file:

jrwiebe@tuna:~$ wget https://ruebot.net/2018-11-12%2016.14.11.mp4
# ...

jrwiebe@tuna:~$ java -jar tika-app-1.22.jar -d 2018-11-12\ 16.14.11.mp4 2>/dev/null 
# (pipe suppresses extraneous stderr output
video/mp4

jrwiebe@tuna:~$ cat - > tika-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <detectors><detector class="org.apache.tika.detect.OverrideDetector"/></detectors>
</properties>

jrwiebe@tuna:~$ java -jar tika-app-1.22.jar --config=tika-config.xml -d 2018-11-12\ 16.14.11.mp4 2>/dev/null
application/octet-stream

Obviously AUT isn't using org.apache.tika.detect.OverrideDetector. I'm just demonstrating that when detection fails, for some reason or other, Tika will still identify a binary file as such.

jrwiebe · Aug 13, 2019

Nope, ignore what I said, my theory is wrong. DetectMimeTypeTika wasn't working properly because conversion from raw bytes to string (getContentString) to bytes (getBytes) again was causing data loss.

Note the change in the number of bytes:

scala> import io.archivesunleashed._
import io.archivesunleashed._

scala> val rdd = RecordLoader.loadArchives("/home/jrwiebe/aut/target/test-classes/warc/example.media.warc.gz", sc)
rdd: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[2] at map at package.scala:72

scala> val strToBytes = rdd.map(r => r.getContentString.getBytes).first
strToBytes: Array[Byte] = Array(72, 84, 84, 80, 47, 49, 46, 49, 32, 50, 48, 48, 32, 79, 75, 13, 10, 68, 97, 116, 101, 58, 32, 77, 111, 110, 44, 32, 49, 50, 32, 65, 117, 103, 32, 50, 48, 49, 57, 32, 50, 51, 58, 53, 48, 58, 48, 51, 32, 71, 77, 84, 13, 10, 83, 101, 114, 118, 101, 114, 58, 32, 65, 112, 97, 99, 104, 101, 47, 50, 46, 52, 46, 50, 57, 32, 40, 85, 98, 117, 110, 116, 117, 41, 13, 10, 76, 97, 115, 116, 45, 77, 111, 100, 105, 102, 105, 101, 100, 58, 32, 87, 101, 100, 44, 32, 50, 54, 32, 79, 99, 116, 32, 50, 48, 49, 49, 32, 48, 48, 58, 53, 56, 58, 50, 54, 32, 71, 77, 84, 13, 10, 69, 84, 97, 103, 58, 32, 34, 52, 53, 52, 98, 55, 45, 52, 98, 48, 50, 57, 50, 55, 52, 55, 98, 48, 56, 48, 34, 13, 10, 65, 99, 99, 101, 112, 116, 45, 82, 97, 110, 103, 101, 115, 58, 32, 98, 121, 116, 101, 115,...

scala> strToBytes.size
res1: Int = 503757

scala> val binaryBytes = rdd.map(r => r.getBinaryBytes).first
binaryBytes: Array[Byte] = Array(73, 68, 51, 4, 0, 64, 0, 0, 0, 28, 0, 0, 0, 12, 1, 32, 5, 13, 1, 66, 21, 98, 84, 80, 69, 49, 0, 0, 0, 6, 0, 0, 0, 114, 117, 101, 115, 116, -1, -5, -112, -60, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 73, 110, 102, 111, 0, 0, 0, 15, 0, 0, 2, -90, 0, 4, 84, -111, 0, 3, 5, 8, 10, 12, 15, 18, 21, 23, 25, 28, 30, 33, 36, 38, 41, 43, 46, 49, 51, 54, 56, 58, 61, 64, 67, 69, 71, 74, 77, 80, 82, 84, 87, 89, 92, 95, 97, 100, 102, 104, 107, 110, 113, 115, 117, 120, 123, 126, -128, -126, -123, -121, -118, -115, -113, -110, -108, -105, -102, -100, -97, -95, -93, -90, -87, -84, -82, -80, -77, -75, -71, -69, -67, -64, -62, -59, -56, -54, -51, -49, -47, -44, -41, -38, -36, -34, -31, -28, -25, -23, -21, -18, -16, -13, -10, -8, -5, -3, 0, 0, 0, 57, 76, 65, 77, 69...
scala> binaryBytes.size
res2: Int = 283831

I just pushed a fix to this. It's based on master, so it doesn't include the new extractMediaDetailsDF methods, but you can see what I did with extractPDFDetailsDF.

Update: I took the liberty of merging my fix with this branch.

jrwiebe · Aug 13, 2019

jrwiebe reviewed Aug 13, 2019

View changes

I approve this PR, but since I've contributed changes, @ruebot should review again too.

ianmilligan1 · Aug 13, 2019

I can test on sample data as well - should I test using the same scripts that @ruebot originally provided above?

jrwiebe · Aug 13, 2019

Yes. Those should work; it's worth verifying. The new unit tests @ruebot wrote are succeeding, so I'm pretty confident in this PR, but since Nick initiated it I just thought he should have a say on my changes.

ruebot · Aug 13, 2019

ruebot reviewed Aug 13, 2019

View changes

ruebot · Aug 13, 2019

pom.xml

@@ -655,7 +655,7 @@
      <!-- see issue #302
      <groupId>org.apache.tika</groupId>
      -->
-      <groupId>com.github.archivesunleashed.tika</groupId>
+      <groupId>com.github.jrwiebe.tika</groupId>


We can fix this after this is merged. We're working on sorting JitPack builds/releases now.

Also change the classifier to `shady'

ruebot · Aug 13, 2019

Testing looks good on my end. Here's the results for the three overall tests with 10 GeoCities WARCs:

Original state of PR
4665 audio files
644 PDF files
183 video files
5492 total files

r.getMimeType
4926 audio files
644 PDF files
234 video files
5804 total files

Jeremy's updates
4809 audio files
644 PDF files
232 video files
5685 total

@ianmilligan1 feel free to test on whatever data you have. If you want these 10 WARCs, let me know and I'll get them to you. I'll also open up a couple side issues once this is merged for file extensions for unknown, and adding additional MimeType columns. When you're good, let me know, and I'll write up a commit message for you to merge.

ianmilligan1 · Aug 13, 2019

ianmilligan1 approved these changes Aug 13, 2019

View changes

Tested on the sample data from aut-resources and successfully extracted PDFs, audio, and video. Still quite a few missing extensions, as seen below, which will be addressed in #343. The asps files all seem to be attached to videos (i.e. "add comments about what you think about Michael Ignatieff's video").

test tweaks

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

575ffbb

jrwiebe added some commits Aug 13, 2019

Merge branch 'fix-detectmimetype' into issue-306-307

Loading status checks…

38aec52

Pass binary bytes instread of string to DetectMimeTypeTika.

Loading status checks…

910e22e

jrwiebe marked this pull request as ready for review Aug 13, 2019

Use tika-parsers 1.22-shady from AU repo.

Loading status checks…

1f7fff9

ianmilligan1 deleted the issue-306-307 branch Aug 13, 2019

archivesunleashed/aut

Join GitHub today

Add Audio & Video binary extraction #341

Conversation

ruebot commented Aug 13, 2019

What does this Pull Request do?

How should this be tested?

Additional Notes:

ruebot added some commits Aug 12, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

codecov bot commented Aug 13, 2019 • edited

Codecov Report

ruebot added some commits Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019 • edited

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019 • edited

jrwiebe added some commits Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019 • edited

jrwiebe added some commits Aug 13, 2019

jrwiebe marked this pull request as ready for review Aug 13, 2019

jrwiebe reviewed Aug 13, 2019 View changes

jrwiebe left a comment

This comment has been minimized.

ianmilligan1 commented Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019 • edited

ruebot reviewed Aug 13, 2019 View changes

This comment has been minimized.

ruebot Aug 13, 2019

This comment has been minimized.

jrwiebe Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

ianmilligan1 approved these changes Aug 13, 2019 View changes

ianmilligan1 left a comment

Hide details View details ianmilligan1 merged commit 54c0c3e into master Aug 13, 2019 2 of 3 checks passed

2 of 3 checks passed

ianmilligan1 deleted the issue-306-307 branch Aug 13, 2019

codecov bot commented Aug 13, 2019 •

edited

jrwiebe commented Aug 13, 2019 •

edited

jrwiebe commented Aug 13, 2019 •

edited

jrwiebe commented Aug 13, 2019 •

edited

jrwiebe reviewed Aug 13, 2019

View changes

jrwiebe commented Aug 13, 2019 •

edited

ruebot reviewed Aug 13, 2019

View changes

ianmilligan1 approved these changes Aug 13, 2019

View changes

ianmilligan1 merged commit `54c0c3e` into master Aug 13, 2019
2 of 3 checks passed