Video binary object extraction #306

ruebot · Jan 31, 2019

Using the image extraction process as a basis, our next set of binary object extractions will be documents. This issue is meant to focus specially on video objects.

There may be a some tweaks to this depending on the outcome of #298.

ruebot · Aug 12, 2019

Ok, I have a basic framework setup in the branch.

Pull down the branch, and do something along these lines (I had memory issues, so I did my whole Spark config thing):

rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 35g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/306-pdf-audio-video-extract.scala

306-pdf-audio-video-extract.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df_pdf = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractPDFDetailsDF();
val res_pdf = df_pdf.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/pdf", "extension")

val df_audio = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractAudioDetailsDF();
val res_audio = df_audio.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/audio", "extension")

val df_video = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractVideoDetailsDF();
val res_video = df_video.select($"bytes", $"extension").saveToDisk("bytes", "/home/nruest/Projects/au/sample-data/306-307-test/video", "extension")

sys.exit

I have a whole lot of audio, pdf, and video files.

Considerations

We need tests. Should have had some for #302 🤷‍♂
Is this an ok implementation for getting the extension? It seems to have a really good success rate.
How do we want to handle when we don't or are not able to get an extension? Throw a conditional in the mix, and say if null/empty UNKNOWN? Do we want that stored in the dataframe, or done on the fly in saveToDisk?

@jrwiebe @lintool @ianmilligan1 let me know what you think.

ianmilligan1 · Aug 12, 2019

Woohoo, this is looking great. Congrats @ruebot! I've tested locally on our CPP Sample Data and all the extractors are working on the data. Some fun PDFs and lots of weird political talk radio and interview clips.

FWIW, at least on this weird CPP collection subset from 2009, it's having trouble getting any extensions for video (it found a few wmv and the rest are all sans extension).

I don't know the best route on your questions #2 and #3 so will leave that to the more qualified @jrwiebe and @lintool.

jrwiebe · Aug 13, 2019

I'm not sure about this method of getting the extension. I think the reason @ianmilligan1 was getting so many videos without extension is that you're just getting the extension based on the URL. It's easy to think of examples of how audio or video files might be served without containing the file extension.

I think a better way to get the extension is based on the MIME type. There isn't a 1:1 mapping between MIME and extension, so perhaps we combine this with URL analysis.

I've created a branch that implements this method. I just finished running a test with it right now. It was working fine until the end, when it failed with java.lang.OutOfMemoryError. So some thought resource use is warranted here.

(Aside: If you look at the branch's commit history you'll see I modified the POM. This is because the shading I did that somehow resolved #302 actually did not relocate (i.e., rename) commons-compress as intended. Evidently some other change allowed our tests to work, but I was getting that NoSuchMethodError related to commons-compress again when I tested my getExtension method. Now we're relocating for real, as verified by unzipping the JAR and seeing shaded/org/apache/tika/tika-parsers/ paths.)

ruebot added enhancement Scala feature DataFrames labels Jan 31, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

ruebot added a commit that referenced this issue Jul 26, 2019

Rough pass at #306; extract video.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

ed269d6

jrwiebe referenced this issue Aug 2, 2019
Open
Spreadsheet binary object extraction #303

ruebot moved this from To do to In progress in Binary object extraction Aug 12, 2019

ruebot self-assigned this Aug 12, 2019

ruebot added a commit that referenced this issue Aug 12, 2019

Audio and Videao binary extraction; #306, #307.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

0e46ca7

ruebot added a commit that referenced this issue Aug 12, 2019

audio and video binary extraction; #306, #307

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

81d97e6

ruebot referenced this issue Aug 13, 2019
Draft
Add Audio & Video binary extraction #341

jrwiebe referenced this issue Aug 13, 2019
Open
Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

archivesunleashed/aut

Video binary object extraction #306

Video binary object extraction #306

ruebot commented Jan 31, 2019

ruebot added enhancement Scala feature DataFrames labels Jan 31, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

ruebot added a commit that referenced this issue Jul 26, 2019

jrwiebe referenced this issue Aug 2, 2019

Spreadsheet binary object extraction #303

ruebot moved this from To do to In progress in Binary object extraction Aug 12, 2019

ruebot self-assigned this Aug 12, 2019

ruebot added a commit that referenced this issue Aug 12, 2019

ruebot added a commit that referenced this issue Aug 12, 2019

This comment has been minimized.

ruebot commented Aug 12, 2019 •

edited

This comment has been minimized.

ianmilligan1 commented Aug 12, 2019

ruebot referenced this issue Aug 13, 2019

Add Audio & Video binary extraction #341

This comment has been minimized.

jrwiebe commented Aug 13, 2019

jrwiebe referenced this issue Aug 13, 2019

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

archivesunleashed/aut

Join GitHub today

Video binary object extraction #306

Comments

ruebot commented Jan 31, 2019

ruebot added enhancement Scala feature DataFrames labels Jan 31, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

ruebot added a commit that referenced this issue Jul 26, 2019

jrwiebe referenced this issue Aug 2, 2019

Spreadsheet binary object extraction #303

ruebot moved this from To do to In progress in Binary object extraction Aug 12, 2019

ruebot self-assigned this Aug 12, 2019

ruebot added a commit that referenced this issue Aug 12, 2019

ruebot added a commit that referenced this issue Aug 12, 2019

This comment has been minimized.

ruebot commented Aug 12, 2019 • edited

This comment has been minimized.

ianmilligan1 commented Aug 12, 2019

ruebot referenced this issue Aug 13, 2019

Add Audio & Video binary extraction #341

This comment has been minimized.

jrwiebe commented Aug 13, 2019

jrwiebe referenced this issue Aug 13, 2019

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

ruebot commented Aug 12, 2019 •

edited