Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

jrwiebe · Aug 13, 2019

I wasn't sure whether to add this as a comment in #306 (or one of the other extraction method threads), or to make it an issue.

I observed that while we're using DetectMimeTypeTika in methods like extractPDFDetailsDF, extractVideoDetailsDF, etc., we're using ArchiveRecord.getMimeType for reporting the MIME type. I think Tika's MIME type will tend to be more accurate than what is set by the web server. Do we want to go for accurate meta-description of the contents of our archive records, or do we want fidelity to what was collected from the web?

If we were wanted to use what we're getting from Tika, a method like extractAudioDetailsDF() could look something like this:

def extractAudioDetailsDF(rdd: RDD[ArchiveRecord]): DataFrame = {
    val records = rdd
    .map(r => 
        (r, DetectMimeTypeTika(r.getContentString)) // <------- (record, mime_type) tuple
    )
    .filter(r => r._2.startsWith("audio/")) // for example
    .map(r => {
      val bytes = r._1.getBinaryBytes
      val hash = new String(Hex.encodeHex(MessageDigest.getInstance("MD5").digest(bytes)))
      val encodedBytes = Base64.getEncoder.encodeToString(bytes)
      (r._1.getUrl, r._2, hash, encodedBytes)
    })
    .map(t => Row(t._1, t._2, t._3, t._4))

    val schema = new StructType()
    .add(StructField("url", StringType, true))
    .add(StructField("mime_type", StringType, true))
    .add(StructField("md5", StringType, true))
    .add(StructField("encodedBytes", StringType, true))

    val sqlContext = SparkSession.builder();
    sqlContext.getOrCreate().createDataFrame(records, schema)
}

I'm not sure about the memory implications of the r => (r, DetectMimeTypeTika(r.getContentString)) mapping.

ruebot · Aug 13, 2019

I agree with moving to Tika for all the MimeType detection for identification, but I also think we should tweak the DF columns to be mime_type_tika and mime_type_web_server since there might be a use case for comparing the two. Fun research study about how awful web servers are at identifying MimeTypes?

ruebot · Aug 13, 2019

Interesting, if I switch all the ExtractMediaDetails from r.getMimeType to DetectMimeTypeTika(r.getContentString), for the columns, the tests fail because the identification seems to get worse:

Results :

Tests in error: 
  Image DF extraction(io.archivesunleashed.ExtractImageDetailsTest): "[image/gif]" did not equal "[application/octet-stream]"
  Audio DF extraction(io.archivesunleashed.ExtractAudioDetailsTest): "a[udio/mpeg]" did not equal "a[pplication/octet-stream]"
  Video DF extraction(io.archivesunleashed.ExtractVideoDetailsTest): "[video/mp4]" did not equal "[application/octet-stream]"

ruebot · Aug 13, 2019

@jrwiebe this what you're thinking? 54c1643

jrwiebe · Aug 13, 2019

extractPDFDetailsDF is what I was thinking, but I see the the audio and video methods don't use the same approach (i.e., the map(r => (r, DetectMimeTypeTika(r.getBinaryBytes)))). Is that intentional?

ruebot · Aug 14, 2019

Yeah intentional. I just picked one out of the three to implement before I got a 👍 or 👎 from you 😄

jrwiebe · Aug 14, 2019

👍

jrwiebe · Aug 14, 2019

I wish I knew more about how Spark runs this code. I wrote it this way to avoid calling Tika twice, but it's very possible the return value is cached and read from cache later on.

ruebot · Aug 14, 2019

Cool, I'll update the code later on this morning or afternoon, and compare the output to the last job I ran on #341 testing.

ruebot · Aug 14, 2019

Same numbers. I'm going to do a time test now on HEAD on master, and on what I'll push up here in a second.

4809 audio files
644 PDF files
232 video files
5685 total

ruebot added DataFrames enhancement Scala labels Aug 13, 2019

ruebot referenced this issue Aug 13, 2019
Merged
Add Audio & Video binary extraction #341

ruebot added a commit that referenced this issue Aug 13, 2019

hacking on #342

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

54c1643

ruebot referenced this issue Aug 14, 2019
Merged
Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344

ianmilligan1 closed this in #344 Aug 14, 2019

archivesunleashed/aut

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

jrwiebe commented Aug 13, 2019

ruebot added DataFrames enhancement Scala labels Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

ruebot referenced this issue Aug 13, 2019

Add Audio & Video binary extraction #341

ruebot added a commit that referenced this issue Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 14, 2019

This comment has been minimized.

jrwiebe commented Aug 14, 2019

This comment has been minimized.

jrwiebe commented Aug 14, 2019

This comment has been minimized.

ruebot commented Aug 14, 2019

This comment has been minimized.

ruebot commented Aug 14, 2019

ruebot added a commit that referenced this issue Aug 14, 2019

ruebot referenced this issue Aug 14, 2019

Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344

ruebot added a commit that referenced this issue Aug 14, 2019

ianmilligan1 closed this in #344 Aug 14, 2019

ianmilligan1 added a commit that referenced this issue Aug 14, 2019

archivesunleashed/aut

Join GitHub today

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

Comments

jrwiebe commented Aug 13, 2019

ruebot added DataFrames enhancement Scala labels Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

ruebot referenced this issue Aug 13, 2019

Add Audio & Video binary extraction #341

ruebot added a commit that referenced this issue Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 14, 2019

This comment has been minimized.

jrwiebe commented Aug 14, 2019

This comment has been minimized.

jrwiebe commented Aug 14, 2019

This comment has been minimized.

ruebot commented Aug 14, 2019

This comment has been minimized.

ruebot commented Aug 14, 2019

ruebot added a commit that referenced this issue Aug 14, 2019

ruebot referenced this issue Aug 14, 2019

Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344

ruebot added a commit that referenced this issue Aug 14, 2019

ianmilligan1 closed this in #344 Aug 14, 2019

ianmilligan1 added a commit that referenced this issue Aug 14, 2019