Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342

Open
jrwiebe opened this issue Aug 13, 2019 · 7 comments

Comments

@jrwiebe
Copy link
Contributor

commented Aug 13, 2019

I wasn't sure whether to add this as a comment in #306 (or one of the other extraction method threads), or to make it an issue.

I observed that while we're using DetectMimeTypeTika in methods like extractPDFDetailsDF, extractVideoDetailsDF, etc., we're using ArchiveRecord.getMimeType for reporting the MIME type. I think Tika's MIME type will tend to be more accurate than what is set by the web server. Do we want to go for accurate meta-description of the contents of our archive records, or do we want fidelity to what was collected from the web?

If we were wanted to use what we're getting from Tika, a method like extractAudioDetailsDF() could look something like this:

def extractAudioDetailsDF(rdd: RDD[ArchiveRecord]): DataFrame = {
    val records = rdd
    .map(r => 
        (r, DetectMimeTypeTika(r.getContentString)) // <------- (record, mime_type) tuple
    )
    .filter(r => r._2.startsWith("audio/")) // for example
    .map(r => {
      val bytes = r._1.getBinaryBytes
      val hash = new String(Hex.encodeHex(MessageDigest.getInstance("MD5").digest(bytes)))
      val encodedBytes = Base64.getEncoder.encodeToString(bytes)
      (r._1.getUrl, r._2, hash, encodedBytes)
    })
    .map(t => Row(t._1, t._2, t._3, t._4))

    val schema = new StructType()
    .add(StructField("url", StringType, true))
    .add(StructField("mime_type", StringType, true))
    .add(StructField("md5", StringType, true))
    .add(StructField("encodedBytes", StringType, true))

    val sqlContext = SparkSession.builder();
    sqlContext.getOrCreate().createDataFrame(records, schema)
}

I'm not sure about the memory implications of the r => (r, DetectMimeTypeTika(r.getContentString)) mapping.

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

I agree with moving to Tika for all the MimeType detection for identification, but I also think we should tweak the DF columns to be mime_type_tika and mime_type_web_server since there might be a use case for comparing the two. Fun research study about how awful web servers are at identifying MimeTypes?

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

Interesting, if I switch all the ExtractMediaDetails from r.getMimeType to DetectMimeTypeTika(r.getContentString), for the columns, the tests fail because the identification seems to get worse:

Results :

Tests in error: 
  Image DF extraction(io.archivesunleashed.ExtractImageDetailsTest): "[image/gif]" did not equal "[application/octet-stream]"
  Audio DF extraction(io.archivesunleashed.ExtractAudioDetailsTest): "a[udio/mpeg]" did not equal "a[pplication/octet-stream]"
  Video DF extraction(io.archivesunleashed.ExtractVideoDetailsTest): "[video/mp4]" did not equal "[application/octet-stream]"

ruebot added a commit that referenced this issue Aug 13, 2019

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

@jrwiebe this what you're thinking? 54c1643

@jrwiebe

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

extractPDFDetailsDF is what I was thinking, but I see the the audio and video methods don't use the same approach (i.e., the map(r => (r, DetectMimeTypeTika(r.getBinaryBytes)))). Is that intentional?

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

Yeah intentional. I just picked one out of the three to implement before I got a 👍 or 👎 from you 😄

@jrwiebe

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

👍

@jrwiebe

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

I wish I knew more about how Spark runs this code. I wrote it this way to avoid calling Tika twice, but it's very possible the return value is cached and read from cache later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.