Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upUse Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
Comments
ruebot
added
DataFrames
enhancement
Scala
labels
Aug 13, 2019
This comment has been minimized.
This comment has been minimized.
I agree with moving to Tika for all the MimeType detection for identification, but I also think we should tweak the DF columns to be |
This comment has been minimized.
This comment has been minimized.
Interesting, if I switch all the ExtractMediaDetails from
|
ruebot
added a commit
that referenced
this issue
Aug 13, 2019
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
Yeah intentional. I just picked one out of the three to implement before I got a |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
I wish I knew more about how Spark runs this code. I wrote it this way to avoid calling Tika twice, but it's very possible the return value is cached and read from cache later on. |
This comment has been minimized.
This comment has been minimized.
Cool, I'll update the code later on this morning or afternoon, and compare the output to the last job I ran on #341 testing. |
This comment has been minimized.
This comment has been minimized.
Same numbers. I'm going to do a time test now on HEAD on master, and on what I'll push up here in a second. 4809 audio files |
jrwiebe commentedAug 13, 2019
I wasn't sure whether to add this as a comment in #306 (or one of the other extraction method threads), or to make it an issue.
I observed that while we're using
DetectMimeTypeTika
in methods likeextractPDFDetailsDF
,extractVideoDetailsDF
, etc., we're usingArchiveRecord.getMimeType
for reporting the MIME type. I think Tika's MIME type will tend to be more accurate than what is set by the web server. Do we want to go for accurate meta-description of the contents of our archive records, or do we want fidelity to what was collected from the web?If we were wanted to use what we're getting from Tika, a method like
extractAudioDetailsDF()
could look something like this:I'm not sure about the memory implications of the
r => (r, DetectMimeTypeTika(r.getContentString))
mapping.