Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upUse Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
Comments
ruebot
added
DataFrames
enhancement
Scala
labels
Aug 13, 2019
This comment has been minimized.
This comment has been minimized.
I agree with moving to Tika for all the MimeType detection for identification, but I also think we should tweak the DF columns to be |
This comment has been minimized.
This comment has been minimized.
Interesting, if I switch all the ExtractMediaDetails from
|
ruebot
added a commit
that referenced
this issue
Aug 13, 2019
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
Yeah intentional. I just picked one out of the three to implement before I got a |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
I wish I knew more about how Spark runs this code. I wrote it this way to avoid calling Tika twice, but it's very possible the return value is cached and read from cache later on. |
jrwiebe commentedAug 13, 2019
I wasn't sure whether to add this as a comment in #306 (or one of the other extraction method threads), or to make it an issue.
I observed that while we're using
DetectMimeTypeTika
in methods likeextractPDFDetailsDF
,extractVideoDetailsDF
, etc., we're usingArchiveRecord.getMimeType
for reporting the MIME type. I think Tika's MIME type will tend to be more accurate than what is set by the web server. Do we want to go for accurate meta-description of the contents of our archive records, or do we want fidelity to what was collected from the web?If we were wanted to use what we're getting from Tika, a method like
extractAudioDetailsDF()
could look something like this:I'm not sure about the memory implications of the
r => (r, DetectMimeTypeTika(r.getContentString))
mapping.