Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd method for unknown extensions in binary extractions #343
Comments
ruebot
added
enhancement
Scala
DataFrames
labels
Aug 13, 2019
This comment has been minimized.
This comment has been minimized.
I noticed when I was filtering for URLs ending in ".doc" that I was getting a lot of non-doc formats (HTML and text formats). I think it's less likely there will be such incorrect file extensions for the other binary formats we're targeting, but if we want a generic algorithm for determining the extension, I'd cover the .doc case by switching steps 1 and 2. I'd also use the plural method We're already getting the MimeType with Tika, which is the only expensive operation in this process. My method:
|
This comment has been minimized.
This comment has been minimized.
I have an idea since I noticed these methods as I'm hacking on #346 Use what you have above, and combine it there. Maybe do something consistent with what we have with |
This comment has been minimized.
This comment has been minimized.
Not sure what you mean by this line, unless you're just saying to put the method I wrote above in that section of |
This comment has been minimized.
This comment has been minimized.
@jrwiebe yep! that plus potentially mimicking those two exiting functions as well. That make sense? |
This comment has been minimized.
This comment has been minimized.
Actually, now that I'm trying it I realize we don't want a |
ruebot commentedAug 13, 2019
With the implementation of #302, #306, and #307, we will occasionally get binaries that are extracted, and do not have file extensions on them. We should create a method/helper account for this:
UNKNOWN
or something else as the extension.