Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd method for unknown extensions in binary extractions #343
Comments
ruebot
added
enhancement
Scala
DataFrames
labels
Aug 13, 2019
This comment has been minimized.
This comment has been minimized.
I noticed when I was filtering for URLs ending in ".doc" that I was getting a lot of non-doc formats (HTML and text formats). I think it's less likely there will be such incorrect file extensions for the other binary formats we're targeting, but if we want a generic algorithm for determining the extension, I'd cover the .doc case by switching steps 1 and 2. I'd also use the plural method We're already getting the MimeType with Tika, which is the only expensive operation in this process. My method:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ruebot commentedAug 13, 2019
With the implementation of #302, #306, and #307, we will occasionally get binaries that are extracted, and do not have file extensions on them. We should create a method/helper account for this:
UNKNOWN
or something else as the extension.