Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method for unknown extensions in binary extractions #343

Open
ruebot opened this issue Aug 13, 2019 · 1 comment

Comments

@ruebot
Copy link
Member

commented Aug 13, 2019

With the implementation of #302, #306, and #307, we will occasionally get binaries that are extracted, and do not have file extensions on them. We should create a method/helper account for this:

  1. Try existing method
  2. Try to guess file extension from MimeType (@jrwiebe is working on this in get-extension
  3. If both fail, use UNKNOWN or something else as the extension.
@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

I noticed when I was filtering for URLs ending in ".doc" that I was getting a lot of non-doc formats (HTML and text formats). I think it's less likely there will be such incorrect file extensions for the other binary formats we're targeting, but if we want a generic algorithm for determining the extension, I'd cover the .doc case by switching steps 1 and 2. I'd also use the plural method getExtensions and if the extension returned by FilenameUtils is contained in this list, I'd select that one.

We're already getting the MimeType with Tika, which is the only expensive operation in this process.

My method:

  def getExt(mimeType: String, url: String): String = {
    val tikaExtensions = DetectMimeTypeTika.getExtensions(mimetype)
    var ext = "unknown"
    // Tika method
    if (tikaExtensions.size == 1) {
      ext = tikaExtensions(0).substring(1)
    } else {
      // FilenameUtils method
      val fnuExt = FilenameUtils.getExtension(url)
      if (fnuExt != null && !fnuExt.isEmpty) {
        // Reconcile Tika list and FilenameUtils extension
        if (tikaExtensions.size > 1) {
          if (tikaExtensions.contains(fnuExt)) {
            ext = fnuExt
          } else {
            ext = tikaExtensions(0).substring(1)
          }
        } else { // tikaExtensions.size == 0 && fnuExt exists
          ext = fnuExt
        }
      } // else => unknown
    }
    ext
  }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.