Add method for unknown extensions in binary extractions #343

ruebot · Aug 13, 2019

With the implementation of #302, #306, and #307, we will occasionally get binaries that are extracted, and do not have file extensions on them. We should create a method/helper account for this:

Try existing method
Try to guess file extension from MimeType (@jrwiebe is working on this in get-extension
If both fail, use UNKNOWN or something else as the extension.

jrwiebe · Aug 13, 2019

I noticed when I was filtering for URLs ending in ".doc" that I was getting a lot of non-doc formats (HTML and text formats). I think it's less likely there will be such incorrect file extensions for the other binary formats we're targeting, but if we want a generic algorithm for determining the extension, I'd cover the .doc case by switching steps 1 and 2. I'd also use the plural method getExtensions and if the extension returned by FilenameUtils is contained in this list, I'd select that one.

We're already getting the MimeType with Tika, which is the only expensive operation in this process.

My method:

  def getExt(mimeType: String, url: String): String = {
    val tikaExtensions = DetectMimeTypeTika.getExtensions(mimetype)
    var ext = "unknown"
    // Tika method
    if (tikaExtensions.size == 1) {
      ext = tikaExtensions(0).substring(1)
    } else {
      // FilenameUtils method
      val fnuExt = FilenameUtils.getExtension(url)
      if (fnuExt != null && !fnuExt.isEmpty) {
        // Reconcile Tika list and FilenameUtils extension
        if (tikaExtensions.size > 1) {
          if (tikaExtensions.contains(fnuExt)) {
            ext = fnuExt
          } else {
            ext = tikaExtensions(0).substring(1)
          }
        } else { // tikaExtensions.size == 0 && fnuExt exists
          ext = fnuExt
        }
      } // else => unknown
    }
    ext
  }

ruebot · Aug 16, 2019

I have an idea since I noticed these methods as I'm hacking on #346

https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/package.scala#L307-L313

https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/package.scala#L374-L380

Use what you have above, and combine it there. Maybe do something consistent with what we have with getMimeType where it uses web server Mime Type, and have keepTikaMimeTypes and discardTikaMimeTypes. It could clean-up a whole lot of what we have put in there the last couple of days.

jrwiebe · Aug 16, 2019

Use what you have above, and combine it there.

Not sure what you mean by this line, unless you're just saying to put the method I wrote above in that section of package.scala. (Which I was about to do.)

ruebot · Aug 16, 2019

@jrwiebe yep! that plus potentially mimicking those two exiting functions as well. That make sense?

jrwiebe · Aug 16, 2019

Actually, now that I'm trying it I realize we don't want a getExtension method that applies to RDDs. I'm putting it in the matchbox.

jrwiebe · Aug 16, 2019

Did it.

@ruebot Want to integrate this into PR #346? Or I could make a separate one after that's merged.

ruebot added enhancement Scala DataFrames labels Aug 13, 2019

ruebot referenced this issue Aug 16, 2019
Open
Add office document binary extraction. #346

archivesunleashed/aut

Add method for unknown extensions in binary extractions #343

Add method for unknown extensions in binary extractions #343

ruebot commented Aug 13, 2019

ruebot added enhancement Scala DataFrames labels Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

ruebot referenced this issue Aug 16, 2019

Add office document binary extraction. #346

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

archivesunleashed/aut

Join GitHub today

Add method for unknown extensions in binary extractions #343

Comments

ruebot commented Aug 13, 2019

ruebot added enhancement Scala DataFrames labels Aug 13, 2019

This comment has been minimized.

jrwiebe commented Aug 13, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

ruebot referenced this issue Aug 16, 2019

Add office document binary extraction. #346

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

ruebot commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019

This comment has been minimized.

jrwiebe commented Aug 16, 2019