Skip to content
Branch: master
Find file History
jrwiebe and ruebot Add PDF binary extraction. (#340)
Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome.

- Updates getImageBytes to getBinaryBytes
- Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk
- Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0.
- Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection.
- Added generic saveToDisk method to save a bytes column of a DataFrame to files
- Updates tests
- Resolves #302
- Further addresses #308
- Includes work by @ruebot, see #340 for all commits before squash
Latest commit 73981a7 Aug 11, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
main Add PDF binary extraction. (#340) Aug 12, 2019
test Add PDF binary extraction. (#340) Aug 12, 2019
You can’t perform that action at this time.