Permalink
Please
sign in to comment.
Browse files
Add PDF binary extraction. (#340)
Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome. - Updates getImageBytes to getBinaryBytes - Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk - Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0. - Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection. - Added generic saveToDisk method to save a bytes column of a DataFrame to files - Updates tests - Resolves #302 - Further addresses #308 - Includes work by @ruebot, see #340 for all commits before squash
- Loading branch information...
Showing
with
93 additions
and 25 deletions.
- +13 −1 pom.xml
- +2 −2 src/main/scala/io/archivesunleashed/ArchiveRecord.scala
- +1 −1 src/main/scala/io/archivesunleashed/app/ExtractPopularImages.scala
- +35 −7 src/main/scala/io/archivesunleashed/df/package.scala
- +6 −3 src/main/scala/io/archivesunleashed/matchbox/DetectMimeTypeTika.scala
- +33 −8 src/main/scala/io/archivesunleashed/package.scala
- +3 −3 src/test/scala/io/archivesunleashed/df/{SaveImageTest.scala → SaveBytesTest.scala}
0 comments on commit
73981a7