Tree: fb10f4eef9
-
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
-
jrwiebe committed
Aug 16, 2019 -
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
ruebot committed
Aug 16, 2019 Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits
-
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
Remove unnecessary filtering on file extension, which might produce
false positives.
-
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
Remove tika-parser classifier.
jrwiebe committedAug 15, 2019 -
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
Add office document binary extraction.
- Add WordProcessor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add tests for new DF and binary extractions - Add test fixture for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Back out 39831c2 (We _might_ not have to do this)
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits
-
Use version of tika-parsers without a classifier. (#345)
Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact was unnecessary.
-
Add audio & video binary extraction (#341)
- Add Audio & Video binary extraction. - Add filename, and extenstion column to audio, pdf, and video DF - Pass binary bytes instread of string to DetectMimeTypeTika in DF (s/getContentString/getBinaryBytes) - Updates saveToDisk to use file extension from DF column - Adds tests for Audio, PDF, and Video DF extraction - Add test fixtures for Audio, PDF, and Video DF extraction - Rename SaveBytesTest to SaveImageBytes test - Eliminate bytes->string->bytes conversion that was causing data loss in DetectMimeTypeTika - Update tika-parsers dep from JitPack - Remove tweet cruft - Resolves #306 - Resolves #307 - Includes work by @jrwiebe, see #341 for all commits before squash
-
Add PDF binary extraction. (#340)
Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome. - Updates getImageBytes to getBinaryBytes - Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk - Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0. - Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection. - Added generic saveToDisk method to save a bytes column of a DataFrame to files - Updates tests - Resolves #302 - Further addresses #308 - Includes work by @ruebot, see #340 for all commits before squash
-
More scalastyle work; addresses #196. (#339)
- Remove all underscore imports, except shapeless - Address all scalastyle warnings - Update scalastyle config for magic numbers, and null (only used in tests)
-
-
Update Tika to 1.22; address security alerts. (#337)
- Update Tika to 1.22 - pom.xml surgery to get aut to build again with --packages
-
Python formatting, and gitignore additions. (#326)
- Run black and isort on Python files. - Move Spark config to example file. - Update gitignore for 7a61f0e additions.
-
Makes ArchiveRecordImpl serializable by removing non-serializable ARC…
…Record and WARCRecord variables. Also removes unused headerResponseFormat variable. (#316)
-
Resolve cobertura-maven-plugin class issue; resolves #313. (#314)
- Exclude slf4j binding logback-classic (mojohaus/cobertura-maven-plugin#6 (comment))