Tree: 86fb5433b1
-
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits
-
-
Remove saveImageToDisk and its test
jrwiebe committedAug 17, 2019 -
Verified
This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits -
Add keep and discard by http status. (#347)
- Add keep and discard by http status RecordLoader - Add tests - Clean up/add doc comments in RecordLoader - Resolve #315
-
-
jrwiebe committed
Aug 17, 2019 -
Remove filtering on URL for images; add DF fields to image extraction
jrwiebe committedAug 17, 2019
-
Remove filtering on URL for audio, video, and images.
jrwiebe committedAug 16, 2019 -
Insert `toLowerCase` into `getUrl.endsWith()` calls in io.archivesunl…
jrwiebe committedAug 16, 2019 …eashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`.
-
Fix GetExtensionMime case where URL has no extension but a MIME type …
jrwiebe committedAug 16, 2019 …is detected
-
Remove most filtering by file extension from binary extraction method…
jrwiebe committedAug 16, 2019 …s; add CSV/TSV special cases.
-
Remove tika-parsers classifier
jrwiebe committedAug 16, 2019 -
jrwiebe committed
Aug 16, 2019 -
Add office document binary extraction. (#346)
- Add Word Processor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add Text files DF and binary extraction - Add tests for new DF and binary extractions - Add test fixtures for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Use aut-resources repo to distribute our shaded tika-parsers 1.22 - Close TikaInputStream - Add RDD filters on MimeTypeTika values - Add CodeCov configuration yaml - Includes work by @jrwiebe, see #346 for all commits before squash
-
Use GetExtensionMime for extraction methods; minor fixes.
jrwiebe committedAug 16, 2019 -
Merge remote-tracking branch 'remotes/origin/master' into get-extension
jrwiebe committedAug 16, 2019 # Conflicts: # src/main/scala/io/archivesunleashed/matchbox/DetectMimeTypeTika.scala
-
Matchbox object to get extension of URL
jrwiebe committedAug 16, 2019
-
Use version of tika-parsers without a classifier. (#345)
Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact was unnecessary.
-
Add audio & video binary extraction (#341)
- Add Audio & Video binary extraction. - Add filename, and extenstion column to audio, pdf, and video DF - Pass binary bytes instread of string to DetectMimeTypeTika in DF (s/getContentString/getBinaryBytes) - Updates saveToDisk to use file extension from DF column - Adds tests for Audio, PDF, and Video DF extraction - Add test fixtures for Audio, PDF, and Video DF extraction - Rename SaveBytesTest to SaveImageBytes test - Eliminate bytes->string->bytes conversion that was causing data loss in DetectMimeTypeTika - Update tika-parsers dep from JitPack - Remove tweet cruft - Resolves #306 - Resolves #307 - Includes work by @jrwiebe, see #341 for all commits before squash
-
Add getExtensions method to DetectMimeTypeTika.
jrwiebe committedAug 13, 2019 -
Adds method for getting a file extension from a MIME type.
jrwiebe committedAug 13, 2019 -
Use fixed version of shaded tika-parsers
jrwiebe committedAug 13, 2019 -
Use fixed version of shaded tika-parsers
jrwiebe committedAug 13, 2019
-
Add PDF binary extraction. (#340)
Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome. - Updates getImageBytes to getBinaryBytes - Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk - Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0. - Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection. - Added generic saveToDisk method to save a bytes column of a DataFrame to files - Updates tests - Resolves #302 - Further addresses #308 - Includes work by @ruebot, see #340 for all commits before squash
-
More scalastyle work; addresses #196. (#339)
- Remove all underscore imports, except shapeless - Address all scalastyle warnings - Update scalastyle config for magic numbers, and null (only used in tests)
-
-
Update Tika to 1.22; address security alerts. (#337)
- Update Tika to 1.22 - pom.xml surgery to get aut to build again with --packages
-
Python formatting, and gitignore additions. (#326)
- Run black and isort on Python files. - Move Spark config to example file. - Update gitignore for 7a61f0e additions.