Tree: c824ad814f
-
Add office document binary extraction. (#346)
- Add Word Processor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add Text files DF and binary extraction - Add tests for new DF and binary extractions - Add test fixtures for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Use aut-resources repo to distribute our shaded tika-parsers 1.22 - Close TikaInputStream - Add RDD filters on MimeTypeTika values - Add CodeCov configuration yaml - Includes work by @jrwiebe, see #346 for all commits before squash
-
Use version of tika-parsers without a classifier. (#345)
Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact was unnecessary.
-
Add audio & video binary extraction (#341)
- Add Audio & Video binary extraction. - Add filename, and extenstion column to audio, pdf, and video DF - Pass binary bytes instread of string to DetectMimeTypeTika in DF (s/getContentString/getBinaryBytes) - Updates saveToDisk to use file extension from DF column - Adds tests for Audio, PDF, and Video DF extraction - Add test fixtures for Audio, PDF, and Video DF extraction - Rename SaveBytesTest to SaveImageBytes test - Eliminate bytes->string->bytes conversion that was causing data loss in DetectMimeTypeTika - Update tika-parsers dep from JitPack - Remove tweet cruft - Resolves #306 - Resolves #307 - Includes work by @jrwiebe, see #341 for all commits before squash
-
Add PDF binary extraction. (#340)
Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome. - Updates getImageBytes to getBinaryBytes - Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk - Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0. - Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection. - Added generic saveToDisk method to save a bytes column of a DataFrame to files - Updates tests - Resolves #302 - Further addresses #308 - Includes work by @ruebot, see #340 for all commits before squash
-
More scalastyle work; addresses #196. (#339)
- Remove all underscore imports, except shapeless - Address all scalastyle warnings - Update scalastyle config for magic numbers, and null (only used in tests)
-
-
Update Tika to 1.22; address security alerts. (#337)
- Update Tika to 1.22 - pom.xml surgery to get aut to build again with --packages
-
Python formatting, and gitignore additions. (#326)
- Run black and isort on Python files. - Move Spark config to example file. - Update gitignore for 7a61f0e additions.
-
Makes ArchiveRecordImpl serializable by removing non-serializable ARC…
…Record and WARCRecord variables. Also removes unused headerResponseFormat variable. (#316)
-
Resolve cobertura-maven-plugin class issue; resolves #313. (#314)
- Exclude slf4j binding logback-classic (mojohaus/cobertura-maven-plugin#6 (comment))
-
Update spark-core_2.11 to 2.3.1. (#312)
- CVE-2018-8024 https://nvd.nist.gov/vuln/detail/CVE-2018-8024 - CVE-2018-1334 https://nvd.nist.gov/vuln/detail/CVE-2018-1334 - CVE-2018-17190 https://nvd.nist.gov/vuln/detail/CVE-2018-17190 - CVE-2018-11770 https://nvd.nist.gov/vuln/detail/CVE-2018-11770
-
Add .getHttpStatus and .getArchiveFile to ArchiveRecordImpl class #198 …
…& #164 (#292) * Resolves #198 * Resolves #164 * Add getHttpStatus to ArchiveRecord class & trait - add .getHttpStatus to potential outputs - add tests for .getHttpStatus calls - improve ArchiveRecord testing overall. * Add .getArchiveFile feature to ArchiveRecordImpl. - add getArchiveFile to trait - add getArchiveFile for ArchiveRecordImpl - add tests for getArchiveFile. * Other code style fixes. * Include updates to tests.
-
-
Change Id generation for graphs from using hashes for urls to using .…
…zipWithUniqueIds() (#289) * Resolves #243 * Create GEXF with proper ids instead of hash to avoid collisions. * Add WriteGEXF files. * Add WriteGraph file and test. * Add test for Graphml output. * Add xml escaping for edges. * Add test case for non-escaped edges. * Add additional tests to cover for more potential cases of graphml and gexf files. * Coverage for null cases in urls.