Tree: 448601eaf9
-
Add method for determining binary file extension. (#349)
This PR implements the strategy described in the discussion of the above issue to get an extension for a file described by a URL and a MIME type. It creates a GetExtensionMime object in the matchbox. This PR also removes most of the filtering by URL from the image, audio, video, presentation, spreadsheet, and word processor document extraction methods, since these were returning false positives. (CSV and TSV files are a special case, since Tika detects them as "text/plain" based on content.) Finally, I have inserted toLowerCase into the getUrl.endsWith() filter tests, which could possibly bring in some more CSV and TSV files * Adds method for getting a file extension from a MIME type. * Add getExtensions method to DetectMimeTypeTika. * Matchbox object to get extension of URL * Use GetExtensionMime for extraction methods; minor fixes. * Remove tika-parsers classifier * Remove most filtering by file extension from binary extraction methods; add CSV/TSV special cases. * Fix GetExtensionMime case where URL has no extension but a MIME type is detected * Insert `toLowerCase` into `getUrl.endsWith()` calls in io.archivesunleashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`. * Remove filtering on URL for audio, video, and images. * Remove filtering on URL for images; add DF fields to image extraction * Remove saveImageToDisk and its test * Remove robots.txt check and extraneous imports * Close files so we don't get too many files open again. * Add GetExtensionMimeTest * Resolve #343
-
Add keep and discard by http status. (#347)
- Add keep and discard by http status RecordLoader - Add tests - Clean up/add doc comments in RecordLoader - Resolve #315
-
Add office document binary extraction. (#346)
- Add Word Processor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add Text files DF and binary extraction - Add tests for new DF and binary extractions - Add test fixtures for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Use aut-resources repo to distribute our shaded tika-parsers 1.22 - Close TikaInputStream - Add RDD filters on MimeTypeTika values - Add CodeCov configuration yaml - Includes work by @jrwiebe, see #346 for all commits before squash
-
Use version of tika-parsers without a classifier. (#345)
Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact was unnecessary.
-
Add audio & video binary extraction (#341)
- Add Audio & Video binary extraction. - Add filename, and extenstion column to audio, pdf, and video DF - Pass binary bytes instread of string to DetectMimeTypeTika in DF (s/getContentString/getBinaryBytes) - Updates saveToDisk to use file extension from DF column - Adds tests for Audio, PDF, and Video DF extraction - Add test fixtures for Audio, PDF, and Video DF extraction - Rename SaveBytesTest to SaveImageBytes test - Eliminate bytes->string->bytes conversion that was causing data loss in DetectMimeTypeTika - Update tika-parsers dep from JitPack - Remove tweet cruft - Resolves #306 - Resolves #307 - Includes work by @jrwiebe, see #341 for all commits before squash
-
Add PDF binary extraction. (#340)
Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome. - Updates getImageBytes to getBinaryBytes - Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk - Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0. - Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection. - Added generic saveToDisk method to save a bytes column of a DataFrame to files - Updates tests - Resolves #302 - Further addresses #308 - Includes work by @ruebot, see #340 for all commits before squash
-
More scalastyle work; addresses #196. (#339)
- Remove all underscore imports, except shapeless - Address all scalastyle warnings - Update scalastyle config for magic numbers, and null (only used in tests)
-
-
Update Tika to 1.22; address security alerts. (#337)
- Update Tika to 1.22 - pom.xml surgery to get aut to build again with --packages
-
Python formatting, and gitignore additions. (#326)
- Run black and isort on Python files. - Move Spark config to example file. - Update gitignore for 7a61f0e additions.
-
Makes ArchiveRecordImpl serializable by removing non-serializable ARC…
…Record and WARCRecord variables. Also removes unused headerResponseFormat variable. (#316)
-
Resolve cobertura-maven-plugin class issue; resolves #313. (#314)
- Exclude slf4j binding logback-classic (mojohaus/cobertura-maven-plugin#6 (comment))
-
Update spark-core_2.11 to 2.3.1. (#312)
- CVE-2018-8024 https://nvd.nist.gov/vuln/detail/CVE-2018-8024 - CVE-2018-1334 https://nvd.nist.gov/vuln/detail/CVE-2018-1334 - CVE-2018-17190 https://nvd.nist.gov/vuln/detail/CVE-2018-17190 - CVE-2018-11770 https://nvd.nist.gov/vuln/detail/CVE-2018-11770
-
Add .getHttpStatus and .getArchiveFile to ArchiveRecordImpl class #198 …
…& #164 (#292) * Resolves #198 * Resolves #164 * Add getHttpStatus to ArchiveRecord class & trait - add .getHttpStatus to potential outputs - add tests for .getHttpStatus calls - improve ArchiveRecord testing overall. * Add .getArchiveFile feature to ArchiveRecordImpl. - add getArchiveFile to trait - add getArchiveFile for ArchiveRecordImpl - add tests for getArchiveFile. * Other code style fixes. * Include updates to tests.
-
-
Change Id generation for graphs from using hashes for urls to using .…
…zipWithUniqueIds() (#289) * Resolves #243 * Create GEXF with proper ids instead of hash to avoid collisions. * Add WriteGEXF files. * Add WriteGraph file and test. * Add test for Graphml output. * Add xml escaping for edges. * Add test case for non-escaped edges. * Add additional tests to cover for more potential cases of graphml and gexf files. * Coverage for null cases in urls.