Tree: 8eb43ff055
-
udf API implementations for DataFrame (#391)
- add discardMimeTypesDF - add discardDateDF - add discardUrlsDF - add discardDomainsDF - update tests - addresses #223
-
Add Serializable APIs for DataFrames (#389)
- Add keepValidPagesDF - Add HTTP status code column to all() - Add test for keepValidPagesDF - Addresses #223
-
Add and update tests, resolve textFiles bug. (#388)
- Add ExtractDateDF test - Fix conditional logic of textFiles filter to resolve #390 - Add test for conditional logic fix for #390 - Remove cruft ExtractUrls, left over from Twitter analysis removal (see: https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala) - Tweak null/nothing on a few tests
-
Add new DataFrame matchbox udfs (#387)
- Add DetectLanguageDF - Add ExtractBoilerpipeTextDF - Add ExtractDateDF - Update tests - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD - Partially addresses #223
-
Add "Extract popular images" DataFrame implementation (#382).
- Add tests for ExtractPopularImagesDF - Rename ExtractPopularImages to ExtractPopularImagesRDD - Addresses #223
-
Add all() method and refactor DF UDFs (#383).
- Add `all()` DataFrame method - Refactor fixity DataFrame UDFs - Add ComputeImageSize UDF - Add Python implementation of `all()` - Addresses #223
-
Rename pages() to webpages(). (#384)
- Part of work on #233
-
Append UDF with RDD or RF. (#381)
- Addresses #223
-
Extend more Matchbook utilities to DataFrames (#380).
- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames - Addresses #223
-
Finalize converting NER Classifier to WANE Format (#378).
- Fully resolves #297 - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
-
Tweaks the style of the license badge to look consistent with the other badges.
-
Align NER output to WANE format; addresses #297 (#361)
- Update Stanford core NLP - Format NER output in json - Add getPayloadDigest to ArchiveRecord - Add test for getPayloadDigest - Add payload digest to NER output - Remove extractFromScrapeText - Remove extractFromScrapeText test - TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output
🤢 -
Various UDF implementation and cleanup for DF. (#370)
- Replace ExtractBaseDomain with ExtractDomain - Closes #367 - Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader - Closes #369 - Wraps RemoveHttpHeader and RemoveHTML for use in data frames. - Partially addresses #238 - Updates tests where necessary - Punts on #368 UDF CaMeL cASe consistency issues
-
Update keepValidPages to include a filter on 200 OK. (#360)
- Add status code filter to keepValidPages - Add MimeTypeTika to valid pages DF - Update tests since we filter more and better now
😄 - Resolves #359
-
[skip travis] Update links (#357)
ruebot committedAug 27, 2019 Verified
This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits
-
Add discardLanguage filter to RecordLoader. (#353)
- Clean up doc comments - Add test - Resolves #352
-
- Add tests a few more filters in RecordLoader - Add binary extration DataFrameLoader tests
-
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
[maven-release-plugin] prepare release aut-0.18.0
ruebot committedAug 21, 2019 Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
-
Update LICENSE and license headers. (#351)
- Update LICENSE file to full Apache 2.0 license - Reconfigure license-maven-plugin - Update all license headers in java and scala files to include copyright year, and project name - Move LICENSE_HEADER.txt to config - Update scalastyle config
-
Add method for determining binary file extension. (#349)
This PR implements the strategy described in the discussion of the above issue to get an extension for a file described by a URL and a MIME type. It creates a GetExtensionMime object in the matchbox. This PR also removes most of the filtering by URL from the image, audio, video, presentation, spreadsheet, and word processor document extraction methods, since these were returning false positives. (CSV and TSV files are a special case, since Tika detects them as "text/plain" based on content.) Finally, I have inserted toLowerCase into the getUrl.endsWith() filter tests, which could possibly bring in some more CSV and TSV files * Adds method for getting a file extension from a MIME type. * Add getExtensions method to DetectMimeTypeTika. * Matchbox object to get extension of URL * Use GetExtensionMime for extraction methods; minor fixes. * Remove tika-parsers classifier * Remove most filtering by file extension from binary extraction methods; add CSV/TSV special cases. * Fix GetExtensionMime case where URL has no extension but a MIME type is detected * Insert `toLowerCase` into `getUrl.endsWith()` calls in io.archivesunleashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`. * Remove filtering on URL for audio, video, and images. * Remove filtering on URL for images; add DF fields to image extraction * Remove saveImageToDisk and its test * Remove robots.txt check and extraneous imports * Close files so we don't get too many files open again. * Add GetExtensionMimeTest * Resolve #343
-
Add keep and discard by http status. (#347)
- Add keep and discard by http status RecordLoader - Add tests - Clean up/add doc comments in RecordLoader - Resolve #315
-
Add office document binary extraction. (#346)
- Add Word Processor DF and binary extraction - Add Spreadsheets DF and binary extraction - Add Presentation Program DF and binary extraction - Add Text files DF and binary extraction - Add tests for new DF and binary extractions - Add test fixtures for new DF and binary extractions - Resolves #303 - Resolves #304 - Resolves #305 - Use aut-resources repo to distribute our shaded tika-parsers 1.22 - Close TikaInputStream - Add RDD filters on MimeTypeTika values - Add CodeCov configuration yaml - Includes work by @jrwiebe, see #346 for all commits before squash
-
Use version of tika-parsers without a classifier. (#345)
Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact was unnecessary.