Tree: ffef735721
-
Clean up test descriptions, addresses #372. (#416)
- Clean up test descriptions - Rename typo filename
-
Add ExtractImageDetailsDF. (#415)
- Add test - Addresses #223
-
Add crawl_date to binary DataFrames and imageLinks. (#414)
- Resolves #413 - Update tests where necessary
-
Various DataFrame implementation updates for documentation clean-up; …
…Addresses #372. - .all() column HttpStatus to http_status_code - Adds archive_filename to .all() - Significant README updates for setup - See also: archivesunleashed/aut-docs#39
-
Use https for maven repo. (#405)
- Looks like repos are forcing https to be used now: [WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.
-
- Clean-up variable names in RecordDFTest.scala - Remove dos line endings on a number of files
-
Add more DataFrame Implementation Serializable APIs (#401).
- Partially addresses #223 - Add discardContentDF - Add discardUrlPatternsDF - Add discardLanguagesDF - Add keepImagesDF - Add keepContentDF - Add keepUrlPatternsDF - Add keepLanguagesDF - Update tests
-
Add more DF implementations for #223. (#399)
- Add discardHttpStatusDF - Add keepMimeTypesDF - Add keepMimeTypesTikaDF - Update tests
-
Add more serializable APIs for DataFrames (#396)
- Partially address #223 - Add keepHttpStatusDF - Add keepDateDF - Add keepUrlsDF - Add keepDomainsDF - Add tests
-
udf API implementations for DataFrame (#391)
- add discardMimeTypesDF - add discardDateDF - add discardUrlsDF - add discardDomainsDF - update tests - addresses #223
-
Add Serializable APIs for DataFrames (#389)
- Add keepValidPagesDF - Add HTTP status code column to all() - Add test for keepValidPagesDF - Addresses #223
-
Add and update tests, resolve textFiles bug. (#388)
- Add ExtractDateDF test - Fix conditional logic of textFiles filter to resolve #390 - Add test for conditional logic fix for #390 - Remove cruft ExtractUrls, left over from Twitter analysis removal (see: https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala) - Tweak null/nothing on a few tests
-
Add new DataFrame matchbox udfs (#387)
- Add DetectLanguageDF - Add ExtractBoilerpipeTextDF - Add ExtractDateDF - Update tests - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD - Partially addresses #223
-
Add "Extract popular images" DataFrame implementation (#382).
- Add tests for ExtractPopularImagesDF - Rename ExtractPopularImages to ExtractPopularImagesRDD - Addresses #223
-
Add all() method and refactor DF UDFs (#383).
- Add `all()` DataFrame method - Refactor fixity DataFrame UDFs - Add ComputeImageSize UDF - Add Python implementation of `all()` - Addresses #223
-
Rename pages() to webpages(). (#384)
- Part of work on #233
-
Append UDF with RDD or RF. (#381)
- Addresses #223
-
Extend more Matchbook utilities to DataFrames (#380).
- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames - Addresses #223
-
Finalize converting NER Classifier to WANE Format (#378).
- Fully resolves #297 - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
-
Tweaks the style of the license badge to look consistent with the other badges.
-
Align NER output to WANE format; addresses #297 (#361)
- Update Stanford core NLP - Format NER output in json - Add getPayloadDigest to ArchiveRecord - Add test for getPayloadDigest - Add payload digest to NER output - Remove extractFromScrapeText - Remove extractFromScrapeText test - TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output
🤢 -
Various UDF implementation and cleanup for DF. (#370)
- Replace ExtractBaseDomain with ExtractDomain - Closes #367 - Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader - Closes #369 - Wraps RemoveHttpHeader and RemoveHTML for use in data frames. - Partially addresses #238 - Updates tests where necessary - Punts on #368 UDF CaMeL cASe consistency issues