SinghGursimran / aut
forked from archivesunleashed/autTree: 8f1a9f10e0
-
Add img alt text to imagegraph(); resolves archivesunleashed#420. (ar…
ruebot committedFeb 10, 2020 …chivesunleashed#422) - Update ExtractImageLinksRDD to grab alt text - Add alt_text column to imagegraph - Update tests
Verified
This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits -
Rename imageLinks to imagegraph; resolves archivesunleashed#419 (arch…
ruebot committedFeb 10, 2020 …ivesunleashed#421) * Rename imageLinks to imagegraph; resolves archivesunleashed#419
Verified
This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits
-
Need --repositories flag with --packages. (archivesunleashed#417)
ruebot committedFeb 6, 2020 - Fully resolves this issue archivesunleashed/docker-aut#19 - archivesunleashed/docker-aut@37ce4e2 - archivesunleashed/docker-aut@082907a - archivesunleashed/docker-aut@baee431
Verified
This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits -
[maven-release-plugin] prepare for next development iteration
ruebot committedFeb 5, 2020 Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits
-
[maven-release-plugin] prepare release aut-0.50.0
ruebot committedFeb 5, 2020 Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits
-
Clean up test descriptions, addresses archivesunleashed#372. (archive…
…sunleashed#416) - Clean up test descriptions - Rename typo filename
-
Add crawl_date to binary DataFrames and imageLinks. (archivesunleashe…
…d#414) - Resolves archivesunleashed#413 - Update tests where necessary
-
Various DataFrame implementation updates for documentation clean-up; …
…Addresses archivesunleashed#372. - .all() column HttpStatus to http_status_code - Adds archive_filename to .all() - Significant README updates for setup - See also: archivesunleashed/aut-docs#39
-
Use https for maven repo. (archivesunleashed#405)
- Looks like repos are forcing https to be used now: [WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.
-
Test clean-up. (archivesunleashed#404)
- Clean-up variable names in RecordDFTest.scala - Remove dos line endings on a number of files
-
Add more DataFrame Implementation Serializable APIs (archivesunleashe…
…d#401). - Partially addresses archivesunleashed#223 - Add discardContentDF - Add discardUrlPatternsDF - Add discardLanguagesDF - Add keepImagesDF - Add keepContentDF - Add keepUrlPatternsDF - Add keepLanguagesDF - Update tests
-
Add more DF implementations for archivesunleashed#223. (archivesunlea…
…shed#399) - Add discardHttpStatusDF - Add keepMimeTypesDF - Add keepMimeTypesTikaDF - Update tests
-
Add more serializable APIs for DataFrames (archivesunleashed#396)
- Partially address archivesunleashed#223 - Add keepHttpStatusDF - Add keepDateDF - Add keepUrlsDF - Add keepDomainsDF - Add tests
-
Add additional filters for fextFiles; resolves archivesunleashed#362. (…
…archivesunleashed#393) * Add additional filters for fextFiles; resolves archivesunleashed#362. - Add filedesc, and dns filter (arc files) - Add test case
-
udf API implementations for DataFrame (archivesunleashed#391)
- add discardMimeTypesDF - add discardDateDF - add discardUrlsDF - add discardDomainsDF - update tests - addresses archivesunleashed#223
-
Add Serializable APIs for DataFrames (archivesunleashed#389)
- Add keepValidPagesDF - Add HTTP status code column to all() - Add test for keepValidPagesDF - Addresses archivesunleashed#223
-
Add and update tests, resolve textFiles bug. (archivesunleashed#388)
- Add ExtractDateDF test - Fix conditional logic of textFiles filter to resolve archivesunleashed#390 - Add test for conditional logic fix for archivesunleashed#390 - Remove cruft ExtractUrls, left over from Twitter analysis removal (see: https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala) - Tweak null/nothing on a few tests
-
Add new DataFrame matchbox udfs (archivesunleashed#387)
- Add DetectLanguageDF - Add ExtractBoilerpipeTextDF - Add ExtractDateDF - Update tests - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD - Partially addresses archivesunleashed#223
-
Add "Extract popular images" DataFrame implementation (archivesunleas…
…hed#382). - Add tests for ExtractPopularImagesDF - Rename ExtractPopularImages to ExtractPopularImagesRDD - Addresses archivesunleashed#223
-
Add all() method and refactor DF UDFs (archivesunleashed#383).
- Add `all()` DataFrame method - Refactor fixity DataFrame UDFs - Add ComputeImageSize UDF - Add Python implementation of `all()` - Addresses archivesunleashed#223
-
Extend more Matchbook utilities to DataFrames (archivesunleashed#380).
- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames - Addresses archivesunleashed#223
-
Finalize converting NER Classifier to WANE Format (archivesunleashed#378
). - Fully resolves archivesunleashed#297 - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
-
Add df ExtractLinks udf; resolves archivesunleashed#238. (archivesunl…
…eashed#377) - Add df ExtractLinks udf - Add test
-
Update README.md (archivesunleashed#376)
Tweaks the style of the license badge to look consistent with the other badges.