Skip to content
Permalink
Tree: 8f1a9f10e0
Commits on Feb 10, 2020
  1. Add img alt text to imagegraph(); resolves archivesunleashed#420. (ar…

    ruebot committed Feb 10, 2020
    …chivesunleashed#422)
    
    - Update ExtractImageLinksRDD to grab alt text
    - Add alt_text column to imagegraph
    - Update tests
  2. Rename imageLinks to imagegraph; resolves archivesunleashed#419 (arch…

    ruebot committed Feb 10, 2020
    …ivesunleashed#421)
    
    * Rename imageLinks to imagegraph; resolves archivesunleashed#419
Commits on Feb 6, 2020
Commits on Feb 5, 2020
Commits on Jan 21, 2020
  1. Clean up test descriptions, addresses archivesunleashed#372. (archive…

    ruebot authored and ianmilligan1 committed Jan 21, 2020
    …sunleashed#416)
    
    - Clean up test descriptions
    - Rename typo filename
  2. Add ExtractImageDetailsDF. (archivesunleashed#415)

    SinghGursimran authored and ruebot committed Jan 21, 2020
    - Add test
    - Addresses archivesunleashed#223
Commits on Jan 18, 2020
  1. Add crawl_date to binary DataFrames and imageLinks. (archivesunleashe…

    ruebot authored and ianmilligan1 committed Jan 18, 2020
    …d#414)
    
    - Resolves archivesunleashed#413
    - Update tests where necessary
Commits on Jan 17, 2020
  1. Various DataFrame implementation updates for documentation clean-up; …

    ruebot authored and ianmilligan1 committed Jan 17, 2020
    …Addresses archivesunleashed#372.
    
    - .all() column HttpStatus to http_status_code
    - Adds archive_filename to .all()
    - Significant README updates for setup
    - See also: archivesunleashed/aut-docs#39
Commits on Jan 16, 2020
  1. Use https for maven repo. (archivesunleashed#405)

    ruebot authored and ianmilligan1 committed Jan 16, 2020
    - Looks like repos are forcing https to be used now:
    [WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.
Commits on Jan 13, 2020
  1. Test clean-up. (archivesunleashed#404)

    ruebot authored and ianmilligan1 committed Jan 13, 2020
    - Clean-up variable names in RecordDFTest.scala
    - Remove dos line endings on a number of files
Commits on Jan 12, 2020
Commits on Jan 10, 2020
  1. Add more DataFrame Implementation Serializable APIs (archivesunleashe…

    SinghGursimran authored and ruebot committed Jan 10, 2020
    …d#401).
    
    - Partially addresses  archivesunleashed#223 
    - Add discardContentDF
    - Add discardUrlPatternsDF
    - Add discardLanguagesDF
    - Add keepImagesDF
    - Add keepContentDF
    - Add keepUrlPatternsDF
    - Add keepLanguagesDF
    - Update tests
Commits on Jan 8, 2020
Commits on Jan 7, 2020
  1. Add more DF implementations for archivesunleashed#223. (archivesunlea…

    SinghGursimran authored and ruebot committed Jan 7, 2020
    …shed#399)
    
    - Add discardHttpStatusDF
    - Add keepMimeTypesDF
    - Add keepMimeTypesTikaDF
    - Update tests
Commits on Jan 5, 2020
Commits on Dec 29, 2019
  1. Add more serializable APIs for DataFrames (archivesunleashed#396)

    SinghGursimran authored and ruebot committed Dec 29, 2019
    - Partially address archivesunleashed#223 
    - Add keepHttpStatusDF
    - Add keepDateDF
    - Add keepUrlsDF
    - Add keepDomainsDF
    - Add tests
Commits on Dec 19, 2019
Commits on Dec 18, 2019
  1. Add additional filters for fextFiles; resolves archivesunleashed#362. (

    ruebot authored and ianmilligan1 committed Dec 18, 2019
    …archivesunleashed#393)
    
    * Add additional filters for fextFiles; resolves archivesunleashed#362.
    
    - Add filedesc, and dns filter (arc files)
    - Add test case
Commits on Dec 17, 2019
  1. udf API implementations for DataFrame (archivesunleashed#391)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - add discardMimeTypesDF
    - add discardDateDF
    - add discardUrlsDF
    - add discardDomainsDF
    - update tests
    - addresses archivesunleashed#223
  2. Add Serializable APIs for DataFrames (archivesunleashed#389)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - Add keepValidPagesDF
    - Add HTTP status code column to all()
    - Add test for keepValidPagesDF
    - Addresses archivesunleashed#223
  3. Add and update tests, resolve textFiles bug. (archivesunleashed#388)

    ruebot authored and ianmilligan1 committed Dec 17, 2019
    - Add ExtractDateDF test
    - Fix conditional logic of textFiles filter to resolve archivesunleashed#390
    - Add test for conditional logic fix for archivesunleashed#390
    - Remove cruft ExtractUrls, left over from Twitter analysis removal
    (see:
    https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala)
    - Tweak null/nothing on a few tests
Commits on Dec 5, 2019
  1. Add new DataFrame matchbox udfs (archivesunleashed#387)

    SinghGursimran authored and ruebot committed Dec 5, 2019
    - Add DetectLanguageDF
    - Add ExtractBoilerpipeTextDF
    - Add ExtractDateDF
    - Update tests
    - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
    - Partially addresses archivesunleashed#223
Commits on Nov 28, 2019
Commits on Nov 21, 2019
  1. Add "Extract popular images" DataFrame implementation (archivesunleas…

    SinghGursimran authored and ruebot committed Nov 21, 2019
    …hed#382).
    
    - Add tests for ExtractPopularImagesDF
    - Rename ExtractPopularImages to ExtractPopularImagesRDD
    - Addresses archivesunleashed#223
  2. Add all() method and refactor DF UDFs (archivesunleashed#383).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add `all()` DataFrame method 
    - Refactor fixity DataFrame UDFs
    - Add ComputeImageSize UDF
    - Add Python implementation of `all()`
    - Addresses archivesunleashed#223
Commits on Nov 19, 2019
Commits on Nov 18, 2019
  1. Extend more Matchbook utilities to DataFrames (archivesunleashed#380).

    SinghGursimran authored and ruebot committed Nov 18, 2019
    - Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
    - Addresses archivesunleashed#223
Commits on Nov 17, 2019
Commits on Nov 14, 2019
  1. Finalize converting NER Classifier to WANE Format (archivesunleashed#378

    SinghGursimran authored and ruebot committed Nov 14, 2019
    ).
    
    - Fully resolves archivesunleashed#297 
    - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
Commits on Nov 12, 2019
  1. Add df ExtractLinks udf; resolves archivesunleashed#238. (archivesunl…

    SinghGursimran authored and ruebot committed Nov 12, 2019
    …eashed#377)
    
    - Add df ExtractLinks udf
    - Add test
Commits on Nov 10, 2019
  1. Update README.md (archivesunleashed#376)

    lintool authored and ruebot committed Nov 10, 2019
    Tweaks the style of the license badge to look consistent with the other badges.
Commits on Nov 7, 2019
Older
You can’t perform that action at this time.