Skip to content
Permalink
Tree: 9ce73a894a
Commits on Feb 20, 2020
  1. Update Spark and Hadoop versions. (#426)

    ruebot committed Feb 20, 2020
    - Update Spark to 2.4.5
    - Update Hadoop to 2.7.4 (for RADOS/S3 support)
    - Tweak README
Commits on Feb 12, 2020
  1. Add logic so UDFs that filter on url should also filter on src (#424).

    SinghGursimran and ruebot committed Feb 12, 2020
    - Resolves #418 
    - Update tests
    
    Co-authored-by: Nick Ruest <ruestn@gmail.com>
Commits on Feb 11, 2020
  1. [skip travis] Add pre-print link to README. (#423)

    ruebot committed Feb 11, 2020
    * [skip travis] Add pre-print link to README.
Commits on Feb 10, 2020
  1. Add img alt text to imagegraph(); resolves #420. (#422)

    ruebot committed Feb 10, 2020
    - Update ExtractImageLinksRDD to grab alt text
    - Add alt_text column to imagegraph
    - Update tests
  2. Rename imageLinks to imagegraph; resolves #419 (#421)

    ruebot committed Feb 10, 2020
    * Rename imageLinks to imagegraph; resolves #419
Commits on Feb 6, 2020
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed Feb 5, 2020
Commits on Feb 5, 2020
Commits on Jan 21, 2020
  1. Clean up test descriptions, addresses #372. (#416)

    ruebot authored and ianmilligan1 committed Jan 21, 2020
    - Clean up test descriptions
    - Rename typo filename
  2. Add ExtractImageDetailsDF. (#415)

    SinghGursimran authored and ruebot committed Jan 21, 2020
    - Add test
    - Addresses #223
Commits on Jan 18, 2020
  1. Add crawl_date to binary DataFrames and imageLinks. (#414)

    ruebot authored and ianmilligan1 committed Jan 18, 2020
    - Resolves #413
    - Update tests where necessary
Commits on Jan 17, 2020
  1. Various DataFrame implementation updates for documentation clean-up; …

    ruebot authored and ianmilligan1 committed Jan 17, 2020
    …Addresses #372.
    
    - .all() column HttpStatus to http_status_code
    - Adds archive_filename to .all()
    - Significant README updates for setup
    - See also: archivesunleashed/aut-docs#39
Commits on Jan 16, 2020
  1. Use https for maven repo. (#405)

    ruebot authored and ianmilligan1 committed Jan 16, 2020
    - Looks like repos are forcing https to be used now:
    [WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.
Commits on Jan 13, 2020
  1. Test clean-up. (#404)

    ruebot authored and ianmilligan1 committed Jan 13, 2020
    - Clean-up variable names in RecordDFTest.scala
    - Remove dos line endings on a number of files
Commits on Jan 12, 2020
  1. Add language detection column to webpages. (#403)

    ruebot authored and ianmilligan1 committed Jan 12, 2020
    - Addresses #402
Commits on Jan 10, 2020
  1. Add more DataFrame Implementation Serializable APIs (#401).

    SinghGursimran authored and ruebot committed Jan 10, 2020
    - Partially addresses  #223 
    - Add discardContentDF
    - Add discardUrlPatternsDF
    - Add discardLanguagesDF
    - Add keepImagesDF
    - Add keepContentDF
    - Add keepUrlPatternsDF
    - Add keepLanguagesDF
    - Update tests
Commits on Jan 8, 2020
  1. Filter blank src/dest out of webgraph. (#400)

    ruebot authored and ianmilligan1 committed Jan 8, 2020
Commits on Jan 7, 2020
  1. Add more DF implementations for #223. (#399)

    SinghGursimran authored and ruebot committed Jan 7, 2020
    - Add discardHttpStatusDF
    - Add keepMimeTypesDF
    - Add keepMimeTypesTikaDF
    - Update tests
Commits on Jan 5, 2020
  1. Scala imports cleanup. (#398)

    ruebot authored and ianmilligan1 committed Jan 5, 2020
Commits on Dec 29, 2019
  1. Add more serializable APIs for DataFrames (#396)

    SinghGursimran authored and ruebot committed Dec 29, 2019
    - Partially address #223 
    - Add keepHttpStatusDF
    - Add keepDateDF
    - Add keepUrlsDF
    - Add keepDomainsDF
    - Add tests
Commits on Dec 19, 2019
  1. Remove redundant test; addresses #64. (#395)

    ruebot authored and ianmilligan1 committed Dec 19, 2019
Commits on Dec 18, 2019
  1. Add doc comments for webpages and webgraph; resolves #392. (#394)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
  2. Add additional filters for fextFiles; resolves #362. (#393)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
    * Add additional filters for fextFiles; resolves #362.
    
    - Add filedesc, and dns filter (arc files)
    - Add test case
Commits on Dec 17, 2019
  1. udf API implementations for DataFrame (#391)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - add discardMimeTypesDF
    - add discardDateDF
    - add discardUrlsDF
    - add discardDomainsDF
    - update tests
    - addresses #223
  2. Add Serializable APIs for DataFrames (#389)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - Add keepValidPagesDF
    - Add HTTP status code column to all()
    - Add test for keepValidPagesDF
    - Addresses #223
  3. Add and update tests, resolve textFiles bug. (#388)

    ruebot authored and ianmilligan1 committed Dec 17, 2019
    - Add ExtractDateDF test
    - Fix conditional logic of textFiles filter to resolve #390
    - Add test for conditional logic fix for #390
    - Remove cruft ExtractUrls, left over from Twitter analysis removal
    (see:
    https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala)
    - Tweak null/nothing on a few tests
Commits on Dec 5, 2019
  1. Add new DataFrame matchbox udfs (#387)

    SinghGursimran authored and ruebot committed Dec 5, 2019
    - Add DetectLanguageDF
    - Add ExtractBoilerpipeTextDF
    - Add ExtractDateDF
    - Update tests
    - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
    - Partially addresses #223
Commits on Nov 28, 2019
  1. Clean-up underscore import, and scalastyle warnings. (#386)

    ruebot authored and ianmilligan1 committed Nov 28, 2019
Commits on Nov 21, 2019
  1. Add "Extract popular images" DataFrame implementation (#382).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add tests for ExtractPopularImagesDF
    - Rename ExtractPopularImages to ExtractPopularImagesRDD
    - Addresses #223
  2. Add all() method and refactor DF UDFs (#383).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add `all()` DataFrame method 
    - Refactor fixity DataFrame UDFs
    - Add ComputeImageSize UDF
    - Add Python implementation of `all()`
    - Addresses #223
  3. Rename pages() to webpages(). (#384)

    ruebot authored and ianmilligan1 committed Nov 21, 2019
    - Part of work on #233
Commits on Nov 19, 2019
  1. Append UDF with RDD or RF. (#381)

    ruebot authored and ianmilligan1 committed Nov 19, 2019
    - Addresses #223
Commits on Nov 18, 2019
  1. Extend more Matchbook utilities to DataFrames (#380).

    SinghGursimran authored and ruebot committed Nov 18, 2019
    - Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
    - Addresses #223
Commits on Nov 17, 2019
  1. Rename DF functions to be consistent with Python DF functions. (#379)

    ruebot authored and ianmilligan1 committed Nov 17, 2019
    - Resolves #366
Commits on Nov 14, 2019
  1. Finalize converting NER Classifier to WANE Format (#378).

    SinghGursimran authored and ruebot committed Nov 14, 2019
    - Fully resolves #297 
    - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
Older
You can’t perform that action at this time.