Skip to content
Permalink
Tree: 9474a0996e
Commits on Feb 11, 2020
  1. [skip travis] Add pre-print link to README. (#423)

    ruebot committed Feb 11, 2020
    * [skip travis] Add pre-print link to README.
Commits on Feb 10, 2020
  1. Add img alt text to imagegraph(); resolves #420. (#422)

    ruebot committed Feb 10, 2020
    - Update ExtractImageLinksRDD to grab alt text
    - Add alt_text column to imagegraph
    - Update tests
  2. Rename imageLinks to imagegraph; resolves #419 (#421)

    ruebot committed Feb 10, 2020
    * Rename imageLinks to imagegraph; resolves #419
Commits on Feb 6, 2020
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed Feb 5, 2020
Commits on Feb 5, 2020
Commits on Jan 21, 2020
  1. Clean up test descriptions, addresses #372. (#416)

    ruebot authored and ianmilligan1 committed Jan 21, 2020
    - Clean up test descriptions
    - Rename typo filename
  2. Add ExtractImageDetailsDF. (#415)

    SinghGursimran authored and ruebot committed Jan 21, 2020
    - Add test
    - Addresses #223
Commits on Jan 18, 2020
  1. Add crawl_date to binary DataFrames and imageLinks. (#414)

    ruebot authored and ianmilligan1 committed Jan 18, 2020
    - Resolves #413
    - Update tests where necessary
Commits on Jan 17, 2020
  1. Various DataFrame implementation updates for documentation clean-up; …

    ruebot authored and ianmilligan1 committed Jan 17, 2020
    …Addresses #372.
    
    - .all() column HttpStatus to http_status_code
    - Adds archive_filename to .all()
    - Significant README updates for setup
    - See also: archivesunleashed/aut-docs#39
Commits on Jan 16, 2020
  1. Use https for maven repo. (#405)

    ruebot authored and ianmilligan1 committed Jan 16, 2020
    - Looks like repos are forcing https to be used now:
    [WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.
Commits on Jan 13, 2020
  1. Test clean-up. (#404)

    ruebot authored and ianmilligan1 committed Jan 13, 2020
    - Clean-up variable names in RecordDFTest.scala
    - Remove dos line endings on a number of files
Commits on Jan 12, 2020
  1. Add language detection column to webpages. (#403)

    ruebot authored and ianmilligan1 committed Jan 12, 2020
    - Addresses #402
Commits on Jan 10, 2020
  1. Add more DataFrame Implementation Serializable APIs (#401).

    SinghGursimran authored and ruebot committed Jan 10, 2020
    - Partially addresses  #223 
    - Add discardContentDF
    - Add discardUrlPatternsDF
    - Add discardLanguagesDF
    - Add keepImagesDF
    - Add keepContentDF
    - Add keepUrlPatternsDF
    - Add keepLanguagesDF
    - Update tests
Commits on Jan 8, 2020
  1. Filter blank src/dest out of webgraph. (#400)

    ruebot authored and ianmilligan1 committed Jan 8, 2020
Commits on Jan 7, 2020
  1. Add more DF implementations for #223. (#399)

    SinghGursimran authored and ruebot committed Jan 7, 2020
    - Add discardHttpStatusDF
    - Add keepMimeTypesDF
    - Add keepMimeTypesTikaDF
    - Update tests
Commits on Jan 5, 2020
  1. Scala imports cleanup. (#398)

    ruebot authored and ianmilligan1 committed Jan 5, 2020
Commits on Dec 29, 2019
  1. Add more serializable APIs for DataFrames (#396)

    SinghGursimran authored and ruebot committed Dec 29, 2019
    - Partially address #223 
    - Add keepHttpStatusDF
    - Add keepDateDF
    - Add keepUrlsDF
    - Add keepDomainsDF
    - Add tests
Commits on Dec 19, 2019
  1. Remove redundant test; addresses #64. (#395)

    ruebot authored and ianmilligan1 committed Dec 19, 2019
Commits on Dec 18, 2019
  1. Add doc comments for webpages and webgraph; resolves #392. (#394)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
  2. Add additional filters for fextFiles; resolves #362. (#393)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
    * Add additional filters for fextFiles; resolves #362.
    
    - Add filedesc, and dns filter (arc files)
    - Add test case
Commits on Dec 17, 2019
  1. udf API implementations for DataFrame (#391)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - add discardMimeTypesDF
    - add discardDateDF
    - add discardUrlsDF
    - add discardDomainsDF
    - update tests
    - addresses #223
  2. Add Serializable APIs for DataFrames (#389)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - Add keepValidPagesDF
    - Add HTTP status code column to all()
    - Add test for keepValidPagesDF
    - Addresses #223
  3. Add and update tests, resolve textFiles bug. (#388)

    ruebot authored and ianmilligan1 committed Dec 17, 2019
    - Add ExtractDateDF test
    - Fix conditional logic of textFiles filter to resolve #390
    - Add test for conditional logic fix for #390
    - Remove cruft ExtractUrls, left over from Twitter analysis removal
    (see:
    https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala)
    - Tweak null/nothing on a few tests
Commits on Dec 5, 2019
  1. Add new DataFrame matchbox udfs (#387)

    SinghGursimran authored and ruebot committed Dec 5, 2019
    - Add DetectLanguageDF
    - Add ExtractBoilerpipeTextDF
    - Add ExtractDateDF
    - Update tests
    - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
    - Partially addresses #223
Commits on Nov 28, 2019
  1. Clean-up underscore import, and scalastyle warnings. (#386)

    ruebot authored and ianmilligan1 committed Nov 28, 2019
Commits on Nov 21, 2019
  1. Add "Extract popular images" DataFrame implementation (#382).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add tests for ExtractPopularImagesDF
    - Rename ExtractPopularImages to ExtractPopularImagesRDD
    - Addresses #223
  2. Add all() method and refactor DF UDFs (#383).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add `all()` DataFrame method 
    - Refactor fixity DataFrame UDFs
    - Add ComputeImageSize UDF
    - Add Python implementation of `all()`
    - Addresses #223
  3. Rename pages() to webpages(). (#384)

    ruebot authored and ianmilligan1 committed Nov 21, 2019
    - Part of work on #233
Commits on Nov 19, 2019
  1. Append UDF with RDD or RF. (#381)

    ruebot authored and ianmilligan1 committed Nov 19, 2019
    - Addresses #223
Commits on Nov 18, 2019
  1. Extend more Matchbook utilities to DataFrames (#380).

    SinghGursimran authored and ruebot committed Nov 18, 2019
    - Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
    - Addresses #223
Commits on Nov 17, 2019
  1. Rename DF functions to be consistent with Python DF functions. (#379)

    ruebot authored and ianmilligan1 committed Nov 17, 2019
    - Resolves #366
Commits on Nov 14, 2019
  1. Finalize converting NER Classifier to WANE Format (#378).

    SinghGursimran authored and ruebot committed Nov 14, 2019
    - Fully resolves #297 
    - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
Commits on Nov 12, 2019
  1. Add df ExtractLinks udf; resolves #238. (#377)

    SinghGursimran authored and ruebot committed Nov 12, 2019
    - Add df ExtractLinks udf
    - Add test
Commits on Nov 10, 2019
  1. Update README.md (#376)

    lintool authored and ruebot committed Nov 10, 2019
    Tweaks the style of the license badge to look consistent with the other badges.
Older
You can’t perform that action at this time.