Skip to content
Permalink
Tree: f237c351d8
Commits on Feb 10, 2020
  1. Python work check-in

    ruebot committed Feb 10, 2020
Commits on Jan 23, 2020
  1. Start adding filters; keep_valid_pages.

    ruebot committed Jan 23, 2020
    - TODO, make it object oriented
  2. clean-up

    ruebot committed Jan 23, 2020
Commits on Jan 22, 2020
  1. Merge branch 'master' into issue-409

    ruebot committed Jan 22, 2020
Commits on Jan 21, 2020
  1. Clean up test descriptions, addresses #372. (#416)

    ruebot authored and ianmilligan1 committed Jan 21, 2020
    - Clean up test descriptions
    - Rename typo filename
  2. Merge branch 'master' into issue-409

    ruebot committed Jan 21, 2020
  3. Add ExtractImageDetailsDF. (#415)

    SinghGursimran authored and ruebot committed Jan 21, 2020
    - Add test
    - Addresses #223
  4. - Remove order udfs alphabetically

    ruebot committed Jan 21, 2020
    - Get detect_language setup
    - ComputeSHA1 and MD5 need some work?
  5. Setup external lib packaging for Python!!

    ruebot committed Jan 21, 2020
    Add remove_html udf
    Rename remove_http_header to remove_http_headers
Commits on Jan 19, 2020
  1. Matchbox-pyhton1

    g285sing
    g285sing committed Jan 19, 2020
  2. computeMD5

    g285sing
    g285sing committed Jan 19, 2020
Commits on Jan 18, 2020
  1. Merge branch 'master' into issue-409

    ruebot committed Jan 18, 2020
  2. Add crawl_date to binary DataFrames and imageLinks. (#414)

    ruebot authored and ianmilligan1 committed Jan 18, 2020
    - Resolves #413
    - Update tests where necessary
  3. rename links to webgraph

    ruebot committed Jan 18, 2020
  4. Add some PySpark udfs

    ruebot committed Jan 18, 2020
    - Add remove_http_header, remove_prefix_www
    - Rename extract_domain_func to extract_domain
    - Formatting updates
    - Addresses #409
Commits on Jan 17, 2020
  1. Various DataFrame implementation updates for documentation clean-up; …

    ruebot authored and ianmilligan1 committed Jan 17, 2020
    …Addresses #372.
    
    - .all() column HttpStatus to http_status_code
    - Adds archive_filename to .all()
    - Significant README updates for setup
    - See also: archivesunleashed/aut-docs#39
Commits on Jan 16, 2020
  1. Use https for maven repo. (#405)

    ruebot authored and ianmilligan1 committed Jan 16, 2020
    - Looks like repos are forcing https to be used now:
    [WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.
Commits on Jan 13, 2020
  1. Test clean-up. (#404)

    ruebot authored and ianmilligan1 committed Jan 13, 2020
    - Clean-up variable names in RecordDFTest.scala
    - Remove dos line endings on a number of files
Commits on Jan 12, 2020
  1. Add language detection column to webpages. (#403)

    ruebot authored and ianmilligan1 committed Jan 12, 2020
    - Addresses #402
Commits on Jan 10, 2020
  1. Add more DataFrame Implementation Serializable APIs (#401).

    SinghGursimran authored and ruebot committed Jan 10, 2020
    - Partially addresses  #223 
    - Add discardContentDF
    - Add discardUrlPatternsDF
    - Add discardLanguagesDF
    - Add keepImagesDF
    - Add keepContentDF
    - Add keepUrlPatternsDF
    - Add keepLanguagesDF
    - Update tests
Commits on Jan 8, 2020
  1. Filter blank src/dest out of webgraph. (#400)

    ruebot authored and ianmilligan1 committed Jan 8, 2020
Commits on Jan 7, 2020
  1. Add more DF implementations for #223. (#399)

    SinghGursimran authored and ruebot committed Jan 7, 2020
    - Add discardHttpStatusDF
    - Add keepMimeTypesDF
    - Add keepMimeTypesTikaDF
    - Update tests
Commits on Jan 5, 2020
  1. Scala imports cleanup. (#398)

    ruebot authored and ianmilligan1 committed Jan 5, 2020
Commits on Dec 29, 2019
  1. Add more serializable APIs for DataFrames (#396)

    SinghGursimran authored and ruebot committed Dec 29, 2019
    - Partially address #223 
    - Add keepHttpStatusDF
    - Add keepDateDF
    - Add keepUrlsDF
    - Add keepDomainsDF
    - Add tests
Commits on Dec 19, 2019
  1. Remove redundant test; addresses #64. (#395)

    ruebot authored and ianmilligan1 committed Dec 19, 2019
Commits on Dec 18, 2019
  1. Add doc comments for webpages and webgraph; resolves #392. (#394)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
  2. Add additional filters for fextFiles; resolves #362. (#393)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
    * Add additional filters for fextFiles; resolves #362.
    
    - Add filedesc, and dns filter (arc files)
    - Add test case
Commits on Dec 17, 2019
  1. udf API implementations for DataFrame (#391)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - add discardMimeTypesDF
    - add discardDateDF
    - add discardUrlsDF
    - add discardDomainsDF
    - update tests
    - addresses #223
  2. Add Serializable APIs for DataFrames (#389)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - Add keepValidPagesDF
    - Add HTTP status code column to all()
    - Add test for keepValidPagesDF
    - Addresses #223
  3. Add and update tests, resolve textFiles bug. (#388)

    ruebot authored and ianmilligan1 committed Dec 17, 2019
    - Add ExtractDateDF test
    - Fix conditional logic of textFiles filter to resolve #390
    - Add test for conditional logic fix for #390
    - Remove cruft ExtractUrls, left over from Twitter analysis removal
    (see:
    https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala)
    - Tweak null/nothing on a few tests
Commits on Dec 5, 2019
  1. Add new DataFrame matchbox udfs (#387)

    SinghGursimran authored and ruebot committed Dec 5, 2019
    - Add DetectLanguageDF
    - Add ExtractBoilerpipeTextDF
    - Add ExtractDateDF
    - Update tests
    - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
    - Partially addresses #223
Commits on Nov 28, 2019
  1. Clean-up underscore import, and scalastyle warnings. (#386)

    ruebot authored and ianmilligan1 committed Nov 28, 2019
Commits on Nov 21, 2019
  1. Add "Extract popular images" DataFrame implementation (#382).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add tests for ExtractPopularImagesDF
    - Rename ExtractPopularImages to ExtractPopularImagesRDD
    - Addresses #223
  2. Add all() method and refactor DF UDFs (#383).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add `all()` DataFrame method 
    - Refactor fixity DataFrame UDFs
    - Add ComputeImageSize UDF
    - Add Python implementation of `all()`
    - Addresses #223
Older
You can’t perform that action at this time.