Skip to content
Permalink
Tree: 99a4e8b283
Commits on Mar 26, 2020
  1. Alphabetical sort udfs

    ruebot committed Mar 26, 2020
Commits on Mar 25, 2020
  1. Merge branch 'master' into issue-409

    ruebot committed Mar 25, 2020
Commits on Mar 23, 2020
  1. Tweak hasDate to handle Seq. (#430)

    ruebot committed Mar 23, 2020
    Tweak hasDate to handle Seq.
    - Addresses #425
    - Add test for hasDate
Commits on Mar 19, 2020
  1. Add setup; might not need it in the end :shrug:

    ruebot committed Mar 19, 2020
Commits on Mar 18, 2020
  1. Restyle keep/discard filter UDFs in the context of DataFrames (#429)

    ruebot committed Mar 18, 2020
    Co-authored-by: g285sing <g285sing@student.cs.uwaterloo.ca> (@SinghGursimran)
    
    - Resolves #425
    - Replace all keep/discard DF udfs with `hasXYZ()`
    - Update tests
Commits on Feb 20, 2020
  1. Merge branch 'master' into issue-409

    ruebot committed Feb 20, 2020
  2. Update Spark and Hadoop versions. (#426)

    ruebot committed Feb 20, 2020
    - Update Spark to 2.4.5
    - Update Hadoop to 2.7.4 (for RADOS/S3 support)
    - Tweak README
Commits on Feb 18, 2020
  1. Merge branch 'master' into issue-409

    ruebot committed Feb 18, 2020
Commits on Feb 12, 2020
  1. Add logic so UDFs that filter on url should also filter on src (#424).

    SinghGursimran and ruebot committed Feb 12, 2020
    - Resolves #418 
    - Update tests
    
    Co-authored-by: Nick Ruest <ruestn@gmail.com>
Commits on Feb 11, 2020
  1. [skip travis] Add pre-print link to README. (#423)

    ruebot committed Feb 11, 2020
    * [skip travis] Add pre-print link to README.
Commits on Feb 10, 2020
  1. Add img alt text to imagegraph(); resolves #420. (#422)

    ruebot committed Feb 10, 2020
    - Update ExtractImageLinksRDD to grab alt text
    - Add alt_text column to imagegraph
    - Update tests
  2. Rename imageLinks to imagegraph; resolves #419 (#421)

    ruebot committed Feb 10, 2020
    * Rename imageLinks to imagegraph; resolves #419
  3. Python work check-in

    ruebot committed Feb 10, 2020
Commits on Feb 6, 2020
  1. Merge branch 'master' into issue-409

    ruebot committed Feb 6, 2020
  2. [maven-release-plugin] prepare for next development iteration

    ruebot committed Feb 5, 2020
Commits on Feb 5, 2020
Commits on Jan 23, 2020
  1. Start adding filters; keep_valid_pages.

    ruebot committed Jan 23, 2020
    - TODO, make it object oriented
  2. clean-up

    ruebot committed Jan 23, 2020
Commits on Jan 22, 2020
  1. Merge branch 'master' into issue-409

    ruebot committed Jan 22, 2020
Commits on Jan 21, 2020
  1. Clean up test descriptions, addresses #372. (#416)

    ruebot authored and ianmilligan1 committed Jan 21, 2020
    - Clean up test descriptions
    - Rename typo filename
  2. Merge branch 'master' into issue-409

    ruebot committed Jan 21, 2020
  3. Add ExtractImageDetailsDF. (#415)

    SinghGursimran authored and ruebot committed Jan 21, 2020
    - Add test
    - Addresses #223
  4. - Remove order udfs alphabetically

    ruebot committed Jan 21, 2020
    - Get detect_language setup
    - ComputeSHA1 and MD5 need some work?
  5. Setup external lib packaging for Python!!

    ruebot committed Jan 21, 2020
    Add remove_html udf
    Rename remove_http_header to remove_http_headers
Commits on Jan 19, 2020
  1. Matchbox-pyhton1

    g285sing
    g285sing committed Jan 19, 2020
  2. computeMD5

    g285sing
    g285sing committed Jan 19, 2020
Commits on Jan 18, 2020
  1. Merge branch 'master' into issue-409

    ruebot committed Jan 18, 2020
  2. Add crawl_date to binary DataFrames and imageLinks. (#414)

    ruebot authored and ianmilligan1 committed Jan 18, 2020
    - Resolves #413
    - Update tests where necessary
  3. rename links to webgraph

    ruebot committed Jan 18, 2020
  4. Add some PySpark udfs

    ruebot committed Jan 18, 2020
    - Add remove_http_header, remove_prefix_www
    - Rename extract_domain_func to extract_domain
    - Formatting updates
    - Addresses #409
Commits on Jan 17, 2020
  1. Various DataFrame implementation updates for documentation clean-up; …

    ruebot authored and ianmilligan1 committed Jan 17, 2020
    …Addresses #372.
    
    - .all() column HttpStatus to http_status_code
    - Adds archive_filename to .all()
    - Significant README updates for setup
    - See also: archivesunleashed/aut-docs#39
Older
You can’t perform that action at this time.