Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Permalink
Tree: 3ab17bf017
Commits on Nov 26, 2019
  1. Fix copyright header

    ruebot committed Nov 26, 2019
  2. Merge branch 'issue-356' of github.com:archivesunleashed/aut into iss…

    ruebot committed Nov 26, 2019
    …ue-329
Commits on Nov 21, 2019
  1. Merge branch 'master' into issue-356

    ruebot committed Nov 21, 2019
  2. Add "Extract popular images" DataFrame implementation (#382).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add tests for ExtractPopularImagesDF
    - Rename ExtractPopularImages to ExtractPopularImagesRDD
    - Addresses #223
  3. Merge branch 'master' into issue-356

    ruebot committed Nov 21, 2019
  4. Add all() method and refactor DF UDFs (#383).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add `all()` DataFrame method 
    - Refactor fixity DataFrame UDFs
    - Add ComputeImageSize UDF
    - Add Python implementation of `all()`
    - Addresses #223
  5. Rename pages() to webpages(). (#384)

    ruebot authored and ianmilligan1 committed Nov 21, 2019
    - Part of work on #233
Commits on Nov 19, 2019
  1. Merge branch 'master' into issue-356

    ruebot committed Nov 19, 2019
  2. Append UDF with RDD or RF. (#381)

    ruebot authored and ianmilligan1 committed Nov 19, 2019
    - Addresses #223
Commits on Nov 18, 2019
  1. Merge branch 'master' into issue-356

    ruebot committed Nov 18, 2019
  2. Extend more Matchbook utilities to DataFrames (#380).

    SinghGursimran authored and ruebot committed Nov 18, 2019
    - Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
    - Addresses #223
Commits on Nov 17, 2019
  1. Rename DF functions to be consistent with Python DF functions. (#379)

    ruebot authored and ianmilligan1 committed Nov 17, 2019
    - Resolves #366
Commits on Nov 14, 2019
  1. Merge branch 'master' into issue-356

    ruebot committed Nov 14, 2019
  2. Finalize converting NER Classifier to WANE Format (#378).

    SinghGursimran authored and ruebot committed Nov 14, 2019
    - Fully resolves #297 
    - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
Commits on Nov 12, 2019
  1. Add df ExtractLinks udf; resolves #238. (#377)

    SinghGursimran authored and ruebot committed Nov 12, 2019
    - Add df ExtractLinks udf
    - Add test
Commits on Nov 10, 2019
  1. Merge branch 'master' into issue-356

    ruebot committed Nov 10, 2019
  2. #356 add Java 11 and Spark 3.0.0+ to README

    ruebot committed Nov 10, 2019
  3. Update README.md (#376)

    lintool authored and ruebot committed Nov 10, 2019
    Tweaks the style of the license badge to look consistent with the other badges.
  4. #356 wave goodbye to Java 8.

    ruebot committed Nov 10, 2019
  5. #356 remove test line.

    ruebot committed Nov 10, 2019
  6. #356 -- forgot to remove cobertura TravisCI build.

    ruebot committed Nov 10, 2019
  7. #356 Replace Cobertura with JaCoCo, and allow Java 8 to fail for now.

    ruebot committed Nov 10, 2019
  8. #356 :facepalm: oops. uncommented that, and pushed.

    ruebot committed Nov 10, 2019
  9. #356 update TravisCI; don't allow failures on Java 11.

    ruebot committed Nov 10, 2019
  10. #356 verify-javadocs fixed -- make sure you have OpenJDK-11 fully ins…

    ruebot committed Nov 10, 2019
    …talled :facepalm:, and a bunch more pom cleanup.
  11. #356 scala updates

    ruebot committed Nov 10, 2019
Commits on Nov 9, 2019
  1. Update to Spark 3.0.0

    ruebot committed Nov 9, 2019
    - Some hacks to get a sucessful build
    - Definitely need to loop back and clean-up a whole lot!
    - Addresses #356
Commits on Nov 8, 2019
Commits on Nov 7, 2019
  1. Change RemoveHttpHeader to RemoveHTTPHeader. (#374)

    SinghGursimran authored and ruebot committed Nov 7, 2019
    Resolves #368.
Commits on Nov 6, 2019
  1. Updates description. See archivesunleashed/aut-docs#18 (#373)

    ruebot authored and ianmilligan1 committed Nov 6, 2019
Commits on Nov 5, 2019
  1. Align NER output to WANE format; addresses #297 (#361)

    ruebot authored and ianmilligan1 committed Nov 5, 2019
    - Update Stanford core NLP
    - Format NER output in json
    - Add getPayloadDigest to ArchiveRecord
    - Add test for getPayloadDigest
    - Add payload digest to NER output
    - Remove extractFromScrapeText
    - Remove extractFromScrapeText test
    - TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢
  2. Various UDF implementation and cleanup for DF. (#370)

    lintool authored and ruebot committed Nov 5, 2019
    - Replace ExtractBaseDomain with ExtractDomain
    - Closes #367
    - Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader
    - Closes #369
    - Wraps RemoveHttpHeader and RemoveHTML for use in data frames.
    - Partially addresses #238
    - Updates tests where necessary
    - Punts on #368 UDF CaMeL cASe consistency issues
Commits on Oct 14, 2019
  1. Update commons-compress to 1.19; CVE-2019-12402 (#365)

    ruebot authored and ianmilligan1 committed Oct 14, 2019
Commits on Oct 9, 2019
  1. Add ComputeSHA1 method; resolves #363. (#364)

    ruebot authored and ianmilligan1 committed Oct 9, 2019
    - Update tests where needed
    - Add SHA1 method to ExtractImageDetails
    - Add SHA1 to DataFrames binary extraction and analysis
Commits on Sep 11, 2019
  1. Update keepValidPages to include a filter on 200 OK. (#360)

    ruebot authored and ianmilligan1 committed Sep 11, 2019
    - Add status code filter to keepValidPages
    - Add MimeTypeTika to valid pages DF
    - Update tests since we filter more and better now 😄
    - Resolves #359
Older
You can’t perform that action at this time.