Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Permalink
Tree: bc0d663fb4
Commits on Jan 12, 2020
  1. Add language detection column to webpages. (#403)

    ruebot authored and ianmilligan1 committed Jan 12, 2020
    - Addresses #402
Commits on Jan 10, 2020
  1. Add more DataFrame Implementation Serializable APIs (#401).

    SinghGursimran authored and ruebot committed Jan 10, 2020
    - Partially addresses  #223 
    - Add discardContentDF
    - Add discardUrlPatternsDF
    - Add discardLanguagesDF
    - Add keepImagesDF
    - Add keepContentDF
    - Add keepUrlPatternsDF
    - Add keepLanguagesDF
    - Update tests
Commits on Jan 8, 2020
  1. Filter blank src/dest out of webgraph. (#400)

    ruebot authored and ianmilligan1 committed Jan 8, 2020
Commits on Jan 7, 2020
  1. Add more DF implementations for #223. (#399)

    SinghGursimran authored and ruebot committed Jan 7, 2020
    - Add discardHttpStatusDF
    - Add keepMimeTypesDF
    - Add keepMimeTypesTikaDF
    - Update tests
Commits on Jan 5, 2020
  1. Scala imports cleanup. (#398)

    ruebot authored and ianmilligan1 committed Jan 5, 2020
Commits on Dec 29, 2019
  1. Add more serializable APIs for DataFrames (#396)

    SinghGursimran authored and ruebot committed Dec 29, 2019
    - Partially address #223 
    - Add keepHttpStatusDF
    - Add keepDateDF
    - Add keepUrlsDF
    - Add keepDomainsDF
    - Add tests
Commits on Dec 19, 2019
  1. Remove redundant test; addresses #64. (#395)

    ruebot authored and ianmilligan1 committed Dec 19, 2019
Commits on Dec 18, 2019
  1. Add doc comments for webpages and webgraph; resolves #392. (#394)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
  2. Add additional filters for fextFiles; resolves #362. (#393)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
    * Add additional filters for fextFiles; resolves #362.
    
    - Add filedesc, and dns filter (arc files)
    - Add test case
Commits on Dec 17, 2019
  1. udf API implementations for DataFrame (#391)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - add discardMimeTypesDF
    - add discardDateDF
    - add discardUrlsDF
    - add discardDomainsDF
    - update tests
    - addresses #223
  2. Add Serializable APIs for DataFrames (#389)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - Add keepValidPagesDF
    - Add HTTP status code column to all()
    - Add test for keepValidPagesDF
    - Addresses #223
  3. Add and update tests, resolve textFiles bug. (#388)

    ruebot authored and ianmilligan1 committed Dec 17, 2019
    - Add ExtractDateDF test
    - Fix conditional logic of textFiles filter to resolve #390
    - Add test for conditional logic fix for #390
    - Remove cruft ExtractUrls, left over from Twitter analysis removal
    (see:
    https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala)
    - Tweak null/nothing on a few tests
Commits on Dec 5, 2019
  1. Add new DataFrame matchbox udfs (#387)

    SinghGursimran authored and ruebot committed Dec 5, 2019
    - Add DetectLanguageDF
    - Add ExtractBoilerpipeTextDF
    - Add ExtractDateDF
    - Update tests
    - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
    - Partially addresses #223
Commits on Nov 28, 2019
  1. Clean-up underscore import, and scalastyle warnings. (#386)

    ruebot authored and ianmilligan1 committed Nov 28, 2019
Commits on Nov 21, 2019
  1. Add "Extract popular images" DataFrame implementation (#382).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add tests for ExtractPopularImagesDF
    - Rename ExtractPopularImages to ExtractPopularImagesRDD
    - Addresses #223
  2. Add all() method and refactor DF UDFs (#383).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add `all()` DataFrame method 
    - Refactor fixity DataFrame UDFs
    - Add ComputeImageSize UDF
    - Add Python implementation of `all()`
    - Addresses #223
  3. Rename pages() to webpages(). (#384)

    ruebot authored and ianmilligan1 committed Nov 21, 2019
    - Part of work on #233
Commits on Nov 19, 2019
  1. Append UDF with RDD or RF. (#381)

    ruebot authored and ianmilligan1 committed Nov 19, 2019
    - Addresses #223
Commits on Nov 18, 2019
  1. Extend more Matchbook utilities to DataFrames (#380).

    SinghGursimran authored and ruebot committed Nov 18, 2019
    - Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
    - Addresses #223
Commits on Nov 17, 2019
  1. Rename DF functions to be consistent with Python DF functions. (#379)

    ruebot authored and ianmilligan1 committed Nov 17, 2019
    - Resolves #366
Commits on Nov 14, 2019
  1. Finalize converting NER Classifier to WANE Format (#378).

    SinghGursimran authored and ruebot committed Nov 14, 2019
    - Fully resolves #297 
    - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
Commits on Nov 12, 2019
  1. Add df ExtractLinks udf; resolves #238. (#377)

    SinghGursimran authored and ruebot committed Nov 12, 2019
    - Add df ExtractLinks udf
    - Add test
Commits on Nov 10, 2019
  1. Update README.md (#376)

    lintool authored and ruebot committed Nov 10, 2019
    Tweaks the style of the license badge to look consistent with the other badges.
Commits on Nov 7, 2019
  1. Change RemoveHttpHeader to RemoveHTTPHeader. (#374)

    SinghGursimran authored and ruebot committed Nov 7, 2019
    Resolves #368.
Commits on Nov 6, 2019
  1. Updates description. See archivesunleashed/aut-docs#18 (#373)

    ruebot authored and ianmilligan1 committed Nov 6, 2019
Commits on Nov 5, 2019
  1. Align NER output to WANE format; addresses #297 (#361)

    ruebot authored and ianmilligan1 committed Nov 5, 2019
    - Update Stanford core NLP
    - Format NER output in json
    - Add getPayloadDigest to ArchiveRecord
    - Add test for getPayloadDigest
    - Add payload digest to NER output
    - Remove extractFromScrapeText
    - Remove extractFromScrapeText test
    - TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢
  2. Various UDF implementation and cleanup for DF. (#370)

    lintool authored and ruebot committed Nov 5, 2019
    - Replace ExtractBaseDomain with ExtractDomain
    - Closes #367
    - Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader
    - Closes #369
    - Wraps RemoveHttpHeader and RemoveHTML for use in data frames.
    - Partially addresses #238
    - Updates tests where necessary
    - Punts on #368 UDF CaMeL cASe consistency issues
Commits on Oct 14, 2019
  1. Update commons-compress to 1.19; CVE-2019-12402 (#365)

    ruebot authored and ianmilligan1 committed Oct 14, 2019
Commits on Oct 9, 2019
  1. Add ComputeSHA1 method; resolves #363. (#364)

    ruebot authored and ianmilligan1 committed Oct 9, 2019
    - Update tests where needed
    - Add SHA1 method to ExtractImageDetails
    - Add SHA1 to DataFrames binary extraction and analysis
Commits on Sep 11, 2019
  1. Update keepValidPages to include a filter on 200 OK. (#360)

    ruebot authored and ianmilligan1 committed Sep 11, 2019
    - Add status code filter to keepValidPages
    - Add MimeTypeTika to valid pages DF
    - Update tests since we filter more and better now 😄
    - Resolves #359
Commits on Sep 3, 2019
  1. Update to Spark 2.4.4 (#358)

    ruebot authored and ianmilligan1 committed Sep 3, 2019
Commits on Aug 27, 2019
Commits on Aug 23, 2019
  1. Add discardLanguage filter to RecordLoader. (#353)

    ruebot authored and ianmilligan1 committed Aug 23, 2019
    - Clean up doc comments
    - Add test
    - Resolves #352
Commits on Aug 22, 2019
  1. Improve test coverage. (#354)

    ruebot authored and ianmilligan1 committed Aug 22, 2019
    - Add tests a few more filters in RecordLoader
    - Add binary extration DataFrameLoader tests
Commits on Aug 21, 2019
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed Aug 21, 2019
Older
You can’t perform that action at this time.