Skip to content
Permalink
Tree: b9f7a76750
Commits on Apr 22, 2019
  1. Update README and add LICENSE.txt

    ruebot committed Apr 22, 2019
    - Add badges to README
    - Update Markdown
    - Clean-up formatting
    - Add LICENSE.txt in root
    - Partially addresses internetarchive#233
Commits on Apr 18, 2019
Commits on Apr 17, 2019
Commits on Apr 15, 2019
  1. Merge pull request internetarchive#251 from nlevitt/trough-dedup

    adam-miller committed Apr 15, 2019
    fix some trough dedup bugs
Commits on Apr 10, 2019
  1. Merge pull request internetarchive#249 from ruebot/remove-suffix-craw…

    nlevitt committed Apr 10, 2019
    …ler-bean
    
    Remove suffix from warcWriter since it is no longer used.
  2. Merge pull request internetarchive#253 from dvanduzer/master

    nlevitt committed Apr 10, 2019
    set of frontier management changes to support CrawlHQ module
  3. give me some space

    nlevitt committed Apr 10, 2019
  4. Merge branch 'master' into trough-dedup

    nlevitt committed Apr 10, 2019
    * master:
      replace System.err.println with logger.info
      Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly"
      Removing outdated test.
      Disable questionalbe test.
      Avoid deprecated flag.
      Supply an iterator, for internetarchive#245
      Updated POM to use latest version.
      Update README.md
      Handle missing closing paren in srcset descriptor
      Teach jericho extractor srcset
      Don't run srcset test against jericho, it doesn't handle it
      Handle commas more compliantly when parsing srcset
      Ensure we start parsing full lines, for internetarchive#239.
Commits on Apr 8, 2019
  1. set of frontier management changes to support CrawlHQ module

    David Van Duzer
    David Van Duzer committed Apr 8, 2019
    These changes come from a private fork of H3, originally made by Kenji
    Nagahashi, to create org.archive.crawler.frontier.PullingBdbFrontier,
    which we intend to merge into 'contrib' of the official version in the
    near future.
Commits on Apr 5, 2019
Commits on Apr 1, 2019
  1. fix some trough dedup bugs

    nlevitt committed Apr 1, 2019
    especially this:
    
    -    writeUrlCache.remove("segmentId");
    +    writeUrlCache.remove(segmentId);
    
    and some improvements and tweaks
Commits on Mar 29, 2019
Commits on Mar 28, 2019
  1. Merge pull request internetarchive#248 from internetarchive/revert-24…

    ato committed Mar 28, 2019
    …6-upgrade-httpclient
    
    Revert "Upgrade httpclient to 4.5.7 and handle cookies more compliantly"
Commits on Mar 22, 2019
  1. Merge pull request internetarchive#246 from ukwa/upgrade-httpclient

    nlevitt committed Mar 22, 2019
    Upgrade httpclient to 4.5.7 and handle cookies more compliantly
Commits on Mar 21, 2019
  1. Removing outdated test.

    anjackson committed Mar 21, 2019
Commits on Mar 20, 2019
  1. Disable questionalbe test.

    anjackson committed Mar 20, 2019
  2. Avoid deprecated flag.

    anjackson committed Mar 20, 2019
Commits on Mar 19, 2019
  1. Merge pull request internetarchive#243 from internetarchive/srcset

    adam-miller committed Mar 19, 2019
    Handle commas more compliantly when parsing srcset
  2. make TroughCrawlLogFeed use TroughClient, and...

    nlevitt committed Mar 19, 2019
    ... configure using rethinkdb url and segment id, instead of write url,
    which means it can work if the segment gets reassigned and so forth
    ***backward incompatible change***
  3. Merge pull request internetarchive#244 from mikeizbicki/patch-1

    ato committed Mar 19, 2019
    Update README.md
  4. Update README.md

    mikeizbicki committed Mar 19, 2019
    Fix typo in link
Commits on Mar 16, 2019
  1. Teach jericho extractor srcset

    ato committed Mar 16, 2019
  2. Handle commas more compliantly when parsing srcset

    ato committed Mar 16, 2019
    Commas are allowed if they're in the middle of the URL. Consequently:
    
        srcset="a,b,,c,"   => ["a,b,,c"]
        srcset="a, b,, c," => ["a", "b", "c"]
    
    They occur particularly commonly in data: URLs before the base64 value.
    
    Commas are also allowed in descriptors if they are enclosed by parens:
    
        srcset="a (b,c),d" => ["a", "d"]
    
    Spec: https://html.spec.whatwg.org/multipage/images.html#parsing-a-srcset-attribute
Older
You can’t perform that action at this time.