Skip to content
Permalink
Browse files

Add section about h3cc

  • Loading branch information...
anjackson committed May 29, 2018
1 parent 34f2b94 commit f7119781ee4ffcbb3de84d7529aa3499b03ba2cf
Showing with 22 additions and 0 deletions.
  1. +22 −0 README.md
@@ -99,3 +99,25 @@ Here's a quick script that builds, launches and unpauses a job using information
h.launch_job(name)
wait_for(h, name, 'unpause')
h.unpause_job(name)

## Command-line Interface

### h3cc - Heritrix3 Crawl Controller

Script to interact with Heritrix directly, to perform some general crawler operations.

The ```info-xml``` command downloads the raw XML version of the job information page, which can the be filtered by other tools to extract information. For example, the Java heap status can be queried like this:

$ python agents/h3cc.py info-xml | xmlstarlet sel -t -v "//heapReport/usedBytes"

Similarly, the number of novel URLs stored in the WARCs can be determined from:

$ python agents/h3cc.py info-xml | xmlstarlet sel -t -v "//warcNovelUrls"

You can query the frontier too. To see the URL queue for a given host, use a query-url corresponding to that host, e.g.

$ python agents/h3cc.py -H 192.168.99.100 -q "http://www.bbc.co.uk/" -l 5 pending-urls-from

This will show the first five URLs that are queued to be crawled next on that host. Similarly, you can ask for information about a specific URL:

$ python agents/h3cc.py -H 192.168.99.100 -q "http://www.bbc.co.uk/news" url-status

0 comments on commit f711978

Please sign in to comment.
You can’t perform that action at this time.