Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Spark-Crawler : Evolving Apache Nutch to run on Spark.
Java Scala Shell JavaScript Python Dockerfile Other
Branch: master
Clone or download
thammegowda Merge pull request #175 from prowave/master
Always use the latest docker image
Latest commit 5c22013 Sep 17, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github SPKLR-108: Contrib guidelines updated to link issue number Apr 11, 2017
bin Always use the latest docker image Sep 17, 2019
conf add chrome driver for sparkler text extraction. fixes #173 Mar 21, 2019
docs add documentation site Dec 27, 2017
sparkler-api Close stream after copy Jun 11, 2018
sparkler-app Add sbt profile to allow building and development on Java 10 for #172 Mar 20, 2019
sparkler-deployment Simplify docker image build instructions Oct 18, 2018
sparkler-plugins add chrome driver for sparkler text extraction. fixes #173 Mar 21, 2019
sparkler-tests-base Set Version to 0.2.1-SNAPSHOT Apr 28, 2018
sparkler-ui Set Version to 0.2.1-SNAPSHOT Apr 28, 2018
.dockerignore allow config in docker context Sep 25, 2018
.gitignore Restructure sparkler build - make it easy for release distribution Dec 9, 2017
.gitlab-ci.yml add jdk 11 artifact Apr 1, 2019
.gitmodules Fix #130: Use Maven overlay for banana instead of git submodule Oct 12, 2017
.travis.yml Set maven Log level to WARN Oct 18, 2018
LICENSE Updated Apache License Header (Fix spaces + add to missing files) Jun 8, 2016
README.md Added another method using echo command to include urls in a file in … Jun 11, 2018
Release-Checklist.md Add Release Checklist Apr 28, 2018
eclipse-codeformat.xml Added java code style file, updated README Jun 22, 2016
pom.xml Add sbt profile to allow building and development on Java 10 for #172 Mar 20, 2019
scalastyle_config.xml Fixed the scala style issues (some errors are now warnings) Jun 8, 2016

README.md

Sparkler

Build Status

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

NOTE:

Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here Will be done later, eventually!

Notable features of Sparkler:

  • Provides Higher performance and fault tolerance: The crawl pipeline has been redesigned to take advantage of the caching and fault tolerance capability of Apache Spark.
  • Supports complex and near real-time analytics: The internal data-structure is an indexed store powered by Apache Lucene and has the functionality to answer complex queries in near real time. Apache Solr (Supporting standalone for a quick start and cloud mode to scale horizontally) is used to expose the crawler analytics via HTTP API. These analytics can be visualized using intuitive charts in Admin dashboard (coming soon).
  • Streams out the content in real-time: Optionally, Apache Kafka can be configured to retrieve the output content as and when the content becomes available.
  • Java Script Rendering Executes the javascript code in webpages to create final state of the page. The setup is easy and painless, scales by distributing the work on Spark. It preserves the sessions and cookies for the subsequent requests made to a host.
  • Extensible plugin framework: Sparkler is designed to be modular. It supports plugins to extend and customize the runtime behaviour.
  • Universal Parser: Apache Tika, the most popular content detection, and content analysis toolkit that can deal with thousands of file formats, is used to discover links to the outgoing web resources and also to perform analysis on fetched resources.

Quick Start: Running your first crawl job in minutes

To use sparkler, install docker and run the below commands:

# Step 0. Get this script
wget https://raw.githubusercontent.com/USCDataScience/sparkler/master/bin/dockler.sh
# Step 1. Run the script - it starts docker container and forwards ports to host
bash dockler.sh 
# Step 2. Inject seed urls
/data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
/data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2     # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:

1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:     
       a. emacs sparkler/bin/seed-urls.txt 
       b. copy paste your urls 
       c. Ctrl+x Ctrl+s to save  
       d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]

* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.

3. Inject seed urls using the following command, 
/sparkler/bin/sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.

To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1

Access the dashboard http://localhost:8983/banana/ (forwarded from docker image). The dashboard should look like the one in the below:

Dashboard

Making Contributions:

Contact Us

Any questions or suggestions are welcomed in our mailing list irds-l@mymaillists.usc.edu Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack

You can’t perform that action at this time.