Skip to content
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Scala Java Python
Branch: master
Clone or download

Latest commit

ruebot Implement Scala Matchbox UDFs in Python. (#463)
- Resolves #408
- Alphabetizes DataFrameloader functions
- Alphabetizes UDFs functions
- Move DataFrameLoader to df packages
- Move UDFs out of df into their own package
- Rename UDFs (no more DF tagged to the end).
- Update tests as necessary
- Partially addresses #410, #409
- Supersedes #412.
Latest commit 69007e2 May 19, 2020

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Test clean-up. (#404) Jan 13, 2020
config Clean-up underscore import, and scalastyle warnings. (#386) Nov 28, 2019
src Implement Scala Matchbox UDFs in Python. (#463) May 19, 2020
.codecov.yml Add office document binary extraction. (#346) Aug 16, 2019
.gitignore Python formatting, and gitignore additions. (#326) Jul 18, 2019
.travis.yml Test Java 8 & 11, and remove OracleJDK; resolves #324. (#325) Jul 8, 2019
CODE_OF_CONDUCT.md Created Code of Conduct file (#110) Nov 3, 2017
CONTRIBUTING.md warcbase core moves to aut Jul 5, 2017
LICENSE Update LICENSE and license headers. (#351) Aug 21, 2019
README.md [maven-release-plugin] prepare for next development iteration May 4, 2020
pom.xml [maven-release-plugin] prepare for next development iteration May 4, 2020

README.md

The Archives Unleashed Toolkit

Build Status codecov Maven Central Javadoc Scaladoc UserDocs LICENSE Contribution Guidelines

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on Apache Spark, which provides powerful tools for analytics and data processing. This toolkit is part of the Archives Unleashed Project.

The following two articles provide an overview of the project:

Dependencies

Java

The Archives Unleashed Toolkit requires Java 8.

For macOS: You can find information on Java here. We recommend OpenJDK. The easiest way is to install with homebrew and then:

brew cask install adoptopenjdk/openjdk/adoptopenjdk8

If you run into difficulties with homebrew, installation instructions can be found here.

On Debian based system you can install Java using apt:

apt install openjdk-8-jdk

Before spark-shell can launch, JAVA_HOME must be set. If you receive an error that JAVA_HOME is not set, you need to point it to where Java is installed. On Linux, this might be export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 or on macOS it might be export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home.

Python

If you would like to use the Archives Unleashed Toolkit with PySpark and Jupyter Notebooks, you'll need to have a modern version of Python installed. We recommend using the Anaconda Distribution. This should install Jupyter Notebook, as well as the PySpark bindings. If it doesn't, you can install either with conda install or pip install.

Apache Spark

Download and unzip Apache Spark to a location of your choice.

curl -L "https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz" > spark-2.4.5-bin-hadoop2.7.tgz
tar -xvf spark-2.4.5-bin-hadoop2.7.tgz

Getting Started

Building Locally

Clone the repo:

git clone http://github.com/archivesunleashed/aut.git

You can then build The Archives Unleashed Toolkit.

mvn clean install

Archives Unleashed Toolkit with Spark Submit

The Toolkit offers a variety of extraction jobs with spark-submit . These extraction jobs have a few configuration options.

The extraction jobs have a basic outline of:

spark-submit --class io.archivesunleashed.app.CommandLineAppRunner PATH_TO_AUT_JAR --extractor EXTRACTOR --input INPUT DIRECTORY --output OUTPUT DIRECTORY

Additional flags include:

  • --output-format FORMAT (csv (default) or parquet. DomainGraphExtractor has two additional output options graphml or gexf.)
  • --split (The extractor will put results for each input file in its own directory. Each directory name will be the name of the ARC/WARC file parsed.)
  • --partition N (The extractor will partition RDD or DataFrame according to N before writing results. The is useful to combine all the results to a single file.)

Available extraction jobs:

  • AudioInformationExtractor
  • DomainFrequencyExtractor
  • DomainGraphExtractor
  • ImageGraphExtractor
  • ImageInformationExtractor
  • PDFInformationExtractor
  • PlainTextExtractor
  • PresentationProgramInformationExtractor
  • SpreadsheetInformationExtractor
  • TextFilesInformationExtractor
  • VideoInformationExtractor
  • WebGraphExtractor
  • WebPagesExtractor
  • WordProcessorInformationExtractor

More documentation on using the Toolkit with spark-submit can be found here.

Archives Unleashed Toolkit with Spark Shell

There are a two options for loading the Archives Unleashed Toolkit. The advantages and disadvantages of using either option are going to depend on your setup (single machine vs cluster):

spark-shell --help

  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.

As a package

Release version:

spark-shell --packages "io.archivesunleashed:aut:0.70.0"

HEAD (built locally):

spark-shell --packages "io.archivesunleashed:aut:0.70.1-SNAPSHOT"

With an UberJar

Release version:

spark-shell --jars /path/to/aut-0.70.0-fatjar.jar

HEAD (built locally):

spark-shell --jars /path/to/aut/target/aut-0.70.1-SNAPSHOT-fatjar.jar

Archives Unleashed Toolkit with PySpark

To run PySpark with the Archives Unleashed Toolkit loaded, you will need to provide PySpark with the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with --packages or --jars as described above. The Python bindings can be downloaded, or built locally (the zip file will be found in the target directory.

In each of the examples below, /path/to/python is listed. If you are unsure where your Python is, it can be found with which python.

As a package

Release version:

export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.70.0.zip --packages "io.archivesunleashed:aut:0.70.0"

HEAD (built locally):

export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.70.1-SNAPSHOT"

With an UberJar

Release version:

export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.70.0.zip --jars /path/to/aut-0.70.0-fatjar.jar

HEAD (built locally):

export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --jars /path/to/aut-0.70.1-SNAPSHOT-fatjar.jar

Archives Unleashed Toolkit with Jupyter

To run a Jupyter Notebook with the Archives Unleashed Toolkit loaded, you will need to provide PySpark the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with --packages or --jars as described above. The Python bindings can be downloaded, or built locally (the zip file will be found in the target directory.

As a package

Release version:

export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.70.0.zip --packages "io.archivesunleashed:aut:0.70.0"

HEAD (built locally):

export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.70.1-SNAPSHOT"

With an UberJar

Release version:

export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.70.0.zip --jars /path/to/aut-0.70.0-fatjar.jar

HEAD (built locally):

export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --jars /path/to/aut-0.70.1-SNAPSHOT-fatjar.jar

A Jupyter Notebook should automatically load in your browser at http://localhost:8888. You may be asked for a token upon first launch, which just offers a bit of security. The token is available in the load screen and will look something like this:

[I 19:18:30.893 NotebookApp] Writing notebook server cookie secret to /run/user/1001/jupyter/notebook_cookie_secret
[I 19:18:31.111 NotebookApp] JupyterLab extension loaded from /home/nruest/bin/anaconda3/lib/python3.7/site-packages/jupyterlab
[I 19:18:31.111 NotebookApp] JupyterLab application directory is /home/nruest/bin/anaconda3/share/jupyter/lab
[I 19:18:31.112 NotebookApp] Serving notebooks from local directory: /home/nruest/Projects/au/aut
[I 19:18:31.112 NotebookApp] The Jupyter Notebook is running at:
[I 19:18:31.112 NotebookApp] http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04
[I 19:18:31.112 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:18:31.140 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///run/user/1001/jupyter/nbserver-9702-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04

Create a new notebook by clicking “New” (near the top right of the Jupyter homepage) and select “Python 3” from the drop down list.

The notebook will open in a new window. In the first cell enter:

from aut import *

archive = WebArchive(sc, sqlContext, "src/test/resources/warc/")

webpages = archive.webpages()
webpages.printSchema()

Then hit Shift+Enter, or press the play button.

If you receive no errors, and see the following, you are ready to begin working with your web archives!

License

Licensed under the Apache License, Version 2.0.

Acknowledgments

This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.

Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

You can’t perform that action at this time.