Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Permalink
Browse files

Various DataFrame implementation updates for documentation clean-up; …

…Addresses #372.

- .all() column HttpStatus to http_status_code
- Adds archive_filename to .all()
- Significant README updates for setup
- See also: archivesunleashed/aut-docs#39
  • Loading branch information
ruebot authored and ianmilligan1 committed Jan 17, 2020
1 parent 4c6875d commit 9277e68f851741391e989035db50eeec7bd31a64
Showing with 180 additions and 26 deletions.
  1. +172 −20 README.md
  2. +1 −1 pom.xml
  3. +7 −5 src/main/scala/io/archivesunleashed/package.scala
192 README.md
@@ -2,8 +2,8 @@
[![Build Status](https://travis-ci.org/archivesunleashed/aut.svg?branch=master)](https://travis-ci.org/archivesunleashed/aut)
[![codecov](https://codecov.io/gh/archivesunleashed/aut/branch/master/graph/badge.svg)](https://codecov.io/gh/archivesunleashed/aut)
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/io.archivesunleashed/aut/badge.svg)](https://maven-badges.herokuapp.com/maven-central/io.archivesunleashed/aut)
[![Javadoc](https://javadoc-badge.appspot.com/io.archivesunleashed/aut.svg?label=javadoc)](http://api.docs.archivesunleashed.io/0.18.0/apidocs/index.html)
[![Scaladoc](https://javadoc-badge.appspot.com/io.archivesunleashed/aut.svg?label=scaladoc)](http://api.docs.archivesunleashed.io/0.18.0/scaladocs/index.html)
[![Javadoc](https://javadoc-badge.appspot.com/io.archivesunleashed/aut.svg?label=javadoc)](http://api.docs.archivesunleashed.io/0.18.1/apidocs/index.html)
[![Scaladoc](https://javadoc-badge.appspot.com/io.archivesunleashed/aut.svg?label=scaladoc)](http://api.docs.archivesunleashed.io/0.18.1/scaladocs/index.html)
[![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat)](https://www.apache.org/licenses/LICENSE-2.0)
[![Contribution Guidelines](http://img.shields.io/badge/CONTRIBUTING-Guidelines-blue.svg)](./CONTRIBUTING.md)

@@ -13,53 +13,205 @@ The toolkit grew out of a previous project called [Warcbase](https://github.com/

+ Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on Computing and Cultural Heritage_, 10(4), Article 22, 2017.

## Getting Started
## Dependencies

### Java

### Easy
The Archives Unleashed Toolkit requires Java 8.

If you have Apache Spark ready to go, it's as easy as:
For macOS: You can find information on Java [here](https://java.com/en/download/help/mac_install.xml), or install with [homebrew](https://brew.sh) and then:

```bash
brew cask install java8
```
$ spark-shell --packages "io.archivesunleashed:aut:0.18.0"

On Debian based system you can install Java using `apt`:

```bash
apt install openjdk-8-jdk
```

### A little less easy
Before `spark-shell` can launch, `JAVA_HOME` must be set. If you receive an error that `JAVA_HOME` is not set, you need to point it to where Java is installed. On Linux, this might be `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64` or on macOS it might be `export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home`.

You can download the [latest release here](https://github.com/archivesunleashed/aut/releases) and include it like so:
### Python

```
$ spark-shell --jars /path/to/aut-0.18.0-fatjar.jar"
If you would like to use the Archives Unleashed Toolkit with PySpark and Jupyter Notebooks, you'll need to have a modern version of Python installed. We recommend using the [Anaconda Distribution](https://www.anaconda.com/distribution). This _should_ install Jupyter Notebook, as well as the PySpark bindings. If it doesn't, you can install either with `conda install` or `pip install`.

### Apache Spark

Download and unzip [Apache Spark](https://spark.apache.org) to a location of your choice.

```bash
curl -L "https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz" > spark-2.4.4-bin-hadoop2.7.tgz
tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
```

### Even less easy
## Getting Started

Build it yourself as per the instructions below:
### Building Locally

Clone the repo:

```
```shell
$ git clone http://github.com/archivesunleashed/aut.git
```

You can then build The Archives Unleashed Toolkit.

```
```shell
$ mvn clean install
```

For the impatient, to skip tests:
### Archives Unleashed Toolkit with Spark Shell

There are a two options for loading the Archives Unleashed Toolkit. The advantages and disadvantages of using either option are going to depend on your setup (single machine vs cluster):

```shell
$ spark-shell --help
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
```

#### As a package

Release version:

```shell
$ spark-shell --packages "io.archivesunleashed:aut:0.18.1"
```

HEAD (built locally):

```shell
$ spark-shell --packages "io.archivesunleashed:aut:0.18.2-SNAPSHOT"
```

#### With an UberJar

Release version:

```shell
$ spark-shell --jars /path/to/aut-0.18.1-fatjar.jar
```

HEAD (built locally):

```shell
$ spark-shell --jars /path/to/aut/target/aut-0.18.2-SNAPSHOT-fatjar.jar
```

### Archives Unleashed Toolkit with PySpark

To run PySpark with the Archives Unleashed Toolkit loaded, you will need to provide PySpark with the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.18.1/aut-0.18.1.zip), or [built locally](#building-locally) (the zip file will be found in the `target` directory.

In each of the examples below, `/path/to/python` is listed. If you are unsure where your Python is, it can be found with `which python`.

#### As a package

Release version:

```shell
$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1"
```

HEAD (built locally):

```shell
$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.18.2-SNAPSHOT"
```

#### With an UberJar

Release version:

```shell
$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --jars /path/to/aut-0.18.1-fatjar.jar
```

HEAD (built locally):

```shell
$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --jars /path/to/aut-0.18.2-SNAPSHOT-fatjar.jar
```
$ mvn clean install -DskipTests

### Archives Unleashed Toolkit with Jupyter

To run a [Jupyter Notebook](https://jupyter.org/install) with the Archives Unleashed Toolkit loaded, you will need to provide PySpark the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.18.1/aut-0.18.1.zip), or [built locally](#Introduction) (the zip file will be found in the `target` directory.

#### As a package

Release version:

```shell
$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1"
```

### I want to use Docker!
HEAD (built locally):

```shell
$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.18.2-SNAPSHOT"
```

#### With an UberJar

Release version:

```shell
$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --jars /path/to/aut-0.18.1-fatjar.jar
```

HEAD (built locally):

```shell
$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --jars /path/to/aut-0.18.2-SNAPSHOT-fatjar.jar
```

A Jupyter Notebook _should_ automatically load in your browser at <http://localhost:8888>. You may be asked for a token upon first launch, which just offers a bit of security. The token is available in the load screen and will look something like this:

```
[I 19:18:30.893 NotebookApp] Writing notebook server cookie secret to /run/user/1001/jupyter/notebook_cookie_secret
[I 19:18:31.111 NotebookApp] JupyterLab extension loaded from /home/nruest/bin/anaconda3/lib/python3.7/site-packages/jupyterlab
[I 19:18:31.111 NotebookApp] JupyterLab application directory is /home/nruest/bin/anaconda3/share/jupyter/lab
[I 19:18:31.112 NotebookApp] Serving notebooks from local directory: /home/nruest/Projects/au/aut
[I 19:18:31.112 NotebookApp] The Jupyter Notebook is running at:
[I 19:18:31.112 NotebookApp] http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04
[I 19:18:31.112 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:18:31.140 NotebookApp]
To access the notebook, open this file in a browser:
file:///run/user/1001/jupyter/nbserver-9702-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04
```

Create a new notebook by clicking “New” (near the top right of the Jupyter homepage) and select “Python 3” from the drop down list.

The notebook will open in a new window. In the first cell enter:

```python
from aut import *
archive = WebArchive(sc, sqlContext, "src/test/resources/warc/")
webpages = archive.webpages()
webpages.printSchema()
```

Then hit <kbd>Shift</kbd>+<kbd>Enter</kbd>, or press the play button.

If you receive no errors, and see the following, you are ready to begin working with your web archives!

Ok! Take a quick spin with `aut` with [Docker](https://github.com/archivesunleashed/docker-aut#use).
![](https://user-images.githubusercontent.com/218561/63203995-42684080-c061-11e9-9361-f5e6177705ff.png)

## Documentation! Or, how do I use this?
## Documentation! Or, what can I do?

Once built or downloaded, you can follow the basic set of recipes and tutorials [here](https://github.com/archivesunleashed/aut/wiki/User-Documentation).
Once built or downloaded, you can follow the basic set of recipes and tutorials [here](https://github.com/archivesunleashed/aut-docs/tree/master/current#the-archives-unleashed-toolkit-latest-documentation).

# License

@@ -5,7 +5,7 @@
<groupId>io.archivesunleashed</groupId>
<artifactId>aut</artifactId>
<packaging>jar</packaging>
<version>0.18.1-SNAPSHOT</version>
<version>0.18.2-SNAPSHOT</version>
<name>Archives Unleashed Toolkit</name>
<description>An open-source toolkit for analyzing web archives.</description>
<url>https://github.com/archivesunleashed/aut</url>
@@ -110,7 +110,7 @@ package object archivesunleashed {
$"url".rlike("(?i).*html$")
)
)
.filter($"HttpStatus" === 200)
.filter($"http_status_code" === 200)
}

/** Filters ArchiveRecord MimeTypes (web server).
@@ -155,7 +155,7 @@ package object archivesunleashed {
*/
def discardHttpStatusDF(statusCodes: Set[String]): DataFrame = {
val filteredHttpStatus = udf((statusCode: String) => !statusCodes.contains(statusCode))
df.filter(filteredHttpStatus($"HttpStatus"))
df.filter(filteredHttpStatus($"http_status_code"))
}

/** Filters detected content (regex).
@@ -209,7 +209,7 @@ package object archivesunleashed {
*/
def keepHttpStatusDF(statusCodes: Set[String]): DataFrame = {
val takeHttpStatus = udf((statusCode: String) => statusCodes.contains(statusCode))
df.filter(takeHttpStatus($"HttpStatus"))
df.filter(takeHttpStatus($"http_status_code"))
}

/** Removes all data that does not have selected date.
@@ -309,7 +309,8 @@ package object archivesunleashed {
Call KeepImages OR KeepValidPages on RDD depending upon the requirement before calling this method */
def all(): DataFrame = {
val records = rdd.map(r => Row(r.getCrawlDate, r.getUrl, r.getMimeType,
DetectMimeTypeTika(r.getBinaryBytes), r.getContentString, r.getBinaryBytes, r.getHttpStatus))
DetectMimeTypeTika(r.getBinaryBytes), r.getContentString,
r.getBinaryBytes, r.getHttpStatus, r.getArchiveFilename))

val schema = new StructType()
.add(StructField("crawl_date", StringType, true))
@@ -318,7 +319,8 @@ package object archivesunleashed {
.add(StructField("mime_type_tika", StringType, true))
.add(StructField("content", StringType, true))
.add(StructField("bytes", BinaryType, true))
.add(StructField("HttpStatus", StringType, true))
.add(StructField("http_status_code", StringType, true))
.add(StructField("archive_filename", StringType, true))

val sqlContext = SparkSession.builder()
sqlContext.getOrCreate().createDataFrame(records, schema)

0 comments on commit 9277e68

Please sign in to comment.
You can’t perform that action at this time.