Various DataFrame implementation updates for documentation clean-up; …

…Addresses #372. - .all() column HttpStatus to http_status_code - Adds archive_filename to .all() - Significant README updates for setup - See also: archivesunleashed/aut-docs#39
archivesunleashed · Jan 17, 2020 · 9277e68f851741391e989035db50eeec7bd31a64 · 9277e68
1 parent 4c6875d
commit 9277e68f851741391e989035db50eeec7bd31a64
Unified Split

Showing with 180 additions and 26 deletions.

+172 −20 README.md

+1 −1 pom.xml

+7 −5 src/main/scala/io/archivesunleashed/package.scala
diff --git a/README.md b/README.md
@@ -2,8 +2,8 @@
 [![Build Status](https://travis-ci.org/archivesunleashed/aut.svg?branch=master)](https://travis-ci.org/archivesunleashed/aut)
 [![codecov](https://codecov.io/gh/archivesunleashed/aut/branch/master/graph/badge.svg)](https://codecov.io/gh/archivesunleashed/aut)
 [![Maven Central](https://maven-badges.herokuapp.com/maven-central/io.archivesunleashed/aut/badge.svg)](https://maven-badges.herokuapp.com/maven-central/io.archivesunleashed/aut)
-[![Javadoc](https://javadoc-badge.appspot.com/io.archivesunleashed/aut.svg?label=javadoc)](http://api.docs.archivesunleashed.io/0.18.0/apidocs/index.html)
-[![Scaladoc](https://javadoc-badge.appspot.com/io.archivesunleashed/aut.svg?label=scaladoc)](http://api.docs.archivesunleashed.io/0.18.0/scaladocs/index.html)
+[![Javadoc](https://javadoc-badge.appspot.com/io.archivesunleashed/aut.svg?label=javadoc)](http://api.docs.archivesunleashed.io/0.18.1/apidocs/index.html)
+[![Scaladoc](https://javadoc-badge.appspot.com/io.archivesunleashed/aut.svg?label=scaladoc)](http://api.docs.archivesunleashed.io/0.18.1/scaladocs/index.html)
 [![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat)](https://www.apache.org/licenses/LICENSE-2.0)
 [![Contribution Guidelines](http://img.shields.io/badge/CONTRIBUTING-Guidelines-blue.svg)](./CONTRIBUTING.md)
 
@@ -13,53 +13,205 @@ The toolkit grew out of a previous project called [Warcbase](https://github.com/
 
 + Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). _ACM Journal on Computing and Cultural Heritage_, 10(4), Article 22, 2017.
 
-## Getting Started
+## Dependencies
+
+### Java
 
-### Easy
+The Archives Unleashed Toolkit requires Java 8.
 
-If you have Apache Spark ready to go, it's as easy as:
+For macOS: You can find information on Java [here](https://java.com/en/download/help/mac_install.xml), or install with [homebrew](https://brew.sh) and then:
 
+```bash
+brew cask install java8
 ```
-$ spark-shell --packages "io.archivesunleashed:aut:0.18.0"
+
+On Debian based system you can install Java using `apt`:
+
+```bash
+apt install openjdk-8-jdk
 ```
 
-### A little less easy
+Before `spark-shell` can launch, `JAVA_HOME` must be set. If you receive an error that `JAVA_HOME` is not set, you need to point it to where Java is installed. On Linux, this might be `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64` or on macOS it might be `export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home`.
 
-You can download the [latest release here](https://github.com/archivesunleashed/aut/releases) and include it like so:
+### Python
 
-```
-$ spark-shell --jars /path/to/aut-0.18.0-fatjar.jar"
+If you would like to use the Archives Unleashed Toolkit with PySpark and Jupyter Notebooks, you'll need to have a modern version of Python installed. We recommend using the [Anaconda Distribution](https://www.anaconda.com/distribution). This _should_ install Jupyter Notebook, as well as the PySpark bindings. If it doesn't, you can install either with `conda install` or `pip install`.
+
+### Apache Spark
+
+Download and unzip [Apache Spark](https://spark.apache.org) to a location of your choice.
+
+```bash
+curl -L "https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz" > spark-2.4.4-bin-hadoop2.7.tgz
+tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
 ```
 
-### Even less easy
+## Getting Started
 
-Build it yourself as per the instructions below:
+### Building Locally
 
 Clone the repo:
 
-```
+```shell
 $ git clone http://github.com/archivesunleashed/aut.git
 ```
 
 You can then build The Archives Unleashed Toolkit.
 
-```
+```shell
 $ mvn clean install
 ```
 
-For the impatient, to skip tests:
+### Archives Unleashed Toolkit with Spark Shell
+
+There are a two options for loading the Archives Unleashed Toolkit. The advantages and disadvantages of using either option are going to depend on your setup (single machine vs cluster):
+
+```shell
+$ spark-shell --help
+
+  --jars JARS                 Comma-separated list of jars to include on the driver
+                              and executor classpaths.
+  --packages                  Comma-separated list of maven coordinates of jars to include
+                              on the driver and executor classpaths. Will search the local
+                              maven repo, then maven central and any additional remote
+                              repositories given by --repositories. The format for the
+                              coordinates should be groupId:artifactId:version.
+```
+
+#### As a package
+
+Release version:
+
+```shell
+$ spark-shell --packages "io.archivesunleashed:aut:0.18.1"
+```
+
+HEAD (built locally):
+
+```shell
+$ spark-shell --packages "io.archivesunleashed:aut:0.18.2-SNAPSHOT"
+```
+
+#### With an UberJar
+
+Release version:
+
+```shell
+$ spark-shell --jars /path/to/aut-0.18.1-fatjar.jar
+```
+
+HEAD (built locally):
+
+```shell
+$ spark-shell --jars /path/to/aut/target/aut-0.18.2-SNAPSHOT-fatjar.jar
+```
+
+### Archives Unleashed Toolkit with PySpark
+
+To run PySpark with the Archives Unleashed Toolkit loaded, you will need to provide PySpark with the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.18.1/aut-0.18.1.zip), or [built locally](#building-locally) (the zip file will be found in the `target` directory.
+
+In each of the examples below, `/path/to/python` is listed. If you are unsure where your Python is, it can be found with `which python`.
+
+#### As a package
+
+Release version:
+
+```shell
+$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1"
+```
+
+HEAD (built locally):
+
+```shell
+$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.18.2-SNAPSHOT"
+```
+
+#### With an UberJar
+
+Release version:
+
+```shell
+$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --jars /path/to/aut-0.18.1-fatjar.jar
+```
+
+HEAD (built locally):
 
+```shell
+$ export PYSPARK_PYTHON=/path/to/python; export PYSPARK_DRIVER_PYTHON=/path/to/python; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --jars /path/to/aut-0.18.2-SNAPSHOT-fatjar.jar
 ```
-$ mvn clean install -DskipTests
+
+### Archives Unleashed Toolkit with Jupyter
+
+To run a [Jupyter Notebook](https://jupyter.org/install) with the Archives Unleashed Toolkit loaded, you will need to provide PySpark the Java/Scala package, and the Python bindings. The Java/Scala packages can be provided with `--packages` or `--jars` as described above. The Python bindings can be [downloaded](https://github.com/archivesunleashed/aut/releases/download/aut-0.18.1/aut-0.18.1.zip), or [built locally](#Introduction) (the zip file will be found in the `target` directory.
+
+#### As a package
+
+Release version:
+
+```shell
+$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --packages "io.archivesunleashed:aut:0.18.1"
 ```
 
-### I want to use Docker!
+HEAD (built locally):
+
+```shell 
+$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --packages "io.archivesunleashed:aut:0.18.2-SNAPSHOT"
+```
+
+#### With an UberJar
+
+Release version:
+
+```shell
+$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files aut-0.18.1.zip --jars /path/to/aut-0.18.1-fatjar.jar
+```
+
+HEAD (built locally):
+
+```shell
+$ export PYSPARK_DRIVER_PYTHON=jupyter; export PYSPARK_DRIVER_PYTHON_OPTS=notebook; /path/to/spark/bin/pyspark --py-files /home/nruest/Projects/au/aut/target/aut.zip --jars /path/to/aut-0.18.2-SNAPSHOT-fatjar.jar
+```
+
+A Jupyter Notebook _should_ automatically load in your browser at <http://localhost:8888>. You may be asked for a token upon first launch, which just offers a bit of security. The token is available in the load screen and will look something like this:
+
+```
+[I 19:18:30.893 NotebookApp] Writing notebook server cookie secret to /run/user/1001/jupyter/notebook_cookie_secret
+[I 19:18:31.111 NotebookApp] JupyterLab extension loaded from /home/nruest/bin/anaconda3/lib/python3.7/site-packages/jupyterlab
+[I 19:18:31.111 NotebookApp] JupyterLab application directory is /home/nruest/bin/anaconda3/share/jupyter/lab
+[I 19:18:31.112 NotebookApp] Serving notebooks from local directory: /home/nruest/Projects/au/aut
+[I 19:18:31.112 NotebookApp] The Jupyter Notebook is running at:
+[I 19:18:31.112 NotebookApp] http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04
+[I 19:18:31.112 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
+[C 19:18:31.140 NotebookApp]
+
+    To access the notebook, open this file in a browser:
+        file:///run/user/1001/jupyter/nbserver-9702-open.html
+    Or copy and paste one of these URLs:
+        http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04
+```
+
+Create a new notebook by clicking “New” (near the top right of the Jupyter homepage) and select “Python 3” from the drop down list.
+
+The notebook will open in a new window. In the first cell enter:
+
+```python
+from aut import *
+
+archive = WebArchive(sc, sqlContext, "src/test/resources/warc/")
+
+webpages = archive.webpages()
+webpages.printSchema()
+```
+
+Then hit <kbd>Shift</kbd>+<kbd>Enter</kbd>, or press the play button.
+
+If you receive no errors, and see the following, you are ready to begin working with your web archives!
 
-Ok! Take a quick spin with `aut` with [Docker](https://github.com/archivesunleashed/docker-aut#use).
+![](https://user-images.githubusercontent.com/218561/63203995-42684080-c061-11e9-9361-f5e6177705ff.png)
 
-## Documentation! Or, how do I use this?
+## Documentation! Or, what can I do?
 
-Once built or downloaded, you can follow the basic set of recipes and tutorials [here](https://github.com/archivesunleashed/aut/wiki/User-Documentation).
+Once built or downloaded, you can follow the basic set of recipes and tutorials [here](https://github.com/archivesunleashed/aut-docs/tree/master/current#the-archives-unleashed-toolkit-latest-documentation).
 
 # License
 

diff --git a/pom.xml b/pom.xml
@@ -5,7 +5,7 @@
   <groupId>io.archivesunleashed</groupId>
   <artifactId>aut</artifactId>
   <packaging>jar</packaging>
-  <version>0.18.1-SNAPSHOT</version>
+  <version>0.18.2-SNAPSHOT</version>
   <name>Archives Unleashed Toolkit</name>
   <description>An open-source toolkit for analyzing web archives.</description>
   <url>https://github.com/archivesunleashed/aut</url>

diff --git a/src/main/scala/io/archivesunleashed/package.scala b/src/main/scala/io/archivesunleashed/package.scala
@@ -110,7 +110,7 @@ package object archivesunleashed {
                     $"url".rlike("(?i).*html$")
                   )
                )
-        .filter($"HttpStatus" === 200)
+        .filter($"http_status_code" === 200)
     }
 
     /** Filters ArchiveRecord MimeTypes (web server).
@@ -155,7 +155,7 @@ package object archivesunleashed {
       */
     def discardHttpStatusDF(statusCodes: Set[String]): DataFrame = {
       val filteredHttpStatus = udf((statusCode: String) => !statusCodes.contains(statusCode))
-      df.filter(filteredHttpStatus($"HttpStatus"))
+      df.filter(filteredHttpStatus($"http_status_code"))
     }
 
     /** Filters detected content (regex).
@@ -209,7 +209,7 @@ package object archivesunleashed {
      */
     def keepHttpStatusDF(statusCodes: Set[String]): DataFrame = {
       val takeHttpStatus = udf((statusCode: String) => statusCodes.contains(statusCode))
-      df.filter(takeHttpStatus($"HttpStatus"))
+      df.filter(takeHttpStatus($"http_status_code"))
     }
 
     /** Removes all data that does not have selected date.
@@ -309,7 +309,8 @@ package object archivesunleashed {
        Call KeepImages OR KeepValidPages on RDD depending upon the requirement before calling this method */
     def all(): DataFrame = {
       val records = rdd.map(r => Row(r.getCrawlDate, r.getUrl, r.getMimeType,
-          DetectMimeTypeTika(r.getBinaryBytes), r.getContentString, r.getBinaryBytes, r.getHttpStatus))
+          DetectMimeTypeTika(r.getBinaryBytes), r.getContentString,
+          r.getBinaryBytes, r.getHttpStatus, r.getArchiveFilename))
 
       val schema = new StructType()
         .add(StructField("crawl_date", StringType, true))
@@ -318,7 +319,8 @@ package object archivesunleashed {
         .add(StructField("mime_type_tika", StringType, true))
         .add(StructField("content", StringType, true))
         .add(StructField("bytes", BinaryType, true))
-        .add(StructField("HttpStatus", StringType, true))
+        .add(StructField("http_status_code", StringType, true))
+        .add(StructField("archive_filename", StringType, true))
 
       val sqlContext = SparkSession.builder()
       sqlContext.getOrCreate().createDataFrame(records, schema)