Refactoring Documentation for Explanations and Consistent Structure (#5)

- Flesh out root README with a site-wide table of contents; - Provide some basic introduction; - Provide some context on RDD/DF; and - Break the large "getting started and overview" document into at least two parts.
archivesunleashed · Oct 21, 2019 · 95a7559841fcea09cf428547550017deb1f6df63 · 95a7559
1 parent 200b4f1
commit 95a7559841fcea09cf428547550017deb1f6df63
diff --git a/README.md b/README.md
@@ -0,0 +1,15 @@
+# Archives Unleashed Toolkit: Documentation
+
+This repository contains the documentation for the Archives Unleashed Toolkit.
+You're most likely looking for the [most recent documentation](current/README.md).
+
+Documentation from previous releases are also available:
+
++ [aut-0.18.0](aut-0.18.0/README.md)
++ [aut-0.17.0](aut-0.17.0/README.md)
+
+## Acknowledgments
+
+This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), and the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/).
+
+Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
diff --git a/r0.17.0/index.md → aut-0.17.0/README.md b/r0.17.0/index.md → aut-0.17.0/README.md
diff --git a/r0.18.0/index.md → aut-0.18.0/README.md b/r0.18.0/index.md → aut-0.18.0/README.md
diff --git a/Cookbook.md → current/Cookbook.md b/Cookbook.md → current/Cookbook.md
diff --git a/Docker-Install.md → current/Docker-Install.md b/Docker-Install.md → current/Docker-Install.md
diff --git a/Home.md → current/Home.md b/Home.md → current/Home.md
diff --git a/current/README.md b/current/README.md
@@ -0,0 +1,45 @@
+# The Archives Unleashed Toolkit: Latest Documentation
+
+The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Hadoop](https://hadoop.apache.org/). Tight integration with Hadoop provides powerful tools for analytics and data processing via [Spark](http://spark.apache.org/).
+
+Most of this documentation is built on [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html). We are working on adding support for [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes). You can read more about this in our experimental [DataFrames section](#dataframes), and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial.
+
+If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) 
+
+## Table of Contents
+
+Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.
+
+### Getting Started
+
+- [Installing the Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/install.md)
+
+### Generating Results
+- [**Collection Analysis**](collection-analysis.md): How do I...
+  - [List URLs](collection-analysis.md#List-URLs)
+  - [List Top-Level Domains](collection-analysis.md#List-Top-Level-Domains)
+  - [List Different Subdomains](collection-analysis.md#List-Different-Subdomains)
+  - [List HTTP Status Codes](collection-analysis.md#List-HTTP-Status-Codes)
+  - [Get the Location of the Resource in ARCs and WARCs](collection-analysis.md#Get-the-Location-of-the-Resource-in-ARCs-and-WARCs)
+- **[Text Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/text-analysis.md)**: How do I extract all plain text; plain text without HTTP headers; filter by domain, URL pattern, date, language, keyword; remove boilerplate; extract raw HTML or named entities.
+- **[Link Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/link-analysis.md)**: How do I extract a simple site link structure, a raw URL link structure, organize links by URL patter or crawl date, filter by URL, or export as TSV or Gephi file.
+- **[Image Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/image-analysis.md)**: How do I find the most frequent images in a collection by URL or MD5 hash.
+
+### Filtering Results
+- **[Filters](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/filters.md NOW)**: A variety of ways to filter results.
+
+### What to do with Results
+- **[What to do with DataFrame Results](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/df-results.md)**
+- **[What to do with RDD Results](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/rdd-results.md)**
+
+## Further Reading
+
+The toolkit grew out of a previous project called [Warcbase](https://github.com/lintool/warcbase). The following article provides a nice overview, much of which is still relevant:
+
+Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). *ACM Journal on Computing and Cultural Heritage*, 10(4), Article 22, 2017.
+
+## Acknowledgments
+
+This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), and the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/).
+
+Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
diff --git a/Release-Process.md → current/Release-Process.md b/Release-Process.md → current/Release-Process.md
diff --git a/Toolkit-Lesson.md → current/Toolkit-Lesson.md b/Toolkit-Lesson.md → current/Toolkit-Lesson.md
diff --git a/User-Documentation.md → current/User-Documentation.md b/User-Documentation.md → current/User-Documentation.md
diff --git a/...rchives-Unleashed-Toolkit-with-PySpark.md → ...rchives-Unleashed-Toolkit-with-PySpark.md b/...rchives-Unleashed-Toolkit-with-PySpark.md → ...rchives-Unleashed-Toolkit-with-PySpark.md
diff --git a/current/collection-analysis.md b/current/collection-analysis.md
@@ -30,6 +30,7 @@ What do I do with the results? See [this guide](rdd-results.md)!
 
 ```scala
 import io.archivesunleashed._
+import io.archivesunleashed.df._
 
 RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
   .select($"Url")

diff --git a/current/filters.md b/current/filters.md
@@ -0,0 +1,229 @@
+# Filters
+
+The following filters can be used on any `RecordLoader` DataFrames or RDDs.
+
+## Keep Images
+
+Removes all data except images. 
+
+```scala
+import io.archivesunleashed._
+
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepImages()
+```
+
+## Keep MIME Types (web server)
+
+Removes all data but selected MIME Types (identified by the web server).
+
+```scala
+import io.archivesunleashed._
+
+val mimetypes = Set("text/html", "text/plain")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepMimeTypes(mimetypes)
+```
+
+## Keep MIME Types (Apache Tika)
+
+Removes all data but selected MIME Types (identified by [Apache Tika](https://tika.apache.org/)).
+
+```scala
+import io.archivesunleashed._
+
+val mimetypes = Set("text/html", "text/plain")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepMimeTypesTika(mimetypes)
+```
+
+## Keep HTTP Status
+
+Removes all data that does not have selected status codes specified.
+
+```scala
+import io.archivesunleashed._
+
+val statusCodes = Set("200", "404")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepHttpStatus(statusCodes)
+```
+
+## Keep Dates
+
+Removes all data that does not have selected date.
+
+```scala
+import io.archivesunleashed._
+
+val val dates = List("2008", "200908", "20070502")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepDate(dates)
+```
+
+## Keep URLs
+
+Removes all data but selected exact URLs.
+
+```scala
+import io.archivesunleashed._
+
+val val urls = Set("archive.org", "uwaterloo.ca", "yorku.ca")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepUrls(urls)
+```
+
+## Keep URL Patterns
+
+Removes all data but selected URL patterns (regex).
+
+```scala
+import io.archivesunleashed._
+
+val val urls = Set(archive.r, sloan.r, "".r)
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepUrlPatterns(urls)
+```
+
+## Keep Domains
+
+Removes all data but selected source domains.
+
+```scala
+import io.archivesunleashed._
+
+val val doamins = Set("www.archive.org", "www.sloan.org")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepDomains(domains)
+```
+
+## Keep Languages
+
+Removes all data not in selected language ([ISO 639-2 codes](https://www.loc.gov/standards/iso639-2/php/code_list.php)).
+
+```scala
+import io.archivesunleashed._
+
+val val languages = Set("en", "fr")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepLanguages(languages)
+```
+
+## Keep Content
+
+Removes all content that does not pass Regular Expression test.
+
+```scala
+import io.archivesunleashed._
+
+val val content = Set(regex, raw"UNINTELLIBLEDFSJKLS".r)
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepContent(content)
+```
+
+## Discard MIME Types (web server)
+
+Filters out detected MIME Types (identified by the web server).
+
+```scala
+import io.archivesunleashed._
+
+val mimetypes = Set("text/html", "text/plain")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.discardMimeTypes(mimetypes)
+```
+
+## Discard MIME Types (Apache Tika)
+
+Filters out detected MIME Types (identified by [Apache Tika](https://tika.apache.org/)).
+
+```scala
+import io.archivesunleashed._
+
+val mimetypes = Set("text/html", "text/plain")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.discardMimeTypesTika(mimetypes)
+```
+
+## Discard HTTP Status
+
+Filters out detected HTTP status codes.
+
+```scala
+import io.archivesunleashed._
+
+val statusCodes = Set("200", "404")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.discardHttpStatus(statusCodes)
+```
+
+## Discard Dates
+
+Filters out detected dates.
+
+```scala
+import io.archivesunleashed._
+
+val val dates = List("2008", "200908", "20070502")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.discardDate(dates)
+```
+
+## Discard URLs
+
+Filters out detected URLs.
+
+```scala
+import io.archivesunleashed._
+
+val val urls = Set("archive.org", "uwaterloo.ca", "yorku.ca")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.discardUrls(urls)
+```
+
+## Discard URL Patterns
+
+Filters out detected URL patterns (regex).
+
+```scala
+import io.archivesunleashed._
+
+val val urls = Set(archive.r, sloan.r, "".r)
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.discardUrlPatterns(urls)
+```
+
+## Discard Domains
+
+Filters out detected source domains.
+
+```scala
+import io.archivesunleashed._
+
+val val doamins = Set("www.archive.org", "www.sloan.org")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.discardDomains(domains)
+```
+## Discard Languages
+
+Filters out detected languages ([ISO 639-2 codes](https://www.loc.gov/standards/iso639-2/php/code_list.php)).
+
+```scala
+import io.archivesunleashed._
+
+val val languages = Set("en", "fr")
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.keepLanguages(languages)
+```
+
+## Discard Content
+
+Filters out detected content that does pass Regular Expression test.
+
+```scala
+import io.archivesunleashed._
+
+val val content = Set(regex, raw"UNINTELLIBLEDFSJKLS".r)
+val r = RecordLoader.loadArchives("example.warc.gz",sc)
+r.discardContent(content)
+```
diff --git a/current/image-analysis.md b/current/image-analysis.md
@@ -1,8 +1,13 @@
-## Image Analysis
+# Image Analysis
 
-AUT supports image analysis, a growing area of interest within web archives.
+- [Most Frequent Image URLs](#Most-Frequent-Image-URLs)
+- [Most Frequent Images MD5 Hash](#Most-Frequent-Images-MD5-Hash)
 
-### Most frequent image URLs in a collection
+The Archives Unleashed Toolkit supports image analysis, a growing area of interest within web archives.
+
+## Most Frequent Image URLs
+
+### Scala RDD
 
 The following script:
 
@@ -32,7 +37,15 @@ To do analysis on all images, you could thus prepend `http://web.archive.org/web
 
 For more information on `wget`, please consult [this lesson available on the Programming Historian website](http://programminghistorian.org/lessons/automated-downloading-with-wget).
 
-### Most frequent images in a collection, based on MD5 hash
+### Scala DF
+
+TODO
+
+### Python DF
+
+TODO
+
+## Most Frequent Images MD5 Hash
 
 Some images may be the same, but have different URLs. This UDF finds the popular images by calculating the MD5 hash of each and presenting the most frequent images based on that metric. This script:
 
@@ -46,3 +59,11 @@ ExtractPopularImages(r, 500, sc).saveAsTextFile("500-Popular-Images")
 ```
 
 Will save the 500 most popular URLs to an output directory.
+
+### Scala DF
+
+TODO
+
+### Python DF
+
+TODO