Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Permalink
Browse files

Refactoring Documentation for Explanations and Consistent Structure (#5)

- Flesh out root README with a site-wide table of contents;
- Provide some basic introduction;
- Provide some context on RDD/DF; and
- Break the large "getting started and overview" document into at least two parts.
  • Loading branch information...
ianmilligan1 authored and ruebot committed Oct 21, 2019
1 parent 200b4f1 commit 95a7559841fcea09cf428547550017deb1f6df63
@@ -0,0 +1,15 @@
# Archives Unleashed Toolkit: Documentation

This repository contains the documentation for the Archives Unleashed Toolkit.
You're most likely looking for the [most recent documentation](current/README.md).

Documentation from previous releases are also available:

+ [aut-0.18.0](aut-0.18.0/README.md)
+ [aut-0.17.0](aut-0.17.0/README.md)

## Acknowledgments

This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), and the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/).

Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
@@ -0,0 +1,45 @@
# The Archives Unleashed Toolkit: Latest Documentation

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Hadoop](https://hadoop.apache.org/). Tight integration with Hadoop provides powerful tools for analytics and data processing via [Spark](http://spark.apache.org/).

Most of this documentation is built on [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html). We are working on adding support for [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes). You can read more about this in our experimental [DataFrames section](#dataframes), and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial.

If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do)

## Table of Contents

Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.

### Getting Started

- [Installing the Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/install.md)

### Generating Results
- [**Collection Analysis**](collection-analysis.md): How do I...
- [List URLs](collection-analysis.md#List-URLs)
- [List Top-Level Domains](collection-analysis.md#List-Top-Level-Domains)
- [List Different Subdomains](collection-analysis.md#List-Different-Subdomains)
- [List HTTP Status Codes](collection-analysis.md#List-HTTP-Status-Codes)
- [Get the Location of the Resource in ARCs and WARCs](collection-analysis.md#Get-the-Location-of-the-Resource-in-ARCs-and-WARCs)
- **[Text Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/text-analysis.md)**: How do I extract all plain text; plain text without HTTP headers; filter by domain, URL pattern, date, language, keyword; remove boilerplate; extract raw HTML or named entities.
- **[Link Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/link-analysis.md)**: How do I extract a simple site link structure, a raw URL link structure, organize links by URL patter or crawl date, filter by URL, or export as TSV or Gephi file.
- **[Image Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/image-analysis.md)**: How do I find the most frequent images in a collection by URL or MD5 hash.

### Filtering Results
- **[Filters](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/filters.md NOW)**: A variety of ways to filter results.

### What to do with Results
- **[What to do with DataFrame Results](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/df-results.md)**
- **[What to do with RDD Results](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/rdd-results.md)**

## Further Reading

The toolkit grew out of a previous project called [Warcbase](https://github.com/lintool/warcbase). The following article provides a nice overview, much of which is still relevant:

Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). *ACM Journal on Computing and Cultural Heritage*, 10(4), Article 22, 2017.

## Acknowledgments

This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), and the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/).

Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
File renamed without changes.
File renamed without changes.
File renamed without changes.
@@ -30,6 +30,7 @@ What do I do with the results? See [this guide](rdd-results.md)!

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._
RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
.select($"Url")
@@ -0,0 +1,229 @@
# Filters

The following filters can be used on any `RecordLoader` DataFrames or RDDs.

## Keep Images

Removes all data except images.

```scala
import io.archivesunleashed._
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepImages()
```

## Keep MIME Types (web server)

Removes all data but selected MIME Types (identified by the web server).

```scala
import io.archivesunleashed._
val mimetypes = Set("text/html", "text/plain")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepMimeTypes(mimetypes)
```

## Keep MIME Types (Apache Tika)

Removes all data but selected MIME Types (identified by [Apache Tika](https://tika.apache.org/)).

```scala
import io.archivesunleashed._
val mimetypes = Set("text/html", "text/plain")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepMimeTypesTika(mimetypes)
```

## Keep HTTP Status

Removes all data that does not have selected status codes specified.

```scala
import io.archivesunleashed._
val statusCodes = Set("200", "404")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepHttpStatus(statusCodes)
```

## Keep Dates

Removes all data that does not have selected date.

```scala
import io.archivesunleashed._
val val dates = List("2008", "200908", "20070502")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepDate(dates)
```

## Keep URLs

Removes all data but selected exact URLs.

```scala
import io.archivesunleashed._
val val urls = Set("archive.org", "uwaterloo.ca", "yorku.ca")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepUrls(urls)
```

## Keep URL Patterns

Removes all data but selected URL patterns (regex).

```scala
import io.archivesunleashed._
val val urls = Set(archive.r, sloan.r, "".r)
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepUrlPatterns(urls)
```

## Keep Domains

Removes all data but selected source domains.

```scala
import io.archivesunleashed._
val val doamins = Set("www.archive.org", "www.sloan.org")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepDomains(domains)
```

## Keep Languages

Removes all data not in selected language ([ISO 639-2 codes](https://www.loc.gov/standards/iso639-2/php/code_list.php)).

```scala
import io.archivesunleashed._
val val languages = Set("en", "fr")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepLanguages(languages)
```

## Keep Content

Removes all content that does not pass Regular Expression test.

```scala
import io.archivesunleashed._
val val content = Set(regex, raw"UNINTELLIBLEDFSJKLS".r)
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepContent(content)
```

## Discard MIME Types (web server)

Filters out detected MIME Types (identified by the web server).

```scala
import io.archivesunleashed._
val mimetypes = Set("text/html", "text/plain")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.discardMimeTypes(mimetypes)
```

## Discard MIME Types (Apache Tika)

Filters out detected MIME Types (identified by [Apache Tika](https://tika.apache.org/)).

```scala
import io.archivesunleashed._
val mimetypes = Set("text/html", "text/plain")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.discardMimeTypesTika(mimetypes)
```

## Discard HTTP Status

Filters out detected HTTP status codes.

```scala
import io.archivesunleashed._
val statusCodes = Set("200", "404")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.discardHttpStatus(statusCodes)
```

## Discard Dates

Filters out detected dates.

```scala
import io.archivesunleashed._
val val dates = List("2008", "200908", "20070502")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.discardDate(dates)
```

## Discard URLs

Filters out detected URLs.

```scala
import io.archivesunleashed._
val val urls = Set("archive.org", "uwaterloo.ca", "yorku.ca")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.discardUrls(urls)
```

## Discard URL Patterns

Filters out detected URL patterns (regex).

```scala
import io.archivesunleashed._
val val urls = Set(archive.r, sloan.r, "".r)
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.discardUrlPatterns(urls)
```

## Discard Domains

Filters out detected source domains.

```scala
import io.archivesunleashed._
val val doamins = Set("www.archive.org", "www.sloan.org")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.discardDomains(domains)
```
## Discard Languages

Filters out detected languages ([ISO 639-2 codes](https://www.loc.gov/standards/iso639-2/php/code_list.php)).

```scala
import io.archivesunleashed._
val val languages = Set("en", "fr")
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.keepLanguages(languages)
```

## Discard Content

Filters out detected content that does pass Regular Expression test.

```scala
import io.archivesunleashed._
val val content = Set(regex, raw"UNINTELLIBLEDFSJKLS".r)
val r = RecordLoader.loadArchives("example.warc.gz",sc)
r.discardContent(content)
```
@@ -1,8 +1,13 @@
## Image Analysis
# Image Analysis

AUT supports image analysis, a growing area of interest within web archives.
- [Most Frequent Image URLs](#Most-Frequent-Image-URLs)
- [Most Frequent Images MD5 Hash](#Most-Frequent-Images-MD5-Hash)

### Most frequent image URLs in a collection
The Archives Unleashed Toolkit supports image analysis, a growing area of interest within web archives.

## Most Frequent Image URLs

### Scala RDD

The following script:

@@ -32,7 +37,15 @@ To do analysis on all images, you could thus prepend `http://web.archive.org/web

For more information on `wget`, please consult [this lesson available on the Programming Historian website](http://programminghistorian.org/lessons/automated-downloading-with-wget).

### Most frequent images in a collection, based on MD5 hash
### Scala DF

TODO

### Python DF

TODO

## Most Frequent Images MD5 Hash

Some images may be the same, but have different URLs. This UDF finds the popular images by calculating the MD5 hash of each and presenting the most frequent images based on that metric. This script:

@@ -46,3 +59,11 @@ ExtractPopularImages(r, 500, sc).saveAsTextFile("500-Popular-Images")
```

Will save the 500 most popular URLs to an output directory.

### Scala DF

TODO

### Python DF

TODO

0 comments on commit 95a7559

Please sign in to comment.
You can’t perform that action at this time.