Permalink
Please
sign in to comment.
Browse files
Refactoring Documentation for Explanations and Consistent Structure (#5)
- Flesh out root README with a site-wide table of contents; - Provide some basic introduction; - Provide some context on RDD/DF; and - Break the large "getting started and overview" document into at least two parts.
- Loading branch information
Showing
with
666 additions
and 439 deletions.
- +15 −0 README.md
- 0 r0.17.0/index.md → aut-0.17.0/README.md
- 0 r0.18.0/index.md → aut-0.18.0/README.md
- 0 { → current}/Cookbook.md
- 0 { → current}/Docker-Install.md
- 0 { → current}/Home.md
- +45 −0 current/README.md
- 0 { → current}/Release-Process.md
- 0 { → current}/Toolkit-Lesson.md
- 0 { → current}/User-Documentation.md
- 0 { → current}/Using-the-Archives-Unleashed-Toolkit-with-PySpark.md
- +1 −0 current/collection-analysis.md
- +229 −0 current/filters.md
- +25 −4 current/image-analysis.md
- +2 −418 current/index.md
- +143 −0 current/install.md
- +86 −8 current/link-analysis.md
- +120 −9 current/text-analysis.md
@@ -0,0 +1,15 @@ | ||
# Archives Unleashed Toolkit: Documentation | ||
|
||
This repository contains the documentation for the Archives Unleashed Toolkit. | ||
You're most likely looking for the [most recent documentation](current/README.md). | ||
|
||
Documentation from previous releases are also available: | ||
|
||
+ [aut-0.18.0](aut-0.18.0/README.md) | ||
+ [aut-0.17.0](aut-0.17.0/README.md) | ||
|
||
## Acknowledgments | ||
|
||
This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), and the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/). | ||
|
||
Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors. |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
@@ -0,0 +1,45 @@ | ||
# The Archives Unleashed Toolkit: Latest Documentation | ||
|
||
The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Hadoop](https://hadoop.apache.org/). Tight integration with Hadoop provides powerful tools for analytics and data processing via [Spark](http://spark.apache.org/). | ||
|
||
Most of this documentation is built on [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html). We are working on adding support for [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes). You can read more about this in our experimental [DataFrames section](#dataframes), and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial. | ||
|
||
If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) | ||
|
||
## Table of Contents | ||
|
||
Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results. | ||
|
||
### Getting Started | ||
|
||
- [Installing the Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/install.md) | ||
|
||
### Generating Results | ||
- [**Collection Analysis**](collection-analysis.md): How do I... | ||
- [List URLs](collection-analysis.md#List-URLs) | ||
- [List Top-Level Domains](collection-analysis.md#List-Top-Level-Domains) | ||
- [List Different Subdomains](collection-analysis.md#List-Different-Subdomains) | ||
- [List HTTP Status Codes](collection-analysis.md#List-HTTP-Status-Codes) | ||
- [Get the Location of the Resource in ARCs and WARCs](collection-analysis.md#Get-the-Location-of-the-Resource-in-ARCs-and-WARCs) | ||
- **[Text Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/text-analysis.md)**: How do I extract all plain text; plain text without HTTP headers; filter by domain, URL pattern, date, language, keyword; remove boilerplate; extract raw HTML or named entities. | ||
- **[Link Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/link-analysis.md)**: How do I extract a simple site link structure, a raw URL link structure, organize links by URL patter or crawl date, filter by URL, or export as TSV or Gephi file. | ||
- **[Image Analysis](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/image-analysis.md)**: How do I find the most frequent images in a collection by URL or MD5 hash. | ||
|
||
### Filtering Results | ||
- **[Filters](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/filters.md NOW)**: A variety of ways to filter results. | ||
|
||
### What to do with Results | ||
- **[What to do with DataFrame Results](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/df-results.md)** | ||
- **[What to do with RDD Results](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/rdd-results.md)** | ||
|
||
## Further Reading | ||
|
||
The toolkit grew out of a previous project called [Warcbase](https://github.com/lintool/warcbase). The following article provides a nice overview, much of which is still relevant: | ||
|
||
Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. [Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives](https://dl.acm.org/authorize.cfm?key=N46731). *ACM Journal on Computing and Cultural Heritage*, 10(4), Article 22, 2017. | ||
|
||
## Acknowledgments | ||
|
||
This work is primarily supported by the [Andrew W. Mellon Foundation](https://mellon.org/). Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](http://www.sshrc-crsh.gc.ca/), [Compute Canada](https://www.computecanada.ca/), the [Ontario Ministry of Research, Innovation, and Science](https://www.ontario.ca/page/ministry-research-innovation-and-science), [York University Libraries](https://www.library.yorku.ca/web/), [Start Smart Labs](http://www.startsmartlabs.com/), and the [Faculty of Arts](https://uwaterloo.ca/arts/) and [David R. Cheriton School of Computer Science](https://cs.uwaterloo.ca/) at the [University of Waterloo](https://uwaterloo.ca/). | ||
|
||
Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors. |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
@@ -0,0 +1,229 @@ | ||
# Filters | ||
|
||
The following filters can be used on any `RecordLoader` DataFrames or RDDs. | ||
|
||
## Keep Images | ||
|
||
Removes all data except images. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepImages() | ||
``` | ||
|
||
## Keep MIME Types (web server) | ||
|
||
Removes all data but selected MIME Types (identified by the web server). | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val mimetypes = Set("text/html", "text/plain") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepMimeTypes(mimetypes) | ||
``` | ||
|
||
## Keep MIME Types (Apache Tika) | ||
|
||
Removes all data but selected MIME Types (identified by [Apache Tika](https://tika.apache.org/)). | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val mimetypes = Set("text/html", "text/plain") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepMimeTypesTika(mimetypes) | ||
``` | ||
|
||
## Keep HTTP Status | ||
|
||
Removes all data that does not have selected status codes specified. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val statusCodes = Set("200", "404") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepHttpStatus(statusCodes) | ||
``` | ||
|
||
## Keep Dates | ||
|
||
Removes all data that does not have selected date. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val dates = List("2008", "200908", "20070502") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepDate(dates) | ||
``` | ||
|
||
## Keep URLs | ||
|
||
Removes all data but selected exact URLs. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val urls = Set("archive.org", "uwaterloo.ca", "yorku.ca") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepUrls(urls) | ||
``` | ||
|
||
## Keep URL Patterns | ||
|
||
Removes all data but selected URL patterns (regex). | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val urls = Set(archive.r, sloan.r, "".r) | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepUrlPatterns(urls) | ||
``` | ||
|
||
## Keep Domains | ||
|
||
Removes all data but selected source domains. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val doamins = Set("www.archive.org", "www.sloan.org") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepDomains(domains) | ||
``` | ||
|
||
## Keep Languages | ||
|
||
Removes all data not in selected language ([ISO 639-2 codes](https://www.loc.gov/standards/iso639-2/php/code_list.php)). | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val languages = Set("en", "fr") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepLanguages(languages) | ||
``` | ||
|
||
## Keep Content | ||
|
||
Removes all content that does not pass Regular Expression test. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val content = Set(regex, raw"UNINTELLIBLEDFSJKLS".r) | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepContent(content) | ||
``` | ||
|
||
## Discard MIME Types (web server) | ||
|
||
Filters out detected MIME Types (identified by the web server). | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val mimetypes = Set("text/html", "text/plain") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.discardMimeTypes(mimetypes) | ||
``` | ||
|
||
## Discard MIME Types (Apache Tika) | ||
|
||
Filters out detected MIME Types (identified by [Apache Tika](https://tika.apache.org/)). | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val mimetypes = Set("text/html", "text/plain") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.discardMimeTypesTika(mimetypes) | ||
``` | ||
|
||
## Discard HTTP Status | ||
|
||
Filters out detected HTTP status codes. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val statusCodes = Set("200", "404") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.discardHttpStatus(statusCodes) | ||
``` | ||
|
||
## Discard Dates | ||
|
||
Filters out detected dates. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val dates = List("2008", "200908", "20070502") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.discardDate(dates) | ||
``` | ||
|
||
## Discard URLs | ||
|
||
Filters out detected URLs. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val urls = Set("archive.org", "uwaterloo.ca", "yorku.ca") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.discardUrls(urls) | ||
``` | ||
|
||
## Discard URL Patterns | ||
|
||
Filters out detected URL patterns (regex). | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val urls = Set(archive.r, sloan.r, "".r) | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.discardUrlPatterns(urls) | ||
``` | ||
|
||
## Discard Domains | ||
|
||
Filters out detected source domains. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val doamins = Set("www.archive.org", "www.sloan.org") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.discardDomains(domains) | ||
``` | ||
## Discard Languages | ||
|
||
Filters out detected languages ([ISO 639-2 codes](https://www.loc.gov/standards/iso639-2/php/code_list.php)). | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val languages = Set("en", "fr") | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.keepLanguages(languages) | ||
``` | ||
|
||
## Discard Content | ||
|
||
Filters out detected content that does pass Regular Expression test. | ||
|
||
```scala | ||
import io.archivesunleashed._ | ||
val val content = Set(regex, raw"UNINTELLIBLEDFSJKLS".r) | ||
val r = RecordLoader.loadArchives("example.warc.gz",sc) | ||
r.discardContent(content) | ||
``` |
Oops, something went wrong.
0 comments on commit
95a7559