Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Branch: master
Find file History
ianmilligan1 Fixed Table of Content on Current Doc README (#10)
* Fixing table of content links

* Adding more relative links

* Adding image (fixing existing markdown)

* Removing image on seeing it rendered
Latest commit 4a8955e Oct 23, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
Cookbook.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
Docker-Install.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
README.md Fixed Table of Content on Current Doc README (#10) Oct 23, 2019
Release-Process.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
Toolkit-Lesson.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
User-Documentation.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
Using-the-Archives-Unleashed-Toolkit-with-PySpark.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
collection-analysis.md Adds ToC subheadings; changes collection analysis headings (#9) Oct 21, 2019
df-results.md Wrote up beginnings of 'what to do with results' (#3) Oct 21, 2019
filters.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
image-analysis.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
index.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
install.md Refactoring Documentation for Explanations and Consistent Structure (#5) Oct 21, 2019
link-analysis.md
rdd-results.md Wrote up beginnings of 'what to do with results' (#3) Oct 21, 2019
text-analysis.md Changed text-analysis.md to use consistent phrasing (#8) Oct 21, 2019

README.md

The Archives Unleashed Toolkit: Latest Documentation

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on Hadoop. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark.

Most of this documentation is built on resilient distributed datasets (RDD). We are working on adding support for DataFrames. You can read more about this in our experimental DataFrames section, and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial.

If you want to learn more about Apache Spark, we highly recommend Spark: The Definitive Guide

Table of Contents

Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.

Getting Started

Generating Results

Filtering Results

  • Filters: A variety of ways to filter results.

What to do with Results

Further Reading

The toolkit grew out of a previous project called Warcbase. The following article provides a nice overview, much of which is still relevant:

Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, 10(4), Article 22, 2017.

Acknowledgments

This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.

Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

You can’t perform that action at this time.