The Archives Unleashed Toolkit: Latest Documentation
The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on Apache Spark, which provides powerful tools for analytics and data processing.
Most of this documentation is built on resilient distributed datasets (RDD). We are working on adding support for DataFrames. You can read more about this in our experimental DataFrames section, and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial.
If you want to learn more about Apache Spark, we highly recommend Spark: The Definitive Guide
Table of Contents
Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.
Getting Started
- Setting up the Archives Unleashed Toolkit
- Using the Archives Unleashed Toolkit at Scale
- Archives Unleashed Toolkit Walkthrough
- Cookbook (short scripts for reference)
Generating Results
- Collection Analysis: How do I...
- Text Analysis: How do I...
- Extract All Plain Text
- Extract Plain Text Without HTTP Headers
- Extract Plain Text By Domain
- Extract Plain Text by URL Pattern
- Extract Plain Text Minus Boilerplate
- Extract Plain Text Filtered by Date
- Extract Plain Text Filtered by Language
- Extract Plain text Filtered by Keyword
- Extract Raw HTML
- Extract Named Entities
- Link Analysis: How do I...
- Image Analysis: How do I...
- Binary Analysis: How do I...
Filtering Results
- Filters: A variety of ways to filter results.
What to do with Results
Further Reading
The toolkit grew out of a previous project called Warcbase. The following article provides a nice overview, much of which is still relevant:
Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, 10(4), Article 22, 2017.
Acknowledgments
This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.
Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.