About the Archives Unleashed Toolkit

The Archives Unleashed Toolkit, or AUT, applies modern big data analytics infrastructure to the scholarly analysis of web archives. Built on Hadoop, it provides powerful tools for analytics and data processing via Apache Spark. AUT grew out of the Warcbase project. You can read about (and cite!) Warcbase in a recent ACM Journal of Computing and Cultural Heritage article.

AUT is built against CDH 5.4.1: + Hadoop version: 2.6.0-cdh5.4.1 + Spark version: 1.3.0-cdh5.4.1

The Hadoop ecosystem is evolving rapidly, so there may be incompatibilities with other versions.

You are currently in our documentation.

Supporting files can be found in the aut-resources repository.

Project Team

The Archives Unleashed Toolkit is brought to you by a team of researchers at the University of Waterloo and York University. Originally, called "warcbase", the AUT has had major contributions from a number of people including:


Licensed under the Apache License, Version 2.0.

Acknowlegments and Funding

This work is primarily supported by the Andrew W. Mellon Foundation. Additional funding for the Toolkit has come from the U.S. National Science Foundation, Columbia University Library's Mellon-funded Web Archiving Incentive Award, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, and the Ontario Ministry of Research and Innovation's Early Researcher Award program. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.