About the Archives Unleashed Toolkit
The Archives Unleashed Toolkit, or AUT, applies modern big data analytics infrastructure to the scholarly analysis of web archives. Built on Hadoop, it provides powerful tools for analytics and data processing via Apache Spark. AUT grew out of the Warcbase project. You can read about (and cite!) Warcbase in a recent ACM Journal of Computing and Cultural Heritage article.
AUT is built against CDH 5.4.1: + Hadoop version: 2.6.0-cdh5.4.1 + Spark version: 1.3.0-cdh5.4.1
The Hadoop ecosystem is evolving rapidly, so there may be incompatibilities with other versions.
You are currently in our documentation.
Supporting files can be found in the aut-resources repository.
The Archives Unleashed Toolkit is brought to you by a team of researchers at the University of Waterloo and York University. Originally, called "warcbase", the AUT has had major contributions from a number of people including:
- Jimmy Lin, David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo
- Ian Milligan, Associate Professor, Department of History, University of Waterloo
- Nick Ruest, Digital Assets Librarian, York University
- Ryan Deschamps, Post-Doctoral Fellow, Department of History, University of Waterloo
- Alice Zhou, Undergraduate Research Assistant, David R. Cheriton School of Computer Science, University of Waterloo
- Jeremy Wiebe, PhD Candidate, Department of History, University of Waterloo
Licensed under the Apache License, Version 2.0.
Acknowlegments and Funding
This work is primarily supported by the Andrew W. Mellon Foundation. Additional funding for the Toolkit has come from the U.S. National Science Foundation, Columbia University Library's Mellon-funded Web Archiving Incentive Award, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, and the Ontario Ministry of Research and Innovation's Early Researcher Award program. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.