Getting Started with Archives Unleashed Toolkit Tutorials

Downloading AUT

The Archives Unleashed Toolkit can be downloaded as a JAR file for easy use

The following bash commands will download the jar and an example ARC file. You can also download the example ARC file here.

mkdir aut
cd aut
curl -L "https://github.com/archivesunleashed/aut/releases/download/aut-0.9.0/aut-0.9.0-fatjar.jar" > aut-0.9.0-fatjar.jar
# example arc file for testing
curl -L "https://raw.githubusercontent.com/archivesunleashed/aut/master/src/test/resources/arc/example.arc.gz" > example.arc.gz

Installing Spark shell

Download and unzip The Spark Shell from the Apache Spark Website.

curl -L "http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz" > spark-1.6.1-bin-hadoop2.6.tgz
tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
cd spark-1.6.1-bin-hadoop2.6
./bin/spark-shell --jars ../aut-0.9.0-fatjar.jar

If for some reason you get Failed to initialize compiler: object scala.runtime in compiler mirror not found. error, this probably means the .jar file did not download properly. Try downloading it directly from our releases page

You should have the spark shell ready and running.


Welcome to
  ____              __
 / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
   /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_72)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> 

If you recently upgraded your Mac OS X, your java version may not be correct in terminal. You will have to change the path to the latest version in your ./bash_profile file..

Test the Archives Unleashed Toolkit

Type :p at the scala prompt and go into paste mode.

Type or paste the following:

import io.archivesunleashed.spark.matchbox._
import io.archivesunleashed.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("../example.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

then <ctrl> d to exit paste mode and run the script.

If you see:

r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))

That means you're up and running!

You should now be able to try out the toolkit's many tutorials. We suggest that for your starting point, our Filter-Analyze-Aggregate-Visualize cycle provides an introductory walkthrough to how you can explore your data.

More Information