Several Basic Spark Commands
We have several basic commands that you will see again and again in these scripts.
This keeps only pages that are encoded as text/html, end with htm or html file extensions, do not have a null crawldate, and are not a robots.txt file. When doing text or link analysis, you want to work with the HTML pages themselves.
This allows you to specify file types that you're interested in keeping. The opposite command is
This command allows you to specify specific dates that you are interest in keeping. If you were dealing with a large number of WARCs and only wanted to keep files from October 10th 2005, you would pass
.keepDate("20051010"). The opposite command is
This command allows you to specify specific domains that you are interested in keeping. If you were dealing with a large number of WARCs and only wanted to keep domains from the Green Party of Canada, you would pass
.keepDomains(Set("greenparty.ca")). The opposite command is
This is a similar command to above but on URLs not just domains. The opposite command is
This command allows you to specify URL patterns for records you wish to keep. The patterns must be regular expression objects. You can generate a regular expression object by appending
.rto the end of a string. E.g.,
keepUrlPatterns(Set("http://www.archive.org/about/.*".r))will keep all records with URLs beginning with
http://www.archive.org/about/. The opposite command is
.discardUrlPatterns(). (Remember that the dot has special meaning in a regular expression, so if you wished to keep all URLs beginning with
http://www.you would need to escape the dot by specifying
keepUrlPatterns(Set("http://www\\..*".r)).) If you want to make this case insensitive, use
This allows you to keep only pages that are written in a specified language. It uses the ISO 639.2 language codes; currently it supports the following langauges:
da, de, et, el, en, es, fi, fr, hu, is, it, lt, nl, no, pl, pt, ru, sv, th. If you wanted to keep only pages in French and German, you would do
.keepLanguages(Set("fr", "de")). Language detection is somewhat resource-intensive on a large collection, so run your other filters first.
This command allows you to keep only pages that contain a given keyword. The opposite command is