The reality of any hands-on workshop is that things will break. We've tried our best to provide a robust environment that can let you walk through the basics of the Archives Unleashed Toolkit alongside us.
@@ -57,7 +57,7 @@ Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.3
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
@@ -83,38 +83,53 @@ Now cut and paste the following script:
* tells the program where it can find the data (in this case, the sample data that we have included in this Docker image);
* tells it only to keep the "valid" pages, in this case HTML data;
* tells it only to keep the "[valid](https://github.com/archivesunleashed/aut-docs/blob/master/aut-0.50.0/filters.md#keep-valid-pages)" pages, in this case HTML data;
* tells it to `ExtractDomain`, or find the base domain of each URL - i.e. `www.google.com/cats` we are interested just in the domain, or `www.google.com`;
* count them - how many times does `www.google.com` appear in this collection, for example;
* and display the top ten!
* and display a DataFrame the top ten!
Once it is pasted in, let's run it.
You run pasted scripts by holding the *Ctrl* key and the *D* key at the same time. Try that now.
You run pasted scripts by pressing `ctrl` + `d`. Try that now.
**If you're using your own data, that's why the domain count was key!** Swap out the "liberal.ca" command above with the domain that you want to look at from your own data.
@@ -168,19 +187,22 @@ Try running the **exact same script** that you did above.
Take some time to explore the various options and variables that you can swap in and around the `.keepDomains` line. Check out the [documentation](http://archivesunleashed.org/aut/#plain-text-extraction) for some ideas.
Take some time to explore the various options and variables that you can swap in and around the `.keepDomainsDF` line. Check out the [documentation](https://github.com/archivesunleashed/aut-docs/blob/master/aut-0.50.0/text-analysis.md) for some ideas.
Some options:
***Keep URL Patterns**: Instead of domains, what if you wanted to have text relating to just a certain pattern? Substitute `.keepDomains` for a command like: `.keepUrlPatterns(Set("(?i)http://geocities.com/EnchantedForest/.*".r))`
***Filter by Date**: What if we just wanted data from 2006? You could add the following command after `.keepValidPages()`: `.keepDate(List("2006"), ExtractDateRDD.DateComponent.YYYY)`
***Filter by Language**: What if you just want French-language pages? After `.keepDomains` add a new line: `.keepLanguages(Set("fr"))`.
***Keep URL Patterns**: Instead of domains, what if you wanted to have text relating to just a certain pattern? Substitute `.keepDomainsDF` for a command like: `.keepUrlPatternsDF(Set("(?i)http://geocities.com/EnchantedForest/.*".r))`
***Filter by Date**: What if we just wanted data from 2006? You could add the following command after `.webpages()`: `.keepDateDF(List("2006"), "YYYY")`
***Filter by Language**: What if you just want French-language pages? After `.keepDomainsDF` add a new line: `.keepLanguagesDF(Set("fr"))`.
For example, if we just wanted the French-language Liberal pages, we would run:
Finally, if we want to remove the HTTP headers – let's say if we want to create some nice word clouds – we can add a final command: `RemoveHttpHeader`.
This will take a fair amount of time, even on a small amount of data. It is very computationally intensive! I often use it as an excuse to go make a cup of coffee.
When it is done, in the /data file you will have results. The first line should look like:
When it is done, you will have results in the `/data` directory. The first line should look like:
```
(20060622,http://www.gca.ca/indexcms/?organizations&orgid=27,{"PERSON":["Marie"],"ORGANIZATION":["Green Communities Canada","Green Communities Canada News and Events Our Programs Join Green Communities Canada Downloads Privacy Policy Site Map GCA Clean North Kathie Brosemer"],"LOCATION":["St. E. Sault","Canada"]})
{"timestamp":"20060622","url":"http://www.gca.ca/indexcms/?organizations&orgid=27","named_entities":{"persons":["Marie"],"organizations":["Green Communities Canada","Green Communities Canada News and Events Our Programs Join Green Communities Canada Downloads Privacy Policy Site Map GCA Clean North Kathie Brosemer"],"locations":["St. E. Sault","Canada"]},"digest":"sha1:3e3dc1e855b994d838564ac8d921451451a199d5"}
```
Here we can see that in this website, it was probably taking about Sault Ste. Marie, Ontario.
0 comments on commit
928523f