Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upUpdate to 0.50.0 and DataFrames where applicable. #44
Conversation
This comment has been minimized.
This comment has been minimized.
Fantastic - just two little things here. I think we should swap out one of the scripts (provided an alternative), and I'm not sure how best to signal to Docker in this walkthrough that they need to get a new |
|
||
For example, if your files are in `/Users/ianmilligan1/desktop/data` you would run the above command like: | ||
|
||
`docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut:0.50.0` | ||
`docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut:latest` |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Feb 7, 2020
Member
When running this it pulled down an earlier version of AUT. Is there a command that they should run to force docker to find the most recent image? (when run with 0.50.0
it worked)
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ruebot
Feb 7, 2020
•
Author
Member
(Though, for the datathon, they'll be using the aut-0.50.0 version of this document. So, they'll be using archivesunleashed/docker-aut:0.50.0
)
This comment has been minimized.
This comment has been minimized.
ruebot
Feb 7, 2020
Author
Member
Oh, I thought I had it setup. It's there, but doesn't appear to be working anymore.
This comment has been minimized.
This comment has been minimized.
.webpages() | ||
.keepDomainsDF(domains) | ||
.select($"crawl_date", ExtractDomainDF($"url").alias("domain"), $"url", RemoveHTMLDF($"content").alias("content")) | ||
.write.csv("/data/liberal-party-text") |
This comment has been minimized.
This comment has been minimized.
.webpages() | ||
.keepDateDF(List("2006"), ExtractDateRDD.DateComponent.YYYY) | ||
.select($"crawl_date", ExtractDomainDF($"url").alias("domain"), $"url", RemoveHTMLDF($"content").alias("content")) | ||
.write.csv("/data/2006-text") |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Feb 7, 2020
Member
This failed for me. This script works however:
import io.archivesunleashed._
import io.archivesunleashed.df._
RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
.webpages()
.keepDateDF(List("2006"), "YYYY")
.select($"crawl_date", ExtractDomainDF($"url").as("domain"), $"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("content"))
.write.csv("/data/2006-text")
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Looks good to me. I’ll let @SamFritz review + merge - I’ve checked the code Sam, so just if you see any other things in the doc? |
Awesome walk through! just three quick notes (that are the same for both sets of instructions) |
* tells it to `ExtractDomain`, or find the base domain of each URL - i.e. `www.google.com/cats` we are interested just in the domain, or `www.google.com`; | ||
* count them - how many times does `www.google.com` appear in this collection, for example; | ||
* and display the top ten! | ||
* and display a DataFrame the top ten! | ||
|
||
Once it is pasted in, let's run it. | ||
|
This comment has been minimized.
This comment has been minimized.
SamFritz
Feb 10, 2020
Member
in other places we've formatted as ctrl
+ D
- makes it easier to see :)
@@ -145,13 +162,15 @@ To load this script, remember: type `paste`, copy-and-paste it into the shell, a | |||
|
This comment has been minimized.
This comment has been minimized.
@@ -262,7 +287,7 @@ This will take a fair amount of time, even on a small amount of data. It is very | |||
When it is done, in the /data file you will have results. The first line should look like: |
This comment has been minimized.
This comment has been minimized.
SamFritz
Feb 10, 2020
Member
When it is done, in the /data file you will have results. -->
When it is done, the results will appear in the data file.
This comment has been minimized.
This comment has been minimized.
* tells it to `ExtractDomain`, or find the base domain of each URL - i.e. `www.google.com/cats` we are interested just in the domain, or `www.google.com`; | ||
* count them - how many times does `www.google.com` appear in this collection, for example; | ||
* and display the top ten! | ||
* and display a DataFrame the top ten! | ||
|
||
Once it is pasted in, let's run it. | ||
|
This comment has been minimized.
This comment has been minimized.
@@ -145,13 +162,15 @@ To load this script, remember: type `paste`, copy-and-paste it into the shell, a | |||
|
This comment has been minimized.
This comment has been minimized.
@@ -262,7 +287,7 @@ This will take a fair amount of time, even on a small amount of data. It is very | |||
When it is done, in the /data file you will have results. The first line should look like: |
This comment has been minimized.
This comment has been minimized.
SamFritz
Feb 10, 2020
Member
When it is done, in the /data file you will have results. -->
When it is done, the results will appear in the data file.
ruebot commentedFeb 7, 2020
@ianmilligan1 @SamFritz here you go. Feel free to take your time. I'm on campus tomorrow.
...and if you want, feel free to push to the branch instead of waiting for me to make any changes you need.