Update to 0.50.0 and DataFrames where applicable. #44

ruebot · 2020-02-07T00:27:24Z

@ianmilligan1 @SamFritz here you go. Feel free to take your time. I'm on campus tomorrow.

...and if you want, feel free to push to the branch instead of waiting for me to make any changes you need.


        Update to 0.50.0 and DataFrames where applicable.

ruebot · 2020-02-07T00:32:10Z

Easy links: 😉

ianmilligan1

Fantastic - just two little things here.

I think we should swap out one of the scripts (provided an alternative), and I'm not sure how best to signal to Docker in this walkthrough that they need to get a new latest. In my case it didn't get the 0.50.0 version but went to whatever the last version I ran docker with, if that makes sense.

ianmilligan1 · 2020-02-07T18:34:21Z

current/toolkit-walkthrough.md


 For example, if your files are in `/Users/ianmilligan1/desktop/data` you would run the above command like:

-`docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut:0.50.0`
+`docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut:latest`


When running this it pulled down an earlier version of AUT. Is there a command that they should run to force docker to find the most recent image? (when run with 0.50.0 it worked)

Let me see if I can setup a build trigger for aut.

(Though, for the datathon, they'll be using the aut-0.50.0 version of this document. So, they'll be using archivesunleashed/docker-aut:0.50.0)

Oh, I thought I had it setup. It's there, but doesn't appear to be working anymore.

https://developer.github.com/v3/guides/replacing-github-services/

ianmilligan1 · 2020-02-07T18:34:21Z

current/toolkit-walkthrough.md

+  .webpages()
+  .keepDomainsDF(domains)
+  .select($"crawl_date", ExtractDomainDF($"url").alias("domain"), $"url", RemoveHTMLDF($"content").alias("content"))
+  .write.csv("/data/liberal-party-text")


The output is so much nicer with DataFrames, eh?

ianmilligan1 · 2020-02-07T18:34:21Z

current/toolkit-walkthrough.md

+  .webpages()
+  .keepDateDF(List("2006"), ExtractDateRDD.DateComponent.YYYY)
+  .select($"crawl_date", ExtractDomainDF($"url").alias("domain"), $"url", RemoveHTMLDF($"content").alias("content"))
+  .write.csv("/data/2006-text")


This failed for me. This script works however:

import io.archivesunleashed._ import io.archivesunleashed.df._ RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc) .webpages() .keepDateDF(List("2006"), "YYYY") .select($"crawl_date", ExtractDomainDF($"url").as("domain"), $"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("content")) .write.csv("/data/2006-text")

Oh, right. My bad.


        review

ianmilligan1 · 2020-02-08T01:39:08Z

Looks good to me. I’ll let @SamFritz review + merge - I’ve checked the code Sam, so just if you see any other things in the doc?

SamFritz

Awesome walk through! just three quick notes (that are the same for both sets of instructions)

SamFritz · 2020-02-10T16:29:17Z

aut-0.50.0/toolkit-walkthrough.md

 * tells it to `ExtractDomain`, or find the base domain of each URL - i.e. `www.google.com/cats` we are interested just in the domain, or `www.google.com`;
 * count them - how many times does `www.google.com` appear in this collection, for example;
-* and display the top ten!
+* and display a DataFrame the top ten!

 Once it is pasted in, let's run it.



in other places we've formatted as ctrl + D - makes it easier to see :)

SamFritz · 2020-02-10T16:29:17Z

aut-0.50.0/toolkit-walkthrough.md

@@ -145,13 +162,15 @@ To load this script, remember: type `paste`, copy-and-paste it into the shell, a



just need to move the semicolon over :paste

SamFritz · 2020-02-10T16:29:17Z

aut-0.50.0/toolkit-walkthrough.md

@@ -262,7 +287,7 @@ This will take a fair amount of time, even on a small amount of data. It is very
 When it is done, in the /data file you will have results. The first line should look like:


When it is done, in the /data file you will have results. -->

When it is done, the results will appear in the data file.

Maybe this should be data folder?

SamFritz · 2020-02-10T16:29:17Z

current/toolkit-walkthrough.md

 * tells it to `ExtractDomain`, or find the base domain of each URL - i.e. `www.google.com/cats` we are interested just in the domain, or `www.google.com`;
 * count them - how many times does `www.google.com` appear in this collection, for example;
-* and display the top ten!
+* and display a DataFrame the top ten!

 Once it is pasted in, let's run it.



ctrl + D

SamFritz · 2020-02-10T16:29:17Z

current/toolkit-walkthrough.md

@@ -145,13 +162,15 @@ To load this script, remember: type `paste`, copy-and-paste it into the shell, a



just need to move the semicolon over :paste

SamFritz · 2020-02-10T16:29:17Z

current/toolkit-walkthrough.md

@@ -262,7 +287,7 @@ This will take a fair amount of time, even on a small amount of data. It is very
 When it is done, in the /data file you will have results. The first line should look like:


When it is done, in the /data file you will have results. -->

When it is done, the results will appear in the data file.

Yeah maybe /data folder like above


        review

Update to 0.50.0 and DataFrames where applicable.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

8975c53

ruebot requested review from ianmilligan1 and SamFritz Feb 7, 2020

ianmilligan1 requested changes Feb 7, 2020

View changes

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

0c650a3

ianmilligan1 approved these changes Feb 8, 2020

View changes

SamFritz suggested changes Feb 10, 2020

View changes

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

694d441

ianmilligan1 merged commit 928523f into master Feb 10, 2020

ianmilligan1 deleted the homework-update branch Feb 10, 2020

archivesunleashed / aut-docs

Update to 0.50.0 and DataFrames where applicable. #44

Update to 0.50.0 and DataFrames where applicable. #44

ruebot commented Feb 7, 2020

This comment has been minimized.

ruebot commented Feb 7, 2020

ianmilligan1 left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ianmilligan1 commented Feb 8, 2020

SamFritz left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

		@@ -145,13 +162,15 @@ To load this script, remember: type `paste`, copy-and-paste it into the shell, a

		@@ -262,7 +287,7 @@ This will take a fair amount of time, even on a small amount of data. It is very
		When it is done, in the /data file you will have results. The first line should look like:

archivesunleashed / aut-docs

Join GitHub today

Update to 0.50.0 and DataFrames where applicable. #44

Update to 0.50.0 and DataFrames where applicable. #44

Conversation

ruebot commented Feb 7, 2020

This comment has been minimized.

ruebot commented Feb 7, 2020

ianmilligan1 left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ruebot Feb 7, 2020 • edited

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ianmilligan1 commented Feb 8, 2020

SamFritz left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ruebot Feb 7, 2020 •

edited