Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to 0.50.0 and DataFrames where applicable. #44

Merged
merged 3 commits into from Feb 10, 2020
Merged

Conversation

@ruebot
Copy link
Member

ruebot commented Feb 7, 2020

@ianmilligan1 @SamFritz here you go. Feel free to take your time. I'm on campus tomorrow.

...and if you want, feel free to push to the branch instead of waiting for me to make any changes you need.

@ruebot ruebot requested review from ianmilligan1 and SamFritz Feb 7, 2020
Copy link
Member

ianmilligan1 left a comment

Fantastic - just two little things here.

I think we should swap out one of the scripts (provided an alternative), and I'm not sure how best to signal to Docker in this walkthrough that they need to get a new latest. In my case it didn't get the 0.50.0 version but went to whatever the last version I ran docker with, if that makes sense.


For example, if your files are in `/Users/ianmilligan1/desktop/data` you would run the above command like:

`docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut:0.50.0`
`docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut:latest`

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 7, 2020

Member

When running this it pulled down an earlier version of AUT. Is there a command that they should run to force docker to find the most recent image? (when run with 0.50.0 it worked)

This comment has been minimized.

Copy link
@ruebot

ruebot Feb 7, 2020

Author Member

Let me see if I can setup a build trigger for aut.

This comment has been minimized.

Copy link
@ruebot

ruebot Feb 7, 2020

Author Member

(Though, for the datathon, they'll be using the aut-0.50.0 version of this document. So, they'll be using archivesunleashed/docker-aut:0.50.0)

This comment has been minimized.

Copy link
@ruebot

ruebot Feb 7, 2020

Author Member

Oh, I thought I had it setup. It's there, but doesn't appear to be working anymore.

This comment has been minimized.

.webpages()
.keepDomainsDF(domains)
.select($"crawl_date", ExtractDomainDF($"url").alias("domain"), $"url", RemoveHTMLDF($"content").alias("content"))
.write.csv("/data/liberal-party-text")

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 7, 2020

Member

The output is so much nicer with DataFrames, eh?

.webpages()
.keepDateDF(List("2006"), ExtractDateRDD.DateComponent.YYYY)
.select($"crawl_date", ExtractDomainDF($"url").alias("domain"), $"url", RemoveHTMLDF($"content").alias("content"))
.write.csv("/data/2006-text")

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 7, 2020

Member

This failed for me. This script works however:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .webpages()
  .keepDateDF(List("2006"), "YYYY")
  .select($"crawl_date", ExtractDomainDF($"url").as("domain"), $"url", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")).as("content"))
  .write.csv("/data/2006-text")

This comment has been minimized.

Copy link
@ruebot

ruebot Feb 7, 2020

Author Member

Oh, right. My bad.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Feb 8, 2020

Looks good to me. I’ll let @SamFritz review + merge - I’ve checked the code Sam, so just if you see any other things in the doc?

Copy link
Member

SamFritz left a comment

Awesome walk through! just three quick notes (that are the same for both sets of instructions)

* tells it to `ExtractDomain`, or find the base domain of each URL - i.e. `www.google.com/cats` we are interested just in the domain, or `www.google.com`;
* count them - how many times does `www.google.com` appear in this collection, for example;
* and display the top ten!
* and display a DataFrame the top ten!

Once it is pasted in, let's run it.

This comment has been minimized.

Copy link
@SamFritz

SamFritz Feb 10, 2020

Member

in other places we've formatted as ctrl + D - makes it easier to see :)

@@ -145,13 +162,15 @@ To load this script, remember: type `paste`, copy-and-paste it into the shell, a

This comment has been minimized.

Copy link
@SamFritz

SamFritz Feb 10, 2020

Member

just need to move the semicolon over :paste

@@ -262,7 +287,7 @@ This will take a fair amount of time, even on a small amount of data. It is very
When it is done, in the /data file you will have results. The first line should look like:

This comment has been minimized.

Copy link
@SamFritz

SamFritz Feb 10, 2020

Member

When it is done, in the /data file you will have results. -->

When it is done, the results will appear in the data file.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 10, 2020

Member

Maybe this should be data folder?

* tells it to `ExtractDomain`, or find the base domain of each URL - i.e. `www.google.com/cats` we are interested just in the domain, or `www.google.com`;
* count them - how many times does `www.google.com` appear in this collection, for example;
* and display the top ten!
* and display a DataFrame the top ten!

Once it is pasted in, let's run it.

This comment has been minimized.

Copy link
@SamFritz

SamFritz Feb 10, 2020

Member

ctrl + D

@@ -145,13 +162,15 @@ To load this script, remember: type `paste`, copy-and-paste it into the shell, a

This comment has been minimized.

Copy link
@SamFritz

SamFritz Feb 10, 2020

Member

just need to move the semicolon over :paste

@@ -262,7 +287,7 @@ This will take a fair amount of time, even on a small amount of data. It is very
When it is done, in the /data file you will have results. The first line should look like:

This comment has been minimized.

Copy link
@SamFritz

SamFritz Feb 10, 2020

Member

When it is done, in the /data file you will have results. -->

When it is done, the results will appear in the data file.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Feb 10, 2020

Member

Yeah maybe /data folder like above

@ianmilligan1 ianmilligan1 merged commit 928523f into master Feb 10, 2020
@ianmilligan1 ianmilligan1 deleted the homework-update branch Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.