Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

ruebot · 2020-01-16T21:50:31Z

A whole bunch of updates for archivesunleashed/aut#372 (comment)

Partially hits #29.

Resolves #22.

Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know 😄

When y'all approve, I'll squash and merge with a sane commit message.


        Address archivesunleashed/aut#372 - DRAFT


        Add DF results Python section


        Add won't implement language to binary analysis.


        Add won't implement language to standard derivatives.


        Remove index, fix ToC in setting up.


        Update README, add scala df to link analys, add TSV rdd results


        text-analysis scala df


        Add to be implemented; #22

ianmilligan1

Just did a prose review – caught a few things, @ruebot.

I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now)

ianmilligan1 · 2020-01-17T02:47:27Z

current/README.md

@@ -2,17 +2,16 @@

 The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing.

-Most of this documentation is built on [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html). We are working on adding support for [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes). You can read more about this in our experimental [DataFrames section](#dataframes), and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial.
+This documentation is centred on a cookbook approach, providing a series of "recipes" for addressing a number of common analytics tasks, and inspiration for your own analysis. We generally provide examples for [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in Scala, and [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) in both Scala and Python. We leave it up to you to choose Scala or Python flavours of Spark.


"is centred on a cookbook approach" -> "is based on a cookbook approach"
"and inspiration for your own analysis" -> "to provide inspiration for your own analysis"

ianmilligan1 · 2020-01-17T02:47:27Z

current/binary-analysis.md

@@ -15,7 +15,7 @@ The Archives Unleashed Toolkit supports binary object types for analysis:

 ### Scala RDD

-TODO
+**Will not be implemented.**


ianmilligan1 · 2020-01-17T02:47:27Z

current/df-results.md

+archive.write.csv("/path/to/export/directory/", header='true')
+```
+
+If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:


Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?

ianmilligan1 · 2020-01-17T02:47:27Z

current/link-analysis.md

@@ -4,7 +4,6 @@
 - [Extract Raw URL Link Structure](#Extract-Raw-URL-Link-Structure)
 - [Organize Links by URL Pattern](#Organize-Links-by-URL-Pattern)
 - [Organize Links by Crawl Date](#Organize-Links-by-Crawl-Date)
- [Export as TSV](#Export-as-TSV)


ianmilligan1 · 2020-01-17T02:47:27Z

current/rdd-results.md

@@ -49,3 +49,37 @@ Alternatively, if you want to save the results to disk, replace the `.take(10)`
 Replace `/path/to/export/directory/` with your desired location.
 Note that this is a _directory_, not a _file_.

+You can also format your results to a format of your choice before passing them to `saveAsTextFile()`.
+
+FOr example, rchive records are represented in Spark as [tuples](https://en.wikipedia.org/wiki/Tuple),


"FOr example, rchive records" -> "For example, archive records"


        more clean-up

ruebot · 2020-01-17T03:00:10Z

@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon.


        review

ruebot added 7 commits Jan 16, 2020

Address archivesunleashed/aut#372 - DRAFT

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

88789a9

Add DF results Python section

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

a7bccf8

Add won't implement language to binary analysis.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

5089a03

Add won't implement language to standard derivatives.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

9dab703

Remove index, fix ToC in setting up.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

3475498

Update README, add scala df to link analys, add TSV rdd results

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

b489e55

text-analysis scala df

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

9b6fe02

ruebot marked this pull request as ready for review Jan 17, 2020

ruebot requested review from lintool, ianmilligan1 and SamFritz Jan 17, 2020

ruebot mentioned this pull request Jan 17, 2020

Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406

Open

Add to be implemented; #22

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

9c34b18

ianmilligan1 requested changes Jan 17, 2020

View changes

more clean-up

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

d9441a9

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

47a96a8

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut-docs

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

ruebot commented Jan 16, 2020 •

edited

ianmilligan1 left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ruebot commented Jan 17, 2020

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut-docs

Join GitHub today

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Conversation

ruebot commented Jan 16, 2020 • edited

ianmilligan1 left a comment

This comment has been minimized.

ianmilligan1 Jan 17, 2020

This comment has been minimized.

ianmilligan1 Jan 17, 2020

This comment has been minimized.

ianmilligan1 Jan 17, 2020

This comment has been minimized.

ianmilligan1 Jan 17, 2020

This comment has been minimized.

ianmilligan1 Jan 17, 2020

This comment has been minimized.

ruebot commented Jan 17, 2020

ruebot commented Jan 16, 2020 •

edited