Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Open
wants to merge 10 commits into
base: master
from

Conversation

@ruebot
Copy link
Member

ruebot commented Jan 16, 2020

A whole bunch of updates for archivesunleashed/aut#372 (comment)

Depends on archivesunleashed/aut#406

Partially hits #29.

Resolves #22.

Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know 😄

When y'all approve, I'll squash and merge with a sane commit message.

ruebot added 7 commits Jan 16, 2020
Copy link
Member

ianmilligan1 left a comment

Just did a prose review – caught a few things, @ruebot.

I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now)

@@ -2,17 +2,16 @@

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing.

Most of this documentation is built on [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html). We are working on adding support for [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes). You can read more about this in our experimental [DataFrames section](#dataframes), and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial.
This documentation is centred on a cookbook approach, providing a series of "recipes" for addressing a number of common analytics tasks, and inspiration for your own analysis. We generally provide examples for [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in Scala, and [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) in both Scala and Python. We leave it up to you to choose Scala or Python flavours of Spark.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jan 17, 2020

Member

"is centred on a cookbook approach" -> "is based on a cookbook approach"
"and inspiration for your own analysis" -> "to provide inspiration for your own analysis"

@@ -15,7 +15,7 @@ The Archives Unleashed Toolkit supports binary object types for analysis:

### Scala RDD

TODO
**Will not be implemented.**

This comment has been minimized.

Copy link
@ianmilligan1
archive.write.csv("/path/to/export/directory/", header='true')
```

If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jan 17, 2020

Member

Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?

@@ -4,7 +4,6 @@
- [Extract Raw URL Link Structure](#Extract-Raw-URL-Link-Structure)
- [Organize Links by URL Pattern](#Organize-Links-by-URL-Pattern)
- [Organize Links by Crawl Date](#Organize-Links-by-Crawl-Date)
- [Export as TSV](#Export-as-TSV)

This comment has been minimized.

Copy link
@ianmilligan1
@@ -49,3 +49,37 @@ Alternatively, if you want to save the results to disk, replace the `.take(10)`
Replace `/path/to/export/directory/` with your desired location.
Note that this is a _directory_, not a _file_.

You can also format your results to a format of your choice before passing them to `saveAsTextFile()`.

FOr example, rchive records are represented in Spark as [tuples](https://en.wikipedia.org/wiki/Tuple),

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jan 17, 2020

Member

"FOr example, rchive records" -> "For example, archive records"

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Jan 17, 2020

@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.