Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAddress https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39
Conversation
Just did a prose review – caught a few things, @ruebot. I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now) |
@@ -2,17 +2,16 @@ | |||
|
|||
The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on [Apache Spark](http://spark.apache.org/), which provides powerful tools for analytics and data processing. | |||
|
|||
Most of this documentation is built on [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html). We are working on adding support for [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes). You can read more about this in our experimental [DataFrames section](#dataframes), and at our [[Using the Archives Unleashed Toolkit with PySpark]] tutorial. | |||
This documentation is centred on a cookbook approach, providing a series of "recipes" for addressing a number of common analytics tasks, and inspiration for your own analysis. We generally provide examples for [resilient distributed datasets (RDD)](https://spark.apache.org/docs/latest/rdd-programming-guide.html) in Scala, and [DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) in both Scala and Python. We leave it up to you to choose Scala or Python flavours of Spark. |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Jan 17, 2020
Member
"is centred on a cookbook approach" -> "is based on a cookbook approach"
"and inspiration for your own analysis" -> "to provide inspiration for your own analysis"
@@ -15,7 +15,7 @@ The Archives Unleashed Toolkit supports binary object types for analysis: | |||
|
|||
### Scala RDD | |||
|
|||
TODO | |||
**Will not be implemented.** |
This comment has been minimized.
This comment has been minimized.
archive.write.csv("/path/to/export/directory/", header='true') | ||
``` | ||
|
||
If you want to store the results with the intention to read the results back later for further processing, then use Parquet format: |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Jan 17, 2020
Member
Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?
@@ -4,7 +4,6 @@ | |||
- [Extract Raw URL Link Structure](#Extract-Raw-URL-Link-Structure) | |||
- [Organize Links by URL Pattern](#Organize-Links-by-URL-Pattern) | |||
- [Organize Links by Crawl Date](#Organize-Links-by-Crawl-Date) | |||
- [Export as TSV](#Export-as-TSV) |
This comment has been minimized.
This comment has been minimized.
@@ -49,3 +49,37 @@ Alternatively, if you want to save the results to disk, replace the `.take(10)` | |||
Replace `/path/to/export/directory/` with your desired location. | |||
Note that this is a _directory_, not a _file_. | |||
|
|||
You can also format your results to a format of your choice before passing them to `saveAsTextFile()`. | |||
|
|||
FOr example, rchive records are represented in Spark as [tuples](https://en.wikipedia.org/wiki/Tuple), |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon. |
ruebot commentedJan 16, 2020
•
edited
A whole bunch of updates for archivesunleashed/aut#372 (comment)
Depends on archivesunleashed/aut#406
Partially hits #29.
Resolves #22.
Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know😄
When y'all approve, I'll squash and merge with a sane commit message.