Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAddress https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39
Conversation
Just did a prose review – caught a few things, @ruebot. I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now) |
archive.write.csv("/path/to/export/directory/", header='true') | ||
``` | ||
|
||
If you want to store the results with the intention to read the results back later for further processing, then use Parquet format: |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Jan 17, 2020
Member
Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon. |
Decided to bite the bullet and plow through this! Looks great, @ruebot - I've tested all the new scripts. A few errors, which I've put in the comment. Three quarters of the docs refer to |
import io.archivesunleashed._ | ||
import io.archivesunleashed.df._ | ||
RecordLoader.loadArchives("example.warc.gz", sc) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
.select($"crawl_date", ExtractDomainDF($"url"), $"url", $"language", RemoveHTMLDF(RemoveHTTPHeaderDF($"content"))) | ||
.filter($"language" == "fr") | ||
.write.csv("plain-text-fr-df/") | ||
``` |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Jan 17, 2020
Member
Leads to error:
<pastie>:111: error: overloaded method value filter with alternatives:
(func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
(func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
(conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
(condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
cannot be applied to (Boolean)
.filter($"language" == "fr")
^
import io.archivesunleashed._ | ||
import io.archivesunleashed.df._ | ||
RecordLoader.loadArchives("example.warc.gz", sc) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Added in review, mostly just addresses quick and minor fixes. Note: while reviewing, line #s are provided, especially in cases where comment was added below section that needs addressing (because I wasn't able to see the blue + button). In some cases there are questions on formatting. My focus was on text rather then code pieces. Like @ianmilligan1 I'm happy to run through the code snippets for testing. The documentation is looking fantastic @ruebot!! |
|
||
If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) | ||
|
||
## Table of Contents | ||
|
||
Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results. |
This comment has been minimized.
This comment has been minimized.
@@ -142,7 +142,7 @@ only showing top 20 rows | |||
|
|||
### Scala RDD | |||
|
|||
TODO | |||
**Will not be implemented.** | |||
|
|||
### Scala DF | |||
|
This comment has been minimized.
This comment has been minimized.
SamFritz
Jan 17, 2020
Member
I'm finding that in the Python DF we mention 'width' and 'height' will be extracted, but the example outputs don't have these columns - are the dimensions embedded in the columns that are shown?
@@ -168,7 +198,7 @@ RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages() | |||
.filter(r => r._2 != "" && r._3 != "") | |||
.countItems() | |||
.filter(r => r._2 > 5) | |||
.saveAsTextFile("sitelinks-by-date/") | |||
.saveAsTextFile("sitelinks-by-date-rdd/") | |||
``` | |||
|
|||
The format of this output is: |
This comment has been minimized.
This comment has been minimized.
SamFritz
Jan 17, 2020
Member
question: line 205 "- Field one: Crawldate" should it be yyyyMMdd
or yyyymmdd
?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
.count() | ||
.filter($"count" > 5) | ||
.write.csv("sitelinks-details-df/") | ||
``` | ||
|
||
### Python DF | ||
|
This comment has been minimized.
This comment has been minimized.
@@ -28,12 +28,11 @@ import io.archivesunleashed.matchbox._ | |||
sc.setLogLevel("INFO") |
This comment has been minimized.
This comment has been minimized.
SamFritz
Jan 17, 2020
Member
line 15 - should we add the code
command (inline) for concatenation, as graph pass command is written directly below the paragraph?
@@ -67,9 +66,10 @@ TODO | |||
How do I extract binary information of PDFs, audio files, video files, word processor files, spreadsheet files, presentation program files, and text files to a CSV file, or into the [Apache Parquet](https://parquet.apache.org/) format to [work with later](df-results.md#what-to-do-with-dataframe-results)? |
This comment has been minimized.
This comment has been minimized.
SamFritz
Jan 17, 2020
Member
"How do I extract binary information " --> "How do I extract the binary information"
@@ -25,9 +25,10 @@ This script extracts the crawl date, domain, URL, and plain text from HTML files | |||
import io.archivesunleashed._ |
This comment has been minimized.
This comment has been minimized.
…Addresses #372. - .all() column HttpStatus to http_status_code - Adds archive_filename to .all() - Significant README updates for setup - See also: archivesunleashed/aut-docs#39
This comment has been minimized.
This comment has been minimized.
@SamFritz @ianmilligan1 I think I hit everything raised. |
This comment has been minimized.
This comment has been minimized.
Looks good to me - I'll wait for @SamFritz's thumbs up and then I'm happy to merge (or, after reading your PR, you can squash + merge too!). |
This comment has been minimized.
This comment has been minimized.
|
ruebot commentedJan 16, 2020
•
edited
A whole bunch of updates for archivesunleashed/aut#372 (comment)
Depends on archivesunleashed/aut#406
Partially hits #29.
Resolves #22.
Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know😄
When y'all approve, I'll squash and merge with a sane commit message.