Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Merged
merged 11 commits into from Jan 20, 2020

Conversation

@ruebot
Copy link
Member

ruebot commented Jan 16, 2020

A whole bunch of updates for archivesunleashed/aut#372 (comment)

Depends on archivesunleashed/aut#406

Partially hits #29.

Resolves #22.

Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know 😄

When y'all approve, I'll squash and merge with a sane commit message.

ruebot added 7 commits Jan 16, 2020
Copy link
Member

ianmilligan1 left a comment

Just did a prose review – caught a few things, @ruebot.

I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now)

current/README.md Outdated Show resolved Hide resolved
current/binary-analysis.md Show resolved Hide resolved
archive.write.csv("/path/to/export/directory/", header='true')
```

If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jan 17, 2020

Member

Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?

current/link-analysis.md Show resolved Hide resolved
current/rdd-results.md Outdated Show resolved Hide resolved
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Jan 17, 2020

@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon.

Copy link
Member

ianmilligan1 left a comment

Decided to bite the bullet and plow through this! Looks great, @ruebot - I've tested all the new scripts.

A few errors, which I've put in the comment. Three quarters of the docs refer to example.arc.gz and the other quarter to example.warc.gz - I'd be a fan of just using example.arc.gz as you'll see.

current/collection-analysis.md Show resolved Hide resolved
current/collection-analysis.md Show resolved Hide resolved
current/collection-analysis.md Show resolved Hide resolved
current/link-analysis.md Show resolved Hide resolved
current/link-analysis.md Show resolved Hide resolved
import io.archivesunleashed._
import io.archivesunleashed.df._
RecordLoader.loadArchives("example.warc.gz", sc)

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jan 17, 2020

Member

change to example.arc.gz?

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 17, 2020

Author Member

iirc, I use warc for a bunch of them so you actually get results.

.select($"crawl_date", ExtractDomainDF($"url"), $"url", $"language", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")))
.filter($"language" == "fr")
.write.csv("plain-text-fr-df/")
```

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jan 17, 2020

Member

Leads to error:

<pastie>:111: error: overloaded method value filter with alternatives:
  (func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and>
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 cannot be applied to (Boolean)
          .filter($"language" == "fr")
           ^
import io.archivesunleashed._
import io.archivesunleashed.df._
RecordLoader.loadArchives("example.warc.gz", sc)

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Jan 17, 2020

Member

recommend changing to arc for consistency

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 17, 2020

Author Member

See above note.

current/text-analysis.md Show resolved Hide resolved
current/text-analysis.md Show resolved Hide resolved
Copy link
Member

SamFritz left a comment

Added in review, mostly just addresses quick and minor fixes. Note: while reviewing, line #s are provided, especially in cases where comment was added below section that needs addressing (because I wasn't able to see the blue + button). In some cases there are questions on formatting.

My focus was on text rather then code pieces. Like @ianmilligan1 I'm happy to run through the code snippets for testing.

The documentation is looking fantastic @ruebot!!

current/README.md Show resolved Hide resolved
current/README.md Show resolved Hide resolved

If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do)

## Table of Contents

Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.

This comment has been minimized.

Copy link
@SamFritz

SamFritz Jan 17, 2020

Member

Delete space proceeding paragraph

current/README.md Show resolved Hide resolved
@@ -142,7 +142,7 @@ only showing top 20 rows

### Scala RDD

TODO
**Will not be implemented.**

### Scala DF

This comment has been minimized.

Copy link
@SamFritz

SamFritz Jan 17, 2020

Member

I'm finding that in the Python DF we mention 'width' and 'height' will be extracted, but the example outputs don't have these columns - are the dimensions embedded in the columns that are shown?

@@ -168,7 +198,7 @@ RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
.saveAsTextFile("sitelinks-by-date/")
.saveAsTextFile("sitelinks-by-date-rdd/")
```

The format of this output is:

This comment has been minimized.

Copy link
@SamFritz

SamFritz Jan 17, 2020

Member

question: line 205 "- Field one: Crawldate" should it be yyyyMMdd or yyyymmdd?

This comment has been minimized.

Copy link
@SamFritz

SamFritz Jan 17, 2020

Member

Also line 220, ExtractLinks --> ExtractLinks ?

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 17, 2020

Author Member

yyyyMMdd

.count()
.filter($"count" > 5)
.write.csv("sitelinks-details-df/")
```

### Python DF

This comment has been minimized.

Copy link
@SamFritz

SamFritz Jan 17, 2020

Member

line 295: open-soure --> open-source

@@ -28,12 +28,11 @@ import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

This comment has been minimized.

Copy link
@SamFritz

SamFritz Jan 17, 2020

Member

line 15 - should we add the code command (inline) for concatenation, as graph pass command is written directly below the paragraph?

@@ -67,9 +66,10 @@ TODO
How do I extract binary information of PDFs, audio files, video files, word processor files, spreadsheet files, presentation program files, and text files to a CSV file, or into the [Apache Parquet](https://parquet.apache.org/) format to [work with later](df-results.md#what-to-do-with-dataframe-results)?

This comment has been minimized.

Copy link
@SamFritz

SamFritz Jan 17, 2020

Member

"How do I extract binary information " --> "How do I extract the binary information"

@@ -25,9 +25,10 @@ This script extracts the crawl date, domain, URL, and plain text from HTML files
import io.archivesunleashed._

This comment has been minimized.

Copy link
@SamFritz

SamFritz Jan 17, 2020

Member

Line 12 --> capitalize Text (filtered by keyword)

ianmilligan1 added a commit to archivesunleashed/aut that referenced this pull request Jan 17, 2020
…Addresses #372.

- .all() column HttpStatus to http_status_code
- Adds archive_filename to .all()
- Significant README updates for setup
- See also: archivesunleashed/aut-docs#39
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Jan 17, 2020

@SamFritz @ianmilligan1 I think I hit everything raised.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Jan 17, 2020

Looks good to me - I'll wait for @SamFritz's thumbs up and then I'm happy to merge (or, after reading your PR, you can squash + merge too!). 😄

@SamFritz

This comment has been minimized.

Copy link
Member

SamFritz commented Jan 20, 2020

👍 good to go :)

@ruebot ruebot merged commit 4ce3cb5 into master Jan 20, 2020
@ruebot ruebot deleted the aut-372 branch Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.