Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

ruebot · 2020-01-16T21:50:31Z

A whole bunch of updates for archivesunleashed/aut#372 (comment)

Partially hits #29.

Resolves #22.

Needs eyes, and testing since I touched so much. I'm probably inconsistent, or have funny mess-ups. Let me know 😄

When y'all approve, I'll squash and merge with a sane commit message.


        Address archivesunleashed/aut#372 - DRAFT


        Add DF results Python section


        Add won't implement language to binary analysis.


        Add won't implement language to standard derivatives.


        Remove index, fix ToC in setting up.


        Update README, add scala df to link analys, add TSV rdd results


        text-analysis scala df


        Add to be implemented; #22

ianmilligan1

Just did a prose review – caught a few things, @ruebot.

I haven't obviously in the few minutes this has been open gone through all of the scripts to test them. Do you want me to do that? (it might take into next week as things are a bit swamped right now)

current/README.md

current/binary-analysis.md

ianmilligan1 · 2020-01-17T02:47:27Z

current/df-results.md

+archive.write.csv("/path/to/export/directory/", header='true')
+```
+
+If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:


Is there a good link out on "Parquet format" to an overview of what that means for somebody who wants to dig in further?

current/link-analysis.md

current/rdd-results.md


        more clean-up

ruebot · 2020-01-17T03:00:10Z

@ianmilligan1 I tested most of them locally, and just wrote the rest of them. They should be fine, but definitely need to tested. No big rush until we get closer to sending out homework for NYC datathon.


        review

ianmilligan1

Decided to bite the bullet and plow through this! Looks great, @ruebot - I've tested all the new scripts.

A few errors, which I've put in the comment. Three quarters of the docs refer to example.arc.gz and the other quarter to example.warc.gz - I'd be a fan of just using example.arc.gz as you'll see.

current/collection-analysis.md

current/link-analysis.md

ianmilligan1 · 2020-01-17T16:35:40Z

current/text-analysis.md

+import io.archivesunleashed._
+import io.archivesunleashed.df._
+
+RecordLoader.loadArchives("example.warc.gz", sc)


change to example.arc.gz?

iirc, I use warc for a bunch of them so you actually get results.

ianmilligan1 · 2020-01-17T16:35:40Z

current/text-analysis.md

+  .select($"crawl_date", ExtractDomainDF($"url"), $"url", $"language", RemoveHTMLDF(RemoveHTTPHeaderDF($"content")))
+  .filter($"language" == "fr")
+  .write.csv("plain-text-fr-df/")
+```


Leads to error:

<pastie>:111: error: overloaded method value filter with alternatives: (func: org.apache.spark.api.java.function.FilterFunction[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> (func: org.apache.spark.sql.Row => Boolean)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> (conditionExpr: String)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> (condition: org.apache.spark.sql.Column)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] cannot be applied to (Boolean) .filter($"language" == "fr") ^

ianmilligan1 · 2020-01-17T16:35:40Z

current/text-analysis.md

+import io.archivesunleashed._
+import io.archivesunleashed.df._
+
+RecordLoader.loadArchives("example.warc.gz", sc)


recommend changing to arc for consistency

See above note.

current/text-analysis.md

SamFritz

Added in review, mostly just addresses quick and minor fixes. Note: while reviewing, line #s are provided, especially in cases where comment was added below section that needs addressing (because I wasn't able to see the blue + button). In some cases there are questions on formatting.

My focus was on text rather then code pieces. Like @ianmilligan1 I'm happy to run through the code snippets for testing.

The documentation is looking fantastic @ruebot!!

current/README.md

SamFritz · 2020-01-17T16:46:14Z

current/README.md


 If you want to learn more about [Apache Spark](https://spark.apache.org/), we highly recommend [Spark: The Definitive Guide](http://shop.oreilly.com/product/0636920034957.do) 
-
 ## Table of Contents

 Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.


Delete space proceeding paragraph

current/README.md

SamFritz · 2020-01-17T16:46:14Z

current/binary-analysis.md

@@ -142,7 +142,7 @@ only showing top 20 rows

 ### Scala RDD

-TODO
+**Will not be implemented.**

 ### Scala DF



I'm finding that in the Python DF we mention 'width' and 'height' will be extracted, but the example outputs don't have these columns - are the dimensions embedded in the columns that are shown?

SamFritz · 2020-01-17T16:46:14Z

current/link-analysis.md

@@ -168,7 +198,7 @@ RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)
-  .saveAsTextFile("sitelinks-by-date/")
+  .saveAsTextFile("sitelinks-by-date-rdd/")
 ```

 The format of this output is:


question: line 205 "- Field one: Crawldate" should it be yyyyMMdd or yyyymmdd?

Also line 220, ExtractLinks --> ExtractLinks ?

SamFritz · 2020-01-17T16:46:14Z

current/link-analysis.md

+  .count()
+  .filter($"count" > 5)
+  .write.csv("sitelinks-details-df/")
+```

 ### Python DF



line 295: open-soure --> open-source

SamFritz · 2020-01-17T16:46:14Z

current/standard-derivatives.md

@@ -28,12 +28,11 @@ import io.archivesunleashed.matchbox._
 sc.setLogLevel("INFO")


line 15 - should we add the code command (inline) for concatenation, as graph pass command is written directly below the paragraph?

SamFritz · 2020-01-17T16:46:14Z

current/standard-derivatives.md

@@ -67,9 +66,10 @@ TODO
 How do I extract binary information of PDFs, audio files, video files, word processor files, spreadsheet files, presentation program files, and text files to a CSV file, or into the [Apache Parquet](https://parquet.apache.org/) format to [work with later](df-results.md#what-to-do-with-dataframe-results)?


"How do I extract binary information " --> "How do I extract the binary information"

SamFritz · 2020-01-17T16:46:14Z

current/text-analysis.md

@@ -25,9 +25,10 @@ This script extracts the crawl date, domain, URL, and plain text from HTML files
 import io.archivesunleashed._


Line 12 --> capitalize Text (filtered by keyword)


        Various DataFrame implementation updates for documentation clean-up; …

…Addresses #372. - .all() column HttpStatus to http_status_code - Adds archive_filename to .all() - Significant README updates for setup - See also: archivesunleashed/aut-docs#39


        review

ruebot · 2020-01-17T19:30:49Z

@SamFritz @ianmilligan1 I think I hit everything raised.

ianmilligan1 · 2020-01-17T20:22:15Z

Looks good to me - I'll wait for @SamFritz's thumbs up and then I'm happy to merge (or, after reading your PR, you can squash + merge too!). 😄

SamFritz · 2020-01-20T14:48:39Z

👍 good to go :)

ruebot added 7 commits Jan 16, 2020

Address archivesunleashed/aut#372 - DRAFT

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

88789a9

Add DF results Python section

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

a7bccf8

Add won't implement language to binary analysis.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

5089a03

Add won't implement language to standard derivatives.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

9dab703

Remove index, fix ToC in setting up.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

3475498

Update README, add scala df to link analys, add TSV rdd results

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

b489e55

text-analysis scala df

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

9b6fe02

ruebot marked this pull request as ready for review Jan 17, 2020

ruebot requested review from lintool, ianmilligan1 and SamFritz Jan 17, 2020

ruebot mentioned this pull request Jan 17, 2020

Various DataFrame implementation updates for documentation clean-up; Addresses #372. #406

Merged

Add to be implemented; #22

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

9c34b18

ianmilligan1 requested changes Jan 17, 2020

View changes

more clean-up

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

d9441a9

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

47a96a8

ianmilligan1 requested changes Jan 17, 2020

View changes

SamFritz reviewed Jan 17, 2020

View changes

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

965350a

ianmilligan1 approved these changes Jan 17, 2020

View changes

ruebot merged commit 4ce3cb5 into master Jan 20, 2020

ruebot deleted the aut-372 branch Jan 20, 2020

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut-docs

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

ruebot commented Jan 16, 2020 •

edited

ianmilligan1 left a comment

This comment has been minimized.

This comment has been minimized.

ruebot commented Jan 17, 2020

ianmilligan1 left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SamFritz left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ruebot commented Jan 17, 2020

This comment has been minimized.

ianmilligan1 commented Jan 17, 2020 •

edited

This comment has been minimized.

SamFritz commented Jan 20, 2020

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut-docs

Join GitHub today

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Address https://github.com/archivesunleashed/aut/issues/372 - DRAFT #39

Conversation

ruebot commented Jan 16, 2020 • edited

ianmilligan1 left a comment

This comment has been minimized.

This comment has been minimized.

ruebot commented Jan 17, 2020

ianmilligan1 left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SamFritz left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ruebot commented Jan 17, 2020

This comment has been minimized.

ianmilligan1 commented Jan 17, 2020 • edited

This comment has been minimized.

SamFritz commented Jan 20, 2020

ruebot commented Jan 16, 2020 •

edited

ianmilligan1 commented Jan 17, 2020 •

edited