Copy Editing for Documentation #71

SamFritz · 2020-06-01T18:25:40Z

lintool · 2020-06-01T18:29:58Z

hey @SamFritz all our DataFrame fields have been renamed to lowercase:
https://github.com/archivesunleashed/aut-docs/blob/master/current/dataframe-schemas.md

but in lots of cases, field names are still upper case, e.g., here:
https://github.com/archivesunleashed/aut-docs/blob/master/current/collection-analysis.md#extract-top-level-domains

import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/path/to/warcs", sc).webpages()
  .select(extractDomain($"Url").as("Domain"))
  .groupBy("Domain").count().orderBy(desc("count"))
  .show(20, false)

hey @ruebot can you confirm that Url should be changed to url and Domain changed to domain?

ruebot · 2020-06-01T18:37:11Z

@lintool if you're going to join in on reviewing, you'll want to review these docs: https://github.com/archivesunleashed/aut-docs/tree/docusaurus/docs

I'll check though for column name issues now.


        Updates for #71

SamFritz · 2020-06-01T18:50:48Z

Edits for @ruebot

AUT Documentation Suggested Changes

Home

Add . at the end of “The definitive Guide
Remove space before : for Text analysis and Link Analysis subtitles
Suggestion: We should be descriptive on the “here”(s) for accessibility purposes?

Dependencies

Remove space before . after “Anaconda distribution”
Questions: Can we add “which java” to find out where it lives, just like python instructions?

Usage

“There are a two options for loading” —> “There are two options”
“single machine vs cluster” —> “vs.”
“drop down” —> “drop-down”

The Toolkit at Scale

“Apache Spark has a great a Configuration, and a Tuning” —> remove , after configuration
“example is using 12 threads on a 16-core machine” —> "the example is using"
“Reading Data from a S3-like Endpoint” —> “Reading Data from an S3-like Endpoint
“you'll need an access key and secret, and additionally you'll need to define your endpoint.” —> you’ll need an access key and secret key, and additionally, you will need to define your endpoint.”

DataFrame Schemas

“you can use .all() extract the overall content” —> “to extract”

Toolkit Walkthrough

“If it isn't you might” —> If it isn’t, you might”
“Make a directory in your userspace, somewhere where you can find it: on your desktop, perhaps,” --> “ somewhere where you can find it, on your desktop perhaps,”
“count them” —> “counts them”
“and display a DataFrame the top ten!” —> “and displays a DataFrame of the top ten!”
“We like to use this example to do two things:” —> We like to use this example for two reasons:”
“To load this script, remember: type” —> “remember to type”
“Secondly, if there is time, we can begin to think about how” —> Secondly, we can begin to think about how”

Collection Analysis

Question: are we leaving in” TODO: Add script for the case where I only want to know the location of one resource.”?


        Updates for #70 & #71.

ruebot · 2020-06-01T20:17:43Z

@SamFritz I think I got most of it. Just pushed things up locally, and have a preview here https://ruebot.github.io/aut-docs-redux/

ruebot · 2020-06-01T20:19:22Z

Suggestion: We should be descriptive on the “here”(s) for accessibility purposes?

I don't follow. Can you link to an example?

Questions: Can we add “which java” to find out where it lives, just like python instructions?

Maybe? @ianmilligan1, thoughts?

Question: are we leaving in” TODO: Add script for the case where I only want to know the location of one resource.”?

Yep, for now.

ianmilligan1 · 2020-06-01T20:41:26Z

Re: which java - my head kind of hurts when it comes to Java, as I have so many different java's on my machine that I point to in my .bash_profile. So I probably wouldn't add which java here.

ianmilligan1 · 2020-06-01T21:20:27Z

A few catches:

usage.md:

"There are a two options" -> "There are two options"
"PySpark with the Java/Scala package, and the Python bindings" -> "PySpark with the Java/Scala package, as well as the Python bindings"

aut-at-scale.md:

"Apache Spark has a great Configuration and Tuning guides that are worth checking out" -> "Apache Spark has great configuration and tuning guides that are worth checking out."

The rest are one-offs:

text analysis - inconsistency in the RemoveHttpHeader in the doc wrapper vs the RemoveHTTPHeader in the script itself.
link analysis - in the intro text, maybe change "Though, we do provide one example below that provides raw data" -> "That said, we do provide one example below that provides raw data"

I also did get a page not found when clicking forward to https://ruebot.github.io/aut-docs-redux/docs/rdd-filters. Somewhere there's a pointer to rdd-filters rather than filter-rdds

Otherwise things are looking good to me!


        More updates for #70, and #71.


        More updates for #70, and #71.

lintool · 2020-06-02T12:53:01Z

Python line continuations with backslash should have a space, per PEP8:

from aut import *

WebArchive(sc, sqlContext, "/path/to/warcs")\
  .webpages() \
  .select("url") \
  .show(20, False)

So, slash on first line needs extra space; others are fine. Issue throughout docs.


        Adding Python line continuation spaces, re: #71


        Fixing minor script bracketing errors, as per #71

ianmilligan1 · 2020-06-02T18:07:23Z

Thanks @lintool - I fixed those brackets in 1b96f0f above.

ruebot · 2020-06-03T15:23:49Z

Only remaining TODO is #72.

ianmilligan1 · 2020-06-06T19:35:58Z

BTW I have an ongoing draft PR for typo fixes/clean ups etc. at #76.


        Fixing Documentation Errors (#76)

* Typo fixes to scripts * Removing erroneous import utils that led to error * Fixes two scripts on link-analysis page * Mapped files over to versioned_docs * Partially addresses #71

SamFritz · 2020-06-22T01:04:07Z

Updating thread here to include final copy editing changes I've found (most are pretty minor, but did raise a question for occurrence I found throughoutBinary analysis section).

@ruebot I know this final week for you is a bit busy, so I'm happy to help implement changes where you need support.

Noting the following copyedits below for documentation:

Generation Results

Text Analysis

“lines beginning with (201204, or April 2012.” —> 201204 (remove the bracket)

Link Analysis

That said,, we do provide one example below that provides raw data (remove extra , )
Note how you can add filters are added.—> Note how you can add filters

Image Analysis

“calculating the MD5 hash of each and presenting the most” —> presents

Binary Analysis

Extract Audio Information
- Under Python DF script extracted information —> remove width and height
Extract Presentation Program Files Information
- "If you wanted to work with all the PDF files" —> should this be presentation program files?
- Under Python DF script extracted information —> remove width and height
Extract Spreadsheet Information
- "If you wanted to work with all the PDF files" —> should this be spreadsheet files?
- Under Python DF script extracted information —> remove width and height
Extract Video Information
- "If you wanted to work with all the PDF files" —> should this same video files?
- Under Python DF script extracted information —> remove width and height
Extract Word Processor Files Information
- "If you wanted to work with all the PDF files —> should this same word processor files?"
- Under Python DF script extracted information —> remove width and height

Standard Derivatives

Extract Binaries to Disk
- "How do I all the binary files of PDFs" —> How do I extract
- "… further processing, then use Parquet format (a columnar storage format:" —> include closing bracket after “format”

Thanks for ushering in this amazing documentation Nick! and for all the testing @ianmilligan1!

ianmilligan1 · 2020-06-22T02:45:33Z

@SamFritz These are great catches! I can put these into a pull request tomorrow.

ruebot added a commit that referenced this issue Jun 1, 2020

Updates for #71

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

e66331d

ruebot added a commit that referenced this issue Jun 1, 2020

Updates for #70 & #71.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

493e19a

ruebot added a commit that referenced this issue Jun 1, 2020

More updates for #70, and #71.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

f4ed39e

ruebot added a commit that referenced this issue Jun 1, 2020

More updates for #70, and #71.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

0e0ffc7

ianmilligan1 added a commit that referenced this issue Jun 2, 2020

Adding Python line continuation spaces, re: #71

Loading status checks…

1b96f0f

ianmilligan1 added a commit that referenced this issue Jun 2, 2020

Fixing minor script bracketing errors, as per #71

Loading status checks…

9c85897

archivesunleashed / aut-docs

Copy Editing for Documentation #71

Copy Editing for Documentation #71

SamFritz commented Jun 1, 2020 •

edited

lintool commented Jun 1, 2020

ruebot commented Jun 1, 2020

SamFritz commented Jun 1, 2020 •

edited by ruebot

ruebot commented Jun 1, 2020

ruebot commented Jun 1, 2020

ianmilligan1 commented Jun 1, 2020

ianmilligan1 commented Jun 1, 2020 •

edited

lintool commented Jun 2, 2020

ianmilligan1 commented Jun 2, 2020

ruebot commented Jun 3, 2020

ianmilligan1 commented Jun 6, 2020

SamFritz commented Jun 22, 2020

ianmilligan1 commented Jun 22, 2020

archivesunleashed / aut-docs

Join GitHub today

Copy Editing for Documentation #71

Copy Editing for Documentation #71

Comments

SamFritz commented Jun 1, 2020 • edited

lintool commented Jun 1, 2020

ruebot commented Jun 1, 2020

SamFritz commented Jun 1, 2020 • edited by ruebot

ruebot commented Jun 1, 2020

ruebot commented Jun 1, 2020

ianmilligan1 commented Jun 1, 2020

ianmilligan1 commented Jun 1, 2020 • edited

lintool commented Jun 2, 2020

ianmilligan1 commented Jun 2, 2020

ruebot commented Jun 3, 2020

ianmilligan1 commented Jun 6, 2020

SamFritz commented Jun 22, 2020

Generation Results

Standard Derivatives

ianmilligan1 commented Jun 22, 2020

SamFritz commented Jun 1, 2020 •

edited

SamFritz commented Jun 1, 2020 •

edited by ruebot

ianmilligan1 commented Jun 1, 2020 •

edited