Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy Editing for Documentation #71

Closed
SamFritz opened this issue Jun 1, 2020 · 16 comments
Closed

Copy Editing for Documentation #71

SamFritz opened this issue Jun 1, 2020 · 16 comments

Comments

@SamFritz
Copy link
Member

@SamFritz SamFritz commented Jun 1, 2020

Going through Documentation for copy editing and prose suggestions/clean up

Areas for Review:

  • Home
  • The Toolkit
  • Getting Started
  • Dependencies
  • Usage
  • The Toolkit at Scale
  • DataFrame Schemas
  • Toolkit Walkthrough
  • Generating Results
  • Collection Analysis
  • Text Analysis
  • Link Analysis
  • Image Analysis
  • Binary Analysis
  • Filtering Results
  • RDD Filters
  • DataFrames Filters
  • Standard Derivatives
  • The Toolkit with spark-submit
  • AU Cloud Scholarly Derivatives
  • Extract Binary Info
  • Extract Binaries to Disk
  • What to do with Results
  • DataFrame Results
  • RDD Results
@lintool
Copy link
Member

@lintool lintool commented Jun 1, 2020

hey @SamFritz all our DataFrame fields have been renamed to lowercase:
https://github.com/archivesunleashed/aut-docs/blob/master/current/dataframe-schemas.md

but in lots of cases, field names are still upper case, e.g., here:
https://github.com/archivesunleashed/aut-docs/blob/master/current/collection-analysis.md#extract-top-level-domains

import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/path/to/warcs", sc).webpages()
  .select(extractDomain($"Url").as("Domain"))
  .groupBy("Domain").count().orderBy(desc("count"))
  .show(20, false)

hey @ruebot can you confirm that Url should be changed to url and Domain changed to domain?

@ruebot
Copy link
Member

@ruebot ruebot commented Jun 1, 2020

@lintool if you're going to join in on reviewing, you'll want to review these docs: https://github.com/archivesunleashed/aut-docs/tree/docusaurus/docs

I'll check though for column name issues now.

ruebot added a commit that referenced this issue Jun 1, 2020
@SamFritz
Copy link
Member Author

@SamFritz SamFritz commented Jun 1, 2020

Edits for @ruebot

AUT Documentation Suggested Changes

Home

  • Add . at the end of “The definitive Guide
  • Remove space before : for Text analysis and Link Analysis subtitles
  • Suggestion: We should be descriptive on the “here”(s) for accessibility purposes?

Dependencies

  • Remove space before . after “Anaconda distribution”
  • Questions: Can we add “which java” to find out where it lives, just like python instructions?

Usage

  • “There are a two options for loading” —> “There are two options”
  • “single machine vs cluster” —> “vs.”
  • “drop down” —> “drop-down”

The Toolkit at Scale

  • “Apache Spark has a great a Configuration, and a Tuning” —> remove , after configuration
  • “example is using 12 threads on a 16-core machine” —> "the example is using"
  • “Reading Data from a S3-like Endpoint” —> “Reading Data from an S3-like Endpoint
  • “you'll need an access key and secret, and additionally you'll need to define your endpoint.” —> you’ll need an access key and secret key, and additionally, you will need to define your endpoint.”

DataFrame Schemas

  • “you can use .all() extract the overall content” —> “to extract”

Toolkit Walkthrough

  • “If it isn't you might” —> If it isn’t, you might”
  • “Make a directory in your userspace, somewhere where you can find it: on your desktop, perhaps,” --> “ somewhere where you can find it, on your desktop perhaps,”
  • “count them” —> “counts them”
  • “and display a DataFrame the top ten!” —> “and displays a DataFrame of the top ten!”
  • “We like to use this example to do two things:” —> We like to use this example for two reasons:”
  • “To load this script, remember: type” —> “remember to type”
  • “Secondly, if there is time, we can begin to think about how” —> Secondly, we can begin to think about how”

Collection Analysis

  • Question: are we leaving in” TODO: Add script for the case where I only want to know the location of one resource.”?
ruebot added a commit that referenced this issue Jun 1, 2020
@ruebot
Copy link
Member

@ruebot ruebot commented Jun 1, 2020

@SamFritz I think I got most of it. Just pushed things up locally, and have a preview here https://ruebot.github.io/aut-docs-redux/

@ruebot
Copy link
Member

@ruebot ruebot commented Jun 1, 2020

Suggestion: We should be descriptive on the “here”(s) for accessibility purposes?

I don't follow. Can you link to an example?

Questions: Can we add “which java” to find out where it lives, just like python instructions?

Maybe? @ianmilligan1, thoughts?

Question: are we leaving in” TODO: Add script for the case where I only want to know the location of one resource.”?

Yep, for now.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 1, 2020

Re: which java - my head kind of hurts when it comes to Java, as I have so many different java's on my machine that I point to in my .bash_profile. So I probably wouldn't add which java here.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 1, 2020

A few catches:

usage.md:

  • "There are a two options" -> "There are two options"
  • "PySpark with the Java/Scala package, and the Python bindings" -> "PySpark with the Java/Scala package, as well as the Python bindings"

aut-at-scale.md:

The rest are one-offs:

  • text analysis - inconsistency in the RemoveHttpHeader in the doc wrapper vs the RemoveHTTPHeader in the script itself.
  • link analysis - in the intro text, maybe change "Though, we do provide one example below that provides raw data" -> "That said, we do provide one example below that provides raw data"

I also did get a page not found when clicking forward to https://ruebot.github.io/aut-docs-redux/docs/rdd-filters. Somewhere there's a pointer to rdd-filters rather than filter-rdds

Otherwise things are looking good to me!

ruebot added a commit that referenced this issue Jun 1, 2020
ruebot added a commit that referenced this issue Jun 1, 2020
@lintool
Copy link
Member

@lintool lintool commented Jun 2, 2020

Python line continuations with backslash should have a space, per PEP8:

from aut import *

WebArchive(sc, sqlContext, "/path/to/warcs")\
  .webpages() \
  .select("url") \
  .show(20, False)

So, slash on first line needs extra space; others are fine. Issue throughout docs.

ianmilligan1 added a commit that referenced this issue Jun 2, 2020
ianmilligan1 added a commit that referenced this issue Jun 2, 2020
@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 2, 2020

Thanks @lintool - I fixed those brackets in 1b96f0f above.

@ruebot
Copy link
Member

@ruebot ruebot commented Jun 3, 2020

Only remaining TODO is #72.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 6, 2020

BTW I have an ongoing draft PR for typo fixes/clean ups etc. at #76.

ruebot pushed a commit that referenced this issue Jun 16, 2020
* Typo fixes to scripts
* Removing erroneous import utils that led to error
* Fixes two scripts on link-analysis page
* Mapped files over to versioned_docs
* Partially addresses #71
@SamFritz
Copy link
Member Author

@SamFritz SamFritz commented Jun 22, 2020

Updating thread here to include final copy editing changes I've found (most are pretty minor, but did raise a question for occurrence I found throughoutBinary analysis section).

@ruebot I know this final week for you is a bit busy, so I'm happy to help implement changes where you need support.

Noting the following copyedits below for documentation:

Generation Results

Text Analysis

  • “lines beginning with (201204, or April 2012.” —> 201204 (remove the bracket)

Link Analysis

  • That said,, we do provide one example below that provides raw data (remove extra , )
  • Note how you can add filters are added.—> Note how you can add filters

Image Analysis

  • “calculating the MD5 hash of each and presenting the most” —> presents

Binary Analysis

  • Extract Audio Information

    • Under Python DF script extracted information —> remove width and height
  • Extract Presentation Program Files Information

    • "If you wanted to work with all the PDF files" —> should this be presentation program files?
    • Under Python DF script extracted information —> remove width and height
  • Extract Spreadsheet Information

    • "If you wanted to work with all the PDF files" —> should this be spreadsheet files?
    • Under Python DF script extracted information —> remove width and height
  • Extract Video Information

    • "If you wanted to work with all the PDF files" —> should this same video files?
    • Under Python DF script extracted information —> remove width and height
  • Extract Word Processor Files Information

    • "If you wanted to work with all the PDF files —> should this same word processor files?"
    • Under Python DF script extracted information —> remove width and height

Standard Derivatives

  • Extract Binaries to Disk
    • "How do I all the binary files of PDFs" —> How do I extract
    • "… further processing, then use Parquet format (a columnar storage format:" —> include closing bracket after “format”

Thanks for ushering in this amazing documentation Nick! and for all the testing @ianmilligan1!

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 22, 2020

@SamFritz These are great catches! I can put these into a pull request tomorrow.

@SamFritz
Copy link
Member Author

@SamFritz SamFritz commented Jun 22, 2020

Thanks @ianmilligan1!

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 22, 2020

Will implement all except

"lines beginning with (201204, or April 2012.” —> 201204 (remove the bracket)

The bracket is part of the output, so let's leave it in as a code snippet here. Otherwise just staging up the PR and will have it up momentarily.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 22, 2020

Awesome this can be closed with the PR. Caught a few extra ones in the binary section!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.