Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy Editing for Documentation #71

Open
SamFritz opened this issue Jun 1, 2020 · 11 comments
Open

Copy Editing for Documentation #71

SamFritz opened this issue Jun 1, 2020 · 11 comments

Comments

@SamFritz
Copy link
Member

@SamFritz SamFritz commented Jun 1, 2020

Going through Documentation for copy editing and prose suggestions/clean up

Areas for Review:

  • Home
  • The Toolkit
  • Getting Started
  • Dependencies
  • Usage
  • The Toolkit at Scale
  • DataFrame Schemas
  • Toolkit Walkthrough
  • Generating Results
  • Collection Analysis
  • Text Analysis
  • Link Analysis
  • Image Analysis
  • Binary Analysis
  • Standard Derivatives
  • The Toolkit with spark-submit
  • AU Cloud Scholarly Derivatives
  • Extract Binary Info
  • Extract Binaries to Disk
  • What to do with Results
  • DataFrame Results
  • RDD Results
@lintool
Copy link
Member

@lintool lintool commented Jun 1, 2020

hey @SamFritz all our DataFrame fields have been renamed to lowercase:
https://github.com/archivesunleashed/aut-docs/blob/master/current/dataframe-schemas.md

but in lots of cases, field names are still upper case, e.g., here:
https://github.com/archivesunleashed/aut-docs/blob/master/current/collection-analysis.md#extract-top-level-domains

import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/path/to/warcs", sc).webpages()
  .select(extractDomain($"Url").as("Domain"))
  .groupBy("Domain").count().orderBy(desc("count"))
  .show(20, false)

hey @ruebot can you confirm that Url should be changed to url and Domain changed to domain?

@ruebot
Copy link
Member

@ruebot ruebot commented Jun 1, 2020

@lintool if you're going to join in on reviewing, you'll want to review these docs: https://github.com/archivesunleashed/aut-docs/tree/docusaurus/docs

I'll check though for column name issues now.

ruebot added a commit that referenced this issue Jun 1, 2020
@SamFritz
Copy link
Member Author

@SamFritz SamFritz commented Jun 1, 2020

Edits for @ruebot

AUT Documentation Suggested Changes

Home

  • Add . at the end of “The definitive Guide
  • Remove space before : for Text analysis and Link Analysis subtitles
  • Suggestion: We should be descriptive on the “here”(s) for accessibility purposes?

Dependencies

  • Remove space before . after “Anaconda distribution”
  • Questions: Can we add “which java” to find out where it lives, just like python instructions?

Usage

  • “There are a two options for loading” —> “There are two options”
  • “single machine vs cluster” —> “vs.”
  • “drop down” —> “drop-down”

The Toolkit at Scale

  • “Apache Spark has a great a Configuration, and a Tuning” —> remove , after configuration
  • “example is using 12 threads on a 16-core machine” —> "the example is using"
  • “Reading Data from a S3-like Endpoint” —> “Reading Data from an S3-like Endpoint
  • “you'll need an access key and secret, and additionally you'll need to define your endpoint.” —> you’ll need an access key and secret key, and additionally, you will need to define your endpoint.”

DataFrame Schemas

  • “you can use .all() extract the overall content” —> “to extract”

Toolkit Walkthrough

  • “If it isn't you might” —> If it isn’t, you might”
  • “Make a directory in your userspace, somewhere where you can find it: on your desktop, perhaps,” --> “ somewhere where you can find it, on your desktop perhaps,”
  • “count them” —> “counts them”
  • “and display a DataFrame the top ten!” —> “and displays a DataFrame of the top ten!”
  • “We like to use this example to do two things:” —> We like to use this example for two reasons:”
  • “To load this script, remember: type” —> “remember to type”
  • “Secondly, if there is time, we can begin to think about how” —> Secondly, we can begin to think about how”

Collection Analysis

  • Question: are we leaving in” TODO: Add script for the case where I only want to know the location of one resource.”?
ruebot added a commit that referenced this issue Jun 1, 2020
@ruebot
Copy link
Member

@ruebot ruebot commented Jun 1, 2020

@SamFritz I think I got most of it. Just pushed things up locally, and have a preview here https://ruebot.github.io/aut-docs-redux/

@ruebot
Copy link
Member

@ruebot ruebot commented Jun 1, 2020

Suggestion: We should be descriptive on the “here”(s) for accessibility purposes?

I don't follow. Can you link to an example?

Questions: Can we add “which java” to find out where it lives, just like python instructions?

Maybe? @ianmilligan1, thoughts?

Question: are we leaving in” TODO: Add script for the case where I only want to know the location of one resource.”?

Yep, for now.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 1, 2020

Re: which java - my head kind of hurts when it comes to Java, as I have so many different java's on my machine that I point to in my .bash_profile. So I probably wouldn't add which java here.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 1, 2020

A few catches:

usage.md:

  • "There are a two options" -> "There are two options"
  • "PySpark with the Java/Scala package, and the Python bindings" -> "PySpark with the Java/Scala package, as well as the Python bindings"

aut-at-scale.md:

The rest are one-offs:

  • text analysis - inconsistency in the RemoveHttpHeader in the doc wrapper vs the RemoveHTTPHeader in the script itself.
  • link analysis - in the intro text, maybe change "Though, we do provide one example below that provides raw data" -> "That said, we do provide one example below that provides raw data"

I also did get a page not found when clicking forward to https://ruebot.github.io/aut-docs-redux/docs/rdd-filters. Somewhere there's a pointer to rdd-filters rather than filter-rdds

Otherwise things are looking good to me!

ruebot added a commit that referenced this issue Jun 1, 2020
ruebot added a commit that referenced this issue Jun 1, 2020
@lintool
Copy link
Member

@lintool lintool commented Jun 2, 2020

Python line continuations with backslash should have a space, per PEP8:

from aut import *

WebArchive(sc, sqlContext, "/path/to/warcs")\
  .webpages() \
  .select("url") \
  .show(20, False)

So, slash on first line needs extra space; others are fine. Issue throughout docs.

ianmilligan1 added a commit that referenced this issue Jun 2, 2020
ianmilligan1 added a commit that referenced this issue Jun 2, 2020
@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 2, 2020

Thanks @lintool - I fixed those brackets in 1b96f0f above.

@ruebot
Copy link
Member

@ruebot ruebot commented Jun 3, 2020

Only remaining TODO is #72.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 6, 2020

BTW I have an ongoing draft PR for typo fixes/clean ups etc. at #76.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.