Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation reorg #2

Merged
merged 21 commits into from Oct 20, 2019

Conversation

@lintool
Copy link
Member

commented Oct 19, 2019

Take a look at this example: https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/collection-analysis.md

Major changes:

  • Grouping analyses by type, on its own page. Doing this so the heading level doesn't get too deep to be manageable.
  • Rephrasing headers into tasks, in the form of "How do I..."
  • Every task has Scala RDD, Scala DF, and Python DF subsections - with TODO stubs for the latter two.

Let me know what you think...

lintool added 6 commits Oct 19, 2019
@lintool lintool requested review from ruebot and ianmilligan1 Oct 19, 2019
lintool added 10 commits Oct 19, 2019
@lintool

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2019

Also, see https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/index.md

I've broken down each type of analysis into it's own page.

Copy link
Member

left a comment

Small changes, but overall I like direction the structure is going in.

current/link-analysis.md Outdated Show resolved Hide resolved
current/link-analysis.md Outdated Show resolved Hide resolved
current/link-analysis.md Outdated Show resolved Hide resolved
@ruebot

This comment has been minimized.

Copy link
Member

commented Oct 19, 2019

...it'll be easier to sort out what needs to be done on archivesunleashed/aut#223 with this new structure.

🤝 @lintool

@lintool

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2019

@ruebot I think those issues were in the previous version, but I fixed anyway.

Note that I haven't org'ed text, link, and image analysis in the "How do I..." format.

@lintool

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2019

At some point in time: https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/collection-analysis.md#to-take-or-to-save

"To Take or To Save" might get it's own page, as it's applicable to every script... and we can link to it. Same treatment with "Filters".

lintool added 3 commits Oct 19, 2019
Copy link
Member

left a comment

Just minor things, which can be taken or left!

current/collection-analysis.md Outdated Show resolved Hide resolved
current/collection-analysis.md Outdated Show resolved Hide resolved
@@ -0,0 +1,48 @@
## Image Analysis

AUT supports image analysis, a growing area of interest within web archives.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Oct 19, 2019

Member

I've been trying to move us away from acronyms all over the place, so maybe just The Toolkit


AUT supports image analysis, a growing area of interest within web archives.

### Most frequent image URLs in a collection

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Oct 19, 2019

Member

This script is so dated - and now that we can just write the images out directly rather than having to wget them, maybe stick with just that?


### Extraction of Simple Site Link Structure

If your web archive does not have a temporal component, the following Spark script will generate the site-level link structure.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Oct 19, 2019

Member

Spark -> scala (just to keep things clear)?

.saveAsTextFile("plain-text/")
```

If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Oct 19, 2019

Member

This line of explanation might be superfluous now? In any case, we should change src/test/resources/arc/example.arc.gz to just example.arc.gz to reflect the script above (this was probably in the original!).


If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data.

Note that this will create a new directory to store the output, which cannot already exist.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Oct 19, 2019

Member

Maybe these generic examples around manipulating code should just go in one place at the beginning of the docs?


### Plain text by domain

The following Spark script generates plain text renderings for all the web pages in a collection with a URL matching a filter string. In the example case, it will go through the collection and find all of the URLs within the "archive.org" domain.

This comment has been minimized.

Copy link
@ianmilligan1

ianmilligan1 Oct 19, 2019

Member

For all these, Spark -> scala?

@lintool

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2019

hey @ianmilligan1

I've left image-analysis.md and text-analysis.md alone for now... since they'll need to be rewritten later anyway.

Let's focus on collection-analysis.md and see if we're happy with it?

@ruebot
ruebot approved these changes Oct 20, 2019
@ruebot ruebot merged commit 25d3310 into master Oct 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.