Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDocumentation reorg #2
Conversation
This comment has been minimized.
This comment has been minimized.
Also, see https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/index.md I've broken down each type of analysis into it's own page. |
Small changes, but overall I like direction the structure is going in. |
This comment has been minimized.
This comment has been minimized.
...it'll be easier to sort out what needs to be done on archivesunleashed/aut#223 with this new structure. |
This comment has been minimized.
This comment has been minimized.
@ruebot I think those issues were in the previous version, but I fixed anyway. Note that I haven't org'ed text, link, and image analysis in the "How do I..." format. |
This comment has been minimized.
This comment has been minimized.
At some point in time: https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/collection-analysis.md#to-take-or-to-save "To Take or To Save" might get it's own page, as it's applicable to every script... and we can link to it. Same treatment with "Filters". |
Just minor things, which can be taken or left! |
@@ -0,0 +1,48 @@ | |||
## Image Analysis | |||
|
|||
AUT supports image analysis, a growing area of interest within web archives. |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Oct 19, 2019
Member
I've been trying to move us away from acronyms all over the place, so maybe just The Toolkit
|
||
AUT supports image analysis, a growing area of interest within web archives. | ||
|
||
### Most frequent image URLs in a collection |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Oct 19, 2019
Member
This script is so dated - and now that we can just write the images out directly rather than having to wget them, maybe stick with just that?
|
||
### Extraction of Simple Site Link Structure | ||
|
||
If your web archive does not have a temporal component, the following Spark script will generate the site-level link structure. |
This comment has been minimized.
This comment has been minimized.
.saveAsTextFile("plain-text/") | ||
``` | ||
|
||
If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data. |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Oct 19, 2019
Member
This line of explanation might be superfluous now? In any case, we should change src/test/resources/arc/example.arc.gz
to just example.arc.gz
to reflect the script above (this was probably in the original!).
|
||
If you wanted to use it on your own collection, you would change "src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and change "out/" on the last line to where you want to save your output data. | ||
|
||
Note that this will create a new directory to store the output, which cannot already exist. |
This comment has been minimized.
This comment has been minimized.
ianmilligan1
Oct 19, 2019
Member
Maybe these generic examples around manipulating code should just go in one place at the beginning of the docs?
|
||
### Plain text by domain | ||
|
||
The following Spark script generates plain text renderings for all the web pages in a collection with a URL matching a filter string. In the example case, it will go through the collection and find all of the URLs within the "archive.org" domain. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hey @ianmilligan1 I've left Let's focus on |
lintool commentedOct 19, 2019
Take a look at this example: https://github.com/archivesunleashed/aut-docs-new/blob/doc-reorg/current/collection-analysis.md
Major changes:
Let me know what you think...