Skip to content
Permalink
Browse files

Rearranging notebook; prose changes; adds README

  • Loading branch information...
ianmilligan1 committed Feb 22, 2019
1 parent 0ba9016 commit 31e640184d050d603d0cc4eabcf78e421afb872e
Showing with 291 additions and 120 deletions.
  1. +63 −38 AUK_Full-Text-collection_id.ipynb
  2. +119 −80 AUK_Full-Text_OUTPUT.ipynb
  3. +53 −0 CONTRIBUTING.md
  4. +11 −0 LICENSE.txt
  5. +45 −2 README.md
@@ -10,21 +10,28 @@
"\n",
"# Welcome\n",
"\n",
"Welcome to the Archives Unleashed Cloud Visualization Demo in Jupyter Notebook for your collection. This demonstration takes the main derivatives from the Cloud and uses Python to analyze and produce information about your collection.\n",
"Welcome to the Archives Unleashed Cloud Visualization Demo Jupyter Notebook. This demonstration takes the main derivatives from the Cloud and uses Python to analyze and produce information about your collection.\n",
"\n",
"This product is in beta, so if you encounter any issues, please post an [issue in our Github repository](https://github.com/archivesunleashed/auk/issues) to let us know about any bugs you encountered or features you would like to see included.\n",
"\n",
"If you have some basic Python coding experience, you can change the code we provided to suit your own needs.\n",
"If you have some basic Python coding experience, you can change the provided code to suit your own needs.\n",
"\n",
"Unfortunately, we cannot support code that you produced yourself. We recommend that you use `File > Make a Copy` first before changing the code in the repository. That way, you can always return to the basic visualizations we have offered here. Of course, you can also just re-download the Jupyter Notebook file from your Archives Unleashed Cloud account.\n",
"We recommend that you use `File > Make a Copy` first before changing the code in the repository. That way, you can always return to the basic visualizations we have offered here. Of course, you can also just re-download the Jupyter Notebook file from your Archives Unleashed Cloud account.\n",
"\n",
"### How Jupyter Notebooks Work:\n",
"\n",
"If you have no previous experience of Jupyter Notebooks, the most important thing to understand is that that <Shift><Enter/Return> will run the python code inside a window and output it to the site.\n",
" \n",
"The window titled `# RUN THIS FIRST` should be the first place you go. This will import all the libraries and set basic variables (e.g. where your derivative files are located) for the notebook. After that, everything else should be able to run on its own.\n",
"The cells that cover the required inputs, marked \"Setup\", need to be run before the rest of the notebook will work. These will import all the libraries and set basic variables (e.g. where your derivative files are located) for the notebook. After that, everything else should be able to run on its own.\n",
"\n",
"If you just want to see the results for your collection, use `Cell > Run All`.\n"
"If you just want to see all results for your collection, use `Cell > Run All`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup"
]
},
{
@@ -37,6 +44,18 @@
"\n",
"from collections import Counter\n",
"import logging\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import numpy as np\n",
"import networkx as nx\n",
"import ggplot as ggp\n",
"from nltk.tokenize import word_tokenize, sent_tokenize\n",
"from nltk.draw.dispersion import dispersion_plot as dp\n",
"from nltk.classify import NaiveBayesClassifier\n",
"from nltk.corpus import subjectivity\n",
"from nltk.sentiment import SentimentAnalyzer\n",
"from nltk.sentiment.util import *\n",
"from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
"\n",
"coll_id = \"4656\"\n",
"auk_fp = \"./data/\"\n",
@@ -49,17 +68,9 @@
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"remove_cell"
]
},
"metadata": {},
"source": [
"# Text Analysis\n",
"\n",
"The following set of functions use the [Natural Language Toolkit](https://www.nltk.org) Python library to search for the top most used words in the collection, as well as facilitate breaking it down by name or domain.\n",
"\n",
"Set the variables below if you wish to make some changes."
"The following cell sets out some user-generated variables. Take a look here: are there any domains you are not interested in? How many words would you like to be shown? Do you want to filter out 404 results? Do you want to sample the data? Read the choices below carefully."
]
},
{
@@ -68,10 +79,6 @@
"metadata": {},
"outputs": [],
"source": [
"## CONSTANTS / CONFIGURATION\n",
"#\n",
"# If you wish to fine tune the outputs, you may change the following:\n",
"#\n",
"# maximum number of words to show in output.\n",
"# Jupyter will create an output error if the number is too high.\n",
"TOP_COUNT = 30 \n",
@@ -111,22 +118,22 @@
"FILTERED_DOMAINS = [] # e.g [\"google\", \"apple\", \"facebook\"]\n",
"\n",
"# List of words not to include in a corpus for text analysis\n",
"STOP_WORDS = ['this', 'that', 'with', 'from', 'your']\n",
"\n",
"## Toolkit imports\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import numpy as np\n",
"import networkx as nx\n",
"import ggplot as ggp\n",
"from nltk.tokenize import word_tokenize, sent_tokenize\n",
"from nltk.draw.dispersion import dispersion_plot as dp\n",
"from nltk.classify import NaiveBayesClassifier\n",
"from nltk.corpus import subjectivity\n",
"from nltk.sentiment import SentimentAnalyzer\n",
"from nltk.sentiment.util import *\n",
"from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
"\n",
"STOP_WORDS = ['this', 'that', 'with', 'from', 'your']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The below cell now sets up the functions that drive the analysis throughout this notebook. If you don't run it, you won't be able to work with the data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def clean_domain(s):\n",
" \"\"\"Extracts the name from the domain (e.g. 'www.google.com' becomes 'google').\"\"\"\n",
" ret = \"\"\n",
@@ -349,8 +356,26 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have saved the above functions, you can now use them in a variety of ways. \n",
"As the domain derivative is relatively straightforward, there is not much else that we do with it. "
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"remove_cell"
]
},
"source": [
"# Text Analysis\n",
"\n",
"The following set of functions use the [Natural Language Toolkit](https://www.nltk.org) Python library to search for the top most used words in the collection, as well as facilitate breaking it down by name or domain."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Text by Year"
]
},
@@ -387,7 +412,7 @@
"for i in year_results[:5]:\n",
" print(international(i)[:MAX_CHARACTERS]) # first 50 characters in output\n",
"\n",
"## Commenting out the following will write the results to a `output_filename\n",
"## Removing the # on the following line will write the results to a file entitled `output_filename`\n",
"\n",
"#write_output(output_filename, year_results)"
]
@@ -681,7 +706,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
"version": "3.5.6"
}
},
"nbformat": 4,

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -0,0 +1,53 @@
# Welcome!

If you are reading this document then you are interested in contributing to the AUK-Notebooks repo. All contributions are welcome: use-cases, documentation, code, ptatches, bug reports, feature requests, etc. You do not need to be a programmer to speak up!

### Use cases

If you would like to submit a use case for these notebooks, please submit and issue [here](https://github.com/archivesunleashed/auk-notebooks/issues/new), and begin the issue title with "Use Case:".

### Documentation

You can contribute documentation in two different ways. One way is to create an issue [here](https://github.com/archivesunleashed/auk-notebooks/issues/new) and begin the issue title with "Documentation:".

### Request a new feature

To request a new feature you should [open an issue](https://github.com/archivesunleashed/auk-notebooks/issues/new) or create a use case as described above (see _use case_ section above), and summarize the desired functionality. Begin the issue title with "Enhancement:".

### Report a bug

To report a bug you should [open an issue](https://github.com/archivesunleashed/auk-notebooks/issues/new) that summarizes the bug, and begin the issue title with "Bug".

In order to help us understand and fix the bug it would be great if you could provide us with:

1. The steps to reproduce the bug. This includes information about e.g. The AUK version you were using.
2. The expected behavior.
3. The actual, incorrect behavior.

Feel free to search the issue queue for existing issues (aka tickets) that already describe the problem; if there is such a ticket please add your information as a comment.

### Contribute code

_If you are interested in contributing code to AUK but do not know where to begin:_

In this case you should [browse open issues](https://github.com/archivesunleashed/auk-notebooks/issues).

Contributions to AUK codebase should be sent as GitHub pull requests. See section _Create a pull request_ below for details. If there is any problem with the pull request we can work through it using the commenting features of GitHub.

* For _small patches_, feel free to submit pull requests directly for those patches.
* For _larger code contributions_, please use the following process. The idea behind this process is to prevent any wasted work and catch design issues early on.

1. [Open an issue](https://github.com/archivesunleashed/auk-notebooks/issues), if a similar issue does not exist already. If a similar issue does exist, then you may consider participating in the work on the existing issue.
2. Comment on the issue with your plan for implementing the issue. Explain what pieces of the codebase you are going to touch and how everything is going to fit together.
3. The repository committers will work with you on the design to make sure you are on the right track.
4. Implement your issue, create a pull request (see below), and iterate from there.

### Create a pull request

Take a look at [Creating a pull request](https://help.github.com/articles/creating-a-pull-request). In a nutshell you need to:

1. [Fork](https://help.github.com/articles/fork-a-repo) the AUK GitHub repository at [https://github.com/archivesunleashed/auk-notebooks](https://github.com/archivesleashed/auk-notebooks) to your personal GitHub account.
2. Commit any changes to your fork.
3. Send a [pull request](https://help.github.com/articles/creating-a-pull-request) to AUK GitHub repository that you forked in step 1. If your pull request is related to an existing issue -- for instance, because you reported a [bug/issue](https://github.com/archivesunleashed/auk-notebooks/issues) earlier -- prefix the title of your pull request with the corresponding issue number (e.g. `issue-123: ...`). Please also include a reference to the issue in the description of the pull. This can be done by using '#' plus the issue number like so '#123', also try to pick an appropriate name for the branch in which you're issuing the pull request from.

You may want to read [Syncing a fork](https://help.github.com/articles/syncing-a-fork) for instructions on how to keep your fork up to date with the latest changes of the upstream (official) `auk-notebooks` repository.
@@ -0,0 +1,11 @@
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
@@ -1,2 +1,45 @@
# auk-notebooks
Prototying auk notebooks.
# AUK; Jupyter Notebooks

![notebook screenshot](https://user-images.githubusercontent.com/3834704/53252943-1a89b880-368e-11e9-9a9a-31c43a045a55.png)

A prototype Jupyter notebook derivative for the Archives Unleashed Cloud.

## Requirements

Jupyter Notebook. Follow the instructions on [their website](https://jupyter.org).

Dependencies. Any version higher than below should work:

* Python 3.7
* ggplot (0.11.5)
* matplotlib (1.15.1)
* numpy (0.23.4)
* pandas (0.23.4)
* networkx (2.2)
* nltk (3.4)

## Installation

Download this notebook from the Archives Unleashed Cloud as a derivative (or from here). Place the Cloud derivatives in a directory labelled `data` in the directory that you are running the notebook from.

This repository comes with sample data, you can swap out the sample data with your own Cloud data.

To run this sample:

```
git clone https://github.com/archivesunleashed/auk-notebooks.git
cd auk-notebooks
jupyter notebook
```

## Contributing

Please see [contributing guidelines](https://github.com/archivesunleashed/auk/blob/master/CONTRIBUTING.md) for details.

## License

This application is available as open source under the terms of the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).

## Acknowledgments

This work is primarily supported by the [Andrew W. Mellon Foundation](https://uwaterloo.ca/arts/news/multidisciplinary-project-will-help-historians-unlock). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

0 comments on commit 31e6401

Please sign in to comment.
You can’t perform that action at this time.