Skip to content
Permalink
Browse files

Use NLTK stopwords, update README (#15)

    - Resolve #14
    - Partially address #13
    - Resolve #17 
    - Update notebooks to use NLTK stopwords
    - Add NLTK stopwords
  • Loading branch information...
ruebot authored and ianmilligan1 committed Mar 4, 2019
1 parent c1b1f7c commit d1088fa302aa3a0f157c3b8e731322bd651e377a
@@ -14,9 +14,6 @@ RUN pip install matplotlib==3.0.2 \
networkx==2.2 \
nltk==3.4

# Make things cleaner in Notebook.
RUN rm -rf $HOME/work

# Copy auk-notebook files over.
COPY data $HOME/data
COPY nltk_data $HOME/nltk_data
@@ -1,16 +1,20 @@
# Archives Unleashed Cloud: Jupyter Notebooks
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/archivesunleashed/auk-notebooks/master?filepath=auk-notebook-example.ipynb)
[![Docker Stars](https://img.shields.io/docker/stars/archivesunleashed/auk-notebooks.svg)](https://hub.docker.com/r/archivesunleashed/auk-notebooks/)
[![Docker Pulls](https://img.shields.io/docker/pulls/archivesunleashed/auk-notebooks.svg)](https://hub.docker.com/r/archivesunleashed/auk-notebooks/)
[![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat-square)](./LICENSE)
[![Contribution Guidelines](http://img.shields.io/badge/CONTRIBUTING-Guidelines-blue.svg)](./CONTRIBUTING.md)

[Jupyter](https://jupyter.org/) notebooks to assist in creating additional analysis and visualizations of Archives Unleashed Cloud derivatives.

![notebook screenshot](https://user-images.githubusercontent.com/3834704/53252943-1a89b880-368e-11e9-9a9a-31c43a045a55.png)

## Requirements

Jupyter Notebook. Follow the installation instructions on [their website](https://jupyter.org).
[Anaconda Distribution](https://www.anaconda.com/distribution/#download-section) is very helpful here.

Dependencies. Any version higher than below _should_ work:

* Python 3.7
* Python 3.7+
* [Jupyter Notebook](https://jupyter.org) (1.0.0)
* matplotlib (3.0.2)
* numpy (1.15.1)
* pandas (0.23.4)
@@ -19,34 +23,40 @@ Dependencies. Any version higher than below _should_ work:

## Usage

We suggest using [Docker](https://www.docker.com/get-started):
We suggest using [Docker](https://www.docker.com/get-started), or [Anaconda Distribution](https://www.anaconda.com/distribution).

### Docker Hub

```bash
git clone https://github.com/archivesunleashed/auk-notebooks.git
cd auk-notebooks
docker build -t auk-notebook .
docker run --rm -it -p 8888:8888 auk-notebook
docker run --rm -it -p 8888:8888 archivesunleashed/auk-notebooks
```

If you have the dependencies installed:
### Docker Locally

```bash
git clone https://github.com/archivesunleashed/auk-notebooks.git
cd auk-notebooks
jupyter notebook
docker build -t auk-notebook .
docker run --rm -it -p 8888:8888 auk-notebook
```

This repository comes with sample data, you can swap out the sample data with your own Cloud data.
This repository comes with sample data, you can swap out the sample data with your own Archives Unleashed Cloud data.

```bash
docker run --rm -it -p 8888:8888 -v "/path/to/own/data:/home/jovyan/data" auk-notebook
```

> [You must grant the within-container notebook user or group (NB_UID or NB_GID) write access to the host directory (e.g., sudo chown 1000 /some/host/folder/for/work).](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/common.html#docker-options)
This repository also uses the [Jupyter Docker Stacks](https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html), which provide [a lot of helpful options to take advantage of](https://jupyter-docker-stacks.readthedocs.io/en/latest/using/common.html#docker-options).

## Contributing
### Local (Anaconda)

Please see [contributing guidelines](https://github.com/archivesunleashed/auk-notebooks/blob/master/CONTRIBUTING.md) for details.
```bash
git clone https://github.com/archivesunleashed/auk-notebooks.git
cd auk-notebooks
jupyter notebook
```

## License

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -55,6 +55,7 @@
"from nltk.sentiment import SentimentAnalyzer\n",
"from nltk.sentiment.util import *\n",
"from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
"from nltk.corpus import stopwords\n",
"\n",
"# Add the collection id of your Archive-It collection:\n",
"coll_id = \"\"\n",
@@ -121,7 +122,7 @@
"FILTERED_DOMAINS = [] # e.g [\"google\", \"apple\", \"facebook\"]\n",
"\n",
"# List of words not to include in a corpus for text analysis\n",
"STOP_WORDS = ['this', 'that', 'with', 'from', 'your']"
"STOP_WORDS = set(stopwords.words('english'))"
]
},
{
@@ -748,10 +749,8 @@
"metadata": {},
"source": [
"# Bibliography\n",
"\n",
"Bird, Steven, Edward Loper and Ewan Klein (2009), *Natural Language Processing with Python*. O’Reilly Media Inc.\n",
"\n",
"Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0."
"- Archives Unleashed Project. (2018). Archives Unleashed Toolkit (Version 0.17.0). Apache License, Version 2.0.\n",
"- Bird, Steven, Edward Loper and Ewan Klein (2009), *Natural Language Processing with Python*. O’Reilly Media Inc.\n"
]
}
],
Binary file not shown.
@@ -0,0 +1,32 @@
Stopwords Corpus

This corpus contains lists of stop words for several languages. These
are high-frequency grammatical words which are usually ignored in text
retrieval applications.

They were obtained from:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/

The stop words for the Romanian language were obtained from:
http://arlc.ro/resources/

The English list has been augmented
https://github.com/nltk/nltk_data/issues/22

The German list has been corrected
https://github.com/nltk/nltk_data/pull/49

A Kazakh list has been added
https://github.com/nltk/nltk_data/pull/52

A Nepali list has been added
https://github.com/nltk/nltk_data/pull/83

An Azerbaijani list has been added
https://github.com/nltk/nltk_data/pull/100

A Greek list has been added
https://github.com/nltk/nltk_data/pull/103

An Indonesian list has been added
https://github.com/nltk/nltk_data/pull/112
@@ -0,0 +1,248 @@
إذ
إذا
إذما
إذن
أف
أقل
أكثر
ألا
إلا
التي
الذي
الذين
اللاتي
اللائي
اللتان
اللتيا
اللتين
اللذان
اللذين
اللواتي
إلى
إليك
إليكم
إليكما
إليكن
أم
أما
أما
إما
أن
إن
إنا
أنا
أنت
أنتم
أنتما
أنتن
إنما
إنه
أنى
أنى
آه
آها
أو
أولاء
أولئك
أوه
آي
أي
أيها
إي
أين
أين
أينما
إيه
بخ
بس
بعد
بعض
بك
بكم
بكم
بكما
بكن
بل
بلى
بما
بماذا
بمن
بنا
به
بها
بهم
بهما
بهن
بي
بين
بيد
تلك
تلكم
تلكما
ته
تي
تين
تينك
ثم
ثمة
حاشا
حبذا
حتى
حيث
حيثما
حين
خلا
دون
ذا
ذات
ذاك
ذان
ذانك
ذلك
ذلكم
ذلكما
ذلكن
ذه
ذو
ذوا
ذواتا
ذواتي
ذي
ذين
ذينك
ريث
سوف
سوى
شتان
عدا
عسى
عل
على
عليك
عليه
عما
عن
عند
غير
فإذا
فإن
فلا
فمن
في
فيم
فيما
فيه
فيها
قد
كأن
كأنما
كأي
كأين
كذا
كذلك
كل
كلا
كلاهما
كلتا
كلما
كليكما
كليهما
كم
كم
كما
كي
كيت
كيف
كيفما
لا
لاسيما
لدى
لست
لستم
لستما
لستن
لسن
لسنا
لعل
لك
لكم
لكما
لكن
لكنما
لكي
لكيلا
لم
لما
لن
لنا
له
لها
لهم
لهما
لهن
لو
لولا
لوما
لي
لئن
ليت
ليس
ليسا
ليست
ليستا
ليسوا
ما
ماذا
متى
مذ
مع
مما
ممن
من
منه
منها
منذ
مه
مهما
نحن
نحو
نعم
ها
هاتان
هاته
هاتي
هاتين
هاك
هاهنا
هذا
هذان
هذه
هذي
هذين
هكذا
هل
هلا
هم
هما
هن
هنا
هناك
هنالك
هو
هؤلاء
هي
هيا
هيت
هيهات
والذي
والذين
وإذ
وإذا
وإن
ولا
ولكن
ولو
وما
ومن
وهو
يا
Oops, something went wrong.

0 comments on commit d1088fa

Please sign in to comment.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.