Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make stopword list for hc report terms configurable #14

Open
shawnmjones opened this issue Jun 25, 2020 · 1 comment
Open

Make stopword list for hc report terms configurable #14

shawnmjones opened this issue Jun 25, 2020 · 1 comment
Assignees
Labels

Comments

@shawnmjones
Copy link
Member

@shawnmjones shawnmjones commented Jun 25, 2020

The stopwords for hc report terms are currently hardcoded. Even worse, the are hard coded only in the sumgram code and not the general n-gram code.

# TODO: load these from a file
added_stopwords = [
"associated press",
"com",
"donald trump",
"fox news",
"abc news",
"getty images",
"last month",
"last week",
"last year",
"pic",
"pinterest reddit",
"pm et",
"president donald",
"president donald trump",
"president trump",
"president trump's",
"print mail",
"reddit print",
"said statement",
"send whatsapp",
"sign up",
"trump administration",
"trump said",
"twitter",
"united states",
"washington post",
"white house",
"whatsapp pinterest",
"subscribe whatsapp",
"york times",
"privacy policy",
"terms use"
]
added_stopwords.append( "{} read".format(last_year) )
added_stopwords.append( "{} read".format(current_year) )
stopmonths = [
"january",
"february",
"march",
"april",
"may",
"june",
"july",
"august",
"september",
"october",
"november",
"december"
]
# add just the month to the stop words
added_stopwords.extend(stopmonths)
stopmonths_short = [
"jan",
"feb",
"mar",
"apr",
"may",
"jun",
"jul",
"aug",
"sep",
"oct",
"nov",
"dec"
]
added_stopwords.extend(stopmonths_short)
# add the day of the week, too
added_stopwords.extend([
"monday",
"tuesday",
"wednesday",
"thursday",
"friday",
"saturday",
"sunday"
])
added_stopwords.extend([
"mon",
"tue",
"wed",
"thu",
"fri",
"sat",
"sun"
])
# for i in range(1, 13):
# added_stopwords.append(
# datetime(current_year, i, current_date).strftime('%b %Y')
# )
# added_stopwords.append(
# datetime(last_year, i, current_date).strftime('%b %Y')
# )
# for i in range(1, 13):
# added_stopwords.append(
# datetime(current_year, i, current_date).strftime('%B %Y')
# )
# added_stopwords.append(
# datetime(last_year, i, current_date).strftime('%B %Y')
# )

The generic terms report will need to accept the same stopword list at get_document_tokens:

def get_document_tokens(urim, cache_storage, ngram_length):
from hypercane.utils import get_boilerplate_free_content
from nltk.corpus import stopwords
from nltk import word_tokenize, ngrams
import string
# TODO: stoplist based on language of the document
stoplist = list(set(stopwords.words('english')))
punctuation = [ i for i in string.punctuation ]
additional_stopchars = [ '’', '‘', '“', '”', '•', '·', '—', '–', '›', '»']
stop_numbers = [ str(i) for i in range(0, 11) ]
allstop = stoplist + punctuation + additional_stopchars + stop_numbers
content = get_boilerplate_free_content(urim, cache_storage=cache_storage)
doc_tokens = word_tokenize(content.decode('utf8').lower())
doc_tokens = [ token for token in doc_tokens if token not in allstop ]
table = str.maketrans('', '', string.punctuation)
doc_tokens = [ w.translate(table) for w in doc_tokens ]
doc_tokens = [ w for w in doc_tokens if len(w) > 0 ]
doc_ngrams = ngrams(doc_tokens, ngram_length)
return list(doc_ngrams)

@shawnmjones
Copy link
Member Author

@shawnmjones shawnmjones commented Jun 25, 2020

See Automatically Building a Stopword List for an Information Retrieval System for an idea on how we might automatically compute stopwords. I suspect that we need to include stopwords elsewhere to improve the results of DSA1. With this realization, we might want to give this a little more thought before just testing and releasing the recent code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.