Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user stop words Madlib. #45 #48

Merged
merged 5 commits into from Mar 27, 2019

Conversation

Projects
None yet
3 participants
@greebie
Copy link
Collaborator

greebie commented Mar 25, 2019

This PR gives the user the options to use NLTK stopwords and/or add their own. This was a common demand from the datathon.

I also removed the International () function call because it is not doing what I originally thought it was doing.

  • add set for user stopwords.
  • remove misleading 'international' call until it's fixed.
Add user stop words Madlib.
- add set for user stopwords.
- remove misleading 'international' call until it's fixed.
@ianmilligan1
Copy link
Member

ianmilligan1 left a comment

Diff was too large to provide inline comments.

Looks good. The suggestions I'd make are:

  • for the example STOP_WORDS_USER variable, let's make it a list so that they can get a feel for providing multiple variables. i.e. STOP_WORDS_USER = {'south','north'}
  • make explicit that the STOP_WORDS_USER are added to the NLTK stop words. This is clear from the code but the text comments could make it clear.
  • let's remind the user in the comment or text around the # Get a list of the top words in the collection (regardless of year). cell - i.e. that if you want to remove words from it, they can edit STOP_WORDS_USER. They may not know what stop words are and this would give them the context behind what it means.

Otherwise, works quite well.

I'm fine with removing the internationalization code here – it is straightforward and I'll defer to you and the datathon participants findings – but I think in the future we should try to keep PRs to one thing in scope.

@greebie

This comment has been minimized.

Copy link
Collaborator Author

greebie commented Mar 25, 2019

Okay. Will fix.

I knew you were going to say that about international(), but I just couldn't bear seeing it just be wrong and out there in the open for such a small change. Also, I knew that #42 was still in the issues, so the tracking is still available.

greebie added some commits Mar 25, 2019

@greebie

This comment has been minimized.

Copy link
Collaborator Author

greebie commented Mar 25, 2019

Latest commit should resolve all issues stated above.

@ianmilligan1
Copy link
Member

ianmilligan1 left a comment

Looking good!

I think we might want to highlight it a bit more than even now, as I worry that

vals = [x[1] for x in tokens if x[0] not in STOP_WORDS] # Filter based on STOP_WORDS (above).

might not totally make sense to a novice or even intermediate user.

How about something like this?

Screen Shot 2019-03-25 at 7 27 15 PM

@greebie

This comment has been minimized.

Copy link
Collaborator Author

greebie commented Mar 25, 2019

I think that's a different issue which might be fixed more generally in another branch, especially if we look at making a module to cover the functions.

Essentially we have general madlibs and more specific madlibs and it's not always clear which is which. I don't think that it's that helpful to tell people to scroll up to change something. Let's look at the overall explanations / docs / comments together as a separate issue. (Maybe after receiving feedback from the datathon as well.)

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Mar 25, 2019

OK. I still don't think

vals = [x[1] for x in tokens if x[0] not in STOP_WORDS] # Filter based on STOP_WORDS (above).

is clear? Filter what? Filter in? Filter out? Can you try to take a kick at the can to make it a bit more usable?

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Mar 27, 2019

I think that's a different issue which might be fixed more generally in another branch, especially if we look at making a module to cover the functions.

No. That's in scope for this PR. Ian's suggestion makes sense. Can you update your PR to include that, and we can continue to iterate from there.

@ruebot
Copy link
Member

ruebot left a comment

@greebie

This comment has been minimized.

Copy link
Collaborator Author

greebie commented Mar 27, 2019

Now that I've been able to look at this more closely, I see your point. I have some follow-up work this morning, but will get this fixed in the afternoon.

Add wording for user to edit STOP_WORDS_USER.
Remove filter STOP_WORDS comment.
@@ -645,7 +645,7 @@
"source": [
"## Overall Collection Characteristics\n",
"\n",
"Change the variables in the following cell to manipulate the analysis you'll be running to understand overall collection characteristics. In this case, they mostly affect the visualization that we generate.\n"
"Change the variables in the following cell to manipulate the analysis you'll be running to understand overall collection characteristics. In this case, they mostly affect the visualization that we generate. You may wish to add or remove words to `STOP_WORDS_USER`in the user configuration cell to fine-tune the results.\n"

This comment has been minimized.

@ianmilligan1

ianmilligan1 Mar 27, 2019

Member

maybe fine-tune the results -> remove words that may be overwhelming your analysis

@ruebot

ruebot approved these changes Mar 27, 2019

@ruebot ruebot merged commit ab403c2 into master Mar 27, 2019

1 check passed

ci/dockercloud Your tests passed in Docker Cloud
Details

@ruebot ruebot deleted the issue-45 branch Mar 27, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.