Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Token Frequency #4

Closed
ianmilligan1 opened this issue Nov 11, 2019 · 4 comments
Closed

Basic Token Frequency #4

ianmilligan1 opened this issue Nov 11, 2019 · 4 comments

Comments

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Nov 11, 2019

Could we put some basic token frequency after tokens are generated? Most popular words, etc. If it could be broken down by date that would also be perhaps interesting (most popular words in year1 vs year2). That could flow into the word cloud nicely I think.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Jan 7, 2020

For the tokenization that we do now, it is tokenization per row, which will includes a crawl_date column. So, are you thinking adding another column with most popular words, say 10-20 per row with stop words removed? Or a graph of some sort that combines word distribution over time? If so, @lintool what'd work well for that graph-wise?

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Jan 7, 2020

So, are you thinking adding another column with most popular words, say 10-20 per row with stop words removed

This is what I was thinking, but

a graph of some sort that combines word distribution over time

might be more effective? Curious if @lintool has any suggestions...

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Jan 12, 2020

How's this now? Does it cover the spirit of the issue?

https://github.com/archivesunleashed/notebooks/blob/master/parquet_text_analyis.ipynb

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Jan 16, 2020

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.