New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Full-Text Filtering by Domains within AUK #197

Closed
ianmilligan1 opened this Issue Oct 1, 2018 · 2 comments

Comments

Projects
None yet
2 participants
@ianmilligan1
Member

ianmilligan1 commented Oct 1, 2018

Is your feature request related to a problem? Please describe.

Right now, users can receive the full-text of a web archival collection. Some of these are very big files, and more importantly, they often require a level of filtering to get to the useful content: filtering by date, for example, or filtering by domain.

At our datathons, we've had several teams who have compared the text of domains in a web archive collection to see what they are saying say about a supreme court nominee; labour disruptions; pipelines, etc.

Right now we have documentation on how to filter these files, but this requires that a scholar know how to use the command line and grep.

Describe the solution you'd like

Ideally, it would be nice to bake some of these grep commands into the Archives Unleashed Cloud. See the screenshot below:

screen shot 2018-10-01 at 10 14 20 am

The steps in this process would be:

  1. Next to each of the domains, a "download" button is populated;
  2. When that button is pressed, a background job starts that executes syntax similar to grep ',www.khyber.ca,' 9481-fulltext.txt > 9481-www-khyber-ca-text.txt. In general, it would be grep ,DOMAIN, COLLECTIONNUMBER-fulltext.txt > COLLECTIONNUMBER-domain-text.txt
  3. An e-mail is sent to the user notifying them that the file is ready for download. Perhaps the icon would appear to download next to the "download" button or something.

The actual look and feel of the buttons will probably be different.

Describe alternatives you've considered

There are two main alternatives.

Option A: Using Spark

I'm suggesting a bash command here for two reasons:

  1. The Spark queue can get backed up, and until full DataFrame implementation we don't want to have to pass over every single WARC just to get text from one domain;
  2. Most importantly,, we delete the WARCs and we don't want to hold on to them longer than possible just in case somebody might run one of these jobs down the road. This runs off the full-text derivative which makes more sense to me from a resource perspective.

Option B: Pointing Users to Documentation

This is adding functionality that we tell people how to do here. I think that makes our service a bit more difficult. This way would let them just click a button, download the text file, and paste it into something like Voyant right away.

ruebot added a commit that referenced this issue Nov 2, 2018

@ruebot

This comment has been minimized.

Member

ruebot commented Nov 2, 2018

@ianmilligan1 7d3f89f#diff-52f1a25071caad35c15ee1e5c34a1ff1R87

Let me know what you want to call that button, and what the tool-tip text should be.

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Nov 2, 2018

Looking great!

How about button: Text by Domains

And tool-tip text: A zip file that contains the text of the top ten domains within a web archive, each within their own text file. Within the files you can find the crawl date, full URL, and the plain text of each page within the file.

@ruebot ruebot closed this in a0be875 Nov 2, 2018

ianmilligan1 added a commit that referenced this issue Nov 2, 2018

ruebot added a commit that referenced this issue Nov 9, 2018

Update Docs, About, and Learning Guides to Include Full-Text by Domai…
…n Derivative, Resolves #202 (#205)

* Adds language to text filter guide to reflect #197
* Documents _5th_ derivative file: text by domain
* Minor update to "about" page
* Adding new image with text by domain derivative
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment