Implementing Full-Text Filtering by Domains within AUK #197

ianmilligan1 · Oct 1, 2018

Is your feature request related to a problem? Please describe.

Right now, users can receive the full-text of a web archival collection. Some of these are very big files, and more importantly, they often require a level of filtering to get to the useful content: filtering by date, for example, or filtering by domain.

At our datathons, we've had several teams who have compared the text of domains in a web archive collection to see what they are saying say about a supreme court nominee; labour disruptions; pipelines, etc.

Right now we have documentation on how to filter these files, but this requires that a scholar know how to use the command line and grep.

Describe the solution you'd like

Ideally, it would be nice to bake some of these grep commands into the Archives Unleashed Cloud. See the screenshot below:

The steps in this process would be:

Next to each of the domains, a "download" button is populated;
When that button is pressed, a background job starts that executes syntax similar to grep ',www.khyber.ca,' 9481-fulltext.txt > 9481-www-khyber-ca-text.txt. In general, it would be grep ,DOMAIN, COLLECTIONNUMBER-fulltext.txt > COLLECTIONNUMBER-domain-text.txt
An e-mail is sent to the user notifying them that the file is ready for download. Perhaps the icon would appear to download next to the "download" button or something.

The actual look and feel of the buttons will probably be different.

Describe alternatives you've considered

There are two main alternatives.

Option A: Using Spark

I'm suggesting a bash command here for two reasons:

The Spark queue can get backed up, and until full DataFrame implementation we don't want to have to pass over every single WARC just to get text from one domain;
Most importantly,, we delete the WARCs and we don't want to hold on to them longer than possible just in case somebody might run one of these jobs down the road. This runs off the full-text derivative which makes more sense to me from a resource perspective.

Option B: Pointing Users to Documentation

This is adding functionality that we tell people how to do here. I think that makes our service a bit more difficult. This way would let them just click a button, download the text file, and paste it into something like Voyant right away.

ruebot · Nov 2, 2018

@ianmilligan1 7d3f89f#diff-52f1a25071caad35c15ee1e5c34a1ff1R87

Let me know what you want to call that button, and what the tool-tip text should be.

ianmilligan1 · Nov 2, 2018

Looking great!

How about button: Text by Domains

And tool-tip text: A zip file that contains the text of the top ten domains within a web archive, each within their own text file. Within the files you can find the crawl date, full URL, and the plain text of each page within the file.

ianmilligan1 added enhancement core-feature feature labels Oct 1, 2018

ruebot added the Background jobs label Oct 1, 2018

ruebot added a commit that referenced this issue Nov 2, 2018

#197 check-in; this is going to get squashed and rebased.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

7d3f89f

ruebot closed this in a0be875 Nov 2, 2018

ianmilligan1 referenced this issue Nov 2, 2018
Closed
Update Docs to Include Full-Text Filtering #202

ianmilligan1 added a commit that referenced this issue Nov 2, 2018

Adds language to text filter guide to reflect #197

Loading status checks…

8c7db96

archivesunleashed/auk

Implementing Full-Text Filtering by Domains within AUK #197

ianmilligan1 commented Oct 1, 2018

ianmilligan1 added enhancement core-feature feature labels Oct 1, 2018

ruebot added the Background jobs label Oct 1, 2018

ruebot added a commit that referenced this issue Nov 2, 2018

This comment has been minimized.

ruebot commented Nov 2, 2018

This comment has been minimized.

ianmilligan1 commented Nov 2, 2018

ruebot closed this in `a0be875` Nov 2, 2018

ianmilligan1 referenced this issue Nov 2, 2018

Update Docs to Include Full-Text Filtering #202

ianmilligan1 added a commit that referenced this issue Nov 2, 2018

ruebot added a commit that referenced this issue Nov 9, 2018

archivesunleashed/auk

Join GitHub today

Implementing Full-Text Filtering by Domains within AUK #197

Comments

ianmilligan1 commented Oct 1, 2018

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

ianmilligan1 added enhancement core-feature feature labels Oct 1, 2018

ruebot added the Background jobs label Oct 1, 2018

ruebot added a commit that referenced this issue Nov 2, 2018

This comment has been minimized.

ruebot commented Nov 2, 2018

This comment has been minimized.

ianmilligan1 commented Nov 2, 2018

ruebot closed this in a0be875 Nov 2, 2018

ianmilligan1 referenced this issue Nov 2, 2018

Update Docs to Include Full-Text Filtering #202

ianmilligan1 added a commit that referenced this issue Nov 2, 2018

ruebot added a commit that referenced this issue Nov 9, 2018

ruebot closed this in `a0be875` Nov 2, 2018