Discussion: Implementing NER as a derivative #246

ruebot · Jan 10, 2019

I've wired up AUK to do an extract entities job at the end of the Spark job we currently have setup to see how long it takes. Using some small Dalhousie collections, and 10 cores with 30G of RAM on my desktop, this is what we get:

Bob Fournier: 1 file, 58.4 MB

Standard spark job: 15 Seconds
Standard spark job with NER: 11 Minutes and 8 Seconds

#homenothondas: 4 files, 164 MB

Standard spark job: 29 Seconds
Standard spark job with NER: 4 Minutes and 46 Seconds

Nova Scotia FOIPOP portal breach: 1 file, 326 MB

Standard spark job: 47 Seconds
Standard spark job with NER: 32 Minutes and 19 Seconds

Planning in theory and practice : a research compendium: 1 file, 474 MB

Standard spark job: 49 Seconds
Standard spark job with NER: 3030

Based on these four examples, it looks pretty apparent that this does not scale well at all, and I'd suggest that we not implement it with the current Spark job. Maybe this is something we could implement as a side job? We'd just have to really think through how we'd present the option to the user, and take care of it in the back-end.

ianmilligan1 · Jan 10, 2019

Yeah, that sort of dovetails with our experience at the datathon as well. It becomes almost ungainly with any collection over a few GBs I think.

Agreed that it shouldn't be part of the default Spark jobs.

My own take is that given the size of most of the collections, I'm not sure if we want this to be a side job. A 30GB collection would take ~ two days to process, and a 100GB collection would be ~ week. We could write a tutorial about how to do NER locally on your own full text files, so then a user could sample accordingly?

ruebot · Jan 10, 2019

We could write a tutorial about how to do NER locally on your own full text files, so then a user could sample accordingly?

Yeah, that'd work.

ianmilligan1 · Jan 11, 2019

Ok, let me open a ticket shortly to document NER as a new learning guide!

greebie · Jan 11, 2019

To consider: setting up a bunch of optional plug-ins with AUK for people who want to run their own instance (or for us to include on an optional basis).

ianmilligan1 · Jan 11, 2019

OK, I think we can close this and work on implementing a learning guide. We could revisit this in the future if there's a magical breakthrough in NER?

ruebot added the discussion label Jan 10, 2019

ruebot assigned greebie, lintool, ianmilligan1 and SamFritz Jan 10, 2019

ianmilligan1 referenced this issue Jan 11, 2019
Open
NER Learning Guide #248

ruebot closed this Jan 11, 2019

ruebot added the wontfix label Jan 11, 2019

archivesunleashed/auk

Discussion: Implementing NER as a derivative #246

Discussion: Implementing NER as a derivative #246

ruebot commented Jan 10, 2019

ruebot added the discussion label Jan 10, 2019

ruebot assigned greebie, lintool, ianmilligan1 and SamFritz Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 10, 2019

This comment has been minimized.

ruebot commented Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

This comment has been minimized.

greebie commented Jan 11, 2019

ianmilligan1 referenced this issue Jan 11, 2019

NER Learning Guide #248

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

ruebot closed this Jan 11, 2019

ruebot added the wontfix label Jan 11, 2019

archivesunleashed/auk

Join GitHub today

Discussion: Implementing NER as a derivative #246

Comments

ruebot commented Jan 10, 2019

Bob Fournier: 1 file, 58.4 MB

#homenothondas: 4 files, 164 MB

Nova Scotia FOIPOP portal breach: 1 file, 326 MB

Planning in theory and practice : a research compendium: 1 file, 474 MB

ruebot added the discussion label Jan 10, 2019

ruebot assigned greebie, lintool, ianmilligan1 and SamFritz Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 10, 2019

This comment has been minimized.

ruebot commented Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

This comment has been minimized.

greebie commented Jan 11, 2019

ianmilligan1 referenced this issue Jan 11, 2019

NER Learning Guide #248

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

ruebot closed this Jan 11, 2019

ruebot added the wontfix label Jan 11, 2019