New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Implementing NER as a derivative #246

Closed
ruebot opened this Issue Jan 10, 2019 · 5 comments

Comments

Projects
None yet
5 participants
@ruebot
Member

ruebot commented Jan 10, 2019

I've wired up AUK to do an extract entities job at the end of the Spark job we currently have setup to see how long it takes. Using some small Dalhousie collections, and 10 cores with 30G of RAM on my desktop, this is what we get:

Bob Fournier: 1 file, 58.4 MB

  • Standard spark job: 15 Seconds
  • Standard spark job with NER: 11 Minutes and 8 Seconds

#homenothondas: 4 files, 164 MB

  • Standard spark job: 29 Seconds
  • Standard spark job with NER: 4 Minutes and 46 Seconds

Nova Scotia FOIPOP portal breach: 1 file, 326 MB

  • Standard spark job: 47 Seconds
  • Standard spark job with NER: 32 Minutes and 19 Seconds

Planning in theory and practice : a research compendium: 1 file, 474 MB

  • Standard spark job: 49 Seconds
  • Standard spark job with NER: 3030

chart


Based on these four examples, it looks pretty apparent that this does not scale well at all, and I'd suggest that we not implement it with the current Spark job. Maybe this is something we could implement as a side job? We'd just have to really think through how we'd present the option to the user, and take care of it in the back-end.

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Jan 10, 2019

Yeah, that sort of dovetails with our experience at the datathon as well. It becomes almost ungainly with any collection over a few GBs I think.

Agreed that it shouldn't be part of the default Spark jobs.

My own take is that given the size of most of the collections, I'm not sure if we want this to be a side job. A 30GB collection would take ~ two days to process, and a 100GB collection would be ~ week. We could write a tutorial about how to do NER locally on your own full text files, so then a user could sample accordingly?

@ruebot

This comment has been minimized.

Member

ruebot commented Jan 10, 2019

We could write a tutorial about how to do NER locally on your own full text files, so then a user could sample accordingly?

Yeah, that'd work.

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Jan 11, 2019

Ok, let me open a ticket shortly to document NER as a new learning guide!

@greebie

This comment has been minimized.

Contributor

greebie commented Jan 11, 2019

To consider: setting up a bunch of optional plug-ins with AUK for people who want to run their own instance (or for us to include on an optional basis).

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Jan 11, 2019

OK, I think we can close this and work on implementing a learning guide. We could revisit this in the future if there's a magical breakthrough in NER?

@ruebot ruebot closed this Jan 11, 2019

@ruebot ruebot added the wontfix label Jan 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment