Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upDiscussion: Implementing NER as a derivative #246
Comments
ruebot
added
the
discussion
label
Jan 10, 2019
ruebot
assigned
greebie,
lintool,
ianmilligan1 and
SamFritz
Jan 10, 2019
This comment has been minimized.
This comment has been minimized.
Yeah, that sort of dovetails with our experience at the datathon as well. It becomes almost ungainly with any collection over a few GBs I think. Agreed that it shouldn't be part of the default Spark jobs. My own take is that given the size of most of the collections, I'm not sure if we want this to be a side job. A 30GB collection would take ~ two days to process, and a 100GB collection would be ~ week. We could write a tutorial about how to do NER locally on your own full text files, so then a user could sample accordingly? |
This comment has been minimized.
This comment has been minimized.
Yeah, that'd work. |
This comment has been minimized.
This comment has been minimized.
Ok, let me open a ticket shortly to document NER as a new learning guide! |
This comment has been minimized.
This comment has been minimized.
To consider: setting up a bunch of optional plug-ins with AUK for people who want to run their own instance (or for us to include on an optional basis). |
This comment has been minimized.
This comment has been minimized.
OK, I think we can close this and work on implementing a learning guide. We could revisit this in the future if there's a magical breakthrough in NER? |
ruebot commentedJan 10, 2019
I've wired up AUK to do an extract entities job at the end of the Spark job we currently have setup to see how long it takes. Using some small Dalhousie collections, and 10 cores with 30G of RAM on my desktop, this is what we get:
Bob Fournier: 1 file, 58.4 MB
#homenothondas: 4 files, 164 MB
Nova Scotia FOIPOP portal breach: 1 file, 326 MB
Planning in theory and practice : a research compendium: 1 file, 474 MB
Based on these four examples, it looks pretty apparent that this does not scale well at all, and I'd suggest that we not implement it with the current Spark job. Maybe this is something we could implement as a side job? We'd just have to really think through how we'd present the option to the user, and take care of it in the back-end.