Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upDiscussion: Should we align our Named Entity Recognition output with WANE format? #297
Comments
ruebot
added
the
discussion
label
Jan 10, 2019
ruebot
assigned
greebie,
lintool,
ianmilligan1 and
SamFritz
Jan 10, 2019
This comment has been minimized.
This comment has been minimized.
I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road? |
This comment has been minimized.
This comment has been minimized.
Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ruebot commentedJan 10, 2019
Adapting our example NER script:
Produces the following example output:
This is very similar to WANE output. Is it worth normalizing the output
ExtractEntities
produces to the documented WANE output?JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".