Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDiscussion: Should we align our Named Entity Recognition output with WANE format? #297
Comments
ruebot
added
the
discussion
label
Jan 10, 2019
ruebot
assigned
greebie,
lintool,
ianmilligan1 and
SamFritz
Jan 10, 2019
This comment has been minimized.
This comment has been minimized.
I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road? |
This comment has been minimized.
This comment has been minimized.
Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. |
ruebot
assigned
ruebot and unassigned
greebie,
lintool,
ianmilligan1 and
SamFritz
Jul 17, 2019
This comment has been minimized.
This comment has been minimized.
aut/src/main/scala/io/archivesunleashed/app/ExtractEntities.scala Lines 45 to 63 in 9b3e025 Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing. |
This comment has been minimized.
This comment has been minimized.
After that commit, I just need to sort out |
ruebot commentedJan 10, 2019
Adapting our example NER script:
Produces the following example output:
This is very similar to WANE output. Is it worth normalizing the output
ExtractEntities
produces to the documented WANE output?JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".