Track tasks and feature requests
Join 40 million developers who use GitHub issues to help identify, assign, and keep track of the features and bug fixes your projects need.
Sign up for free See pricing for teams and enterprisesDiscussion: Should we align our Named Entity Recognition output with WANE format? #297
Comments
ruebot
added
the
discussion
label
Jan 10, 2019
ruebot
assigned
greebie,
lintool,
ianmilligan1 and
SamFritz
Jan 10, 2019
This comment has been minimized.
This comment has been minimized.
I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road? |
This comment has been minimized.
This comment has been minimized.
Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. |
ruebot
assigned
ruebot and unassigned
greebie,
lintool,
ianmilligan1 and
SamFritz
Jul 17, 2019
This comment has been minimized.
This comment has been minimized.
aut/src/main/scala/io/archivesunleashed/app/ExtractEntities.scala Lines 45 to 63 in 9b3e025 Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing. |
This comment has been minimized.
This comment has been minimized.
After that commit, I just need to sort out |
This comment has been minimized.
This comment has been minimized.
I'd completely forgotten about this! My vote would be to remove this in the PR. It's an outlier in that it takes derivative files as an input and processes them. I think it makes sense to stick to ARC and WARC files as inputs only, and put the emphasis on users to either use notebooks or their own solutions to work with the derivative files. |
ruebot commentedJan 10, 2019
Adapting our example NER script:
Produces the following example output:
This is very similar to WANE output. Is it worth normalizing the output
ExtractEntities
produces to the documented WANE output?JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".