Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDiscussion: Should we align our Named Entity Recognition output with WANE format? #297
Comments
This comment has been minimized.
This comment has been minimized.
I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road? |
This comment has been minimized.
This comment has been minimized.
Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. |
This comment has been minimized.
This comment has been minimized.
aut/src/main/scala/io/archivesunleashed/app/ExtractEntities.scala Lines 45 to 63 in 9b3e025 Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing. |
This comment has been minimized.
This comment has been minimized.
After that commit, I just need to sort out |
This comment has been minimized.
This comment has been minimized.
I'd completely forgotten about this! My vote would be to remove this in the PR. It's an outlier in that it takes derivative files as an input and processes them. I think it makes sense to stick to ARC and WARC files as inputs only, and put the emphasis on users to either use notebooks or their own solutions to work with the derivative files. |
This comment has been minimized.
This comment has been minimized.
Mostly resolved with 379cc68. Still need to do: Helper note from @helgeho for when I (or somebody else) loops back around to this in the future:
|
This comment has been minimized.
This comment has been minimized.
@SinghGursimran this one one last item to get to before it's done. If you're interested, or see any easy path, let me know. |
- Update Stanford core NLP - Format NER output in json - Add getPayloadDigest to ArchiveRecord - Add test for getPayloadDigest - Add payload digest to NER output - Remove extractFromScrapeText - Remove extractFromScrapeText test - TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output🤢
This comment has been minimized.
This comment has been minimized.
@ruebot May I simply replace keywords PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations in the NER output String. That will conform to WANE output. |
This comment has been minimized.
This comment has been minimized.
@SinghGursimran sure, I'm good with that. Curious what you come up with. |
This comment has been minimized.
This comment has been minimized.
Fantastic! Great work @SinghGursimran and @ruebot! |
ruebot commentedJan 10, 2019
Adapting our example NER script:
Produces the following example output:
This is very similar to WANE output. Is it worth normalizing the output
ExtractEntities
produces to the documented WANE output?JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".