Discussion: Should we align our Named Entity Recognition output with WANE format? #297

ruebot · Jan 10, 2019

Adapting our example NER script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

ExtractEntities.extractFromRecords("/home/ruestn/english.all.3class.distsim.crf.ser.gz", "/tuna1/scratch/nruest/geocites/warcs/1/GEOCITIES-20090723023506-00000-crawling08.us.archive.org.warc.gz", "/tuna1/scratch/nruest/geocites/ner/", sc)

Produces the following example output:

(20090723,http://uk.geocities.com/pendock@btinternet.com/index.htm,{"PERSON":["Frampton","Hardwicke","Hardwicke","Martin","Hardwicke","Hardwicke","Hardwicke","Hutchings","Hopkins","Saunders","Butler","Jones","Frampton","Frampton","Hardwicke","Mark Chapple","Mark Medland","Glos"],"ORGANIZATION":["Hardwicke Cricket Club","Hardwicke Cricket Club","Stroud District Cricket Association","EJ Taylor & Sons Eric Vick Transport Club"],"LOCATION":["China","Gloucester","Ireland","Gloucester"]})

This is very similar to WANE output. Is it worth normalizing the output ExtractEntities produces to the documented WANE output?

JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".

ianmilligan1 · Jan 10, 2019

I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road?

ianmilligan1 · Jan 11, 2019

Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. 😉

ruebot added the discussion label Jan 10, 2019

ruebot assigned greebie, lintool, ianmilligan1 and SamFritz Jan 10, 2019

archivesunleashed/aut

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

ruebot commented Jan 10, 2019

ruebot added the discussion label Jan 10, 2019

ruebot assigned greebie, lintool, ianmilligan1 and SamFritz Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

archivesunleashed/aut

Join GitHub today

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Comments

ruebot commented Jan 10, 2019

ruebot added the discussion label Jan 10, 2019

ruebot assigned greebie, lintool, ianmilligan1 and SamFritz Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019