Discussion: Should we align our Named Entity Recognition output with WANE format? #297

ruebot · 2019-01-10T14:07:15Z

Adapting our example NER script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

ExtractEntities.extractFromRecords("/home/ruestn/english.all.3class.distsim.crf.ser.gz", "/tuna1/scratch/nruest/geocites/warcs/1/GEOCITIES-20090723023506-00000-crawling08.us.archive.org.warc.gz", "/tuna1/scratch/nruest/geocites/ner/", sc)

Produces the following example output:

(20090723,http://uk.geocities.com/pendock@btinternet.com/index.htm,{"PERSON":["Frampton","Hardwicke","Hardwicke","Martin","Hardwicke","Hardwicke","Hardwicke","Hutchings","Hopkins","Saunders","Butler","Jones","Frampton","Frampton","Hardwicke","Mark Chapple","Mark Medland","Glos"],"ORGANIZATION":["Hardwicke Cricket Club","Hardwicke Cricket Club","Stroud District Cricket Association","EJ Taylor & Sons Eric Vick Transport Club"],"LOCATION":["China","Gloucester","Ireland","Gloucester"]})

This is very similar to WANE output. Is it worth normalizing the output ExtractEntities produces to the documented WANE output?

JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".

ianmilligan1 · 2019-01-10T14:11:16Z

I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road?

ianmilligan1 · 2019-01-11T16:40:49Z

Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. 😉


        More on #297


        more on #297


        #297 payload digest

ruebot · 2019-09-18T20:08:11Z

@ianmilligan1 @lintool

aut/src/main/scala/io/archivesunleashed/app/ExtractEntities.scala

Lines 45 to 63 in 9b3e025

    
             /** Extracts named entities from tuple-formatted derivatives scraped from a website. 
        
               * 
        
               * @param iNerClassifierFile path of classifier file 
        
               * @param inputFile path of file containing tuples (date: String, url: String, content: String) 
        
               *                  from which to extract entities 
        
               * @param outputFile path of output directory 
        
               * @return an rdd with classification entities. 
        
               */ 
        
             def extractFromScrapeText(iNerClassifierFile: String, inputFile: String, outputFile: String, sc: SparkContext): RDD[(String, String, String)] = { 
        
               val rdd = sc.textFile(inputFile) 
        
                 .map(line => { 
        
                   val ind1 = line.indexOf(",") 
        
                   val ind2 = line.indexOf(",", ind1 + 1) 
        
                   (line.substring(1, ind1), 
        
                     line.substring(ind1 + 1, ind2), 
        
                     line.substring(ind2 + 1, line.length - 1)) 
        
                 }) 
        
               extractAndOutput(iNerClassifierFile, rdd, outputFile) 
        
             }

Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing.


        #297 - write a json object per line.

ruebot · 2019-09-18T20:37:19Z

After that commit, I just need to sort out PERSON -> persons, etc., and it should finally be done.

ianmilligan1 · 2019-09-23T21:12:28Z

Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing.

I'd completely forgotten about this! My vote would be to remove this in the PR. It's an outlier in that it takes derivative files as an input and processes them. I think it makes sense to stick to ARC and WARC files as inputs only, and put the emphasis on users to either use notebooks or their own solutions to work with the derivative files.

ruebot · 2019-11-05T18:17:17Z

Mostly resolved with 379cc68.

Still need to do: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢

Helper note from @helgeho for when I (or somebody else) loops back around to this in the future:

we're not exactly overriding it, because the corenlp output is not json, we simply take the PERSON class and put it under a key that we call persons

ruebot added the discussion label Jan 10, 2019

ruebot assigned greebie, lintool, ianmilligan1 and SamFritz Jan 10, 2019

ruebot assigned ruebot and unassigned greebie, lintool, ianmilligan1 and SamFritz Jul 17, 2019

ruebot added a commit that referenced this issue Sep 4, 2019

More on #297

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

8d0eb68

ruebot added a commit that referenced this issue Sep 4, 2019

more on #297

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

2689670

ruebot added a commit that referenced this issue Sep 18, 2019

#297 payload digest

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

9896424

ruebot added a commit that referenced this issue Sep 18, 2019

#297 - write a json object per line.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

9d074a7

ruebot referenced this issue Sep 18, 2019

Align NER output to WANE format #361

Merged

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

ruebot commented Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

This comment has been minimized.

ruebot commented Sep 18, 2019

This comment has been minimized.

ruebot commented Sep 18, 2019

This comment has been minimized.

ianmilligan1 commented Sep 23, 2019

This comment has been minimized.

ruebot commented Nov 5, 2019

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Join GitHub today

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Comments

ruebot commented Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 10, 2019

This comment has been minimized.

ianmilligan1 commented Jan 11, 2019

This comment has been minimized.

ruebot commented Sep 18, 2019

This comment has been minimized.

ruebot commented Sep 18, 2019

This comment has been minimized.

ianmilligan1 commented Sep 23, 2019

This comment has been minimized.

ruebot commented Nov 5, 2019