Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Open
ruebot opened this issue Jan 10, 2019 · 6 comments
Assignees
Labels

Comments

@ruebot
Copy link
Member

@ruebot ruebot commented Jan 10, 2019

Adapting our example NER script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

ExtractEntities.extractFromRecords("/home/ruestn/english.all.3class.distsim.crf.ser.gz", "/tuna1/scratch/nruest/geocites/warcs/1/GEOCITIES-20090723023506-00000-crawling08.us.archive.org.warc.gz", "/tuna1/scratch/nruest/geocites/ner/", sc)

Produces the following example output:

(20090723,http://uk.geocities.com/pendock@btinternet.com/index.htm,{"PERSON":["Frampton","Hardwicke","Hardwicke","Martin","Hardwicke","Hardwicke","Hardwicke","Hutchings","Hopkins","Saunders","Butler","Jones","Frampton","Frampton","Hardwicke","Mark Chapple","Mark Medland","Glos"],"ORGANIZATION":["Hardwicke Cricket Club","Hardwicke Cricket Club","Stroud District Cricket Association","EJ Taylor & Sons Eric Vick Transport Club"],"LOCATION":["China","Gloucester","Ireland","Gloucester"]})

This is very similar to WANE output. Is it worth normalizing the output ExtractEntities produces to the documented WANE output?

JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".

@ianmilligan1

This comment has been minimized.

Copy link
Member

@ianmilligan1 ianmilligan1 commented Jan 10, 2019

I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road?

@ianmilligan1

This comment has been minimized.

Copy link
Member

@ianmilligan1 ianmilligan1 commented Jan 11, 2019

Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. 😉

ruebot added a commit that referenced this issue Sep 4, 2019
ruebot added a commit that referenced this issue Sep 4, 2019
ruebot added a commit that referenced this issue Sep 18, 2019
@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Sep 18, 2019

@ianmilligan1 @lintool

/** Extracts named entities from tuple-formatted derivatives scraped from a website.
*
* @param iNerClassifierFile path of classifier file
* @param inputFile path of file containing tuples (date: String, url: String, content: String)
* from which to extract entities
* @param outputFile path of output directory
* @return an rdd with classification entities.
*/
def extractFromScrapeText(iNerClassifierFile: String, inputFile: String, outputFile: String, sc: SparkContext): RDD[(String, String, String)] = {
val rdd = sc.textFile(inputFile)
.map(line => {
val ind1 = line.indexOf(",")
val ind2 = line.indexOf(",", ind1 + 1)
(line.substring(1, ind1),
line.substring(ind1 + 1, ind2),
line.substring(ind2 + 1, line.length - 1))
})
extractAndOutput(iNerClassifierFile, rdd, outputFile)
}

Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing.

ruebot added a commit that referenced this issue Sep 18, 2019
@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Sep 18, 2019

After that commit, I just need to sort out PERSON -> persons, etc., and it should finally be done.

@ianmilligan1

This comment has been minimized.

Copy link
Member

@ianmilligan1 ianmilligan1 commented Sep 23, 2019

Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing.

I'd completely forgotten about this! My vote would be to remove this in the PR. It's an outlier in that it takes derivative files as an input and processes them. I think it makes sense to stick to ARC and WARC files as inputs only, and put the emphasis on users to either use notebooks or their own solutions to work with the derivative files.

@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Nov 5, 2019

Mostly resolved with 379cc68.

Still need to do: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢

Helper note from @helgeho for when I (or somebody else) loops back around to this in the future:

we're not exactly overriding it, because the corenlp output is not json, we simply take the PERSON class and put it under a key that we call persons

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.