Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align NER output to WANE format #361

Merged
merged 11 commits into from Nov 5, 2019

Conversation

@ruebot
Copy link
Member

ruebot commented Sep 18, 2019

GitHub issue(s): #297

What does this Pull Request do?

  • I'll
  • Add
  • This when I'm out of draft.

How should this be tested?

  • TravisCI
  • Something like:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

ExtractEntities.extractFromRecords("/home/nruest/Projects/au/aut-resources/NER/english.all.3class.distsim.crf.ser.gz", "/home/nruest/Projects/au/sample-data/geocites/1/GEOCITIES-20091027142649-00105-ia400111.us.archive.org.warc.gz", "/home/nruest/Projects/au/sample-data/issue-297/output-ner/", sc)

Should produce output like this:

Additional Notes:

  • Probably need to update documentation. I'll do a review before this gets merged.
ruebot added 7 commits Jul 22, 2019
@codecov

This comment has been minimized.

Copy link

codecov bot commented Sep 18, 2019

Codecov Report

Merging #361 into master will increase coverage by 0.1%.
The diff coverage is 21.42%.

@@            Coverage Diff            @@
##           master     #361     +/-   ##
=========================================
+ Coverage   76.25%   76.36%   +0.1%     
=========================================
  Files          40       40             
  Lines        1411     1413      +2     
  Branches      267      268      +1     
=========================================
+ Hits         1076     1079      +3     
+ Misses        219      217      -2     
- Partials      116      117      +1
@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Sep 26, 2019

Looking good! I built this and kicked it around on some sample WARCs that I have on my system, and the output is looking great.

Screen Shot 2019-09-26 at 9 41 49 AM

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Nov 5, 2019

@ianmilligan1 it'll be awhile before I get back to writing the class to take care of the last bit of the implementation. Should we just merge what we have now, and come back around to the last bit later?

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Nov 5, 2019

Should we just merge what we have now, and come back around to the last bit later?

Yep, that makes complete sense to me. I can test one last time once the branch is updated and marked ready for review.

@ruebot ruebot marked this pull request as ready for review Nov 5, 2019
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Nov 5, 2019

@ianmilligan1 awesome. when you're ready to merge, let me know and I'll clean up the commit message since it is really bad right now 😬

Copy link
Member

ianmilligan1 left a comment

Builds nicely locally, and output looks great on a sample ARC. Do you want to provide a clean commit message and then I can merge, @ruebot ?

Screen Shot 2019-11-05 at 9 51 40 AM

ruebot added 2 commits Nov 5, 2019
…ue-297
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Nov 5, 2019

Sorry, I had a local commit I never pushed.

Here's a commit message when you're ready:

Align NER output to WANE format; addresses #297 (#361)

- Update Stanford core NLP
- Format NER output in json
- Add getPayloadDigest to ArchiveRecord
- Add test for getPayloadDigest
- Add payload digest to NER output
- Remove extractFromScrapeText
- Remove extractFromScrapeText test
- TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output :nauseated_face: 
@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 5, 2019

Codecov Report

Merging #361 into master will increase coverage by 0.1%.
The diff coverage is 21.42%.

@@            Coverage Diff            @@
##           master     #361     +/-   ##
=========================================
+ Coverage   76.25%   76.36%   +0.1%     
=========================================
  Files          40       40             
  Lines        1411     1413      +2     
  Branches      267      268      +1     
=========================================
+ Hits         1076     1079      +3     
+ Misses        219      217      -2     
- Partials      116      117      +1
@ianmilligan1 ianmilligan1 merged commit 379cc68 into master Nov 5, 2019
2 of 3 checks passed
2 of 3 checks passed
codecov/patch 21.42% of diff hit (target 76.25%)
Details
codecov/project 76.36% (+0.1%) compared to 6686519
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ianmilligan1 ianmilligan1 deleted the issue-297 branch Nov 5, 2019
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Nov 5, 2019

@ianmilligan1 thanks! I'll have a PR for aut-docs-new shortly.

ruebot added a commit to archivesunleashed/aut-docs-new that referenced this pull request Nov 5, 2019
ianmilligan1 added a commit to archivesunleashed/aut-docs-new that referenced this pull request Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.