Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tests for ExtractEntities.scala #48

Closed
ruebot opened this issue Oct 2, 2017 · 10 comments
Closed

Create tests for ExtractEntities.scala #48

ruebot opened this issue Oct 2, 2017 · 10 comments

Comments

@ruebot
Copy link
Member

@ruebot ruebot commented Oct 2, 2017

ExtractEntities.scala has no test coverage. We need to create tests for it.

@ruebot ruebot added the tests label Oct 2, 2017
@ruebot ruebot added this to To Do in Test coverage Oct 2, 2017
@greebie
Copy link
Contributor

@greebie greebie commented Oct 16, 2017

Tests exist for this, but are disabled because there is no NER classifier in the repo. Is it possible to include the NER classifier in-house, or is there a licensing issue with doing so?

@ruebot
Copy link
Member Author

@ruebot ruebot commented Oct 16, 2017

Looks like I shouldn't have removed it here 0ec8ab1, and I don't see english.all.3class.distsim.crf.ser.gz in the Git history at all. But we have it over here with a notice. I'm happy to restore example.txt, but I'm unsure about how to handle english.all.3class.distsim.crf.ser.gz.

@lintool I assume english.all.3class.distsim.crf.ser.gz is over there and not in it the aut or warcbase repo because of licensing? That correct? If so, how do you want to handle this?

@greebie
Copy link
Contributor

@greebie greebie commented Oct 16, 2017

I'm interested that NER seems to be included in Tika, which we all ready use for the PDFParser. I could take this on as a November project I think. It may mean some configuration.

I think NER is basic GPL, so I don't think it's a licensing issue.

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Oct 16, 2017

Our thought back when we designed the entity extractor was less a licensing issue, but more not wanting the repository to balloon out of control in terms of size. english.all.3class.distsim.crf.ser.gz is ~ 33MB in size, and it's just one language, so the thought was that if a user needed it then they should download it from a separate resources repository.

It was a general aversion to having too much cruft build up within the repo. @lintool

@lintool
Copy link
Member

@lintool lintool commented Oct 16, 2017

Yes, that's exactly it. My usual treatment is to create a separate repo, e.g., aut-resources and then ask the user to wget the data...

@greebie
Copy link
Contributor

@greebie greebie commented Oct 16, 2017

We have tika-parsers in our pom.xml. There is a version of NER in there -- maybe we could use that? https://wiki.apache.org/tika/TikaAndNER

@ruebot
Copy link
Member Author

@ruebot ruebot commented Oct 17, 2017

@greebie Maven isn't going to pull that that classifier file from that I can tell.

@greebie
Copy link
Contributor

@greebie greebie commented Oct 17, 2017

Okay -- back to square one. I think we are stuck not being able to test this one.

@greebie greebie added the wontfix label Oct 17, 2017
@greebie
Copy link
Contributor

@greebie greebie commented Oct 17, 2017

Although I may be able to mock a NERClassifier .. let's keep this open until I can figure out.

@greebie
Copy link
Contributor

@greebie greebie commented Oct 19, 2017

NER functions require a classifier file that is not included in AUT. Potential solutions:

develop a generic NER classifier for test purposes. See #53 & #52.

@ianmilligan1 ianmilligan1 removed the wontfix label Jan 6, 2018
@ruebot ruebot added the RA-Task label Feb 5, 2018
@ruebot ruebot added this to In Progress in 1.0.0 Release of AUT May 27, 2020
ruebot added a commit that referenced this issue May 27, 2020
- Resolves #48
- Resolves #52
- Resolves #53
- Resolves #469
- Remove all NER associated functionality
- Tweak pom.xml to handle the removal
1.0.0 Release of AUT automation moved this from In Progress to Done May 27, 2020
ianmilligan1 pushed a commit that referenced this issue May 27, 2020
- Resolves #48
- Resolves #52
- Resolves #53
- Resolves #469
- Remove all NER associated functionality
- Tweak pom.xml to handle the removal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants
You can’t perform that action at this time.