Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DetectLanguage.scala: class LanguageIdentifier in package language is deprecated #286

Closed
ruebot opened this issue Oct 17, 2018 · 7 comments

Comments

Projects
None yet
3 participants
@ruebot
Copy link
Member

commented Oct 17, 2018

Follow-on to #285

[WARNING] /home/nruest/git/aut/src/main/scala/io/archivesunleashed/matchbox/DetectLanguage.scala:33: warning: class LanguageIdentifier in package language is deprecated: see corresponding Javadoc for more information.
[INFO]       new LanguageIdentifier(input).getLanguage
[INFO]           ^
[WARNING] one warning found

I believe we need to update DectectLanguage to use this method.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Oct 17, 2018

@borislin if you have time, do you want to take this one on? It should be an easy one.

@borislin

This comment has been minimized.

Copy link
Collaborator

commented Oct 18, 2018

@ruebot Sure, I'll work on this.

@borislin

This comment has been minimized.

Copy link
Collaborator

commented Oct 19, 2018

Update:

Current code for to fix this issue: https://github.com/archivesunleashed/aut/tree/refactor-detect-language

I can't test my code now due to a a lot of dependency issues/errors in pom file by introducing tika-app and tika-langdetect dependencies.

Maven log: build.log

After discussing with @ruebot, it turns out that it's more complicated than we thought and we need more time to sort out this dependency hell before pushing a PR for this issue.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2019

I just pushed a fix to the dependency errors. They were caused by a conflict between versions of Guava. Hadoop 2.6.5 is bringing in Guava 11, while tika-langdetect requires a more modern version (1.19.1 calls for 17.0).

I created a version of tika-langdetect that shades Guava, basically following what is described here. I pushed my changes to pom.xml in my fork of tika. mvn deploy builds the maven artifact. I published it to a personal repository I created called aut-artifacts. The updated AUT POM includes this repository, and adds a <classifier> specifying the shaded version of tika-langdetect.

The build is still failing, but now it's because two tests fail:

Tests in error: 
  detect language(io.archivesunleashed.ArcTest): 57 did not equal 135
  keep languages(io.archivesunleashed.RecordRDDTest): scala.this.Predef.refArrayOps[String](r2).sameElements[String](scala.this.Predef.wrapRefArray[String](r)) was false

I haven't looked into this yet.

This shading solution is obviously not ideal, but it might do in the short term since we should be using the updated tika. The long term solution would be to upgrade Hadoop and our other dependencies.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2019

I remember going down this rabbit hole, and had setup a bunch of exclusions on the Guava dependencies. Maybe it would be worth going down that path again? That said, the transitive dependencies on this project are not fun to sort out!

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jan 24, 2019

Started digging into the test failures. I suspect Tika is returning more with this version, and we need to dig into that more. But, maybe we should update our implementation too? I hadn't noticed this example before in the API documentation.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jan 24, 2019

Boris was never able to build it, and ran out of time before he left to finish it, so that explains why it never got that far.

ruebot added a commit that referenced this issue Jul 4, 2019

Update to Spark 2.4.3 and update Tika to 1.20.
- Resolves #295
- Resolves #308
- Resolves #286
- Pulls in unfinished work by @jrwiebe and @borislin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.