Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategy to deal with conflict between application and Spark distribution dependencies #308

Closed
jrwiebe opened this issue Feb 18, 2019 · 4 comments

Comments

Projects
None yet
4 participants
@jrwiebe
Copy link
Contributor

commented Feb 18, 2019

There is a conflict between Tika's dependency on commons-compress and the version that is included in the Spark distribution, which under normal parameters causes calls to DetectMimeTypeTika from spark-shell to fail with a java.lang.NoSuchMethodError. For example, running the following code:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/path/to/example.warc.gz", sc)

r.keepValidPages().map(r => DetectMimeTypeTika(r.getContentString)).take(5)

results in:

2019-02-18 16:39:07 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
...

This is because current Spark distributions include old versions of commons-compress, and our code depends on version >= 1.14, which introduced the ArchiveStreamFactory.detect() method. Spark shell's classpath, which includes its jars/ directory, takes precedence over our dependencies.

I initially resolved this by adding an exclusion of poi-ooxml to the tika-parsers dependency in our POM, since poi-ooxml is the module that requires the newer commons-compress. This won't do, however, since we will want that module for detection of Microsoft Office formats.

A better solution is to prepend the correct commons-compress JAR to the spark-shell classpath with the --driver-class-path argument (i.e., the spark.driver.extraClassPath property). E.g,

$ spark-shell --jars target/aut-0.17.1-SNAPSHOT-fatjar.jar --driver-memory 4G --driver-class-path /home/jrwiebe/.m2/repository/org/apache/commons/commons-compress/1.18/commons-compress-1.18.jar

Obviously this is cumbersome, but until Spark is distributed with Hadoop 3, which shades Hadoop's dependencies, I don't see a better way. I've done quite a bit of research on this topic, and I don't see any solutions.

My question: Is there a better way? Perhaps some Maven magic, @lintool? Or is this just something we need to document?

@ruebot

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

This would push us away from the convenience of using --packages. But, if it's really our only path forward on binary extraction, it appears to me that we need to take that path.

The transitive dependencies are definitely a tangled beast on this project. I've put a lot of hours into trying to untangle it, but not sure if I did any good. @lintool, if you catch this in your travels over the next week or so, let me know if you have any ideas you can point us at.

@amirhadad

This comment has been minimized.

Copy link

commented May 31, 2019

@ruebot and @lintool I have facing the same issue. The difference is I am using spark-submit and by using the --driver-class-path is only solves the conflict for client local mode of spark-submit. I am using Tika 1.20 and Spark 2.1.1.

@lintool

This comment has been minimized.

Copy link
Member

commented May 31, 2019

I initially resolved this by adding an exclusion of poi-ooxml to the tika-parsers dependency in our POM, since poi-ooxml is the module that requires the newer commons-compress. This won't do, however, since we will want that module for detection of Microsoft Office formats.

Maybe I'm missing something, but why can't we add an exclusion on the Spark end?

@amirhadad

This comment has been minimized.

Copy link

commented Jun 3, 2019

@lintool and @ruebot an exclusion on the Spark end will be highly appreciated. As the --driver-class-path and --jars don't work in cluster mode to reinforce the usage of the newer version of apache-commons to resolve the conflict on org.apache.commons.compress.archivers.ArchiveStreamFactory. I tried building my fat jar with Spark 2.4 and the issue persists.

ruebot added a commit that referenced this issue Jul 4, 2019

Update to Spark 2.4.3 and update Tika to 1.20.
- Resolves #295
- Resolves #308
- Resolves #286
- Pulls in unfinished work by @jrwiebe and @borislin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.