Strategy to deal with conflict between application and Spark distribution dependencies #308

jrwiebe · Feb 18, 2019

There is a conflict between Tika's dependency on commons-compress and the version that is included in the Spark distribution, which under normal parameters causes calls to DetectMimeTypeTika from spark-shell to fail with a java.lang.NoSuchMethodError. For example, running the following code:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/path/to/example.warc.gz", sc)

r.keepValidPages().map(r => DetectMimeTypeTika(r.getContentString)).take(5)

results in:

2019-02-18 16:39:07 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
...

This is because current Spark distributions include old versions of commons-compress, and our code depends on version >= 1.14, which introduced the ArchiveStreamFactory.detect() method. Spark shell's classpath, which includes its jars/ directory, takes precedence over our dependencies.

I initially resolved this by adding an exclusion of poi-ooxml to the tika-parsers dependency in our POM, since poi-ooxml is the module that requires the newer commons-compress. This won't do, however, since we will want that module for detection of Microsoft Office formats.

A better solution is to prepend the correct commons-compress JAR to the spark-shell classpath with the --driver-class-path argument (i.e., the spark.driver.extraClassPath property). E.g,

$ spark-shell --jars target/aut-0.17.1-SNAPSHOT-fatjar.jar --driver-memory 4G --driver-class-path /home/jrwiebe/.m2/repository/org/apache/commons/commons-compress/1.18/commons-compress-1.18.jar

Obviously this is cumbersome, but until Spark is distributed with Hadoop 3, which shades Hadoop's dependencies, I don't see a better way. I've done quite a bit of research on this topic, and I don't see any solutions.

My question: Is there a better way? Perhaps some Maven magic, @lintool? Or is this just something we need to document?

ruebot · Feb 20, 2019

This would push us away from the convenience of using --packages. But, if it's really our only path forward on binary extraction, it appears to me that we need to take that path.

The transitive dependencies are definitely a tangled beast on this project. I've put a lot of hours into trying to untangle it, but not sure if I did any good. @lintool, if you catch this in your travels over the next week or so, let me know if you have any ideas you can point us at.

amirhadad · May 31, 2019

@ruebot and @lintool I have facing the same issue. The difference is I am using spark-submit and by using the --driver-class-path is only solves the conflict for client local mode of spark-submit. I am using Tika 1.20 and Spark 2.1.1.

lintool · May 31, 2019

I initially resolved this by adding an exclusion of poi-ooxml to the tika-parsers dependency in our POM, since poi-ooxml is the module that requires the newer commons-compress. This won't do, however, since we will want that module for detection of Microsoft Office formats.

Maybe I'm missing something, but why can't we add an exclusion on the Spark end?

amirhadad · Jun 3, 2019

@lintool and @ruebot an exclusion on the Spark end will be highly appreciated. As the --driver-class-path and --jars don't work in cluster mode to reinforce the usage of the newer version of apache-commons to resolve the conflict on org.apache.commons.compress.archivers.ArchiveStreamFactory. I tried building my fat jar with Spark 2.4 and the issue persists.

ruebot referenced this issue Jul 4, 2019
Merged
Update to Spark 2.4.3 and update Tika to 1.20. #321

ianmilligan1 closed this in 0e701b2 Jul 17, 2019

archivesunleashed/aut

Strategy to deal with conflict between application and Spark distribution dependencies #308

Strategy to deal with conflict between application and Spark distribution dependencies #308

jrwiebe commented Feb 18, 2019

This comment has been minimized.

ruebot commented Feb 20, 2019

This comment has been minimized.

amirhadad commented May 31, 2019 •

edited

This comment has been minimized.

lintool commented May 31, 2019

This comment has been minimized.

amirhadad commented Jun 3, 2019 •

edited

ruebot added a commit that referenced this issue Jul 4, 2019

ruebot referenced this issue Jul 4, 2019

Update to Spark 2.4.3 and update Tika to 1.20. #321

ianmilligan1 closed this in `0e701b2` Jul 17, 2019

archivesunleashed/aut

Join GitHub today

Strategy to deal with conflict between application and Spark distribution dependencies #308

Comments

jrwiebe commented Feb 18, 2019

This comment has been minimized.

ruebot commented Feb 20, 2019

This comment has been minimized.

amirhadad commented May 31, 2019 • edited

This comment has been minimized.

lintool commented May 31, 2019

This comment has been minimized.

amirhadad commented Jun 3, 2019 • edited

ruebot added a commit that referenced this issue Jul 4, 2019

ruebot referenced this issue Jul 4, 2019

Update to Spark 2.4.3 and update Tika to 1.20. #321

ianmilligan1 closed this in 0e701b2 Jul 17, 2019

amirhadad commented May 31, 2019 •

edited

amirhadad commented Jun 3, 2019 •

edited

ianmilligan1 closed this in `0e701b2` Jul 17, 2019