Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upStrategy to deal with conflict between application and Spark distribution dependencies #308
Comments
This comment has been minimized.
This comment has been minimized.
This would push us away from the convenience of using The transitive dependencies are definitely a tangled beast on this project. I've put a lot of hours into trying to untangle it, but not sure if I did any good. @lintool, if you catch this in your travels over the next week or so, let me know if you have any ideas you can point us at. |
This comment has been minimized.
This comment has been minimized.
amirhadad
commented
May 31, 2019
•
This comment has been minimized.
This comment has been minimized.
Maybe I'm missing something, but why can't we add an exclusion on the Spark end? |
This comment has been minimized.
This comment has been minimized.
amirhadad
commented
Jun 3, 2019
•
@lintool and @ruebot an exclusion on the Spark end will be highly appreciated. As the |
jrwiebe commentedFeb 18, 2019
There is a conflict between Tika's dependency on
commons-compress
and the version that is included in the Spark distribution, which under normal parameters causes calls toDetectMimeTypeTika
from spark-shell to fail with ajava.lang.NoSuchMethodError
. For example, running the following code:results in:
This is because current Spark distributions include old versions of commons-compress, and our code depends on version >= 1.14, which introduced the
ArchiveStreamFactory.detect()
method. Spark shell's classpath, which includes itsjars/
directory, takes precedence over our dependencies.I initially resolved this by adding an exclusion of poi-ooxml to the tika-parsers dependency in our POM, since poi-ooxml is the module that requires the newer commons-compress. This won't do, however, since we will want that module for detection of Microsoft Office formats.
A better solution is to prepend the correct commons-compress JAR to the spark-shell classpath with the
--driver-class-path
argument (i.e., thespark.driver.extraClassPath
property). E.g,Obviously this is cumbersome, but until Spark is distributed with Hadoop 3, which shades Hadoop's dependencies, I don't see a better way. I've done quite a bit of research on this topic, and I don't see any solutions.
My question: Is there a better way? Perhaps some Maven magic, @lintool? Or is this just something we need to document?