Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use version of tika-parsers without a classifier #345

Merged
merged 1 commit into from Aug 14, 2019

Conversation

@jrwiebe
Copy link
Contributor

commented Aug 14, 2019

As @ruebot mentioned on Slack, running AUT with --packages produced error messages, e.g.:

19/08/14 19:35:28 ERROR SparkContext: Failed to add file:/root/.ivy2/jars/com.github.archivesunleashed.tika_tika-parsers-1.22.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/com.github.archivesunleashed.tika_tika-parsers-1.22.jar not found
    at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
...

This is because ivy isn't good at finding dependencies specified with a classifier.

Since the classifier wasn't doing any useful work I removed it from our fork of tika and pushed a new release. This PR updates our POM accordingly.

How should this be tested?

Something like this:

jrwiebe@tuna:~/aut$ rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && time ~/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT

There should be no errors.

I tested by running this code:

import io.archivesunleashed._
import io.archivesunleashed.df._

val warc_path = "/home/jrwiebe/warcs/*.gz"

val df_pdf = RecordLoader.loadArchives(warc_path, sc).extractPDFDetailsDF();
val res_pdf = df_pdf.select($"bytes", $"extension").saveToDisk("bytes", "/home/jrwiebe/test/pdf", "extension")
Use version of tika-parsers without a classifier, as ivy couldn't
handle it, and specifying one for the custom tika-parsers artifact
was unnecessary.

@jrwiebe jrwiebe requested a review from ruebot Aug 14, 2019

@ruebot

ruebot approved these changes Aug 14, 2019

Copy link
Member

left a comment

Tested with tweaking docker-aut:

	com.github.archivesunleashed.tika#tika-core;1.22 from local-m2-cache in [default]
	com.github.archivesunleashed.tika#tika-parsers;1.22 from local-m2-cache in [default]

🤘 🤘

@codecov

This comment has been minimized.

Copy link

commented Aug 14, 2019

Codecov Report

Merging #345 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           master    #345   +/-   ##
======================================
  Coverage    75.2%   75.2%           
======================================
  Files          39      39           
  Lines        1230    1230           
  Branches      224     224           
======================================
  Hits          925     925           
  Misses        214     214           
  Partials       91      91

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01d12b4...27fde0b. Read the comment docs.

@ruebot ruebot merged commit 39831c2 into master Aug 14, 2019

3 checks passed

codecov/patch Coverage not affected when comparing 01d12b4...27fde0b
Details
codecov/project 75.2% remains the same compared to 01d12b4
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ruebot ruebot deleted the pom-update branch Aug 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.