Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Tika to 1.22; address security alerts. #337

Merged

Conversation

@ruebot
Copy link
Member

commented Aug 6, 2019

What does this Pull Request do?

  • Update Tika to 1.22
  • pom.xml surgery to get aut to build again with --packages
  • CVE-2019-10093, CVE-2019-10094, CVE-2019-10088

How should this be tested?

  • TravisCI
  • rm -rf ~/.m2/repository/* && mvn clean install && rm -rf ~/.ivy2/* && ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT -i ~/318-test-lang.scala

318-test-lang.scala

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc)
.keepValidPages()
.keepDomains(Set("geocities.com"))
.keepLanguages(Set("en"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/tmp/plain-text-en/")

sys.exit

Additional Notes

@jrwiebe this might throw a wrench in our other work, but hopefully shouldn't.

Update Tika to 1.22; address security alerts.
- Update Tika to 1.22
- pom.xml surgery to get aut to build again with --packages

@ruebot ruebot requested a review from ianmilligan1 Aug 6, 2019

@ianmilligan1
Copy link
Member

left a comment

Builds nicely locally. I tested language extraction on CPP data and it worked very well.

@codecov-io

This comment has been minimized.

Copy link

commented Aug 6, 2019

Codecov Report

Merging #337 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #337   +/-   ##
=======================================
  Coverage   75.97%   75.97%           
=======================================
  Files          39       39           
  Lines        1124     1124           
  Branches      197      197           
=======================================
  Hits          854      854           
  Misses        205      205           
  Partials       65       65

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 605afcc...1a3d2b9. Read the comment docs.

@ianmilligan1 ianmilligan1 merged commit 2d14b92 into master Aug 6, 2019

3 checks passed

codecov/patch Coverage not affected when comparing 605afcc...1a3d2b9
Details
codecov/project 75.97% remains the same compared to 605afcc
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ianmilligan1 ianmilligan1 deleted the tika-CVE-2019-10093-CVE-2019-10094-CVE-2019-10088 branch Aug 6, 2019

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

@jrwiebe this might throw a wrench in our other work, but hopefully shouldn't.

If you're referring to our use of com.github.netarchivesuite.language-detector, we're good; it still works.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Aug 6, 2019

Cool. I tested on my end too before I did the PR. Glad things are working on your end too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.