Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract PDF #340

Merged
merged 20 commits into from Aug 12, 2019

Conversation

@jrwiebe
Copy link
Contributor

commented Aug 12, 2019

This PR introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome.


GitHub issue:

  • #302: PDF binary object extraction – a pretty basic feature request, which surfaced some difficult dependency issues, as documented. The discussion under this issue contains all the details we ask for in a PR.

How should this be tested?

Discussion in #302 describes tests run by @ruebot and @jrwiebe – e.g., this.

jrwiebe and others added some commits Feb 1, 2019

Generate MD5 for saveImageToDisk/saveToDisk file suffix from actual f…
…ile contents instead of base64-encoded version.
@ruebot

ruebot approved these changes Aug 12, 2019

@codecov

This comment has been minimized.

Copy link

commented Aug 12, 2019

Codecov Report

Merging #340 into master will decrease coverage by 1.45%.
The diff coverage is 28.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #340      +/-   ##
==========================================
- Coverage   74.95%   73.49%   -1.46%     
==========================================
  Files          39       39              
  Lines        1122     1147      +25     
  Branches      197      198       +1     
==========================================
+ Hits          841      843       +2     
- Misses        214      237      +23     
  Partials       67       67
Impacted Files Coverage Δ
...ain/scala/io/archivesunleashed/ArchiveRecord.scala 84.9% <100%> (ø) ⬆️
...rchivesunleashed/matchbox/DetectMimeTypeTika.scala 80% <100%> (+5%) ⬆️
...o/archivesunleashed/app/ExtractPopularImages.scala 100% <100%> (ø) ⬆️
...c/main/scala/io/archivesunleashed/df/package.scala 68.96% <11.11%> (-26.28%) ⬇️
src/main/scala/io/archivesunleashed/package.scala 74.6% <6.25%> (-10.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2d7394...2b2627c. Read the comment docs.

@ruebot ruebot merged commit 73981a7 into master Aug 12, 2019

1 of 3 checks passed

codecov/patch 28.12% of diff hit (target 74.95%)
Details
codecov/project 73.49% (-1.46%) compared to b2d7394
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ruebot ruebot deleted the extract-pdf branch Aug 12, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.