Extract PDF #340

jrwiebe · Aug 12, 2019

This PR introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome.

GitHub issue:

#302: PDF binary object extraction – a pretty basic feature request, which surfaced some difficult dependency issues, as documented. The discussion under this issue contains all the details we ask for in a PR.

How should this be tested?

Discussion in #302 describes tests run by @ruebot and @jrwiebe – e.g., this.

ruebot · Aug 12, 2019

ruebot approved these changes Aug 12, 2019

View changes

codecov · Aug 12, 2019

Codecov Report

Merging #340 into master will decrease coverage by 1.45%.
The diff coverage is 28.12%.

@@            Coverage Diff             @@
##           master     #340      +/-   ##
==========================================
- Coverage   74.95%   73.49%   -1.46%     
==========================================
  Files          39       39              
  Lines        1122     1147      +25     
  Branches      197      198       +1     
==========================================
+ Hits          841      843       +2     
- Misses        214      237      +23     
  Partials       67       67

Impacted Files	Coverage Δ
...ain/scala/io/archivesunleashed/ArchiveRecord.scala	`84.9% <100%> (ø)`	⬆️
...rchivesunleashed/matchbox/DetectMimeTypeTika.scala	`80% <100%> (+5%)`	⬆️
...o/archivesunleashed/app/ExtractPopularImages.scala	`100% <100%> (ø)`	⬆️
...c/main/scala/io/archivesunleashed/df/package.scala	`68.96% <11.11%> (-26.28%)`	⬇️
src/main/scala/io/archivesunleashed/package.scala	`74.6% <6.25%> (-10.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2d7394...2b2627c. Read the comment docs.

ruebot deleted the extract-pdf branch Aug 12, 2019

ianmilligan1 referenced this pull request Aug 12, 2019
Open
Document Binary Object Extraction #133

archivesunleashed/aut

Extract PDF #340

Extract PDF #340

jrwiebe commented Aug 12, 2019

jrwiebe and others added some commits Feb 1, 2019

ruebot approved these changes Aug 12, 2019

View changes

This comment has been minimized.

codecov bot commented Aug 12, 2019

ruebot merged commit `73981a7` into master Aug 12, 2019
1 of 3 checks passed

1 of 3 checks passed

ruebot deleted the extract-pdf branch Aug 12, 2019

ianmilligan1 referenced this pull request Aug 12, 2019

Document Binary Object Extraction #133

archivesunleashed/aut

Join GitHub today

Extract PDF #340

Conversation

jrwiebe commented Aug 12, 2019

How should this be tested?

jrwiebe and others added some commits Feb 1, 2019

ruebot approved these changes Aug 12, 2019 View changes

This comment has been minimized.

codecov bot commented Aug 12, 2019

Codecov Report

Hide details View details ruebot merged commit 73981a7 into master Aug 12, 2019 1 of 3 checks passed

1 of 3 checks passed

ruebot deleted the extract-pdf branch Aug 12, 2019

ianmilligan1 referenced this pull request Aug 12, 2019

Document Binary Object Extraction #133

ruebot approved these changes Aug 12, 2019

View changes

ruebot merged commit `73981a7` into master Aug 12, 2019
1 of 3 checks passed