Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335

ianmilligan1 · Jul 30, 2019

GitHub issue(s):

What does this Pull Request do?

As noted in #334, @jrwiebe discovered that we are inconsistent on how we getContentBytes of ARC and WARC files. Currently, ARC files we just get the contents of the record (using getBodyContent) and WARC files we get everything (using getContent). The two should be consistent.

On the ticket itself, I discussed the various outputs, how they differ, and potential solutions. I think it is critical that we normalize the two approaches. Given the existence of RemoveHttpHeader, I think we should have both just using getContent. If people just want the body content, they can remove the header (which is what we've been doing with WARC files). I can imagine lots of diverse use cases for HTTP header information so it's better to have it in there and then remove it.

How should this be tested?

TravisCI should turn green. Note that I had to change the tests because now that we are returning headers in the ARC files, the tika and language counts are different. This is to be expected, to me at least.
We should do some sanity checking on the outputs.

Additional Notes:

We should probably foreground RemoveHttpHeader more in our documentation.

Interested parties

@ruebot @jrwiebe

codecov-io · Jul 30, 2019

Codecov Report

Merging #335 into master will decrease coverage by 0.97%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #335      +/-   ##
========================================
- Coverage   75.97%    75%   -0.98%     
========================================
  Files          39     39              
  Lines        1124   1124              
  Branches      197    197              
========================================
- Hits          854    843      -11     
- Misses        205    214       +9     
- Partials       65     67       +2

Impacted Files	Coverage Δ
...ain/scala/io/archivesunleashed/ArchiveRecord.scala	`84.9% <100%> (ø)`	⬆️
...io/archivesunleashed/matchbox/DetectLanguage.scala	`66.66% <0%> (-33.34%)`	⬇️
...rchivesunleashed/matchbox/DetectMimeTypeTika.scala	`75% <0%> (-25%)`	⬇️
...java/io/archivesunleashed/data/ArcRecordUtils.java	`63.04% <0%> (-15.22%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 605afcc...40bc126. Read the comment docs.

Make ArchiveRecord.getContentBytes consistent,#334

Loading status checks…

f2bfa3b

ianmilligan1 requested a review from ruebot Jul 30, 2019

Merge branch 'master' into issue-334

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

40bc126

archivesunleashed/aut

Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335

Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335

ianmilligan1 commented Jul 30, 2019

ianmilligan1 requested a review from ruebot Jul 30, 2019

This comment has been minimized.

codecov-io commented Jul 30, 2019 •

edited

archivesunleashed/aut

Join GitHub today

Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335

Conversation

ianmilligan1 commented Jul 30, 2019

What does this Pull Request do?

How should this be tested?

Additional Notes:

Interested parties

ianmilligan1 requested a review from ruebot Jul 30, 2019

This comment has been minimized.

codecov-io commented Jul 30, 2019 • edited

Codecov Report

codecov-io commented Jul 30, 2019 •

edited