Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335

Open
wants to merge 2 commits into
base: master
from

Conversation

@ianmilligan1
Copy link
Member

commented Jul 30, 2019

GitHub issue(s):

#334

What does this Pull Request do?

As noted in #334, @jrwiebe discovered that we are inconsistent on how we getContentBytes of ARC and WARC files. Currently, ARC files we just get the contents of the record (using getBodyContent) and WARC files we get everything (using getContent). The two should be consistent.

On the ticket itself, I discussed the various outputs, how they differ, and potential solutions. I think it is critical that we normalize the two approaches. Given the existence of RemoveHttpHeader, I think we should have both just using getContent. If people just want the body content, they can remove the header (which is what we've been doing with WARC files). I can imagine lots of diverse use cases for HTTP header information so it's better to have it in there and then remove it.

How should this be tested?

  • TravisCI should turn green. Note that I had to change the tests because now that we are returning headers in the ARC files, the tika and language counts are different. This is to be expected, to me at least.
  • We should do some sanity checking on the outputs.

Additional Notes:

We should probably foreground RemoveHttpHeader more in our documentation.

Interested parties

@ruebot @jrwiebe

@ianmilligan1 ianmilligan1 requested a review from ruebot Jul 30, 2019

@codecov-io

This comment has been minimized.

Copy link

commented Jul 30, 2019

Codecov Report

Merging #335 into master will decrease coverage by 0.97%.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #335      +/-   ##
========================================
- Coverage   75.97%    75%   -0.98%     
========================================
  Files          39     39              
  Lines        1124   1124              
  Branches      197    197              
========================================
- Hits          854    843      -11     
- Misses        205    214       +9     
- Partials       65     67       +2
Impacted Files Coverage Δ
...ain/scala/io/archivesunleashed/ArchiveRecord.scala 84.9% <100%> (ø) ⬆️
...io/archivesunleashed/matchbox/DetectLanguage.scala 66.66% <0%> (-33.34%) ⬇️
...rchivesunleashed/matchbox/DetectMimeTypeTika.scala 75% <0%> (-25%) ⬇️
...java/io/archivesunleashed/data/ArcRecordUtils.java 63.04% <0%> (-15.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 605afcc...40bc126. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.