Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upMake ArchiveRecord.getContentBytes consistent, Resolve #334 #335
Conversation
ianmilligan1
requested a review
from ruebot
Jul 30, 2019
This comment has been minimized.
This comment has been minimized.
codecov-io
commented
Jul 30, 2019
•
Codecov Report
@@ Coverage Diff @@
## master #335 +/- ##
========================================
- Coverage 75.97% 75% -0.98%
========================================
Files 39 39
Lines 1124 1124
Branches 197 197
========================================
- Hits 854 843 -11
- Misses 205 214 +9
- Partials 65 67 +2
Continue to review full report at Codecov.
|
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 I'm going to test with language and full text extraction. That the big stuff? Also, do you want to create a ticket for documentation so it doesn't get lost in the shuffle? |
This comment has been minimized.
This comment has been minimized.
Plain textScript
Results
LanguageScript
Results
Anything else to test? Should I through |
This comment has been minimized.
This comment has been minimized.
Yep!
Good idea. Behaviour should now always be consistent between ARCs + WARCs. I'll open up a documentation ticket. |
ianmilligan1
referenced this pull request
Aug 1, 2019
Open
Document RemoveHttpHeader and build into some of the documentation #130
This comment has been minimized.
This comment has been minimized.
Plain textScript
Results
LanguageScript
Results
Looks good to me. I'll wait for a thumbs up from @jrwiebe before I merge. |
ianmilligan1 commentedJul 30, 2019
GitHub issue(s):
#334
What does this Pull Request do?
As noted in #334, @jrwiebe discovered that we are inconsistent on how we
getContentBytes
of ARC and WARC files. Currently, ARC files we just get the contents of the record (usinggetBodyContent
) and WARC files we get everything (usinggetContent
). The two should be consistent.On the ticket itself, I discussed the various outputs, how they differ, and potential solutions. I think it is critical that we normalize the two approaches. Given the existence of
RemoveHttpHeader
, I think we should have both just usinggetContent
. If people just want the body content, they can remove the header (which is what we've been doing with WARC files). I can imagine lots of diverse use cases for HTTP header information so it's better to have it in there and then remove it.How should this be tested?
Additional Notes:
We should probably foreground
RemoveHttpHeader
more in our documentation.Interested parties
@ruebot @jrwiebe