Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upMake ArchiveRecord.getContentBytes consistent, Resolve #334 #335
Conversation
ianmilligan1
requested a review
from ruebot
Jul 30, 2019
This comment has been minimized.
This comment has been minimized.
codecov-io
commented
Jul 30, 2019
•
Codecov Report
@@ Coverage Diff @@
## master #335 +/- ##
========================================
- Coverage 75.97% 75% -0.98%
========================================
Files 39 39
Lines 1124 1124
Branches 197 197
========================================
- Hits 854 843 -11
- Misses 205 214 +9
- Partials 65 67 +2
Continue to review full report at Codecov.
|
This comment has been minimized.
This comment has been minimized.
@ianmilligan1 I'm going to test with language and full text extraction. That the big stuff? Also, do you want to create a ticket for documentation so it doesn't get lost in the shuffle? |
This comment has been minimized.
This comment has been minimized.
Plain textScript
Results
LanguageScript
Results
Anything else to test? Should I through |
This comment has been minimized.
This comment has been minimized.
Yep!
Good idea. Behaviour should now always be consistent between ARCs + WARCs. I'll open up a documentation ticket. |
ianmilligan1
referenced this pull request
Aug 1, 2019
Open
Document RemoveHttpHeader and build into some of the documentation #130
This comment has been minimized.
This comment has been minimized.
Plain textScript
Results
LanguageScript
Results
Looks good to me. I'll wait for a thumbs up from @jrwiebe before I merge. |
ruebot
approved these changes
Aug 1, 2019
This comment has been minimized.
This comment has been minimized.
@jrwiebe you good on this one? |
This comment has been minimized.
This comment has been minimized.
A-OK
…On Tue, Aug 6, 2019, 12:09 PM Nick Ruest, ***@***.***> wrote:
@jrwiebe <https://github.com/jrwiebe> you good on this one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#335?email_source=notifications&email_token=ABDJJDOD526E3WZQIQMGCV3QDGV2XA5CNFSM4IIAIVMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3V2K5Y#issuecomment-518759799>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABDJJDOAZS5B3IFNCFR4QBLQDGV2XANCNFSM4IIAIVMA>
.
|
ianmilligan1 commentedJul 30, 2019
GitHub issue(s):
#334
What does this Pull Request do?
As noted in #334, @jrwiebe discovered that we are inconsistent on how we
getContentBytes
of ARC and WARC files. Currently, ARC files we just get the contents of the record (usinggetBodyContent
) and WARC files we get everything (usinggetContent
). The two should be consistent.On the ticket itself, I discussed the various outputs, how they differ, and potential solutions. I think it is critical that we normalize the two approaches. Given the existence of
RemoveHttpHeader
, I think we should have both just usinggetContent
. If people just want the body content, they can remove the header (which is what we've been doing with WARC files). I can imagine lots of diverse use cases for HTTP header information so it's better to have it in there and then remove it.How should this be tested?
Additional Notes:
We should probably foreground
RemoveHttpHeader
more in our documentation.Interested parties
@ruebot @jrwiebe