Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upFunction to get the status response code and headers of a warc response? #198
Comments
This comment has been minimized.
This comment has been minimized.
I am using this function at the moment:
If you tell me how your prefer to refactor this code to fit your library, I can make a pull-request if you want. |
This comment has been minimized.
This comment has been minimized.
Although there is an error parsing the following archive, generate by do you know if this any idea about this?
|
ruebot
added
the
question
label
Aug 20, 2018
This comment has been minimized.
This comment has been minimized.
@dportabella I was going to look at producing something to resolve this issue today. Are you still interested in providing a pull request? With @lintool & @ianmilligan1 's okay, I would say that this function should be in the matchbox (https://github.com/archivesunleashed/aut/tree/master/src/main/scala/io/archivesunleashed/matchbox) as An alternative is to include access to the response code in the ArchiveRecord trait and ArchiveRecordImpl as .getHttpResponse so that it is included in all ArchiveRecords by default. This seems the more user-friendly approach, but we would want to test the effect the function has on overall run time. (It should be nil due to laziness, but as a rule we should check run time anytime we change the ArchiveRecord trait). I think I prefer the second option. If you are not able to prep a PR at this time, I can take a crack at it. Either way, I am not sure how, but we would want to be sure you were given appropriate credit for this idea and proposed implementation. Maybe @ruebot or @ianmilligan1 knows the best way to make this happen. Thanks so much for your help! Ryan. . |
This comment has been minimized.
This comment has been minimized.
Branch issue-198 covers the header response code, but not the full headers, as I could not get the full header details at this stage. The following will have results for time differences. Using the same warc collection 17.0
17.1 (same script)
17.1 (add statusHeader)(note network script includes an additional map compared to above)
17.1 (add fileName)(note network script includes an additional map compared to above)
tl;dr - there is no effect on the ArchiveRecord / RecordLoader when not using .getHttpStatus or .getFilename and there is minimal effect when using it. |
This comment has been minimized.
This comment has been minimized.
This is the code I used to produce the above. Original code
add .getHttpStatus
produces: links: Array[((String, String, String), Int)] = Array(((200,nanaimodailynews.com,nanaimodailynews.com),445785), ((200,nanaimodailynews.com,blackpress.ca),188676), ((200,nanaimodailynews.com,bclocalnews.com),111400), ((200,nanaimodailynews.com,drivewaycanada.ca),53796), ((200,nanaimodailynews.com,facebook.com),53500), ((200,nanaimodailynews.com,bcclassified.com),52922), ((200,nanaimodailynews.com,usednanaimo.com),27067), ((200,nanaimodailynews.com,iservices.blackpress.ca),26953), ((200,nanaimodailynews.com,localworkbc.ca),26546), ((200,nanaimodailynews.com,twitter.com),24853))
produces
|
greebie
referenced this issue
Nov 22, 2018
Merged
Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292
This comment has been minimized.
This comment has been minimized.
Cool, thanks! |
dportabella commentedApr 20, 2018
is there a function to get the status response code and headers of a warc response?
such as...