Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292
Conversation
greebie
added some commits
Nov 21, 2018
This comment has been minimized.
This comment has been minimized.
codecov-io
commented
Nov 22, 2018
•
Codecov Report
@@ Coverage Diff @@
## master #292 +/- ##
==========================================
+ Coverage 73.33% 73.39% +0.06%
==========================================
Files 42 42
Lines 1170 1184 +14
Branches 205 210 +5
==========================================
+ Hits 858 869 +11
Misses 244 244
- Partials 68 71 +3
Continue to review full report at Codecov.
|
This comment has been minimized.
This comment has been minimized.
Thanks @greebie – I can test. Is this PR complete or is there anything else you're hoping to add to it before we start the review process? |
This comment has been minimized.
This comment has been minimized.
Thanks. I'm pretty sure I learned from the last PR and did a very good job testing. fingers crossed I think it's ready to go. |
This comment has been minimized.
This comment has been minimized.
OK sg. I'll take first round testing, realistically probably tomorrow morning. |
@@ -37,17 +38,76 @@ class ArchiveRecordTest extends FunSuite with BeforeAndAfter { | ||
.setAppName(appName) | ||
conf.set("spark.driver.allowMultipleContexts", "true"); | ||
sc = new SparkContext(conf) | ||
This comment has been minimized.
This comment has been minimized.
/** Trait for a record in a web archive. */ | ||
trait ArchiveRecord extends Serializable { | ||
/** Returns the filename containing the Archive Records */ | ||
def getFilename: String |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
greebie
Nov 23, 2018
Contributor
Done. Just to confirm Resourcename (to mimic Filename) and not ResourceName (because resource name are two separate words)? I'm okay with either.
This comment has been minimized.
This comment has been minimized.
greebie
Nov 23, 2018
Contributor
Current commit uses Resourcename. I think that's fine - I'm probably just overthinking it.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
/** Trait for a record in a web archive. */ | ||
trait ArchiveRecord extends Serializable { | ||
/** Returns the filename containing the Archive Records */ |
This comment has been minimized.
This comment has been minimized.
@@ -64,6 +74,7 @@ class ArchiveRecordImpl(r: SerializableWritable[ArchiveRecordWritable]) extends | ||
var arcRecord: ARCRecord = null | ||
var warcRecord: WARCRecord = null | ||
// scalastyle:on null | ||
var headerResponseFormat: String = "US-ASCII" |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 23, 2018
Member
Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.
I suppose it this? https://tools.ietf.org/html/rfc7230#section-3.2.4
This comment has been minimized.
This comment has been minimized.
greebie
Nov 23, 2018
Contributor
Yes - this part was a mimic of @dportabella 's suggestion. Seemed weird to me at first as well.
ianmilligan1
approved these changes
Nov 23, 2018
•
Tested with variations on
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz", sc)
.keepValidPages()
.map(r => (r.getFilename, r.getHttpStatus, r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/tuna1/scratch/i2milligan/results/plain-text-http")
WARC file and response codes all check out, and work in a variety of scripts. Good to go from my POV (and @ruebot thanks for your review too!).
This comment has been minimized.
This comment has been minimized.
I would like to run one more test on this minus the |
This comment has been minimized.
This comment has been minimized.
So just i.e. import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz", sc)
.map(r => (r.getFilename, r.getHttpStatus, r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/tuna1/scratch/i2milligan/results/plain-text-http") I have a large body of WARCs to test on, so if that's what you mean I can test it here too @greebie |
This comment has been minimized.
This comment has been minimized.
Yes. Would like to see some statuses besides 200. :) Also if the header is munged, I'd like to be able to cover for it. Really appreciate it if you could. |
This comment has been minimized.
This comment has been minimized.
OK. Let me run it at scale minus the plain text, so we can easily look around at status codes too. |
This comment has been minimized.
This comment has been minimized.
Tested it without Am seeing other response codes - i.e. 404s, 301s, etc. i.e.
|
This comment has been minimized.
This comment has been minimized.
Awesome! I think doing a status code visualisation would make a decent Medium post. That might be my december project. |
ruebot
referenced this pull request
Nov 23, 2018
Open
Update aut documentation for https://github.com/archivesunleashed/aut/pull/292 #74
This comment has been minimized.
This comment has been minimized.
More relevant to #260 but related to why coverage shrunk in this PR - I think we need test cases for ArchiveRecords that are neither WARC or ARC Format. The java takes care of the error handling I think, but codecov notices that we did not test for that here in ArchiveRecord for some reason. I think this is something for a different PR, however. |
This comment has been minimized.
This comment has been minimized.
I think we have different interpretations of what Resource name is. This appears to be putting out the filename of the ARC/WARC. My understanding was that it was the resource name of the resource being parsed. i.e. |
This comment has been minimized.
This comment has been minimized.
We also need to decide if it is just the ARC/WARC filename name itself that the method returns, or the full path. |
assert(textSampleWarc.deep == Array("20080430", "20080430", "20080430").deep) | ||
} | ||
test("Domains") { |
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 27, 2018
Member
There are a few extra test methods here. Are they scope for the issues posted in the original comment? Or do they cover other tickets?
This comment has been minimized.
This comment has been minimized.
greebie
Nov 27, 2018
Contributor
No other tickets. Basically, last PR I had good test coverage, but failed to test particular cases and my code passed Travis with bugs. I decided to include these additional tests in case that happened again (it was unlikely, but I wanted this PR to go more smoothly). Since ArchiveRecord is used widely across the tests, I did not expect the tests to improve coverage (as per #260).
This comment has been minimized.
This comment has been minimized.
Other than the above comments, functionality-wise, looks good to me:
Nice work @greebie |
This comment has been minimized.
This comment has been minimized.
I think archiveFilename works. The issue is that because the filename comes from the |
This comment has been minimized.
This comment has been minimized.
I suggest we keep the fullpaths but provide information accessing the filename in the docs. There's an example to extract just the filename in the tests. Another option is to include a util. |
This comment has been minimized.
This comment has been minimized.
Ok, let's go with |
This comment has been minimized.
This comment has been minimized.
Ok. Update this ticket as necessary after this gets merged. |
This comment has been minimized.
This comment has been minimized.
Good to go! Script:
Sample output:
|
greebie commentedNov 22, 2018
•
edited by ianmilligan1
GitHub issue(s):
#198, #164
What does this Pull Request do?
This PR includes three changes to the ArchiveRecord class:
.getHttpStatus
could be useful for studies that would like to compare status calls on crawls.Example:
Should produce something like
It might be more interesting to test this script without
.keepValidPages()
since most valid pages would likely return a200
. Where an ArchiveRecord fails to get a status code (via null or empty string) the record will return000
. This default value can be changed..getFilename
returns the fullpath (which may be a url) of the Warc that was consumed.should produce something like
To get just the filename, you could use
FilenameUtils.getName(x.getFilename)
(I have included the FilenameUtils import in the code above). I have not looked into whether FilenameUtils will get the filename from a url, but that is why I stuck with the fullpath. Also, I am open to feedback on whether .getFilename is the right call for this. IIPC calls it.getReaderIdentifier()
.I can think of a number of use cases for this, but it would be great as a way to detect problems with warcs (e.g. with wonky data). It might also be helpful for error responses down the road.
How should this be tested?
The above scripts should have expected outputs.
Both functions should work for both arc and warc formats.
Travis should pass.
Additional Notes:
I used some of @dportabella 's ideas to produce the code here, in particular for getting the HttpStatus from a Warc file. For Arc and the Warc filename, I tried to use the WARCRecord and ARCRecord classes instead.
I did not include the full header responses in this PR. It seems like the ArcRecord and WarcRecord responses are quite different and I haven't been able to produce an effective testing mechanism.
Interested parties
@ianmilligan1 @ruebot
Thanks in advance for your help with the Archives Unleashed Toolkit!