Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upInconsistency in ArchiveRecord.getContentBytes #334
Comments
This comment has been minimized.
This comment has been minimized.
Seems like an unintentional difference to me. Agreed that both should be the same... |
ianmilligan1
self-assigned this
Jul 30, 2019
This comment has been minimized.
This comment has been minimized.
This is a good find, @jrwiebe. We haven't run into this issue when working with the Archives Unleashed Cloud because we wrap our text extractor job there in Running Example Script on an ARC and WARCBut if you run the plain-text file on a combination of ARCs and WARCs (i.e. our ARC and WARC bundled as part of our aut-resources/sample-data, you get inconsistent results. When running this script: import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/home/i2millig/aut-resources/Sample-Data/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("plain-text/") We can see the different results. i.e. the ARC, which uses
and the WARC, which is just using
Adding RemoveHttpHeaderThat said, if you wrap the call in import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.saveAsTextFile("plain-text/") the results are now what they should be. WARC
ARC
Solution?We should definitely normalize these two approaches. Given the wide range of use cases and how people have found inventive ways to use the HTTP headers, my recommendation would be to normalize the approach along the lines of having both making the val getContentBytes: Array[Byte] = {
if (recordFormat == ArchiveRecordWritable.ArchiveFormat.ARC)
{
ArcRecordUtils.getContent(r.t.getRecord.asInstanceOf[ARCRecord])
} else {
WarcRecordUtils.getContent(r.t.getRecord.asInstanceOf[WARCRecord])
}
} Thoughts? |
This comment has been minimized.
This comment has been minimized.
Your solution sounds good to me, @ianmilligan1. Perhaps we should also replace |
This comment has been minimized.
This comment has been minimized.
I think I like |
ianmilligan1
added a commit
that referenced
this issue
Jul 30, 2019
ianmilligan1
referenced this issue
Jul 30, 2019
Merged
Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335
ruebot
added a commit
that referenced
this issue
Aug 6, 2019
This comment has been minimized.
This comment has been minimized.
Resolved with 1818596 |
jrwiebe commentedJul 30, 2019
I noticed
ArchiveRecord.getContentBytes
behaves differently on ARC and WARC records. For an ARC it return just the contents of the record, usingArcRecordUtils.getBodyContent(arcRecord)
, but for a WARC it return the "raw" contents -- i.e., including HTTP headers -- usingWarcRecordUtils.getContent(warcRecord)
. I don't see a rationale for this difference; it seems unintentional.@ruebot @lintool