Inconsistency in ArchiveRecord.getContentBytes #334

jrwiebe · Jul 30, 2019

I noticed ArchiveRecord.getContentBytes behaves differently on ARC and WARC records. For an ARC it return just the contents of the record, using ArcRecordUtils.getBodyContent(arcRecord), but for a WARC it return the "raw" contents -- i.e., including HTTP headers -- using WarcRecordUtils.getContent(warcRecord). I don't see a rationale for this difference; it seems unintentional.

@ruebot @lintool

ianmilligan1 · Jul 30, 2019

Seems like an unintentional difference to me. Agreed that both should be the same...

ianmilligan1 · Jul 30, 2019

This is a good find, @jrwiebe. We haven't run into this issue when working with the Archives Unleashed Cloud because we wrap our text extractor job there in RemoveHttpHeader, which basically does the same thing.

Running Example Script on an ARC and WARC

But if you run the plain-text file on a combination of ARCs and WARCs (i.e. our ARC and WARC bundled as part of our aut-resources/sample-data, you get inconsistent results.

When running this script:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/home/i2millig/aut-resources/Sample-Data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .saveAsTextFile("plain-text/")

We can see the different results. i.e. the ARC, which uses getBodyContent

(20060622,www.gca.ca,http://www.gca.ca/indexcms/?organizations&orgid=27,Green Communities Canada | Our Member Organizations Home About Our Members Our Member Organizations Search Member Organizations About Green Communities Canada News and Events Our Programs Join Gre

and the WARC, which is just using getContent

(20091218,www.equalvoice.ca,http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg,HTTP/1.1 200 OK Connection: close Date: Fri, 18 Dec 2009 23:17:29 GMT Server: Microsoft-IIS/6.0 PICS-Label: (PICS-1.0 "http://www.rsac.org/ratingsv01.html" l by "info@xynapse.ca" on "2004.10.12T16:51-0400" exp "2010.10.12T12:00-0400" r (v 0 s 0 n 0 l 0)) X-Powered-By: ASP.NET Content-Language: en-CA Content-Type: text/html; charset=UTF-8 Equal Voice Equal Voice HOME | ENGLISH | FRENCH RSS SUBSCRIBE About Us Mission...

Adding RemoveHttpHeader

That said, if you wrap the call in RemoveHttpHeader as per the following script

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
  .saveAsTextFile("plain-text/")

the results are now what they should be.

WARC

(20091218,www.equalvoice.ca,http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg,Equal Voice Equal Voice HOME | ENGLISH | FRENCH RSS SUBSCRIBE About Us Mission...

ARC

20060622,www.gca.ca,http://www.gca.ca/indexcms/?organizations&orgid=27,Green Communities Canada | Our Member Organizations Home About Our Members Our Member Organizations Search Member Organizations About Green Communities Canada News and Events Our Programs Join Gre

Solution?

We should definitely normalize these two approaches.

Given the wide range of use cases and how people have found inventive ways to use the HTTP headers, my recommendation would be to normalize the approach along the lines of having both making the getContent call, i.e. changing ArchiveRecord(https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/ArchiveRecord.scala) to

  val getContentBytes: Array[Byte] = {
    if (recordFormat == ArchiveRecordWritable.ArchiveFormat.ARC)
    {
      ArcRecordUtils.getContent(r.t.getRecord.asInstanceOf[ARCRecord])
    } else {
      WarcRecordUtils.getContent(r.t.getRecord.asInstanceOf[WARCRecord])
    }
  }

Thoughts?

jrwiebe · Jul 30, 2019

Your solution sounds good to me, @ianmilligan1.

Perhaps we should also replace getImageBytes with something like getContentBodyBytes, which consists of the top part of the current method's if-block.

ruebot · Jul 30, 2019

I think I like getContentBodyBytes over getBinaryBytes as a name 😄

f88c59a

ianmilligan1 self-assigned this Jul 30, 2019

ianmilligan1 added a commit that referenced this issue Jul 30, 2019

Make ArchiveRecord.getContentBytes consistent,#334

Loading status checks…

f2bfa3b

ianmilligan1 referenced this issue Jul 30, 2019
Open
Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335

archivesunleashed/aut

Inconsistency in ArchiveRecord.getContentBytes #334

Inconsistency in ArchiveRecord.getContentBytes #334

jrwiebe commented Jul 30, 2019

This comment has been minimized.

ianmilligan1 commented Jul 30, 2019

ianmilligan1 self-assigned this Jul 30, 2019

This comment has been minimized.

ianmilligan1 commented Jul 30, 2019

This comment has been minimized.

jrwiebe commented Jul 30, 2019

This comment has been minimized.

ruebot commented Jul 30, 2019

ianmilligan1 added a commit that referenced this issue Jul 30, 2019

ianmilligan1 referenced this issue Jul 30, 2019

Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335

archivesunleashed/aut

Join GitHub today

Inconsistency in ArchiveRecord.getContentBytes #334

Comments

jrwiebe commented Jul 30, 2019

This comment has been minimized.

ianmilligan1 commented Jul 30, 2019

ianmilligan1 self-assigned this Jul 30, 2019

This comment has been minimized.

ianmilligan1 commented Jul 30, 2019

Running Example Script on an ARC and WARC

Adding RemoveHttpHeader

Solution?

This comment has been minimized.

jrwiebe commented Jul 30, 2019

This comment has been minimized.

ruebot commented Jul 30, 2019

ianmilligan1 added a commit that referenced this issue Jul 30, 2019

ianmilligan1 referenced this issue Jul 30, 2019

Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335