Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in ArchiveRecord.getContentBytes #334

Closed
jrwiebe opened this issue Jul 30, 2019 · 5 comments

Comments

@jrwiebe
Copy link
Contributor

commented Jul 30, 2019

I noticed ArchiveRecord.getContentBytes behaves differently on ARC and WARC records. For an ARC it return just the contents of the record, using ArcRecordUtils.getBodyContent(arcRecord), but for a WARC it return the "raw" contents -- i.e., including HTTP headers -- using WarcRecordUtils.getContent(warcRecord). I don't see a rationale for this difference; it seems unintentional.

@ruebot @lintool

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Jul 30, 2019

Seems like an unintentional difference to me. Agreed that both should be the same...

@ianmilligan1 ianmilligan1 self-assigned this Jul 30, 2019

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Jul 30, 2019

This is a good find, @jrwiebe. We haven't run into this issue when working with the Archives Unleashed Cloud because we wrap our text extractor job there in RemoveHttpHeader, which basically does the same thing.

Running Example Script on an ARC and WARC

But if you run the plain-text file on a combination of ARCs and WARCs (i.e. our ARC and WARC bundled as part of our aut-resources/sample-data, you get inconsistent results.

When running this script:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/home/i2millig/aut-resources/Sample-Data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .saveAsTextFile("plain-text/")

We can see the different results. i.e. the ARC, which uses getBodyContent

(20060622,www.gca.ca,http://www.gca.ca/indexcms/?organizations&orgid=27,Green Communities Canada | Our Member Organizations   Home About Our Members Our Member Organizations Search Member Organizations About Green Communities Canada News and Events Our Programs Join Gre

and the WARC, which is just using getContent

(20091218,www.equalvoice.ca,http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg,HTTP/1.1 200 OK Connection: close Date: Fri, 18 Dec 2009 23:17:29 GMT Server: Microsoft-IIS/6.0 PICS-Label: (PICS-1.0 "http://www.rsac.org/ratingsv01.html" l by "info@xynapse.ca" on "2004.10.12T16:51-0400" exp "2010.10.12T12:00-0400" r (v 0 s 0 n 0 l 0)) X-Powered-By: ASP.NET Content-Language: en-CA Content-Type: text/html; charset=UTF-8 Equal Voice  Equal Voice HOME | ENGLISH | FRENCH RSS SUBSCRIBE About Us Mission...

Adding RemoveHttpHeader

That said, if you wrap the call in RemoveHttpHeader as per the following script

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
  .saveAsTextFile("plain-text/")

the results are now what they should be.

WARC

(20091218,www.equalvoice.ca,http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg,Equal Voice  Equal Voice HOME | ENGLISH | FRENCH RSS SUBSCRIBE About Us Mission...

ARC

20060622,www.gca.ca,http://www.gca.ca/indexcms/?organizations&orgid=27,Green Communities Canada | Our Member Organizations   Home About Our Members Our Member Organizations Search Member Organizations About Green Communities Canada News and Events Our Programs Join Gre

Solution?

We should definitely normalize these two approaches.

Given the wide range of use cases and how people have found inventive ways to use the HTTP headers, my recommendation would be to normalize the approach along the lines of having both making the getContent call, i.e. changing ArchiveRecord(https://github.com/archivesunleashed/aut/blob/master/src/main/scala/io/archivesunleashed/ArchiveRecord.scala) to

  val getContentBytes: Array[Byte] = {
    if (recordFormat == ArchiveRecordWritable.ArchiveFormat.ARC)
    {
      ArcRecordUtils.getContent(r.t.getRecord.asInstanceOf[ARCRecord])
    } else {
      WarcRecordUtils.getContent(r.t.getRecord.asInstanceOf[WARCRecord])
    }
  }

Thoughts?

@jrwiebe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 30, 2019

Your solution sounds good to me, @ianmilligan1.

Perhaps we should also replace getImageBytes with something like getContentBodyBytes, which consists of the top part of the current method's if-block.

@ruebot

This comment has been minimized.

Copy link
Member

commented Jul 30, 2019

I think I like getContentBodyBytes over getBinaryBytes as a name 😄

f88c59a

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Resolved with 1818596

@ruebot ruebot closed this Aug 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.