Function to get the status response code and headers of a warc response? #198

dportabella · Apr 20, 2018

is there a function to get the status response code and headers of a warc response?

such as...

RecordLoader.loadArchives(warcFile, sc)
.filter(_.warcResponse.statusCode = 200)
.filter(_.warcResponse.headers.get("Server") == "Apache/2.4.6")

dportabella · Apr 20, 2018

I am using this function at the moment:

import java.io.ByteArrayInputStream
import io.archivesunleashed.spark.archive.io.ArchiveRecord
import org.apache.commons.httpclient.{Header, HttpParser, StatusLine}
import org.apache.commons.io.IOUtils

case class Response(archiveRecord: ArchiveRecord, statusLine: StatusLine, headers: List[Header], content: Array[Byte])

object WarcUtils {
  def parseResponse(r: ArchiveRecord): Response = {
    val response = new ByteArrayInputStream(r.getContentBytes)
    val line = HttpParser.readRawLine(response)
    val statusLine = new StatusLine(new String(line))
    val headers = HttpParser.parseHeaders(response, "US-ASCII").toList
    val responseContent: Array[Byte] = IOUtils.toByteArray(response)
    Response(r, statusLine, headers, responseContent)
  }
}

RecordLoader.loadArchives(warcFile, sc)
  .filter(FilterArchive.isHTML)
  .flatMap(r => Try(WarcUtils.parseResponse(r)).toOption)
  .filter(_.statusLine.getStatusCode = 200)
  .filter(_.headers.collectFirst {case h if h.getName == "Server" => h.getValue }.contains("Apache/2.4.6"))

If you tell me how your prefer to refactor this code to fit your library, I can make a pull-request if you want.

dportabella · Apr 20, 2018

Although there is an error parsing the following archive, generate by
$ wget --warc-file=test https://www.linkedin.com/

do you know if this 1000 (just before the html content) is part of the response header? it looks quite strange to me. my parseResponse function detects this 1000 as part of the html content.

any idea about this?

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:5D495122-1DD9-472B-B619-A7EDADF03070>
WARC-Warcinfo-ID: <urn:uuid:65D43BDC-AC23-4515-A951-6DE201680871>
WARC-Concurrent-To: <urn:uuid:CB15E601-0128-4D95-9436-BDF271F23CBF>
WARC-Target-URI: <https://www.linkedin.com/>
WARC-Date: 2018-04-20T07:57:54Z
WARC-IP-Address: 185.63.145.1
WARC-Block-Digest: sha1:7OCQ4D4PIAWT2CDJ3K7P6HZ4QGBFAYMP
WARC-Payload-Digest: sha1:ANGMHLF4ZVXSWXMWJQLSCNUIRJEWBZLS
Content-Type: application/http;msgtype=response
Content-Length: 46408

HTTP/1.1 200 OK
Date: Fri, 20 Apr 2018 07:57:54 GMT
Content-Type: text/html; charset=utf-8
...
Set-Cookie: lidc="b=VGST04:g=802:u=1:i=1524211008:t=1524297408:s=AQHTwVE0xQkI0A3-ifWSgmSS3EeyGTjx"; Expires=Sat, 21 Apr 2018 07:56:48 GMT; domain=.linkedin.com; Path=/

1000
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en" class="ie ie6 lte9 lte8 lte7 os-win"> <![endif]-->
...

greebie · Nov 20, 2018

@dportabella I was going to look at producing something to resolve this issue today. Are you still interested in providing a pull request?

With @lintool & @ianmilligan1 's okay, I would say that this function should be in the matchbox (https://github.com/archivesunleashed/aut/tree/master/src/main/scala/io/archivesunleashed/matchbox) as ExtractHttpResponse or something similar and maybe make .parseResponse be the apply function.

An alternative is to include access to the response code in the ArchiveRecord trait and ArchiveRecordImpl as .getHttpResponse so that it is included in all ArchiveRecords by default. This seems the more user-friendly approach, but we would want to test the effect the function has on overall run time. (It should be nil due to laziness, but as a rule we should check run time anytime we change the ArchiveRecord trait).

I think I prefer the second option. If you are not able to prep a PR at this time, I can take a crack at it.

Either way, I am not sure how, but we would want to be sure you were given appropriate credit for this idea and proposed implementation. Maybe @ruebot or @ianmilligan1 knows the best way to make this happen.

Thanks so much for your help!

Ryan. .

greebie · Nov 21, 2018

Branch issue-198 covers the header response code, but not the full headers, as I could not get the full header details at this stage. The following will have results for time differences.

Using the same warc collection

17.0

Text	Network	Domain
4222	163173	unknown
292	167488	113007
297	164422	114284

17.1 (same script)

Text	Network	Domain
2569	177328	120834
237	160474	112580
227	168222	112792

17.1 (add statusHeader)

(note network script includes an additional map compared to above)

Text	Network	Domain
229	188917	160654
341	202115	124252
247	175594	116306

17.1 (add fileName)

(note network script includes an additional map compared to above)

Text	Network	Domain
230	213184	18282
249	165582	113008
239	180437	123404

tl;dr - there is no effect on the ArchiveRecord / RecordLoader when not using .getHttpStatus or .getFilename and there is minimal effect when using it.

greebie · Nov 21, 2018

This is the code I used to produce the above.

Original code

timed {
import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  .take(10)
  }

  timed {

  import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import io.archivesunleashed.util._

val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
  .keepValidPages()
  .keepContent(Set("apple".r))
  .flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
  .map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
  .filter(r => r._1 != "" && r._2 != "")
  .countItems()
  .filter(r => r._2 > 5).take(10)
  }

  timed {
  import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
  }

add .getHttpStatus

  import io.archivesunleashed._
  import io.archivesunleashed.matchbox._
  import io.archivesunleashed.util._

  val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
    .keepValidPages()
    .keepContent(Set("apple".r))
    .map(r => (r.getHttpStatus, (ExtractLinks(r.getUrl, r.getContentString))))
    .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), 
      ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
    .filter(r => r._2 != "" && r._3 != "")
    .countItems()
    .filter(r => r._2 > 5).take(10)

produces: links: Array[((String, String, String), Int)] = Array(((200,nanaimodailynews.com,nanaimodailynews.com),445785), ((200,nanaimodailynews.com,blackpress.ca),188676), ((200,nanaimodailynews.com,bclocalnews.com),111400), ((200,nanaimodailynews.com,drivewaycanada.ca),53796), ((200,nanaimodailynews.com,facebook.com),53500), ((200,nanaimodailynews.com,bcclassified.com),52922), ((200,nanaimodailynews.com,usednanaimo.com),27067), ((200,nanaimodailynews.com,iservices.blackpress.ca),26953), ((200,nanaimodailynews.com,localworkbc.ca),26546), ((200,nanaimodailynews.com,twitter.com),24853))

    // add filename

  import io.archivesunleashed._
  import io.archivesunleashed.matchbox._
  import io.archivesunleashed.util._

  val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/*gz", sc)
        .keepValidPages()
        .keepContent(Set("apple".r))
        .map(r => (r.getFilename, (ExtractLinks(r.getUrl, r.getContentString))))
        .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), 
          ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
        .filter(r => r._2 != "" && r._3 != "")
        .countItems()
        .filter(r => r._2 > 5).take(10)
      }

produces

links: Array[((String, String, String), Int)] = Array(((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,nanaimodailynews.com),439503), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,blackpress.ca),186028), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,bclocalnews.com),106107), 
((file:/Users/ryandeschamps/warcs/ARCHIVEIT-4656-CRAWL_SELECTED_SEEDS-JOB193391-20160127222913427-00000.warc.gz,nanaimodailynews.com,drivewaycanada.ca),53040),
...

dportabella · Nov 23, 2018

Cool, thanks!

ruebot added the question label Aug 20, 2018

greebie referenced this issue Nov 22, 2018
Merged
Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292

ruebot closed this in 7731b6d Nov 28, 2018

archivesunleashed/aut

Function to get the status response code and headers of a warc response? #198

dportabella commented Apr 20, 2018

This comment has been minimized.

dportabella commented Apr 20, 2018 •

edited

This comment has been minimized.

dportabella commented Apr 20, 2018

ruebot added the question label Aug 20, 2018

This comment has been minimized.

greebie commented Nov 20, 2018

This comment has been minimized.

greebie commented Nov 21, 2018 •

edited

This comment has been minimized.

greebie commented Nov 21, 2018 •

edited

greebie referenced this issue Nov 22, 2018

Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292

This comment has been minimized.

dportabella commented Nov 23, 2018

ruebot closed this in `7731b6d` Nov 28, 2018

archivesunleashed/aut

Join GitHub today

Function to get the status response code and headers of a warc response? #198

Comments

dportabella commented Apr 20, 2018

This comment has been minimized.

dportabella commented Apr 20, 2018 • edited

This comment has been minimized.

dportabella commented Apr 20, 2018

ruebot added the question label Aug 20, 2018

This comment has been minimized.

greebie commented Nov 20, 2018

This comment has been minimized.

greebie commented Nov 21, 2018 • edited

17.0

17.1 (same script)

17.1 (add statusHeader)

17.1 (add fileName)

This comment has been minimized.

greebie commented Nov 21, 2018 • edited

Original code

add .getHttpStatus

greebie referenced this issue Nov 22, 2018

Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292

This comment has been minimized.

dportabella commented Nov 23, 2018

ruebot closed this in 7731b6d Nov 28, 2018

dportabella commented Apr 20, 2018 •

edited

greebie commented Nov 21, 2018 •

edited

greebie commented Nov 21, 2018 •

edited

ruebot closed this in `7731b6d` Nov 28, 2018