Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upCannot skip bad record while reading warc file #267
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 5, 2018
Member
Hi @akshedu, thanks for the report. Can you let us know a little bit more? There should have been a template to help tease out some more information. Can you update this ticket, and provide more context? It will help us get to the root of the issue.
This also sounds like it could be a duplicate of #246, and #258.
Hi @akshedu, thanks for the report. Can you let us know a little bit more? There should have been a template to help tease out some more information. Can you update this ticket, and provide more context? It will help us get to the root of the issue. This also sounds like it could be a duplicate of #246, and #258. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
akshedu
Sep 5, 2018
Hi @ruebot, updated with more details. If you need the warc file I can share it as well.
akshedu
commented
Sep 5, 2018
Hi @ruebot, updated with more details. If you need the warc file I can share it as well. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 5, 2018
Member
Downloaded the aut-0.16.1-SNAPSHOT-fatjar.jar
From where? We don't push snapshot builds. Did you build aut locally, and use the --jars
option?
As an aside, the template is there to alleviate a lot of this contextual clarity that is lost here. At the very least, can you provide your exact steps with this format:
**To Reproduce**
Steps to reproduce the behavior (e.g.):
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
From where? We don't push snapshot builds. Did you build aut locally, and use the As an aside, the template is there to alleviate a lot of this contextual clarity that is lost here. At the very least, can you provide your exact steps with this format:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
akshedu
Sep 5, 2018
Steps to reproduce:
-
Download the aut jar -
https://github.com/archivesunleashed/aut/releases/download/aut-0.16.0/aut-0.16.0-fatjar.jar -
Run the spark-shell with the jar file:
spark-shell --jars ~/Downloads/aut-0.16.0-fatjar.jar
-
Download the warc file -
https://www.cse.iitb.ac.in/~soumen/tmp/cw09/00.warc.gz -
Load the required modules:
scala> import io.archivesunleashed._
import io.archivesunleashed._
scala> import io.archivesunleashed.matchbox._
import io.archivesunleashed.matchbox._
- Read the warc file:
scala> val r = RecordLoader.loadArchives("/Users/akshanshgupta/Workspace/00.warc.gz", sc)
r: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[2] at map at package.scala:50
scala> r.take(1)
[Stage 0:> (0 + 1) / 1]2018-09-05 18:38:48 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.io.NotSerializableException: org.archive.io.warc.WARCRecord
Serialization stack:
- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@5dc9ca4a)
- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@56d06a37)
- element of array (index: 0)
- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:38:48 ERROR TaskSetManager:70 - Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.archive.io.warc.WARCRecord
Serialization stack:
- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@5dc9ca4a)
- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@56d06a37)
- element of array (index: 0)
- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1); not retrying
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.archive.io.warc.WARCRecord
Serialization stack:
- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@5dc9ca4a)
- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@56d06a37)
- element of array (index: 0)
- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
... 53 elided
- Uncompress the file and try reading again:
scala> val r = RecordLoader.loadArchives("/Users/akshanshgupta/Workspace/00.warc", sc)
r: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[5] at map at package.scala:50
scala> r.take(1)
2018-09-05 18:39:45 WARN ArchiveReader$ArchiveRecordIterator:462 - Trying skip of failed record cleanup of {reader-identifier=file:/Users/akshanshgupta/Workspace/00.warc, absolute-offset=0, WARC-Date=2009-03-65T08:43:19-0800, Content-Length=219, WARC-Record-ID=<urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: Unexpected character a(Expecting d)
java.io.IOException: Unexpected character a(Expecting d)
at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
at io.archivesunleashed.data.ArchiveRecordInputFormat$ArchiveRecordReader.nextKeyValue(ArchiveRecordInputFormat.java:175)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:39:45 WARN ArchiveReader$ArchiveRecordIterator:462 - Trying skip of failed record cleanup of {reader-identifier=file:/Users/akshanshgupta/Workspace/00.warc, absolute-offset=0, WARC-Date=2009-03-65T08:43:19-0800, Content-Length=219, WARC-Record-ID=<urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: Unexpected character 41(Expecting d)
java.io.IOException: Unexpected character 41(Expecting d)
at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:501)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
at io.archivesunleashed.data.ArchiveRecordInputFormat$ArchiveRecordReader.nextKeyValue(ArchiveRecordInputFormat.java:186)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:39:45 WARN WARCReaderFactory$UncompressedWARCReader:502 - Bad Record. Trying skip (Record start 409): Unexpected character 57(Expecting d)
res1: Array[io.archivesunleashed.ArchiveRecord] = Array()
akshedu
commented
Sep 5, 2018
Steps to reproduce:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 5, 2018
Member
@akshedu can you try and reproduce with Apache Spark 2.1.3. The 0.16.0 release doesn't officially have Apache 2.3.1 support.
@akshedu can you try and reproduce with Apache Spark 2.1.3. The 0.16.0 release doesn't officially have Apache 2.3.1 support. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Sep 5, 2018
Member
I just ran this on 2.1.1.
The following script:
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val r = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
led to this error message
unexpected extra data after record org.archive.io.warc.WARCRecord@790869a4
which FWIW is tripped by this in WARCReaderFactory.
protected void gotoEOR(ArchiveRecord rec) throws IOException {
long skipped = 0;
while (getIn().read()>-1) {
skipped++;
}
if(skipped>4) {
System.err.println("unexpected extra data after record "+rec);
}
return;
}
}
I just ran this on 2.1.1. The following script:
led to this error message
which FWIW is tripped by this in WARCReaderFactory.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 5, 2018
Member
Same here with 2.1.3
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.3
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
val r = RecordLoader.loadArchives("/home/nruest/Downloads/00.warc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
// Exiting paste mode, now interpreting.
unexpected extra data after record org.archive.io.warc.WARCRecord@6b2cebc5
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
r: Array[(String, Int)] = Array()
Same here with 2.1.3
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Sep 5, 2018
Member
@ruebot and I did a bit of digging into this, using JWAT-Tools and then manually looking at the WARC records themselves.
There are some issues with the WARC file itself. Here's the test results:
$ ./jwattools.sh -el test /mnt/vol1/data_sets/aut_debug/00.warc.gz
Showing errors: true
Validate digest: true
Using relaxed URI validation for ARC URL and WARC Target-URI.
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
ThreadPool shut down.
Output Thread stopped.
#
# Job summary
#
GZip files: 0
+ Arc: 0
+ Warc: 1
Arc files: 0
Warc files: 0
Errors: 124544
Warnings: 17792
RuntimeErr: 0
Skipped: 0
Time: 00:01:02 (62324 ms.)
TotalBytes: 161.1 mb
AvgBytes: 2.5 mb/s
INVALID: 35582
INVALID_EXPECTED: 71166
REQUIRED_INVALID: 17796
'WARC-Date' header: 17792
'WARC-Date' value: 17792
'WARC-Target-URI' value: 8
'WARC-Warcinfo-ID' value: 35578
Data before WARC version: 17791
Empty lines before WARC version: 17791
Trailing newlines: 17792
We looked into the headers, and here's the WARC header for the broken file:
WARC/0.18
WARC-Type: warcinfo
WARC-Date: 2009-03-65T08:43:19-0800
WARC-Record-ID: <urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>
Content-Type: application/warc-fields
Content-Length: 219
software: Nutch 1.0-dev (modified for clueweb09)
isPartOf: clueweb09-en
description: clueweb09 crawl with WARC output
format: WARC file version 0.18
conformsTo: http://www.archive.org/documents/WarcFileFormat-0.18.html
and here's a working header:
WARC/1.0^M
WARC-Type: warcinfo^M
WARC-Date: 2009-12-18T23:17:27Z^M
WARC-Filename: ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz^M
WARC-Record-ID: <urn:uuid:7bef4ed8-86df-4bd5-a419-95e33c02a667>^M
Content-Type: application/warc-fields^M
Content-Length: 595^M
^M
I have carriage returns turned on here, so we can see how (a) line-endings differ; (b) WARC-Date is different, etc. There are similar mismatches throughout the headers.
I'm not an expert on WARCs - I'm not sure if the specification changed dramatically between 0.18 and 1.0, or whether this is an artefact of Nutch or being compressed/decompressed at some stage.
But I think since we rely on the webarchive-commons library, it might be worth opening up an issue there if you want to continue poking at this. It's probably out of scope for AUT. I did see a similar issue there that might be of help.
@ruebot and I did a bit of digging into this, using JWAT-Tools and then manually looking at the WARC records themselves. There are some issues with the WARC file itself. Here's the test results:
We looked into the headers, and here's the WARC header for the broken file:
and here's a working header:
I have carriage returns turned on here, so we can see how (a) line-endings differ; (b) WARC-Date is different, etc. There are similar mismatches throughout the headers. I'm not an expert on WARCs - I'm not sure if the specification changed dramatically between 0.18 and 1.0, or whether this is an artefact of Nutch or being compressed/decompressed at some stage. But I think since we rely on the webarchive-commons library, it might be worth opening up an issue there if you want to continue poking at this. It's probably out of scope for AUT. I did see a similar issue there that might be of help. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Sep 6, 2018
Member
So, I did get this to work. Broken WARCs stick in my craw!
See the results of the top ten domains here:
r: Array[(String, Int)] = Array((directory.binarybiz.com,1473), (blog.pennlive.com,1037), (americanhistory.si.edu,931), (businessfinder.mlive.com,876), (bama.edebris.com,812), (basnect.info,754), (cbs5.com,665), (2modern.com,599), (clinicaltrials.gov,506), (dotwhat.net,439))
The WARC is basically all screwed up, with line-endings, etc. (see above)
If you do need to get it to work, however, I used jwattools to decompress and recompress. The recompressed warc.gz file is correct and now works with AUT. See set of commands here:
./jwattools.sh decompress /mnt/vol1/data_sets/aut_debug/00.warc.gz
./jwattools.sh compress /mnt/vol1/data_sets/aut_debug/00.warc
Then works and the re-compression process has fixed the file. Not ideal, but I don't think this dataset is ideal from a WARC compliance standpoint.
So, I did get this to work. Broken WARCs stick in my craw! See the results of the top ten domains here:
The WARC is basically all screwed up, with line-endings, etc. (see above) If you do need to get it to work, however, I used jwattools to decompress and recompress. The recompressed warc.gz file is correct and now works with AUT. See set of commands here:
Then works and the re-compression process has fixed the file. Not ideal, but I don't think this dataset is ideal from a WARC compliance standpoint. |
akshedu commentedSep 5, 2018
•
edited
Edited 1 time
-
-
akshedu editedSep 5, 2018 (most recent)
akshedu createdSep 5, 2018
Trying to read a WARC file which has an info header results in read failure. I followed the steps as:
Using spark 2.3.1, scala shell. Downloaded the aut-0.16.1-SNAPSHOT-fatjar.jar and used the --jars option with spark-shell to load additional functions.
Loaded the required modules:
First tried the compressed file:
Got the following error:
Then tried the uncompressed file:
Got the following error:
Checked the warc file and it looked like this: