New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AUT exits/dies on java.util.zip.ZipException: invalid distance too far back #246

Closed
ruebot opened this Issue Jul 27, 2018 · 4 comments

Comments

4 participants
@ruebot
Member

ruebot commented Jul 27, 2018

Describe the bug
Came across this when processing a user's collection on cloud.archivesunleashed.org, using aut-0.16.0. The collection appears to have a couple problematic warcs, which throw this error:

018-07-19 00:48:39,021 [Executor task launch worker for task 5771] INFO  NewHadoopRDD - Input split: file:/data/146/625/warcs/ARCHIVEIT-625-20090319153934-00276-crawling04.us.archive.org.arc.gz:0+103342436
2018-07-19 00:48:40,484 [Executor task launch worker for task 5770] ERROR Executor - Exception in task 1922.0 in stage 3.0 (TID 5770)
java.util.zip.ZipException: invalid distance too far back
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-07-19 00:48:40,485 [dispatcher-event-loop-2] INFO  TaskSetManager - Starting task 1924.0 in stage 3.0 (TID 5772, localhost, executor driver, partition 1924, PROCESS_LOCAL, 19609 bytes)
2018-07-19 00:48:40,485 [Executor task launch worker for task 5772] INFO  Executor - Running task 1924.0 in stage 3.0 (TID 5772)
2018-07-19 00:48:40,486 [task-result-getter-0] WARN  TaskSetManager - Lost task 1922.0 in stage 3.0 (TID 5770, localhost, executor driver): java.util.zip.ZipException: invalid distance too far back
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

2018-07-19 00:48:40,486 [task-result-getter-0] ERROR TaskSetManager - Task 1922 in stage 3.0 failed 1 times; aborting job
2018-07-19 00:48:40,486 [dag-scheduler-event-loop] INFO  TaskSchedulerImpl - Cancelling stage 3
2018-07-19 00:48:40,487 [dag-scheduler-event-loop] INFO  TaskSchedulerImpl - Stage 3 was cancelled
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1924.0 in stage 3.0 (TID 5772)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1921.0 in stage 3.0 (TID 5769)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1914.0 in stage 3.0 (TID 5762)
2018-07-19 00:48:40,487 [dag-scheduler-event-loop] INFO  DAGScheduler - ShuffleMapStage 3 (map at package.scala:66) failed in 6445.786 s due to Job aborted due to stage failure: Task 1922 in stage 3.0 failed 1 times, most recent failure: Lost task 1922.0 in stage 3.0 (TID 5770, localhost, executor driver): java.util.zip.ZipException: invalid distance too far back
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1893.0 in stage 3.0 (TID 5741)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1918.0 in stage 3.0 (TID 5766)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1915.0 in stage 3.0 (TID 5763)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1912.0 in stage 3.0 (TID 5760)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1919.0 in stage 3.0 (TID 5767)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1916.0 in stage 3.0 (TID 5764)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1872.0 in stage 3.0 (TID 5720)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1923.0 in stage 3.0 (TID 5771)
2018-07-19 00:48:40,487 [dispatcher-event-loop-14] INFO  Executor - Executor is trying to kill task 1917.0 in stage 3.0 (TID 5765)
2018-07-19 00:48:40,487 [main] INFO  DAGScheduler - Job 2 failed: sortBy at package.scala:68, took 6445.867506 s
2018-07-19 00:48:40,488 [Executor task launch worker for task 5772] INFO  NewHadoopRDD - Input split: file:/data/146/625/warcs/ARCHIVEIT-625-20090319170447-00329-crawling04.us.archive.org.warc.gz:0+100053886
2018-07-19 00:48:40,489 [Executor task launch worker for task 5763] INFO  Executor - Executor killed task 1915.0 in stage 3.0 (TID 5763)
2018-07-19 00:48:40,490 [task-result-getter-2] WARN  TaskSetManager - Lost task 1915.0 in stage 3.0 (TID 5763, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,490 [Executor task launch worker for task 5764] INFO  Executor - Executor killed task 1916.0 in stage 3.0 (TID 5764)
2018-07-19 00:48:40,490 [task-result-getter-1] WARN  TaskSetManager - Lost task 1916.0 in stage 3.0 (TID 5764, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,491 [Executor task launch worker for task 5765] INFO  Executor - Executor killed task 1917.0 in stage 3.0 (TID 5765)
2018-07-19 00:48:40,491 [task-result-getter-3] WARN  TaskSetManager - Lost task 1917.0 in stage 3.0 (TID 5765, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,508 [Executor task launch worker for task 5720] INFO  Executor - Executor killed task 1872.0 in stage 3.0 (TID 5720)
2018-07-19 00:48:40,509 [task-result-getter-0] WARN  TaskSetManager - Lost task 1872.0 in stage 3.0 (TID 5720, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,517 [Executor task launch worker for task 5772] INFO  Executor - Executor killed task 1924.0 in stage 3.0 (TID 5772)
2018-07-19 00:48:40,518 [task-result-getter-2] WARN  TaskSetManager - Lost task 1924.0 in stage 3.0 (TID 5772, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,532 [Executor task launch worker for task 5771] INFO  Executor - Executor killed task 1923.0 in stage 3.0 (TID 5771)
2018-07-19 00:48:40,532 [task-result-getter-1] WARN  TaskSetManager - Lost task 1923.0 in stage 3.0 (TID 5771, localhost, executor driver): TaskKilled (killed intentionally)
2018-07-19 00:48:40,716 [Executor task launch worker for task 5767] INFO  Executor - Executor killed task 1919.0 in stage 3.0 (TID 5767)
2018-07-19 00:48:40,720 [task-result-getter-3] WARN  TaskSetManager - Lost task 1919.0 in stage 3.0 (TID 5767, localhost, executor driver): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1922 in stage 3.0 failed 1 times, most recent failure: Lost task 1922.0 in stage 3.0 (TID 5770, localhost, executor driver): java.util.zip.ZipException: invalid distance too far back
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
	at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
	at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
	at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
	at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
	at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
	at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
	at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
  at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:266)
  at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:128)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
  at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:619)
  at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:620)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.sortBy(RDD.scala:617)
  at io.archivesunleashed.package$CountableRDD.countItems(package.scala:68)
  ... 77 elided
Caused by: java.util.zip.ZipException: invalid distance too far back
  at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
  at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
  at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
  at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
  at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
  at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
  at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
  at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
  at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
  at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
  at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
  at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
  at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
  at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
2018-07-19 00:48:41,130 [Executor task launch worker for task 5766] INFO  Executor - Executor killed task 1918.0 in stage 3.0 (TID 5766)
2018-07-19 00:48:41,131 [task-result-getter-0] WARN  TaskSetManager - Lost task 1918.0 in stage 3.0 (TID 5766, localhost, executor driver): TaskKilled (killed intentionally)
<console>:33: error: not found: value links
             WriteGraphML(links, "/data/146/625/45/derivatives/gephi/625-gephi.graphml")
                          ^
2018-07-19 00:48:41,622 [Thread-1] INFO  SparkContext - Invoking stop() from shutdown hook
2018-07-19 00:48:41,637 [Thread-1] INFO  ServerConnector - Stopped Spark@5625daf1{HTTP/1.1}{0.0.0.0:4040}

To Reproduce
Steps to reproduce the behavior (e.g.):

      import io.archivesunleashed._
      import io.archivesunleashed.app._
      import io.archivesunleashed.matchbox._
      sc.setLogLevel("INFO")
      RecordLoader.loadArchives("/data/146/625/warcs/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/data/146/625/45/derivatives/all-domains/output")
      RecordLoader.loadArchives("/data/146/625/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/data/146/625/45/derivatives/all-text/output")
      val links = RecordLoader.loadArchives("/data/146/625/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
      WriteGraphML(links, "/data/146/625/45/derivatives/gephi/625-gephi.graphml")
      sys.exit

Expected behavior
I think we should catch this error, log it, and move on.

Additional context
I'll check with the user, and see if it is ok whether to use one of these files in a test.

tag: @lintool @ianmilligan1

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Aug 1, 2018

Member

Some more context from GitHub digging: the java.util.zip.ZipException: invalid distance code error was fixed for WarcRecordUtils.java in this commit. Here is the original issue, from back when AUT was Warcbase.

However, we never updated ArcRecordUtils.java to introduce similar error handling for ARC files. It would be great if ArcRecordUtils.java was updated to catch the this error.

Member

ianmilligan1 commented Aug 1, 2018

Some more context from GitHub digging: the java.util.zip.ZipException: invalid distance code error was fixed for WarcRecordUtils.java in this commit. Here is the original issue, from back when AUT was Warcbase.

However, we never updated ArcRecordUtils.java to introduce similar error handling for ARC files. It would be great if ArcRecordUtils.java was updated to catch the this error.

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Sep 17, 2018

Contributor

Possibly related: #249 removed some try-catch calls checking for IOExceptions, favoring an Option approach. The justification (discussed in #212) was that IOExceptions should be caught in the ArchiveRecord class, instead of trying to manage them inside every string manipulation function.

However, this is a ZipException, so I do not think the problems are the same. (You can see in #249, that I avoided adding Options to the ArchiveRecord class because it would require refactoring.)

Contributor

greebie commented Sep 17, 2018

Possibly related: #249 removed some try-catch calls checking for IOExceptions, favoring an Option approach. The justification (discussed in #212) was that IOExceptions should be caught in the ArchiveRecord class, instead of trying to manage them inside every string manipulation function.

However, this is a ZipException, so I do not think the problems are the same. (You can see in #249, that I avoided adding Options to the ArchiveRecord class because it would require refactoring.)

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Sep 17, 2018

Member

@greebie #249 was merged well after 0.16.0 was released. This ticket specifically cites that it comes up in 0.16.0.

Member

ruebot commented Sep 17, 2018

@greebie #249 was merged well after 0.16.0 was released. This ticket specifically cites that it comes up in 0.16.0.

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Sep 17, 2018

Contributor

Thanks @ruebot. Just wanted to be sure I did not break things.

Contributor

greebie commented Sep 17, 2018

Thanks @ruebot. Just wanted to be sure I did not break things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment