Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAUT exits/dies on java.util.zip.ZipException: too many length or distance symbols #271
Comments
ruebot
added
bug
URA-Task
labels
Sep 18, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 18, 2018
Member
Same thing on text extraction now, at the same point (same file):
RecordLoader.loadArchives("/data/139/499/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/data/139/499/60/derivatives/all-text/output")
2018-09-18 22:45:30,236 [Executor task launch worker for task 34541] INFO NewHadoopRDD - Input split: file:/data/139/499/warcs/ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz:0+130144728
2018-09-18 22:45:30,305 [Executor task launch worker for task 34541] INFO FileOutputCommitter - File Output Committer Algorithm version is 1
2018-09-18 22:45:31,051 [Executor task launch worker for task 34541] ERROR Utils - Aborting task
java.util.zip.ZipException: too many length or distance symbols
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1210)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1356)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-09-18 22:45:31,053 [Executor task launch worker for task 34541] ERROR Executor - Exception in task 17269.0 in stage 2.0 (TID 34541)
java.util.zip.ZipException: too many length or distance symbols
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1210)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1356)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-09-18 22:45:31,061 [dispatcher-event-loop-8] INFO TaskSetManager - Starting task 17270.0 in stage 2.0 (TID 34542, localhost, executor driver, partition 17270, PROCESS_LOCAL, 19621 bytes)
2018-09-18 22:45:31,061 [Executor task launch worker for task 34542] INFO Executor - Running task 17270.0 in stage 2.0 (TID 34542)
2018-09-18 22:45:31,062 [task-result-getter-0] WARN TaskSetManager - Lost task 17269.0 in stage 2.0 (TID 34541, localhost, executor driver): java.util.zip.ZipException: too many length or distance symbols
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecordImpl.scala:66)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1210)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1210)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1356)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-09-18 22:45:31,062 [task-result-getter-0] ERROR TaskSetManager - Task 17269 in stage 2.0 failed 1 times; aborting job
Same thing on text extraction now, at the same point (same file):
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Sep 19, 2018
Member
Thanks for the update @ruebot! Could you put the file on rho- I’d love to poke at it tomorrow afternoon.
Thanks for the update @ruebot! Could you put the file on rho- I’d love to poke at it tomorrow afternoon. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Sep 19, 2018
Member
Individually, the WARC is fine - have been able to extract domains and plain text from it. Hmm.
Individually, the WARC is fine - have been able to extract domains and plain text from it. Hmm. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Sep 19, 2018
Member
I see in the second fail merge that it's actually a different file: file:/data/139/499/warcs/ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz
I see in the second fail merge that it's actually a different file: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Sep 27, 2018
Collaborator
@ruebot Could you help put the file on tuna? I'm looking into this issue.
@ruebot Could you help put the file on tuna? I'm looking into this issue. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Sep 27, 2018
Member
@borislin I put ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz
in /tuna1/scratch/i2milligan
. I think you've got permissions to grab the file from there but let me know if you run into trouble.
@ruebot if you have the file handy could you move ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz
to tuna
as well?
That said, with ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz
I was not able to individually reproduce the problem.
As noted above it's similar to #246 we think.
@borislin I put @ruebot if you have the file handy could you move That said, with As noted above it's similar to #246 we think. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 27, 2018
Member
I'm moving all the problem files over right now. It's taking some time, and I'll let y'all know when I'm done.
I'm moving all the problem files over right now. It's taking some time, and I'll let y'all know when I'm done. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 27, 2018
Member
/home/ruestn/499-issues
There are 54 files there. They are a mix of files, that include:
- invalid gzip
- incomplete downloads
- files that cause this error raised in this issue
- files that cause the error raised in #246
There are 54 files there. They are a mix of files, that include:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Sep 28, 2018
Collaborator
What are the known files that are causing this ZipException issue? Only ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz
and ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz
?
What are the known files that are causing this ZipException issue? Only |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 28, 2018
Member
Sounds right to me. Those are the ones in the error log. There is a link to a gist above too if you want to double check.
Sounds right to me. Those are the ones in the error log. There is a link to a gist above too if you want to double check. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Sep 30, 2018
Collaborator
I also can't reproduce the error for ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz
. Do you still see the exception error on your end when you run aut on this file?
For ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz
, I got a warning instead of an error. No exception was really thrown from https://github.com/archivesunleashed/aut/blob/master/src/main/java/io/archivesunleashed/data/ArchiveRecordInputFormat.java#L175 and https://github.com/archivesunleashed/aut/blob/master/src/main/java/io/archivesunleashed/data/ArchiveRecordInputFormat.java#L186. The logger just logged the exception as an warning and program proceeded. Do you see the same behaviour on your end? My aut version is the current master branch version.
The command I've used is:
b25lin@tuna:~/aut$ /home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[12] --driver-memory 105G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar" -i /home/b25lin/spark_jobs/499.scala > 499.log
2018-09-30 16:15:10,647 [Executor task launch worker for task 4] WARN ARCReaderFactory$CompressedARCReader$1 - Trying skip of failed record cleanup of file:/home/ruestn/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20081223114925-01486-crawling10.us.archive.org.arc.gz: {subject-uri=http://nris.mt.gov/nsdi/data/doqq/spc/tif/d48106/d4810666SW.tif, ip-address=161.7.9.212, origin=, length=41799444, absolute-offset=29096, creation-date=20081223114930, content-type=image/tiff, version=null}: too many length or distance symbols
java.util.zip.ZipException: too many length or distance symbols
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:262)
at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:248)
at org.archive.io.ArchiveRecord.close(ArchiveRecord.java:172)
at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:175)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:501)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
at io.archivesunleashed.data.ArchiveRecordInputFormat$ArchiveRecordReader.nextKeyValue(ArchiveRecordInputFormat.java:195)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I also can't reproduce the error for For The command I've used is:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
@ruebot HEAD |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Sep 30, 2018
Member
Can you do the same testing on all the files, with HEAD, but use all the files in /home/ruestn/499-issues
. That should be a good test of this one and #246 on HEAD. If we get warning on all of them, then things seem to have resolved themselves between the 0.16.0 release and now. And, if that's the case, I'll work on cutting a new release. If things die out, then we need to address those.
Can you do the same testing on all the files, with HEAD, but use all the files in |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 1, 2018
Collaborator
@ruebot I've done the same testing on all the files in /home/ruestn/499-issues
. Only file ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz
produces an EOFException
because this file is empty and we currently do not catch this exception and handle it. All other files produce only warnings.
@ruebot I've done the same testing on all the files in |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Thanks! I'll carve out some time today, and try and replicate. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 2, 2018
Member
@borislin can you gist up your output log?
This is what I just ran on my end with all the 499-issues arcs/warcs:
/home/nruest/bin/spark-2.3.1-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
(I use zsh, so I have to escape those brackets.)
2018-10-02 09:42:43 WARN Utils:66 - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-02 09:42:43 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-10-02 09:42:43 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538487767122).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
2018-10-02 09:42:49 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:42:49 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:42:49 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 10.0.1.44:42415 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:42:49 INFO SparkContext:54 - Created broadcast 0 from newAPIHadoopFile at package.scala:51
2018-10-02 09:42:49 INFO FileInputFormat:283 - Total input paths to process : 54
2018-10-02 09:42:49 INFO SparkContext:54 - Starting job: sortBy at package.scala:74
2018-10-02 09:42:49 INFO DAGScheduler:54 - Registering RDD 5 (map at package.scala:72)
2018-10-02 09:42:49 INFO DAGScheduler:54 - Got job 0 (sortBy at package.scala:74) with 54 output partitions
2018-10-02 09:42:49 INFO DAGScheduler:54 - Final stage: ResultStage 1 (sortBy at package.scala:74)
2018-10-02 09:42:49 INFO DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 0)
2018-10-02 09:42:49 INFO DAGScheduler:54 - Missing parents: List(ShuffleMapStage 0)
2018-10-02 09:42:49 INFO DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at map at package.scala:72), which has no missing parents
2018-10-02 09:42:49 INFO MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 4.6 KB, free 15.8 GB)
2018-10-02 09:42:49 INFO MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.6 KB, free 15.8 GB)
2018-10-02 09:42:49 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on 10.0.1.44:42415 (size: 2.6 KB, free: 15.8 GB)
2018-10-02 09:42:49 INFO SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:42:49 INFO DAGScheduler:54 - Submitting 54 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at map at package.scala:72) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:42:49 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 54 tasks
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7976 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 8004 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:42:49 INFO TaskSetManager:54 - Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:42:49 INFO Executor:54 - Running task 3.0 in stage 0.0 (TID 3)
2018-10-02 09:42:49 INFO Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2018-10-02 09:42:49 INFO Executor:54 - Running task 1.0 in stage 0.0 (TID 1)
2018-10-02 09:42:49 INFO Executor:54 - Running task 2.0 in stage 0.0 (TID 2)
2018-10-02 09:42:49 INFO Executor:54 - Running task 5.0 in stage 0.0 (TID 5)
2018-10-02 09:42:49 INFO Executor:54 - Running task 4.0 in stage 0.0 (TID 4)
2018-10-02 09:42:49 INFO Executor:54 - Running task 6.0 in stage 0.0 (TID 6)
2018-10-02 09:42:49 INFO Executor:54 - Running task 7.0 in stage 0.0 (TID 7)
2018-10-02 09:42:49 INFO Executor:54 - Running task 8.0 in stage 0.0 (TID 8)
2018-10-02 09:42:49 INFO Executor:54 - Running task 9.0 in stage 0.0 (TID 9)
2018-10-02 09:42:49 INFO Executor:54 - Fetching spark://10.0.1.44:46245/jars/aut-0.16.1-SNAPSHOT-fatjar.jar with timestamp 1538487767105
2018-10-02 09:42:49 INFO TransportClientFactory:267 - Successfully created connection to /10.0.1.44:46245 after 16 ms (0 ms spent in bootstraps)
2018-10-02 09:42:49 INFO Utils:54 - Fetching spark://10.0.1.44:46245/jars/aut-0.16.1-SNAPSHOT-fatjar.jar to /tmp/spark-ea2246da-4062-4046-b819-70932ad9d664/userFiles-81e47cb1-d944-40d7-9021-8e94f7a71310/fetchFileTemp8578759083499317273.tmp
2018-10-02 09:42:51 INFO Executor:54 - Adding file:/tmp/spark-ea2246da-4062-4046-b819-70932ad9d664/userFiles-81e47cb1-d944-40d7-9021-8e94f7a71310/aut-0.16.1-SNAPSHOT-fatjar.jar to class loader
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:42:51 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:42:58 ERROR Executor:91 - Exception in task 4.0 in stage 0.0 (TID 4)
java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:42:58 INFO TaskSetManager:54 - Starting task 10.0 in stage 0.0 (TID 10, localhost, executor driver, partition 10, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:42:58 INFO Executor:54 - Running task 10.0 in stage 0.0 (TID 10)
2018-10-02 09:42:58 WARN TaskSetManager:66 - Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:42:58 ERROR TaskSetManager:70 - Task 4 in stage 0.0 failed 1 times; aborting job
2018-10-02 09:42:58 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:42:58 INFO TaskSchedulerImpl:54 - Cancelling stage 0
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
2018-10-02 09:42:58 INFO TaskSchedulerImpl:54 - Stage 0 was cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 9.0 in stage 0.0 (TID 9), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 5.0 in stage 0.0 (TID 5), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 6.0 in stage 0.0 (TID 6), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 10.0 in stage 0.0 (TID 10), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 7.0 in stage 0.0 (TID 7), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor is trying to kill task 8.0 in stage 0.0 (TID 8), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
2018-10-02 09:42:58 INFO Executor:54 - Executor killed task 9.0 in stage 0.0 (TID 9), reason: Stage cancelled
2018-10-02 09:42:58 WARN TaskSetManager:66 - Lost task 9.0 in stage 0.0 (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 INFO DAGScheduler:54 - ShuffleMapStage 0 (map at package.scala:72) failed in 9.132 s due to Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
2018-10-02 09:42:58 INFO DAGScheduler:54 - Job 0 failed: sortBy at package.scala:74, took 9.294232 s
2018-10-02 09:42:58 INFO Executor:54 - Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
2018-10-02 09:42:58 WARN TaskSetManager:66 - Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 INFO Executor:54 - Executor killed task 10.0 in stage 0.0 (TID 10), reason: Stage cancelled
2018-10-02 09:42:58 WARN TaskSetManager:66 - Lost task 10.0 in stage 0.0 (TID 10, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 INFO Executor:54 - Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
2018-10-02 09:42:58 WARN TaskSetManager:66 - Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:58 INFO Executor:54 - Executor killed task 5.0 in stage 0.0 (TID 5), reason: Stage cancelled
2018-10-02 09:42:58 WARN TaskSetManager:66 - Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:306)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:168)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:148)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:622)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.sortBy(RDD.scala:620)
at io.archivesunleashed.package$CountableRDD.countItems(package.scala:74)
... 78 elided
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:42:59 INFO Executor:54 - Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
2018-10-02 09:42:59 WARN TaskSetManager:66 - Lost task 2.0 in stage 0.0 (TID 2, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:42:59 INFO MemoryStore:54 - Block broadcast_2 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:42:59 INFO MemoryStore:54 - Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:42:59 INFO BlockManagerInfo:54 - Added broadcast_2_piece0 in memory on 10.0.1.44:42415 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:42:59 INFO SparkContext:54 - Created broadcast 2 from newAPIHadoopFile at package.scala:51
2018-10-02 09:42:59 INFO deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2018-10-02 09:42:59 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO FileInputFormat:283 - Total input paths to process : 54
2018-10-02 09:42:59 INFO SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2018-10-02 09:42:59 INFO DAGScheduler:54 - Got job 1 (runJob at SparkHadoopWriter.scala:78) with 54 output partitions
2018-10-02 09:42:59 INFO DAGScheduler:54 - Final stage: ResultStage 2 (runJob at SparkHadoopWriter.scala:78)
2018-10-02 09:42:59 INFO DAGScheduler:54 - Parents of final stage: List()
2018-10-02 09:42:59 INFO DAGScheduler:54 - Missing parents: List()
2018-10-02 09:42:59 INFO DAGScheduler:54 - Submitting ResultStage 2 (MapPartitionsRDD[15] at saveAsTextFile at <console>:34), which has no missing parents
2018-10-02 09:42:59 INFO MemoryStore:54 - Block broadcast_3 stored as values in memory (estimated size 72.3 KB, free 15.8 GB)
2018-10-02 09:42:59 INFO MemoryStore:54 - Block broadcast_3_piece0 stored as bytes in memory (estimated size 25.9 KB, free 15.8 GB)
2018-10-02 09:42:59 INFO BlockManagerInfo:54 - Added broadcast_3_piece0 in memory on 10.0.1.44:42415 (size: 25.9 KB, free: 15.8 GB)
2018-10-02 09:42:59 INFO SparkContext:54 - Created broadcast 3 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:42:59 INFO DAGScheduler:54 - Submitting 54 missing tasks from ResultStage 2 (MapPartitionsRDD[15] at saveAsTextFile at <console>:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:42:59 INFO TaskSchedulerImpl:54 - Adding task set 2.0 with 54 tasks
2018-10-02 09:42:59 INFO TaskSetManager:54 - Starting task 0.0 in stage 2.0 (TID 11, localhost, executor driver, partition 0, PROCESS_LOCAL, 7989 bytes)
2018-10-02 09:42:59 INFO TaskSetManager:54 - Starting task 1.0 in stage 2.0 (TID 12, localhost, executor driver, partition 1, PROCESS_LOCAL, 8020 bytes)
2018-10-02 09:42:59 INFO TaskSetManager:54 - Starting task 2.0 in stage 2.0 (TID 13, localhost, executor driver, partition 2, PROCESS_LOCAL, 7987 bytes)
2018-10-02 09:42:59 INFO TaskSetManager:54 - Starting task 3.0 in stage 2.0 (TID 14, localhost, executor driver, partition 3, PROCESS_LOCAL, 7991 bytes)
2018-10-02 09:42:59 INFO TaskSetManager:54 - Starting task 4.0 in stage 2.0 (TID 15, localhost, executor driver, partition 4, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:42:59 INFO TaskSetManager:54 - Starting task 5.0 in stage 2.0 (TID 16, localhost, executor driver, partition 5, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:42:59 INFO TaskSetManager:54 - Starting task 6.0 in stage 2.0 (TID 17, localhost, executor driver, partition 6, PROCESS_LOCAL, 8015 bytes)
2018-10-02 09:42:59 INFO Executor:54 - Running task 3.0 in stage 2.0 (TID 14)
2018-10-02 09:42:59 INFO Executor:54 - Running task 6.0 in stage 2.0 (TID 17)
2018-10-02 09:42:59 INFO Executor:54 - Running task 5.0 in stage 2.0 (TID 16)
2018-10-02 09:42:59 INFO Executor:54 - Running task 2.0 in stage 2.0 (TID 13)
2018-10-02 09:42:59 INFO Executor:54 - Running task 1.0 in stage 2.0 (TID 12)
2018-10-02 09:42:59 INFO Executor:54 - Running task 4.0 in stage 2.0 (TID 15)
2018-10-02 09:42:59 INFO Executor:54 - Running task 0.0 in stage 2.0 (TID 11)
2018-10-02 09:42:59 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:42:59 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:42:59 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:42:59 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:42:59 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:42:59 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:42:59 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:42:59 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:42:59 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 17
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 22
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 18
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 20
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 6
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 13
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 7
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 1
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 4
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 5
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned shuffle 0
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 2
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 0
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 12
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 14
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 8
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 19
2018-10-02 09:43:00 INFO ContextCleaner:54 - Cleaned accumulator 24
2018-10-02 09:43:00 INFO BlockManagerInfo:54 - Removed broadcast_0_piece0 on 10.0.1.44:42415 in memory (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:43:02 INFO Executor:54 - Executor killed task 8.0 in stage 0.0 (TID 8), reason: Stage cancelled
2018-10-02 09:43:02 INFO TaskSetManager:54 - Starting task 7.0 in stage 2.0 (TID 18, localhost, executor driver, partition 7, PROCESS_LOCAL, 7989 bytes)
2018-10-02 09:43:02 WARN TaskSetManager:66 - Lost task 8.0 in stage 0.0 (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:02 INFO Executor:54 - Running task 7.0 in stage 2.0 (TID 18)
2018-10-02 09:43:02 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:43:02 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:04 INFO Executor:54 - Executor killed task 6.0 in stage 0.0 (TID 6), reason: Stage cancelled
2018-10-02 09:43:04 INFO TaskSetManager:54 - Starting task 8.0 in stage 2.0 (TID 19, localhost, executor driver, partition 8, PROCESS_LOCAL, 8020 bytes)
2018-10-02 09:43:04 WARN TaskSetManager:66 - Lost task 6.0 in stage 0.0 (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:04 INFO Executor:54 - Running task 8.0 in stage 2.0 (TID 19)
2018-10-02 09:43:04 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:43:04 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:06 INFO Executor:54 - Executor killed task 7.0 in stage 0.0 (TID 7), reason: Stage cancelled
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 9.0 in stage 2.0 (TID 20, localhost, executor driver, partition 9, PROCESS_LOCAL, 7991 bytes)
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 INFO Executor:54 - Running task 9.0 in stage 2.0 (TID 20)
2018-10-02 09:43:06 INFO ContextCleaner:54 - Cleaned accumulator 11
2018-10-02 09:43:06 INFO ContextCleaner:54 - Cleaned accumulator 15
2018-10-02 09:43:06 INFO ContextCleaner:54 - Cleaned accumulator 3
2018-10-02 09:43:06 INFO ContextCleaner:54 - Cleaned accumulator 21
2018-10-02 09:43:06 INFO ContextCleaner:54 - Cleaned accumulator 16
2018-10-02 09:43:06 INFO ContextCleaner:54 - Cleaned accumulator 23
2018-10-02 09:43:06 INFO ContextCleaner:54 - Cleaned accumulator 9
2018-10-02 09:43:06 INFO ContextCleaner:54 - Cleaned accumulator 10
2018-10-02 09:43:06 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:43:06 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000005_0 aborted.
2018-10-02 09:43:06 ERROR Executor:91 - Exception in task 5.0 in stage 2.0 (TID 16)
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 10.0 in stage 2.0 (TID 21, localhost, executor driver, partition 10, PROCESS_LOCAL, 8018 bytes)
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
2018-10-02 09:43:06 ERROR TaskSetManager:70 - Task 5 in stage 2.0 failed 1 times; aborting job
2018-10-02 09:43:06 INFO TaskSchedulerImpl:54 - Cancelling stage 2
2018-10-02 09:43:06 INFO Executor:54 - Running task 10.0 in stage 2.0 (TID 21)
2018-10-02 09:43:06 INFO TaskSchedulerImpl:54 - Stage 2 was cancelled
2018-10-02 09:43:06 INFO DAGScheduler:54 - ResultStage 2 (runJob at SparkHadoopWriter.scala:78) failed in 7.273 s due to Job aborted due to stage failure: Task 5 in stage 2.0 failed 1 times, most recent failure: Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
Driver stacktrace:
2018-10-02 09:43:06 INFO DAGScheduler:54 - Job 1 failed: runJob at SparkHadoopWriter.scala:78, took 7.276136 s
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 4.0 in stage 2.0 (TID 15), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 1.0 in stage 2.0 (TID 12), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 8.0 in stage 2.0 (TID 19), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 2.0 in stage 2.0 (TID 13), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 9.0 in stage 2.0 (TID 20), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 6.0 in stage 2.0 (TID 17), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 10.0 in stage 2.0 (TID 21), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 7.0 in stage 2.0 (TID 18), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 3.0 in stage 2.0 (TID 14), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor is trying to kill task 0.0 in stage 2.0 (TID 11), reason: Stage cancelled
2018-10-02 09:43:06 ERROR SparkHadoopWriter:91 - Aborting job job_20181002094259_0015.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 2.0 failed 1 times, most recent failure: Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:34)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:39)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:41)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:43)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:45)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:47)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:49)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:51)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:53)
at $line21.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:55)
at $line21.$read$$iw$$iw$$iw$$iw.<init>(<console>:57)
at $line21.$read$$iw$$iw$$iw.<init>(<console>:59)
at $line21.$read$$iw$$iw.<init>(<console>:61)
at $line21.$read$$iw.<init>(<console>:63)
at $line21.$read.<init>(<console>:65)
at $line21.$read$.<init>(<console>:69)
at $line21.$read$.<clinit>(<console>)
at $line21.$eval$.$print$lzycompute(<console>:7)
at $line21.$eval$.$print(<console>:6)
at $line21.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5$$anonfun$apply$6.apply(ILoop.scala:427)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5$$anonfun$apply$6.apply(ILoop.scala:423)
at scala.reflect.io.Streamable$Chars$class.applyReader(Streamable.scala:111)
at scala.reflect.io.File.applyReader(File.scala:50)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5.apply(ILoop.scala:423)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5.apply(ILoop.scala:423)
at scala.tools.nsc.interpreter.ILoop.savingReplayStack(ILoop.scala:91)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1.apply(ILoop.scala:422)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1.apply(ILoop.scala:422)
at scala.tools.nsc.interpreter.ILoop.savingReader(ILoop.scala:96)
at scala.tools.nsc.interpreter.ILoop.interpretAllFrom(ILoop.scala:421)
at scala.tools.nsc.interpreter.ILoop$$anonfun$run$3$1.apply(ILoop.scala:577)
at scala.tools.nsc.interpreter.ILoop$$anonfun$run$3$1.apply(ILoop.scala:576)
at scala.tools.nsc.interpreter.ILoop.withFile(ILoop.scala:570)
at scala.tools.nsc.interpreter.ILoop.run$3(ILoop.scala:576)
at scala.tools.nsc.interpreter.ILoop.loadCommand(ILoop.scala:583)
at scala.tools.nsc.interpreter.ILoop$$anonfun$standardCommands$8.apply(ILoop.scala:207)
at scala.tools.nsc.interpreter.ILoop$$anonfun$standardCommands$8.apply(ILoop.scala:207)
at scala.tools.nsc.interpreter.LoopCommands$LineCmd.apply(LoopCommands.scala:62)
at scala.tools.nsc.interpreter.ILoop.colonCommand(ILoop.scala:688)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:679)
at scala.tools.nsc.interpreter.ILoop.loadFiles(ILoop.scala:835)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:111)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
at org.apache.spark.repl.Main$.doMain(Main.scala:76)
at org.apache.spark.repl.Main$.main(Main.scala:56)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000000_0 aborted.
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000007_0 aborted.
2018-10-02 09:43:06 INFO Executor:54 - Executor interrupted and killed task 7.0 in stage 2.0 (TID 18), reason: Stage cancelled
2018-10-02 09:43:06 INFO Executor:54 - Executor interrupted and killed task 0.0 in stage 2.0 (TID 11), reason: Stage cancelled
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 7.0 in stage 2.0 (TID 18, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 0.0 in stage 2.0 (TID 11, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 INFO Executor:54 - Executor interrupted and killed task 10.0 in stage 2.0 (TID 21), reason: Stage cancelled
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 10.0 in stage 2.0 (TID 21, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000004_0
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000004_0 aborted.
2018-10-02 09:43:06 INFO Executor:54 - Executor interrupted and killed task 4.0 in stage 2.0 (TID 15), reason: Stage cancelled
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000001_0
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000001_0 aborted.
2018-10-02 09:43:06 INFO Executor:54 - Executor interrupted and killed task 1.0 in stage 2.0 (TID 12), reason: Stage cancelled
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 1.0 in stage 2.0 (TID 12, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:96)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
... 78 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 2.0 failed 1 times, most recent failure: Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
... 106 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000009_0
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000009_0 aborted.
2018-10-02 09:43:06 INFO Executor:54 - Executor interrupted and killed task 9.0 in stage 2.0 (TID 20), reason: Stage cancelled
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 9.0 in stage 2.0 (TID 20, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:06 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000003_0
2018-10-02 09:43:06 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000003_0 aborted.
2018-10-02 09:43:06 INFO Executor:54 - Executor interrupted and killed task 3.0 in stage 2.0 (TID 14), reason: Stage cancelled
2018-10-02 09:43:06 WARN TaskSetManager:66 - Lost task 3.0 in stage 2.0 (TID 14, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:06 INFO MemoryStore:54 - Block broadcast_4 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:43:06 INFO MemoryStore:54 - Block broadcast_4_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:43:06 INFO BlockManagerInfo:54 - Added broadcast_4_piece0 in memory on 10.0.1.44:42415 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:43:06 INFO SparkContext:54 - Created broadcast 4 from newAPIHadoopFile at package.scala:51
2018-10-02 09:43:06 INFO FileInputFormat:283 - Total input paths to process : 54
2018-10-02 09:43:06 INFO SparkContext:54 - Starting job: sortBy at package.scala:74
2018-10-02 09:43:06 INFO DAGScheduler:54 - Registering RDD 23 (map at package.scala:72)
2018-10-02 09:43:06 INFO DAGScheduler:54 - Got job 2 (sortBy at package.scala:74) with 54 output partitions
2018-10-02 09:43:06 INFO DAGScheduler:54 - Final stage: ResultStage 4 (sortBy at package.scala:74)
2018-10-02 09:43:06 INFO DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 3)
2018-10-02 09:43:06 INFO DAGScheduler:54 - Missing parents: List(ShuffleMapStage 3)
2018-10-02 09:43:06 INFO DAGScheduler:54 - Submitting ShuffleMapStage 3 (MapPartitionsRDD[23] at map at package.scala:72), which has no missing parents
2018-10-02 09:43:06 INFO MemoryStore:54 - Block broadcast_5 stored as values in memory (estimated size 4.7 KB, free 15.8 GB)
2018-10-02 09:43:06 INFO MemoryStore:54 - Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.4 KB, free 15.8 GB)
2018-10-02 09:43:06 INFO BlockManagerInfo:54 - Added broadcast_5_piece0 in memory on 10.0.1.44:42415 (size: 2.4 KB, free: 15.8 GB)
2018-10-02 09:43:06 INFO SparkContext:54 - Created broadcast 5 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:43:06 INFO DAGScheduler:54 - Submitting 54 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[23] at map at package.scala:72) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:43:06 INFO TaskSchedulerImpl:54 - Adding task set 3.0 with 54 tasks
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 0.0 in stage 3.0 (TID 22, localhost, executor driver, partition 0, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 1.0 in stage 3.0 (TID 23, localhost, executor driver, partition 1, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 2.0 in stage 3.0 (TID 24, localhost, executor driver, partition 2, PROCESS_LOCAL, 7976 bytes)
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 3.0 in stage 3.0 (TID 25, localhost, executor driver, partition 3, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 4.0 in stage 3.0 (TID 26, localhost, executor driver, partition 4, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 5.0 in stage 3.0 (TID 27, localhost, executor driver, partition 5, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:43:06 INFO TaskSetManager:54 - Starting task 6.0 in stage 3.0 (TID 28, localhost, executor driver, partition 6, PROCESS_LOCAL, 8004 bytes)
2018-10-02 09:43:06 INFO Executor:54 - Running task 0.0 in stage 3.0 (TID 22)
2018-10-02 09:43:06 INFO Executor:54 - Running task 5.0 in stage 3.0 (TID 27)
2018-10-02 09:43:06 INFO Executor:54 - Running task 2.0 in stage 3.0 (TID 24)
2018-10-02 09:43:06 INFO Executor:54 - Running task 1.0 in stage 3.0 (TID 23)
2018-10-02 09:43:06 INFO Executor:54 - Running task 6.0 in stage 3.0 (TID 28)
2018-10-02 09:43:06 INFO Executor:54 - Running task 3.0 in stage 3.0 (TID 25)
2018-10-02 09:43:06 INFO Executor:54 - Running task 4.0 in stage 3.0 (TID 26)
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:43:06 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 47
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 31
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 28
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 26
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 33
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 44
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 34
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 36
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 35
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 43
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 37
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 30
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 49
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 27
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 48
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 32
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 39
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 38
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 29
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 42
2018-10-02 09:43:08 INFO BlockManagerInfo:54 - Removed broadcast_1_piece0 on 10.0.1.44:42415 in memory (size: 2.6 KB, free: 15.8 GB)
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 40
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 46
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 25
2018-10-02 09:43:08 INFO BlockManagerInfo:54 - Removed broadcast_2_piece0 on 10.0.1.44:42415 in memory (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 41
2018-10-02 09:43:08 INFO ContextCleaner:54 - Cleaned accumulator 45
2018-10-02 09:43:11 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:11 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000006_0
2018-10-02 09:43:11 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000006_0 aborted.
2018-10-02 09:43:11 INFO Executor:54 - Executor interrupted and killed task 6.0 in stage 2.0 (TID 17), reason: Stage cancelled
2018-10-02 09:43:11 INFO TaskSetManager:54 - Starting task 7.0 in stage 3.0 (TID 29, localhost, executor driver, partition 7, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:43:11 WARN TaskSetManager:66 - Lost task 6.0 in stage 2.0 (TID 17, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:11 INFO Executor:54 - Running task 7.0 in stage 3.0 (TID 29)
2018-10-02 09:43:11 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:43:13 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:13 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000002_0
2018-10-02 09:43:13 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000002_0 aborted.
2018-10-02 09:43:13 INFO Executor:54 - Executor interrupted and killed task 2.0 in stage 2.0 (TID 13), reason: Stage cancelled
2018-10-02 09:43:13 INFO TaskSetManager:54 - Starting task 8.0 in stage 3.0 (TID 30, localhost, executor driver, partition 8, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:43:13 WARN TaskSetManager:66 - Lost task 2.0 in stage 2.0 (TID 13, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:13 INFO Executor:54 - Running task 8.0 in stage 3.0 (TID 30)
2018-10-02 09:43:13 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:43:14 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:14 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094259_0015_m_000008_0
2018-10-02 09:43:14 ERROR SparkHadoopWriter:70 - Task attempt_20181002094259_0015_m_000008_0 aborted.
2018-10-02 09:43:14 INFO Executor:54 - Executor interrupted and killed task 8.0 in stage 2.0 (TID 19), reason: Stage cancelled
2018-10-02 09:43:14 INFO TaskSetManager:54 - Starting task 9.0 in stage 3.0 (TID 31, localhost, executor driver, partition 9, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:43:14 WARN TaskSetManager:66 - Lost task 8.0 in stage 2.0 (TID 19, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:14 INFO TaskSchedulerImpl:54 - Removed TaskSet 2.0, whose tasks have all completed, from pool
2018-10-02 09:43:14 INFO Executor:54 - Running task 9.0 in stage 3.0 (TID 31)
2018-10-02 09:43:14 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:43:15 ERROR Executor:91 - Exception in task 5.0 in stage 3.0 (TID 27)
java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:15 INFO TaskSetManager:54 - Starting task 10.0 in stage 3.0 (TID 32, localhost, executor driver, partition 10, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:43:15 WARN TaskSetManager:66 - Lost task 5.0 in stage 3.0 (TID 27, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:43:15 ERROR TaskSetManager:70 - Task 5 in stage 3.0 failed 1 times; aborting job
2018-10-02 09:43:15 INFO Executor:54 - Running task 10.0 in stage 3.0 (TID 32)
2018-10-02 09:43:15 INFO TaskSchedulerImpl:54 - Cancelling stage 3
2018-10-02 09:43:15 INFO TaskSchedulerImpl:54 - Stage 3 was cancelled
2018-10-02 09:43:15 INFO DAGScheduler:54 - ShuffleMapStage 3 (map at package.scala:72) failed in 8.441 s due to Job aborted due to stage failure: Task 5 in stage 3.0 failed 1 times, most recent failure: Lost task 5.0 in stage 3.0 (TID 27, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 8.0 in stage 3.0 (TID 30), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 9.0 in stage 3.0 (TID 31), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 10.0 in stage 3.0 (TID 32), reason: Stage cancelled
2018-10-02 09:43:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 2.0 in stage 3.0 (TID 24), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 6.0 in stage 3.0 (TID 28), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 3.0 in stage 3.0 (TID 25), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 0.0 in stage 3.0 (TID 22), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 7.0 in stage 3.0 (TID 29), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 4.0 in stage 3.0 (TID 26), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor is trying to kill task 1.0 in stage 3.0 (TID 23), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor killed task 10.0 in stage 3.0 (TID 32), reason: Stage cancelled
2018-10-02 09:43:15 INFO DAGScheduler:54 - Job 2 failed: sortBy at package.scala:74, took 8.449134 s
2018-10-02 09:43:15 WARN TaskSetManager:66 - Lost task 10.0 in stage 3.0 (TID 32, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 INFO Executor:54 - Executor killed task 9.0 in stage 3.0 (TID 31), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor killed task 1.0 in stage 3.0 (TID 23), reason: Stage cancelled
2018-10-02 09:43:15 INFO Executor:54 - Executor killed task 3.0 in stage 3.0 (TID 25), reason: Stage cancelled
2018-10-02 09:43:15 WARN TaskSetManager:66 - Lost task 1.0 in stage 3.0 (TID 23, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 WARN TaskSetManager:66 - Lost task 9.0 in stage 3.0 (TID 31, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 WARN TaskSetManager:66 - Lost task 3.0 in stage 3.0 (TID 25, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 INFO Executor:54 - Executor killed task 0.0 in stage 3.0 (TID 22), reason: Stage cancelled
2018-10-02 09:43:15 WARN TaskSetManager:66 - Lost task 0.0 in stage 3.0 (TID 22, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 INFO Executor:54 - Executor interrupted and killed task 4.0 in stage 3.0 (TID 26), reason: Stage cancelled
2018-10-02 09:43:15 WARN TaskSetManager:66 - Lost task 4.0 in stage 3.0 (TID 26, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:43:15 INFO Executor:54 - Executor killed task 7.0 in stage 3.0 (TID 29), reason: Stage cancelled
2018-10-02 09:43:15 WARN TaskSetManager:66 - Lost task 7.0 in stage 3.0 (TID 29, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 3.0 failed 1 times, most recent failure: Lost task 5.0 in stage 3.0 (TID 27, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:306)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:168)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:148)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:622)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.sortBy(RDD.scala:620)
at io.archivesunleashed.package$CountableRDD.countItems(package.scala:74)
... 78 elided
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
<console>:33: error: not found: value links
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
^
2018-10-02 09:43:15 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-10-02 09:43:15 INFO AbstractConnector:318 - Stopped Spark@14b83891{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-10-02 09:43:15 INFO SparkUI:54 - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-02 09:43:15 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-10-02 09:43:15 INFO MemoryStore:54 - MemoryStore cleared
2018-10-02 09:43:15 INFO BlockManager:54 - BlockManager stopped
2018-10-02 09:43:15 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-10-02 09:43:15 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-10-02 09:43:15 INFO SparkContext:54 - Successfully stopped SparkContext
2018-10-02 09:43:15 INFO ShutdownHookManager:54 - Shutdown hook called
2018-10-02 09:43:15 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-b7418656-a101-4eca-acfd-fe8914c5f1dd
2018-10-02 09:43:15 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-ea2246da-4062-4046-b819-70932ad9d664
2018-10-02 09:43:15 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-ea2246da-4062-4046-b819-70932ad9d664/repl-25ef64b4-c115-4020-a5a3-2e57854c08cc
@borislin can you gist up your output log? This is what I just ran on my end with all the 499-issues arcs/warcs:
(I use zsh, so I have to escape those brackets.)
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 2, 2018
Member
...and if I remove the empty file, and run the same job with the other 53 problematic arcs/warcs, I am not able to replicate what you've come up with (this is all from building aut on HEAD this morning, after clearing ~/.ivy2
and ~/.m2/repository
.)
2018-10-02 09:47:09 WARN Utils:66 - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-02 09:47:09 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-10-02 09:47:09 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538488032542).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
2018-10-02 09:47:14 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:47:14 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:47:14 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 10.0.1.44:40719 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:47:14 INFO SparkContext:54 - Created broadcast 0 from newAPIHadoopFile at package.scala:51
2018-10-02 09:47:14 INFO FileInputFormat:283 - Total input paths to process : 53
2018-10-02 09:47:14 INFO SparkContext:54 - Starting job: sortBy at package.scala:74
2018-10-02 09:47:14 INFO DAGScheduler:54 - Registering RDD 5 (map at package.scala:72)
2018-10-02 09:47:14 INFO DAGScheduler:54 - Got job 0 (sortBy at package.scala:74) with 53 output partitions
2018-10-02 09:47:14 INFO DAGScheduler:54 - Final stage: ResultStage 1 (sortBy at package.scala:74)
2018-10-02 09:47:14 INFO DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 0)
2018-10-02 09:47:14 INFO DAGScheduler:54 - Missing parents: List(ShuffleMapStage 0)
2018-10-02 09:47:14 INFO DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at map at package.scala:72), which has no missing parents
2018-10-02 09:47:14 INFO MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 4.6 KB, free 15.8 GB)
2018-10-02 09:47:14 INFO MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.6 KB, free 15.8 GB)
2018-10-02 09:47:14 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on 10.0.1.44:40719 (size: 2.6 KB, free: 15.8 GB)
2018-10-02 09:47:14 INFO SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:47:14 INFO DAGScheduler:54 - Submitting 53 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at map at package.scala:72) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:47:14 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 53 tasks
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 7976 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 4.0 in stage 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 5.0 in stage 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 6.0 in stage 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_LOCAL, 8004 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 7.0 in stage 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 8.0 in stage 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:47:14 INFO TaskSetManager:54 - Starting task 9.0 in stage 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:47:14 INFO Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2018-10-02 09:47:14 INFO Executor:54 - Running task 1.0 in stage 0.0 (TID 1)
2018-10-02 09:47:14 INFO Executor:54 - Running task 2.0 in stage 0.0 (TID 2)
2018-10-02 09:47:14 INFO Executor:54 - Running task 4.0 in stage 0.0 (TID 4)
2018-10-02 09:47:14 INFO Executor:54 - Running task 8.0 in stage 0.0 (TID 8)
2018-10-02 09:47:14 INFO Executor:54 - Running task 9.0 in stage 0.0 (TID 9)
2018-10-02 09:47:14 INFO Executor:54 - Running task 6.0 in stage 0.0 (TID 6)
2018-10-02 09:47:14 INFO Executor:54 - Running task 7.0 in stage 0.0 (TID 7)
2018-10-02 09:47:14 INFO Executor:54 - Running task 5.0 in stage 0.0 (TID 5)
2018-10-02 09:47:14 INFO Executor:54 - Running task 3.0 in stage 0.0 (TID 3)
2018-10-02 09:47:14 INFO Executor:54 - Fetching spark://10.0.1.44:33593/jars/aut-0.16.1-SNAPSHOT-fatjar.jar with timestamp 1538488032531
2018-10-02 09:47:14 INFO TransportClientFactory:267 - Successfully created connection to /10.0.1.44:33593 after 15 ms (0 ms spent in bootstraps)
2018-10-02 09:47:14 INFO Utils:54 - Fetching spark://10.0.1.44:33593/jars/aut-0.16.1-SNAPSHOT-fatjar.jar to /tmp/spark-0782f27c-4461-4384-9191-997eb21b1d7e/userFiles-3f5716ce-710f-4264-aff7-e0b037b9cd99/fetchFileTemp5723790853042569991.tmp
2018-10-02 09:47:15 INFO Executor:54 - Adding file:/tmp/spark-0782f27c-4461-4384-9191-997eb21b1d7e/userFiles-3f5716ce-710f-4264-aff7-e0b037b9cd99/aut-0.16.1-SNAPSHOT-fatjar.jar to class loader
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:47:15 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:47:23 ERROR Executor:91 - Exception in task 5.0 in stage 0.0 (TID 5)
java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:23 INFO TaskSetManager:54 - Starting task 10.0 in stage 0.0 (TID 10, localhost, executor driver, partition 10, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:47:23 WARN TaskSetManager:66 - Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:23 INFO Executor:54 - Running task 10.0 in stage 0.0 (TID 10)
2018-10-02 09:47:23 ERROR TaskSetManager:70 - Task 5 in stage 0.0 failed 1 times; aborting job
2018-10-02 09:47:23 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:47:23 INFO TaskSchedulerImpl:54 - Cancelling stage 0
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 9.0 in stage 0.0 (TID 9), reason: Stage cancelled
2018-10-02 09:47:23 INFO TaskSchedulerImpl:54 - Stage 0 was cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 6.0 in stage 0.0 (TID 6), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 10.0 in stage 0.0 (TID 10), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 7.0 in stage 0.0 (TID 7), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 4.0 in stage 0.0 (TID 4), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor is trying to kill task 8.0 in stage 0.0 (TID 8), reason: Stage cancelled
2018-10-02 09:47:23 INFO DAGScheduler:54 - ShuffleMapStage 0 (map at package.scala:72) failed in 8.493 s due to Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
2018-10-02 09:47:23 INFO Executor:54 - Executor killed task 9.0 in stage 0.0 (TID 9), reason: Stage cancelled
2018-10-02 09:47:23 INFO Executor:54 - Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
2018-10-02 09:47:23 WARN TaskSetManager:66 - Lost task 9.0 in stage 0.0 (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO DAGScheduler:54 - Job 0 failed: sortBy at package.scala:74, took 8.618676 s
2018-10-02 09:47:23 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO Executor:54 - Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
2018-10-02 09:47:23 WARN TaskSetManager:66 - Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO Executor:54 - Executor killed task 4.0 in stage 0.0 (TID 4), reason: Stage cancelled
2018-10-02 09:47:23 WARN TaskSetManager:66 - Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO Executor:54 - Executor killed task 10.0 in stage 0.0 (TID 10), reason: Stage cancelled
2018-10-02 09:47:23 WARN TaskSetManager:66 - Lost task 10.0 in stage 0.0 (TID 10, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO Executor:54 - Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
2018-10-02 09:47:23 WARN TaskSetManager:66 - Lost task 2.0 in stage 0.0 (TID 2, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:23 INFO Executor:54 - Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
2018-10-02 09:47:23 WARN TaskSetManager:66 - Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:306)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:168)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:148)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:622)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.sortBy(RDD.scala:620)
at io.archivesunleashed.package$CountableRDD.countItems(package.scala:74)
... 78 elided
Caused by: java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:23 INFO MemoryStore:54 - Block broadcast_2 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:47:23 INFO MemoryStore:54 - Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:47:23 INFO BlockManagerInfo:54 - Added broadcast_2_piece0 in memory on 10.0.1.44:40719 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:47:23 INFO SparkContext:54 - Created broadcast 2 from newAPIHadoopFile at package.scala:51
2018-10-02 09:47:23 INFO deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2018-10-02 09:47:23 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO FileInputFormat:283 - Total input paths to process : 53
2018-10-02 09:47:23 INFO SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2018-10-02 09:47:23 INFO DAGScheduler:54 - Got job 1 (runJob at SparkHadoopWriter.scala:78) with 53 output partitions
2018-10-02 09:47:23 INFO DAGScheduler:54 - Final stage: ResultStage 2 (runJob at SparkHadoopWriter.scala:78)
2018-10-02 09:47:23 INFO DAGScheduler:54 - Parents of final stage: List()
2018-10-02 09:47:23 INFO DAGScheduler:54 - Missing parents: List()
2018-10-02 09:47:23 INFO DAGScheduler:54 - Submitting ResultStage 2 (MapPartitionsRDD[15] at saveAsTextFile at <console>:34), which has no missing parents
2018-10-02 09:47:23 INFO MemoryStore:54 - Block broadcast_3 stored as values in memory (estimated size 72.3 KB, free 15.8 GB)
2018-10-02 09:47:23 INFO MemoryStore:54 - Block broadcast_3_piece0 stored as bytes in memory (estimated size 25.9 KB, free 15.8 GB)
2018-10-02 09:47:23 INFO BlockManagerInfo:54 - Added broadcast_3_piece0 in memory on 10.0.1.44:40719 (size: 25.9 KB, free: 15.8 GB)
2018-10-02 09:47:23 INFO SparkContext:54 - Created broadcast 3 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:47:23 INFO DAGScheduler:54 - Submitting 53 missing tasks from ResultStage 2 (MapPartitionsRDD[15] at saveAsTextFile at <console>:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:47:23 INFO TaskSchedulerImpl:54 - Adding task set 2.0 with 53 tasks
2018-10-02 09:47:23 INFO TaskSetManager:54 - Starting task 0.0 in stage 2.0 (TID 11, localhost, executor driver, partition 0, PROCESS_LOCAL, 7989 bytes)
2018-10-02 09:47:23 INFO TaskSetManager:54 - Starting task 1.0 in stage 2.0 (TID 12, localhost, executor driver, partition 1, PROCESS_LOCAL, 8020 bytes)
2018-10-02 09:47:23 INFO TaskSetManager:54 - Starting task 2.0 in stage 2.0 (TID 13, localhost, executor driver, partition 2, PROCESS_LOCAL, 7987 bytes)
2018-10-02 09:47:23 INFO TaskSetManager:54 - Starting task 3.0 in stage 2.0 (TID 14, localhost, executor driver, partition 3, PROCESS_LOCAL, 7991 bytes)
2018-10-02 09:47:23 INFO TaskSetManager:54 - Starting task 4.0 in stage 2.0 (TID 15, localhost, executor driver, partition 4, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:47:23 INFO TaskSetManager:54 - Starting task 5.0 in stage 2.0 (TID 16, localhost, executor driver, partition 5, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:47:23 INFO TaskSetManager:54 - Starting task 6.0 in stage 2.0 (TID 17, localhost, executor driver, partition 6, PROCESS_LOCAL, 8015 bytes)
2018-10-02 09:47:23 INFO Executor:54 - Running task 2.0 in stage 2.0 (TID 13)
2018-10-02 09:47:23 INFO Executor:54 - Running task 3.0 in stage 2.0 (TID 14)
2018-10-02 09:47:23 INFO Executor:54 - Running task 4.0 in stage 2.0 (TID 15)
2018-10-02 09:47:23 INFO Executor:54 - Running task 0.0 in stage 2.0 (TID 11)
2018-10-02 09:47:23 INFO Executor:54 - Running task 1.0 in stage 2.0 (TID 12)
2018-10-02 09:47:23 INFO Executor:54 - Running task 6.0 in stage 2.0 (TID 17)
2018-10-02 09:47:23 INFO Executor:54 - Running task 5.0 in stage 2.0 (TID 16)
2018-10-02 09:47:23 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:47:23 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:47:23 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:47:23 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:47:23 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:47:23 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:47:23 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:47:23 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:23 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:25 INFO Executor:54 - Executor killed task 8.0 in stage 0.0 (TID 8), reason: Stage cancelled
2018-10-02 09:47:25 INFO TaskSetManager:54 - Starting task 7.0 in stage 2.0 (TID 18, localhost, executor driver, partition 7, PROCESS_LOCAL, 7989 bytes)
2018-10-02 09:47:25 WARN TaskSetManager:66 - Lost task 8.0 in stage 0.0 (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:25 INFO Executor:54 - Running task 7.0 in stage 2.0 (TID 18)
2018-10-02 09:47:25 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:47:25 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:26 INFO Executor:54 - Executor killed task 6.0 in stage 0.0 (TID 6), reason: Stage cancelled
2018-10-02 09:47:26 INFO TaskSetManager:54 - Starting task 8.0 in stage 2.0 (TID 19, localhost, executor driver, partition 8, PROCESS_LOCAL, 8020 bytes)
2018-10-02 09:47:26 WARN TaskSetManager:66 - Lost task 6.0 in stage 0.0 (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:26 INFO Executor:54 - Running task 8.0 in stage 2.0 (TID 19)
2018-10-02 09:47:26 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:47:26 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:28 INFO Executor:54 - Executor killed task 7.0 in stage 0.0 (TID 7), reason: Stage cancelled
2018-10-02 09:47:28 INFO TaskSetManager:54 - Starting task 9.0 in stage 2.0 (TID 20, localhost, executor driver, partition 9, PROCESS_LOCAL, 7991 bytes)
2018-10-02 09:47:28 WARN TaskSetManager:66 - Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:28 INFO Executor:54 - Running task 9.0 in stage 2.0 (TID 20)
2018-10-02 09:47:28 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2018-10-02 09:47:28 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:47:28 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 17
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 10
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 18
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 14
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 22
2018-10-02 09:47:29 INFO BlockManagerInfo:54 - Removed broadcast_1_piece0 on 10.0.1.44:40719 in memory (size: 2.6 KB, free: 15.8 GB)
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 6
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 0
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 19
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 13
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 16
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 11
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 21
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned shuffle 0
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 24
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 15
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 8
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 20
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 7
2018-10-02 09:47:29 INFO BlockManagerInfo:54 - Removed broadcast_0_piece0 on 10.0.1.44:40719 in memory (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 12
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 9
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 23
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 3
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 4
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 2
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 5
2018-10-02 09:47:29 INFO ContextCleaner:54 - Cleaned accumulator 1
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000004_0 aborted.
2018-10-02 09:47:30 ERROR Executor:91 - Exception in task 4.0 in stage 2.0 (TID 15)
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 10.0 in stage 2.0 (TID 21, localhost, executor driver, partition 10, PROCESS_LOCAL, 8018 bytes)
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
2018-10-02 09:47:30 ERROR TaskSetManager:70 - Task 4 in stage 2.0 failed 1 times; aborting job
2018-10-02 09:47:30 INFO Executor:54 - Running task 10.0 in stage 2.0 (TID 21)
2018-10-02 09:47:30 INFO TaskSchedulerImpl:54 - Cancelling stage 2
2018-10-02 09:47:30 INFO TaskSchedulerImpl:54 - Stage 2 was cancelled
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 1.0 in stage 2.0 (TID 12), reason: Stage cancelled
2018-10-02 09:47:30 INFO DAGScheduler:54 - ResultStage 2 (runJob at SparkHadoopWriter.scala:78) failed in 6.467 s due to Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
Driver stacktrace:
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 8.0 in stage 2.0 (TID 19), reason: Stage cancelled
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 5.0 in stage 2.0 (TID 16), reason: Stage cancelled
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 2.0 in stage 2.0 (TID 13), reason: Stage cancelled
2018-10-02 09:47:30 INFO DAGScheduler:54 - Job 1 failed: runJob at SparkHadoopWriter.scala:78, took 6.470853 s
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 9.0 in stage 2.0 (TID 20), reason: Stage cancelled
2018-10-02 09:47:30 ERROR SparkHadoopWriter:91 - Aborting job job_20181002094723_0015.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:34)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:39)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:41)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:43)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:45)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:47)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:49)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:51)
at $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:53)
at $line21.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:55)
at $line21.$read$$iw$$iw$$iw$$iw.<init>(<console>:57)
at $line21.$read$$iw$$iw$$iw.<init>(<console>:59)
at $line21.$read$$iw$$iw.<init>(<console>:61)
at $line21.$read$$iw.<init>(<console>:63)
at $line21.$read.<init>(<console>:65)
at $line21.$read$.<init>(<console>:69)
at $line21.$read$.<clinit>(<console>)
at $line21.$eval$.$print$lzycompute(<console>:7)
at $line21.$eval$.$print(<console>:6)
at $line21.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5$$anonfun$apply$6.apply(ILoop.scala:427)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5$$anonfun$apply$6.apply(ILoop.scala:423)
at scala.reflect.io.Streamable$Chars$class.applyReader(Streamable.scala:111)
at scala.reflect.io.File.applyReader(File.scala:50)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5.apply(ILoop.scala:423)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1$$anonfun$apply$5.apply(ILoop.scala:423)
at scala.tools.nsc.interpreter.ILoop.savingReplayStack(ILoop.scala:91)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1.apply(ILoop.scala:422)
at scala.tools.nsc.interpreter.ILoop$$anonfun$interpretAllFrom$1.apply(ILoop.scala:422)
at scala.tools.nsc.interpreter.ILoop.savingReader(ILoop.scala:96)
at scala.tools.nsc.interpreter.ILoop.interpretAllFrom(ILoop.scala:421)
at scala.tools.nsc.interpreter.ILoop$$anonfun$run$3$1.apply(ILoop.scala:577)
at scala.tools.nsc.interpreter.ILoop$$anonfun$run$3$1.apply(ILoop.scala:576)
at scala.tools.nsc.interpreter.ILoop.withFile(ILoop.scala:570)
at scala.tools.nsc.interpreter.ILoop.run$3(ILoop.scala:576)
at scala.tools.nsc.interpreter.ILoop.loadCommand(ILoop.scala:583)
at scala.tools.nsc.interpreter.ILoop$$anonfun$standardCommands$8.apply(ILoop.scala:207)
at scala.tools.nsc.interpreter.ILoop$$anonfun$standardCommands$8.apply(ILoop.scala:207)
at scala.tools.nsc.interpreter.LoopCommands$LineCmd.apply(LoopCommands.scala:62)
at scala.tools.nsc.interpreter.ILoop.colonCommand(ILoop.scala:688)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:679)
at scala.tools.nsc.interpreter.ILoop.loadFiles(ILoop.scala:835)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:111)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
at org.apache.spark.repl.Main$.doMain(Main.scala:76)
at org.apache.spark.repl.Main$.main(Main.scala:56)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 6.0 in stage 2.0 (TID 17), reason: Stage cancelled
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 10.0 in stage 2.0 (TID 21), reason: Stage cancelled
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 7.0 in stage 2.0 (TID 18), reason: Stage cancelled
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-BIMONTHLY-RPIOGJ-20120131001727-00013-crawling114.us.archive.org-6681.warc.gz:0+242453280
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 3.0 in stage 2.0 (TID 14), reason: Stage cancelled
2018-10-02 09:47:30 INFO Executor:54 - Executor is trying to kill task 0.0 in stage 2.0 (TID 11), reason: Stage cancelled
2018-10-02 09:47:30 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-10-02 09:47:30 WARN FileUtil:187 - Failed to delete file or dir [/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary]: it still exists.
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000007_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000007_0 aborted.
2018-10-02 09:47:30 INFO Executor:54 - Executor interrupted and killed task 7.0 in stage 2.0 (TID 18), reason: Stage cancelled
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 7.0 in stage 2.0 (TID 18, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000000_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000000_0 aborted.
2018-10-02 09:47:30 INFO Executor:54 - Executor interrupted and killed task 0.0 in stage 2.0 (TID 11), reason: Stage cancelled
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000002_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000002_0 aborted.
2018-10-02 09:47:30 INFO Executor:54 - Executor interrupted and killed task 2.0 in stage 2.0 (TID 13), reason: Stage cancelled
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 0.0 in stage 2.0 (TID 11, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 2.0 in stage 2.0 (TID 13, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000010_0 aborted.
2018-10-02 09:47:30 INFO Executor:54 - Executor interrupted and killed task 10.0 in stage 2.0 (TID 21), reason: Stage cancelled
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 10.0 in stage 2.0 (TID 21, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000009_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000009_0 aborted.
2018-10-02 09:47:30 INFO Executor:54 - Executor interrupted and killed task 9.0 in stage 2.0 (TID 20), reason: Stage cancelled
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 9.0 in stage 2.0 (TID 20, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000001_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000001_0 aborted.
2018-10-02 09:47:30 INFO Executor:54 - Executor interrupted and killed task 1.0 in stage 2.0 (TID 12), reason: Stage cancelled
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 1.0 in stage 2.0 (TID 12, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
java.util.zip.ZipException: invalid stored block lengths
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000005_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000005_0 aborted.
2018-10-02 09:47:30 INFO Executor:54 - Executor interrupted and killed task 5.0 in stage 2.0 (TID 16), reason: Stage cancelled
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 5.0 in stage 2.0 (TID 16, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:96)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1493)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1472)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1472)
... 78 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 15, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
... 106 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:151)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
... 8 more
2018-10-02 09:47:30 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:30 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000003_0
2018-10-02 09:47:30 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000003_0 aborted.
2018-10-02 09:47:30 INFO Executor:54 - Executor interrupted and killed task 3.0 in stage 2.0 (TID 14), reason: Stage cancelled
2018-10-02 09:47:30 WARN TaskSetManager:66 - Lost task 3.0 in stage 2.0 (TID 14, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:30 INFO MemoryStore:54 - Block broadcast_4 stored as values in memory (estimated size 275.4 KB, free 15.8 GB)
2018-10-02 09:47:30 INFO MemoryStore:54 - Block broadcast_4_piece0 stored as bytes in memory (estimated size 23.0 KB, free 15.8 GB)
2018-10-02 09:47:30 INFO BlockManagerInfo:54 - Added broadcast_4_piece0 in memory on 10.0.1.44:40719 (size: 23.0 KB, free: 15.8 GB)
2018-10-02 09:47:30 INFO SparkContext:54 - Created broadcast 4 from newAPIHadoopFile at package.scala:51
2018-10-02 09:47:30 INFO FileInputFormat:283 - Total input paths to process : 53
2018-10-02 09:47:30 INFO SparkContext:54 - Starting job: sortBy at package.scala:74
2018-10-02 09:47:30 INFO DAGScheduler:54 - Registering RDD 23 (map at package.scala:72)
2018-10-02 09:47:30 INFO DAGScheduler:54 - Got job 2 (sortBy at package.scala:74) with 53 output partitions
2018-10-02 09:47:30 INFO DAGScheduler:54 - Final stage: ResultStage 4 (sortBy at package.scala:74)
2018-10-02 09:47:30 INFO DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 3)
2018-10-02 09:47:30 INFO DAGScheduler:54 - Missing parents: List(ShuffleMapStage 3)
2018-10-02 09:47:30 INFO DAGScheduler:54 - Submitting ShuffleMapStage 3 (MapPartitionsRDD[23] at map at package.scala:72), which has no missing parents
2018-10-02 09:47:30 INFO MemoryStore:54 - Block broadcast_5 stored as values in memory (estimated size 4.7 KB, free 15.8 GB)
2018-10-02 09:47:30 INFO MemoryStore:54 - Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.4 KB, free 15.8 GB)
2018-10-02 09:47:30 INFO BlockManagerInfo:54 - Added broadcast_5_piece0 in memory on 10.0.1.44:40719 (size: 2.4 KB, free: 15.8 GB)
2018-10-02 09:47:30 INFO SparkContext:54 - Created broadcast 5 from broadcast at DAGScheduler.scala:1039
2018-10-02 09:47:30 INFO DAGScheduler:54 - Submitting 53 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[23] at map at package.scala:72) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2018-10-02 09:47:30 INFO TaskSchedulerImpl:54 - Adding task set 3.0 with 53 tasks
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 0.0 in stage 3.0 (TID 22, localhost, executor driver, partition 0, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 1.0 in stage 3.0 (TID 23, localhost, executor driver, partition 1, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 2.0 in stage 3.0 (TID 24, localhost, executor driver, partition 2, PROCESS_LOCAL, 7976 bytes)
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 3.0 in stage 3.0 (TID 25, localhost, executor driver, partition 3, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 4.0 in stage 3.0 (TID 26, localhost, executor driver, partition 4, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 5.0 in stage 3.0 (TID 27, localhost, executor driver, partition 5, PROCESS_LOCAL, 7996 bytes)
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 6.0 in stage 3.0 (TID 28, localhost, executor driver, partition 6, PROCESS_LOCAL, 8004 bytes)
2018-10-02 09:47:30 INFO TaskSetManager:54 - Starting task 7.0 in stage 3.0 (TID 29, localhost, executor driver, partition 7, PROCESS_LOCAL, 7978 bytes)
2018-10-02 09:47:30 INFO Executor:54 - Running task 1.0 in stage 3.0 (TID 23)
2018-10-02 09:47:30 INFO Executor:54 - Running task 7.0 in stage 3.0 (TID 29)
2018-10-02 09:47:30 INFO Executor:54 - Running task 5.0 in stage 3.0 (TID 27)
2018-10-02 09:47:30 INFO Executor:54 - Running task 6.0 in stage 3.0 (TID 28)
2018-10-02 09:47:30 INFO Executor:54 - Running task 4.0 in stage 3.0 (TID 26)
2018-10-02 09:47:30 INFO Executor:54 - Running task 3.0 in stage 3.0 (TID 25)
2018-10-02 09:47:30 INFO Executor:54 - Running task 2.0 in stage 3.0 (TID 24)
2018-10-02 09:47:30 INFO Executor:54 - Running task 0.0 in stage 3.0 (TID 22)
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB318434-20170802060019808-00010.warc.gz:0+745054600
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB311837-20170703002531023-00015.warc.gz:0+378164512
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-JOB311846-20170702213909334-00019.warc.gz:0+642314864
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTHLY-CDDZXY-20111001180917-00033-crawling200.us.archive.org-6680.warc.gz:0+118039703
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080810110222-00723-crawling09.us.archive.org.arc.gz:0+191119896
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-20080505225931-00186-crawling09.us.archive.org.arc.gz:0+284417845
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz:0+134559197
2018-10-02 09:47:30 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130503220356341-00010-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1437967814
2018-10-02 09:47:33 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:33 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000006_0
2018-10-02 09:47:33 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000006_0 aborted.
2018-10-02 09:47:33 INFO Executor:54 - Executor interrupted and killed task 6.0 in stage 2.0 (TID 17), reason: Stage cancelled
2018-10-02 09:47:33 INFO TaskSetManager:54 - Starting task 8.0 in stage 3.0 (TID 30, localhost, executor driver, partition 8, PROCESS_LOCAL, 8009 bytes)
2018-10-02 09:47:33 WARN TaskSetManager:66 - Lost task 6.0 in stage 2.0 (TID 17, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:33 INFO Executor:54 - Running task 8.0 in stage 3.0 (TID 30)
2018-10-02 09:47:33 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-WEEKLY-12073-20130506091844567-00118-wbgrp-crawl054.us.archive.org-6441.warc.gz:0+1289997762
2018-10-02 09:47:34 ERROR Utils:91 - Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:151)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:124)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:135)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:79)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:34 WARN FileOutputCommitter:569 - Could not delete file:/home/nruest/Dropbox/499-issues/output/all-text/output/_temporary/0/_temporary/attempt_20181002094723_0015_m_000008_0
2018-10-02 09:47:34 ERROR SparkHadoopWriter:70 - Task attempt_20181002094723_0015_m_000008_0 aborted.
2018-10-02 09:47:34 INFO Executor:54 - Executor interrupted and killed task 8.0 in stage 2.0 (TID 19), reason: Stage cancelled
2018-10-02 09:47:34 INFO TaskSetManager:54 - Starting task 9.0 in stage 3.0 (TID 31, localhost, executor driver, partition 9, PROCESS_LOCAL, 7980 bytes)
2018-10-02 09:47:34 INFO Executor:54 - Running task 9.0 in stage 3.0 (TID 31)
2018-10-02 09:47:34 WARN TaskSetManager:66 - Lost task 8.0 in stage 2.0 (TID 19, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:34 INFO TaskSchedulerImpl:54 - Removed TaskSet 2.0, whose tasks have all completed, from pool
2018-10-02 09:47:34 INFO NewHadoopRDD:54 - Input split: file:/home/nruest/Dropbox/499-issues/ARCHIVEIT-499-QUARTERLY-JOB456196-20171001172127643-00001.warc.gz:0+209337866
2018-10-02 09:47:36 ERROR Executor:91 - Exception in task 4.0 in stage 3.0 (TID 26)
java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:36 INFO TaskSetManager:54 - Starting task 10.0 in stage 3.0 (TID 32, localhost, executor driver, partition 10, PROCESS_LOCAL, 8007 bytes)
2018-10-02 09:47:36 WARN TaskSetManager:66 - Lost task 4.0 in stage 3.0 (TID 26, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-10-02 09:47:36 ERROR TaskSetManager:70 - Task 4 in stage 3.0 failed 1 times; aborting job
2018-10-02 09:47:36 INFO TaskSchedulerImpl:54 - Cancelling stage 3
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 8.0 in stage 3.0 (TID 30), reason: Stage cancelled
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 5.0 in stage 3.0 (TID 27), reason: Stage cancelled
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 9.0 in stage 3.0 (TID 31), reason: Stage cancelled
2018-10-02 09:47:36 INFO Executor:54 - Running task 10.0 in stage 3.0 (TID 32)
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 10.0 in stage 3.0 (TID 32), reason: Stage cancelled
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 2.0 in stage 3.0 (TID 24), reason: Stage cancelled
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 6.0 in stage 3.0 (TID 28), reason: Stage cancelled
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 3.0 in stage 3.0 (TID 25), reason: Stage cancelled
2018-10-02 09:47:36 INFO TaskSchedulerImpl:54 - Stage 3 was cancelled
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 0.0 in stage 3.0 (TID 22), reason: Stage cancelled
2018-10-02 09:47:36 INFO DAGScheduler:54 - ShuffleMapStage 3 (map at package.scala:72) failed in 6.424 s due to Job aborted due to stage failure: Task 4 in stage 3.0 failed 1 times, most recent failure: Lost task 4.0 in stage 3.0 (TID 26, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 7.0 in stage 3.0 (TID 29), reason: Stage cancelled
2018-10-02 09:47:36 INFO Executor:54 - Executor is trying to kill task 1.0 in stage 3.0 (TID 23), reason: Stage cancelled
2018-10-02 09:47:36 INFO Executor:54 - Executor killed task 10.0 in stage 3.0 (TID 32), reason: Stage cancelled
2018-10-02 09:47:36 INFO DAGScheduler:54 - Job 2 failed: sortBy at package.scala:74, took 6.428172 s
2018-10-02 09:47:36 WARN TaskSetManager:66 - Lost task 10.0 in stage 3.0 (TID 32, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:36 INFO Executor:54 - Executor killed task 7.0 in stage 3.0 (TID 29), reason: Stage cancelled
2018-10-02 09:47:36 WARN TaskSetManager:66 - Lost task 7.0 in stage 3.0 (TID 29, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:36 INFO Executor:54 - Executor killed task 0.0 in stage 3.0 (TID 22), reason: Stage cancelled
2018-10-02 09:47:36 WARN TaskSetManager:66 - Lost task 0.0 in stage 3.0 (TID 22, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:37 INFO Executor:54 - Executor killed task 1.0 in stage 3.0 (TID 23), reason: Stage cancelled
2018-10-02 09:47:37 WARN TaskSetManager:66 - Lost task 1.0 in stage 3.0 (TID 23, localhost, executor driver): TaskKilled (Stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 3.0 failed 1 times, most recent failure: Lost task 4.0 in stage 3.0 (TID 26, localhost, executor driver): java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:306)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:168)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:148)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:622)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.sortBy(RDD.scala:620)
at io.archivesunleashed.package$CountableRDD.countItems(package.scala:74)
... 78 elided
Caused by: java.util.zip.ZipException: invalid code lengths set
at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
at org.archive.io.arc.ARCRecord.read(ARCRecord.java:799)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:121)
at org.apache.commons.io.input.BoundedInputStream.read(BoundedInputStream.java:103)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1792)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1769)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1744)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:462)
at io.archivesunleashed.data.ArcRecordUtils.copyToByteArray(ArcRecordUtils.java:161)
at io.archivesunleashed.data.ArcRecordUtils.getContent(ArcRecordUtils.java:117)
at io.archivesunleashed.data.ArcRecordUtils.getBodyContent(ArcRecordUtils.java:131)
at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:98)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
<console>:33: error: not found: value links
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
^
2018-10-02 09:47:37 INFO Executor:54 - Executor killed task 2.0 in stage 3.0 (TID 24), reason: Stage cancelled
2018-10-02 09:47:37 WARN TaskSetManager:66 - Lost task 2.0 in stage 3.0 (TID 24, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:37 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-10-02 09:47:37 INFO Executor:54 - Executor killed task 3.0 in stage 3.0 (TID 25), reason: Stage cancelled
2018-10-02 09:47:37 WARN TaskSetManager:66 - Lost task 3.0 in stage 3.0 (TID 25, localhost, executor driver): TaskKilled (Stage cancelled)
2018-10-02 09:47:37 INFO AbstractConnector:318 - Stopped Spark@26c2bd0c{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-10-02 09:47:37 INFO SparkUI:54 - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-02 09:47:37 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-10-02 09:47:37 INFO MemoryStore:54 - MemoryStore cleared
2018-10-02 09:47:37 INFO BlockManager:54 - BlockManager stopped
2018-10-02 09:47:37 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-10-02 09:47:37 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-10-02 09:47:37 INFO SparkContext:54 - Successfully stopped SparkContext
2018-10-02 09:47:37 INFO ShutdownHookManager:54 - Shutdown hook called
2018-10-02 09:47:37 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-4cdcdab3-2015-4789-acbc-c3ef56e6e405
2018-10-02 09:47:37 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-0782f27c-4461-4384-9191-997eb21b1d7e/repl-6a0a3b1f-9a55-4268-a6d8-d76accaa0494
2018-10-02 09:47:37 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-0782f27c-4461-4384-9191-997eb21b1d7e
...and if I remove the empty file, and run the same job with the other 53 problematic arcs/warcs, I am not able to replicate what you've come up with (this is all from building aut on HEAD this morning, after clearing
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 2, 2018
Member
Out of curiosity, I tried it with Apache Spark 2.3.2 (released September 24, 2018), and I'm getting the same thing.
Out of curiosity, I tried it with Apache Spark 2.3.2 (released September 24, 2018), and I'm getting the same thing. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 2, 2018
Member
Can you give all your steps, because I am unable to replicate your success on 0.16.0 and HEAD, and I am sure @ianmilligan1 is as well. It's extremely helpful to share the exact steps when we're verifying something like this. Saying "it works on my end." without steps to replicate isn't too helpful. So, can you please do this on tuna, or whatever machine that you have these files on:
- Clean up your environment:
- Remove everything in
~/.m2
and ~/.ivy2`
- Remove
aut
from where ever you have it. - Clone
aut
somewhere. - Build
aut
on master, as of the latest commit:mvn clean install
- Create an output directory with sub-directories
mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
- Adapt the example script from above:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit
- Run the command from above with adapted paths with Apache Spark 2.1.3 or Apache Spark 2.3.2:
/home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
- Let us know what happened, tell us your steps, and share the output of the log.
Can you give all your steps, because I am unable to replicate your success on 0.16.0 and HEAD, and I am sure @ianmilligan1 is as well. It's extremely helpful to share the exact steps when we're verifying something like this. Saying "it works on my end." without steps to replicate isn't too helpful. So, can you please do this on tuna, or whatever machine that you have these files on:
|
ruebot commentedSep 18, 2018
Describe the bug
Came across this when processing a user's collection on cloud.archivesunleashed.org, using aut-0.16.0. The collection appears to have a couple problematic warcs, which throw this error:
Fuller log output available here.
To Reproduce
The error occurs at this specific point:
Environment information
--packages
/home/ubuntu/aut/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[12] --driver-memory 105G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --packages \"io.archivesunleashed:aut:0.16.0\" -i /data/139/499/60/spark_jobs/499.scala | tee /data/139/499/60/spark_jobs/499.scala.log
Additional context
ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz
on rho or tuna for further testing.ARCHIVEIT-499-MONTLIB-MTGOV-webteam-www.20071208015549.arc.gz: OK