Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heap space!! java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) #317

Open
ruebot opened this issue Apr 24, 2019 · 9 comments
Labels
bug

Comments

@ruebot
Copy link
Member

@ruebot ruebot commented Apr 24, 2019

Describe the bug

On a number FDLP and Stanford collections, we run into this space heap space error, and it kills the Spark job. Upon investigation, this does not seem to be a problem with a large ARC/WARC or large collection. This happens on small and large collections.

19/04/18 18:02:34 DEBUG BlockManager: Told master about block rdd_3_3
19/04/18 18:02:34 INFO ArchiveRecordInputFormat: Closed archive file file:/tuna1/scratch/nruest/auk_collection_testing/10689/warcs/ARCHIVEIT-10689-TEST-JOB677126-20180828211602861-00009.warc.gz
19/04/18 18:02:34 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.StringCoding.safeTrim(StringCoding.java:89)
        at java.lang.StringCoding.access$100(StringCoding.java:50)
        at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:154)
        at java.lang.StringCoding.decode(StringCoding.java:193)
        at java.lang.StringCoding.decode(StringCoding.java:254)
        at java.lang.String.<init>(String.java:546)
        at java.lang.String.<init>(String.java:566)
        at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:117)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:69)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:69)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
        at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:139)
        at org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:174)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$10.apply(BlockManager.scala:1203)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$10.apply(BlockManager.scala:1201)
        at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1201)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
19/04/18 18:02:34 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 25
19/04/18 18:02:34 INFO SparkContext: Invoking stop() from shutdown hook
19/04/18 18:02:34 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.StringCoding.safeTrim(StringCoding.java:89)
        at java.lang.StringCoding.access$100(StringCoding.java:50)
        at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:154)
        at java.lang.StringCoding.decode(StringCoding.java:193)
        at java.lang.StringCoding.decode(StringCoding.java:254)
        at java.lang.String.<init>(String.java:546)
        at java.lang.String.<init>(String.java:566)
        at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:117)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:69)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:69)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
        at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:139)
        at org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:174)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$10.apply(BlockManager.scala:1203)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$10.apply(BlockManager.scala:1201)
        at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1201)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)

To Reproduce
Steps to reproduce the behavior on tuna:

  1. Using /tuna1/scratch/nruest/auk_collection_testing/10689
  2. /home/ruestn/spark-2.4.1-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true -Djava.io.tmpdir=/tuna1/scratch/nruest/tmp --jars /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar -i /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689.scala 2>&1 | tee /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689.scala-test.log
  3. Should pop the error around 31 in stage 1.

Expected behavior
Shouldn't hit the heap error.

Environment information

  • AUT version: 0.17.0
  • OS: Ubuntu 16.04
  • Java version: Java 8
  • Apache Spark version: 2.3.1, 2.3.2, 2.4.0, 2.4.1
  • Apache Spark w/aut: --jars and --pagackes
  • Apache Spark command used to run AUT: /home/ruestn/spark-2.4.1-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true -Djava.io.tmpdir=/tuna1/scratch/nruest/tmp --jars /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar -i /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689.scala 2>&1 | tee /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689.scala-test.log

Additional context

@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Apr 24, 2019

@jrwiebe @lintool @ianmilligan1 finally wrote it up. Let me know if I missed anything, or y'all want me to provide more info.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

@jrwiebe jrwiebe commented Apr 24, 2019

@ruebot, you say:

Upon investigation, this does not seem to be a problem with a large ARC/WARC or large collection.

Can you clarify this? Are you saying that jobs run with the same parameters against other large collections – in terms of total archive size – do not result in heap space errors? Because in my testing I didn't find this to be the case. It seemed like the error was rather a function of the total size of all the WARCs being operated on simultaneously. I think I was seeing jobs bail when they didn't have about 4 times the amount of memory as [# tasks] * [average WARC size].

I found this garbage collection log analyzer useful for trying to understand how memory was being used, though I'm still puzzled that so much memory is being used. See for example how much memory is used by a job that simply chains RecordLoader.loadArchives(...).count(). Some useful background on Java garbage collection here.

To generate garbage collection logs you want to do something like this (the spark.driver.extraJavaOptions conf directive is the key):

/home/jrwiebe/spark-2.4.1-bin-hadoop2.7/bin/spark-shell --conf spark.driver.extraJavaOptions="-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:3260-16-16g-35bigwarcs.log -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+UnlockExperimentalVMOptions -XX:G1LogLevel=finest -XX:+UnlockDiagnosticVMOptions -XX:+G1PrintRegionLivenessInfo" --master local[16] --driver-memory 16g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages "io.archivesunleashed:aut:0.17.0" -i 3260.scala

For the record, I'm pretty certain KryoSerializer doesn't offer any benefits for this problem; I just kept using it because I copied @ruebot's command line. I ran tests with and without some of the other tuning options as well, and was seeing similar results.

@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Apr 24, 2019

Sorry, yeah, that's kinda vague. I meant overall, we've analyzed 170T of collections ranging from under a 1G of WARCs up to 12T with the same basic Spark settings.

I added the KryoSerializer during troubleshooting because it looked like it got further along in some cases, and reading up on it, it made sense at the time.

Thanks for putting in the GC info. I was hoping to tease it out by creating the issue 😄

@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Apr 24, 2019

...and a single WARC as large as 80G.

@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Apr 25, 2019

Reading the garbage collection post reminded me that Spark allocates (not sure if that is the right word), ~55-60% of "storage" memory for executors. A smarter version of me would have bookmarked that Stackoverflow or Spark mailing list thread. But, I can provide a screenshot of a the Spark UI of a currently running job.

Screenshot_2019-04-24 Spark shell - Executors

That's a job running with 30 threads, and 105G of RAM allocated: --master local[30] --driver-memory 105g

@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Aug 1, 2019

I'm going to start adding collection info on failed collections here:

id size spark version aut version
593/12190 141G 2.4.3 0.17.0
769/4200 9.9T 2.4.4 0.18.0
713/5484 51G 2.4.4 0.18.0
@lintool

This comment has been minimized.

Copy link
Member

@lintool lintool commented Oct 31, 2019

As a sanity check, I went to the data directory and did this:

$ gunzip -c *.gz | grep -a "Content-Length" | cut -d ' ' -f 2 | sort -n | less
2878090679
2878090400
2878090400
2247649420
2247649420
2247649180
2247649180

2878090679 = 2,878,090,679 ~ 2.9 GB. A bit silly to have records that big... dunno if this is the source of your errors though.

@ianmilligan1

This comment has been minimized.

Copy link
Member

@ianmilligan1 ianmilligan1 commented Oct 31, 2019

So an individual record that’s almost 3GB? I wonder if we could implement a default that skips records greater than say a GB? Or lower?

I can imagine users who might want to work with big files - video etc - but for the most part, and with our generic processing pipeline, we could do without.

@ruebot

This comment has been minimized.

Copy link
Member Author

@ruebot ruebot commented Nov 2, 2019

I'm not certain that record size is the issue. I've collected some data from a number of collections that fail, and succeed. Large record size doesn't appear to be the sole factor in failure or success. We have some that are well above what @lintool listed having been a part of successful jobs. I'll take some time over the coming days, or weeks 😓, to try and isolate some of the WARCs that are actually causing the problems, then we can proceed digging in further from there if that makes sense.

PQ: Fails on extractVideoDetailsDF()

  1. 263,051,370
  2. 263,051,370
  3. 263,051,370
  4. 263,051,370
  5. 146,567,997

InternationalesQc: Success

  1. 585,726,559
  2. 148,428,760
  3. 148,428,512
  4. 135,553,646
  5. 135,553,230

10689: Fails on AUK

  1. 2,878,090,679
  2. 2,878,090,400
  3. 2,878,090,400
  4. 2,247,649,420
  5. 2,247,649,420

7235: Fails on AUK

  1. 65,608,807,060
  2. 65,608,807,060
  3. 65,608,807,060
  4. 65,608,807,060
  5. 65,608,807,060

AgricultureQc: Success

  1. 526,026,425
  2. 526,026,050
  3. 341,618,524
  4. 341,618,149
  5. 238,542,670

EnvironnementQc: Success

  1. 7,061,547,393
  2. 7,061,547,393
  3. 3,741,979,582
  4. 3,733,722,254
  5. 3,733,722,254

12190: Fails on AUK

  1. 3,273,574,468
  2. 3,273,574,204
  3. 2,580,347,094
  4. 2,580,346,830
  5. 2,185,762,748

TourismeQc-2006-2017: Success

  1. 50,182,078
  2. 26,795,182
  3. 24,213,112
  4. 23,125,967
  5. 22,184,611
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.