Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image extraction does not scale with number of WARCs #298

Open
ruebot opened this issue Jan 23, 2019 · 3 comments

Comments

1 participant
@ruebot
Copy link
Member

commented Jan 23, 2019

Describe the bug

aut fails in a variety of ways the more ARCs/WARCs you try to extract images from at a time.

To Reproduce

Using this basic extraction script and scaling the number of WARCs per job:

import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/test-NUMBER-OF-WARCS/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").orderBy(desc("bytes")).saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/images/NUMBER-OF-WARCS/geocities-image")
sys.exit

10 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-10.scala | tee /tuna1/scratch/nruest/geocites/logs/test-10.log);

Results

  • Number of images: 240,424
  • Disk usage: 6.8G
  • Time:
-------------------
real    20m41.399s
user    190m50.356s
sys     29m14.584s
-------------------

100 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-100.scala | tee /tuna1/scratch/nruest/geocites/logs/test-100.log);

Results

  • Number of images: 2,330,328
  • Disk usage: 66G
  • Time:
--------------------
real    230m21.423s
user    1885m4.936s
sys     270m5.440s
--------------------

(I accidentally ran it twice.)

--------------------
real    726m33.194s
user    1565m17.936s
sys     285m35.924s
--------------------

200 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-200.scala | tee /tuna1/scratch/nruest/geocites/logs/test-200.log);

Results

  • Number of images: FAILED
  • Disk usage: FAILED
  • Time: FAILED

500 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-500.scala | tee /tuna1/scratch/nruest/geocites/logs/test-500.log);

Results

  • Number of images: Did not run
  • Disk usage: Did not run
  • Time: Did not run

Environment information

  • AUT version: 0.17.0
  • OS: Ubuntu 16.04
  • Java version: Java 8
  • Apache Spark version: 2.3.2, 2.4.0
  • Apache Spark w/aut: --packages
  • Apache Spark command used to run AUT: above

Additional context

  • We hit a ulimit problem on tuna with what I believe are the default setting. tuna is also using zfs as a filesystem.
java.io.FileNotFoundException: /tuna1/scratch/nruest/geocites/images/geocities-image-a97c139a3a31467aeb4bbfa36edaa775.gif (Too many open files)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:69)
        at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:55)
        at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:419)
        at javax.imageio.ImageIO.write(ImageIO.java:1530)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:72)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
  • When run on rho over all the files, I was able to extract ~20M images (there should be about ~121M in total), but I ran into a lot of disk space issues even though I had plenty of disk space free, inodes. Might have been an ext4 issue?
java.io.FileNotFoundException: /mnt/vol1/data_sets/geocities/images-84553ea4b4737f20d6b20cbc16defc66.gif (No space left on device)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:69)
        at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:55)
        at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:419)
        at javax.imageio.ImageIO.write(ImageIO.java:1530)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:72)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
  • Other errors:
java.io.EOFException
        at javax.imageio.stream.ImageInputStreamImpl.readUnsignedByte(ImageInputStreamImpl.java:222)
        at com.sun.imageio.plugins.gif.GIFImageReader.read(GIFImageReader.java:916)
        at javax.imageio.ImageReader.read(ImageReader.java:939)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:66)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
java.io.EOFException
        at javax.imageio.stream.ImageInputStreamImpl.readUnsignedByte(ImageInputStreamImpl.java:222)
        at com.sun.imageio.plugins.gif.GIFImageReader.read(GIFImageReader.java:916)
        at javax.imageio.ImageReader.read(ImageReader.java:939)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:66)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Expected behavior

I think we're hitting this because of our implementation and extraction script. I believe we're iterating over the entire collection and identifying all the images, and tossing them into the dataframe, then iterating back over that and dumping them out to a tmp dir, then moving it over to the actual place they're supposed to end up. This requires a huge setting for spark.driver.maxResultSize. We should examine our implementation and see if it is possible to stream images out as we find them. That should require less overhead.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2019

For reference, image extraction was implemented here: #234

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

I think I have this working on tuna right now. I'll drop in the settings, logs, and script when I'm done. Pretty sure it might be a combination of a more recent version of Spark and some config settings on tuna that were changed by the sysadmin.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

aut-298-df-split-test.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/1/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/df-images-test/1/aut-298-test-")
sys.exit

command

/home/ruestn/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[75] --driver-memory 300g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT" -i /home/ruestn/aut-298-df-split-test.scala 2>&1 | tee /home/ruestn/aut-298-df-split-test-01.log

output

ruestn@tuna:/tuna1/scratch/nruest/geocites/df-images-test/1$ ls | wc -l
1692092
ruestn@tuna:/tuna1/scratch/nruest/geocites/df-images-test/1$ ls | head
aut-298-test--0000111a3b953aec7f0cfe8c67d1ef13.gif
aut-298-test--0000149ccfc79a8d592f80fd4d0428ae.gif
aut-298-test--000017e324fdc571fa32b2dc2179f299.JPEG
aut-298-test--00002e984a2d1c22d55e8c7e03a8b283.JPEG
aut-298-test--00002eb38a357b394047a10c59ec4f95.JPEG
aut-298-test--000033e6d3a81a7c9e9c8681cc6b4801.gif
aut-298-test--00003a521ae0c8b1f1cbcab0c42cf296.gif
aut-298-test--000040ed7782c926413e5c1f46a06fbc.gif
aut-298-test--000052770ee09057c84c59d5c5d8a7cd.JPEG
aut-298-test--00006544537064d8d3072bb0d059c7ab.gif

job log

I think we're good. I'll run this on the rest of GeoCities, and close it if we're good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.