Image extraction does not scale with number of WARCs #298

ruebot · Jan 23, 2019

Describe the bug

aut fails in a variety of ways the more ARCs/WARCs you try to extract images from at a time.

To Reproduce

Using this basic extraction script and scaling the number of WARCs per job:

import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/test-NUMBER-OF-WARCS/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").orderBy(desc("bytes")).saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/images/NUMBER-OF-WARCS/geocities-image")
sys.exit

10 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-10.scala | tee /tuna1/scratch/nruest/geocites/logs/test-10.log);

Results

Number of images: 240,424
Disk usage: 6.8G
Time:

-------------------
real    20m41.399s
user    190m50.356s
sys     29m14.584s
-------------------

100 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-100.scala | tee /tuna1/scratch/nruest/geocites/logs/test-100.log);

Results

Number of images: 2,330,328
Disk usage: 66G
Time:

--------------------
real    230m21.423s
user    1885m4.936s
sys     270m5.440s
--------------------

(I accidentally ran it twice.)

--------------------
real    726m33.194s
user    1565m17.936s
sys     285m35.924s
--------------------

200 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-200.scala | tee /tuna1/scratch/nruest/geocites/logs/test-200.log);

Results

Number of images: FAILED
Disk usage: FAILED
Time: FAILED

500 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-500.scala | tee /tuna1/scratch/nruest/geocites/logs/test-500.log);

Results

Number of images: Did not run
Disk usage: Did not run
Time: Did not run

Environment information

AUT version: 0.17.0
OS: Ubuntu 16.04
Java version: Java 8
Apache Spark version: 2.3.2, 2.4.0
Apache Spark w/aut: --packages
Apache Spark command used to run AUT: above

Additional context

We hit a ulimit problem on tuna with what I believe are the default setting. tuna is also using zfs as a filesystem.

java.io.FileNotFoundException: /tuna1/scratch/nruest/geocites/images/geocities-image-a97c139a3a31467aeb4bbfa36edaa775.gif (Too many open files)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:69)
        at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:55)
        at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:419)
        at javax.imageio.ImageIO.write(ImageIO.java:1530)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:72)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

When run on rho over all the files, I was able to extract ~20M images (there should be about ~121M in total), but I ran into a lot of disk space issues even though I had plenty of disk space free, inodes. Might have been an ext4 issue?

java.io.FileNotFoundException: /mnt/vol1/data_sets/geocities/images-84553ea4b4737f20d6b20cbc16defc66.gif (No space left on device)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:69)
        at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:55)
        at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:419)
        at javax.imageio.ImageIO.write(ImageIO.java:1530)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:72)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Other errors:

java.io.EOFException
        at javax.imageio.stream.ImageInputStreamImpl.readUnsignedByte(ImageInputStreamImpl.java:222)
        at com.sun.imageio.plugins.gif.GIFImageReader.read(GIFImageReader.java:916)
        at javax.imageio.ImageReader.read(ImageReader.java:939)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:66)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
java.io.EOFException
        at javax.imageio.stream.ImageInputStreamImpl.readUnsignedByte(ImageInputStreamImpl.java:222)
        at com.sun.imageio.plugins.gif.GIFImageReader.read(GIFImageReader.java:916)
        at javax.imageio.ImageReader.read(ImageReader.java:939)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:66)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Expected behavior

I think we're hitting this because of our implementation and extraction script. I believe we're iterating over the entire collection and identifying all the images, and tossing them into the dataframe, then iterating back over that and dumping them out to a tmp dir, then moving it over to the actual place they're supposed to end up. This requires a huge setting for spark.driver.maxResultSize. We should examine our implementation and see if it is possible to stream images out as we find them. That should require less overhead.

ruebot · Jan 23, 2019

For reference, image extraction was implemented here: #234

ruebot · Jul 17, 2019

I think I have this working on tuna right now. I'll drop in the settings, logs, and script when I'm done. Pretty sure it might be a combination of a more recent version of Spark and some config settings on tuna that were changed by the sysadmin.

ruebot · Jul 17, 2019

aut-298-df-split-test.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/1/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/df-images-test/1/aut-298-test-")
sys.exit

command

/home/ruestn/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[75] --driver-memory 300g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT" -i /home/ruestn/aut-298-df-split-test.scala 2>&1 | tee /home/ruestn/aut-298-df-split-test-01.log

output

ruestn@tuna:/tuna1/scratch/nruest/geocites/df-images-test/1$ ls | wc -l
1692092
ruestn@tuna:/tuna1/scratch/nruest/geocites/df-images-test/1$ ls | head
aut-298-test--0000111a3b953aec7f0cfe8c67d1ef13.gif
aut-298-test--0000149ccfc79a8d592f80fd4d0428ae.gif
aut-298-test--000017e324fdc571fa32b2dc2179f299.JPEG
aut-298-test--00002e984a2d1c22d55e8c7e03a8b283.JPEG
aut-298-test--00002eb38a357b394047a10c59ec4f95.JPEG
aut-298-test--000033e6d3a81a7c9e9c8681cc6b4801.gif
aut-298-test--00003a521ae0c8b1f1cbcab0c42cf296.gif
aut-298-test--000040ed7782c926413e5c1f46a06fbc.gif
aut-298-test--000052770ee09057c84c59d5c5d8a7cd.JPEG
aut-298-test--00006544537064d8d3072bb0d059c7ab.gif

job log

I think we're good. I'll run this on the rest of GeoCities, and close it if we're good to go.

ruebot added bug optimization Scala DataFrames labels Jan 24, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

This was referenced Jan 31, 2019

Open

PDF binary object extraction #302

Open

Spreadsheet binary object extraction #303

Open

Doc binary object extraction #304

Open

Powerpoint binary object extraction #305

Open

Video binary object extraction #306

Open

Audio binary object extraction #307

archivesunleashed/aut

Image extraction does not scale with number of WARCs #298

Image extraction does not scale with number of WARCs #298

ruebot commented Jan 23, 2019 •

edited

This comment has been minimized.

ruebot commented Jan 23, 2019

ruebot added bug optimization Scala DataFrames labels Jan 24, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

This was referenced Jan 31, 2019

PDF binary object extraction #302

Spreadsheet binary object extraction #303

Doc binary object extraction #304

Powerpoint binary object extraction #305

Video binary object extraction #306

Audio binary object extraction #307

This comment has been minimized.

ruebot commented Jul 17, 2019

This comment has been minimized.

ruebot commented Jul 17, 2019

archivesunleashed/aut

Join GitHub today

Image extraction does not scale with number of WARCs #298

Comments

ruebot commented Jan 23, 2019 • edited

Describe the bug

To Reproduce

10 WARCs

Results

100 WARCs

Results

200 WARCs

Results

500 WARCs

Results

Environment information

Additional context

Expected behavior

This comment has been minimized.

ruebot commented Jan 23, 2019

ruebot added bug optimization Scala DataFrames labels Jan 24, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

This was referenced Jan 31, 2019

PDF binary object extraction #302

Spreadsheet binary object extraction #303

Doc binary object extraction #304

Powerpoint binary object extraction #305

Video binary object extraction #306

Audio binary object extraction #307

This comment has been minimized.

ruebot commented Jul 17, 2019

This comment has been minimized.

ruebot commented Jul 17, 2019

ruebot commented Jan 23, 2019 •

edited