Image extraction does not scale with number of WARCs #298

ruebot · Jan 23, 2019

Describe the bug

aut fails in a variety of ways the more ARCs/WARCs you try to extract images from at a time.

To Reproduce

Using this basic extraction script and scaling the number of WARCs per job:

import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/test-NUMBER-OF-WARCS/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").orderBy(desc("bytes")).saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/images/NUMBER-OF-WARCS/geocities-image")
sys.exit

10 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-10.scala | tee /tuna1/scratch/nruest/geocites/logs/test-10.log);

Results

Number of images: 240,424
Disk usage: 6.8G
Time:

-------------------
real    20m41.399s
user    190m50.356s
sys     29m14.584s
-------------------

100 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-100.scala | tee /tuna1/scratch/nruest/geocites/logs/test-100.log);

Results

Number of images: 2,330,328
Disk usage: 66G
Time:

--------------------
real    230m21.423s
user    1885m4.936s
sys     270m5.440s
--------------------

(I accidentally ran it twice.)

--------------------
real    726m33.194s
user    1565m17.936s
sys     285m35.924s
--------------------

200 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-200.scala | tee /tuna1/scratch/nruest/geocites/logs/test-200.log);

Results

Number of images: FAILED
Disk usage: FAILED
Time: FAILED

500 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-500.scala | tee /tuna1/scratch/nruest/geocites/logs/test-500.log);

Results

Number of images: Did not run
Disk usage: Did not run
Time: Did not run

Environment information

AUT version: 0.17.0
OS: Ubuntu 16.04
Java version: Java 8
Apache Spark version: 2.3.2, 2.4.0
Apache Spark w/aut: --packages
Apache Spark command used to run AUT: above

Additional context

We hit a ulimit problem on tuna with what I believe are the default setting. tuna is also using zfs as a filesystem.

java.io.FileNotFoundException: /tuna1/scratch/nruest/geocites/images/geocities-image-a97c139a3a31467aeb4bbfa36edaa775.gif (Too many open files)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:69)
        at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:55)
        at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:419)
        at javax.imageio.ImageIO.write(ImageIO.java:1530)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:72)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

When run on rho over all the files, I was able to extract ~20M images (there should be about ~121M in total), but I ran into a lot of disk space issues even though I had plenty of disk space free, inodes. Might have been an ext4 issue?

java.io.FileNotFoundException: /mnt/vol1/data_sets/geocities/images-84553ea4b4737f20d6b20cbc16defc66.gif (No space left on device)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:69)
        at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:55)
        at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:419)
        at javax.imageio.ImageIO.write(ImageIO.java:1530)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:72)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Other errors:

java.io.EOFException
        at javax.imageio.stream.ImageInputStreamImpl.readUnsignedByte(ImageInputStreamImpl.java:222)
        at com.sun.imageio.plugins.gif.GIFImageReader.read(GIFImageReader.java:916)
        at javax.imageio.ImageReader.read(ImageReader.java:939)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:66)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
java.io.EOFException
        at javax.imageio.stream.ImageInputStreamImpl.readUnsignedByte(ImageInputStreamImpl.java:222)
        at com.sun.imageio.plugins.gif.GIFImageReader.read(GIFImageReader.java:916)
        at javax.imageio.ImageReader.read(ImageReader.java:939)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:66)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Expected behavior

I think we're hitting this because of our implementation and extraction script. I believe we're iterating over the entire collection and identifying all the images, and tossing them into the dataframe, then iterating back over that and dumping them out to a tmp dir, then moving it over to the actual place they're supposed to end up. This requires a huge setting for spark.driver.maxResultSize. We should examine our implementation and see if it is possible to stream images out as we find them. That should require less overhead.

ruebot · Jan 23, 2019

For reference, image extraction was implemented here: #234

ruebot · Jul 17, 2019

I think I have this working on tuna right now. I'll drop in the settings, logs, and script when I'm done. Pretty sure it might be a combination of a more recent version of Spark and some config settings on tuna that were changed by the sysadmin.

ruebot · Jul 17, 2019

aut-298-df-split-test.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/1/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/df-images-test/1/aut-298-test-")
sys.exit

command

/home/ruestn/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[75] --driver-memory 300g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT" -i /home/ruestn/aut-298-df-split-test.scala 2>&1 | tee /home/ruestn/aut-298-df-split-test-01.log

output

ruestn@tuna:/tuna1/scratch/nruest/geocites/df-images-test/1$ ls | wc -l
1692092
ruestn@tuna:/tuna1/scratch/nruest/geocites/df-images-test/1$ ls | head
aut-298-test--0000111a3b953aec7f0cfe8c67d1ef13.gif
aut-298-test--0000149ccfc79a8d592f80fd4d0428ae.gif
aut-298-test--000017e324fdc571fa32b2dc2179f299.JPEG
aut-298-test--00002e984a2d1c22d55e8c7e03a8b283.JPEG
aut-298-test--00002eb38a357b394047a10c59ec4f95.JPEG
aut-298-test--000033e6d3a81a7c9e9c8681cc6b4801.gif
aut-298-test--00003a521ae0c8b1f1cbcab0c42cf296.gif
aut-298-test--000040ed7782c926413e5c1f46a06fbc.gif
aut-298-test--000052770ee09057c84c59d5c5d8a7cd.JPEG
aut-298-test--00006544537064d8d3072bb0d059c7ab.gif

job log

I think we're good. I'll run this on the rest of GeoCities, and close it if we're good to go.

ruebot · Jul 23, 2019

Ran on the entire GeoCities dataset, split into 9 separate jobs:

Extracted of the data frames exported as csv files:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/3/*gz", sc).extractImageDetailsDF();
df.select($"url", $"mime_type", $"width", $"height", $"md5").orderBy(desc("md5")).write.csv("/home/ruestn/geocities-images/set-03")

sys.exit

Results:

ruestn@tuna:~$ head /tuna1/scratch/nruest/geocites/geocities-images-image-details.csv
http://it.geocities.com/grannoce/camere/thumb/camera_blu_001.jpg,image/jpeg,112,150,fffffef31a159782b97876b7a17eab92
http://ar.geocities.com/angeles_uno/PLAYMATES/1999/JUNIO/KIMBERLY_SPICER/06_small.jpg,image/jpeg,100,143,fffffd5fe6d986c04f028854bbd4a20a
http://in.geocities.com/nileshtx/images/DSC01219.jpg,image/jpeg,510,768,fffffc7244d39657dd286547fda3fd0d
http://kr.geocities.com/magicianclow/img/favor.gif,image/gif,71,20,fffff8a7566c250585fb4453594b9c3e
http://login.space2000.de/logo.gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07
http://91-143-80-250.blue.kundencontroller.de/logo.gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07
http://cf.geocities.com/rouquins/images/merlin0.jpg,image/jpeg,129,140,fffff077e30e213fa08cecc389a60bdb
http://ar.geocities.com/aliaga_fernandoo/ediciones/ed7/imagenes/menu/MENU7_r11_c21.jpg,image/jpeg,68,10,ffffe91beaf231ea8b5fc46a1c6b7f32
http://www.geocities.com/audy000/newspic1/qudes.jpg,image/jpeg,55,24,ffffd381a8c0ae2e6a7d63d8af6b893c
http://ca.geocities.com/brunette_george/holidays/dad_brendon_lighthouse.jpg,image/jpeg,300,226,ffffc83f77a1558222f40d7a44b1d464

ruestn@tuna:~/geocities-images$ wc -l *csv
    3457505 set-01.csv
    4891437 set-02.csv
    4110015 set-03.csv
    4189014 set-04.csv
    6202039 set-05.csv
   41925987 set-06.csv
   26969972 set-07.csv
   28244154 set-08.csv
   26486297 set-09.csv
  146476420 total

Our 9 directories of images (from above script):

Total of 81,030,327 unique images!

I think we're good to go.

ianmilligan1 · Jul 23, 2019

Fantastic, thanks @ruebot! Do you think any of this (new flags on the spark-shell command or tuning information more generally) should go into our documentation?

ruebot · Jul 23, 2019

Yeah, we might add a cautionary note to this section about file systems, and flags. I can help flesh that out when the time comes.

ianmilligan1 · Jul 23, 2019

OK awesome. I will open up an issue on the website so we remember.

ruebot added bug optimization Scala DataFrames labels Jan 24, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

This was referenced Jan 31, 2019

Open

PDF binary object extraction #302

Open

Spreadsheet binary object extraction #303

Open

Doc binary object extraction #304

Open

Powerpoint binary object extraction #305

Open

Video binary object extraction #306

Open

Audio binary object extraction #307

ruebot closed this Jul 23, 2019

Binary object extraction automation moved this from To do to Done Jul 23, 2019

archivesunleashed/aut

Image extraction does not scale with number of WARCs #298

Image extraction does not scale with number of WARCs #298

ruebot commented Jan 23, 2019 •

edited

This comment has been minimized.

ruebot commented Jan 23, 2019

ruebot added bug optimization Scala DataFrames labels Jan 24, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

This was referenced Jan 31, 2019

PDF binary object extraction #302

Spreadsheet binary object extraction #303

Doc binary object extraction #304

Powerpoint binary object extraction #305

Video binary object extraction #306

Audio binary object extraction #307

This comment has been minimized.

ruebot commented Jul 17, 2019

This comment has been minimized.

ruebot commented Jul 17, 2019

This comment has been minimized.

ruebot commented Jul 23, 2019

ruebot closed this Jul 23, 2019

Binary object extraction automation moved this from To do to Done Jul 23, 2019

This comment has been minimized.

ianmilligan1 commented Jul 23, 2019

This comment has been minimized.

ruebot commented Jul 23, 2019

This comment has been minimized.

ianmilligan1 commented Jul 23, 2019

archivesunleashed/aut

Track tasks and feature requests

Image extraction does not scale with number of WARCs #298

Comments

ruebot commented Jan 23, 2019 • edited

Describe the bug

To Reproduce

10 WARCs

Results

100 WARCs

Results

200 WARCs

Results

500 WARCs

Results

Environment information

Additional context

Expected behavior

This comment has been minimized.

ruebot commented Jan 23, 2019

ruebot added bug optimization Scala DataFrames labels Jan 24, 2019

ruebot added this to To do in Binary object extraction Jan 31, 2019

This was referenced Jan 31, 2019

PDF binary object extraction #302

Spreadsheet binary object extraction #303

Doc binary object extraction #304

Powerpoint binary object extraction #305

Video binary object extraction #306

Audio binary object extraction #307

This comment has been minimized.

ruebot commented Jul 17, 2019

This comment has been minimized.

ruebot commented Jul 17, 2019

This comment has been minimized.

ruebot commented Jul 23, 2019

ruebot closed this Jul 23, 2019

Binary object extraction automation moved this from To do to Done Jul 23, 2019

This comment has been minimized.

ianmilligan1 commented Jul 23, 2019

This comment has been minimized.

ruebot commented Jul 23, 2019

This comment has been minimized.

ianmilligan1 commented Jul 23, 2019

ruebot commented Jan 23, 2019 •

edited