Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image extraction does not scale with number of WARCs #298

Closed
ruebot opened this issue Jan 23, 2019 · 7 comments

Comments

2 participants
@ruebot
Copy link
Member

commented Jan 23, 2019

Describe the bug

aut fails in a variety of ways the more ARCs/WARCs you try to extract images from at a time.

To Reproduce

Using this basic extraction script and scaling the number of WARCs per job:

import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/test-NUMBER-OF-WARCS/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").orderBy(desc("bytes")).saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/images/NUMBER-OF-WARCS/geocities-image")
sys.exit

10 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-10.scala | tee /tuna1/scratch/nruest/geocites/logs/test-10.log);

Results

  • Number of images: 240,424
  • Disk usage: 6.8G
  • Time:
-------------------
real    20m41.399s
user    190m50.356s
sys     29m14.584s
-------------------

100 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-100.scala | tee /tuna1/scratch/nruest/geocites/logs/test-100.log);

Results

  • Number of images: 2,330,328
  • Disk usage: 66G
  • Time:
--------------------
real    230m21.423s
user    1885m4.936s
sys     270m5.440s
--------------------

(I accidentally ran it twice.)

--------------------
real    726m33.194s
user    1565m17.936s
sys     285m35.924s
--------------------

200 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-200.scala | tee /tuna1/scratch/nruest/geocites/logs/test-200.log);

Results

  • Number of images: FAILED
  • Disk usage: FAILED
  • Time: FAILED

500 WARCs

time (/home/ruestn/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --master local[44] --driver-memory 500G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s  --conf spark.driver.maxResultSize=250G --conf spark.local.dir=/tuna1/scratch/nruest/tmp --packages "io.archivesunleashed:aut:0.17.0" -i /home/ruestn/geocities-images-tests/test-500.scala | tee /tuna1/scratch/nruest/geocites/logs/test-500.log);

Results

  • Number of images: Did not run
  • Disk usage: Did not run
  • Time: Did not run

Environment information

  • AUT version: 0.17.0
  • OS: Ubuntu 16.04
  • Java version: Java 8
  • Apache Spark version: 2.3.2, 2.4.0
  • Apache Spark w/aut: --packages
  • Apache Spark command used to run AUT: above

Additional context

  • We hit a ulimit problem on tuna with what I believe are the default setting. tuna is also using zfs as a filesystem.
java.io.FileNotFoundException: /tuna1/scratch/nruest/geocites/images/geocities-image-a97c139a3a31467aeb4bbfa36edaa775.gif (Too many open files)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:69)
        at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:55)
        at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:419)
        at javax.imageio.ImageIO.write(ImageIO.java:1530)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:72)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
  • When run on rho over all the files, I was able to extract ~20M images (there should be about ~121M in total), but I ran into a lot of disk space issues even though I had plenty of disk space free, inodes. Might have been an ext4 issue?
java.io.FileNotFoundException: /mnt/vol1/data_sets/geocities/images-84553ea4b4737f20d6b20cbc16defc66.gif (No space left on device)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at javax.imageio.stream.FileImageOutputStream.<init>(FileImageOutputStream.java:69)
        at com.sun.imageio.spi.FileImageOutputStreamSpi.createOutputStreamInstance(FileImageOutputStreamSpi.java:55)
        at javax.imageio.ImageIO.createImageOutputStream(ImageIO.java:419)
        at javax.imageio.ImageIO.write(ImageIO.java:1530)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:72)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
  • Other errors:
java.io.EOFException
        at javax.imageio.stream.ImageInputStreamImpl.readUnsignedByte(ImageInputStreamImpl.java:222)
        at com.sun.imageio.plugins.gif.GIFImageReader.read(GIFImageReader.java:916)
        at javax.imageio.ImageReader.read(ImageReader.java:939)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:66)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
java.io.EOFException
        at javax.imageio.stream.ImageInputStreamImpl.readUnsignedByte(ImageInputStreamImpl.java:222)
        at com.sun.imageio.plugins.gif.GIFImageReader.read(GIFImageReader.java:916)
        at javax.imageio.ImageReader.read(ImageReader.java:939)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:66)
        at io.archivesunleashed.df.package$SaveImage$$anonfun$saveToDisk$1.apply(package.scala:54)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:927)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Expected behavior

I think we're hitting this because of our implementation and extraction script. I believe we're iterating over the entire collection and identifying all the images, and tossing them into the dataframe, then iterating back over that and dumping them out to a tmp dir, then moving it over to the actual place they're supposed to end up. This requires a huge setting for spark.driver.maxResultSize. We should examine our implementation and see if it is possible to stream images out as we find them. That should require less overhead.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2019

For reference, image extraction was implemented here: #234

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

I think I have this working on tuna right now. I'll drop in the settings, logs, and script when I'm done. Pretty sure it might be a combination of a more recent version of Spark and some config settings on tuna that were changed by the sysadmin.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

aut-298-df-split-test.scala

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/1/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/df-images-test/1/aut-298-test-")
sys.exit

command

/home/ruestn/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master local[75] --driver-memory 300g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true --packages "io.archivesunleashed:aut:0.17.1-SNAPSHOT" -i /home/ruestn/aut-298-df-split-test.scala 2>&1 | tee /home/ruestn/aut-298-df-split-test-01.log

output

ruestn@tuna:/tuna1/scratch/nruest/geocites/df-images-test/1$ ls | wc -l
1692092
ruestn@tuna:/tuna1/scratch/nruest/geocites/df-images-test/1$ ls | head
aut-298-test--0000111a3b953aec7f0cfe8c67d1ef13.gif
aut-298-test--0000149ccfc79a8d592f80fd4d0428ae.gif
aut-298-test--000017e324fdc571fa32b2dc2179f299.JPEG
aut-298-test--00002e984a2d1c22d55e8c7e03a8b283.JPEG
aut-298-test--00002eb38a357b394047a10c59ec4f95.JPEG
aut-298-test--000033e6d3a81a7c9e9c8681cc6b4801.gif
aut-298-test--00003a521ae0c8b1f1cbcab0c42cf296.gif
aut-298-test--000040ed7782c926413e5c1f46a06fbc.gif
aut-298-test--000052770ee09057c84c59d5c5d8a7cd.JPEG
aut-298-test--00006544537064d8d3072bb0d059c7ab.gif

job log

I think we're good. I'll run this on the rest of GeoCities, and close it if we're good to go.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jul 23, 2019

Ran on the entire GeoCities dataset, split into 9 separate jobs:

Extracted of the data frames exported as csv files:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/3/*gz", sc).extractImageDetailsDF();
df.select($"url", $"mime_type", $"width", $"height", $"md5").orderBy(desc("md5")).write.csv("/home/ruestn/geocities-images/set-03")

sys.exit

Results:

ruestn@tuna:~$ head /tuna1/scratch/nruest/geocites/geocities-images-image-details.csv
http://it.geocities.com/grannoce/camere/thumb/camera_blu_001.jpg,image/jpeg,112,150,fffffef31a159782b97876b7a17eab92
http://ar.geocities.com/angeles_uno/PLAYMATES/1999/JUNIO/KIMBERLY_SPICER/06_small.jpg,image/jpeg,100,143,fffffd5fe6d986c04f028854bbd4a20a
http://in.geocities.com/nileshtx/images/DSC01219.jpg,image/jpeg,510,768,fffffc7244d39657dd286547fda3fd0d
http://kr.geocities.com/magicianclow/img/favor.gif,image/gif,71,20,fffff8a7566c250585fb4453594b9c3e
http://login.space2000.de/logo.gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07
http://91-143-80-250.blue.kundencontroller.de/logo.gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07
http://cf.geocities.com/rouquins/images/merlin0.jpg,image/jpeg,129,140,fffff077e30e213fa08cecc389a60bdb
http://ar.geocities.com/aliaga_fernandoo/ediciones/ed7/imagenes/menu/MENU7_r11_c21.jpg,image/jpeg,68,10,ffffe91beaf231ea8b5fc46a1c6b7f32
http://www.geocities.com/audy000/newspic1/qudes.jpg,image/jpeg,55,24,ffffd381a8c0ae2e6a7d63d8af6b893c
http://ca.geocities.com/brunette_george/holidays/dad_brendon_lighthouse.jpg,image/jpeg,300,226,ffffc83f77a1558222f40d7a44b1d464
ruestn@tuna:~/geocities-images$ wc -l *csv
    3457505 set-01.csv
    4891437 set-02.csv
    4110015 set-03.csv
    4189014 set-04.csv
    6202039 set-05.csv
   41925987 set-06.csv
   26969972 set-07.csv
   28244154 set-08.csv
   26486297 set-09.csv
  146476420 total

Our 9 directories of images (from above script):

1,692,092
2,581,462
2,489,849
1,973,500
3,972,130
23,516,218
17,568,023
19,238,377
7,998,676

Total of 81,030,327 unique images!

I think we're good to go.

@ruebot ruebot closed this Jul 23, 2019

Binary object extraction automation moved this from To do to Done Jul 23, 2019

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

Fantastic, thanks @ruebot! Do you think any of this (new flags on the spark-shell command or tuning information more generally) should go into our documentation?

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jul 23, 2019

Yeah, we might add a cautionary note to this section about file systems, and flags. I can help flesh that out when the time comes.

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

OK awesome. I will open up an issue on the website so we remember.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.