Track tasks and feature requests
Join 36 million developers who use GitHub issues to help identify, assign, and keep track of the features and bug fixes your projects need.
Sign up for free See pricing for teams and enterprisesImage extraction does not scale with number of WARCs #298
Comments
This comment has been minimized.
This comment has been minimized.
For reference, image extraction was implemented here: #234 |
ruebot
added
bug
optimization
Scala
DataFrames
labels
Jan 24, 2019
ruebot
added this to To do
in Binary object extraction
Jan 31, 2019
Jan 31, 2019
This was referenced
This comment has been minimized.
This comment has been minimized.
I think I have this working on tuna right now. I'll drop in the settings, logs, and script when I'm done. Pretty sure it might be a combination of a more recent version of Spark and some config settings on tuna that were changed by the sysadmin. |
This comment has been minimized.
This comment has been minimized.
aut-298-df-split-test.scala import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/1/*.gz", sc).extractImageDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/df-images-test/1/aut-298-test-")
sys.exit command
output
I think we're good. I'll run this on the rest of GeoCities, and close it if we're good to go. |
This comment has been minimized.
This comment has been minimized.
Ran on the entire GeoCities dataset, split into 9 separate jobs: Extracted of the data frames exported as
Results:
Our 9 directories of images (from above script):
Total of 81,030,327 unique images! I think we're good to go. |
ruebot
closed this
Jul 23, 2019
Binary object extraction
automation
moved this from To do
to Done
Jul 23, 2019
This comment has been minimized.
This comment has been minimized.
Fantastic, thanks @ruebot! Do you think any of this (new flags on the spark-shell command or tuning information more generally) should go into our documentation? |
This comment has been minimized.
This comment has been minimized.
Yeah, we might add a cautionary note to this section about file systems, and flags. I can help flesh that out when the time comes. |
This comment has been minimized.
This comment has been minimized.
OK awesome. I will open up an issue on the website so we remember. |
ruebot commentedJan 23, 2019
•
edited
Describe the bug
aut
fails in a variety of ways the more ARCs/WARCs you try to extract images from at a time.To Reproduce
Using this basic extraction script and scaling the number of WARCs per job:
10 WARCs
Results
100 WARCs
Results
(I accidentally ran it twice.)
200 WARCs
Results
500 WARCs
Results
Environment information
Additional context
ulimit
problem ontuna
with what I believe are the default setting.tuna
is also using zfs as a filesystem.rho
over all the files, I was able to extract ~20M images (there should be about ~121M in total), but I ran into a lot of disk space issues even though I had plenty of disk space free, inodes. Might have been an ext4 issue?Expected behavior
I think we're hitting this because of our implementation and extraction script. I believe we're iterating over the entire collection and identifying all the images, and tossing them into the dataframe, then iterating back over that and dumping them out to a tmp dir, then moving it over to the actual place they're supposed to end up. This requires a huge setting for
spark.driver.maxResultSize
. We should examine our implementation and see if it is possible to stream images out as we find them. That should require less overhead.