Empty Text and Domain Derivative File Despite Successful Job #200

ianmilligan1 · Oct 25, 2018

Describe the bug
We've now run into this problem twice. When generating the basic set of derivatives via AUK for a large 10TB collection of some ~47,000 WARCs, despite getting a successful job indication (and logs that contain no errors or exceptions), the resulting derivative files are empty.

Expected behaviour
We would expect the files to be working.

ianmilligan1 · Oct 25, 2018

My initial thought was that this could be related to too many arguments with cat (and to potentially replace with find). We should probably do this in any case, but on our system we should still be able to handle collections of this size?

i.e. this works on both rho and tuna:

i2millig@rho:~/test$ touch part-{00000..48269}
i2millig@rho:~/test$ ls -1 | wc -l
48271
i2millig@rho:~/test$ cat part* > all-text.txt

ruebot · Oct 29, 2018

Copying the collection over to rho now to do some testing. Should be done in a few days.

ianmilligan1 · Nov 19, 2018

Testing on rho with:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

RecordLoader.loadArchives("/mnt/vol1/data_sets/auk_datasets/139/499/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString)))).saveAsTextFile("/mnt/vol1/derivative_data/auk-montana/all-text")

Will keep the ticket posted.

Failed, presumably due to a large WARC (there's a 16GB one in there), so relaunched with:

/home/i2millig/spark-2.3.2-bin-hadoop2.7/bin/spark-shell --master local[8] --driver-memory 55G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.17.0"

ianmilligan1 · Nov 28, 2018

Just to update the ticket: we're still not quite sure what happened here. On rho the full-text derivative generated just fine, and we were able to cat the files together. Find worked as well on the production server.

ianmilligan1 added bug Background jobs labels Oct 25, 2018

ianmilligan1 self-assigned this Nov 10, 2018

ianmilligan1 closed this Jan 30, 2019

archivesunleashed/auk

Empty Text and Domain Derivative File Despite Successful Job #200

Empty Text and Domain Derivative File Despite Successful Job #200

ianmilligan1 commented Oct 25, 2018

ianmilligan1 added bug Background jobs labels Oct 25, 2018

This comment has been minimized.

ianmilligan1 commented Oct 25, 2018

This comment has been minimized.

ruebot commented Oct 29, 2018

ianmilligan1 self-assigned this Nov 10, 2018

This comment has been minimized.

ianmilligan1 commented Nov 19, 2018 •

edited

ruebot added a commit that referenced this issue Nov 19, 2018

This comment has been minimized.

ianmilligan1 commented Nov 28, 2018

ianmilligan1 closed this Jan 30, 2019

archivesunleashed/auk

Join GitHub today

Empty Text and Domain Derivative File Despite Successful Job #200

Comments

ianmilligan1 commented Oct 25, 2018

ianmilligan1 added bug Background jobs labels Oct 25, 2018

This comment has been minimized.

ianmilligan1 commented Oct 25, 2018

This comment has been minimized.

ruebot commented Oct 29, 2018

ianmilligan1 self-assigned this Nov 10, 2018

This comment has been minimized.

ianmilligan1 commented Nov 19, 2018 • edited

ruebot added a commit that referenced this issue Nov 19, 2018

This comment has been minimized.

ianmilligan1 commented Nov 28, 2018

ianmilligan1 closed this Jan 30, 2019

ianmilligan1 commented Nov 19, 2018 •

edited