New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty Text and Domain Derivative File Despite Successful Job #200

Closed
ianmilligan1 opened this Issue Oct 25, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@ianmilligan1
Copy link
Member

ianmilligan1 commented Oct 25, 2018

Describe the bug
We've now run into this problem twice. When generating the basic set of derivatives via AUK for a large 10TB collection of some ~47,000 WARCs, despite getting a successful job indication (and logs that contain no errors or exceptions), the resulting derivative files are empty.

Expected behaviour
We would expect the files to be working.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Oct 25, 2018

My initial thought was that this could be related to too many arguments with cat (and to potentially replace with find). We should probably do this in any case, but on our system we should still be able to handle collections of this size?

i.e. this works on both rho and tuna:

i2millig@rho:~/test$ touch part-{00000..48269}
i2millig@rho:~/test$ ls -1 | wc -l
48271
i2millig@rho:~/test$ cat part* > all-text.txt
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Oct 29, 2018

Copying the collection over to rho now to do some testing. Should be done in a few days.

@ianmilligan1 ianmilligan1 self-assigned this Nov 10, 2018

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Nov 19, 2018

Testing on rho with:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

RecordLoader.loadArchives("/mnt/vol1/data_sets/auk_datasets/139/499/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString)))).saveAsTextFile("/mnt/vol1/derivative_data/auk-montana/all-text")

Will keep the ticket posted.

Failed, presumably due to a large WARC (there's a 16GB one in there), so relaunched with:

/home/i2millig/spark-2.3.2-bin-hadoop2.7/bin/spark-shell --master local[8] --driver-memory 55G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.17.0"

ruebot added a commit that referenced this issue Nov 19, 2018

Use find with cat; address #200.
- We may hit (and probably have with the mt.gov job) '-bash: /bin/cat: Argument list too long'
@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Nov 28, 2018

Just to update the ticket: we're still not quite sure what happened here. On rho the full-text derivative generated just fine, and we were able to cat the files together. Find worked as well on the production server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment