Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upEmpty Text and Domain Derivative File Despite Successful Job #200
Comments
ianmilligan1
added
bug
Background jobs
labels
Oct 25, 2018
This comment has been minimized.
This comment has been minimized.
My initial thought was that this could be related to too many arguments with i.e. this works on both
|
This comment has been minimized.
This comment has been minimized.
Copying the collection over to |
ianmilligan1
self-assigned this
Nov 10, 2018
This comment has been minimized.
This comment has been minimized.
Testing on import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/mnt/vol1/data_sets/auk_datasets/139/499/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString)))).saveAsTextFile("/mnt/vol1/derivative_data/auk-montana/all-text") Will keep the ticket posted. Failed, presumably due to a large WARC (there's a 16GB one in there), so relaunched with: /home/i2millig/spark-2.3.2-bin-hadoop2.7/bin/spark-shell --master local[8] --driver-memory 55G --conf spark.network.timeout=10000000 --packages "io.archivesunleashed:aut:0.17.0" |
added a commit
that referenced
this issue
Nov 19, 2018
This comment has been minimized.
This comment has been minimized.
Just to update the ticket: we're still not quite sure what happened here. On |
ianmilligan1 commentedOct 25, 2018
Describe the bug
We've now run into this problem twice. When generating the basic set of derivatives via AUK for a large 10TB collection of some ~47,000 WARCs, despite getting a successful job indication (and logs that contain no errors or exceptions), the resulting derivative files are empty.
Expected behaviour
We would expect the files to be working.