Refactor loadArchives() function to limit size of individual record #275

borislin · Oct 12, 2018

This PR refactors loadArchives() function to enable user to limit the size of individual record, preventing Spark jobs from failing due to large records.

GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

#247

What does this Pull Request do?

This PR adds the functionality to the loadArchives() function to limit size of individual records to be processed, preventing Spark jobs from failing due to records that are too big.

How should this be tested?

git fetch --all
git checkout limit-record-size
mvn clean install
Create an output directory with sub-directories:
mkdir -p path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/spark-jobs
Adapt the script below:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

//val input = "/tuna1/scratch/aut-issue-271/warcs/*.gz"
val input = "/tuna1/scratch/aut-issue-271/warcs/ARCHIVEIT-499-MONTHLY-JOB311846-20170703003948338-00020.warc.gz"

val output1 = "/tuna1/scratch/aut-issue-271/derivatives/all-domains"

val maxRecordSize: Option[Long] = Some(174552L)

RecordLoader.loadArchives(input, sc, maxRecordSize).keepValidPages().map(r => r.getContentBytes.length).sortBy(r => r, ascending = false).saveAsTextFile(output1)

sys.exit

Run the following command with adapted paths with Apache Spark 2.1.3:
/home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
Change the value of the maxRecordSize variable in the above script to a different value or remove it. See different results being produced. Only records that have sizes that are smaller or equal to the user-provided maxRecordSize will be filtered out.

Additional Notes:

This PR doesn't affect our current use and documentation for the loadArchives() function since maxRecordSize is an optional parameter. But we should update our documentation to reflect that now user can use this maxRecordSize variable to limit record size.

Results

My results are in /tuna1/scratch/aut-issue-271/derivatives and log is in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log

Interested parties

@lintool @ianmilligan1 @ruebot

How does this resolve #247? Seems like a solution to something we don't have a ticket for.

ruebot · Oct 12, 2018

How does this resolve #247? Seems like a solution to something we don't have a ticket for.

@ruebot

@ruebot - @ianmilligan1 and I discussed this as a general strategy for dealing with edge-case WARCs and ARCs. Previously, we've dealt with the issue by increasing timeouts, which is janky. This seems like a cleaner solution.

@borislin Are you sure something like this works? IIRC, by the time we parse the header, we've already committed to parsing the entire record, which defeats the whole point of filtering by size. I believe this size check needs to be pushed further down into the reader.

The way to test this is to go back and examine those IA ARCS we failed on - stored on tuna.

lintool · Oct 12, 2018

@ruebot - @ianmilligan1 and I discussed this as a general strategy for dealing with edge-case WARCs and ARCs. Previously, we've dealt with the issue by increasing timeouts, which is janky. This seems like a cleaner solution.

@borislin Are you sure something like this works? IIRC, by the time we parse the header, we've already committed to parsing the entire record, which defeats the whole point of filtering by size. I believe this size check needs to be pushed further down into the reader.

The way to test this is to go back and examine those IA ARCS we failed on - stored on tuna.

@lintool

@lintool Ok. The ticket should have at least been updated or discussed it on a call. We're moving goal posts, and only telling part of the team.

ruebot · Oct 12, 2018

@lintool Ok. The ticket should have at least been updated or discussed it on a call. We're moving goal posts, and only telling part of the team.

ianmilligan1 · Oct 17, 2018

ianmilligan1 approved these changes Oct 17, 2018

View changes

Tested and works for me, both on sample files and also a few other scripts. It'd be nice to make sure @lintool and @ruebot do a more granular look over the code as I'm just focused on the end results and use. 😄

@ianmilligan1

@ianmilligan1 I'm running aut with this PR to see whether we still have those heartbeat issues in IA ARCs. I think I may need to update this PR after that. Will keep this PR updated once I have more results...it's very slow to validate all those ARCs.

borislin · Oct 17, 2018

@ianmilligan1 I'm running aut with this PR to see whether we still have those heartbeat issues in IA ARCs. I think I may need to update this PR after that. Will keep this PR updated once I have more results...it's very slow to validate all those ARCs.

OK - please keep us posted on that front. I'll make sure to re-test if the PR changes.

ianmilligan1 · Oct 17, 2018

OK - please keep us posted on that front. I'll make sure to re-test if the PR changes.

Refactor loadArchives() function to limit size of individual record

Loading status checks…

875821f

borislin requested review from lintool and ruebot Oct 12, 2018

ruebot referenced this pull request Oct 17, 2018
Closed
Refactor loadArchives() function #257

archivesunleashed/aut

Refactor loadArchives() function to limit size of individual record #275

borislin commented Oct 12, 2018

borislin requested review from lintool and ruebot Oct 12, 2018

This comment has been minimized.

ruebot commented Oct 12, 2018

This comment has been minimized.

lintool commented Oct 12, 2018

This comment has been minimized.

ruebot commented Oct 12, 2018

ruebot referenced this pull request Oct 17, 2018

Refactor loadArchives() function #257

ianmilligan1 approved these changes Oct 17, 2018

View changes

This comment has been minimized.

borislin commented Oct 17, 2018

This comment has been minimized.

ianmilligan1 commented Oct 17, 2018

archivesunleashed/aut

Join GitHub today

Refactor loadArchives() function to limit size of individual record #275

Conversation

borislin commented Oct 12, 2018

What does this Pull Request do?

How should this be tested?

Additional Notes:

Results

Interested parties

borislin requested review from lintool and ruebot Oct 12, 2018

This comment has been minimized.

ruebot Oct 12, 2018

ruebot commented Oct 12, 2018

This comment has been minimized.

lintool Oct 12, 2018

lintool commented Oct 12, 2018

This comment has been minimized.

ruebot Oct 12, 2018

ruebot commented Oct 12, 2018

ruebot referenced this pull request Oct 17, 2018

Refactor loadArchives() function #257

ianmilligan1 approved these changes Oct 17, 2018 View changes

This comment has been minimized.

borislin Oct 17, 2018

borislin commented Oct 17, 2018

This comment has been minimized.

ianmilligan1 Oct 17, 2018

ianmilligan1 commented Oct 17, 2018

ianmilligan1 approved these changes Oct 17, 2018

View changes