New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor loadArchives() function to limit size of individual record #275

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@borislin
Collaborator

borislin commented Oct 12, 2018

This PR refactors loadArchives() function to enable user to limit the size of individual record, preventing Spark jobs from failing due to large records.


GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

What does this Pull Request do?

This PR adds the functionality to the loadArchives() function to limit size of individual records to be processed, preventing Spark jobs from failing due to records that are too big.

How should this be tested?

  • git fetch --all
  • git checkout limit-record-size
  • mvn clean install
  • Create an output directory with sub-directories:
    mkdir -p path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/spark-jobs
  • Adapt the script below:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

//val input = "/tuna1/scratch/aut-issue-271/warcs/*.gz"
val input = "/tuna1/scratch/aut-issue-271/warcs/ARCHIVEIT-499-MONTHLY-JOB311846-20170703003948338-00020.warc.gz"

val output1 = "/tuna1/scratch/aut-issue-271/derivatives/all-domains"

val maxRecordSize: Option[Long] = Some(174552L)

RecordLoader.loadArchives(input, sc, maxRecordSize).keepValidPages().map(r => r.getContentBytes.length).sortBy(r => r, ascending = false).saveAsTextFile(output1)

sys.exit
  • Run the following command with adapted paths with Apache Spark 2.1.3:
    /home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log

  • Change the value of the maxRecordSize variable in the above script to a different value or remove it. See different results being produced. Only records that have sizes that are smaller or equal to the user-provided maxRecordSize will be filtered out.

Additional Notes:

  • This PR doesn't affect our current use and documentation for the loadArchives() function since maxRecordSize is an optional parameter. But we should update our documentation to reflect that now user can use this maxRecordSize variable to limit record size.

Results

My results are in /tuna1/scratch/aut-issue-271/derivatives and log is in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log

Interested parties

@lintool @ianmilligan1 @ruebot

@borislin borislin requested review from lintool and ruebot Oct 12, 2018

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 12, 2018

Member

How does this resolve #247? Seems like a solution to something we don't have a ticket for.

Member

ruebot commented Oct 12, 2018

How does this resolve #247? Seems like a solution to something we don't have a ticket for.

@lintool

This comment has been minimized.

Show comment
Hide comment
@lintool

lintool Oct 12, 2018

Member

@ruebot - @ianmilligan1 and I discussed this as a general strategy for dealing with edge-case WARCs and ARCs. Previously, we've dealt with the issue by increasing timeouts, which is janky. This seems like a cleaner solution.

@borislin Are you sure something like this works? IIRC, by the time we parse the header, we've already committed to parsing the entire record, which defeats the whole point of filtering by size. I believe this size check needs to be pushed further down into the reader.

The way to test this is to go back and examine those IA ARCS we failed on - stored on tuna.

Member

lintool commented Oct 12, 2018

@ruebot - @ianmilligan1 and I discussed this as a general strategy for dealing with edge-case WARCs and ARCs. Previously, we've dealt with the issue by increasing timeouts, which is janky. This seems like a cleaner solution.

@borislin Are you sure something like this works? IIRC, by the time we parse the header, we've already committed to parsing the entire record, which defeats the whole point of filtering by size. I believe this size check needs to be pushed further down into the reader.

The way to test this is to go back and examine those IA ARCS we failed on - stored on tuna.

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 12, 2018

Member

@lintool Ok. The ticket should have at least been updated or discussed it on a call. We're moving goal posts, and only telling part of the team.

Member

ruebot commented Oct 12, 2018

@lintool Ok. The ticket should have at least been updated or discussed it on a call. We're moving goal posts, and only telling part of the team.

@ianmilligan1

Tested and works for me, both on sample files and also a few other scripts. It'd be nice to make sure @lintool and @ruebot do a more granular look over the code as I'm just focused on the end results and use. 😄

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 17, 2018

Collaborator

@ianmilligan1 I'm running aut with this PR to see whether we still have those heartbeat issues in IA ARCs. I think I may need to update this PR after that. Will keep this PR updated once I have more results...it's very slow to validate all those ARCs.

Collaborator

borislin commented Oct 17, 2018

@ianmilligan1 I'm running aut with this PR to see whether we still have those heartbeat issues in IA ARCs. I think I may need to update this PR after that. Will keep this PR updated once I have more results...it's very slow to validate all those ARCs.

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 17, 2018

Member

OK - please keep us posted on that front. I'll make sure to re-test if the PR changes.

Member

ianmilligan1 commented Oct 17, 2018

OK - please keep us posted on that front. I'll make sure to re-test if the PR changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment