Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upRefactor loadArchives() function to limit size of individual record #275
Conversation
borislin
requested review from
lintool and
ruebot
Oct 12, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 12, 2018
Member
How does this resolve #247? Seems like a solution to something we don't have a ticket for.
How does this resolve #247? Seems like a solution to something we don't have a ticket for. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lintool
Oct 12, 2018
Member
@ruebot - @ianmilligan1 and I discussed this as a general strategy for dealing with edge-case WARCs and ARCs. Previously, we've dealt with the issue by increasing timeouts, which is janky. This seems like a cleaner solution.
@borislin Are you sure something like this works? IIRC, by the time we parse the header, we've already committed to parsing the entire record, which defeats the whole point of filtering by size. I believe this size check needs to be pushed further down into the reader.
The way to test this is to go back and examine those IA ARCS we failed on - stored on tuna
.
@ruebot - @ianmilligan1 and I discussed this as a general strategy for dealing with edge-case WARCs and ARCs. Previously, we've dealt with the issue by increasing timeouts, which is janky. This seems like a cleaner solution. @borislin Are you sure something like this works? IIRC, by the time we parse the header, we've already committed to parsing the entire record, which defeats the whole point of filtering by size. I believe this size check needs to be pushed further down into the reader. The way to test this is to go back and examine those IA ARCS we failed on - stored on |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 12, 2018
Member
@lintool Ok. The ticket should have at least been updated or discussed it on a call. We're moving goal posts, and only telling part of the team.
@lintool Ok. The ticket should have at least been updated or discussed it on a call. We're moving goal posts, and only telling part of the team. |
ianmilligan1
approved these changes
Oct 17, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 17, 2018
Collaborator
@ianmilligan1 I'm running aut with this PR to see whether we still have those heartbeat issues in IA ARCs. I think I may need to update this PR after that. Will keep this PR updated once I have more results...it's very slow to validate all those ARCs.
@ianmilligan1 I'm running aut with this PR to see whether we still have those heartbeat issues in IA ARCs. I think I may need to update this PR after that. Will keep this PR updated once I have more results...it's very slow to validate all those ARCs. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Oct 17, 2018
Member
OK - please keep us posted on that front. I'll make sure to re-test if the PR changes.
OK - please keep us posted on that front. I'll make sure to re-test if the PR changes. |
borislin commentedOct 12, 2018
This PR refactors loadArchives() function to enable user to limit the size of individual record, preventing Spark jobs from failing due to large records.
GitHub issue(s):
If you are responding to an issue, please mention their numbers below.
What does this Pull Request do?
This PR adds the functionality to the
loadArchives()
function to limit size of individual records to be processed, preventing Spark jobs from failing due to records that are too big.How should this be tested?
git fetch --all
git checkout limit-record-size
mvn clean install
mkdir -p path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/spark-jobs
Run the following command with adapted paths with Apache Spark 2.1.3:
/home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
Change the value of the
maxRecordSize
variable in the above script to a different value or remove it. See different results being produced. Only records that have sizes that are smaller or equal to the user-providedmaxRecordSize
will be filtered out.Additional Notes:
loadArchives()
function sincemaxRecordSize
is an optional parameter. But we should update our documentation to reflect that now user can use thismaxRecordSize
variable to limit record size.Results
My results are in
/tuna1/scratch/aut-issue-271/derivatives
and log is in/tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
Interested parties
@lintool @ianmilligan1 @ruebot