Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upPatch for #246 & #271: Fix exception error when processing corrupted ARC files #272
Conversation
borislin
requested review from
ruebot and
lintool
Oct 3, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Thanks @borislin! I'll try and get this tested tonight. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 3, 2018
Member
I'm not able to replicate with all the 499 files, and the two files from aut-resources.
rm -rf ~/m2/repository/*
rm -rf ~/git/aut/target
git fetch --all
git checkout fix-exeception
mvn clean install
- Run the following:
/home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
- Where
499.scala
is:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit
- It fails. Log here.
I'm not able to replicate with all the 499 files, and the two files from aut-resources.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Oct 3, 2018
Member
Getting closer! If I remove the empty ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz
my adapted version of @ruebot's script above works. If that empty file is there, however, it fails the same way that it does for @ruebot above.
Is something weird going on with how we handle empty files?
Getting closer! If I remove the empty Is something weird going on with how we handle empty files? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 3, 2018
Collaborator
@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file.
@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 4, 2018
Member
Can confirm that if I remove the empty file, we are successful.
The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org.
Can confirm that if I remove the empty file, we are successful. The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
codecov-io
Oct 4, 2018
Codecov Report
Merging #272 into master will increase coverage by
<.01%
.
The diff coverage is80%
.
@@ Coverage Diff @@
## master #272 +/- ##
==========================================
+ Coverage 70.35% 70.36% +<.01%
==========================================
Files 41 41
Lines 1039 1046 +7
Branches 191 192 +1
==========================================
+ Hits 731 736 +5
- Misses 242 244 +2
Partials 66 66
Impacted Files | Coverage Δ | |
---|---|---|
src/main/scala/io/archivesunleashed/package.scala | 84.82% <100%> (+0.7%) |
|
...java/io/archivesunleashed/data/ArcRecordUtils.java | 78.26% <50%> (-3.56%) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c95a51d...7ecdd9e. Read the comment docs.
codecov-io
commented
Oct 4, 2018
•
Codecov Report
@@ Coverage Diff @@
## master #272 +/- ##
==========================================
+ Coverage 70.35% 70.36% +<.01%
==========================================
Files 41 41
Lines 1039 1046 +7
Branches 191 192 +1
==========================================
+ Hits 731 736 +5
- Misses 242 244 +2
Partials 66 66
Continue to review full report at Codecov.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 4, 2018
Member
[nruest@wombat:aut] (git)-[fix-exception]-$ /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
2018-10-03 22:26:19,522 [main] WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-03 22:26:19,630 [main] WARN Utils - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-03 22:26:19,630 [main] WARN Utils - Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538619980066).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
... 77 elided
<console>:33: error: not found: value links
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
^
2018-10-03 22:26:22,651 [Thread-1] INFO SparkContext - Invoking stop() from shutdown hook
2018-10-03 22:26:22,655 [Thread-1] INFO ServerConnector - Stopped Spark@1b22bfe2{HTTP/1.1}{0.0.0.0:4040}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@2e7af36e{/stages/stage/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@4b56b031{/jobs/job/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@4a058df8{/api,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@6dca31eb{/,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@64502326{/static,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@2d258eff{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@1d6751e3{/executors/threadDump,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@63fd4dda{/executors/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@335f5c69{/executors,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@22825e1e{/environment/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@55c57422{/environment,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@2c1f8dbd{/storage/rdd/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@4cdba2ed{/storage/rdd,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@72b0a004{/storage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@171dc7c3{/storage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@76f2dad9{/stages/pool/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@499ee966{/stages/pool,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@42e4e589{/stages/stage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@43245559{/stages/stage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@5d5c41e5{/stages/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@715f45c6{/stages,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@794f11cd{/jobs/job/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@562919fe{/jobs/job,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@1ac6dd3d{/jobs/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO ContextHandler - Stopped o.s.j.s.ServletContextHandler@36c7cbe1{/jobs,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,658 [Thread-1] INFO SparkUI - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-03 22:26:22,663 [dispatcher-event-loop-2] INFO MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2018-10-03 22:26:22,669 [Thread-1] INFO MemoryStore - MemoryStore cleared
2018-10-03 22:26:22,669 [Thread-1] INFO BlockManager - BlockManager stopped
2018-10-03 22:26:22,672 [Thread-1] INFO BlockManagerMaster - BlockManagerMaster stopped
2018-10-03 22:26:22,674 [dispatcher-event-loop-7] INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2018-10-03 22:26:22,675 [Thread-1] INFO SparkContext - Successfully stopped SparkContext
2018-10-03 22:26:22,675 [Thread-1] INFO ShutdownHookManager - Shutdown hook called
2018-10-03 22:26:22,676 [Thread-1] INFO ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130
2018-10-03 22:26:22,680 [Thread-1] INFO ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130/repl-648f2e1b-6a8b-4fca-9468-142d2160b52a
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 4, 2018
Collaborator
@ruebot Oh, I forgot to mention. Remove the *.gz
part from your 499.scala
script.
@ruebot Oh, I forgot to mention. Remove the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Oct 4, 2018
Member
Is it possible to tweak this to retain the *.gz
syntax? I can imagine working in directories (and do once in a while) that have several non archive files or directories in them.
Is it possible to tweak this to retain the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 4, 2018
Collaborator
@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files.
@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Oct 4, 2018
Member
That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation.
That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Oct 4, 2018
Member
Yeah, if possible it'd be nice to keep it.
In any case I'm still having trouble reproducing your success. I'm using:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/domains")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/test")
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/mnt/vol1/data_sets/aut_debug/issue-499/results/499-gephi.graphml")
sys.exit
and am getting this error log.
Yeah, if possible it'd be nice to keep it. In any case I'm still having trouble reproducing your success. I'm using:
and am getting this error log. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
@ruebot @ianmilligan1 I see. Let me tweak it further. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
borislin
Oct 4, 2018
Collaborator
@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern *.gz
. My results are in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
and /tuna1/scratch/aut-issue-271/derivatives
. Pls take a look and also validate on your end.
@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern |
ruebot
requested changes
Oct 4, 2018
Two little things with comments, and I think we're good to go.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
@borislin looks good now. Nice work! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
@lintool can you give it a look before I merge? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lintool
approved these changes
Oct 4, 2018
lgtm.
My only comment is that getFiles
will be slow when the number of files get too big... For example, each of the IA buckets has over 20k+ files.
Regardless,
borislin commentedOct 3, 2018
•
edited
Patch for #246 & #271: Fix ZipException error when processing corrupted ARC files.
GitHub issue(s):
If you are responding to an issue, please mention their numbers below.
What does this Pull Request do?
This PR catches exceptions in
getContent()
inArcRecordUtils.java
when reading the content of corrupted ARC files, allowing Spark jobs to continue running.How should this be tested?
Build aut on master, as of the latest commit:
mvn clean install
Create an output directory with sub-directories:
mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
Adapt the script below:
/home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
Current results
/tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
/tuna1/scratch/aut-issue-271/derivatives
Interested parties
@lintool @ianmilligan1 @ruebot