Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

borislin · Oct 3, 2018

Patch for #246 & #271: Fix ZipException error when processing corrupted ARC files.

GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

#246
#271

What does this Pull Request do?

This PR catches exceptions in getContent() in ArcRecordUtils.java when reading the content of corrupted ARC files, allowing Spark jobs to continue running.

How should this be tested?

Build aut on master, as of the latest commit: mvn clean install
Create an output directory with sub-directories:
mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
Adapt the script below:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

val input = "/tuna1/scratch/aut-issue-271/warcs/*.gz"

val output1 = "/tuna1/scratch/aut-issue-271/derivatives/all-domains"
val output2 = "/tuna1/scratch/aut-issue-271/derivatives/all-text"
val output3 = "/tuna1/scratch/aut-issue-271/derivatives/gephi"

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile(output1)

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile(output2)

val links = RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, output3)

sys.exit

Run the following command with adapted paths with Apache Spark 2.1.3:
/home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log

Current results

/tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
/tuna1/scratch/aut-issue-271/derivatives

Interested parties

@lintool @ianmilligan1 @ruebot

@borislin

Thanks @borislin! I'll try and get this tested tonight.

ruebot · Oct 3, 2018

Thanks @borislin! I'll try and get this tested tonight.

I'm not able to replicate with all the 499 files, and the two files from aut-resources.

rm -rf ~/m2/repository/*
rm -rf ~/git/aut/target
git fetch --all
git checkout fix-exeception
mvn clean install
Run the following:

/home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
Where 499.scala is:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit

It fails. Log here.

ruebot · Oct 3, 2018

I'm not able to replicate with all the 499 files, and the two files from aut-resources.

rm -rf ~/m2/repository/*
rm -rf ~/git/aut/target
git fetch --all
git checkout fix-exeception
mvn clean install
Run the following:

/home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
Where 499.scala is:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit

It fails. Log here.

@ruebot

Getting closer! If I remove the empty ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz my adapted version of @ruebot's script above works. If that empty file is there, however, it fails the same way that it does for @ruebot above.

Is something weird going on with how we handle empty files?

ianmilligan1 · Oct 3, 2018

Getting closer! If I remove the empty ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz my adapted version of @ruebot's script above works. If that empty file is there, however, it fails the same way that it does for @ruebot above.

Is something weird going on with how we handle empty files?

@ruebot

@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file.

borislin · Oct 3, 2018

@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file.

Can confirm that if I remove the empty file, we are successful.

The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org.

ruebot · Oct 4, 2018

Can confirm that if I remove the empty file, we are successful.

The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org.

Codecov Report

Merging #272 into master will increase coverage by <.01%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master     #272      +/-   ##
==========================================
+ Coverage   70.35%   70.36%   +<.01%     
==========================================
  Files          41       41              
  Lines        1039     1046       +7     
  Branches      191      192       +1     
==========================================
+ Hits          731      736       +5     
- Misses        242      244       +2     
  Partials       66       66

Impacted Files	Coverage Δ
src/main/scala/io/archivesunleashed/package.scala	`84.82% <100%> (+0.7%)`	⬆️
...java/io/archivesunleashed/data/ArcRecordUtils.java	`78.26% <50%> (-3.56%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c95a51d...7ecdd9e. Read the comment docs.

codecov-io · Oct 4, 2018

Codecov Report

Merging #272 into master will increase coverage by <.01%.
The diff coverage is 80%.

@@            Coverage Diff             @@
##           master     #272      +/-   ##
==========================================
+ Coverage   70.35%   70.36%   +<.01%     
==========================================
  Files          41       41              
  Lines        1039     1046       +7     
  Branches      191      192       +1     
==========================================
+ Hits          731      736       +5     
- Misses        242      244       +2     
  Partials       66       66

Impacted Files	Coverage Δ
src/main/scala/io/archivesunleashed/package.scala	`84.82% <100%> (+0.7%)`	⬆️
...java/io/archivesunleashed/data/ArcRecordUtils.java	`78.26% <50%> (-3.56%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c95a51d...7ecdd9e. Read the comment docs.

[nruest@wombat:aut] (git)-[fix-exception]-$ /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
2018-10-03 22:26:19,522 [main] WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-03 22:26:19,630 [main] WARN  Utils - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-03 22:26:19,630 [main] WARN  Utils - Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538619980066).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
<console>:33: error: not found: value links
       WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
                    ^
2018-10-03 22:26:22,651 [Thread-1] INFO  SparkContext - Invoking stop() from shutdown hook
2018-10-03 22:26:22,655 [Thread-1] INFO  ServerConnector - Stopped Spark@1b22bfe2{HTTP/1.1}{0.0.0.0:4040}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2e7af36e{/stages/stage/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4b56b031{/jobs/job/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4a058df8{/api,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@6dca31eb{/,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@64502326{/static,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2d258eff{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1d6751e3{/executors/threadDump,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@63fd4dda{/executors/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@335f5c69{/executors,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@22825e1e{/environment/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@55c57422{/environment,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2c1f8dbd{/storage/rdd/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4cdba2ed{/storage/rdd,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@72b0a004{/storage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@171dc7c3{/storage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@76f2dad9{/stages/pool/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@499ee966{/stages/pool,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@42e4e589{/stages/stage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@43245559{/stages/stage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@5d5c41e5{/stages/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@715f45c6{/stages,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@794f11cd{/jobs/job/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@562919fe{/jobs/job,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1ac6dd3d{/jobs/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@36c7cbe1{/jobs,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,658 [Thread-1] INFO  SparkUI - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-03 22:26:22,663 [dispatcher-event-loop-2] INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2018-10-03 22:26:22,669 [Thread-1] INFO  MemoryStore - MemoryStore cleared
2018-10-03 22:26:22,669 [Thread-1] INFO  BlockManager - BlockManager stopped
2018-10-03 22:26:22,672 [Thread-1] INFO  BlockManagerMaster - BlockManagerMaster stopped
2018-10-03 22:26:22,674 [dispatcher-event-loop-7] INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2018-10-03 22:26:22,675 [Thread-1] INFO  SparkContext - Successfully stopped SparkContext
2018-10-03 22:26:22,675 [Thread-1] INFO  ShutdownHookManager - Shutdown hook called
2018-10-03 22:26:22,676 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130
2018-10-03 22:26:22,680 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130/repl-648f2e1b-6a8b-4fca-9468-142d2160b52a

ruebot · Oct 4, 2018

[nruest@wombat:aut] (git)-[fix-exception]-$ /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
2018-10-03 22:26:19,522 [main] WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-03 22:26:19,630 [main] WARN  Utils - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-03 22:26:19,630 [main] WARN  Utils - Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538619980066).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
<console>:33: error: not found: value links
       WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
                    ^
2018-10-03 22:26:22,651 [Thread-1] INFO  SparkContext - Invoking stop() from shutdown hook
2018-10-03 22:26:22,655 [Thread-1] INFO  ServerConnector - Stopped Spark@1b22bfe2{HTTP/1.1}{0.0.0.0:4040}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2e7af36e{/stages/stage/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4b56b031{/jobs/job/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4a058df8{/api,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@6dca31eb{/,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@64502326{/static,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2d258eff{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1d6751e3{/executors/threadDump,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@63fd4dda{/executors/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@335f5c69{/executors,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@22825e1e{/environment/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@55c57422{/environment,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2c1f8dbd{/storage/rdd/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4cdba2ed{/storage/rdd,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@72b0a004{/storage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@171dc7c3{/storage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@76f2dad9{/stages/pool/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@499ee966{/stages/pool,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@42e4e589{/stages/stage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@43245559{/stages/stage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@5d5c41e5{/stages/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@715f45c6{/stages,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@794f11cd{/jobs/job/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@562919fe{/jobs/job,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1ac6dd3d{/jobs/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@36c7cbe1{/jobs,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,658 [Thread-1] INFO  SparkUI - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-03 22:26:22,663 [dispatcher-event-loop-2] INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2018-10-03 22:26:22,669 [Thread-1] INFO  MemoryStore - MemoryStore cleared
2018-10-03 22:26:22,669 [Thread-1] INFO  BlockManager - BlockManager stopped
2018-10-03 22:26:22,672 [Thread-1] INFO  BlockManagerMaster - BlockManagerMaster stopped
2018-10-03 22:26:22,674 [dispatcher-event-loop-7] INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2018-10-03 22:26:22,675 [Thread-1] INFO  SparkContext - Successfully stopped SparkContext
2018-10-03 22:26:22,675 [Thread-1] INFO  ShutdownHookManager - Shutdown hook called
2018-10-03 22:26:22,676 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130
2018-10-03 22:26:22,680 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130/repl-648f2e1b-6a8b-4fca-9468-142d2160b52a

@ruebot

@ruebot Oh, I forgot to mention. Remove the *.gz part from your 499.scala script.

borislin · Oct 4, 2018

@ruebot Oh, I forgot to mention. Remove the *.gz part from your 499.scala script.

Is it possible to tweak this to retain the *.gz syntax? I can imagine working in directories (and do once in a while) that have several non archive files or directories in them.

ianmilligan1 · Oct 4, 2018

Is it possible to tweak this to retain the *.gz syntax? I can imagine working in directories (and do once in a while) that have several non archive files or directories in them.

@ianmilligan1

@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files.

borislin · Oct 4, 2018

@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files.

That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation.

ruebot · Oct 4, 2018

That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation.

Yeah, if possible it'd be nice to keep it.

In any case I'm still having trouble reproducing your success. I'm using:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/domains")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/test")
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/mnt/vol1/data_sets/aut_debug/issue-499/results/499-gephi.graphml")
sys.exit

and am getting this error log.

ianmilligan1 · Oct 4, 2018

Yeah, if possible it'd be nice to keep it.

In any case I'm still having trouble reproducing your success. I'm using:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/domains")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/test")
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/mnt/vol1/data_sets/aut_debug/issue-499/results/499-gephi.graphml")
sys.exit

and am getting this error log.

@ruebot

@ruebot @ianmilligan1 I see. Let me tweak it further.

borislin · Oct 4, 2018

@ruebot @ianmilligan1 I see. Let me tweak it further.

@ruebot

@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern *.gz. My results are in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log and /tuna1/scratch/aut-issue-271/derivatives. Pls take a look and also validate on your end.

borislin · Oct 4, 2018

@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern *.gz. My results are in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log and /tuna1/scratch/aut-issue-271/derivatives. Pls take a look and also validate on your end.

ruebot · Oct 4, 2018

ruebot requested changes Oct 4, 2018

View changes

Two little things with comments, and I think we're good to go.

src/main/scala/io/archivesunleashed/package.scala

@borislin

@borislin looks good now. Nice work!

ruebot · Oct 4, 2018

@borislin looks good now. Nice work!

ruebot · Oct 4, 2018

ruebot approved these changes Oct 4, 2018

View changes

@lintool

@lintool can you give it a look before I merge?

ruebot · Oct 4, 2018

@lintool can you give it a look before I merge?

@ruebot

Just echoing @ruebot - this is great work, @borislin! 🎉

ianmilligan1 · Oct 4, 2018

Just echoing @ruebot - this is great work, @borislin! 🎉

lintool · Oct 4, 2018

lintool approved these changes Oct 4, 2018

View changes

lgtm.

My only comment is that getFiles will be slow when the number of files get too big... For example, each of the IA buckets has over 20k+ files.

Regardless, 🚢 and we'll deal with issues later.

Fix exception when processing corrupted ARC files

Loading status checks…

836db56

borislin requested review from ruebot and lintool Oct 3, 2018

Filter out non-empty archive files in loadArchives()

Loading status checks…

610058a

Fix archive files path pattern

Loading status checks…

041d983

Minor commment changes

Loading status checks…

7ecdd9e

ruebot deleted the fix-exception branch Oct 4, 2018

archivesunleashed/aut

Join GitHub today

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

Conversation

borislin commented Oct 3, 2018 • edited

What does this Pull Request do?

How should this be tested?

Current results

Interested parties

borislin requested review from ruebot and lintool Oct 3, 2018

This comment has been minimized.

ruebot commented Oct 3, 2018

This comment has been minimized.

ruebot commented Oct 3, 2018

This comment has been minimized.

ianmilligan1 commented Oct 3, 2018

This comment has been minimized.

borislin commented Oct 3, 2018

This comment has been minimized.

ruebot commented Oct 4, 2018

This comment has been minimized.

Codecov Report

codecov-io commented Oct 4, 2018 • edited

Codecov Report

This comment has been minimized.

ruebot commented Oct 4, 2018

This comment has been minimized.

borislin commented Oct 4, 2018

This comment has been minimized.

ianmilligan1 commented Oct 4, 2018

This comment has been minimized.

borislin commented Oct 4, 2018

This comment has been minimized.

ruebot commented Oct 4, 2018

This comment has been minimized.

ianmilligan1 commented Oct 4, 2018

This comment has been minimized.

borislin commented Oct 4, 2018

This comment has been minimized.

borislin commented Oct 4, 2018 • edited

ruebot requested changes Oct 4, 2018 View changes

This comment has been minimized.

ruebot commented Oct 4, 2018

ruebot approved these changes Oct 4, 2018 View changes

This comment has been minimized.

ruebot commented Oct 4, 2018

This comment has been minimized.

ianmilligan1 commented Oct 4, 2018

lintool approved these changes Oct 4, 2018 View changes

Hide details View details ruebot merged commit b8e57ec into master Oct 4, 2018 4 checks passed

4 checks passed

ruebot deleted the fix-exception branch Oct 4, 2018

borislin commented Oct 3, 2018 •

edited

codecov-io commented Oct 4, 2018 •

edited

borislin commented Oct 4, 2018 •

edited

ruebot requested changes Oct 4, 2018

View changes

ruebot approved these changes Oct 4, 2018

View changes

lintool approved these changes Oct 4, 2018

View changes

ruebot merged commit `b8e57ec` into master Oct 4, 2018
4 checks passed