New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for #246 & #271: Fix exception error when processing corrupted ARC files #272

Merged
merged 4 commits into from Oct 4, 2018

Conversation

Projects
None yet
5 participants
@borislin
Collaborator

borislin commented Oct 3, 2018

Patch for #246 & #271: Fix ZipException error when processing corrupted ARC files.


GitHub issue(s):

If you are responding to an issue, please mention their numbers below.

What does this Pull Request do?

This PR catches exceptions in getContent() in ArcRecordUtils.java when reading the content of corrupted ARC files, allowing Spark jobs to continue running.

How should this be tested?

  • Build aut on master, as of the latest commit: mvn clean install

  • Create an output directory with sub-directories:
    mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs

  • Adapt the script below:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

val input = "/tuna1/scratch/aut-issue-271/warcs/*.gz"

val output1 = "/tuna1/scratch/aut-issue-271/derivatives/all-domains"
val output2 = "/tuna1/scratch/aut-issue-271/derivatives/all-text"
val output3 = "/tuna1/scratch/aut-issue-271/derivatives/gephi"

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile(output1)

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile(output2)

val links = RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, output3)

sys.exit
  • Run the following command with adapted paths with Apache Spark 2.1.3:
    /home/b25lin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local[10] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars "/tuna1/scratch/borislin/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar" -i /tuna1/scratch/aut-issue-271/spark_jobs/499.scala | tee /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log

Current results

/tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log
/tuna1/scratch/aut-issue-271/derivatives

Interested parties

@lintool @ianmilligan1 @ruebot

@borislin borislin requested review from ruebot and lintool Oct 3, 2018

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 3, 2018

Member

Thanks @borislin! I'll try and get this tested tonight.

Member

ruebot commented Oct 3, 2018

Thanks @borislin! I'll try and get this tested tonight.

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 3, 2018

Member

I'm not able to replicate with all the 499 files, and the two files from aut-resources.

  1. rm -rf ~/m2/repository/*
  2. rm -rf ~/git/aut/target
  3. git fetch --all
  4. git checkout fix-exeception
  5. mvn clean install
  6. Run the following:
  • /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
  • Where 499.scala is:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit
  1. It fails. Log here.
Member

ruebot commented Oct 3, 2018

I'm not able to replicate with all the 499 files, and the two files from aut-resources.

  1. rm -rf ~/m2/repository/*
  2. rm -rf ~/git/aut/target
  3. git fetch --all
  4. git checkout fix-exeception
  5. mvn clean install
  6. Run the following:
  • /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
  • Where 499.scala is:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-domains/output")
RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Dropbox/499-issues/output/all-text/output")
val links = RecordLoader.loadArchives("/home/nruest/Dropbox/499-issues/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
sys.exit
  1. It fails. Log here.
@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 3, 2018

Member

Getting closer! If I remove the empty ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz my adapted version of @ruebot's script above works. If that empty file is there, however, it fails the same way that it does for @ruebot above.

Is something weird going on with how we handle empty files?

Member

ianmilligan1 commented Oct 3, 2018

Getting closer! If I remove the empty ARCHIVEIT-499-BIMONTHLY-5528-20131008090852109-00757-wbgrp-crawl053.us.archive.org-6443.warc.gz my adapted version of @ruebot's script above works. If that empty file is there, however, it fails the same way that it does for @ruebot above.

Is something weird going on with how we handle empty files?

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 3, 2018

Collaborator

@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file.

Collaborator

borislin commented Oct 3, 2018

@ruebot @ianmilligan1 Yup, the culprit is the empty file. I'm testing some new code to handle the empty file.

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 4, 2018

Member

Can confirm that if I remove the empty file, we are successful.

The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org.

Member

ruebot commented Oct 4, 2018

Can confirm that if I remove the empty file, we are successful.

The question is whether or not we want to take care of it in this PR, or in a separate one were we create a new issue. My preference would be to just get it done here since all of this is an issue in production on cloud.archivesunleashed.org.

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Oct 4, 2018

Codecov Report

Merging #272 into master will increase coverage by <.01%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #272      +/-   ##
==========================================
+ Coverage   70.35%   70.36%   +<.01%     
==========================================
  Files          41       41              
  Lines        1039     1046       +7     
  Branches      191      192       +1     
==========================================
+ Hits          731      736       +5     
- Misses        242      244       +2     
  Partials       66       66
Impacted Files Coverage Δ
src/main/scala/io/archivesunleashed/package.scala 84.82% <100%> (+0.7%) ⬆️
...java/io/archivesunleashed/data/ArcRecordUtils.java 78.26% <50%> (-3.56%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c95a51d...7ecdd9e. Read the comment docs.

codecov-io commented Oct 4, 2018

Codecov Report

Merging #272 into master will increase coverage by <.01%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #272      +/-   ##
==========================================
+ Coverage   70.35%   70.36%   +<.01%     
==========================================
  Files          41       41              
  Lines        1039     1046       +7     
  Branches      191      192       +1     
==========================================
+ Hits          731      736       +5     
- Misses        242      244       +2     
  Partials       66       66
Impacted Files Coverage Δ
src/main/scala/io/archivesunleashed/package.scala 84.82% <100%> (+0.7%) ⬆️
...java/io/archivesunleashed/data/ArcRecordUtils.java 78.26% <50%> (-3.56%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c95a51d...7ecdd9e. Read the comment docs.

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 4, 2018

Member
[nruest@wombat:aut] (git)-[fix-exception]-$ /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
2018-10-03 22:26:19,522 [main] WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-03 22:26:19,630 [main] WARN  Utils - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-03 22:26:19,630 [main] WARN  Utils - Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538619980066).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
<console>:33: error: not found: value links
       WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
                    ^
2018-10-03 22:26:22,651 [Thread-1] INFO  SparkContext - Invoking stop() from shutdown hook
2018-10-03 22:26:22,655 [Thread-1] INFO  ServerConnector - Stopped Spark@1b22bfe2{HTTP/1.1}{0.0.0.0:4040}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2e7af36e{/stages/stage/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4b56b031{/jobs/job/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4a058df8{/api,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@6dca31eb{/,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@64502326{/static,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2d258eff{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1d6751e3{/executors/threadDump,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@63fd4dda{/executors/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@335f5c69{/executors,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@22825e1e{/environment/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@55c57422{/environment,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2c1f8dbd{/storage/rdd/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4cdba2ed{/storage/rdd,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@72b0a004{/storage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@171dc7c3{/storage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@76f2dad9{/stages/pool/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@499ee966{/stages/pool,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@42e4e589{/stages/stage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@43245559{/stages/stage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@5d5c41e5{/stages/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@715f45c6{/stages,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@794f11cd{/jobs/job/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@562919fe{/jobs/job,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1ac6dd3d{/jobs/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@36c7cbe1{/jobs,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,658 [Thread-1] INFO  SparkUI - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-03 22:26:22,663 [dispatcher-event-loop-2] INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2018-10-03 22:26:22,669 [Thread-1] INFO  MemoryStore - MemoryStore cleared
2018-10-03 22:26:22,669 [Thread-1] INFO  BlockManager - BlockManager stopped
2018-10-03 22:26:22,672 [Thread-1] INFO  BlockManagerMaster - BlockManagerMaster stopped
2018-10-03 22:26:22,674 [dispatcher-event-loop-7] INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2018-10-03 22:26:22,675 [Thread-1] INFO  SparkContext - Successfully stopped SparkContext
2018-10-03 22:26:22,675 [Thread-1] INFO  ShutdownHookManager - Shutdown hook called
2018-10-03 22:26:22,676 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130
2018-10-03 22:26:22,680 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130/repl-648f2e1b-6a8b-4fca-9468-142d2160b52a
Member

ruebot commented Oct 4, 2018

[nruest@wombat:aut] (git)-[fix-exception]-$ /home/nruest/bin/spark-2.1.3-bin-hadoop2.7/bin/spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=10G --jars /home/nruest/git/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar -i /home/nruest/Dropbox/499-issues/spark-jobs/499.scala | tee /home/nruest/Dropbox/499-issues/spark-jobs/499.scala.log
2018-10-03 22:26:19,522 [main] WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-03 22:26:19,630 [main] WARN  Utils - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-10-03 22:26:19,630 [main] WARN  Utils - Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1538619980066).
Spark session available as 'spark'.
Loading /home/nruest/Dropbox/499-issues/spark-jobs/499.scala...
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
java.io.FileNotFoundException: File /home/nruest/Dropbox/499-issues/*.gz does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 77 elided
<console>:33: error: not found: value links
       WriteGraphML(links, "/home/nruest/Dropbox/499-issues/output/gephi/499-gephi.graphml")
                    ^
2018-10-03 22:26:22,651 [Thread-1] INFO  SparkContext - Invoking stop() from shutdown hook
2018-10-03 22:26:22,655 [Thread-1] INFO  ServerConnector - Stopped Spark@1b22bfe2{HTTP/1.1}{0.0.0.0:4040}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2e7af36e{/stages/stage/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4b56b031{/jobs/job/kill,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4a058df8{/api,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@6dca31eb{/,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@64502326{/static,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2d258eff{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1d6751e3{/executors/threadDump,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@63fd4dda{/executors/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@335f5c69{/executors,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@22825e1e{/environment/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@55c57422{/environment,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@2c1f8dbd{/storage/rdd/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@4cdba2ed{/storage/rdd,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,656 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@72b0a004{/storage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@171dc7c3{/storage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@76f2dad9{/stages/pool/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@499ee966{/stages/pool,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@42e4e589{/stages/stage/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@43245559{/stages/stage,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@5d5c41e5{/stages/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@715f45c6{/stages,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@794f11cd{/jobs/job/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@562919fe{/jobs/job,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@1ac6dd3d{/jobs/json,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,657 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@36c7cbe1{/jobs,null,UNAVAILABLE,@Spark}
2018-10-03 22:26:22,658 [Thread-1] INFO  SparkUI - Stopped Spark web UI at http://10.0.1.44:4040
2018-10-03 22:26:22,663 [dispatcher-event-loop-2] INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2018-10-03 22:26:22,669 [Thread-1] INFO  MemoryStore - MemoryStore cleared
2018-10-03 22:26:22,669 [Thread-1] INFO  BlockManager - BlockManager stopped
2018-10-03 22:26:22,672 [Thread-1] INFO  BlockManagerMaster - BlockManagerMaster stopped
2018-10-03 22:26:22,674 [dispatcher-event-loop-7] INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2018-10-03 22:26:22,675 [Thread-1] INFO  SparkContext - Successfully stopped SparkContext
2018-10-03 22:26:22,675 [Thread-1] INFO  ShutdownHookManager - Shutdown hook called
2018-10-03 22:26:22,676 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130
2018-10-03 22:26:22,680 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-fbe65156-ee89-4f87-9bc6-6d2b19418130/repl-648f2e1b-6a8b-4fca-9468-142d2160b52a
@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 4, 2018

Collaborator

@ruebot Oh, I forgot to mention. Remove the *.gz part from your 499.scala script.

Collaborator

borislin commented Oct 4, 2018

@ruebot Oh, I forgot to mention. Remove the *.gz part from your 499.scala script.

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 4, 2018

Member

Is it possible to tweak this to retain the *.gz syntax? I can imagine working in directories (and do once in a while) that have several non archive files or directories in them.

Member

ianmilligan1 commented Oct 4, 2018

Is it possible to tweak this to retain the *.gz syntax? I can imagine working in directories (and do once in a while) that have several non archive files or directories in them.

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 4, 2018

Collaborator

@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files.

Collaborator

borislin commented Oct 4, 2018

@ianmilligan1 but our loadArchives() function filters out only ARC and WARC files.

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 4, 2018

Member

That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation.

Member

ruebot commented Oct 4, 2018

That's a pretty big breaking change for production cloud.archivesunleashed.org and all of our documentation.

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 4, 2018

Member

Yeah, if possible it'd be nice to keep it.

In any case I'm still having trouble reproducing your success. I'm using:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/domains")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/test")
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/mnt/vol1/data_sets/aut_debug/issue-499/results/499-gephi.graphml")
sys.exit

and am getting this error log.

Member

ianmilligan1 commented Oct 4, 2018

Yeah, if possible it'd be nice to keep it.

In any case I'm still having trouble reproducing your success. I'm using:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/domains")
RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/mnt/vol1/data_sets/aut_debug/issue-499/results/test")
val links = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/issue-499/", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, "/mnt/vol1/data_sets/aut_debug/issue-499/results/499-gephi.graphml")
sys.exit

and am getting this error log.

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 4, 2018

Collaborator

@ruebot @ianmilligan1 I see. Let me tweak it further.

Collaborator

borislin commented Oct 4, 2018

@ruebot @ianmilligan1 I see. Let me tweak it further.

@borislin

This comment has been minimized.

Show comment
Hide comment
@borislin

borislin Oct 4, 2018

Collaborator

@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern *.gz. My results are in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log and /tuna1/scratch/aut-issue-271/derivatives. Pls take a look and also validate on your end.

Collaborator

borislin commented Oct 4, 2018

@ruebot @ianmilligan1 Okay, I've fixed the archive files input path issue. Now it accepts the wildcard pattern *.gz. My results are in /tuna1/scratch/aut-issue-271/spark_jobs/499.scala.log and /tuna1/scratch/aut-issue-271/derivatives. Pls take a look and also validate on your end.

@ruebot

Two little things with comments, and I think we're good to go.

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 4, 2018

Member

@borislin looks good now. Nice work!

Member

ruebot commented Oct 4, 2018

@borislin looks good now. Nice work!

@ruebot

ruebot approved these changes Oct 4, 2018

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Oct 4, 2018

Member

@lintool can you give it a look before I merge?

Member

ruebot commented Oct 4, 2018

@lintool can you give it a look before I merge?

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Oct 4, 2018

Member

Just echoing @ruebot - this is great work, @borislin! 🎉

Member

ianmilligan1 commented Oct 4, 2018

Just echoing @ruebot - this is great work, @borislin! 🎉

@lintool

lintool approved these changes Oct 4, 2018

lgtm.

My only comment is that getFiles will be slow when the number of files get too big... For example, each of the IA buckets has over 20k+ files.

Regardless, 🚢 and we'll deal with issues later.

@ruebot ruebot merged commit b8e57ec into master Oct 4, 2018

4 checks passed

codecov/patch 80% of diff hit (target 70.35%)
Details
codecov/project 70.36% (+<.01%) compared to c95a51d
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ruebot ruebot deleted the fix-exception branch Oct 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment