Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makes ArchiveRecordImpl serializable #316

Merged
merged 1 commit into from Apr 22, 2019

Conversation

Projects
None yet
4 participants
@jrwiebe
Copy link
Contributor

jrwiebe commented Apr 18, 2019

What does this Pull Request do?

Makes class ArchiveRecordImpl serializable by removing non-serializable ARCRecord and WARCRecord variables. Also removes unused headerResponseFormat variable.

How should this be tested?

The following code would fail prior to this commit with a NotSerializableException error. Now it works:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
import org.apache.spark.storage.StorageLevel._

sc.setLogLevel("DEBUG")

val validPages = RecordLoader
                  .loadArchives("/path/to/warcs/*.warc.gz", sc)
                  .keepValidPages()
                  .persist(DISK_ONLY) // crucial line
                  .map(r => ExtractDomain(r.getUrl))
                  .countItems()
                  .saveAsTextFile("/writable/path/all-domains/output")

Additional Notes:

Caching RDDs to disk may be useful, but it is not a solution to out-of-memory issues discussed by @ruebot in #aut on Slack.

Interested parties

@ruebot

Makes ArchiveRecordImpl serializable by removing non-serializable ARC…
…Record and WARCRecord variables. Also removes unused headerResponseFormat variable.

@ruebot ruebot self-requested a review Apr 18, 2019

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 18, 2019

Codecov Report

Merging #316 into master will increase coverage by 0.11%.
The diff coverage is 78.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
+ Coverage   75.84%   75.95%   +0.11%     
==========================================
  Files          41       41              
  Lines        1151     1148       -3     
  Branches      202      200       -2     
==========================================
- Hits          873      872       -1     
  Misses        209      209              
+ Partials       69       67       -2
Impacted Files Coverage Δ
...ain/scala/io/archivesunleashed/ArchiveRecord.scala 84.9% <78.26%> (+2.76%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8504190...01b8696. Read the comment docs.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Apr 18, 2019

Still getting the heap space error.

Cleared out my ~/.m2, and built the serialize-ArchiveRecordImpl branch on tuna.

Ran the following:

/home/ruestn/spark-2.4.1-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.rdd.compress=true -Djava.io.tmpdir=/tuna1/scratch/nruest/tmp --jars /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar -i /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689-cache-issue-316.scala 2>&1 | tee /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689.scala-tuna-pr-test.log

10689-cache-issue-316.scala is:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
import org.apache.spark.storage.StorageLevel._

sc.setLogLevel("DEBUG")

val validPages = RecordLoader
                  .loadArchives("/tuna1/scratch/nruest/auk_collection_testing/10689/warcs/*.gz", sc)
                  .keepValidPages()
                  .persist(DISK_ONLY)

validPages
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .saveAsTextFile("/tuna1/scratch/nruest/auk_collection_testing/10689/133/derivatives/all-domains/output")

validPages
  .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
  .saveAsTextFile("/tuna1/scratch/nruest/auk_collection_testing/10689/133/derivatives/all-text/output")

val links = validPages
              .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
              .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1)
              .replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2)
              .replaceAll("^\\s*www\\.", ""))))
              .filter(r => r._2 != "" && r._3 != "")
              .countItems()
              .filter(r => r._2 > 5)

WriteGraphML(links, "/tuna1/scratch/nruest/auk_collection_testing/10689/133/derivatives/gephi/10689-gephi.graphml")

sys.exit

Error:

19/04/18 18:02:34 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.StringCoding.safeTrim(StringCoding.java:89)
        at java.lang.StringCoding.access$100(StringCoding.java:50)
        at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:154)
        at java.lang.StringCoding.decode(StringCoding.java:193)
        at java.lang.StringCoding.decode(StringCoding.java:254)
        at java.lang.String.<init>(String.java:546)
        at java.lang.String.<init>(String.java:566)
        at io.archivesunleashed.ArchiveRecordImpl.<init>(ArchiveRecord.scala:117)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:69)
        at io.archivesunleashed.package$RecordLoader$$anonfun$loadArchives$2.apply(package.scala:69)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
        at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:139)
        at org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:174)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$10.apply(BlockManager.scala:1203)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$10.apply(BlockManager.scala:1201)
        at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1201)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)

If you want to check out the full log on tuna: /tuna1/scratch/nruest/auk_collection_testing/10689/133/spark_jobs/10689.scala-tuna-pr-test.log

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Apr 18, 2019

Caching RDDs to disk may be useful, but it is not a solution to out-of-memory issues discussed by @ruebot in #aut on Slack.

🤦‍♂️

I should have read the PR closer. Sorry for the giant verbose dump above @jrwiebe

@jrwiebe

This comment has been minimized.

Copy link
Contributor Author

jrwiebe commented Apr 19, 2019

@ruebot You can lead a horse to water ...

@ruebot ruebot requested a review from lintool Apr 19, 2019

@ruebot

ruebot approved these changes Apr 19, 2019

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Apr 22, 2019

lgtm

@ruebot ruebot merged commit 5cb05f7 into master Apr 22, 2019

3 checks passed

codecov/patch 78.26% of diff hit (target 75.84%)
Details
codecov/project 75.95% (+0.11%) compared to 8504190
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ruebot ruebot deleted the serialize-ArchiveRecordImpl branch Apr 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.