Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

Closed
ruebot opened this issue Sep 23, 2019 · 0 comments · Fixed by #393
Closed
Assignees

Comments

@ruebot
Copy link
Member

@ruebot ruebot commented Sep 23, 2019

Describe the bug

19/09/23 21:18:44 ERROR Executor: Exception in task 17.0 in stage 22.0 (TID 12628)
java.net.MalformedURLException: unknown protocol: filedesc
        at java.net.URL.<init>(URL.java:607)
        at java.net.URL.<init>(URL.java:497)
        at java.net.URL.<init>(URL.java:446)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$38.apply(package.scala:448)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$38.apply(package.scala:444)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
        at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
        at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

To Reproduce

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/data/banq-datathon/PQ-2012/warcs/*gz", sc).extractTextFilesDetailsDF();
val res = df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.csv("/data/banq-datathon/PQ-2012/derivatives/dataframes/text/pq-2012-text")

val df_txt = RecordLoader.loadArchives("/data/banq-datathon/PQ-2012/warcs/*gz", sc).extractTextFilesDetailsDF();
val res_txt = df_txt.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ-2012/derivatives/binaries/text/pq-2012-text", "extension")

sys.exit

Expected behavior

We should probably just capture and log that error. I remember it coming up in testing with GeoCities, but it went away with all the Tika processing.

Environment information

  • AUT version: 0.18.0
  • OS: Ubuntu 18.04
  • Java version: OpenJDK8
  • Apache Spark version: 2.4.4
  • Apache Spark w/aut: --packages
  • Apache Spark command used to run AUT: /home/ubuntu/aut/spark-2.4.4-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=100g --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.kryoserializer.buffer.max=2000m --packages "io.archivesunleashed:aut:0.18.0"
@ruebot ruebot self-assigned this Sep 23, 2019
@ruebot ruebot added this to To Do in 1.0.0 Release of AUT Nov 14, 2019
ruebot added a commit that referenced this issue Dec 18, 2019
- Add filedesc, and dns filter (arc files)
- Add test case
1.0.0 Release of AUT automation moved this from To Do to Done Dec 18, 2019
ianmilligan1 added a commit that referenced this issue Dec 18, 2019
* Add additional filters for fextFiles; resolves #362.

- Add filedesc, and dns filter (arc files)
- Add test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
1 participant
You can’t perform that action at this time.