DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

ruebot · 2019-09-23T22:30:47Z

Describe the bug

19/09/23 21:18:44 ERROR Executor: Exception in task 17.0 in stage 22.0 (TID 12628)
java.net.MalformedURLException: unknown protocol: filedesc
        at java.net.URL.<init>(URL.java:607)
        at java.net.URL.<init>(URL.java:497)
        at java.net.URL.<init>(URL.java:446)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$38.apply(package.scala:448)
        at io.archivesunleashed.package$WARecordRDD$$anonfun$38.apply(package.scala:444)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
        at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
        at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

To Reproduce

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/data/banq-datathon/PQ-2012/warcs/*gz", sc).extractTextFilesDetailsDF();
val res = df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5").orderBy(desc("md5")).write.csv("/data/banq-datathon/PQ-2012/derivatives/dataframes/text/pq-2012-text")

val df_txt = RecordLoader.loadArchives("/data/banq-datathon/PQ-2012/warcs/*gz", sc).extractTextFilesDetailsDF();
val res_txt = df_txt.select($"bytes", $"extension").saveToDisk("bytes", "/data/banq-datathon/PQ-2012/derivatives/binaries/text/pq-2012-text", "extension")

sys.exit

Expected behavior

We should probably just capture and log that error. I remember it coming up in testing with GeoCities, but it went away with all the Tika processing.

Environment information

AUT version: 0.18.0
OS: Ubuntu 18.04
Java version: OpenJDK8
Apache Spark version: 2.4.4
Apache Spark w/aut: --packages
Apache Spark command used to run AUT: /home/ubuntu/aut/spark-2.4.4-bin-hadoop2.7/bin/spark-shell --master local[30] --driver-memory 105g --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s --conf spark.driver.maxResultSize=100g --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.compress=true --conf spark.kryoserializer.buffer.max=2000m --packages "io.archivesunleashed:aut:0.18.0"


        Add additional filters for fextFiles; resolves #362.

- Add filedesc, and dns filter (arc files) - Add test case


        Add additional filters for fextFiles; resolves #362. (#393)

* Add additional filters for fextFiles; resolves #362. - Add filedesc, and dns filter (arc files) - Add test case

ruebot added bug Scala DataFrames labels Sep 23, 2019

ruebot self-assigned this Sep 23, 2019

ruebot added this to To Do in 1.0.0 Release of AUT Nov 14, 2019

ruebot mentioned this issue Dec 18, 2019

Add additional filters for fextFiles; resolves #362. #393

Merged

ianmilligan1 closed this in #393 Dec 18, 2019

1.0.0 Release of AUT automation moved this from To Do to Done Dec 18, 2019

ianmilligan1 added a commit that referenced this issue Dec 18, 2019

Add additional filters for fextFiles; resolves #362. (#393)

Loading status checks…

8eb43ff

* Add additional filters for fextFiles; resolves #362. - Add filedesc, and dns filter (arc files) - Add test case

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

ruebot commented Sep 23, 2019

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Join GitHub today

DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

DataFrame error with text files: java.net.MalformedURLException: unknown protocol: filedesc #362

Comments

ruebot commented Sep 23, 2019