DataFrame commands throwing java.lang.NullPointerException on example data #320

ianmilligan1 · Jun 18, 2019

Right now on 0.17.0, using Docker, running any DataFrame command leads to a java.lang.NullPointerException error.

For example,

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("example.arc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

leads to

// Exiting paste mode, now interpreting.

java.lang.NullPointerException
  at scala.collection.mutable.ArrayOps$ofRef$.newBuilder$extension(ArrayOps.scala:190)
  at scala.collection.mutable.ArrayOps$ofRef.newBuilder(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:246)
  at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
  at scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:186)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:53)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 54 elided

We should try to get it so that on Docker the DataFrame commands work out of the box (which they did before, I think..).

ianmilligan1 · Jun 18, 2019

Works when running natively with

alias aut45='/home/i2millig/spark-2.3.2-bin-hadoop2.7/bin/spark-shell --driver-memory 45G --packages "io.archivesunleashed:aut:0.17.0"'

but fails when running with

docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut:0.17.0

Apologies, this probably belongs in the docker repo.

ianmilligan1 · Jun 18, 2019

Works if we read in a directory, i.e.

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/data/*", sc)
  .extractValidPagesDF()

df.printSchema()

ruebot · Jun 20, 2019

So, is it just a documentation issue on archivesunleashed.org/aut?

ianmilligan1 · Jun 20, 2019

No, it can't read the example.arc.gz as it won't seem to support *.gz wildcarding w/o throwing an error. For consistency, it'd be nice if was always able to read example.arc.gz.

i.e. this doesn't work

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("*.gz", sc)
  .extractValidPagesDF()

df.printSchema()

Or we can just say not to use it with Docker?

ruebot · Jun 20, 2019

I can't reproduce it:

Standalone:

Spark context Web UI available at http://172.17.0.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1561031350339).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut-resources/Sample-Data/*.gz", sc)
  .extractValidPagesDF()

df.printSchema()


// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

Docker:

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://9092d9b58a11:4040
Spark context available as 'sc' (master = local[*], app id = local-1561031732106).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

2019-06-20 11:56:05 WARN  ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
2019-06-20 11:56:05 WARN  ObjectStore:568 - Failed to get database default, returning NoSuchObjectException
2019-06-20 11:56:06 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/aut-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/aut-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

I'm certain it is a documentation issue, or a misreading of it. There is no example.arc.gz in docker-aut. There is the sample ARC and WARC in /aut-resources/Sample-Data:

ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz
ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz

All of the documentation here uses example.arc.gz as an example file, and the lesson we use with Docker doesn't have Data Frame example in it.

ianmilligan1 · Jun 20, 2019

🤦‍♂

Oh, of course. I'll close this with egg on my face. Sorry @ruebot.

ruebot · Jun 20, 2019

No worries! :-D

ianmilligan1 added the bug label Jun 18, 2019

ianmilligan1 closed this Jun 20, 2019

archivesunleashed/aut

DataFrame commands throwing java.lang.NullPointerException on example data #320

DataFrame commands throwing java.lang.NullPointerException on example data #320

ianmilligan1 commented Jun 18, 2019

ianmilligan1 added the bug label Jun 18, 2019

This comment has been minimized.

ianmilligan1 commented Jun 18, 2019 •

edited

This comment has been minimized.

ianmilligan1 commented Jun 18, 2019

This comment has been minimized.

ruebot commented Jun 20, 2019

This comment has been minimized.

ianmilligan1 commented Jun 20, 2019

This comment has been minimized.

ruebot commented Jun 20, 2019

This comment has been minimized.

ianmilligan1 commented Jun 20, 2019

ianmilligan1 closed this Jun 20, 2019

This comment has been minimized.

ruebot commented Jun 20, 2019

archivesunleashed/aut

Join GitHub today

DataFrame commands throwing java.lang.NullPointerException on example data #320

Comments

ianmilligan1 commented Jun 18, 2019

ianmilligan1 added the bug label Jun 18, 2019

This comment has been minimized.

ianmilligan1 commented Jun 18, 2019 • edited

This comment has been minimized.

ianmilligan1 commented Jun 18, 2019

This comment has been minimized.

ruebot commented Jun 20, 2019

This comment has been minimized.

ianmilligan1 commented Jun 20, 2019

This comment has been minimized.

ruebot commented Jun 20, 2019

This comment has been minimized.

ianmilligan1 commented Jun 20, 2019

ianmilligan1 closed this Jun 20, 2019

This comment has been minimized.

ruebot commented Jun 20, 2019

ianmilligan1 commented Jun 18, 2019 •

edited