Convert RecordLoader.loadArchives to a Spark Data Source #371

ruebot · 2019-11-05T23:12:18Z

Since we're pivoting to full DataFrame support (#223, #190), we should convert/migrate RecordLoader.loadArchives, and any other related functions to a Spark Data Source. That way we could do things like:

spark.read.format("webArchive")
  .option("mode", "FAILFAST")
  .option("inferSchema", "true")
  .option("/path/to/files")
  .schema(someSchema)
  .load()

Then, we could, (since it's an open issue #147) write WARCs that way too? 🤷‍♂

spark.write.format("webArchive")
  .option("mode", "OVERWRITE")
  .option("/path/to/files")
  .save()

These are the Spark core data sources:

CSV
JSON
Parquet
ORC
JDBC/ODBC
Plain-text
Avro

Community implemented data sources:

Cassandra
HBase
MongoDB
AWS Redshift
XML

ruebot added Python Java Scala feature DataFrames labels Nov 5, 2019

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Convert RecordLoader.loadArchives to a Spark Data Source #371

Convert RecordLoader.loadArchives to a Spark Data Source #371

ruebot commented Nov 5, 2019 •

edited

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut

Join GitHub today

Convert RecordLoader.loadArchives to a Spark Data Source #371

Comments

ruebot commented Nov 5, 2019 • edited

ruebot commented Nov 5, 2019 •

edited