Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble testing s3 connectivity #319

Open
obrienben opened this issue Apr 30, 2019 · 1 comment

Comments

Projects
None yet
3 participants
@obrienben
Copy link

commented Apr 30, 2019

I'm just trying to test AUT connectivity to an s3 bucket (as per our conversation @ruebot ), and not having any luck. I thought I'd share what I've tried so far. Disclaimer - my spark and scala knowledge is limited.

I've setup an s3 bucket with some warcs in it, which i can access through plain python using boto3. So I know that my user and access credentials are working.

Based on the following Spark to s3 guide https://www.cloudera.com/documentation/enterprise/6/6.2/topics/spark_s3.html, this was my test AUT script:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")

val r = RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

Which gives me the following error:

java.lang.IllegalArgumentException: Wrong FS: s3a://ndha.prc.wod-aut-test/, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
  at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:427)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
  at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 58 elided

Has anyone else had success supplying the toolkit with warcs from s3?

@ruebot ruebot added the feature label Apr 30, 2019

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented May 22, 2019

Sorry for the delay on this. The problem is that our RecordLoader elects things to come from the file system rather than from S3. Going to take a look around to see if there's any kind of workaround..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.