Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble testing s3 connectivity #319

Open
obrienben opened this issue Apr 30, 2019 · 6 comments

Comments

Projects
None yet
4 participants
@obrienben
Copy link

commented Apr 30, 2019

I'm just trying to test AUT connectivity to an s3 bucket (as per our conversation @ruebot ), and not having any luck. I thought I'd share what I've tried so far. Disclaimer - my spark and scala knowledge is limited.

I've setup an s3 bucket with some warcs in it, which i can access through plain python using boto3. So I know that my user and access credentials are working.

Based on the following Spark to s3 guide https://www.cloudera.com/documentation/enterprise/6/6.2/topics/spark_s3.html, this was my test AUT script:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")

val r = RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

Which gives me the following error:

java.lang.IllegalArgumentException: Wrong FS: s3a://ndha.prc.wod-aut-test/, expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
  at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
  at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:427)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
  at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:674)
  at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
  at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:52)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 58 elided

Has anyone else had success supplying the toolkit with warcs from s3?

@ruebot ruebot added the feature label Apr 30, 2019

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented May 22, 2019

Sorry for the delay on this. The problem is that our RecordLoader elects things to come from the file system rather than from S3. Going to take a look around to see if there's any kind of workaround..

@obrienben

This comment has been minimized.

Copy link
Author

commented May 27, 2019

Great thanks Ian

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2019

I just pushed a commit to branch s3 to allow access to data in S3. It was a matter of modifying the POM to include a dependency on hadoop-aws, and modifying the exclude rules.

@obrienben's above example works now. with one addition – the line:

sc.hadoopConfiguration.set("fs.defaultFS", "s3a://<my-bucket>") // UPDATE: unnecessary

I'll look into making this unnecessary by changing how FileSystem.get() is called.

A few notes:

The version of hadoop-aws used must be identical with hadoop-common. This limits our S3 functionality, as there have been important improvement to the AWS code since 2.6.5, which is our current Hadoop version. I came across two:

  • Some S3 endpoints only use what's called Signature Version 4 for authentication, while others use Versions 2 and 4. hadoop-aws 2.6.5 can only handle Version 2. If your S3 store is on an endpoint that uses only Version 4 you're out of luck, unless in pom.xml you change hadoop.version to a (recent?) 2.7 release (I tested 2.7.7). This builds successfully. I tried a 2.8 release, but it resulted in a runtime error.

If you need to use Version 4, you'll have to add these lines to your code (change the ca-central-1 string as appropriate) (credit):

System.setProperty("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.ca-central-1.amazonaws.com")
  • Support for temporary credentials, as provided by an AWS Educate Starter Account, was apparently only introduced with Hadoop 2.8. Upgrading our Hadoop dependencies is nontrivial. When it eventually happens, use of temporary credentials involves setting fs.s3a.aws.credentials.provider to org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, and setting fs.s3a.session.token to the session token.
@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2019

Updated above comment to reflect 5cab57b

@ruebot

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

@obrienben can you pull down that branch, build it, and test it?

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Jul 24, 2019

Couldn't help myself and wanted to test. Worked flawlessly out of the box on us-west-2, @jrwiebe!

Screen_Shot_2019-07-24_at_9_55_17_AM

and then

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
r: Array[(String, Int)] = Array((www.equalvoice.ca,4644), (www.liberal.ca,1968), (greenparty.ca,732), (www.policyalternatives.ca,601), (www.fairvote.ca,465), (www.ndp.ca,417), (www.davidsuzuki.org,396), (www.canadiancrc.com,90), (www.gca.ca,40), (communist-party.ca,39))

The right results! If this works on @obrienben's front let's move to a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.