Upgrade to Hadoop 3.x #329

jrwiebe · Jul 24, 2019

AUT currently uses Hadoop 2.6.5. Though it is stable, at three years old it is beginning to show its age. I discovered this when testing S3 access (#319): hadoop-aws 2.6.5 is incapable of authenticating with temporary security credentials (probably an edge case) and with endpoints that require Signature Version 4 (many do). Upgrading to a current branch of Hadoop should be a matter of bringing other dependencies up to date, which might not be simple.

At present I would suggest going with version 3.1.2, the latest 3.1.x release. Its docs say: "This release is generally available (GA), meaning that it represents a point of API stability and quality that we consider production-ready." Or 3.0.0 -- all the Cloudera CDH 6 releases use that version, which is an indication its stability and wide use.

I'm not sure of the implications of using a distribution of Spark built with an older Hadoop to run our code that depends on Hadoop 3 (Spark 2.4.3 uses Hadoop 2.6.5). I wonder how the version conflicts would be resolved if we included the Hadoop 3 dependencies in our fatjar (we currently exclude them), and run it on Spark with Hadoop 2.6.5? I imagine it should work if we include Hadoop in the fatjar if we instruct people to use the version of Spark built without Hadoop. I think it's unreasonable to expect people to build Spark themselves, though.

ruebot · Jul 24, 2019

We'll have to sort out FileUtil.copyMerge since it was deprecated in 3+ versions of Hadoop. Luckily it is only used here.

Preliminary StackOverFlow searching says we can implement our own version in Scala. Is that something we'd want to do, or should we look for a better solution to combine all the part files.

...and maybe there is a way to pull off a hdfs cat 🤷‍♂

Anyway, I'll keep digging and check out the Hadoop 3.1.2 API docs.

jrwiebe · Jul 24, 2019

The Scala re-implementation looks good. I'd use it.

greebie · Jul 24, 2019

Just dropping in to confirm that I have played around with the scala re-implementation for 3.1.1 and it works fine.

ruebot added the enhancement label Jul 25, 2019

archivesunleashed/aut

Upgrade to Hadoop 3.x #329

Upgrade to Hadoop 3.x #329

jrwiebe commented Jul 24, 2019

This comment has been minimized.

ruebot commented Jul 24, 2019

This comment has been minimized.

jrwiebe commented Jul 24, 2019

This comment has been minimized.

greebie commented Jul 24, 2019

ruebot added the enhancement label Jul 25, 2019

archivesunleashed/aut

Join GitHub today

Upgrade to Hadoop 3.x #329

Comments

jrwiebe commented Jul 24, 2019

This comment has been minimized.

ruebot commented Jul 24, 2019

This comment has been minimized.

jrwiebe commented Jul 24, 2019

This comment has been minimized.

greebie commented Jul 24, 2019

ruebot added the enhancement label Jul 25, 2019