Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.0.0 + Java 11 support. #375

Merged
merged 51 commits into from Jun 18, 2020
Merged

Spark 3.0.0 + Java 11 support. #375

merged 51 commits into from Jun 18, 2020

Conversation

@ruebot
Copy link
Member

ruebot commented Nov 10, 2019

GitHub issue(s): #356

What does this Pull Request do?

Mostly pom.xml updates. Lots of cleanup, and updates. Big stuff, update to Apache Spark 3.0.0, and Java 11.

How should this be tested?

  • TravisCI
  • Exhaustive regression testing; basically hit everything in the documentation, and make sure it works.
  • Testing gist (updating it as I go through)

Additional Notes:

  • BIG KICKER
    • I cannot get this to work with --packages. We hit an ugly dependency wall really quick. If we exclude, all the tests fail really bad. If I explicitly include it as a dependency, same thing. Tests fail really bad.
  • I'm going to leave this as a draft, and we shouldn't merge until there is an official Spark 3.0.0 release, and we make a decision on --packages.
  • We'll squash this all down, and make a nice commit message when the time comes.
ruebot added 5 commits Aug 31, 2019
- Some hacks to get a sucessful build
- Definitely need to loop back and clean-up a whole lot!
- Addresses #356
…talled 🤦, and a bunch more pom cleanup.
@ruebot ruebot requested review from lintool and ianmilligan1 Nov 10, 2019
ruebot added 4 commits Nov 10, 2019
@codecov
Copy link

codecov bot commented Nov 10, 2019

Codecov Report

Merging #375 into master will increase coverage by 4.50%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master     #375      +/-   ##
============================================
+ Coverage     83.69%   88.20%   +4.50%     
- Complexity        0       57      +57     
============================================
  Files            43       43              
  Lines          1245      958     -287     
  Branches        239       86     -153     
============================================
- Hits           1042      845     -197     
+ Misses           80       74       -6     
+ Partials        123       39      -84     
@lintool
Copy link
Member

lintool commented Nov 10, 2019

This is awesome! Did a quick check, everything looks sane to me.

ruebot added 3 commits Nov 10, 2019
@ianmilligan1
Copy link
Member

ianmilligan1 commented Nov 10, 2019

Thanks for this @ruebot! Have built it locally, but will take my time to exhaustively run through the docs before giving it the thumbs up.

ruebot added 2 commits Nov 10, 2019
@ruebot ruebot added this to In Progress in 1.0.0 Release of AUT Nov 14, 2019
ruebot added 4 commits Nov 18, 2019
ruebot added 4 commits Nov 28, 2019
ruebot added 14 commits Feb 18, 2020
…ue-356
@ruebot ruebot marked this pull request as ready for review Jun 17, 2020
@ruebot
Copy link
Member Author

ruebot commented Jun 17, 2020

These all work with this branch, and Spark 3.0.0 (Hadoop 2.7).

@ruebot
Copy link
Member Author

ruebot commented Jun 17, 2020

I didn't test this one, but it is covered in all the others for the most part.

@ruebot
Copy link
Member Author

ruebot commented Jun 17, 2020

I'll try and get some s3 smoke testing done later today or tomorrow.

@ruebot
Copy link
Member Author

ruebot commented Jun 17, 2020

spark-shell smoke test:
Screenshot from 2020-06-17 15-36-19

pyspark smoke test:
Screenshot from 2020-06-17 15-37-04

Copy link
Member

ianmilligan1 left a comment

Looks great! Tested in Spark shell

Screen Shot 2020-06-17 at 5 07 33 PM

and Python 3 notebook:

Screen Shot 2020-06-17 at 5 08 25 PM

I didn't test every single thing in the documentation, but did get broad representation across all the main commands and the most important functions that we run.

👏 👏 👏 @ruebot, this is a real achievement!

@SamFritz
Copy link
Member

SamFritz commented Jun 17, 2020

Congratulations @ruebot 👏 👏! So much work has gone into this, and so excited to see you push it past the finish line with this wonderful achievement.

ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Jun 18, 2020
@ruebot
Copy link
Member Author

ruebot commented Jun 18, 2020

Documentation update: archivesunleashed/aut-docs#83

@ruebot
Copy link
Member Author

ruebot commented Jun 18, 2020

s3 smoke test good:

import io.archivesunleashed._

sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")

RecordLoader.loadArchives("s3a://au-geocities/", sc).webgraph().show(10)

// Exiting paste mode, now interpreting.

+----------+--------------------+--------------------+--------------------+     
|crawl_date|                 src|                dest|              anchor|
+----------+--------------------+--------------------+--------------------+
|  20090801|http://geocities....|mailto:CindyFowle...|          E-MAIL ME!|
|  20090801|http://geocities....|mailto:cindyfowle...|          E-mail Me!|
|  20090801|http://geocities....|mailto:CindyFowle...|           E-Mail Me|
|  20090801|http://geocities....|mailto:CindyFowle...|          E-MAIL ME!|
|  20090801|http://geocities....|http://www.geocit...|Knitting Tips and...|
|  20090801|http://geocities....|http://knittingmi...|More Reader's Tip...|
|  20090801|http://geocities....|http://www.frugal...|Frugal Knitting H...|
|  20090801|http://geocities....|mailto:CindyFowle...|If you have quest...|
|  20090801|http://geocities....|http://geocities....| BACK TO MY HOMEPAGE|
|  20090801|http://geocities....|mailto:CindyFowle...|          E-MAIL ME!|
+----------+--------------------+--------------------+--------------------+
only showing top 10 rows

import io.archivesunleashed._

scala> :quit
@ianmilligan1 ianmilligan1 merged commit 59b1d4e into master Jun 18, 2020
3 checks passed
3 checks passed
codecov/patch Coverage not affected when comparing cbff479...809d1e3
Details
codecov/project 88.20% (+4.50%) compared to cbff479
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
1.0.0 Release of AUT automation moved this from In Progress to Done Jun 18, 2020
@ianmilligan1 ianmilligan1 deleted the issue-356 branch Jun 18, 2020
ianmilligan1 pushed a commit to archivesunleashed/aut-docs that referenced this pull request Jun 18, 2020
#83)

* Documentation updates for archivesunleashed/aut#375
* review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.