Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe how to start PySpark console in Docker container #22

Merged
merged 3 commits into from May 29, 2020

Conversation

@sepastian
Copy link
Contributor

sepastian commented May 29, 2020

No description provided.

sepastian added 2 commits May 29, 2020
Remove build dependencies (git, wget).
@ruebot
Copy link
Member

ruebot commented May 29, 2020

@sepastian looks like you need to update you local fork. It appears that you have your previous commit in there from the PR earlier today.

It's super helpful if you do all this on a branch other than master too.

@sepastian
Copy link
Contributor Author

sepastian commented May 29, 2020

You mean using a feature branch? Which name?

I edited the README on Github, not sure why it included the other PR again 🤔

Copy link
Member

ruebot left a comment

Overall, solid section. Need a few updates before I can merged.

Thanks!

$ docker run -it --rm archivesunleashed/docker-aut \
/spark/bin/pyspark \
--py-files /aut/target/aut.zip \
--packages "io.archivesunleashed:aut:0.70" # Download Java/Scala packages from maven central

This comment has been minimized.

Copy link
@ruebot

ruebot May 29, 2020

Member

This needs to be updated for the master branch, since it builds on the master branch from aut. Or, this section should be moved to the 0.70.0 branch.

It is also possible to start an interactive PySpark console. This requires specifying Python bindings and Java/Scala packages, both of which are included in the Docker image under `/aut/target`.
```bash
$ docker run -it --rm archivesunleashed/docker-aut \

This comment has been minimized.

Copy link
@ruebot

ruebot May 29, 2020

Member

I'd separate the command out. Then show the output like above.

>>>
```
The example above loads version `0.70.1` of the Java/Scala packages. Your build may have packages in another version, to see what is available and select the right files, run the following.

This comment has been minimized.

Copy link
@ruebot

ruebot May 29, 2020

Member

There is no 0.70.1 release. That's a snapshot from master on the aut repo.

--packages "io.archivesunleashed:aut:0.70" # Download Java/Scala packages from maven central
```
See also https://github.com/archivesunleashed/aut#archives-unleashed-toolkit-with-pyspark.

This comment has been minimized.

Copy link
@ruebot

ruebot May 29, 2020

Member

I'd change this to:

For more information, see the Archives Unleashed Toolkit with PySpark of the Toolkit README.

@ruebot
Copy link
Member

ruebot commented May 29, 2020

Looking good overall. This will be really great to have in here!

$ docker run --rm -it aut /spark/bin/pyspark --py-files /aut/target/aut.zip --jars /aut/target/aut-0.70.1-SNAPSHOT-fatjar.jar
Python 3.6.9 (default, Oct 17 2019, 11:10:22) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
20/05/29 14:43:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 3.6.9 (default, Oct 17 2019 11:10:22)
SparkSession available as 'spark'.
>>> from aut import *
>>> from pyspark.sql.functions import col, desc
>>> 
>>> webgraph = WebArchive(sc, sqlContext, "/aut-resources/Sample-Data/*.gz").webgraph()
>>> webgraph.show()
+----------+--------------------+--------------------+--------------------+     
|crawl_date|                 src|                dest|              anchor|
+----------+--------------------+--------------------+--------------------+
|  20091218|http://www.equalv...|http://www.equalv...|                    |
|  20091218|http://www.equalv...|http://www.equalv...|       RSS SUBSCRIBE|
|  20091218|http://www.equalv...|http://www.equalv...|Bulletin d’AVE - ...|
|  20091218|http://www.equalv...|http://www.equalv...|MORE ABOUT EV'S Y...|
|  20091218|http://www.equalv...|http://www.thesta...|Coyle: Honouring ...|
|  20091218|http://www.equalv...|http://gettingtot...|Getting to the Ga...|
|  20091218|http://www.equalv...|http://www.snapde...|                    |
|  20091218|http://www.libera...|http://www.libera...|Liberal Party of ...|
|  20091218|http://www.libera...|http://www.libera...|   Michael Ignatieff|
|  20091218|http://www.libera...|http://www.libera...|        Introduction|
|  20091218|http://www.libera...|http://www.libera...|           Biography|
|  20091218|http://www.libera...|http://www.libera...|            Speeches|
|  20091218|http://www.libera...|http://www.libera...|        Publications|
|  20091218|http://www.libera...|http://www.libera...|              Photos|
|  20091218|http://www.libera...|http://www.libera...|              Videos|
|  20091218|http://www.libera...|http://www.libera...|                Team|
|  20091218|http://www.libera...|http://www.libera...|Members of Parlia...|
|  20091218|http://www.libera...|http://www.libera...|  Opposition Critics|
|  20091218|http://www.libera...|http://www.libera...|            Senators|
|  20091218|http://www.libera...|http://www.libera...|   In Your Community|
+----------+--------------------+--------------------+--------------------+
only showing top 20 rows

>>> 
@ruebot
Copy link
Member

ruebot commented May 29, 2020

@sepastian I'll pull this down locally, clean it up, and get it merged in. Don't worry about doing updates.

@ruebot ruebot merged commit be7f5b1 into archivesunleashed:master May 29, 2020
1 check passed
1 check passed
ci/dockercloud Your tests passed in Docker Cloud
Details
@ruebot
Copy link
Member

ruebot commented May 29, 2020

@sepastian all updated!

Checkout:

I'll pull in a version of this into the next release branch for docker-aut.

Free free to create an issue or PR with anymore updates. These contributions are great, and super helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.