Adding Apache Zeppelin to docker and documentation #259

Natkeeran · Aug 13, 2018

Is your feature request related to a problem? Please describe.
AUT is a tool targeted towards researchers with limited technical expertise. Nevertheless, command line interactions can be a barrier for potential users. Many users in this domain are familiar with python notebooks and Apache Zeppelin provides similar workspaces and interactive mode of analysis.

Describe the solution you'd like
Installing Apache Zeppelin for experimental use is quite straightforward. Adding the instructions to documentation and including Zeppelin in aut docker would be useful. Down the road, Apache Zeppelin sandbox would be an easier way for users to evaluate AUT.

Describe alternatives you've considered

Databricks Community Edition - Not open source! May be able to add the aut package.
Jupyter Notebook - If pyspark support is ready!

Additional context

Installing Apache Zeppelin (generally /opt/zeppelin):

$ http://mirrors.gigenet.com/apache/zeppelin/zeppelin-0.6.2/zeppelin-0.6.2-bin-all.tgz
$ sudo tar -zxf zeppelin-0.6.2-bin-all.tgz
$ cd zeppelin-0.6.2-bin-all
$ sudo bin/zeppelin-daemon.sh start

Copy the default template:

$ cd /opt/zeppelin/conf# 
$ cp zeppelin-env.sh.template zeppelin-env.sh

Edit the zeppelin-env.sh and provide the following configuration parameters:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_HOME=/home/aut/spark-2.1.1-bin-hadoop2.7
export SPARK_SUBMIT_OPTIONS="--packages io.archivesunleashed:aut:0.16.0"

You can add additional park packages that may be useful for your analysis in SPARK_SUBMIT_OPTIONS.

Restart zeppelin

$ sudo bin/zeppelin-daemon.sh restart

Go to 'http://localhost:8080/`.

ruebot · Aug 13, 2018

I'm almost certain Jupyter Notebook works fine on HEAD. Basic PySpark and Dataframe support was added back around April.

I think @ianmilligan1 has some basic documentation around this hanging around. We haven't published any of this on the site because we're waiting to get to our next release. Once that happens it wiil. But in the interim, all we really have is outlined in closed and open issues: 1.0.0 Release of AUT, DataFrames and PySpark.

ruebot · Aug 20, 2018

@SamFritz can you add this to our August 22nd agenda?

ianmilligan1 · Aug 20, 2018

👍 yep, I think it's worth talking about notebooks - we've got them up and running for PySpark, but as an intermediate step it might be worth documenting using notebooks like this. I'll find some time to follow these steps before our meeting.

ianmilligan1 · Aug 20, 2018

thanks @Natkeeran - tested and you can see how it's running on my end here:

A few quick notes:

I used the most recent build (i.e. wget http://www-eu.apache.org/dist/zeppelin/zeppelin-0.8.0/zeppelin-0.8.0-bin-all.tgz)
on Mac, I didn't have to tinker with anything in /opt but rather just edited the zeppelin-env.sh file in /conf

It's pretty straightforward except for editing the configuration file. We'll have to give some thought around how best to document that..

ianmilligan1 · Aug 20, 2018

Oh, and scripts need to be compressed onto one line (the "paragraph" style we use with :paste in shell doesn't work, out of the box at least - I haven't dug deep to see if there is a workaround).

Natkeeran · Aug 20, 2018

@ianmilligan1 Looks good. You can still use the paragraph style, but would need to add a brackets () as below.

val domains_list = (RecordLoader.loadArchives("/home/aut/data/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems())

ruebot · Aug 20, 2019

This will be resolved with updates to the user documentation (0.18.0 should be out next week), and Using AUT with PySpark (In progress).

ruebot added question discussion labels Aug 20, 2018

ruebot added the resolve before 0.18.0 label Aug 17, 2019

ruebot referenced this issue Aug 20, 2019

Add binary extraction DataFrames to PySpark. #350

Open

archivesunleashed/aut

Adding Apache Zeppelin to docker and documentation #259

Adding Apache Zeppelin to docker and documentation #259

Natkeeran commented Aug 13, 2018 •

edited

This comment has been minimized.

ruebot commented Aug 13, 2018

ruebot added question discussion labels Aug 20, 2018

This comment has been minimized.

ruebot commented Aug 20, 2018

This comment has been minimized.

ianmilligan1 commented Aug 20, 2018

This comment has been minimized.

ianmilligan1 commented Aug 20, 2018 •

edited

This comment has been minimized.

ianmilligan1 commented Aug 20, 2018

This comment has been minimized.

Natkeeran commented Aug 20, 2018

ruebot added the resolve before 0.18.0 label Aug 17, 2019

ruebot added a commit that referenced this issue Aug 20, 2019

This comment has been minimized.

ruebot commented Aug 20, 2019

archivesunleashed/aut

Join GitHub today

Adding Apache Zeppelin to docker and documentation #259

Comments

Natkeeran commented Aug 13, 2018 • edited

This comment has been minimized.

ruebot commented Aug 13, 2018

ruebot added question discussion labels Aug 20, 2018

This comment has been minimized.

ruebot commented Aug 20, 2018

This comment has been minimized.

ianmilligan1 commented Aug 20, 2018

This comment has been minimized.

ianmilligan1 commented Aug 20, 2018 • edited

This comment has been minimized.

ianmilligan1 commented Aug 20, 2018

This comment has been minimized.

Natkeeran commented Aug 20, 2018

ruebot added the resolve before 0.18.0 label Aug 17, 2019

ruebot added a commit that referenced this issue Aug 20, 2019

This comment has been minimized.

ruebot commented Aug 20, 2019

Natkeeran commented Aug 13, 2018 •

edited

ianmilligan1 commented Aug 20, 2018 •

edited