Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Apache Zeppelin to docker and documentation #259

Closed
Natkeeran opened this issue Aug 13, 2018 · 8 comments

Comments

@Natkeeran
Copy link

commented Aug 13, 2018

Is your feature request related to a problem? Please describe.
AUT is a tool targeted towards researchers with limited technical expertise. Nevertheless, command line interactions can be a barrier for potential users. Many users in this domain are familiar with python notebooks and Apache Zeppelin provides similar workspaces and interactive mode of analysis.

Describe the solution you'd like
Installing Apache Zeppelin for experimental use is quite straightforward. Adding the instructions to documentation and including Zeppelin in aut docker would be useful. Down the road, Apache Zeppelin sandbox would be an easier way for users to evaluate AUT.

Describe alternatives you've considered

Additional context

Installing Apache Zeppelin (generally /opt/zeppelin):

$ http://mirrors.gigenet.com/apache/zeppelin/zeppelin-0.6.2/zeppelin-0.6.2-bin-all.tgz
$ sudo tar -zxf zeppelin-0.6.2-bin-all.tgz
$ cd zeppelin-0.6.2-bin-all
$ sudo bin/zeppelin-daemon.sh start

Copy the default template:

$ cd /opt/zeppelin/conf# 
$ cp zeppelin-env.sh.template zeppelin-env.sh

Edit the zeppelin-env.sh and provide the following configuration parameters:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_HOME=/home/aut/spark-2.1.1-bin-hadoop2.7
export SPARK_SUBMIT_OPTIONS="--packages io.archivesunleashed:aut:0.16.0"

You can add additional park packages that may be useful for your analysis in SPARK_SUBMIT_OPTIONS.

Restart zeppelin

$ sudo bin/zeppelin-daemon.sh restart

Go to 'http://localhost:8080/`.

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 13, 2018

I'm almost certain Jupyter Notebook works fine on HEAD. Basic PySpark and Dataframe support was added back around April.

I think @ianmilligan1 has some basic documentation around this hanging around. We haven't published any of this on the site because we're waiting to get to our next release. Once that happens it wiil. But in the interim, all we really have is outlined in closed and open issues: 1.0.0 Release of AUT, DataFrames and PySpark.

@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

@SamFritz can you add this to our August 22nd agenda?

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

👍 yep, I think it's worth talking about notebooks - we've got them up and running for PySpark, but as an intermediate step it might be worth documenting using notebooks like this. I'll find some time to follow these steps before our meeting.

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

thanks @Natkeeran - tested and you can see how it's running on my end here:

screen shot 2018-08-20 at 9 20 36 am

A few quick notes:

  • I used the most recent build (i.e. wget http://www-eu.apache.org/dist/zeppelin/zeppelin-0.8.0/zeppelin-0.8.0-bin-all.tgz)
  • on Mac, I didn't have to tinker with anything in /opt but rather just edited the zeppelin-env.sh file in /conf

It's pretty straightforward except for editing the configuration file. We'll have to give some thought around how best to document that..

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Aug 20, 2018

Oh, and scripts need to be compressed onto one line (the "paragraph" style we use with :paste in shell doesn't work, out of the box at least - I haven't dug deep to see if there is a workaround).

@Natkeeran

This comment has been minimized.

Copy link
Author

commented Aug 20, 2018

@ianmilligan1 Looks good. You can still use the paragraph style, but would need to add a brackets () as below.

val domains_list = (RecordLoader.loadArchives("/home/aut/data/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems())

ruebot added a commit that referenced this issue Aug 20, 2019

Add binary extration DataFrames to PySpark.
- Address #190
- Address #259
- Address #302
- Address #303
- Address #304
- Address #305
- Address #306
- Address #307
@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

This will be resolved with updates to the user documentation (0.18.0 should be out next week), and Using AUT with PySpark (In progress).

ianmilligan1 added a commit that referenced this issue Aug 21, 2019

Add binary extraction DataFrames to PySpark. (#350)
* Add binary extration DataFrames to PySpark.
- Address #190
- Address #259
- Address #302
- Address #303
- Address #304
- Address #305
- Address #306
- Address #307
- Resolves #350 
- Update README
@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

Docs have been reviewed. Closing this now.

@ruebot ruebot closed this Aug 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.