In this blog post, we will explain, how to set up Jupyter as a browser-based frontend to easily query and visualize your data.

Jupyter is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text, see Project Jupyter.

This tutorial consists of two parts.

You are currently reading part one, which explains the basic steps how to set up and configure Jupyter.

It is essential to complete part one before continuing with part two!

Part two demonstrates how to run queries in Python and how to visualize data using matplotlib.

Prerequisites

Before starting this tutorial, please make sure your cluster is up and running.

You should have at least once started the spark shell and run some queries to test its functionality.

To complete part 2 of this tutorial, you need sample data, which can be downloaded here:

Dropbox – tpch_data.zip

This file contains TPC-H sample data at scale factor 0.001.

Please download the file and extract its content to your HDFS.

Alternatively, you may generate the sample data on your own by downloading and compiling DBGEN:

http://www.tpc.org/tpch/tools_download/dbgen-download-request.asp

Please do not use the Ambari webinterface for uploading files, because it may corrupt them:

https://issues.apache.org/jira/browse/AMBARI-13773

Installation

To get startet, we need to install several packages, that should come bundled with your Linux distribution.

Please run the following commands on a RedHat-based machine:


sudo yum install python-pip
sudo yum install python-matplotlib
sudo yum install gcc-c++
sudo pip install --upgrade pip
sudo pip install jupyter













You may install Jupyter on a jumpbox outside the cluster, for example, on an Ubuntu-based system.
Then, the first two commands are slightly different:


sudo apt-get install python-pip
sudo apt-get install python-matplotlib
sudo apt-get install g++
sudo pip install --upgrade pip
sudo pip install jupyter













Environment

Next, we need to set some environment variables to inform Jupyter about our Spark and Python settings.

Please adjust the paths and version number below according to your local environment, then either run these commands on the shell as the “vora” user, or put them in your “.profile”, to have them loaded every time you log in:


export PYTHONPATH=/home/vora/vora/python:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip
export ADD_JARS=/home/vora/vora/lib/spark-sap-datasources-<version>-assembly.jar
export SPARK_CLASSPATH=$ADD_JARS
export PYSPARK_SUBMIT_ARGS="--master yarn-client --jars $ADD_JARS pyspark-shell"



































Configure Jupyter

Please run this command as the user “vora” to generate the initial configuration for Jupyter:


jupyter notebook --generate-config



































Now, open an editor and edit the file “~/.jupyter/jupyter_notebook_config.py”

Since we are running on a remote machine with no Window Manager, we configure Jupyter to not open up a webbrowser on startup.

Please uncomment the line


# c.NotebookApp.open_browser = False



































Uncomment means removing the pound sign at the beginning of the line.

To be able to access Jupyter from remote, we need to uncomment the following line as well:


# c.NotebookApp.ip = '*'



































Notice: This will give everyone access to the Jupyter webinterface.

In a production environment, you might want to set up access control.

Please refer to this guide, how to secure your Jupyter installation:

Securing a notebook server

After applying the above changes to the config file, please save your changes and close the editor.

Notice:

Usually, cloud providers and IT departments are very restrictive and may block access to Jupyter’s TCP port (default: 8888).

Please make sure to include a rule in the firewall configuration allowing access to the port on the machine running Jupyter.

Consult the provider’s documentation or your IT department for details.

Running Jupyter

To run Jupyter, first, create an empty folder where you want to store your notebooks, and go into that folder.

Then run the following command as the user “vora”, e.g.:


mkdir notebooks
cd notebooks
jupyter notebook






























This will start a Jupyter notebook server, listening on port 8888 for connections.

The console output will be similar to this:


[I 09:39:29.176 NotebookApp] Writing notebook server cookie secret to /run/user/1000/jupyter/notebook_cookie_secret
[W 09:39:29.200 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[W 09:39:29.200 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure and not recommended.
[I 09:39:29.204 NotebookApp] Serving notebooks from local directory: /home/d062985/notebooks
[I 09:39:29.204 NotebookApp] 0 active kernels
[I 09:39:29.204 NotebookApp] The IPython Notebook is running at: http://[all ip addresses on your system]:8888/
[I 09:39:29.204 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).






























Now we can fire up a webbrowser on another machine and navigate to the URL of the host running Jupyter, e.g. http://jumpbox.yourcluster.internal:8888/

You should see a website like this:

/wp-content/uploads/2016/01/0_841932.png

By clicking New, you can start a new notebook, that is waiting for your input:

/wp-content/uploads/2016/01/1_841933.png

After clicking, the empty notebook will open up:

/wp-content/uploads/2016/01/2_841976.png

Now, we can start submitting queries by entering the query into a paragraph and hitting the play button on top.
This will then execute the snippet in the background and return results to the webpage.

Submitting queries and plotting data

The final part of this tutorial will take place in Jupyter.

Please download the attached Jupyter Notebook “PythonBindings.ipynb.zip”, unzip it, and copy it to the notebook folder on your machine running jupyter.

Then, open the file in the Jupyter webinterface in your webbrowser.

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply