This tutorial is offered as a quick start guide to installing a Hadoop 3-node cluster on the cloud. Most configurations are kept close to default, and as such this guide is ideal for development and testing environments. This guide is not recommended for productions environments.
Due to some proxy constraints on my cluster, I will note workarounds or additional steps related to clusters behind a proxy with the label ‘Proxies Only’.
For code examples, > indicates a command run by a regular user, # indicates to run as root.
- Nodes: 4 CPU / 16G RAM x 3
- OS: SUSE 12 SP01
- HortonWorks Ambari 2.6.0
- Setting up the cluster
- Ambari Installation
- HDP installation
1. Setting up the cluster
I will be using SAP’s cloud infrastructure in order to provision the servers where Hadoop will be installed.
Ambari will run on nodes with as little as 8 GB RAM, but more is recommended especially if the cluster will be used for testing.
Generally three 4 CPU / 16 GB nodes is recommended.
All nodes are running SuSE 12 SP01. HDP 2.6.2 can support SUSE (64-bit) 11.3, 11.4, 12.1, and 12.2. Make sure all nodes are running the same operating system and patch level.
Connecting through PuTTY
We’ll be using PuTTY to connect to our nodes. You will need only putty.exe client and puttygen.exe in order to convert our private key to a readable format.
From your local user folder, create a folder named ‘.ssh’. On Windows this will likely have to be done through the command-line as explorer doesn’t like to create folders that begin with a period.
In the new .ssh folder is where we’ll store our private key. Once saved, launch puttygen.exe.
Click on the Load button and select your saved private key (if your key isn’t listed, make sure All Files is selected in the loading screen.)
Once loaded, select Save Private Key select Yes when prompted to save the key without a passphrase, and save it to a file in your .ssh directory.
This new file is saved as a .ppk file and is what we’ll use with PuTTY to connect to our servers.
Once saved, launch putty.exe. On the main page specify the hostname for one of the nodes,
Under Connection > Data specify your username under Auto-login username.
Under Connection > SSH > Auth make sure:
- Allow agent forwarding is checked
- Under Private Key for Authentication, browse to your puttygen-created private key file
Add any other customization you want (appearance, select behavior, etc.) then navigate back to the Session page, give a name to your profile and click Save.
Once the profile is saved, click Open to connect to the node.
Once connection is successful, repeat the above process for the other two nodes.
Now that we can connect via SSH to all three nodes, we will do a quick update and create our administration user.
My instance users are pre-configured with passwordless sudo access for the sysadmin group on the server.
This means issuing a sudo su command will allow you to run as root. Passwordless sudo access is required for at least one user to allow HDP to install services on the cluster.
On all three nodes we’ll first do an update to make sure we’re running the latest version:
> sudo su # zypper update -t patch
Since we already have a sysadmin group with password-less sudo access, we need to only create a new user and make sure it is added to the sysadmin group (as well as any other groups you may need). For your platform, make sure the created user has password-less sudo access. I’m naming my user cadmin (cluster admin):
# /usr/sbin/useradd -m -g users -G sysadmin cadmin
Create this user on all three nodes in the cluster.
We’ll use this user to connect our servers when installing our Ambari cluster.
From your primary node, we’ll create an RSA key for the new cadmin user to allow key-based SSH authentication.
Log in as the cadmin user and run ssh-keygen to create this RSA key:
> sudo su cadmin > ssh-keygen -t rsa [ENTER] [ENTER]
Two files are created in the user’s .ssh folder (/home/cadmin/.ssh/):
id_rsa – this is the private key. We’ll need this during Ambari installation so save this to notepad so we can access it quickly later.
id_rsa.pub – this is the public key, we’ll need to add this to an authorized_keys file on all nodes in order for cadmin to connect using the private key.
The authorized_keys file is located in the user’s .ssh folder and is read whenever that user is trying to connect to the server. As such we’ll first copy this to the file on our main node:
> cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
Save this public key to notepad as well since we will add it to the other two nodes under cadmin home directory.
Connect to your other nodes via PuTTY and run:
# sudo su cadmin > echo "XXXXXX" > ~/.ssh/authorized_keys
Where “XXXXXX” is the cadmin public key.
You can run cat on the file to make sure it was written correctly:
> cat ~/.ssh/authorized_keys
Now we can test to make sure cadmin is able to connect.
From your primary node (where you first generated the keypair) run:
> ssh cadmin@<NODE-2>
Where NODE-2 is the hostname of one of your worker nodes. You may get a prompt regarding the authenticity of the host, answer ‘yes’ and you should be connected.
If the key was not accepted or it prompts you for a password, double-check that the public key is listed in the authorized_keys file and try troubleshooting via this link.
3. Ambari Installation
Assuming we now have a working cadmin user, in this section we’ll add the Ambari repository and install the cluster manager.
Ambari manager will only be installed on our primary (master) node, so the below steps only need to be applied once:
First, see this HortonWorks Ambari Repositories page and copy the “Repo File” link for your flavor of OS. In my case, for SLES 12.1, my link is:
Connect via SSH to your primary node, if you aren’t already, and issue the following:
> sudo su # cd /etc/zypp/repos.d # wget http://public-repo-1.hortonworks.com/ambari/sles12/2.x/updates/126.96.36.199/ambari.repo -O ambari.repo # zypper ref
This will add the ambari repository to zypper package manager and refresh the repository list. You should see a line after the refresh pulling packages from ‘ambari Version – ambari-188.8.131.52’ repository.
Now we’ll install ambari server:
# zypper install ambari-server
Once installed, run the below to setup Ambari (as root):
# ambari-server setup
Defaults can be accepted for all prompts.
In some instances, ambari-server setup will be unable to get JDK 1.8 or the JCE policy files from the public internet. The easiest workaround for this is to kill the setup process (Ctrl-C) and manually use curl or wget to download and save the files to their respective directories.
The setup output will hang after a prompt similar to:
Downloading JDK fromto /var/lib/ambari-server/resources/jdk-8u112-linux-x64.tar.gz
In this case, after killing the process a simple wget command will use the correct OS proxy to obtain the file:
# wget http://public-repo-1.hortonworks.com/ARTIFACTS/jdk-8u112-linux-x64.tar.gz -O /var/lib/ambari-server/resources/jdk-8u112-linux-x64.tar.gz
And again for the JCE Policy file:
Downloading JCE Policy archive fromto /var/lib/ambari-server/resources/jce_policy-8.zip
# wget http://public-repo-1.hortonworks.com/ARTIFACTS/jce_policy-8.zip -O /var/lib/ambari-server/resources/jce_policy-8.zip
Finally re-run the setup command and both files should be picked up.
Once setup completes, restart ambari server and in the next section we will install Hadoop services:
# /usr/sbin/ambari-server restart
4. HDP Installation
Now that we have the Ambari manager running, we can access the UI from the web via port 8080:
Once the page loads, you can log in with the default credentials:
Once logged in, you can access the Users link on the left to change the admin password if desired.
Otherwise, click on Launch Install Wizard to begin creating your Hadoop cluster.
Enter a name for your cluster and click next.
Make sure Use Public Repository is selected. If it is not this may be due to a proxy issue, see below.
By default, Ambari won’t be able to read the public repository until we update the proxy.
Close the UI and stop the Ambari server:
> sudo ambari-server stop
We must add the proxy to /var/lib/ambari-server/ambari-env.sh
Open the file and under AMBARI_JVM_ARGS we need to add the following:
To confirm your OS-level proxy, you can issue:
> echo $http_proxy
Which should provide the host and port to enter under AMBARI_JVM_ARGS.
For more advanced proxy configurations or proxies that require authentication, see the HortonWorks documentation.
Once added, save the file and restart the ambari server:
> sudo ambari-server start
Under Select Version select your HDP version, in my case HDP 2.6, and click Next.
Under Install Options we need to enter the domains of all three of our hosts as well as connectivity information (remember that cadmin private key I told you to save?)
Add all three fully-qualified domains to the Target Hosts text box and copy/paste the cadmin private key under Host Registration Information. Make sure to update the user from root to cadmin as well:
Then click Register and Confirm to continue.
At this point, Ambari will connect to and provision the hosts in the cluster. If any errors occur click the ‘Failed’ status to view the install log and troubleshoot further via the web.
In my case, registration failed with an error <host> failed due to EOF occurred in violation of protocol (_ssl.c:661)
From a web search I was able to fix the issue by adding:
Under the [security] section of /etc/ambari-agent/conf/ambari-agent.ini on all nodes.
Once all nodes succeed, you can see the results of all health checks and address any other warnings that may have been raised.
When finished, click Next.
This is where we choose the services for your Hadoop installation. Services chosen will differ depending on your needs.
HDFS and Yarn are required for almost all Hadoop installations, in my case I am using this for Vora and Spark testing so I’ve selected:
HDFS, YARN + MR2, Tez, Hive, ZooKeeper, Pig, Spark, and Ambari Metrics.
Any prerequisites that are needed for selected services will be automatically added.
Next, we can assign services to their respective nodes. In most situations these can remain the defaults:
On the next page we can assign slaves and clients to our nodes. Generally, it is a good idea to assign more rather than less. I assign Clients, Sparkservers, NodeManager, and Datanodes to all nodes.
Next we have to configure all the services. There will be errors that need addressing, indicated by red circles:
Most of these are easily fixed by taking out the directories starting with /home, with the exception of Hive.
Hive requires we set up a database. For this we’ll use PostgreSQL.
SSH to your master node and log in as root:
> sudo su # zypper install postgresql-jdbc # ls /usr/share/java/postgresql-jdbc.jar # chmod 644 /usr/share/java/postgresql-jdbc.jar # ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-jdbc.jar
Now we need to log in to postgres, create our database and user / password. In this case we’re using ‘hive’ for all three:
# sudo su postgres > psql postgres=# create database hive; postgres=# create user hive with password 'hive'; postgres=# grant all privileges on database hive to hive; postgres=# \q
Now we just need to backup and update the pg_hba configuration file:
# cp /var/lib/pgsql/data/pg_hba.conf /var/lib/pgsql/data/pg_hba.conf_backup # vi /var/lib/pgsql/data/pg_hba.conf
Add hive to the list of users at the bottom of the file (so it reads hive,ambari,mapred)
Save and exit with :wq
Then restart postgres:
> sudo service postgresql restart
Now, back to the cluster setup, select “Existing PostgreSQL Database” and make sure hive is set for the DB name, username, and password.
Make sure the Database URL also correctly reflects the node where we installed and configured postgresql and test the database connection.
Once successful, click Next and the deployment should begin.
Similar to when we registered the hosts, the logs for any failures can be viewed by clicking on the respective “Failed” status.
Possible errors are too vast to cover here, but web searches or searches of the Hortonworks forums will most likely provide answers.
Once all deployments are successful, click Next to access the Ambari dashboard and view your services. Any alerts can also be addressed and service customization can be configured at this point.
Congratulations! Your Hadoop cluster should now be up and running.