Skip to Content

Hadoop and Predictive Analytics are some of the most exciting technologies for businesses today but are often seen as having a steep learning curve. While both are complex, getting started is simple thanks to the Hortonworks Sandbox providing the database and SAP InfiniteInsight making predictive analytics intuitive for both data scientists and business users.  In just 3 easy steps, you can setup your own Hadoop cluster and tackle real predictive use cases!

1. Install

First you’ll need to install 3 components:

1. VirtualBox for the virtualization environment: https://www.virtualbox.org/wiki/Downloads

2. HortonWorks Sandbox with HDP 2.2 image: http://hortonworks.com/products/hortonworks-sandbox/ [Go to ‘Download & Install’ tab and select either Mac or Windows for VirtualBox]

3. SAP InfiniteInsight 7.0: http://bit.ly/1t77brW [Trial]

Once you’ve installed Virtualbox, open up the Hortonworks Sandbox .ova file and it’ll automatically load it into your interface. Hit ‘Start’ and you now have a fully functional Hadoop environment!

/wp-content/uploads/2015/01/1_630831.jpg

2. Connect

Next we simply set up our connection from Hadoop to SAP InfiniteInsight using an ODBC connection. Download and install the driver here: http://hortonworks.com/hdp/addons/.

After installation, open up your ODBC Administrator and under the System DSN tab, “Sample Hortonworks Hive DSN” is now available.

  /wp-content/uploads/2015/01/2_630832.jpg

Configure it with the IP address from the startup screen of your Hadoop environment, with the remaining fields shown below.

/wp-content/uploads/2015/01/3_630833.jpg/wp-content/uploads/2015/01/4_630834.jpg

Test the connection and you have now successfully added Hadoop as a data source for InfiniteInsight.

TIP: Your <ip address>:8888 will be your homepage for Hadoop in your browser for accessing Hive, HDFS, and more

3. Predict

Now that everything is set up, you’re ready to do predictive analytics! Open InfiniteInsight and we’ll ‘Create a Clustering Model’ based on the sample tables in Hadoop. Select the ‘Data Type’ as ‘Database’ and select “default”.sample_07 that shows various job titles with the number of total employees and salaries.

TIP: Check out this great tutorial for uploading your own datasets into Hadoop: http://hortonworks.com/hadoop-tutorial/loading-data-into-the-hortonworks-sandbox/

/wp-content/uploads/2015/01/5_630835.jpg

On the next screen, hit the ‘Analyze’ icon and continue with ‘Next’ and ‘Generate’ leaving the default settings and voila, we’ve done it!

/wp-content/uploads/2015/01/6_630836.jpg

We’ve set up our Hadoop environment and performed a clustering analysis on the fly with SAP InfiniteInsight in 3 easy steps. Give it a spin and please leave any feedback below.

To report this post you need to login first.

4 Comments

You must be Logged on to comment or reply to a post.

  1. Rudolf Wenzler

    Hi Victor,

    sounds like a great approach to get a hands on experience. Unfortunately, I’m only able to access via the web interface, but not via the ODBC driver.

    I’ve tried “Hive Server 1” without authentication leading to the following message:

    Driver Version: V1.4.14.1014

    Running connectivity tests…

    Attempting connection
    Failed to establish connection
    SQLSTATE: HY000[Hortonworks][HiveODBC] (68) Error returned trying to set default as the initial database: ETIMEDOUT; Also tried quoting the database name `default` but the query failed with the following error: ETIMEDOUT

    TESTS COMPLETED WITH ERROR.

    “Hive Serve 2” with user or user/password authentication using user ‘hue’ and pwd ‘1111’ respectively returns:

    Driver Version: V1.4.14.1014

    Running connectivity tests…

    Attempting connection
    Failed to establish connection
    SQLSTATE: HY000[Hortonworks][HiveODBC] (34) Error from Hive: ETIMEDOUT.

    TESTS COMPLETED WITH ERROR.

    Do you have an idea, how the configuration must be changed? I’d really appreciate any hint on how I could go on

    Regards,

    Rduolf

    (0) 
    1. Victor Lu Post author

      Hi Rudolf,

      Apologies for the delayed response. Is your Hadoop instance fully up and running? Are you able to view the sample datasets in the browser? Feel free to shoot me an email.

      Best,

      Victor

      (0) 
      1. Rudolf Wenzler

        Hi Victor,

        thanks for getting back to me. In the meantime I managed to come up with a running HDP 2.2 sandbox running von VMPlayer using the ODBC driver from microsoft, which works well.

        Best,

        Rudolf

        (0) 

Leave a Reply