The default installation of the SparkController for HANA in a Hadoop cluster allows the archiving of

data using DLM (Data Lifecycle Management) to HDFS and the accessing of Hive and DLM’d data with HANA.

OR

Access Vora data with HANA.

but not both with the same SparkController instance.

Currently the SparkController can be configured to talk to Hive or Vora and the 2 options are mutually exclusive.  Due to Ambari behaviour, if you install and configure 2 SparkControllers, they will get the same configuration settings, whenever the SparkController services are started, unless you assign separate configurations to one of the SparkController nodes, allowing for

This blog will show you how to alter the installation and use 2 (or more) SparkControllers in a HortonWorks Ambari cluster, so DLM operations can be performed, and Hive and Vora data can be accessed by HANA using SDA.

First – the components:

  • Hortonworks Ambari cluster – HDP 2.4.2.0, Spark 1.6.1
  • HANA (I’m using 122.3)
  • Spark Controller 1.6 PL1
  • Vora 1.3 (latest build)

Next – Ambari configuration:

The standard Ambari installation is followed to set up the cluster.  The Vora and SparkController .tgz files are untarred to the ‘/var/lib/ambari-server/resources/stacks/HDP/2.4/services’ folder.  This creates a SparkController folder and a vora-manager folder.

>tar -zxvf SAPHanaVora-1.3.xx-ambari.tar.gz -C /var/lib/ambari-server/resources/stacks/HDP/2/4/services/.
>tar -zxvf controller.distribution-1.6.1-Ambari-Archive.tar.gz -C /var/lib/ambari-server/resources/stacks/HDP/2/4/services/.

You must restart the Ambari server to allow it to gather the resources in order to see the newly available services for installation via the Ambari console:

>ambari-server restart

After restart, log into the Ambari Server console (http://<host>:8080) and using the ‘Actions->+ Add Service’ button/menu, add the Vora Manager service and the SparkController service to the the Ambari cluster, using the ‘Add Service’ wizard.

(Your versions may vary).  I would recommend installing them one at a time to allow for proper selection of master nodes, clients, etc.

Adding the Vora Manager Service:

The Vora 1.3 installation itself is not covered here.  For now, when assigning Master node(s) with the ‘Add Service’ wizard, you must ensure that at least one master node is assigned as the ‘Vora Manager Master’.


When assigning Slaves and Clients, all other nodes will contain the Vora Worker and all nodes must have a Vora Client.

All other Vora services will be installed using the Vora Manager GUI once Vora Manager is started.  Follow the installation of Vora Manager and Vora Manager services using the Vora Manager GUI documented at https://help.sap.com/hana_vora_re.

Adding the SparkController:Service:

For now, you will only assign and deploly a single Master for the SparkController service. This will be the ‘Hive’-configured SparkController instance.

The SparkController documentation entry point can be found at: Using SAP HANA Spark Controller. You should also refer to the following SAPNote 2344239 for the most up-to-date documentation on SparkController 1.6PL1.

Adding the services to Ambari will automatically create the following users and directories:

  • For Vora: the ‘/etc/vora’ folder and the ‘vora’ user and group in /etc/group and /etc/passwd.
  • For SparkController: the ‘/usr/sap/spark/controller’ folder and the ‘sapsys’ group in /etc/group and ‘hanaes’ user in /etc/passwd.

Note: different GIDs and UIDs may be used for different Hadoop distros. Removing a service may remove the login and directories and subsequent upgrades (uninstall/reinstall) may change UIDs and GIDs. It is recommended to duplicate the SparkController UIDs and GIDs on all datanodes or nodes that will access the HDFS to ensure correct ownership and permissions are maintained in HDFS folders, as well as backup configurations.

Configuration steps

To simplify things, we will first configure our ‘default’ SparkController to access Hive databases and perform DLM work (we will call this the ‘Hive SparkController host’),

Then we are going to copy the installation to a second instance later.  (Installation of the DLM component itself in HANA is not covered in this blog.  See http://help.sap.com/hana_options_dwf?current=hana).

  1. Configure HDFS – add the following configuration settings to the HDFS ‘Custom core-site’.  This is a cluster-wide setting.
    hadoop.proxyuser.hanaes.groups=*
    hadoop.proxyuser.hanaes.hosts=*

    Restart HDFS as required.

  1. Configure YARN – add the the following configuration setting to the YARN ‘Custom yarn-site’ settings:
    hdp.version=2.4.2.0-258

    (your version may vary – ‘ls’ the ‘/usr/hdp’ folder for the version you should use).

    Restart the YARN service as required.

  1. Configure the SparkController to speak to both Vora and Hive.  Under the SparkController ‘Config’ tab, make the following changes.
  • Edit the ‘Advanced hana_hadoop-env’ with the following (making necessary changes to paths to jar files for Spark and Vora locations).
    #!/bin/bash
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export HIVE_CONF_DIR=/etc/hive/conf
    export HANAES_LOG_DIR=/var/log/hanaes
    export HANA_SPARK_ASSEMBLY_JAR=/usr/hdp/current/spark-client/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar
    export HANA_SPARK_ADDITIONAL_JARS=/var/lib/ambari-agent/cache/stacks/HDP/2.4/services/vora-manager/package/lib/vora-spark/lib/spark-sap-datasources-1.3.99-assembly.jar
    #export HANAES_CONF_DIR=/etc/hanaes/conf
    
    
    #use HANA_SPARK_ADDITIONAL_JARS for DATANUCLEUS path
    DATANUCLEUS_LIBS=""
      for jarFile in `ls /usr/hdp/current/spark-client/lib/*datanucleus*`
      do
        DATANUCLEUS_LIBS=${DATANUCLEUS_LIBS}:$jarFile
      done
    export HANA_SPARK_ADDITIONAL_JARS=${HANA_SPARK_ADDITIONAL_JARS}:${DATANUCLEUS_LIBS}

    (datanucleus libraries come with the Ambari HDP stack and are required to access Hive).
    (You will need to alter the HANA_SPARK_ADDITIONAL_JARS variable to point to your version of the Vora spark-sap-datasources…jar file).

  • Leave ‘Advanced hanaes-site’ alone for now.
  • Add the following key/value properties to the ‘Custom hanaes-site’ to define the HDFS folder for storing DLM data from HANA (your settings may vary – just ensure that the folder specified below exists and is owned by ‘hanaes:hdfs’.  The best way to do this is ‘ssh’ to the SparkController node and as the ‘hdfs’ user, create the directories and change ownership using the ‘hadoop -fs’ command).
    sap.hana.es.warehouse=/sap/hana/hanaes/warehouse
    sap.hana.hadoop.datastore=hive
    spark.sql.hive.metastore.sharedPrefixes=com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,org.apache.hadoop
  1. Start the SparkController, use the ‘Summary’ tab and link to the node running the controller and start it from there:
    Check the /var/log/hanaes/hana_controller.log for any startup errors.
  2. Test the connection to the SparkController from a HANA instance by provisioning a Remote Source in a HANA Studio console pointing to the Hive SparkController host:
    CREATE REMOTE SOURCE "SparkHive" ADAPTER "sparksql"       
    CONFIGURATION 'port=7860;ssl_mode=disabled;server=<HiveSparkControllerServerDNS>'        
    WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=hanaes;password=hanaes'

    Refreshing the ‘Remote Sources’ should allow you to see the default and any other Hive databases under the ‘SparkHive’ source.

     

    You can now create virtual tables using SDA to Hive database tables.  This is also the ‘Remote Source’ that will be used for DLM operations.

    create virtual table "<HANASchema>"."<HANATable>"
    at "SparkHive"."hive"."<HiveDb>"."<HiveTable>"

Now we will configure an additional SparkController on another node to work with Vora.

Before proceeding, ensure that Vora services have been configured according to the Vora Administration and Installation Guide.  A test table should be available for testing connectivity from HANA to Vora through the SparkController.

Because Ambari server copies the configuration to the other server, we will take advantage of an Ambari feature that allows multiple configurations.  This will allow us to maintain the 2 SparkController configurations, and start the SparkControllers with separate configurations.

    1. From the Dashboard, select the SparkController service, the ‘Config’ tab and click on the ‘config groups’ link:
    2. Add a new group called VoraController, select a node that doesn’t have the just-installed SparkController on it (e.g. another edge node or secondary master), and add it to the new group then ‘Save’.
    3. Now navigate to the ‘Hosts’ tab of the Ambari server console and select the same node .  We will refer to this as the ‘Vora SparkController host’.
    4. Under to ‘Components’, select ‘Add +’ and choose ‘SparkController’
    5. Confirm and allow Ambari server to install the SparkController service on the new host node.
    6. Return to the ‘Dashboard’, select the ‘SparkController’.  You will now see 2 controllers. Select the ‘Config’ tab and choose the VoraController group.
    7. Open ‘Custom hanaes-site’ and select the ‘Override’ symbol beside the ‘sap.hana.hadoop.datastore’ value.
    8. Add ‘vora’ to the override box that shows.
    9. Press save and give and provide a note.
    10. You will now have 2 configurations for the SparkController instances.
    11. Start the new instance by selecting the ‘Hosts’ tab from the Dashboard, select the node instance that now hosts the Vora Controller and start the SparkController instance

Both controllers should now be running, but you can start and stop them using the Ambari Dashboard or from each node separately as required.

To start either node manually, ‘ssh’ to the node as ‘root’ and start the SparkController manually using the ‘hanaes’ user created by the original installation of the controller on this node:

>su - hanaes
>cd /usr/sap/spark/controller/bin
>./hanaes start

You can check the log at ‘/var/log/hanaes/hana_controller.log’ and also refresh the Remote Source in HANA Studio to confirm the connection.

  1. Add a new Remote Source to the HANA server that points to the Vora SparkController host:
    CREATE REMOTE SOURCE "SparkVora" ADAPTER "sparksql"       
    CONFIGURATION 'port=7860;ssl_mode=disabled;server=<VoraSparkControllerServerDNS>'        
    WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=hanaes;password=hanaes'

    Refresh the Remote Sources.

You should now be able to see 2 Remote Sources under the Provisioning folder in HANA Studio. Creating a virtual tables to Vora is similar to Hive:

create virtual table "<HANASchema>"."<HANATable>"
at "SparkVora"."vora"."spark_vora"."<VoraTable>"

The ‘SparkHive’ source should show the various Hive databases and tables available under the source.  The Vora source will show a database/schema of ‘spark_vora’ and the various tables available under that source.

In a further blog, I will discuss some tuning of the controllers under Spark/YARN.

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply