Creating a HANA Workflow using HADOOP Oozie
There are several standard ways that HANA Procedures can be scheduled:
e.g:
1) HANA SPS7 XS Job Scheduling (Job Scheduling | SAP HANA)
2) SAP DataService (How to invoke SAP HANA stored procedures from D… | SAP HANA)
For those that use opensource HADOOP for managing Big Data, then OOZIE can also be used to execute HANA procedures in a workflow.
For a good overview of HADOOP terms and definitions please refer to:
SAP HANA – Hadoop Integration # 1
A Big Data workflow integrating HADOOP and HANA might be:
Point c) on the diagram (using Sqoop for data transfer to HANA) is covered in more detail in another blog
Exporting and Importing DATA to HANA with HADOOP SQOOP
The focus on the remaining part of this blog is only to demonstrate how HANA Server Side Java script (XSJS) can be used to execute HANA procedures [ point d) in diagram above] via an OOZIE WorkFlow:
Oozie is currently described in Wikipedia as
“ a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java Web-Application that runs in a Java Servlet-Container.
For the purposes of Oozie, a workflow is a collection of actions (e.g. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph). A “control dependency” from one action to another means that the second action can’t run until the first action has completed. The workflow actions start jobs in remote systems (Hadoop or Pig). Upon action completion, the remote systems call back Oozie to notify the action completion; at this point Oozie proceeds to the next action in the workflow.
Oozie workflows contain control flow nodes and action nodes. Control flow nodes define the beginning and the end of a workflow (start, end and fail nodes) and provide a mechanism to control the workflow execution path (decision, fork and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for different types of actions: Hadoop MapReduce, Hadoop file system, Pig, SSH, HTTP, eMail and Oozie sub-workflow. Oozie can be extended to support additional types of actions.
Oozie workflows can be parameterized (using variables like ${inputDir} within the workflow definition). When submitting a workflow job, values for the parameters must be provided. If properly parameterized (using different output directories), several identical workflow jobs can run concurrently. “
I think I’ve also read that Oozie was original designed by Yahoo (now Hortonworks) for managing their complex HADOOP workflows.
It is opensource and able to be used by all distributions of HADOOP (e.g Cloudera, Hortonworks, etc).
Ooze workflows can be defined in XML, or visually via the Hadoop User Interface (Hue – The UI for Apache Hadoop).
Below I will demonstrate a very simple example workflow of HANA XSJS being called to:
A) Delete the Contents of a Table in HANA
B) Insert a Single Record in the Table
To call procedures in HANA from HADOOP I created 2 small programs:
1) in HANA a generic HANA XSJS for calling procedures (callProcedure.xsjs)
2) In HADOOP a generic JAVA program for calling HANA XSJS (callHanaXSJS.java)
The HANA XSJS program has up to 7 input parameters:
iProcedure – is the procedure to be called (mandatory)
iTotalParameters – is the number of additional input parameters used by the Procedure (Optional – default 0)
iParam1 to iParam5 – are the input parameters of the procedure.
In the body of the Reponse I provide the basic input and output info (including DB errors) in JSON format.
HANA: callProcedure.xsjs |
---|
var maxParam = 5; var iProcedure = $.request.parameters.get(‘iProcedure’); var iTotalParameters = $.request.parameters.get(“iTotalParameters”); var iParam1 = $.request.parameters.get(“iParam1”); var iParam2 = $.request.parameters.get(“iParam2”); var iParam3 = $.request.parameters.get(“iParam3”); var iParam4 = $.request.parameters.get(“iParam4”); var iParam5 = $.request.parameters.get(“iParam5”); var output = {}; output.inputParameters = {}; output.inputParameters.iProcedure = iProcedure; output.inputParameters.iTotalParameters = iTotalParameters; output.inputParameters.iParam1 = iParam1; output.inputParameters.iParam2 = iParam2; output.inputParameters.iParam3 = iParam3; output.inputParameters.iParam4 = iParam4; output.inputParameters.iParam5 = iParam5; output.Response = []; var result = “”; // Check inputs //if (iProcedure === ”) { if (typeof iProcedure === ‘undefined’ ) { result = “ERROR: ‘&iProcedure=’ Parameter is Mandatory”; output.Response.push(result); $.response.status = $.net.http.INTERNAL_SERVER_ERROR; $.response.setBody(JSON.stringify(output)); } else { var conn = $.db.getConnection(); var pstmt; if (typeof iTotalParameters === ‘undefined’) { iTotalParameters = 0; } var sql = “call \”_SYS_BIC\”.\”” + iProcedure + “\”(“; if (iTotalParameters > 0 && iTotalParameters <= maxParam) { var i; for (i=0;i< iTotalParameters;i++) { if (i===0) { sql += “?”; } else {sql += “,?”; } } } else { if (iTotalParameters !== 0 ) { result = “WARNING: ‘&iTotalParameters-‘ Parameter shoule be between 0 and ” + maxParam; output.Response.push(result); } } sql += “)”; output.inputParameters.sql = sql; try{ //pstmt = conn.prepareStatement( sql ); //used for SELECT pstmt = conn.prepareCall( sql ); //used for CALL if (iTotalParameters >= 1) { pstmt.setString(1,iParam1);} if (iTotalParameters >= 2) { pstmt.setString(2,iParam2);} if (iTotalParameters >= 3) { pstmt.setString(3,iParam3);} if (iTotalParameters >= 4) { pstmt.setString(4,iParam5);} if (iTotalParameters >= 5) { pstmt.setString(5,iParam5);} // var hanaResponse = pstmt.execute(); if(pstmt.execute()) { result = “OK:”; var rs = pstmt.getResultSet(); result += JSON.stringify(pstmt.getResultSet()); } else { result += “Failed to execute procedure”; } } catch (e) { result += e.toString(); } conn.commit(); conn.close(); //var hanaResponse = []; output.Response.push(result); $.response.setBody(JSON.stringify(output)); } |
The HADOOP Java program accepts a minimum of 4 input arguments:
arg[0] – URL of a HANA XSJS, accessible via the HADOOP cluster
arg[1] – HANA User name
arg[2] – HANA Password
arg[3] – HADOOP HDFS Output directory for storing response
arg[4 to n] – are used for the input parameters for the HANA XSJS called
HADOOP: callHanaXSJS.java |
---|
package com.hanaIntegration.app; /** * Calls a HANA serverside javascript (xsjs) * INPUTS: (mandatory) HANA XSJS URL, output logfile name, username & password * (optional) n parameters/arguments * OUTPUT: writes HANA XSJS response to a the logfile on HDFS * */ import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.net.HttpURLConnection; import java.net.URL; import org.apache.commons.io.IOUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; public class callHanaXSJS { public static void main(String[] args) throws IOException { String sUrl = args[0]; //Append XSJS Command parmeters if (args.length > 4) { //append first parameter sUrl += “?” + args[4]; //add subsequent for(int i= 5;i < args.length;i++) { sUrl += “&” + args[i]; } } System.out.println(“HANA XSJS URL is: ” + sUrl); URL url = new URL(sUrl); HttpURLConnection conn = (HttpURLConnection)url.openConnection(); String userpass = args[1] + “:” + args[2]; //args[0] user args[1] password String basicAuth = “Basic ” + javax.xml.bind.DatatypeConverter.printBase64Binary(userpass.getBytes()); conn.setRequestProperty (“Authorization”, basicAuth); conn.connect(); InputStream connStream = conn.getInputStream(); // HDFS Output FileSystem hdfs = FileSystem.get(new Configuration()); FSDataOutputStream outStream = hdfs.create(new Path(args[3], “HANAxsjsResponse.txt”)); IOUtils.copy(connStream, outStream); outStream.close(); connStream.close(); conn.disconnect(); } } |
NOTE: HADOOP Java programs are compiled as JAR’s and stored on HADOOP HDFS prior to execution by OOZIE
With the small programs in place I will now show the setup in Ooozie using HUE.
Below are Screenshots from my small Hortonworks Hadoop HDP2.0 cluster running on EC2
( For setting up your own cluster or downloading a test virtual machine see HDP 2.0 – The complete Hadoop 2.0 distribution for the enterprise
Zoomed in a bit to the 2 workflow tasks:
The definition of the first workflow task is:
The JAR I created was:
/apps/hanaIntegration/callHanaXSJS-WF/lib/callHanaXSJS-1.0-SNAPSHOT.jar
The arguments passed to call a delete procedure (no parameters) in HANA are:
${hanaXsjsUrl} ${user} ${password} ${outputDir} ${procedure_delete}
As this is the first task I also delete and create a directory to store the Log files of each task.
This will store the JSON return by the HANA XSJS.
The Second workflow task is:
The arguments passed to call an INSERT procedure (no parameters) in HANA are:
${hanaXsjsUrl} ${user} ${password} ${outputDir} ${procedure_insert} ${insertTotalParams} ${insertId} ${insertField1}
The follow XML workflow is then created at runtime:
I can then submit/schedule the workflow:
In my Test I passed the following parameters to the workflow:
(NOTE: unfortunately the order of input parameters via HUE is currently messy. If manually creating XML this can be tidied up in a more logical order
In a more logical sequence of this is:
Following used by both tasks
${hanaXsjsUrl} http://ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com:8000/OOZIE/OOZIE_EXAMPLE1/services/callProcedure.xsjs
${user} HANAUserID
${password} HanaPAssword
${outputDir} hdfs://ip-xx-xxx-xx-xx.ec2.internal:8020/apps/hanaIntegration/callHanaXSJS-log
Used by Delete task
${procedure_delete} iProcedure=OOZIE.OOZIE_EXAMPLE1.procedures/deleteTestTable
Used by Insert task
${procedure_insert} iProcedure=OOZIE.OOZIE_EXAMPLE1.procedures/create_record
${insertTotalParams} iTotalParameters=2
${insertId} iParam1=10
${insertField1}iParam2=fromHADOOP
Once the Workflow runs we see the following:
The following log files were screated by each task:
The Insert task Log file shows as:
Finally we can check in HANA to confirm the record has been created:
OK for inserting one record this isn’t very exciting and a bit of an overkill 😉 , but conceptually this enables the power of HADOOP and HANA to be harnessed and combined in a single workflow.
The Java program for calling HANA XSJS is avialble at https://github.com/AronMacDonald/callHanaXSJS
Hi Aron,
Thanks for wonderful information and step by step guidance. Awaiting your code on Git Hub 🙂 .
Regards
Kumar 🙂
Thanks for that. The main Java program is now available on Github. Have fun 🙂
Hi Aron
Thanks for sharing another great blog on Hadoop 🙂
Regards,
Vivek
No problem. Thanks for leaving a comment. 😉
Very creative Aron, good stuff and well explained.