Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 

This blog post explains how we Built a Custom Operator to incorporate python hivejdbc in the SAP Data Intelligence Pipeline Modeler (Version: 3.1-2010) by building a docker image that includes all the pre-rec for the hivejdbc


 

Background:


We got an opportunity to work for a large Bank in Philippines who were looking for model risk management and governance requirement. SAP proposed below solution, Integrating with HIVE on Cloudera was a keystone to the project ,that too using Https and Kerberos authentication over VPN tunnel. Since there is no standard operator we were exploring the options on how we can write a python code or build a custom operator to address this requirement, and then we came across the product team git repository https://github.com/SAP-samples/datahub-integration-examples/tree/master/HiveOperator.


Architecture


Got very excited that there is an already a working operator that we can quickly solve the customer problem, but our excitement was short lived. Even though the product team had made effort to build a working operator the operator lacks the security aspect of the connection to be deployed in a Production environment. The customer was not willing to compromise on the security and they were only allowing an SSL connection with Kerberos.

We used the product team's Custom Operator as a Base and started addressing the missing Security and SSL connection.

If you want to know more about the business requirement, architecture or the deployed solution as a whole , they you can reach out to me or Ankit Garg at ankit.garg03@sap.com

Getting Started


The Journey Began !!!!

We explored multiple python hive implementation that make use of Kerberos and SSL and finally zeroed-in on https://pypi.org/project/hivejdbc/ .

Even though the documentation tried to simplify things, we had tough time figuring out the way we can build and test in SAP Data Intelligence. Especially the docker image need to have lot of pre-rec mentioned in the documentation .

The Docker Image


The toughest part was to get a docker image with all the pre-rec especially the Java runtime.

Got the standard python Docker image (More info on the standard Python Docker can be found here : https://hub.docker.com/_/python) and started adding the missing pre-rec. Here is the final docker image I came up with. Building Docker image in SAP DI is bit tricky as there is not much logs to look at. I would recommend to you to use Docker on your Laptop to check the build and then if the build works on your Laptop then you can use the same build file in SAP DI.

Note: Laptop in my case its Windows, You may try on Mac / Linux also.

--------------------------------------------------------------------------------------------------------------------------------------
FROM python:3.6.4-slim-stretch

ENV DEBIAN_FRONTEND=noninteractive

RUN mkdir -p /usr/share/man/man1mkdir -p /usr/share/man/man1 && \
apt-get update && \
apt install -y python3-pip && \
apt-get install -y python3-dev && \
apt-get install -y krb5-user && \
apt-get install -y libsasl2-dev && \
apt-get install -y libsasl2-modules-gssapi-mit && \
apt-get install -y libsasl2-2 && \
apt-get install -y openjdk-8-jre-headless && \
apt-get install -y openjdk-8-jdk && \
apt-get install -y wget && \
apt-get install -y zip unzip && \
mkdir /hive_operator && \
mkdir /keytabs && \
mkdir /usr/local/hadoop && \
mkdir /usr/local/hive

RUN pip3 install pandas 'six==1.12.0' 'bit_array==0.1.0' 'thrift==0.9.3' 'thrift_sasl==0.2.1' 'sasl==0.2.1' 'hivejdbc' 'tornado'

RUN wget -P /usr/local/ https://repo1.maven.org/maven2/org/apache/hive/hive-jdbc/2.1.1/hive-jdbc-2.1.1-standalone.jar

RUN wget -P /usr/local/ https://github.com/timveil/hive-jdbc-uber-jar/releases/download/v1.8-2.6.3/hive-jdbc-uber-2.6.3.0-23...

#Intall Beeline - This is just to test the connectivity ( not really required in the image)
RUN wget -P /usr/local/ https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
RUN wget -P /usr/local/ https://archive.apache.org/dist/hive/hive-2.1.1/apache-hive-2.1.1-bin.tar.gz
RUN tar -xvzf /usr/local/hadoop-2.7.3.tar.gz -C /usr/local/hadoop/
RUN tar -xvzf /usr/local/apache-hive-2.1.1-bin.tar.gz -C /usr/local/hive/

ENV HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3
ENV HIVE_HOME=/usr/local/hive/apache-hive-2.1.1-bin
ENV PATH=$PATH:$HIVE_HOME/bin

RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972

ENV HOME=/home/vflow
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
ENV CLASSPATH=/usr/local/*

Tags:

{
"hive_python": "",
"python36": "\"\"",
"tornado": "5.0.2"
}

Custom Operator


We used Python based Operator and we build all the necessary interface to HIVE using Kerberos authentication using krb.

Refer the following link to create the custom operator.

https://help.sap.com/viewer/29ff74dc606c41acad117003f6034ac7/1.2.latest/en-US/049d2f3cc69c4281a3f457...

Here are some screenshot of the Custom Operator created using Base Python Operator.


Operator-Ports


 


Operator-Tags



Operator-Config


 


Operator-Script


You can refer the script from the link : https://github.com/SAP-samples/datahub-integration-examples/blob/master/HiveOperator/src/vrep/vflow/...

************************************************************************************************************

You need to upload the Kerberos krb , truststore.jks and keytab file to the docker image via Operator Upload option to the docker image.

The files will be uploaded to default /vrep/vflow/subengines/com/sap/python36/operators/ubp/com/sap/python36/hive/ location on the docker image. You can move them to any location using the python script in the Operator.


Operator-FileUploadOption



Test Graph


With the Docker Image and the Operator ready we can now test the Operator. The Operator will take the input from Terminal as SQL statement and send the response from the HIVE to be written the the Terminal and File Location.


TestGraph


 

Observation



  • The script in the operator shared in this blog post is very minimal and it is just intend to test the connection to HIVE and see if the HIVE respond back to our SQL request.

  • Please note that the script don't check for any syntax and hence the graph will fail if the SQL syntax is incorrect.

  • You will not get the header for the output, if you need the output with the header you need to add some more logic in the script using DESCRIBE TableName and then append to the result set.

  • Never run Select * on a table having huge volume of data without out any filter, the query may not return/respond and graph might fail.

  • This Operator and Graph was primarily designed for Read from HIVE, however this can be expanded to carry out INSERT bulk data from a File to HIVE. We made many enhancement to the python code, you can reach out to me in comments if you need more info.

  • One bug was observed : If we have any NULL value in the records then the SELECT output after the NULL value were set to NULL. e.g if the table had 100 records and there was a NULL value in one of the column at records 50 then all records after records 50 were set to NULL. We over came this issue by using nvl(columname,0) in the SELECT statement.


To Conclude


SAP Data Intelligence Cloud with its Open Technology support was a great tool to build a working Operator with all the necessary security aspect and able to integrate HIVE data with other source data and successfully address the critical Business Use case.
1 Comment