Bringing Machine Learning (TensorFlow) to the enterprise with SAP HANA
In this blog I aim to provide an introduction to TensorFlow and the SAP HANA integration, give you an understanding of the landscape and outline the process for using External Machine Learning with HANA.
There’s plenty of hype around Machine Learning, Deep Learning and of course Artificial Intelligence (AI), but understanding the benefits in an enterprise context can be more challenging. Being able to integrate the latest and greatest deep learning models into your enterprise via a high performance in-memory platform could provide a competitive advantage or perhaps just keep up with the competition?
With HANA 2.0 SP2 onwards we have the ability to call TensorFlow (TF) models or graphs as they are known. HANA now includes a method to call External Machine Learning (EML) models via a remote source. The EML integration is performed using a wrapper function, very similar to the Predictive Analysis Library (PAL) or Business Function Library (BFL). Like the PAL and BFL, the EML is table based, with tables storing the model metadata, parameters, input data and output results. At the lowest level EML models are created and accessed via SQL, making them a perfect building block.
TensorFlow by itself is powerful, but embedding it within a business process or transaction could be a game changer. Linking it to your enterprise data seamlessly and being able to use the same single source of data for transactions, analytics and deep learning without barriers is no longer a dream. Having some control, and audit trail of what models were used by who, how many times, when they were executed and with what data is likely to be a core enterprise requirement.
TensorFlow itself, is a software library from Google that is accessed via Python. Many examples exist where TF has been used to process and classify both images and text. TF models work by feeding tensors, through multiple layers, a tensor itself, is just a set of numbers. These numbers are stored in a multi-dimensional arrays, which can make up a layer. The finally output of the model may lead to a prediction, such as a true/false classification. We use a typical supervised learning apprach i.e. the TensorFlow model first requires training data to learn from.
As shown below, a TF model is built up of many layers that feed into each other. We can train these models to identify almost anything, given the correct training data, and then integrate that identification within a business process. Below we could pass in an image and ask the TensorFlow model to classify (identify) it, based on training data.
Equally, we could build a model that expects some unstructured text data. The models’ internals may be quite different, but the overall concept would be similar
For text data and text classification, extensive research has been performed by Google with Word2Vec and Stanford publishing GloVe, providing vector representations of words. Pre-trained word vectors are available for download covering multiple languages.
HANA TensorFlow Landscape
With the SAP HANA TensorFlow integration, there are two distinct scenarios, model development/training and then model deployment. First you develop a model, train it, test it, validate it with training data, where the outcome is known. Here we have shown that environment with a Jupyter notebook. Finally, you would publish the model for TensorFlow Serving and make that model available via SAP HANA. During the development and training phase HANA would primarily be used as a data source.
Once a model has been published for scoring, the jupyter notebook and python are not being used.
Model execution is performed by TensorFlow Serving, which loads up a trained model and waits for input data from SAP HANA.
Often, to productionise a TensorFlow model with TensorFlow Serving you would need to develop a client specifically to interact with that model. With the HANA EML, we have a metadata wrapper that resides in HANA to provide a common way to call multiple TensorFlow Serving models. With the HANA EML TensorFlow models can now be easily integrated into enterprise applications and business processes.
Some Implementation Specifics
SAP HANA Python Connectivity
There are at least 3 options to consider
I went with the pure Python library, from the SAP official GitHub https://github.com/SAP as this appears to be the most simple when moving platforms as it has the least dependencies, although it does not yet support SSL or Kerberos so may not be suitable for production just yet. Hdbcli is part of the HANA Client distribution and is the most comprehensive with the best performance, but requires a binary installation from .sar file. Pyodbc is a generic solution more suited to Windows only scenarios.
Python 2.7 vs Python 3.6
Newer isn’t always better! Python 2.7 and Python 3.6 are not necessarily backwards or forward compatible. You can spend time debugging examples that were written against a different version. Many examples you find don’t specify a version, and when they do it’s easy to overlook this important detail. We used Python 3.6, but found many examples had 2.7 syntax which does not always work in 3.6. My advice is, always use the same Python version as any tutorials you are following. At the time of writing the TensorFlow Serving repository binary is for Python 2.7, you therefore may need to compile it yourself for Python 3.6.
Jupyter Notebooks & Anaconda
Get familiar with Jupyter, this is a great interactive development environment. Jupyter runs equally well on your laptop or in Amazon Web Services (AWS) or Google Cloud Platform (GCP).
I began with Anaconda on my laptop, which provides applications (such as Jupyter), easy package management and environmental isolation. Jupyter notebooks are easy to move between local and AWS/GCP as long as the required Python libraries of the same version are installed on both platforms.
Keras is a python library that provides higher level access to TensorFlow functions and even allows you to switch between alternative deep learning backends such as Theano, Google TensorFlow and Microsoft Cognitive Toolkit (CNTK). we tried Theano and TensorFlow, and apparently you can even deploy Theano models to TensorFlow Serving
GPU > CPU
Once you are up and running with an example model you will find that it takes some time to train your models even with modern CPUs. With deep learning models you train the model over a number of epochs or episodes (a complete pass over the data set). After each epoch, the model learns from the previous epoch, it’s common to run 16, 32, 64, 100 or even 1000 epochs. An epoch could take 30 seconds to run on a CPU and less than 1 second on a single GPU.
If you use a cloud platform both GCP and AWS, have specific instance types designed for this. If you use AWS, G3(g3.4xlarge) and P2 (p2.xlarge) or P3 (p3.2xlarge) are suited for Deep Learning, as they include GPUs. If using AWS I would recommend the P instance type, as these are the latest and greatest. If/when you are at this stage to fully utilise the GPUs you may need to compile TensorFlow or other deep learning foundation for your specific environment.
Serving the model
Once you are done with building, refining, training and testing your model you will then need to save it for TensorFlow Serving. Saving a model for serving is not necessarily easy! TensorFlow Serving wants a specific saved_model.pb. The saved models need to have a signature, that defines inputs and outputs. When you have created you own model you will likely need to build a specific saved_model function. We will share some code snippets in a future blog.
TensorFlow Serving is cross platform, but we found that Ubuntu seems to be Google’s preferred Linux distribution. The prerequisites are straightforward for Ubuntu, which was not the case with Amazon Linux which evolved from RedHat.
There are many blogs and GitHub repositories to get you started, that provide walk-through examples and training for free. Some of the good ones include the following.
HANA Academy External Machine Learning (EML) Library on YouTube
HANA Academy Code Snippets GitHub
Open SAP – Enterprise Deep Learning
SAP Help – SAP HANA External Machine Learning Library
Thanks for writing this out. If I've interpreted this correctly, what you describe here is a system set-up for developing, training and presumably retraining a model "in-database"? In other words you can create and consume your models without an ETL cycle where you first have to move your data to something like a Hadoop cluster?
Assuming that it's the case, it appears that you can accomplish this by using one of the tools you mention in the HANA-Python connectivity section, but is there's also a way to do the same in a full-Cloud set-up? In that scenario you'd either have a HANA database running in SCP or an on-premise HANA server connected to SCP through the Cloud Connector.
Is it possible, or perhaps on the road map, to bring the end-to-end management of your TensorFlow model into the Leonardo framework? As I understand it Leonardo so far only offers consumption of existing TF models through the EML? (With Python/R models planned for 2018.)
Finally, do you have any metrics you can share when it comes to HANA's performance with either classic machine learning or more complex deep learning models? I've discussed it with a few data scientists who are used to having HDFS clusters to work with and who tell me even an in-memory DB wouldn't be able to handle the IO load of machine learning dev/test.
Thank you for the feedback, it’s always great to receive.
The training and development of the models is not in-database as such. Yes, it can be done without and ETL cycle, during the training process queries would be executed and models would be trained via Python.
I haven’t tested with SCP yet (I am planning to), for my setup, I used AWS. In theory, the same should be possible with SCP, with an SCP Virtual Machine serving your TF models.
The approach I have used here would be an alternative to Leonardo, both could provide similar outcomes. I believe “Bring your own TF model” is on the current Leonardo roadmap. Leonardo is not using the HANA EML, Leonardo provides a framework, that includes a REST API to TF models.
With regards performance, my HANA system is small in size and was under no load when using TF. With the EML, the models are executed externally (External Machine Learning Library 🙂 ). The datasets I used are definitely not large, GPUs are the usual answer to this type of problem, but SAP HANA didn’t even flinch. Would be interesting to understand the scenarios you are working with.
My apologies for the late reply, evidently email notifications weren't active on this account.
Our stance on Leonardo has indeed also evolved into looking at it more as an enablement framework giving you a bag of tools to plug into your (SCP) environment to innovate with.
As for performance, I too would be interested in understanding the scenarios and requirements that my data science colleagues have in mind. I suspect that HANA performance would be perhaps not great but adequate for most basic ML models and that reluctance to move models away from Hadoop clusters is more about people wanting to use what they know. 🙂
do you know why TensorFlow was used for SAP Leonardo?
Why not one of the other frameworks like deeplearning4j, PyTorch, Caffe, ect ....?
I am not as close to SAP Leonardo, so can only give my opinion.
For me TensorFlow is one the most popular multi purpose deep learning libraries and therefore a good choice. It is relatively mature and SAP has an existing partnership with Google.
thanks for your quick reply 🙂
I am a little confused aboud SAP and machine learning. SAP Leonardo does not use the EML, so when is the EML used and what is the SAP Leonardo Machine Learning Foundation for? I thought the SAP Leonardo Machine Learning Foundation works with the EML.
The EML is within SAP HANA and connects to TensorFlow Serving.
SAP Leonardo includes Machine Learning that is based upon Google TensorFlow. The SAP Leonardo Machine Learning Foundation (MLF) exposes models as web services with a REST API.
They both could be used to solve similar problems, but for different reasons. The EML is to bring that insight into SAP HANA, The Leonardo MLF would more likely be integrated into an application or even SAP Data Hub.
Thanks for sharing. A while back , i was able to integrate HANA with R using RODBC and RJDBC , so using your proposed architecture , implementation of deep learning models using R model built using keras ( for tensor flow ) library should be also possible. Also it should be possible to describe features used for predictions by using Interpretable machine learning model with lime.