Exploring Data Science with Python and SAP Cloud Platform
Step by Step process for developing Data Science python scripts by using SAP HANA Database on Cloud Platform.
Overview
SAP Cloud Platform is an open platform-as-a-service (PaaS) that delivers in-memory capabilities, core platform services, and unique micro services for building and extending intelligent, mobile-enabled cloud applications.
Data Science is the process of deriving knowledge and insights from a huge and diverse set of data through organizing, processing and analyzing the data.
Python is a dynamic, interpreted (byte-code-compiled) language. There are no type declarations of variables, parameters, functions, or methods in source code. This makes the code short and flexible, and you lose the compile-time type checking of the source code.
DISCLAIMER:Please note that the resources and the data used is for demonstration purpose only.
We will be developing a simple python script illustrating data graphically using data science packages like panda, matplotlib and pyhdb by opening data base tunnel to SAP HANA Cloud Platform.
- PYHDB is a pure Python client package for the SAP HANA Database based on the SAP HANA Database SQL Command Network Protocol.
- MATPLOTLIB is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
- PANDAS is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Prerequisites:
- You should have an SAP HCP developer (trial) account:
- Register yourself at https://account.hanatrial.ondemand.com
- Create a HANA MDC (<TRIAL>) data base system.Configure the user for SHINE as well to play around with the data later.
- In case of any issues please refer to this link https://blogs.sap.com/2017/05/31/steps-to-create-database-tables-in-sap-hana-cloud-platform-formerly-hcp
- You should have Python IDE installed on your workstation along with the mentioned packages.
- Latest Python version can be downloaded from https://www.python.org/getit/
- Once installed open command prompt and enter :
python --version
- For installing packages open command prompt and enter:
pip install pyhdb
- By using pip installer install following packages on your machine :
- PYHDB
- PANDAS
- MATPLOTLIB
- You should have SDK for SAP Cloud Platform Tools:
- Download SDK from https://tools.hana.ondemand.com/
- Download SDK from https://tools.hana.ondemand.com/
Lets start the development now:
- Open data base tunnel to SAP HANA Cloud Database
- Open command prompt and enter command to change the current directory to refer to the neo.sh file for the downloaded SDK. Replace username with your workstation name.
cd C:\Users\username\Desktop\PY\SDK\tools
- Now enter below connection string to open a database tunnel to cloud.Replace username,databasename and password with your HANA trial account username,databasename and password.
neo open-db-tunnel -h hanatrial.ondemand.com -a usernametrial -u username -i databasename -p password
- Congratulations you have successfully opened a database tunnel.
- Open command prompt and enter command to change the current directory to refer to the neo.sh file for the downloaded SDK. Replace username with your workstation name.
Lets upload sample data to HANA cloud using SAP HANA studio:
- I have downloaded historical NIFY-50 data for past one year from below location: https://www.nseindia.com/products/content/equities/indices/historical_index_data.htm
- Open HANA studio and import data into schema using downloaded CSV file.
- Below is the table defination we have created in schema SAP_HANA_DEMO:
Its time for python development
- Open Python IDE and create a new file
- Below is the code for connecting to the database and performing data analysis operations on the fetched data:Replace username and password with the database username and password for your MDC database instance.
import pyhdb import pandas as pd import matplotlib.pyplot as plt import matplotlib connection = pyhdb.connect('localhost', 30015, 'username', 'password') cursor = connection.cursor() cursor.execute("SELECT top 20 DATE, HIGH FROM SAP_HANA_DEMO.NIFTY_50_DATA") a = cursor.fetchall() data = pd.DataFrame(a) matplotlib.rcParams['axes.unicode_minus'] = False fig, ax = plt.subplots() ax.plot(data[1], data[0], 'o') ax.set_title('NIFTY-50') plt.show()
- Connection is established using connect function from pyhdb package by passing server credentials.
- We are fetching top 20 records from table NIFTY_50_DATA and converting it into dataframe using DataFrame method from pandas packages.
- At last scatter plot is displayed using package matplotlib.
Lets test the developed script
- Run the python script by press F5.
- Below scatter plot is generated showing variations of days highest price with respect to the date.
Congratulations you have successfully visualized data in python using SAP HANA Cloud Platform.Please note that we can develop perform complex scripts for analyzing the data based on the requirements.
Hi,
I'm trying to do exactly the same thing but when I had the trial version, I was able to connect to DB from pyhdb without opening a tunnel specifically from CLI but used to keep the DB open in Eclipse application. So, there were no issues with that.
But now I have a MDC version, and not able to open the tunnel.
Do you know any way to connect to non-trial version of HANA DB? When I go to open db tunnel, it throws me error saying "Database or schema '___________' not found."
Thank you.
Hi David,
Ideally the connection string for connecting to Non-trial HANA MDC should be the same as trial.
I tried simulating the error you are facing.It seems you are passing either blank or incorrect schema name in open-db-tunnel command.
neo open-db-tunnel -h hanatrial.ondemand.com -a usernametrial -u username
-i schemaname/databasename -p password
Please replace the schemaname/databasename with your schemaID/DB.
However if you are still facing the issues you might need to explore for setting proxy.
Hi,
Your tutorial just worked out of the box. All the way to the python script. I just had a string to float error on the date field. But I guess I just need to format before forwarding to matplotlib.
Kind Regards
Michael P.
Hi,
For resolving this issue ,one of the solution is while importing data using CSV file to SAP HANA Studio, you can change the data type for Date field to NVARCHAR.