Performing Data Science jobs with SAP Data Intelligence
As an SAP BTP customer, you likely are aware that SAP Data Intelligence provides Data Management capabilities such as – data integration, data catalog, etc. Have you seen how SAP Data Intelligence empowers a Data Scientist to perform end-to-end data science jobs? In this blog, we want to show you how you can enrich/transform the data and get data insights before you move on with your machine learning jobs.
We will use bike sales data as an example. To begin with, you should have a high-level flow of how your workflow looks like. It can look something like below.
This blog introduces tutorials available on SAP HANA Academy that includes loading data from various supported SAP and non-SAP data sources.
Validate and Clean the Data:
Before we can use the dataset, we must ensure the data is clean and usable. You might want to ask, for example,
- Does the data contain missing values? If so, how do I rectify them?
- Are the data-types consistent for a given column? Do I have the information to fix them?
- Do I need to normalize the values, so it can be used for prediction?
To proceed, you can create a data pipeline using operators provided in SAP Data Intelligence. The same blog points to video tutorials available at SAP HANA Academy for you to do this.
If you’d like a stronger feel of “control”, in other words, you want more flexibility for customization, SAP Data Intelligence offers an integrated Jupyter Labs environment where you can write your own code. This is available under the ML Scenario Manager tile of SAP Data Intelligence.
First, in your newly created notebook, install the required Python dependencies. you can do this by running the pip command and import the libraries into your notebook. Next, we create a Pandas dataframe and load the data from SAP Data Lake into our dataframe. the code snipped is shown below:
# Import the required library import hdfs import pandas as pd import matplotlib.pyplot as plt # Connect to our DATA_LAKE Datasource. client = hdfs.InsecureClient('http://datalake:50070') # Read data from our CSV file. with client.read('/shared/Bike_SalesAnalysis.csv') as reader: df = pd.read_csv(reader)
In our bike sales dataset, the ‘No Customer Meeting’ field that contains the number of customer meetings, should be numeric. If it contains string or null values, they need to be corrected. In our dataset, this field has the number entered as string. Below is an example of how you can fix an incorrect entry using Python:
df['No Customer Meetings'] = df['No Customer Meetings'].replace('Seven', '7')
You could also perform other pre-processing tasks like: re-arranging the date format, normalizing values, filtering the date, etc., by adding your own Python code in this notebook.
Analyze the Data:
Next, let’s try to look at the statistical summary of our dataset. Statistical information helps us view data maintained in an organized form. As a Data Scientist, you could derive contextual information from it. A sample code snippet, with the output is shared below.
# Perform statistical data analysis df.describe()
The statistical information will help you understand the distribution of the data across the dataset. This can also help you understand if you need to normalize your data and /or find co-relations between data.
Similarly, you can view the top few rows of data, as shown shown below:
# Get top 5 records df.head()
Obviously, you perform data science job because you want to answer the business questions.
For example, some of the question with respect to our dataset can be:
- What does the market segment distribution look like?
- How do the sales of a specific segment perform over time?
- Is there a correlation between the number of customer meetings and the number of units sold?
Answering these business questions will get you closer to your final data insight needs, for example, predicting next quarter’s sales.
- What does the market segment distribution look like?
# Visualize by 'Customer segment' df_trend_by_customer_segment = df[['Customer Segment', 'Units Sold']].groupby('Customer Segment')['Units Sold'].sum().sort_values(ascending=False) plt.bar(x=df_trend_by_customer_segment.keys(), height=np.array(df_trend_by_customer_segment), tick_label=df_trend_by_customer_segment.keys()) plt.xticks(rotation='75') plt.title('Market share by Customer segment') plt.show()
The above chart tells us the wholesales segment has the highest volume of sales.
- Now I want to drill down into wholesale. So, how do the sales of the wholesale segment perform over time?
# Trend by Customer segment / month in 2022 df_trend_by_customer_segment_per_month = df.loc[df['Customer Segment'] == 'Wholesale'] df_trend_by_customer_segment_per_month = df_trend_by_customer_segment_per_month[['Customer Segment', 'Units Sold', 'Date', 'Order Value', 'Month']] month_index = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] for i in df_trend_by_customer_segment_per_month.index: for j in range(len(month_index)): if df_trend_by_customer_segment_per_month['Date'][i].split('-') == month_index[j]: df_trend_by_customer_segment_per_month['Month'][i] = str(j) break df_trend_by_customer_segment_per_month['Month'] = df_trend_by_customer_segment_per_month['Month'].astype(int) df_trend_by_customer_segment_per_month = df_trend_by_customer_segment_per_month.groupby('Month')['Order Value'].sum().sort_values(ascending=False) x_values =  y_values =  for i in range(12): x_values.append(i) y_values.append(df_trend_by_customer_segment_per_month[i]) plt.plot(x_values, y_values, 'bo-') plt.xticks(x_values, month_index, rotation='75') plt.title('Sales trend by Customer segment("Wholesale") in 2022') plt.show()
- Is there a correlation between the number of customer meetings and units sold?
# Visualize corrlation between No Customer Meetings and Units Sold x_value = df['No Customer Meetings'] y_value = df['Units Sold'] plt.scatter(x_value, y_value, c='blue', s=1) plt.title('Correlation Plot') plt.xlabel('Number Customer Meetings') plt.ylabel('Units Sold') plt.show()
Interestingly, when the number of customer meetings is roughly in the range of 5 to 30, units sold ranges between 1 and 400. As the number of meetings increases, more units get sold except for a few outliers. Also, the decreased density in data points on higher number of customer meetings and higher number of units sold reveals a smaller number of customers are engaged.
Now, will such patterns persist in the next quarter? It is time to move ahead to answer this question with machine learning based approaches. You can use SAP Data Intelligence to call SAP’s HANA ML library to get the job done. We will cover this in a separate blog
The idea of this blog is not to deep dive into Python, but rather show you what’s possible with SAP Data Intelligence.
SAP Data Intelligence has a strong integrated environment, which allows you to work on end-to-end scenarios in one place. Without it, you will likely have to switch among various standalone tools to do your job. The constant context switches, the need to connect different systems and different output to answer business questions can be a big pain.
Are you starting your journey with Machine Learning, or you have a use-case which you think Machine Learning can help? Then, here is a blog that will help you understand what a good machine learning problem is. Need help getting started? Feel free to reach out to me or Jay Xiong, and we will be very happy to help.