Skip to Content
Technical Articles
Author's profile photo Nico Wong

Detecting Contextual Anomalies with SAP HANA ML

Today we are going to go over an example of Anomaly Detection using SAP HANA Python Client API for ML. Recently a client from a Multinational Automotive Manufacturing Corporation who has a HANA Enterprise License, wanted to explore the possibility of flagging anomalies within their audit data to increase the efficiency of the audit team. Most of the process is currently manual and time-consuming. Imagine going through billion over records of financial data trying to make sure the numbers check out. There has to be an easier way. Although there were a few articles on anomaly detection, none of them fit our specific use case. Hence, Andreas and I collaborated to produce this, so hopefully it might help some of you out there!

This blog was written by Nico Wong (Analyst, NIMBL) and Andreas Forster (Principal Data Scientist, SAP). Therefore, the term “we” in this blog post refers to ourselves, Nico and Andreas. There is no guarantee or support for any information or code in this posting. Please test any content that you may want to use yourself.

 

Introduction

What is an Anomaly?

The goal here is to detect outlier data points, which do not follow the collective common pattern of the majority of data points, hence can be easily separated from the group.

Some of the possible use cases here are: 

IT DevOps: Intrusion Detection system, monitoring network traffic

Banking: Fraudulent transactions, stock market analysis

Healthcare: Condition monitoring, tumor detection

 

Architecture

For this example, we will be using SAP HANA Python Client API for ML and calling the Automated Predictive Library (APL) with our preferred Python environment (I typically use Jupyter notebooks). If you are not familiar, this blog post has a great step-by-step tutorial on how to get it set up.

Please note that the Python API is only available from HANA 2.0 SPS 04 onwards.

https://blogs.sap.com/2020/07/27/hands-on-tutorial-automated-predictive-apl-in-sap-hana-cloud/

Time Series Anomaly Detection

For this POC, we will be conducting an anomaly detection within a time series data. What this means is we are trying to find data points that are not following the common collective trend or seasonal or cyclic pattern of the entire data. Example use cases here could be monitoring spikes in web traffic or spikes in global temperature. 

Are dates and the outcome variable all that we need? That rarely is the case. Most of the time, we will need to supplement our data with more context. For example, if we’re trying to detect temperature anomalies for all states around the US. We can’t just collect all temperature data and feed it to the model, because different states have varying temperatures on any given day. New York experiences drastic temperature changes over the year which is common, while Hawaii will be warm and sunny all year round. So our goal here is to find anomalies for the respective states. If we don’t include context (US states) into our model here, it is likely that the model might flag New York temperatures as anomalies when in fact the low temperatures are common during Nov-Mar. So depending on your use case, we should think about what other factors could affect our outcome to ensure that it is as accurate as possible. This is known as Contextual Anomalies.

 

The figure below is an example of how the contextual anomaly detection model should function for New York temperature data. 

 

Hypothetical Use Case: Flagging Abnormal Car Registrations for Countries

Now let’s jump right into our demo. 

The World Vehicle Association (WVA) needs to monitor and keep track of car registrations across the world. Car registrations vary across each country and time of the month but their patterns should be relatively consistent. If there are any sudden surges or drops in car registrations, the WVA needs to investigate and find out the causes and plan for next steps.

If there’s a surge, the WVA needs to make sure the country can accommodate the sudden increase in vehicles. Do they need to build more freeways, gas stations?

If there’s a drop, the WVA will need to investigate why? Is this just a one-off situation or are there consecutive drops for the next few months? Are people migrating out of the country, or are they using more public transportation?

This is what we are trying to figure out. We have a dataset of car registrations from different countries. It consists of 3 columns: Date, Country, Number of Car Registrations. Let’s get to it. 

 

  1. First, let’s establish a connection to our SAP HANA system. 
import hana_ml
print(hana_ml.__version__)
hana_address = '' 
hana_port = 443 # Adjust if needed / as advised
hana_user = '' 
hana_password = '' 
hana_encrypt = 'true' # Adjust if needed / as advised
hana_sslcertificate = 'false'
# Instantiate connection object
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address = hana_address,
                                   port = hana_port, 
                                   user = hana_user, 
                                   password = hana_password, 
                                   encrypt = hana_encrypt,
                                   sslValidateCertificate = hana_sslcertificate
                                  )

# Send basic SELECT statement and display the result
sql = 'SELECT 12345 FROM DUMMY'
df_remote = conn.sql(sql)
print(df_remote.collect())

Next, we load our data into our python environment and upload it as a table called “VEHICLEREGISTRATIONS_ORG” to our HANA database. 

import pandas as pd
df_data = pd.read_csv("VEHICLEREGISTRATIONS.txt", sep=";")
df_data.columns = map(str.upper, df_data.columns) 
df_data.MONTH = pd.to_datetime(df_data.MONTH)
df_data.head(5)

df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, 
                                                   pandas_df = df_data, 
                                                   table_name = 'VEHICLEREGISTRATIONS_ORG',
                                                   force = True,
                                                   replace = False)

Here df_remote is calling our HANA table that we just created.

df_remote = conn.table("VEHICLEREGISTRATIONS_ORG")

df_segments shows us a list of all the unique countries we have in our dataset.

df_segments = df_remote.select("COUNTRY").distinct().sort("COUNTRY", desc = False).collect()
df_segments

 

The following code is where the magic happens. The year we are trying to analyze is 2017 – this will be our audit year. The for loop basically takes the raw data up until our audit year (2017) for each respective country and runs it through the automated time series algorithm. This returns a prediction for the expected car registrations in 2017. We compare this against the actuals in 2017 and return those that fall out of the prediction interval. This threshold can be modified as you wish depending on your use case.

audit_year = 2017

import pandas as pd
from hana_ml.algorithms.apl.time_series import AutoTimeSeries
pd.set_option('expand_frame_repr', False) # Print all pandas column without line break

#for index, row in df_segments[:3].iterrows(): 
for index, row in df_segments.iterrows(): 
   
    # Get raw data of each country
    segment = row['COUNTRY']
    print("Now starting with: " + segment)
    df_remote_segment = conn.sql("SELECT COUNTRY, MONTH, REGISTRATIONS FROM VEHICLEREGISTRATIONS_ORG WHERE COUNTRY = '" + segment + "'")
    
    # Sort data ascending, a requirement of the Automated Predictive Library
    df_remote_segment = df_remote_segment.sort("MONTH", desc = False)
    
    # Training data and hold out 
    df_remote_train = df_remote_segment.filter("YEAR(MONTH) < " + str(audit_year))
    df_remote_holdout = df_remote_segment.filter("YEAR(MONTH) = " + str(audit_year))
    #print(df_remote_holdout.collect())

    # Predict known past
    tsapl = AutoTimeSeries(time_column_name = 'MONTH', target = 'REGISTRATIONS', horizon = 12)
    tsapl.fit(data = df_remote_train.drop("COUNTRY"))
    df_remote_aplforecast = tsapl.forecast().filter("YEAR(MONTH) = " + str(audit_year))
    #print(df_remote_aplforecast.collect())
    
    # Join hold out with predictions
    #df_remote_joined = df_remote_aplforecast.alias('L').join(df_remote_holdout.alias('R'), 'L.MONTH = R.MONTH')
    df_remote_joined = df_remote_aplforecast.join(df_remote_holdout.rename_columns({'MONTH': 'RMONTH'}), 'MONTH = RMONTH')
    df_remote_joined = df_remote_joined.drop(['ACTUAL', 'RMONTH'])
    df_remote_joined = df_remote_joined.to_head('COUNTRY')
    
    # Filter on true values outside the predition interval
    df_remote_joined = df_remote_joined.filter('REGISTRATIONS < LOWER_INT_95PCT OR REGISTRATIONS > UPPER_INT_95PCT')
    print(df_remote_joined.collect())

The output is as followed:

E.g. Two anomalies were detected for Austria where the actual registrations fall outside the lower/upper prediction intervals.

Predicted November Range= (22230, 28200) || November Actuals = 29547

We can save this dataframe as a new table ( back into HANA which can be connected to SAC for user consumption.

#saving results into HANA
df_remote_joined.save(where=('schema name', 'Car Registration Anomalies'), 
                       table_type = 'COLUMN', 
                       force = True)

And there you have it! We utilized our time series prediction model and tweaked it into an anomaly detection model. 

What’s Next?

After getting our results, we can analyze these anomalies to plan next steps. Why did these anomalies occur? What can we do about it? 

Say we are analyzing customer demand, and we detected a few anomalies that happened over the month. The anomalies show that less number of people shopped in the 4th week for the past 3 months. This has never happened in the past 5 years. Analytic teams can use this information to dig deeper. What specific product is most impacted? What’s the catalyst? Turns out a competitor has been getting new products a week earlier compared to us. So the business now can plan ahead for inventory, supply chain, etc. This saves teams a lot of time going into the data and trying to analyze trends on their own for 1000+ over products. 

Unfortunately, this model only works for time series use cases. There are going to be use cases where we will want to detect anomalies in non-time series data. For example, flagging products that are placed in the wrong product categories. Stay tuned for the next posting!

 

Assigned tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.