Machine learning behind the scenes of SAP RealSpend – An expense anomaly detection algorithm explained
Introduction to SAP RealSpend & Anomaly Detection
SAP RealSpend is an easy-to-use cloud application developed by our team at the SAP Innovation Center in Potsdam, Germany. It enables managers to track their actuals, committed, approved, and requested expenses in real-time. Based on the SAP Cloud Platform, SAP RealSpend connects directly to SAP S/4HANA without replicating financial data.
On top of this, we added machine learning functionality which automatically analyzes the data to highlight unusual postings like wrong or fraudulent bookings. We made reporting these anomalies as easy as clicking a button and sending an e-mail. Correctly attributing these expenses will give you better information for important decision in the future. In this article, I will explain what we have learned as we created an algorithm that detects anomalies from day one and leverages user feedback to tailor results to match your organization.
SAP RealSpend with anomaly detection
If you want to learn more about how anomaly detection can help managers, I recommend you check out this helpful article by our product manager, Mathias Poehling.
Machine Learning Basics
How can you verify if an expense, for example a hotel bill, is legitimate and was entered correctly? For humans that is pretty easy. Just check if the amount is reasonable, if the employee was actually on a business trip during the time of the stay or just ask the employee directly. But you can’t tell this to a machine. It will not know what reasonable is and it can’t talk to the employee.
Machine learning may not give us sentient robots yet, but it helps to create algorithms which can give answers to these hard questions. These answers will not always be perfect, but with optimization it will not only be much faster than a human but also give better results.
There is a wide spectrum of what may be called a machine learning algorithm. The easiest version is basically a regression model, which fits a function to past values and predicts future values based on this function. At the other end of the spectrum, there are neural networks (emulating how a brain would learn) and ensemble algorithms (combining several mathematical models into one).
We have to make a distinction between 2 types of algorithms: Supervised learning uses input data which already has the correct value for the attribute that we look to predict, while unsupervised learning attempts to predict the attribute based on previous unlabeled data. The quality of either machine learning algorithm is strongly connected to the quality of the input data. Good input data needs to be unbiased and should be well structured. But real-world data is rarely perfect and usually needs to be adjusted and optimized.
These adjustments and optimizations require the data scientist to have a good understanding of the algorithms and some domain knowledge to be able to evaluate the dataset and the algorithm. Despite this, it is possible for developers without prior knowledge to get started with machine learning as there are many tools and in-depth documentation available.
In our case, we used the Anaconda distribution for the initial data exploration and prototyping. Anaconda includes a python runtime and many useful tools like:
- scikit-learn: a machine learning library featuring a big selection of common algorithms
- jupyter notebook: a web-based development environment combining code, documentation and REPL
- pandas: a data analysis and manipulation library
- numpy: a high-performance math library
Development Environment using Jupyter Notebook
For the final productive implementation we leveraged the Predictive Analytics Library (PAL) in SAP S/4HANA. PAL offers many of the same algorithms but works directly on the HANA database. If you are interested in learning more, check out the blog by my colleague Frank Essenberger on how to get tailor-made machine learning in SAP S/4HANA.
The one requirement for every machine learning algorithm is access to data. The bigger, the better! Acquiring good data can be a challenge, especially in a sensitive area like finance. Many datasets are anonymized or scrambled, which has a negative impact on model performance and should be avoided if possible.
Once we got access to a dataset, we started by exploring the structure and content of the data using pandas. pandas can take data from almost any type of storage and put it into an in-memory data frame for fast and easy access. It also has a very good integration with numpy and matplotlib which allows for quick manipulation and graphing. Looking at the universal journal, also known as the ACDOCA table of SAP S/4HANA, we had to deal with hundreds of columns with shorthand names and partly unsanitized data. The table documentation was one of our most used documents as we developed our algorithm.
Preprocessing, Normalization, and Feature Selection
In our case, columns contain four distinct kinds of data. Most columns were parsed as strings, even though they were not necessarily text data. Instead, we found that many columns contained only a predefined set of options, which is referred to as categorical data. Lastly, we have boolean data, as well as numeric data (e.g. spent amounts, but also timestamps).
Most machine learning algorithms are based on mathematical models and expect an input of a two-dimensional array of numeric data. Depending on the complexity of a given algorithm, the runtime is likely scaling well with the sample size but much worse with a big number of features (columns).
To get the algorithms to understand a given column, there are different preprocessing strategies available. For categorical data, there are encoding strategy like One-Hot (create a boolean column for each option) or labeling (create one numeric column and assign each option a number). Text data can be transformed using a count vectorizer (create a column per word and put the number of occurrences as value) or the more-advanced term frequency–inverse document frequency (tf-idf).
Example preprocessing pipeline
Once all columns were encoded as numeric values, we ran our first algorithms and got some discouraging results. The predictions did not match our expectations at all. After some investigation, we realized that due to the widely different value ranges, the algorithm assumed that columns with high numbers were more strongly weighted than columns with smaller value ranges. This problem can be solved by scaling the features onto the same range. Once the features were normalized we saw big improvements but using all columns would still not bring us the intended results.
In the case of the ACDOCA table, there are multiple amount columns storing the expenses in different currencies and several date columns which contain timestamps, booking dates, etc. Those columns are necessary for properly saving an expense in S/4HANA but are almost always strongly correlated. Despite our normalization efforts, this correlation artificially increases the weight of the feature by just having multiple identical or nearly identical features. Due to this effect, it is important to select only relevant features. If this is not possible, you can apply Prime Component Analysis (PCA) to reduce similar features by combining and replacing them with an approximation.
Picking the right algorithm
As mentioned before, machine learning algorithms generally fall into one of two categories: supervised or unsupervised. The ACDOCA table does not have any columns that could be used as a label for anomalies, so we needed to either label our dataset by hand or create an unsupervised algorithm which would automatically label our data. We assumed that different companies would have vastly different expenses and therefore the algorithm will need to adjust depending on context. In this situation, we expect that it is best to have an initial labeling by an unsupervised learning model which is generated for each customer and combine these labels with the users’ feedback as input to a regularly trained supervised learning algorithm.
Overview of clustering methods from the scikit-learn documentation
In our investigation of the data, we noticed that anomalies were not just different from regular expenses but different from other anomalies. This means that we could find a good approximation by using a clustering algorithm.So far we could not find major differences in quality between the clustering algorithms. K-Means and DBScan are common algorithms which offer good performance with many samples as well as reasonably many clusters, making both good options.
With the generated labels, we can train a supervised classification algorithm such as Nearest Neighbors, Decision Trees or Support Vector Machines. These provide the final classification which is shown to the user.
Getting better results
The last step is to improve the results of the algorithm and optimize its runtime. Most of the following proposals are based on assumptions based on our test dataset but will be validated once the anomaly detection is activated for our SAP RealSpend customers.
Big companies are likely to produce huge amounts of financial data, which can be leveraged to get better predictions. Dealing with big amounts of data quickly gets difficult as performance drops or algorithms break because they run out of memory. The better approach is splitting the data into chunks which generate smaller models (e.g. one model for expenses related to travel and another for administrative expenses). These models are then combined into a single model by using ensemble algorithms which can run an input through each model and combine the results.
A problem with detecting rare occurrences like anomalies is a strong imbalance between the class sizes. This will often lead to a model which has a good overall precision because it always predicts “no anomaly”. To prevent this, we can use oversampling or undersampling in our training data.
The last technique I want to highlight is called hyper-parameter tuning. Most machine learning algorithms offer parameters like “number of clusters” or “degree of kernel function” which improve the prediction if they are correctly adjusted. As the correct value depends on the dataset which we don’t know in advance, we can work around the problem by training a large number of models using different values and evaluating each model’s performance with a scoring metric. The parameters that yield the best model are then used for the final algorithm.
For more information on SAP RealSpend, check out the following links:
- See RealSpend with the new machine learning functions at SAPPHIRE NOW 2018, here.
- You would like to dive deeper? Sign-up for ASUG user testing session 16.
- You want more information or try out SAP RealSpend? Visit us here.
Disclaimer: Anomaly detection in SAP RealSpend requires a connection to SAP S/4HANA Cloud 1808 or SAP S/4HANA On Premise 1809.