This blog is about some basic statistical methods which are used sometimes for fraud management. The application is very simple: given a list of f.e. expenses we can prove whether they are suspicious or not. Those tests can be used in analytical business rules. I am writing this blog entry as part of a blog series where I describe how to use algorithms in HANA that can be used in business rules via ADMP or database proxies. In contrast to typical business rules they are dealing with many line items. In this blog series I discussed basic application in forecasting time series and linear optimization.
This mathematical background of this blog is a fascinating observation called Benfords law. Benfords law occurs occurs in many data sets and testing this law can be done using a so called Chi Square test. Those statistical tests can be used in SQLscript procedures in HANA, either in PAL and of course in R. In my blog series I am focusing on R. The reason is very simple: once you learned R the design of algorithms is very easy and in most cases you can find very good implementations of algorithms in an existing package.
So I will use this blog entry to give a short introduction into Benfords law and show how to use it in R. I will also use the chance and give an introduction into some aspects of this programming language.
In 1881 Simon Newcomb published a 2-page article in the American Journal of Mathematics. At this time complex scientists used books of logarithms to perform complex calculations and he observed that those books were dirtier in the beginning and got cleaner at the end. Later this lead to the observation by Frank Benford, that the first digit of numbers in many dataset occurs more often than other numbers: the first digit has a bias towards lower numbers. He tested it with surface areas of rivers, population sizes, physical constants and much more. Today we observe the same when we count followers on Twitter which is described here: https://www.r-bloggers.com/benfords-law/
If you are interested in the mathematical background of Benford’s law I suggest you to read the following paper: https://www.stat.auckland.ac.nz/~fewster/RFewster_Benford.pdf
Statisticians applied the law in many areas, f.e. analysis of purchase orders, balances, credit card transactions, stock item accounts payable transactions and many more. Usually testing Benford’s law is only one aspect of fraud management which requires usually much more work.
The Benford object in R
As I told before there is already an implementation of Benford analysis in R: https://cran.r-project.org/web/packages/benford.analysis/benford.analysis.pdf As usual I recommend to implement it with the R studio on your PC before you implement those algorithms in HANA. The above mentioned documentation is really good and through it and try to give a little introduction in R and also R studio. I hope this will help the reads to get familiar with R.
So first download the library and load it using it using the command:
library(benford.analysis).This library comes together with some sample data ets. Just type in
data() to get a list of all data sets:
Then load the dataset using
data(corporate.payment).The dataset contains A dataset of the 2010’s payments data of a division of a West Coast utilities company. When I am working with R objects (and most APIs are objects and so are dataset) the first thing is to get an overview of the structure. There are some commands to help me to get the right Information:
Now I know that it is a quite huge dataset with 189470 rows. Moreover it is a data frame – a fundamental data structure in R defined in base package. It contains a list of values called statistical variables. One variable is “Amount” which contains amounts for certain vendors together with additional attributes. With head(corporate.payment) I get an overview of the first values.
With plot(corporate.payment$Amount[1:300]) I can get an overview of the first 300 amounts:
By the way, when I call R from HANA I am always using data frames as result structure. I don’t know which other possibilities exist but data frame is a good choice for me.
With the following command we can create a Benford object:
cp <- benford(corporate.payment$Amount, 2, sign="both") that performs the Benford test for two digits and also analyzes negative values. You get a textual result when typing cp and following graphic when typing plot(cp)
You can see here the expected values from Benford’s law (in red) and also the deviations. The red curve is sometimes called Benford’s curve and shows the distribution due to Benford’s law.
There are suspicous numbers starting with 5 and 1. You can use the function
getSuspects that returns suspicous values:
So far most of the blog entry was just a quick demonstration of R features and dataset that you can use out-of-the-box when you installed R. I than Carlos Cinelli for providing a great Benford package together with with good documentation. In this blog entry I wanted to present another use case of an analytical rule based on statistical tests. Due to the huge number of packages R is the right tool even for complex statistical tests f.e. in fraud management. You can do the same things in PAL and SQLscript but I believe that the implementation will be much more complex. The interpretation of the results is far from being trivial. Most likely you will have to read scientific books about the topic. And most fraud management technique use Benford test together with other tests and use variants.
I see much potential to use those techniques in automated processes. Just take expenses or other metric variables of a certain business partner and perform the test. I showed some basic commands to analyze the result objects. This is very useful since in automated processes you will compare the results of the tests abainst thresholds and if they are violated will return suspicous values. This values can be used in a rule system implemented in BRFplus for example.
In the last blog entries I showed use cases for analytical rules using HANA and R. I gave an example in the area of forecasting, optimization and statistical testing. You can use them today in rule system but as I explained in my previous blog SAP should start to improve the integration the toolchain and the delivery.