Custom R Component – Normal Distribution Test
Many statistical or predictive methods assume data to be normally distributed. The residuals of a linear regression for example should be normally distributed, as should be measures that are analysed in a t-test.
This component helps understand how closely a numerical variable follows a normal distribution.
There are a number of options to assess such a resemblance. Often it is a personal choice of the user which methods to use. I personally prefer
- A density plot, that visualises the distribution
- A QQ-Plot, which should show the data points on a straight line
- Skewness as measurement for the asymmentry, which should be close to 0 for a Normal distribution.
- Kurtosis as measurement for the “peakedness” as Wikipedia puts it, which should also be close to 0 for a Normal distribution.
Other measurements calculated by this component are
- Anderson-Darling
- Shapiro-Wilk
- Lilliefors
Disclaimer
Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.
Prerequisites
– R libraries e1071, gplots, nortest and stats must be installed.
Limitations
Anderson-Darling is calculated when the dataset contains more than 7 values.
Lilliefors is calculated for datasets with more than 4 values.
Shapiro-Wilk is calculated for datasets between 3 and 5000 values.
Usage
These parameters can be set by the user.
Parameter | Description |
---|---|
Variable to test for Normal Distribution |
Your numerical variable. |
No output columns added by this component.
How to Implement
The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option “Import/Model Component”, which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.
Example
You can try out the component wtih our own data or with the file LondonOlympics2012Decathlon.csv, which lists the number of points collected by the various athletes competing in the Decathlon at the London Olympics 2012. The total points collected for instance appear fairly close to a normal distribution (see column “Overall”). This is shown in the screenshot at the top of this article.
Is there a way to add the R visualization to the 'Visualize' and 'Compose' Tabs?
In addition, can you separate the two charts that you created into two separate objects on the bottom right of the screen where you have 'Custom Chart' i.e. split the visualizations like they do with the built in Algorithm (the built in 'Auto Clustering' has four objects under 'Cluster Representation.')
I have answered on this duplicate thread
http://scn.sap.com/thread/3767855
Hi Michael,
Here is an example of outputting multiple charts in a custom component
http://scn.sap.com/docs/DOC-65217
The little sample function below might help implement this in your own code.
Just make sure to uncomment the pa.config line when adding it into expert mode.
The user guide of Expert Analytics also has a chapter on "Multiple Charts in Offline Mode Custom R Components"
http://help.sap.com/businessobject/product_guides/pa22/en/pa22_expert_user_en.pdf
#my2Plot(iris, "Sepal.Width")
my2Plot <- function(myData, strColName)
{
library(ggplot2) # load plot library
# UNCOMMENT THIS LINE WHEN USED IN SAP PREDICTIVE ANALYTICS
#pa.config("multiplot","true") # configure PA to use multiplot mode
# Container for multiple charts
myChartCollection <- list()
# 1st Chart
mySubChart1 <-ggplot(myData, aes_string(x=strColName)) + geom_density(alpha=0.5)
myChartCollection <- list(list(chart=mySubChart1, type="line", name="Density Plot"))
# 2nd Chart
mySubChart2 <- ggplot(myData, aes_string(x=strColName)) + geom_dotplot(alpha=0.5)
myChartCollection[[2]] <- list(chart=mySubChart2, type="line", name="Dot Plot")
# Return the initial data together with chart collection
return(list(out=myData, charts=myChartCollection))
}