Custom R Component - Stratified Sampling

AndreasForster · ‎10-07-2015

Stratified sampling creates a subset of data with similar distribution in a select variable. This component add this functionality to SAP Predictive Analytics, Expert Mode.

Disclaimer

Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.

Prerequisites

R library caTools must be installed.

Limitations

Please let me know should you encounter any limitations.

Usage

These parameters can be set by the user.

Parameter	Description
Desired Split (Percent or Count)	Specifies the size of the stratified subset. You can enter a percentage (ie 0.3) or the absolute number of records (ie 2000).
Stratification Column	The categorical column, whose distribution will be reproduced in the stratified subset.
Random Seed	Numerical value that allows to produce random but reproducable samples.
Label 1st Subset	Label that identifies the stratified subset in a newly added column.
Label 2nd Subset	Label that identifies the remainder of the dataset.

Output column added by this component

Column	Description
SplitLabel	Identifies which subset the individual record belongs to. See above "Label 1st Subset" and "Label 2nd Subset".

How to Implement

The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option "Import/Model Component", which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.

Example

You can try the Stratified Sampling on the common Census01.csv file from Automated Mode for instance. The file is automatically installed with SAP Predictive Analytics. In Version 2.3 you will find it in "C:\Program Files\SAP Predictive Analytics\Desktop 2.3\Automated\Samples\Census". The configuration below for instance creates a sample with 30% of the records of the whole dataset. The stratification is based on the "relationship" colum, so that the sample will have a very similar distribution in this column as does the total dataset.

Custom R Component - Stratified Sampling

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win