Stratified sampling creates a subset of data with similar distribution in a select variable. This component add this functionality to SAP Predictive Analytics, Expert Mode.
Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.
R library caTools must be installed.
Please let me know should you encounter any limitations.
These parameters can be set by the user.
|Desired Split (Percent or Count)||
Specifies the size of the stratified subset. You can enter a percentage (ie 0.3) or the absolute number of records (ie 2000).
|Stratification Column||The categorical column, whose distribution will be reproduced in the stratified subset.|
|Random Seed||Numerical value that allows to produce random but reproducable samples.|
|Label 1st Subset||Label that identifies the stratified subset in a newly added column.|
|Label 2nd Subset||Label that identifies the remainder of the dataset.|
Output column added by this component
Identifies which subset the individual record belongs to. See above “Label 1st Subset” and “Label 2nd Subset”.
How to Implement
The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option “Import/Model Component”, which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.
You can try the Stratified Sampling on the common Census01.csv file from Automated Mode for instance. The file is automatically installed with SAP Predictive Analytics. In Version 2.3 you will find it in “C:\Program Files\SAP Predictive Analytics\Desktop 2.3\Automated\Samples\Census”. The configuration below for instance creates a sample with 30% of the records of the whole dataset. The stratification is based on the “relationship” colum, so that the sample will have a very similar distribution in this column as does the total dataset.