Stratified sampling creates a subset of data with similar distribution in a select variable. This component add this functionality to SAP Predictive Analytics, Expert Mode.

Percentages.PNG

Disclaimer

Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.

Prerequisites

R library caTools must be installed.

Limitations

Please let me know should you encounter any limitations.

Usage

These parameters can be set by the user.

Parameter Description
Desired Split (Percent or Count)

Specifies the size of the stratified subset. You can enter a percentage (ie 0.3) or the absolute number of records (ie 2000).

Stratification Column The categorical column, whose distribution will be reproduced in the stratified subset.
Random Seed Numerical value that allows to produce random but reproducable samples.
Label 1st Subset Label that identifies the stratified subset in a newly added column.
Label 2nd Subset Label that identifies the remainder of the dataset.

Output column added by this component

Column Description
SplitLabel

Identifies which subset the individual record belongs to. See above “Label 1st Subset” and “Label 2nd Subset”.

How to Implement

The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option “Import/Model Component”, which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.

Example

You can try the Stratified Sampling on the common Census01.csv file from Automated  Mode for instance. The file is automatically installed with SAP Predictive Analytics. In Version 2.3 you will find it in “C:\Program Files\SAP Predictive Analytics\Desktop 2.3\Automated\Samples\Census”. The configuration below for instance creates a sample with 30% of the records of the whole dataset. The stratification is based on the “relationship” colum, so that the sample will have a very similar distribution in this column as does the total dataset.

properties.PNG

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply