Skip to Content

This component identifies duplicate values in a column. The first occurance of a value in a column is flagged. Should the same value occur again it is labelled as duplicate. To give a meaning to the “first” occurance, the data is sorted on a column specified by the user.

You can use the component in Marketing for instance to avoid sending the same campaign to multiple persons in the same houshold. Look for duplicates in the household id with the data sorted by age. In this example you can ensure your campaign is sent only to the youngest (or oldest) member of a household.

Disclaimer

Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.

Prerequisites

No specific prerequisites. You can use the component on any dataset.

Limitations

Please let me know should you encounter any limitations.

Usage

These parameters can be set by the user.

Parameter Description
Column with potential duplicates

The column which will be analysed for duplicates.

Column to sort the dataset on The column on which the data will be sorted.
Sort direction Specified whether the sorting will be ascending or descending.

Output column added by this component

Column Description
DuplicateIdentifier

The first occurance of a value in the column that is analysed will be flagged “First”. Should the value occur again, these are flagged “Duplicate”.

Column to sort the dataset on The column on which the data will be sorted.
Sort direction Specified whether the sorting will be ascending or descending.

How to Implement

The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option “Import/Model Component”, which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.

Example

You can try this component on the dataset MarketingTargetList.csv, which is a subset of the Census01.csv file that comes with SAP Predictive Analytics. As a basic example, the following configuration flags the youngest person of each “nativecountry” as “First”. Similarly you could identify duplicates in the dataset or first/last occurances, ie the youngest or oldest person in a household.


Config.PNG

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply