Custom R Component - Identify Duplicate Column Val...

AndreasForster · ‎10-12-2015

This component identifies duplicate values in a column. The first occurance of a value in a column is flagged. Should the same value occur again it is labelled as duplicate. To give a meaning to the "first" occurance, the data is sorted on a column specified by the user.

You can use the component in Marketing for instance to avoid sending the same campaign to multiple persons in the same houshold. Look for duplicates in the household id with the data sorted by age. In this example you can ensure your campaign is sent only to the youngest (or oldest) member of a household.

Disclaimer

Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.

Prerequisites

No specific prerequisites. You can use the component on any dataset.

Limitations

Please let me know should you encounter any limitations.

Usage

These parameters can be set by the user.

Parameter	Description
Column with potential duplicates	The column which will be analysed for duplicates.
Column to sort the dataset on	The column on which the data will be sorted.
Sort direction	Specified whether the sorting will be ascending or descending.

Output column added by this component

Column	Description
DuplicateIdentifier	The first occurance of a value in the column that is analysed will be flagged "First". Should the value occur again, these are flagged "Duplicate".
Column to sort the dataset on	The column on which the data will be sorted.
Sort direction	Specified whether the sorting will be ascending or descending.

How to Implement

The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option "Import/Model Component", which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.

Example

You can try this component on the dataset MarketingTargetList.csv, which is a subset of the Census01.csv file that comes with SAP Predictive Analytics. As a basic example, the following configuration flags the youngest person of each "nativecountry" as "First". Similarly you could identify duplicates in the dataset or first/last occurances, ie the youngest or oldest person in a household.

Custom R Component - Identify Duplicate Column Values

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win