DISSIMILAR SIMILARITIES- THE CLUSTER ANALYSIS

Private_Member_9643 · ‎08-24-2005

Birds of a feather flock together,
But ever wondered why and how
A crow does not flock with the dove
Such is the nature’s law
Which enterprises are using without a flaw

Suppose that we have to allocate a number of automated teller machines (ATMs) in a given region so as to satisfy a number of constraints. These constraints could be of the following types:-

Population density of the region
Technology feasibility in terms of last mile connectivity or access
Sound commercial premises for housing the ATM
Statistical distribution of the users- commercial establishments or residential setups
Topography of the selected region or area. Etc.

Households or places of work may be clustered so that typically one ATM is assigned per cluster. The clustering, however, may be constrained by factors involving the location of bridges, rivers, and highways that can affect ATM accessibility. Additional constraints may involve limitations on the number of ATMs per district forming the region. Given such constraints, we can design our clusters, which further we can use for Clustering Analysis.

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

Cluster analysis has wide applications including market or customer segmentation, pattern recognition, biological studies, spatial data analysis, Web document classification, and many others. Cluster analysis can be used as a standard data mining tool to gain insight into the data distribution, or serve as a preprocessing step for other data mining algorithms operating on the detected clusters.

Scenario:

An insurance company might want to identify the potential market for a few policies by segmenting their customer base according to attributes such as income, age, ***, risk categories, and policy types held.

Customer_id	Name	Age	***	Income	Risk Category	Policy Type
Cust001	Kamaljeet	25	M	300000	Medium	Silver
Cust002	Craig	30	M	1000000	Low	Gold
Cust003	Pooja	28	F	20000	High	Platinum
…	…	…	…	…	…	…
…	…	…	…	…	…	…

For creating any Data Mining Model we have to set fields and parameters of that Model.

MODEL FIELDS

Content Type defines the data in the Model field. Data could be key field, discrete, continuous, or ordered.

Parameter Values need to be defined for each Model field. The general parameters are weight, default value, binning intervals.

Values for Model field are defined on the basis of Content type selected. Generally we here define which values to ignore, missing values, and valid ranges of values.

MODEL PARAMETERS

Model parameters are defined for the whole model which we created. We define parameters such as number of clusters, max distinct values allowed for attributes, and stopping conditions like, max number of iteration, min fraction of inter cluster loops.

The last step in the clustering process is to use clustering result and derive strategies from this knowledge. We can analyze the clustering output by integrating the created data mining model into APD. We can see the clustering output using Influence Charts, Value Distribution Chart or PMML format.

The influence chart represents the relative importance of every attribute considered for clustering in the formation of clusters. The higher the index, higher is the influence in deciding which cluster an entity would get assigned to.

Using value distribution chart we can see distribution of values for the attributes in the cluster and also across the various clusters.

We can also display the clustering results in the PMML format. Predictive Model Markup Language (PMML) is an XML-based language that enables applications to define statistical and data mining models.

Conclusion

Thus we can see here that cluster analysis is a powerful tool that can be utilized to recognize patterns of usage or customer habits. More and more enterprises today are utilizing variegated forms of cluster analysis to segment their markets or customer base or their product offerings. Categorizing the data objects in an intelligent manner with dissimilar groupings of similar traits provides the companies to understand the level of differentiations existing in their repertoire of their customer base trends.

So the next time you see a Mc Donald’s being opened in your neighborhood which already contains some eating joints, don’t predict dooms day for the Big Mac… probably some cluster analysis has gone into the decision.

DISSIMILAR SIMILARITIES- THE CLUSTER ANALYSIS

Are you there, SAP? It's me, Jelena

Integration Point of MM-FI-SD in SAP ERP

SAP Project System - A ready Reference ( Part 1 )