Skip to Content


Birds of a feather flock together,
But ever wondered why and how
A crow does not flock with the dove
Such is the nature’s law
Which enterprises are using without a flaw

Suppose that we have to allocate a number of automated teller machines (ATMs) in a given region so as to satisfy a number of constraints. These constraints could be of the following types:-

  • Population density of the region
  • Technology feasibility in terms of last mile connectivity or access
  • Sound commercial premises for housing the ATM
  • Statistical distribution of the users- commercial establishments or residential setups
  • Topography of the selected region or area. Etc.

Households or places of work may be clustered so that typically one ATM is assigned per cluster. The clustering, however, may be constrained by factors involving the location of bridges, rivers, and highways that can affect ATM accessibility. Additional constraints may involve limitations on the number of ATMs per district forming the region. Given such constraints, we can design our clusters, which further we can use for Clustering Analysis.

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

Cluster analysis has wide applications including market or customer segmentation, pattern recognition, biological studies, spatial data analysis, Web document classification, and many others. Cluster analysis can be used as a standard data mining tool to gain insight into the data distribution, or serve as a preprocessing step for other data mining algorithms operating on the detected clusters.


An insurance company might want to identify the potential market for a few policies by segmenting their customer base according to attributes such as income, age, ***, risk categories, and policy types held.

Customer_id Name Age *** Income Risk Category Policy Type
Cust001 Kamaljeet 25 M 300000 Medium Silver
Cust002 Craig 30 M 1000000 Low Gold
Cust003 Pooja 28 F 20000 High Platinum

For creating any Data Mining Model we have to set fields and parameters of that Model.



Content Type defines the data in the Model field. Data could be key field, discrete, continuous, or ordered.

Parameter Values need to be defined for each Model field. The general parameters are weight, default value, binning intervals.

Values for Model field are defined on the basis of Content type selected. Generally we here define which values to ignore, missing values, and valid ranges of values.



Model parameters are defined for the whole model which we created. We define parameters such as number of clusters, max distinct values allowed for attributes, and stopping conditions like, max number of iteration, min fraction of inter cluster loops.

The last step in the clustering process is to use clustering result and derive strategies from this knowledge. We can analyze the clustering output by integrating the created data mining model into APD. We can see the clustering output using Influence Charts, Value Distribution Chart or PMML format.

The influence chart represents the relative importance of every attribute considered for clustering in the formation of clusters. The higher the index, higher is the influence in deciding which cluster an entity would get assigned to.

Using value distribution chart we can see distribution of values for the attributes in the cluster and also across the various clusters.

We can also display the clustering results in the PMML format. Predictive Model Markup Language (PMML) is an XML-based language that enables applications to define statistical and data mining models.


Thus we can see here that cluster analysis is a powerful tool that can be utilized to recognize patterns of usage or customer habits. More and more enterprises today are utilizing variegated forms of cluster analysis to segment their markets or customer base or their product offerings. Categorizing the data objects in an intelligent manner with dissimilar groupings of similar traits provides the companies to understand the level of differentiations existing in their repertoire of their customer base trends.

So the next time you see a Mc Donald’s being opened in your neighborhood which already contains some eating joints, don’t predict dooms day for the Big Mac… probably some cluster analysis has gone into the decision.

You must be Logged on to comment or reply to a post.
  • I have a concern regarding datamining on the BW server. My problem is that datamining often requires a lot of resources meaning you need to restrict the user’s possibilities or they will overflow the disks as well as the CPU’s. Is it recommendable to have an ”exploration” server to handle the mining tasks? Of cause there will be a data synchronization issue but in my experience is dataminers not dependant on up to date data so I’ll see this as a minor issue. Is there a best practice on how to handle miners in BW?
    With Kind Regards
    • Hi kristian,

      Even tough your question is not very much clear to me, still i am trying it as the way i understand it. So please don’t hesiate to ask it again.

      Data Mining is generally used for analytical purpose, handling of miners depends on how good your Data Mining Model is. We restrict the user at the Modelling step only.if the Model is built efficiently than genrally we don’t face any such problem. Like if we create Data Mining Model as Decision Tree, then we can use PRUNING, which cuts the records which is not affecting the accuracy of Decision tree. Similarly we can set the junk values, missing vlaues, and all other parameters, using which we can restrict our Model.

      Still if it’s not clear, please let you clear with your problem in detail at BI genral Forum, than i think we can come up with more better output.


      • Thanks for your answer. But how do you restrict the user in the modelling phase? The modelling is often an iterative process where you generate several possible models and during the process you’ll create a lot of temporary datasets. Should the modelling take place on a separate server or can it live together with all the normal BI users?

        Best Regards

        • Hi Kristian,

             sorry that I intervening your discussion. The Data Mining is  a component from SAP BW and runs on that server. For realize your business case each Bi user needs his own model and preferably a BW administrator to enhance the data model in case.

          Hopes that helps.
          Best Regards,

          • Hi Kristian,

            First thanks to Kamaljeet for bringing up this blog. To make this more interesting, you should have included also the results you derived from this Clustering Model..maybe some major conclusions you found by analyzing the results.

            Regarding your question, Klaus is right..the Datamining solution is an integral part of the SAP BW system and uses the same resources as other BW functions like cubes, queries use.

            As regards to creation of temporary data sets, it is done during the runtime of the mining runs (like training, prediction etc), and are deleted immediately after those runs. What is retained will be only the results of those runs. Do have a look at the BI documentation space for more information. There is also Performance Sizing Paper for Datamining applications which you can get in the service market place..


          • Thanks to Harish/Klaus, for your answers and giving your valuable time to read my weblog.

            Harish i will keep your suggestion with me for my future weblogs. Thanks for update.


  • Hallo Kamaljeet,

      thank you for your description of clustering. Your
    example shows seven attribute, please, could you explain hows the different between the classifikation attribute Risk Category and income in your influence chart. This attributes seeem to be no  discerning. Only the content type are different but  in your influence chart they should be the same influences  the cluster. Or were zhe content types essential for the influence?

    Best Regards,

    • Hi Kamaljeet,

      I will specify my problem. For my opinion the attribute Risk Category could be the result of a classification (K-means Clustering) by your customer into groups. Therefore you need a second continous content type i.e. insurance output or something like else.
      Then you will get the identification of your customers i.e. a cluster involved a influence of insurance output = strong by age <30, income<1000000 etc.

      Now the knowledge gain is the risk category _ high for that cluster. The market potential is to find new customers which do not complied this cluster.

      Best Regards,

      • Hi klaus,

        Sorry for delay in reply, as SDN at my side was not working properly, and Thanks for giving time to read my Weblog.

        I agree with the solution you gave, it could be one of the possible scenario. My intention to include risk_category as a attribute is to give the client a multiple option for segmentation.

        Once you get the clusters you can apply rules on the basis of that data, conclude your result, and also it is possible, first you derive some rules, and then you apply clustering mechanism on that. So it all depends on the client’s requirement.

        But i really appreciate your solution, as it’s the more generic one. Thanks for that.


        • Hi Kamaljeet,

            thank you for your answer. Sorry, but if you used a risk category you should not need a data mining tool for that  a data warhouse is sufficently.

          For my opinion the strength of sap data mining is to use the results in your operatve system i.e. crm. How could a client using SAP DM Clustering results for his operative business process? Do you know some business case?

          Best Regards,

          • Hi klaus,

            I agree with your points. The solution which i proposed could be among one of the possible option (never catched any such Business Case, it was my own solution), but if we want to utilize Data Mining strength effectively, your approach is more efficient.
            Really thankful to you, as this is the way we can learn something new from eachother.Eventhough what you said, i was aware of that already, but the difference matters in the way who catched it first.