Introduction

former_member492038 · ‎09-06-2017

Introduction

If you are in a situation where you have a multitude of reports within your Business Objects environment and would like to consolidate them, the first question that comes to mind is Which reports are very similar to one another? While the answer there may often be subjective, this post will provide you with a methodology that aims to measure similarity in as objective a manner as possible. While further review of individual reports will be required before doing any kind of consolidation, this methodology should at least help you identify a good starting point within your system.

This analysis assumes that you can get a listing of every report and its dimensions / measures, which you should be able to pull using the appropriate SDK for your environment. I've chose to only look at those two expressions for this example as slight variations in other expressions (say filters) may not justify having a separate report. However, you can extend this methodology to take filters and other expressions into consideration should you so choose.

Pre-Requisites

This article assumes some familiarity with Python, and the code used here is supported in Python 3.X. If you choose to run Python 2.X, you may need to alter some of the samples to get it to work. Additionally, you will need to install all of the following packages:

Pandas

NumPy

SciPy

If you are new or relatively inexperienced with Python and managing packages, you may want to just install Anaconda which will include all of the above packages.

As mentioned in the introduction, you will also need to pull a listing of reports and the expressions (i.e. dimensions/measures) that they are using. How to do this will depend on your environment, but generally you could either:

Leverage the RESTful Web Service SDK if in a 4.X environment OR

Leverage the Java Report Engine if in a 3.X environment

It's also worth nothing that there's nothing stopping you from comparing different types of reports. If your business is open to the idea that a Deski report could be consolidated with a Webi report with similar expressions, then there is nothing in the analysis below that would prevent you from being able to that, provided again that you can just get the listing of expressions used in both types of reports.

Sample Data

As mentioned in the introduction, the first thing you'll want to do is get a listing of every dimension and metric by report in your repository. For this blog entry, feel free to use the following data and save it as a CSV file (assumed to be "report_metadata.csv"):

Report,Expression

Report 1,Dimension_A

Report 1,Dimension_B

Report 1,Dimension_C

Report 1,Dimension_D

Report 1,Measure_A

Report 1,Measure_B

Report 1,Measure_C

Report 1,Measure_D

Report 2,Dimension_A

Report 2,Dimension_C

Report 2,Dimension_D

Report 2,Measure_A

Report 2,Measure_B

Report 2,Measure_C

Report 2,Measure_D

Report 3,Dimension_A

Report 3,Dimension_B

Report 3,Dimension_E

Report 3,Dimension_G

Report 3,Measure_A

Report 3,Measure_E

Report 3,Measure_F

Report 3,Measure_G

Report 4,Dimension_E

Report 4,Dimension_F

Report 4,Dimension_G

Report 4,Dimension_H

Report 4,Measure_E

Report 4,Measure_F

Report 4,Measure_G

Report 4,Measure_H

Report 5,Dimension_H

Report 5,Dimension_B

Report 5,Dimension_C

Report 5,Dimension_D

Report 5,Measure_H

Report 5,Measure_B

Report 5,Measure_E

Report 5,Measure_D

Report 6,Dimension_A

Report 6,Dimension_E

Report 6,Dimension_G

Report 6,Dimension_D

Report 6,Measure_G

Report 6,Measure_H

Report 6,Measure_I

Report 6,Measure_J

Analysis

With data in hand, go ahead and start an interactive Python session and kick things off by importing all of the dependent packages:

import pandas as pd

import numpy as np

from scipy.spatial.distance import squareform, pdist

With those packages imported, start by reading your data into a DataFrame.

df = pd.read_csv('YOUR_DATA.csv')

Call df.head() to inspect the first 5 rows of data. If everything is good so far your data should look something as follows:

     Report   Expression

0  Report 1  Dimension_A

1  Report 1  Dimension_B

2  Report 1  Dimension_C

3  Report 1  Dimension_D

4  Report 1    Measure_A

Next, we'll use the get_dummies function to generate columns for each categorical in the Expression column with a binary indicator showing whether or not that categorical is being referenced by a particular row.

 dummies = pd.get_dummies(df['Expression'])

Go ahead and run dummies.head() to see the data. You should notice columns called "Dimension_A", "Dimension_B", etc... and the value of each column should be a 0 or a 1, indicating whether or not that particular row in your DataFrame uses that expression.

   Dimension_A  Dimension_B  Dimension_C  Dimension_D  Dimension_E 

0            1            0            0            0            0   

1            0            1            0            0            0   

2            0            0            1            0            0   

3            0            0            0            1            0   

4            0            0            0            0            0

Go ahead and concat those records along the rows to your DataFrame:

 df = pd.concat([df, dummies], axis=1)

Next, group by the Report Name column and aggregate all of the metrics using the "sum" function.

grp = df.groupby('Report').sum()

Running grp.head() will show you that you've now created one unique entry per report, and all of the dummy categorical columns will be either 0 or 1, indicating if that report uses the given Dimension or Metric:

          Dimension_A  Dimension_B  Dimension_C  Dimension_D  Dimension_E  

Report                                                                      

Report 1            1            1            1            1            0   

Report 2            1            0            1            1            0   

Report 3            1            1            0            0            1   

Report 4            0            0            0            0            1   

Report 5            0            1            1            1            0

Now we can go ahead and calculate the "Jaccard distance" between each report. You can choose other binary distance measurements if you would like, but this particular measurement is popular when dealing with binary data because it does not take into consideration variables that are False for both items when computing their distance. So if you were only looking at Report 1 and Report 2 above, only considering Dimensions A through E, the distance between those two would be calculated as .25. That's because Dimension_E is 0 ("False") for both, and out of the remaining dimensions under consideration they only vary on one out of the 4, hence .25.

We'll use the pdist function to calculate those distances, and then use squareform to convert those distances into a nxn matrix, where n is the number of reports being analyzed and the values at each point within that matrix is the distance between those reports. Given the distance of any report to itself will always be 0, we'll also replace the diagonal with np.nan to differentiate from the rest of the data.

dist = pdist(grp, metric="jaccard")

s_dist = squareform(dist)

# Fill diagonals with nulls

np.fill_diagonal(s_dist, np.nan)

At this point, we have a matrix containing the distances of each report, but we are looking for the similarities. Thankfully, we can get the similarities by subtracting the distances from 1:

sim = np.subtract(1, s_dist)

Go ahead and convert this information back to a DataFrame, using the Report Names along the x and y axis for easy reference:

 sim_df = pd.DataFrame(sim, columns=grp.index, index=grp.index)

At this point, you can inspect your data and see the similarities of each report calculated against one another. The numbers here will be on a scale from 0 to 1, with 0 meaning they have nothing in common and 1 meaning they are identical:

Report    Report 1  Report 2  Report 3  Report 4  Report 5  Report 6

Report                                                              

Report 1       NaN  0.875000  0.230769  0.000000  0.454545  0.142857

Report 2  0.875000       NaN  0.153846  0.000000  0.363636  0.153846

Report 3  0.230769  0.153846       NaN  0.454545  0.142857  0.333333

Report 4  0.000000  0.000000  0.454545       NaN  0.230769  0.333333

Report 5  0.454545  0.363636  0.142857  0.230769       NaN  0.142857

Report 6  0.142857  0.153846  0.333333  0.333333  0.142857       NaN

With the sample data provided, you can see that reports 1 and 2 are the most similar, with an 87.5% overlap of dimensions and measures. On the other end of the extreme, report 4 has nothing in common with report 1 and report 2.

How to Measure Report Similarity Using Python

Introduction

Pre-Requisites

Sample Data

Analysis

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z