Skip to Content

How to Measure Report Similarity Using Python

Introduction

If you are in a situation where you have a multitude of reports within your Business Objects environment and would like to consolidate them, the first question that comes to mind is Which reports are very similar to one another? While the answer there may often be subjective, this post will provide you with a methodology that aims to measure similarity in as objective a manner as possible. While further review of individual reports will be required before doing any kind of consolidation, this methodology should at least help you identify a good starting point within your system.

This analysis assumes that you can get a listing of every report and its dimensions / measures, which you should be able to pull using the appropriate SDK for your environment. I’ve chose to only look at those two expressions for this example as slight variations in other expressions (say filters) may not justify having a separate report. However, you can extend this methodology to take filters and other expressions into consideration should you so choose.

Pre-Requisites

This article assumes some familiarity with Python, and the code used here is supported in Python 3.X. If you choose to run Python 2.X, you may need to alter some of the samples to get it to work. Additionally, you will need to install all of the following packages:

If you are new or relatively inexperienced with Python and managing packages, you may want to just install Anaconda which will include all of the above packages.

As mentioned in the introduction, you will also need to pull a listing of reports and the expressions (i.e. dimensions/measures) that they are using. How to do this will depend on your environment, but generally you could either:

It’s also worth nothing that there’s nothing stopping you from comparing different types of reports. If your business is open to the idea that a Deski report could be consolidated with a Webi report with similar expressions, then there is nothing in the analysis below that would prevent you from being able to that, provided again that you can just get the listing of expressions used in both types of reports.

Sample Data

As mentioned in the introduction, the first thing you’ll want to do is get a listing of every dimension and metric by report in your repository. For this blog entry, feel free to use the following data and save it as a CSV file (assumed to be “report_metadata.csv”):

Report,Expression
Report 1,Dimension_A
Report 1,Dimension_B
Report 1,Dimension_C
Report 1,Dimension_D
Report 1,Measure_A
Report 1,Measure_B
Report 1,Measure_C
Report 1,Measure_D
Report 2,Dimension_A
Report 2,Dimension_C
Report 2,Dimension_D
Report 2,Measure_A
Report 2,Measure_B
Report 2,Measure_C
Report 2,Measure_D
Report 3,Dimension_A
Report 3,Dimension_B
Report 3,Dimension_E
Report 3,Dimension_G
Report 3,Measure_A
Report 3,Measure_E
Report 3,Measure_F
Report 3,Measure_G
Report 4,Dimension_E
Report 4,Dimension_F
Report 4,Dimension_G
Report 4,Dimension_H
Report 4,Measure_E
Report 4,Measure_F
Report 4,Measure_G
Report 4,Measure_H
Report 5,Dimension_H
Report 5,Dimension_B
Report 5,Dimension_C
Report 5,Dimension_D
Report 5,Measure_H
Report 5,Measure_B
Report 5,Measure_E
Report 5,Measure_D
Report 6,Dimension_A
Report 6,Dimension_E
Report 6,Dimension_G
Report 6,Dimension_D
Report 6,Measure_G
Report 6,Measure_H
Report 6,Measure_I
Report 6,Measure_J

Analysis

With data in hand, go ahead and start an interactive Python session and kick things off by importing all of the dependent packages:

import pandas as pd
import numpy as np
from scipy.spatial.distance import squareform, pdist

With those packages imported, start by reading your data into a DataFrame.

df = pd.read_csv('YOUR_DATA.csv')

Call df.head() to inspect the first 5 rows of data. If everything is good so far your data should look something as follows:

     Report   Expression
0  Report 1  Dimension_A
1  Report 1  Dimension_B
2  Report 1  Dimension_C
3  Report 1  Dimension_D
4  Report 1    Measure_A

Next, we’ll use the get_dummies function to generate columns for each categorical in the Expression column with a binary indicator showing whether or not that categorical is being referenced by a particular row.

 dummies = pd.get_dummies(df['Expression'])

Go ahead and run dummies.head() to see the data. You should notice columns called “Dimension_A”, “Dimension_B”, etc… and the value of each column should be a 0 or a 1, indicating whether or not that particular row in your DataFrame uses that expression.

   Dimension_A  Dimension_B  Dimension_C  Dimension_D  Dimension_E 
0            1            0            0            0            0   
1            0            1            0            0            0   
2            0            0            1            0            0   
3            0            0            0            1            0   
4            0            0            0            0            0   

Go ahead and concat those records along the rows to your DataFrame:

 df = pd.concat([df, dummies], axis=1)

Next, group by the Report Name column and aggregate all of the metrics using the “sum” function.

grp = df.groupby('Report').sum()

Running grp.head() will show you that you’ve now created one unique entry per report, and all of the dummy categorical columns will be either 0 or 1, indicating if that report uses the given Dimension or Metric:

          Dimension_A  Dimension_B  Dimension_C  Dimension_D  Dimension_E  
Report                                                                      
Report 1            1            1            1            1            0   
Report 2            1            0            1            1            0   
Report 3            1            1            0            0            1   
Report 4            0            0            0            0            1   
Report 5            0            1            1            1            0 

Now we can go ahead and calculate the “Jaccard distance” between each report. You can choose other binary distance measurements if you would like, but this particular measurement is popular when dealing with binary data because it does not take into consideration variables that are False for both items when computing their distance. So if you were only looking at Report 1 and Report 2 above, only considering Dimensions A through E, the distance between those two would be calculated as .25. That’s because Dimension_E is 0 (“False”) for both, and out of the remaining dimensions under consideration they only vary on one out of the 4, hence .25.

We’ll use the pdist function to calculate those distances, and then use squareform to convert those distances into a nxn matrix, where n is the number of reports being analyzed and the values at each point within that matrix is the distance between those reports. Given the distance of any report to itself will always be 0, we’ll also replace the diagonal with np.nan to differentiate from the rest of the data.

dist = pdist(grp, metric="jaccard")
s_dist = squareform(dist)
# Fill diagonals with nulls
np.fill_diagonal(s_dist, np.nan)

At this point, we have a matrix containing the distances of each report, but we are looking for the similarities. Thankfully, we can get the similarities by subtracting the distances from 1:

sim = np.subtract(1, s_dist)

Go ahead and convert this information back to a DataFrame, using the Report Names along the x and y axis for easy reference:

 sim_df = pd.DataFrame(sim, columns=grp.index, index=grp.index)

At this point, you can inspect your data and see the similarities of each report calculated against one another. The numbers here will be on a scale from 0 to 1, with 0 meaning they have nothing in common and 1 meaning they are identical:

Report    Report 1  Report 2  Report 3  Report 4  Report 5  Report 6
Report                                                              
Report 1       NaN  0.875000  0.230769  0.000000  0.454545  0.142857
Report 2  0.875000       NaN  0.153846  0.000000  0.363636  0.153846
Report 3  0.230769  0.153846       NaN  0.454545  0.142857  0.333333
Report 4  0.000000  0.000000  0.454545       NaN  0.230769  0.333333
Report 5  0.454545  0.363636  0.142857  0.230769       NaN  0.142857
Report 6  0.142857  0.153846  0.333333  0.333333  0.142857       NaN

With the sample data provided, you can see that reports 1 and 2 are the most similar, with an 87.5% overlap of dimensions and measures. On the other end of the extreme, report 4 has nothing in common with report 1 and report 2.

1 Comment
You must be Logged on to comment or reply to a post.