How to Measure Report Similarity Using Python
If you are in a situation where you have a multitude of reports within your Business Objects environment and would like to consolidate them, the first question that comes to mind is Which reports are very similar to one another? While the answer there may often be subjective, this post will provide you with a methodology that aims to measure similarity in as objective a manner as possible. While further review of individual reports will be required before doing any kind of consolidation, this methodology should at least help you identify a good starting point within your system.
This analysis assumes that you can get a listing of every report and its dimensions / measures, which you should be able to pull using the appropriate SDK for your environment. I’ve chose to only look at those two expressions for this example as slight variations in other expressions (say filters) may not justify having a separate report. However, you can extend this methodology to take filters and other expressions into consideration should you so choose.
This article assumes some familiarity with Python, and the code used here is supported in Python 3.X. If you choose to run Python 2.X, you may need to alter some of the samples to get it to work. Additionally, you will need to install all of the following packages:
If you are new or relatively inexperienced with Python and managing packages, you may want to just install Anaconda which will include all of the above packages.
As mentioned in the introduction, you will also need to pull a listing of reports and the expressions (i.e. dimensions/measures) that they are using. How to do this will depend on your environment, but generally you could either:
- Leverage the RESTful Web Service SDK if in a 4.X environment OR
- Leverage the Java Report Engine if in a 3.X environment
It’s also worth nothing that there’s nothing stopping you from comparing different types of reports. If your business is open to the idea that a Deski report could be consolidated with a Webi report with similar expressions, then there is nothing in the analysis below that would prevent you from being able to that, provided again that you can just get the listing of expressions used in both types of reports.
As mentioned in the introduction, the first thing you’ll want to do is get a listing of every dimension and metric by report in your repository. For this blog entry, feel free to use the following data and save it as a CSV file (assumed to be “report_metadata.csv”):
Report,Expression Report 1,Dimension_A Report 1,Dimension_B Report 1,Dimension_C Report 1,Dimension_D Report 1,Measure_A Report 1,Measure_B Report 1,Measure_C Report 1,Measure_D Report 2,Dimension_A Report 2,Dimension_C Report 2,Dimension_D Report 2,Measure_A Report 2,Measure_B Report 2,Measure_C Report 2,Measure_D Report 3,Dimension_A Report 3,Dimension_B Report 3,Dimension_E Report 3,Dimension_G Report 3,Measure_A Report 3,Measure_E Report 3,Measure_F Report 3,Measure_G Report 4,Dimension_E Report 4,Dimension_F Report 4,Dimension_G Report 4,Dimension_H Report 4,Measure_E Report 4,Measure_F Report 4,Measure_G Report 4,Measure_H Report 5,Dimension_H Report 5,Dimension_B Report 5,Dimension_C Report 5,Dimension_D Report 5,Measure_H Report 5,Measure_B Report 5,Measure_E Report 5,Measure_D Report 6,Dimension_A Report 6,Dimension_E Report 6,Dimension_G Report 6,Dimension_D Report 6,Measure_G Report 6,Measure_H Report 6,Measure_I Report 6,Measure_J
With data in hand, go ahead and start an interactive Python session and kick things off by importing all of the dependent packages:
import pandas as pd import numpy as np from scipy.spatial.distance import squareform, pdist
With those packages imported, start by reading your data into a DataFrame.
df = pd.read_csv('YOUR_DATA.csv')
Call df.head() to inspect the first 5 rows of data. If everything is good so far your data should look something as follows:
Report Expression 0 Report 1 Dimension_A 1 Report 1 Dimension_B 2 Report 1 Dimension_C 3 Report 1 Dimension_D 4 Report 1 Measure_A
Next, we’ll use the get_dummies function to generate columns for each categorical in the Expression column with a binary indicator showing whether or not that categorical is being referenced by a particular row.
dummies = pd.get_dummies(df['Expression'])
Go ahead and run dummies.head() to see the data. You should notice columns called “Dimension_A”, “Dimension_B”, etc… and the value of each column should be a 0 or a 1, indicating whether or not that particular row in your DataFrame uses that expression.
Dimension_A Dimension_B Dimension_C Dimension_D Dimension_E 0 1 0 0 0 0 1 0 1 0 0 0 2 0 0 1 0 0 3 0 0 0 1 0 4 0 0 0 0 0
Go ahead and concat those records along the rows to your DataFrame:
df = pd.concat([df, dummies], axis=1)
Next, group by the Report Name column and aggregate all of the metrics using the “sum” function.
grp = df.groupby('Report').sum()
Running grp.head() will show you that you’ve now created one unique entry per report, and all of the dummy categorical columns will be either 0 or 1, indicating if that report uses the given Dimension or Metric:
Dimension_A Dimension_B Dimension_C Dimension_D Dimension_E Report Report 1 1 1 1 1 0 Report 2 1 0 1 1 0 Report 3 1 1 0 0 1 Report 4 0 0 0 0 1 Report 5 0 1 1 1 0
Now we can go ahead and calculate the “Jaccard distance” between each report. You can choose other binary distance measurements if you would like, but this particular measurement is popular when dealing with binary data because it does not take into consideration variables that are False for both items when computing their distance. So if you were only looking at Report 1 and Report 2 above, only considering Dimensions A through E, the distance between those two would be calculated as .25. That’s because Dimension_E is 0 (“False”) for both, and out of the remaining dimensions under consideration they only vary on one out of the 4, hence .25.
We’ll use the pdist function to calculate those distances, and then use squareform to convert those distances into a nxn matrix, where n is the number of reports being analyzed and the values at each point within that matrix is the distance between those reports. Given the distance of any report to itself will always be 0, we’ll also replace the diagonal with np.nan to differentiate from the rest of the data.
dist = pdist(grp, metric="jaccard") s_dist = squareform(dist) # Fill diagonals with nulls np.fill_diagonal(s_dist, np.nan)
At this point, we have a matrix containing the distances of each report, but we are looking for the similarities. Thankfully, we can get the similarities by subtracting the distances from 1:
sim = np.subtract(1, s_dist)
Go ahead and convert this information back to a DataFrame, using the Report Names along the x and y axis for easy reference:
sim_df = pd.DataFrame(sim, columns=grp.index, index=grp.index)
At this point, you can inspect your data and see the similarities of each report calculated against one another. The numbers here will be on a scale from 0 to 1, with 0 meaning they have nothing in common and 1 meaning they are identical:
Report Report 1 Report 2 Report 3 Report 4 Report 5 Report 6 Report Report 1 NaN 0.875000 0.230769 0.000000 0.454545 0.142857 Report 2 0.875000 NaN 0.153846 0.000000 0.363636 0.153846 Report 3 0.230769 0.153846 NaN 0.454545 0.142857 0.333333 Report 4 0.000000 0.000000 0.454545 NaN 0.230769 0.333333 Report 5 0.454545 0.363636 0.142857 0.230769 NaN 0.142857 Report 6 0.142857 0.153846 0.333333 0.333333 0.142857 NaN
With the sample data provided, you can see that reports 1 and 2 are the most similar, with an 87.5% overlap of dimensions and measures. On the other end of the extreme, report 4 has nothing in common with report 1 and report 2.