With the introduction of data blending in SAP Lumira 1.27, a business analyst now has the option to choose between performing joins on their data, or making use of the new feature. So a simple question that comes up is: Should I join or should I blend?
Let’s review over what SAP Lumira is capable of in the context of data merging / joining. By definition, a join is an instruction (to a database) to combine data from more than one table. The result is one table that contains all the data, held together by the relationships that data has with other pieces of data. These relationships, or links, are what is important for a product like SAP Lumira to perform the join. These links are known as foreign keys.
Left outer and inner joins are supported in SAP Lumira today. A left outer join results in a set where all the rows from the first or left hand side table are preserved. Rows from the second, or right hand side table only show up if there is a match with the rows of the first table. Values from the left table that are not in the right table will show as null.
An inner join selects all rows from both tables where there is a match between the tables based on a join-predicate. A combined row of matched pairs from both tables is the resulting row in the result table.
In both merge cases, one resulting table is created, which can then be further tweaked with data manipulations or calculations in the Prepare Room, and then charts/tables are created on this one table in the Visualize Room.
Data merging also requires a unique key or keys to be used to join two datasets together. Therefore it is important to understand how and if the datasets are related in such a way that a key can be found to match the data together. If the match – calculated as a percentage in SAP Lumira – is not sufficient, we do not allow the merge to continue. This is not a deficiency in the product; rather, it is a limitation of how joins can be accomplished in SQL.
Where Blending Plays a Part
Data blending seeks to overcome some of these restrictions by recognizing that business level information sometimes doesn’t include ID values or columns that allow one to merge. Rather, it recognizes that in a dataset, there are some specific columns where the information is valuable for display in a chart or table, and that information will get aggregated or summarized, along with appropriate measures or calculations. It is at this aggregated level that data from disparate sources can be visualized together. These data sources simply have to have 1 or more common dimensions upon which to link their related data together, and nulls are used to pad columns or values that don’t exist in either dataset.
Blending is typically very useful in scenarios where the data is coming from different datasources. SAP Lumira 1.27 supports blending across all of our offline datasource offerings, including SAP HANA, SAP BW extension, SAP Query Panel Universe extension, Microsoft Excel, CSV, and Freehand SQL sources. This allows for endusers to blend corporate data with standalone personal datasources, for example. A more obvious and common case is that data sometimes just doesn’t exist in the same database, eg when external or public data is used to compare with internal or even some other external source.
Sometimes IT can be restrictive and slow in getting the requested data properly located in one place. In the time it takes for one to wait for an ETL process to complete, one may as well look into data blending to quickly discover insights in a matter of hours or days that would otherwise take weeks or months. Granted, from a performance perspective, if the heavy lifting can be pushed to a backend or datasource by IT, it’s always worth consideration so as not to depend on the client to handle all of it.
The general mechanics of blending have already been covered in a previous blog post, but the blending is achieved by aggregating the datasets on either side of a blend, and joining the aggregate result together based on the blending context (ie. which dimensions we link on). In SAP Lumira 1.27, the blending context is identified by which linked dimensions are present in the visualization (aka the dimensions upon which each dataset is being aggregated on). The joining of the aggregate tables is performed as a left outer join; therefore it is important to identify which table you wish to use as your primary dataset, or left table. Otherwise you could be losing values that you care about. Sometimes the dataset with the more dimension values you wish to see will be the primary dataset in your blend.
One Example of Blending Preferred Over Joining – ComparingMmeasures in a One-to-Many Relationship
Imagine that you have a table of sales figures in one dataset, and sales quotas in another dataset. The question to answer is to understand how the sum of all sales are performing against the monthly or quarterly quotas. For this example, I will be using the Sales Commission sample that ships with SAP Lumira.
The Sales Deals Enterprise dataset consists of Order IDs, the Dates the Orders were posted, the Sales Rep IDs responsible for making the sales, and the Sales Deal Totals for each order. It is easy to group them by date and by rep to get a total for each rep per month. For example, if you filtered on Sales Rep ID 228, you will get 49 Orders from Jan 2010 to Jun 2014. Some months will have one entry while others, like Jun 2014 will have 3. Let’s look at this Sales Rep for Jun 2014 more closely. To make screenshots easier, I will filter specifically on Sales Rep 228, for the month of Jun 2014:
In the Sales Commissions dataset, I can look at the quota for Sales Rep 228, for the month of Jun 2014:
To answer my original business question, I could join the tables together on the Sales Rep ID field, and the date field – I would extract the Month and Year from the Date field prior. Because I start my join from the Sales Enterprise Deals dataset, using the left outer join that merging supports I get a quota value for every order transaction I have.
I immediately see a problem with this join. If I were to aggregate these values, because I want to see the total sales for the month, I will also have to aggregate the quota values, which in this example is now tripled in size. This is not correct. I could adjust the aggregation type for the quota value to a minimum to compensate. But I may run into problems with this quota value if I decide to tweak this visualization to aggregateup to a different dimension (eg Sales Rep or Year).
With blending, I can retain the values from both datasets and not have to worry about these values being erroneously aggregated because they are duplicated at a record level in a join. Starting with my Sales Enterprise Deals dataset again, and showing the Sales Rep, Date of Sales, and Sales Totals:
Then by switching to the Sales Commission dataset, I can bring in the measure for the Quota (I created a measure from the Monthly Quota Lower Bound dimension). Once the visualization detects that the addition of this measure comes from a different dataset, it understands that I am performing a blend and is seeking a common dimension or set of dimensions to link and blend on. Since I have not created any links yet, I will get the Dataset Linking dialog – this is where I can link on Sales Rep ID, Month, and Year from both datasets. Note that I do not have to pick these dimensions in any specific order. They just have to match between Sales Deals and Sales Commission
After clicking OK, the blended visualization is produced and I can now see my Sales Rep performance by Month and Year, against the quota for that particular time period.
In my next post, I will spend some time on the topics of filters, ranks, and secondary dimensions in a blended visualization. Also if you have topics you wish for me to cover, feel free to comment below!