How To: Dynamic Transposition in HANA
Transposition Use Case
There are cases in which “horizontal” data in HANA needs to be transposed into a “vertical” format. Reasons for this transformation include:
1) Data may need to be combined from multiple fact tables where one is “horizontally” structured and another is “vertically” structured. A common structure is required in order to combine them (typically via Union with Constant Values) in a shared data model.
2) “Vertical” data structures are much easier to work with for BI data modeling and reporting. (Less fields are required in the data model, and slicing/dicing/sorting on dimension values make much more sense than awkward display of different measure columns).
SAP FICO is a typical example of a system that has tables which should be transposed. COSS, for example, has separate measure columns for each month of the year. In other cases measure columns are defined by cost elements or cost element groups.
Problems with Physical Table Transposition
Data modelers have the option of transposing the physical tables via ETL transformations. Such an approach has several problems associated with it:
1) Extra work is required to implement the ETL process.
2) Vertical transposition can increase the number of rows of a table by a few orders of magnitude. This can lead to a table with more than 2 billion rows, forcing partitioning (which required additional maintenance and complexity).
3) Performance degradation given the orders of magnitude increase in table size.
3) The SAP HANA landscape can become very complex if additional ETL transformations are required on SAP source data coming through SLT. This is not recommended.
Benefits of Dynamic Transposition
Data modelers can also implement dynamic transposition, which pivots the data on-the-fly. This approach has the following benefits:
1) No required change to physical table structures.
2) Potential for automated maintenance of one additional table require for transposition.
3) Quick, easy, intuitive and maintainable data modeling.
4) High-performance data model.
Following is a simplified, step-by-step example of how to implement Dynamic Transposition on HANA. It demonstrates how to transpose tables on-the-fly via a graphical Calculation View. Special thanks goes to Tony Cheung, Adam Baryla and Imran Rashid of SAP America/Canada for sharing details of this approach with me on a prior project.
(There are other approaches to dynamic transposition, such as using a combination of filters and UNION/AGGREGATION nodes. Such approaches may have comparable performance, but the “hard-coded”, complex design makes them difficult to develop and maintain.)
Transposition Data Flow
The following shows the “Source” table that we want to transpose as well as the desired “Target” structure. Columns M1 through M3 represent aggregated measure values at a montly level (such as in COSS).
Implementing this transposition requires the following steps in a Calculation View:
1) Create an “Identity Matrix” table with any required DIMENSION columns and, optionally, and required SORT_ORDER columns.
2) Create constant value calculated columns in each table (in implementation these will be projection nodes in a Calculation View).
3) Cross Join the two tables by joining on constant value columns.
4) Create output measure (i.e. AMOUNT) as a calculated column.
5) Query required columns (aggregate AMOUNT, group by DIMENSION, sort by SORT_ORDER)
Step 1: Create an Identity Matrix table
Please note: If this table is large and/or changed frequently, consider generating and loading an associated CSV file in a programming language of your choice. This approach can be automated and easily maintained.
DROP TABLE MONTH_IDENT_MTRX;
CREATE COLUMN TABLE MONTH_IDENT_MTRX
PRIMARY KEY (MONTH)
INSERT INTO MONTH_IDENT_MTRX VALUES (‘JAN’, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘FEB’, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘MAR’, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘APR’, 4, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘MAY’, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘JUN’, 6, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘JUL’, 7, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘AUG’, 8, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘SEP’, 9, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘OCT’, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘NOV’, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0);
INSERT INTO MONTH_IDENT_MTRX VALUES (‘DEC’, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1);
For practice purposes, here is SQL for generating a small, sample fact table:
DROP TABLE SMALL_FACT;
CREATE COLUMN TABLE SMALL_FACT
PRIMARY KEY (K)
INSERT INTO SMALL_FACT VALUES (‘A’, 230, 232, 240, 300, 320, 280, 340, 290, 220, 242, 234, 320);
Step 2: Create constant value calculated columns in each table
a) Create a simple Analytic View that wraps SMALL_FACT called AN_FACT
b) Create a Calculation View called CA_TRANSPOSE_EX
c) Bring in AN_FACT and the table MONTH_IDENT_MTRX
d) Create projection nodes above AN_FACT and MONTH_IDENT_MTRX
e) Create the following calculated column in each projection node.
Step 3: Cross Join the two tables
Join both tables on calculated column ONE
Step 4: Create output measure (i.e. AMOUNT) as a calculated column.
(Please don’t literally copy the ellipsis below! The whole calculation goes from COL1/M1 to COL12/M12. Shortened below for display.)
Step 5: Query required columns (aggregate AMOUNT, group by DIMENSION, sort by SORT_ORDER)
A few comments
When examining the data model above, you might has some performance concerns. For example:
1) It’s generally recommended NOT to join on calculated columns in a Calculation View
2) Dynamic transposition can clearly still create very large intermediate result sets.
With regard to the join on calculated columns, you would be semi-correct. If you have the “ability” (i.e. if your data is loaded via ETL, or via SLT and you have the option of altering table structures (not always possible, i.e. if ABAP Accelerator expects same structure as source)), then you should of course consider creating the “ONE” column as a physical column. And even if this is not possible, “ONE” can still be created as a physical column on the “identity matrix” table.
However, in testing with large data sets on a previous project (hundred of millions of records in the fact table, hundreds of records in the identity matrix table), colleagues of mine (referenced at the beginning ) found very good performance, despite using calculated columns (and despite concerns regarding intermediate result set size). I suspect the very low cardinality of the identity matrix table, the simplicity of the calculation, and the fact that only certain columns are “multiplied out” lead to the great performance. As always, try various approaches for your scenario to meet an optimal functional/performance trade-off.
Thanks for reading, and I hope you find this How-To guide on Dynamic Transposition in HANA useful!
I'm really interested in this because this is definitely not how I would have approached it. Not saying that your method is wrong because there are always many ways to model this stuff in HANA and sometimes the non-obvious one is faster!
I would have created 12 projections in a calc view, each with the attributes and one measure. Then I would have created a Union with Constant Values that combines all 12. I would have a generated column in the UCV for month, with a constant value. I would follow this with an aggregation node to aggregate out the nulls union (you can often use the default output node for this)
There are a few obvious benefits to this:
- If you filter on a month, it will pass the filter down and only use one of the projections
- If you use all months, it will run all 12 projections in parallel
- You avoid joins which are definitely slower than UCV
- Any filter on any attribute is definitely passed down
The other thing I sometimes do is join on grouping attributes after the aggregation. This can reduce the amount of data you move between HANA engines.
The thing you have to be careful of in my example is that if you have an interim result set of more than 100m then you may see some performance degradation.
Interested in a performance test.
Good point on the UCV approach. I mentioned this briefly in my original posting, right before "Transposition Data Flow" section:
As you rightly point out, UCV can give much better performance, which may be worth a bit more development complexity. So, I think it would be helpful to discuss complexity and performance a bit more to give folks more to think about:
1) The original work that this approach was developed against was transposing cost element and cost element groups. I believe there were ~40 cost elements. I think the maintenance cost speaks for itself in this case (especially when additional logic, such as sort order columns - need to be maintained regularly).
2) In my current project, I am indeed transposing monthly data - for two sets of measures. So, there would be 24 project nodes - still a bit of a headache.
3) Graphical CalcViews are of course preferred, but in my use case I'm actually doing a SQLScript approach* - which means CE_UNION_ALL for each projection node (plus all upstream code). Again, maintenance nightmare.
All of the above complexity issues are solved with the dynamic transposition approach where a single identity matrix table needs to be maintained, and a single AMOUNT column needs to be maintained.
So, how about that performance test? 🙂
I grabbed one of my base tables for the test, COSS. It has 18 million records - pretty small, but reasonable enough for quick testing.
I wrapped it in an AN_VIEW with a handful of dimensions with pretty high granularity - result set of ~ 600,000 records.
Then I built my approach graphically (my original post was actually just manufactured screenshots to represent a section of my code ) with calculated columns, physical columns, a combination, and then with the UCV approach. Here are the results:
Open / CA_CALC: 2.4s
Open / CA_PHYSICAL: 5.4s
Open / CA_UNION: 1.8s
Filtered / CA_CALC: 1.2s
Filtered / CA_PHYSICAL: 1.2s
Filtered / CA_UNION: .77s
CA_CALC represents "dynamic transposition" with calculated join columns for cross join.
CA_PHYSICAL - same but with physical columns.
CA_UNION - the UCV approach.
"Open" represents COUNT(*) against the full model - no filters.
"Filtered" is filtered again MONTH = 'JAN'
1) UCV certainly wins in terms of performance.
2) In this case, the CalcView joins faster on the calculated column than the physical column. (If this isn't unintuitive, I don't know what is!!!)
Some other day I'll investigate 2) above.
In summary though - the tradeoff, as is often the case, is performance vs maintainability. In my particular use case, the data is also filtered once I get to the transposition, so any performance gains using UCV would be minimal.
UCV certainly has it's place - but I don't think it's the right approach given the above complexities.
* Of course, someone is going to want to know why I'm doing this in SQLScript. 🙂 Three reasons off the top of my head: 1) I have to do some clunky hierarchy traversal logic that can't be done graphically (and hopefully will be more elegant given SP7 hierarchy SQL extensions), 2) I have to "aggregate" a dimension field by concatenating each distinct value (also no elegant solution - except for some clever work with aforementioned hierarchy SQL), and 3) there is a lot of pedantic logic in this model (notably filters) which I find much more transparent in SQLScript rather than having to click through a bunch of projection nodes and dig out the filters themselves.
Hope that all makes a bit of sense!
Another observation that I forgot to mention - the MONTH filter is passed down against the identity matrix table before the join is executed, according to the plan visualization. (This is a rather trivial optimization in this case though, as the IM table is all of 12 records!)
Cool, great analysis. I love blogs/documents where the best stuff happens in the comments.
On point(2) you can push two measures for each month in one projection if they are in the same table. Should only need 12 legs.
My suspicion is that with higher data volumes (100x more), the different would be much more pronounced. In my instance I also modeled it in a similar way to you, and found a 10x difference in performance for big aggregations queries and a 300x difference when filters were applied.
As a point of interest, we quite often use this same concept to force parallelism and increase query response time. Pick a uniform distributed dimension, filter it 8 times and join them in UCV 🙂
For smaller data volumes it makes sense to be more concerned with maintainability than performance, especially when it's plenty fast enough as-is. On maintainability, I quite often create complex analytic views which I then re-use in calc views, even scripted calc views. It's not too bad to build out 12 months once. You get great maintainability and great reusability this way. Would this be possible in your example?
Working with distincts is a pain in HANA, especially with large volumes. I believe there is good news in SP07 there. Funny, I prefer filters in analytic views rather than CE Functions - the CE_CALC syntax is funky.
On your weird join point, that doesn't surprise me too much. HANA joins are fairly sensitive to the quality of the join predicate. A temporary node with a calc column as a join would be faster if it were against a primary key or good join predicate on the other table.
I am surprised that it correctly passes the month filter down. I have seen many examples where it materializes data for these joins, which is all bad!
Definitely agree, John. Nothing like hashing out best practices with other HANA aficionados!
Good point! Looks like I spoke before I thought. Wouldn't be the first time and won't be the last - so thanks for pointing that out.
Thanks for the additional testing. It further emphasizes the need for HANA developers to keep UCV front and center when it comes to optimization, which is further underscored by your next point:
This is a clever tactic! On my last project we were considering building a code generator that would essentially query distinct values of a particular dimension and then build a stored procedure "on-the-fly" (as part of a batch process) with CE functions that does the same thing. We were working on optimizing models against a table of 80 billion records and had in mind doing a UCV with potentially hundreds of sources for the Union node. In order to find out whether that moved forward, you'd have to ask Abani Pattanayak (guess there's not a way to tag him in this post, but he posts to SCN). I left the project to go independent, and he and Werner were hackin' away at that monster table.
Good point. I think it depends on the reporting use cases. For example, if the same star schemas were consumed in different ways (i.e. different filters and different calculations), it's also often a good idea to build "vanilla" analytic views with usage-specific filters, calculations and any other logic in downstream Calc Views.
Well, I'll stick with my point that maintaining the identity matrix and one calculated field is easier than 12 CE_UNION_ALL nodes. 🙂 Certainly easier to maintain graphically, though. So, could I do this aspect of my particular solution graphically? Yes. However, it would involve splitting one script into two scripts that lead and follow the graphical model, and I don't think it's worth that added complexity. More importantly - the data is filtered substantially before the transposition, so the performance improvement makes little difference (again, specific to my use case - 100% agree that UCV is very often a better approach).
Agreed. My only point was that wherever the filters are - in analytic views or projection nodes - you have to click around a bunch to figure out what's where. In a script, it's much more transparent. Compounded by the need to do some clunky logic for hierarchies and string concatenation - it seemed best to include all required logic in one scripted CalcView for my use case. (Left all these details out of my original post so as not to confuse folks too much.)
Would you mind explaining a bit on this? My insight into HANA join operations is somewhat limited. In my example above - both columns were calculated, which was the fastest approach (faster than a hybrid approach in either direction (one physical, one calculated column), and faster than physical columns on both tables. Keys weren't part of the join columns.)
Thanks again for all the detailed feedback John. I'm certainly learning a lot in this discussion!
Also, John - above I mentioned the project with the 80 billion record fact table. That was at a large grocery retailer, so I'm particularly bummed that I couldn't attend your webinar today on Retail Analysis using HANA. I look forward to checking out the recording - also to catch follow up details on your weather data analysis!
Transpose is an interesting topic and many thanks for sharing Jody.
I just created a single table and would like to know how we can transpose the below table:
Output of the information view should be:
I think you have a few options. What have you tried so far?
I have not tried yet.
Give it a shot and let me know what you come up with. Then I'll be happy to help.
Sure Jody. Will give a try 🙂
I given the try and written a separate document http://scn.sap.com/docs/DOC-50719
Cool! UCV approach certainly has its merits as well.
Great document Jody, seams to be very helpful.
Thanks for sharing.
Good Stuff Guys - Thanks!
Good document Jody
Thanks for sharing
Very interesting technique to solve a very common requirement. I have gone with the UCV approach with some good results, but I like seeing different ways to do everything, great write up!
I was particularly surprised by your performance results and the join on the calculated column being faster than joining on the physical column.
Keep up the awesome work.
Thanks for your approach..
We got a similar task and we went with a procedure using dynamic SQL using Case Statements to transpose. I agree that performance wise it might have not been a correct way.
Wished i had seen this blog a bit earlier.
Good document john 🙂 Thanks for sharing; Really helpful!!
Not to be pedantic, but my name isn't John. 🙂
Another one for the Hana cookbook John, erm Jody (sorry). Just worked through it, I like this elegant and low maintenance option as a tool in my kit. The added-value comments really do reflect both the challenges faced on implementations and an increasing range of solutions at the modeller's disposal that can be weighted on maintainability/performance.
It's frustrating I cannot add anything technically, yet grateful the HDEs continue to contribute these knowledge gems as many folks strive to make the transition from Hana student to consultant.
Sssssscccchhhhh now 😉
Unfortunately for me, I would like to do just the contrary, i.e. a dynamic reverse pivot on a table (or view ...).
The business case for this is , as an exemple, to analyse stocks in storage bins per batch characteristics. You will have plenty of characteristics values for each material, that you will want to present as columns.
unfortunately, unlike Oracle or MS SQL, HANA SQL does not support PIVOT statement (yet).
So far, the only docs I found was talking about BODS, which is not a dynamic solution.
Does any of you experts have a magic trick before I start sweating and crying (while trying to do it on my own) ?
Sounds like a good use case for Analysis for Office or Analysis for OLAP, building cross-tabs against whatever dimension you're interested in.
Alternatively, you could build restricted measures if you know beforehand what dimension values are required for analysis and pull these into any front-end tools.
Would either of those solutions work for you?
If I'm not mistaken, you are likely talking about the "classification" functionality in ECC? In this case, the association to the characteristic (say batch) is not totally transparent and needs additional work before it can be exposed from AUSP/CABN tables. Put more simply, this is not a case of just plugging a dimension into a fact, there is logic involved here.
As Jody touched on, as long as you know the specific characteristics you are targeting you should be able to build it out. I just completed a piece of development that works on denormalizing a specific set of Material classification characteristics, but ended up physically writing the table for performance reasons. I suppose you could do it dynamically, but it may impact performance and also I don't see a huge need to have this values updated in real time.
Essentially it would be a series of left outer joins to the base characteristic (batch), with each of these being a left outer join specific to the characteristic you are trying to grab.
You are not mistaken at all !
What I am laboriously trying to do is dynamically denormalize classification characteristics to unable material-related facts , such as stock analysis, by characteristic value (which can actually be text', as classification are often used as additional information fields for Material or batches.
I had a 1st try according to Vitalij Sadovitch's Thread, but as you say, it is limited to a certain, known number of characteristics.
And the I'm stuck because I would like to have it dynamic and generate as much columns as I have values, just like the 'PIVOT' statement does in oracle (or so I recall).
I know it can be challenging regarding performances, but I bet on the fact our ECC database will remain of quite a reasonable size.
And on the other side, managing a denormalized table also puts data consistency at risk, whenever the list of characteristics and values is moving, which will happen too often.
That's where I hoped to take benefit of HANA's capacity to expose data through views.
So I thank you for your hints, and I will try just that. If you have more information or ideas about it, I'll gladly take them also 🙂 .
Since we more or less already have the solution for the "fixed" characteristic approach, that is not of concern. You would like to dynamically populate the characteristics (columns). This would be just adding an additional two steps from TCLA->AUSP and CABN->AUSP.
For my fixed approach, which in the end persists the data, logic is like this implemented in SQLScript procedure
- Determine all distinct materials that have values for the specific characteristics we are interested in (AUSP->CABN)
- Left outer join each characteristic and the value, each time looking for the specific characteristic (AUSP->CABN)
- Create a text lookup table that joins the text values of characteristics (CABN/CAWN/CAWNT)
So modified, you would have to implement some dynamic SQL to
- Determine all the characteristics (columns) that need to be selected (TCLA->AUSP), for example all MCHA related ATINN's from AUSP.
- Determine all the ATNAMS from each ATINN (CABN)
- Proceed with the same steps as above, but instead need pass in dynamic SQL to build each column.
This dynamic approach even for persisting the data, I may take a crack at it if a I have time since I've already come this far 😉
Also see HANA Live view sap.hba.ecc.CharacteristicValueDesc for some hints on how to piece some of this together.
Great work around for your requirement. However, I was wondering if you any solution for this one. I hope you are aware of the employee compensation infotype table in SAP HCM; PA0008.
I am developing a report which is the transposed form of the table and the output fields are dynamic as well based on the number of wage codes which the input employees are having.
I would appreciate any inputs with regard to this. Thanks. Below, in the standard SAP Table in ECC, LGA* fields contain the wage code which an employee can have. BET* fields contain the amount for the respective wage code. An employee can have max of 40 wage codes. The report on the other hand has wage codes as field headers. Hope this sample gives you a clear picture of the requirement. Its almost the reverse of what you have done.
Just trying to follow these steps.
Can someone assist in providing the screenshots to Step 2 a, b, c, d. I need to see if I am doing this correctly. I am particularly new in SAP HANA as I have been in the SQL Server world most of my life. Thanks again.
Is it possible to Transpose Rows to columns, where as the columns might be dynamic.
And the output:
The Col2 values might increase dynamically.
Good samples - thanks.