Technical Articles
Running SAP S/4HANA cloud, private edition at the largest scale
Implementation with distributed database at Mahindra & Mahindra Limited
Introduction
Running ERP at scale in the cloud with SAP S/4HANA
In the age of Cloud ERP, customers can focus on the business value they draw from their ERP systems. Customer IT, liberated from a large share of basis workload, becomes primarily an enabler of business innovation.
At the same time, many companies have built out their SAP S/4HANA and SAP ECC systems with adaptations of standard processes and a variety of customer applications as a means to differentiate themselves from their competitors.
SAP S/4HANA cloud, private edition is also the solution for even some of the most demanding, largest cloud ERP systems in the world. Supporting the database scale-out technology of SAP HANA, there are no limits in achievable scale compared to classical on-premises installations of SAP S/4HANA.
Google Cloud as key partner for SAP S/4HANA cloud infrastructure
Google Cloud provides a platform for SAP to build, deploy and manage SAP S/4HANA cloud, private edition for the smallest to the most demanding customers. Google Cloud’s scalable infrastructure and high bandwidth low latency network provides excellent performance and stability for running SAP applications and delivers high application performance for scale-up and scale-out SAP S/4HANA systems.
Background
Mahindra & Mahindra Limited
Mahindra & Mahindra Ltd. is an Indian multinational automotive manufacturing corporation headquartered in Mumbai. It was established in 1945. Part of the Mahindra Group, Mahindra & Mahindra Limited, is one of the largest vehicle manufacturers by production in India.
In 2019, Mahindra & Mahindra went live with SAP S/4HANA after 18 months of a project for conversion of the existing SAP Business Suite (ECC 6.0) system to SAP S/4HANA 1709. The SAP Application is supporting more than 25000+ users and 80+ different company codes, running a broad range of SAP S/4HANA core modules such as FI/CO, MM, SD, PP, Asset Management, and core processes such as record-to-report, order-to-cash and procure-to-pay.
Move to SAP S/4HANA cloud, private edition
In 2021, Mahindra & Mahindra decided to adopt SAP S/4HANA cloud, private edition, choosing Google Cloud as the infrastructure provider. Project planning and safeguarding was provided through SAP MaxAttention services from the planning stage to Go-Live. Mahindra & Mahindra were also supported by the SAP S/4HANA customer care program (https://influence.sap.com/sap/ino/#/campaign/71), whose project coaches helped to move ahead efficiently with regards to clarification of new SAP S/4HANA product features and quick issue resolutions. Customer care also helped Mahindra & Mahindra to quickly adapt innovation and introduce 90+ standard- as well as 200 custom SAP Fiori apps.
On the technical side, Mahindra & Mahindra expected the cloud system to match the existing on-premises system in terms of performance and throughput.
As the project progressed, intensive load testing showed that the originally foreseen single-node database infrastructure did not provide sufficient CPU capacity to sustain estimated peak workload.
Therefore, in August of 2022, the option of implementing SAP HANA database scale-out was introduced into the project plan. Only three-and-a-half months later, the system went live successfully with the first ever SAP S/4HANA cloud system operating on a distributed HANA database. The productive collaboration of Mahindra & Mahindra, SAP teams and Google Cloud paved the path to this impressive achievement.
SAP S/4HANA cloud, private edition
SAP S/4HANA Cloud, private edition enables companies to safeguard their existing SAP ERP investment while benefiting from a new level of flexibility. They can tailor the software to meet their specific needs, retain the company-specific configurations and customizations of the existing SAP ERP system, and access the latest capabilities that give them a competitive advantage.
SAP S/4HANA operating on HANA scale-out
SAP HANA scale-out is a HANA deployment option that allows to stretch the HANA database across multiple physical database servers (typically referred to as “nodes”). It is based on a shared nothing architecture, wherein each node controls its own data volumes, and generally, a given data set is stored and managed by one particular node. For applications, the scale-out cluster is represented as one single ACID-compliant database via the SAP HANA interfaces. Efficient connectivity to the optimal database node is ensured by the feature of client-side statement routing in the database client.
Figure 1: SAP HANA scale-out
Scale-out is used to scale the available physical memory, as well as the available CPU resources in the SAP HANA system. Though adding some complexity, scale-out enables growth beyond single node limits:
- Total database sizes are reachable that are beyond the capacity of the largest available single servers.
- Application workload can be distributed across multiple database nodes – also enlarging the CPU capacity available to the application.
- Scale-out allows for more flexibility: it enables to react to strong system growth by adding additional scale-out hosts (within boundaries set by the applications).
For the core applications of SAP S/4HANA, a concept of scale-out by component placement has been developed. The main goals of this concept are:
- Minimize the impact of inter-node communication for the core transactional workload thus ensuring good response times.
- Provide a table distribution that is stable in time, not requiring frequent moving of database objects between the nodes.
- Allow for a reasonable distribution of data volume and workload across the database nodes.
The fundamental design choices
- Tables are clustered by application criteria, so that tables that are used within the same application context belong to the same table group. SAP provides a set of table groups as a starting point for projects, the initial proposal is adapted to each customer, including z-coding and z-tables.
- OLTP-queries joining tables from different groups are avoided in the application standard.
- All tables of a given group are kept on the same database node. Multiple groups may share the same node.
- Selected master data or similar tables may be shared between all database nodes by means of a table replication concept.
Figure 2: S/4HANA Scale-out
The concept also allows flexibility to adjust to the actual situation in a given customer system:
- Customers can adapt the SAP-provided set of table groups according to their specific needs, based on analysis of their core workload:
- Merging multiple groups to one larger group
- Adding additional tables (SAP- or customer-tables) to existing groups
- Defining new groups
- Customers can choose to replicate additional tables as required.
Since its introduction in the year 2017, several large SAP S/4HANA installations have been deployed using this scale-out offering. Mahindra & Mahindra are the first customer to have adopted scale-out in SAP S/4HANA cloud, private edition.
Details of the project
Collaboration between Mahindra&Mahindra, SAP, and Google Cloud
Throughout the project phase, teams from Mahindra & Mahindra, Google Cloud, and SAP worked in a well-aligned setup to safeguard the project, convert findings into actions and steer towards successful go-live.
Mahindra & Mahindra’s existing setup and processes for load and performance testing proved an invaluable asset. It enabled testing and fine-tuning scale-out aspects such as table distribution or table replication to minimize impact of cross-node queries on statement performance and query throughput. Similarly, it made it possible to find optimization potential in custom development objects with respect to the underlying data distribution.
With the load test system on production-identical hardware, the teams could also spot and address optimization potential in the technology stack, be it in database parameterization or physical infrastructure.
Mahindra & Mahindra’s test team was also quick to adjust the load test scenario where needed to reflect more closely the actual workload in the planned production system. This flexibility ensured that the test cycles provided highly meaningful results for understanding and safeguarding productive system behavior.
Google Cloud’s technical infrastructure and SAP experts provided analysis of infrastructure KPIs and identified opportunities for VM deployment and configuration optimization as part of a Google Cloud Safeguarding service. Google Cloud’s PSO team was on standby during the go live to help resolve any issues if they would arise, but the go live went through without any incidents for Google Cloud.
Figure 3: Collaboration
Implementing SAP S/4HANA on scale-out with cluster protection
In order to enlarge the compute bandwidth there was the clear need to scale horizontally with HANA scale-out. Vertical scaling will require moving to Google Cloud Bare Metal HANA systems. Furthermore, adding an additional HANA node doubles the I/O bandwidth of the disks which was also a very important aspect as it improves performance for backups and other operational tasks.
Due to the high resilience requirements of the customer the scale-out setup needs to be distributed across Google Cloud Availability Zones. To protect the system in case of failure the system was protected via Pacemaker Cluster software.
Figure 4: Please note: this setup is only available in RISE with SAP S/4HANA Cloud, private edition, tailored option with expert analysis and approval
Running scale-out clusters on Google Cloud
SAP HANA scale-out has been supported on Google Cloud since 2018 and Google Cloud provides standard best practices for setting up an SAP HANA scale-out clusters supporting high availability across availability zones. The scale-out architecture consists of one master host, a number of worker hosts, and, optionally, one or more standby hosts. The hosts are interconnected through a network that supports sending data between hosts at rates of up to 100 Gbps on selected machine types using high-bandwidth networking with the lowest possible latency.
As the workload demand increases, especially when using OLAP, a multi-host, scale-out architecture can distribute the load across all hosts.
The following features help ensure the high availability of an SAP HANA scale-out system:
- Compute Engine live migration
- Compute Engine automatic instance restart
- SAP HANA host auto-failover with up to three SAP HANA standby hosts
For more information about high availability options on Google Cloud, see the SAP HANA high-availability planning guide.
Increasing disk IO using hyperdisk Extreme
Large SAP HANA system can require very high peak I/O throughput specifically to support HANA e.g., delta merges, savepoints and backups. Google Cloud provides linear scalable persistent disks with up to 1200MB/s throughput, 40000 read and 40000 write IOPS per VM used in the deployment of SAP HANA at Mahindra & Mahindra.
Mahindra & Mahindra’s SAP system was observed at peak throughput during stress testing to still be within these limits, so it was not a bottleneck for the go live. Google Cloud has recently released a new disk type, Hyperdisk Extreme, a high performing and scalable disk solution, providing up to 5000MB/s throughput and up to 350,000 IOPS. Hyperdisk Extreme is certified for SAP HANA workloads and is available for RISE with SAP Private Cloud Edition customer deployments on request.
Outcome
Successful go-live on scale-out after 3.5 months project time.
Stable system operations, outperforming requirements with respect to performance and throughput.
In total 12% faster response times after migration to RISE Google Cloud environment.
Figure 5: Response time comparison
|
SAP Notes |
- SAP Note 2408419 (SAP S/4HANA – Multi-Node Support)
Good detailed article...We should also release learning from this Journey..
Yes, we are planning to release more material on the matter soon.
Thanks 👍
Happy to be part of the team 🙂 All I know about I/O benchmarking, I learned in the SAP LinuxLab. I summarized it here: https://www.admin-magazine.com/Archive/2016/32/Fundamentals-of-I-O-benchmarking
Hope it helps a lot of people 🙂
Hi Marc,
I really love to hear and read such success stories with architecture solutions, but for a good comparison story you have to tell all the details (you already covered some tech stuff of the target system):
You can only compare apples and apples. Currently it sounds like you doubled the hardware and gained 12% performance. May be there was also some other activity like the mentioned tuning opimization regarding SQL tuning which gained the performance improvement and not (only) the hardware. I think to solve everything with hardware sounds quite expensive. So comparisons to a baseline and optimized system iterations would describe the path to success and even more so the hours of hard work throughout the journey.
I would love to hear more about the hurdles and the tricky stuff and not only what was smooth. Because S/4 HANA scale out needs a lot of good know how regarding partitioning and grouping => table placement. How much time was spend to build the architecture and afterwards to optimize the system to each the targets?
However, great to hear that GCP solutions performing great with the S/4HANA load.
Regards,
Jens
Hello Jens,
thanks for your comment.
No, the two scale-out nodes are pretty much comparable on CPU cores / SAPS values with the one scale-up system on-prem. And yes, optimizations like SQL statement analysis and other tunings have been part of the overall process.
Unfortunately, we are not able to share more details on the production system (like used instance types or revisions).
This means a over 5 year old hardware has the same SAPS value as the new system with more requirements regarding system growth and scaling? Sounds kind of odd to me, especially given the need for better performance. But may be the tuning was such good that you need less hardware 😉
I understand that you can not share too detailed information, but you can make a statement of what you exactly compared in "Figure 5: Response time comparison". I mean it is a result but it is not really comparable.
As already mentioned:
In the end nobody knows what exactly achieved the 12% performance. The newer HANA revision with other optimizer decisions, the SAP kernel with optimized FDA access, the S/4 release with different code or the optimized custom code.
What I want to say, please be careful with the message: change architecture and optimize the code => you will achieve >10% performance (without really comparable KPIs and costs)
It should be more like be up-to-date with your system components, continuously optimize your coding, know your workload and how to scale it (may be also without scale-out). A lot of customers will compare the systems and their situations, but without the right KPIs and costs this is impossible.
Regards,
Jens
Good read! However, please clarify following statement/queries pertaining to the blog-
The fundamental design choices
1/ Does the scale -out option is offered along with RISE with SAP licenses? When is it adopted in the project plan, during sizing or after Volume testing phase?
2/ What happened to their On premise S/4HANA system, is it sunset completely or running in side-car approach?
3/ 12% Faster response time is very less ROI for Business standpoint with such huge investment for IT driven cause as ON-prem would be supported at least till 2040- as per SAP's roadmap and they wanted to mirror the on-prem system to Private Cloud system so we can safely assume they already had stable & robust existing system even with peak workload! what was main driver to undergo this IT transformation with 200 custom fiori apps?
KR
Avik
Hi Avik,
referring to the grouping question/design choice:
You still can join tables which are managed/distributed on different nodes, but a join will generate inter-node traffic which is an overhead in the context of performance. So, your custom code still can be used, but it will harm the performance of the join. Avoid join them in one SQL. This means a read of selective data into internal tables will avoid the inter-node traffic, but will also kick out the code pushdown, because it is processed in the ABAP layer. It depends on the business case and the amount of data which solution makes more sense. May be the usage of temp. tables in a stored procedure can be used to come across this issue.
Regards,
Jens
Thanks for the response! I understand there is no 'one size fit all' approach in those scenarios and we may need to deal application wise in terms of data volume they would be dealing with in discussion with business owners.
Now as you mentioned the important essence of code push-down technique- that might prove counterproductive in case of underlying node structure/grouping/clustering- we have to navigate the hurdles with SQL trace monitoring- what about CDS views in VDM context/ODATA &Fiori?
There should be well documented repositories for such node structures underlying DB table groupings for application design guideline. All these for improved efficiency/less latency where long story short -essentially it seems, fancy name for 'increasing Disc size' - adding more parallel HDB resources along with already existing primary HDB multi cores- is it at all less expensive? Rather doing extensive performance optimization using available resources/capacity could have been a game changer.
Hi Avik,
thanks for your comment.
Scale-out scenarios can be offered levering RISE private edition, tailored option (expert analysis required). Sizing and growth is a essential part of the conversation during the pre-sales phase and also during the delivery.
Regards
Marc
Hello Avik,
regarding your question about OLTP-queries and table joins:
When operating S/4HANA on HANA scale-out, it's not our goal to avoid ALL joins between tables on different nodes, but to minimise such joins and avoid such joins in expensive statements / in statements that are part of performance-critical workload.
The approach that we are following to achieve this is twofold:
This second item is critical and requires thorough work and deep expertise.
Next to the grouping, we can use synchronous table replication for selected tables (master data and configuration data tables), to avoid distributed join queries. We are cautious not to replicated all such tables, but only those that will be beneficial to replicated in the given customer workload.
Obviously, there will be scenarios that requires cross-group and in any distribution setup also cross-node queries. The relative overhead coming from distributed execution is often acceptable in analytical queries / OLAP workload. And in OLTP queries, it depends on the context: if a query is executed as part of an end-user interaction and there's a few milliseconds added, this may be measurable but not noticeable. If we add a few milliseconds to a query that's executed ten million times in a batch job, this would likely be significant. Customer requirements and expectations therefore are important when it comes to our optimisation goals.
And regarding the scale-out option: in SAP S/4HANA, scale-out is supported (with certain additional technical restrictions such as specific selection of hardware / IaaS systems certified for OLTP scale-out). In RISE, it is offered as part of the tailored option.
As for Question 2: this was a system migration, so the private-cloud scale-out system replaced the original on-premise S/4HANA system.
And Question 3: the original on-premise infrastructure and the scale-out infrastructure cannot be compared 1:1 in the sense that one of the scale-out nodes would have the same CPU capacity as the original single-node server. See Marc's comment on Jens' question.
Best regards,
Richard (SAP S/4HANA product management)
Thank you Richard for the detailed response. One Follow-up question- when we create bespoke Tables- let's say 1st in P2P & 2nd in R2R scope- in the application layer and a 3rd CDS view for O2C area-customer specific in store operation, there might be 2 underlying groups in HDB layer.
How, all custom tables are put into certain group and how given groups are kept on the same database node ,does it hold true for CDS context, what instruction to be passed form application layer or inform NetWeaver team to enable this grouping sanctity behind the curtain?
KR
Avik
Hello Avik,
table grouping is not exposed in the ABAP layer, and grouping alone would not even be useful here - because the thing that counts is the actual table/group distribution across the physical database nodes. Before you ask: also this physical distribution is not exposed in the ABAP layer.
Table grouping and table distribution needs to be done entirely within the database with the means provided by the database.
In consequence, it is something that in a customer project needs to be managed as specific guidance for development teams. This may sound worse than it is, because we are not talking thousands of tables that are relevant: in a typical S/4HANA scale-out system, there are tens of tables, at maximum few hundred tables that are not on the coordinator (master) node of the database.
Developers of ABAP code, CDS views or other artefacts (AMDPs) should be aware of the fact that a small number of tables is on the second DB node, that a further set of tables is replicated to all nodes; Only in code involving these tables do they need to be cautious. If table grouping has been defined in a good way, it will often even be sufficient to know that, for example, it's the finance area that's on the second node (plus a few less critical things used for better resource balancing, e.g. ZARIX tables or the application log or similar). So developers would know that as long as they are developing within finance, they will be good; or as long as they are developing entirely outside of finance; but that they need to pay attention when combining finance plus other application components in one query (one CDS view, one AMDP).
When it comes to defining new customer tables, by default (if the system is set up as recommended), they would be placed on the coordinator/master node - but can be added to table groups and moved to other nodes as required. Also, if reasonable, customer tables can be replicated (synchronously) to all nodes.
Now two things are important: a) not all cross-node queries are evil, but some are; and b) test, test, test -> if production is on scale-out infrastructure, also one pre-production system should exist in scale-out with identical table distribution, identical or comparable data volume and reasonable setup for testing including workload.
Best regards,
Richard
Awesome! Thank you for your erudite explanation.
Hi Marc,
Thank you for sharing your experience.
Can you give some details about your environment ( size of database, number of transactions happening at peak hours, interface/batch loads, etc) ?
Then we are able to compare our systems more accurately according to those values.
Hi Gokhan,
thanks for your comment.
I understand your request but unfortunately we are not able to share more details on the production system of the customer.
Thank you for this. Can you tell me / point me to standard guidance which talks about migrating large single MDC system(15+ TB) to Scale out please. Either on-prem to on-prem or on-prem to Hyperscaler ? If there is no standard guidance, can you highlight how this was achieved and if there are any tools that take care of table placement/grouping and partitioning.
Also you mention vertical scaling on GCP involves Bare Metal HANA systems. Is there something to read between the lines regarding vertical scaling. Can you clarify
Great blog and thanks for sharing the knowledge from the project.
I was looking for the size of the HANA DB supporting this scale out configuration to understand how it is the "largest"? Was it larger than 72TB scale out? that seems to have been the largest scale out S/4HANA system I have learnt to be running in production in the cloud (not GCP) so far.