SAP Labs IT is committed to provide agile, experimental, global development-specific IT infrastructure solutions to enable Development teams to be more productive, creative and innovative in creating new products or enhance existing products. We would like to showcase how they are living by SAP’s mission statement with this success story.
The Perforce Landscape is part of the Platform Innovation HANA Platform Production Infrastructure unit. This landscape hosts 93 perforce servers that are used as a centralized Software Configuration Management (SCM) system by 9000+ developers worldwide to check in and check out non-ABAP software code every day.
1. Situation & Critical Issue
The previous landscape had several challenges; Outages of the landscape was a common occurrence from June 2011, which interfered with the development activities of the 9000+ developers causing user productivity loss of 5.31 Million Euros. The root cause could not be identified clearly due to complexity of the cluster setup. The unplanned downtimes forced colleagues to look for immediate fixes rather than long term solutions. The fixes implemented had little impact or in some cases made the landscape unstable and unpredictable. An in-depth analysis of the incidents pointed out at a highly fragmented storage pool leading to lower performance and likely instability of the cluster. Recovery process from backups was never tested.
The landscape was overly complex in terms of setup with outdated technology. Flexibility of the initial landscape was achieved by an increased complexity of the setup. Change Management Process (filesystem extensions, storage requests) was not well established. Consequently significant growth of Perforce data volumes resulted in a high level of storage fragmentation. Very few colleagues had expert knowledge on the existing landscape. The initial landscape was not designed to support 9000+ developers back in 2010 when conceived. Unclear processes were reasons which caused the unit to miss on support SLAs.
3. What we did
Labs IT formed a project team with experts from all IT support units. After careful consideration and deliberations, the project team decided to approach the solution in two phases.
The first phase is to simplify and stabilize the current landscape as we implement the long term solution.
The second phase is to design, build and test, and create a stable and elastic landscape designed with the latest technologies that will be good for the next five years.
The team reduced the complexity of the landscape by consolidating storage and changing cluster behavior. The team also automated critical tasks.
For storage consolidation, colleagues worked with the Perforce SMEs. They forecasted future storage growth by analyzing the trend data, past support tickets, and monitoring tools data. Existing system storage unit allocations, inventory, and performance data were also carefully analyzed. The plan was to further fine-tune after taking expert advice from the storage team. The team then developed a tool to migrate the data to reduce downtime and worked over several weekends to implement the solutions proposed. Once stable, the team was able to move on to the next greater task of redesign.
The project team conducted a workshop and brainstormed on all spectrums. They created a blueprint of the future landscape and prepared support processes. The team created various test cases, and test environments to validate that the new landscape will be able to deliver on its promises. The team also accommodated additional challenges received for storage migration to a new EMC VMAX 40K storage.
SAP Labs IT and the customer team performed the migration of the Perforce Landscape in 20+ weekends. This activity consisted of migrating existing
8-node clusters to the new landscape consisting of 6 individual clusters. Ninety-three Perforce services distributed in several cluster nodes were migrated seamlessly without major business impact.
4. Where are we
We are now able to achieve 99.8% availability. The causes for initial issues were identified and mitigated and we expect the availability to increase moving forward. Increased availability of the landscape resulted in increasing the productivity and gain to the tune of 4.61 million euros. The charts below show the differences between the old and new.
The backup processes were further fine-tuned. After careful analysis on RPO (Recovery Point Objective) and RTO (Recovery time Objective), clear processes for Disaster Recovery are defined. The data will be restored in a test system once per quarter for every service. Data will be validated and consistency will be maintained. Regular fire drill exercises will be executed in the test landscape as part of testing before implementing any major changes to the landscape.
There is room for further improvements. We are working on further automation, enhancing the processes and bringing the support units together in the run team.
The journey continues………..