Kubernetes to the Limit, 10 things we learned in our SAP Business Application Studio journey
Our story starts with the motivation to develop a new, modern, easy-to-manage, one-click, cloud-based development environment, that is the evolution of the well-known SAP Web IDE. It ends up with running thousands of Kubernetes pods in production across multiple cloud providers.
If you are thinking about starting to use Kubernetes, you’ll want to hear about our Kubernetes journey. I will share with you the 10 most important challenges we had, how we handled them, and what we learned. Hopefully this will be both interesting and helpful for your journey.
But first, a little bit about our product and the challenges we had.
In a nutshell, our product, SAP Business Application Studio, is a modern end-to-end development environment, that allows developers to easily develop and extend SAP solutions (in the cloud and on premise), seamlessly integrating SAP services, technologies (SAPUI5, etc), and solutions. It provides a desktop-like experience similar to leading IDEs with command line, integrated debugging and optimized code editors. In addition, it includes high-productivity development tools e.g. wizards and templates, graphical editors, quick deployment, and more.
At the heart of SAP Business Application Studio are the dev spaces, which are like isolated “virtual machines in the cloud” containing tailored tools and pre-installed runtimes per business scenario, such as: SAP Fiori, SAP S/4HANA extensions, SAP Mobile and more. The dev space covers the end-2-end needs of the developers starting from project creation via wizard templates, to efficient development with SAP technologies via code or graphical editors, connectivity to cloud and on premise, local testing, debugging, building and deployment to the SAP solution. This simplifies and saves time in setting up the development environment and accelerate time-to-market in application development.
Now that we understand the challenge, we can focus on the technology we used to provide our users with the same user experience they would have on a local IDE.
If you read the blog headline, you already know we are using Kubernetes, but let’s dive a little bit into the technical side…
We are running multiple Kubernetes clusters (worldwide) using multiple IaaS services (compute, network, security) across multiple cloud providers (AWS, Azure, Ali Cloud). Each user dev space runs in a dedicated Kubernetes namespace, running a pod with multiple containers orchestrating together to provide the specific dev space configuration (e.g. runtimes like Java, Node.js, SAP libraries like SAPUI5, etc).
The client side is Eclipse Theia-based (open source), while the remote pod provides the Theia server side all the tools needed for the IDE (e.g. code compilation, build, run configuration.)
Each user can create multiple dev spaces, which means we run thousands of dev spaces in production.
So, without further ado, these are the 10 most challenging obstacles we had and what we learned:
#1 – Kubernetes-as-a-Service
How do you install a Kubernetes cluster across multiple cloud providers? How do you manage its versions? The node operating system?
We were looking for a tool that would assist us with the Kubernetes cluster installations. First, we thought about using the cloud provider Kubernetes service (e.g. EKS, AKS) and manage multi-environment installation, then, we considered using Infrastructure-as-code tools (e.g. Terraform), but in the end, we decided to go with the open source SAP-managed Kubernetes service: project “Gardener”. Using project “Gardener” saves us a lot of effort with adjusting Kubernetes per IaaS. In the end (or in the beginning), you want to focus on writing your product’s code and not on managing clusters… project “Gardener” helps us with that. In addition, another added value is that we can maintain a single code line and minimal variations of our product for the different IaaS providers.
- Identify your infrastructure requirements and their matching solutions (e.g. multi-IaaS or a single platform, private or public cloud)
- Choose a solution that will solve ALL your product use cases (e.g. dynamic scaling, easy and fast setup, configurable, version management, etc.)
- Production failures: choose a platform that is responsive and assists with problems (e.g. check the service’s SLA)
- Costs: how much will you pay for a good service and support
#2 – Costs
Lower production costs per single dev space (we allow each user to create multiple dev spaces.)
It is all about the money. You want to save money on the cloud services you consume in development and production. To reduce costs without hurting performance and service level was one of our greatest challenges. It was an ongoing effort to learn how to work efficiently in the cloud. You must understand the pricing model of the services you are using and challenge the way you consume them on a regular basis. From the Kubernetes cluster perspective, you control the node resources, the cluster size (number of nodes), pod size, and even how many pods each node can run.
- In the pay-as-you-go model, it is recommended to stop your development cluster when you are away (e.g. at night, on weekends).
- Do you really need a 1TB disk in a development cluster?!? Should it be an SSD disk?
- In our case, we found out that configuring each cluster node to a limited number of dev spaces gave us the best cost/value/performance ratio.
- Know all your costs and paid services
- Conduct an on-going cost review
- Use a dedicated tool to analyze the costs
- Have a plan to reduce costs in production and in development clusters (the use cases are different)
- Be creative with cost saving and constantly challenge the way you consume payable services
- Challenge your CI/CD process to be cost effective
#3 – Security
Provide the user an isolated and secured remote dev space.
One of the most challenging, technical, and time-consuming topics is security. How to create a secured service in a public cloud? How to protect the user’s data? How to protect each dev space against potential threats?
First, we analyzed each component and use case to understand the threats. We built a top-down security model with potential threats. We identified 4 layers of threats: IaaS Kubernetes Pod Container.
- IaaS: Understand the shared responsibility model – the boundaries between your service and the cloud provider, secure the cloud account, and the Compute services.
- Kubernetes: Protect the cluster resources and limiting the access to the cluster (e.g. view and admin access)
- Pod: Protect the pod using resource limits and the network policy.
- Container: Protect the container by isolating it in all stages, from build to execution. Prevent access from the container to other Kubernetes resources.
In addition to these 4 layers, we scan our code and dependencies using, Black Duck (formerly Protecode), WhiteSource, and Checkmarx.
- Identify potential threats in all layers/components.
- Find the risks at an early stage, avoid unnecessary refactoring and security fixes (e.g. don’t commit plain text passwords to Git, it is much harder to protect them afterwards).
- Add security scans and tools as part of the CI process.
- Update the versions of all your software layers periodically (e.g. Linux, Kubernetes, code libraries, open sources).
- Protect your accounts as much as possible (e.g. 2FA, complex passwords).
#4 – Performance Efficiency
Provide the users with a performance experience that is at least as good as on their local computer.
Running thousands of pods in production is tough, it really takes your service performance capabilities to the limit. It forces your service to be flawless. You need to tune your resource consumption and design your system to be as efficient as possible. From the Kubernetes perspective, it means you must not only write effective code, but you also must manage the resources you use in the cluster from the memory and CPU perspective. The challenge for resource optimization includes questions like: which disk are you using? How does the network bandwidth affect your use case? Should you over provision your cluster for better a response time during busy periods? We handled all of these as part of the ongoing tuning of the system.
- Define the key performance indicator well, with real numbers.
- Set an indicator for all service types: busy time, average time, and idle time. A dynamic service can save you money.
- Run your performance tests as soon as possible, you want to know your bottlenecks as early as possible.
- Add the performance tests as done criteria for all relevant tasks, just as unit testing.
#5 – Operational Excellence
Run and monitor your service across multiple landscapes and at the same time provide business value and a good level of support.
This is one of the main challenges in every application. In the end, you want your service to run, run well, and recover fast when it fails. The road to get there includes managing your code versions, logging, monitoring, procedures for failure handling, and service traceability. To handle these challenges, we are using common open sources, such as Elastic Search, FluentD, and Prometheus, as well as project “Gardener” as our Kubernetes platform. In addition, we are using alerts and the Kubernetes self-healing capabilities.
- Make frequent, small, reversible changes to your code.
- Test your changes in internal environments first (e.g. ci, staging, canary).
- Use common tools for common tasks (e.g. logging, monitoring).
- Anticipate failures and learn from them.
- Avoid manual operations. Automation prevents human errors and is much faster.
- Understand that your application will change with time and your operational tools should adapt accordingly.
#6 – Reliability
Run your service to perform its intended function correctly and consistently when it’s expected to do so.
This challenge goes hand in hand with the operational excellence challenge. The service needs be reliable, highly available, recover from any disruptions, and dynamically acquire computing resources to meet the demand (and reduce them during idle times, e.g. weekends.) Working with Kubernetes is perfect for that purpose. Its self-healing capabilities and dynamic scaling are some of the features we are using. In addition, we have a backup and recovery system. The resource capacity is one of the topics we handled as part of our performance tasks.
- Backup and Recovery is a must for every service. Plan and automate these processes as much as possible.
- Manage the cluster capacity carefully. How many nodes do you need? What resources does each pod consume?
- Acquire a zero down time approach as much as possible and design your components for that purpose.
- Test your system workload and take it to the limit to see how it behaves (e.g. stress tests and chaos testing).
- Monitor your system, alert on failures, and prepare a strategy for handling failures.
#7 – User Experience
Provide our users with the best IDE experience, as close as it gets to their current IDE and with the same features, they have today in addition to the business value our IDE adds.
- Talk to your users, understand what they want and try to achieve it.
- Don’t invent the wheel, cooperate with other tools and focus on your core product and added value.
#8 – DevOps
Build a robust CI/CD pipeline that supports a dozen production clusters worldwide (in multiple regions) across multiple cloud providers.
Since our service runs in production in multi IaaS mode, the challenge here is a great one. For the CI side, we created separate pipelines, one per IaaS: Azure, AWS, and Ali Cloud. For the CD side, we decided to use Argo CD, a declarative, GitOps continuous-delivery tool for Kubernetes. We created a dedicated Git repository per production cluster. Combining these two together allows us to control all versions easily and even add another production landscape with a minor effort.
- Automate the CI/CD process as much as you can, manual steps are error prone.
- Add DevSecOps tools to your pipeline (e.g. vulnerability scans).
- Keep all teams on the same page, we are all responsible for our pipeline, not just the DevOps team.
- DevOps is a culture change; it takes time to adopt, but it is worth the effort.
According to Moore’s law, computing power doubles approximately every two years. This means that technology runs faster than you can adopt it. So how do you keep track?
In today’s world, the technology progress is amazing. Every day, new open sources emerge, the cloud services change rapidly, and existing tools come out with new features. We learned that we must go forward alongside the technology, we must not stay behind. This means that you should explore and learn all the time. Don’t hesitate to try out new tools. Constantly challenge your existing solutions for every component.
The Kubernetes community is vast, and you can find better solutions for almost any piece of code you wrote… maybe you should adopt it and throw away your code. This will take some effort from your side, but you’ll gain lots of features (present and future ones) for free (no coding or testing from your side).
One example from our service is storage. When we started our journey, the storage solutions for Kubernetes were not as advanced as today. We were looking for tools to improve and simplify our storage solution. Today, we are testing the latest available tools such, as Longhorn and Rook.
- Be curious, learn and try new tools.
- Challenge your micro-services and components to improve all the time.
- Don’t stay behind, upgrade the versions of all the tools you use periodically.
- Don’t be afraid to throw away code pieces and adopt other tools that do it better.
- Go to conventions, read blogs, share your ideas, and hear about other’s ideas.
#10 – Think First
The last one is, in my humble opinion, the most important one. It is relevant not only for coding and computers, it is a lesson for life: “First think, then act!” Understand the requirements, challenge them, and only then decide on the appropriate solution. Don’t rush into the solution, even when you need to deliver it yesterday.
Kubernetes is a great platform to run a microservice system, it is built exactly for that purpose. Yet, like its name implies (Kubernetes means helmsman or pilot in Greek), fine tuning it is like sitting in a big ship’s cockpit, there are tens of buttons you can press to steer it…
My last 2 cents – come with an open mind. There is a vast community that is willing to assist, you only need to learn how to benefit from it (…and how to give back to the community). I recommended you start at cncf.
I hope you enjoyed reading about our journey, we enjoyed going on it!