Skip to Content
Technical Articles

Decentralized GitOps over multiple environments

 

1. Motivation

 

Before diving into the nitty gritty details how we wire build, deployment, and test setups together, let’s get quickly run over our motivation for this article. If you want to skip the history feel free to skip ahead to the 1.3. Take aways without remorse.

 

1.1. Where do we come from?

 

At SAP Artificial Intelligence, we run a huge set of Kubernetes clusters for development, testing and production across all major cloud-providers. The Kubernetes clusters themselves are powered by the wonderful Gardener project.

In the past we used a technology stack around Jenkins, Terraform and Helm to manage the configuration in our clusters. Even though such a technology stack is widely used and still has its merits, we found multiple issues managing a setup at the scale at which we operate. In particular, keeping state in sync between Kubernetes, Helm and Terraform can be challenging. Thus, we decided to onboard the GitOps way of managing all our Kubernetes clusters.

 

1.2. What do we want to solve here?

 

When we started looking into GitOps (as initially defined in here) we found many great articles and blog posts around GitOps and how to use it in simple setups consisting of just one or a few clusters. We were missing some important implementation details when it comes to:

  • concise application configuration management of large environments
  • keeping multiple environments aligned
  • handling sensitive data such as secrets at scale
  • orchestrating environment and validation tests

While none of the above topics is in itself complex, all are required for an effective and efficient continuous delivery system (CD-system).

In this article, we want to provide a concise picture how we wire all the above topics together to form our CD-system. In doing so, we hope to provide inspiration to other large projects and give a basis for GitOps implementations in large scale projects. . In addition, we hope to spark discussions around these large-scale GitOps approaches and future improvements to the setup.

 

1.3. Take aways

 

In this article we present how we implement GitOps in our large-scale project at SAP Artificial Intelligence. In doing so, we provide our solutions to the following challenges:

  • How do we efficiently manage the configurations of many cluster instances in a GitOps configuration repository?
  • How do we efficiently manage the configurations of all application in each cluster instance?
  • How do we incorporate sensitive configuration data into our GitOps configuration repository and how does lifecycle management of this data work?
  • How can we automatically test configuration changes in cluster-instances?

 

2. Boundary conditions

 

2.1. Prerequisites

 

Since we build on the foundations of GitOps, we assume for the rest of this article that the reader is familiar to some extend with the core concepts of GitOps, as well as the following open-source projects:

  • Kubernetes: the industry standard for container orchestrations and machine management
  • Argo CD: an extremely versatile Kubernetes operator for handling cluster deployments in a GitOps native way
  • Sealed Secrets: a Kubernetes operator for encrypting and decrypting sensitive data which allows us to handle sensitive data without other credential stores.
  • Argo workflows: the most advanced Kubernetes-native workflow execution engine

 

2.2. What we want

 

Before going into all the nitty gritty details and complexities of our CD-system, here is a short list of requirements, that we set to ourself:

  • Highest importance
    1. Easy to use
    2. Secure
    3. Reproducible / Recoverable
    4. Fast
  • High importance
    1. Audit-trail included
    2. Easy to maintain
    3. Scalable across many clusters and cloud-providers
    4. Easy to reason about

Now this is of course just our particular choice and it certainly is not for everyone and every project, neither in content nor in order. Nevertheless, we took this list to guide the decisions around the CD-system that we want to build.

 

2.3. What we have

 

In addition to our wishes for a CD-system, we have some hard requirements stemming from our products development and operational setup:

  • around 20-30 development systems must be regularly updated and maintained
  • More than ten productive systems must be regularly updated and maintained
  • All systems shall offer a similar platform and may run on different cloud-providers

 

3. Mile high setup

 

In the spirit of DevOps, we want to make software deployment an engineering problem and create a CD-system that autonomously delivers desired changes to our various clusters while gathering human inputs where required. This in itself does not solve our requirements stated above but it gives us the established tools of software engineering to achieve our goals. In particular, we can improve our CD-system iteratively to meet our goals and we can start to shield human operators from technical complexities by automating them in the CD-system where sensible.

For our Kubernetes CD-system, we heavily rely on the following three open-source projects:

The former two give us the ability to specify our target cluster setup in an entirely declarative fashion in a git configuration repository, while Argo workflows allows us to leverage the flexibility and modularity of imperative container-based workflows for testing our deployments.

In addition, we choose to run Argo CD in a decentralized setup, where we have one Argo CD instance running in each cluster and this instance is solely responsible for the synchronization of the cluster it runs in. In contrast to the centralized setup depicted e.g. here, this decentralized approach yields a better environment decoupling and removes one single-point of failure (the deployment cluster) from the CD-system.

 

4. Branches, tags and other dilemmas

 

Before we touch on how we structure the content of our configuration repository, let’s first discuss how we model multiple clusters inside this repository in an Argo CD friendly way. In addition, we want to be able to easily add specific tweaks to individual environments while keeping the common bits between environments as concise as possible. In other words: how can we model multiple cluster configurations in a reasonably DRY (“do not repeat yourself”) way.

Conceptually, we take a single-branch branch and repository approach, where we store all configurations of all clusters inside a single main branch. Configuration changes to clusters appear as new commits to main which allows us to roll out configuration changes to multiple environments in atomic commits. Environment specific configurations are handled via environment specific values files. While these changes happen in the repository atomically, we don’t want them to appear in all affected clusters at the same time. We incorporate this, by allowing each cluster to point to a specific commit along the main branch. Updates of clusters happen by moving this cluster specific synchronization pointer to a later commit along the main branch.

 

4.1. Implementation

 

Apart from the conceptual perspective, there are some requirements that we cannot fulfil with a single-branch repository. Namely our setup must solve the following requirements:

  1. Active configurations of productive clusters must be sealed behind tight access control.
  2. Productive and development clusters must have stable configurations.
  3. Proposed changes to the configuration repository must be testable in a scalable manner.
  4. The introduction of other branches than main must not lead to branch divergences (with respect to main).
  5. Update schedules can be configured for individual clusters.

To fulfil these needs, we introduce the following concepts:

  • branch-clusters (solves req. 2, 4 and 5)
    • For each branch-cluster there exists a dedicated cluster-branch in the configuration repository.
    • The argoCD controller of a branch-cluster synchronizes the cluster to the
      HEAD of its dedicated cluster-branch.
    • The cluster-branch must point to a commit on main.
    • Cluster-branches have the naming convention
      cluster/cloud.region.clusterName.
  • tag-clusters (solves req. 3 and 5)
    • For each tag-cluster there exists a dedicated cluster-tag in the configuration repository.
    • The argoCD controller of a tag-cluster synchronizes the cluster to the commit that is tagged with the cluster-tag.
    • Cluster-tags have the naming convention tag/clusterName.
  • 2 remotes (solves req. 1)
    • To ensure granular and tight access control, clusters synchronize to one of two remote repositories. Development and productive clusters synchronize to tags or branches on the dev and prod repository, respectively.
    • The main branch lives in the dev repository.

The described implementations are visualized in Fig. 1.

repository%20branch%20structure%20overview

 

Fig.1: repository branch structure overview

 

4.2. Normal cluster updates

 

Since main holds the source of truth for the configurations of all clusters, any cluster-branch must not diverge from main in normal situations (this explicitly excludes hot-fixes). This means that all cluster-branches point in general to some commit on the commit-graph of main which is behind the HEAD of main itself for most clusters.

To update the HEAD of a cluster-branch to a new commit X on main, X is merged into the cluster-branch with a fast-forward only merge:

git merge "X" --ff-only

This type of merge effectively moves the cluster-branch along the commit-graph of main until it hits X (see the branch cluster/update.example in Fig. 1).

Tag-clusters can simply be updated by moving the cluster-tag to a new commit.

 

4.3. Branch reconciliation and synchronization

 

In case a cluster-branch diverges from a target branch (e.g. main), a fast forward only merge is no longer possible. We use the following process to reconcile the branches. This is usually necessary after a hot-fix.

Reconciliation:

  1. Merge the cluster branch into the target branch (main) via a normal (non fast-foward only) merge operation
  2. Update the cluster branch via a fast forward merge operation.

Step 1. creates a path for the fast forward merge in step 2.

 

5. Who installs what now?

 

In the previous section, we focussed on the overall branch layout of our configuration repository. Now it is time to turn to the layout within the repository. As mentioned, we use Argo CD to synchronize our clusters to the desired cluster states described in a configuration repository. This is great for our run-of-the-mill applications and we can follow the official example applications to get started with application definitions in the configuration repository. However, there are two main corner cases that make the fully declarative application definition for a cluster complicated:

  • How do we define the complete list of applications that run in cluster?
  • How do we fill an empty cluster and hand the management over to Argo CD?

 

5.1. Management hierarchy

 

The first question from the previous section is actually answered in the app of apps pattern by Argo CD itself. In this pattern, we use the cluster-list Helm chart to incorporate one custom resource application-cr for all applications that are managed by Argo CD. Since Argo CD installs all application-crs in the cluster-list and subsequently all resources defined in those CRs, the cluster-list serves as the synchronization seed for all applications managed by Argo CD. The cluster-list has the hierarchical structure depicted in Fig. 2:

 

Fig.%202%3A%20management%20hierarchy%20of%20Argo%20CD%20applications.

 

Fig. 2: management hierarchy of Argo CD applications.

Argo CD applications.

where root applications in red must exist in the cluster in order for the CD-system to work and remaining applications in blue capture all other applications that run in the cluster (e.g. services that hold the business logic).

The last 3 applications in the red category may vary depending on the specific cluster setup. We use cert-manager for dynamic certificate provisioning, external-dns for load-balancer registration with the cloud provider and an ingress-controller to expose the Argo CD UI.

The other 4 applications are the root-project, Argo CD, Sealed Secrets, and the cluster-list itself.

  • The root-project is the Argo CD project-cr for all root applications. This project is kept separate so that we can enforce a specific access-control on applications within it.
  • Argo CD is responsible for the synchronization with git and application installations in cluster. Note that Argo CD in this setup manages the Argo CD application and hence its own configuration changes. To prevent miss-configuration disasters, we take the precautions described below.
  • Sealed Secrets is used to encrypt sensitive information so that we can store it in our configuration repository. Since Argo CD needs one piece of sensitive information to run (read access to the cluster configuration repository), Sealed Secrets must also be present in the root applications.
  • Finally, the cluster-list is also present in the cluster-list application definition. Argo CD can resolve this circular dependency and it is handled similarly to the Argo CD application case: configuration changes to the cluster-list are being picked up by Argo CD and applied directly. Like in the Argo CD case, this circular dependency gives room to disastrous miss-configurations.

 

5.1.1. Accident prevention

 

In the previous section, we described that Argo CD not only actively synchronizes the configuration state of our business applications but also of all infrastructure applications including all configuration around Argo CD itself. This gives of course the opportunity for miss-configurations that could potentially tear down a whole cluster. To prevent such scenarios, we are employ the following precautions:

  • All configuration changes to applications in the root-project require human operator approval before they are applied. This is achieved by setting the Argo CD sync policies to
    syncPolicy:
      automated:
        selfHeal: false
        prune: false
    
  • Critical resources are additionally labelled with the annotation argocd.argoproj.io/sync-options: Prune=false to prevent any pruning action from Argo CD. This applies to all resources that are tied to application state persistence like crds and namespaces.

Apart from this, Argo CD will leave underlying resources in tact if applications are deleted without the Cascade flag. In case of reinstallation, Argo CD also happily assumes control over existing resources, if they fall under an application definition.

 

5.2. Repository layout

 

We map all applications running in all clusters to a particular folder layout in our GitOps configuration repository. This layout also holds cluster-specific additions as mentioned previously. Without further ado here is an extract of our repository layout:

+-- applications
|   +-- root
|   |   +-- argocd
|   |   |   +-- base
|   |   |   |   +-- ...
|   |   |   +-- overlays
|   |   |   |   +-- cloud1.region1.name1
|   |   |   |   +-- ...
|   |   |   +-- source (upstream kustomization folder)
|   |   |   |   +-- ...
|   |   |   +-- README.md (application readme with update information)
|   |   +-- cluster-list
|   |   |   +-- chart (helm chart folder)
|   |   |   |   +-- templates
|   |   |   |   |   +-- argocd.yaml
|   |   |   |   |   +-- cluster-list.yaml
|   |   |   |   |   +-- external-dns.yaml
|   |   |   |   |   +-- ...
|   |   |   |   +-- Chart.yaml
|   |   |   +-- README.md
|   |   +-- external-dns
|   |   |   +-- chart (upstream helm chart)
|   |   |   |   +-- templates
|   |   |   |   |   +-- ...
|   |   |   |   +-- values.yaml
|   |   |   |   +-- Chart.yaml
|   |   |   +-- values (cluster-specific value files)
|   |   |   |   +-- cloud1.region1.name1.yaml
|   |   |   |   +-- cloud2.region2.name2.yaml
|   |   |   |   +-- ...
|   |   |   +-- state (kustomization for state)
|   |   |   |   +-- base
|   |   |   |   |   +-- ...
|   |   |   |   +-- overlays
|   |   |   |   |   +-- cloud1.region1.name1
|   |   |   |   |   |   +-- sealedSeacret-cloud-dns.yaml
|   |   |   |   |   |   +-- kustomization.yaml
|   |   |   |   |   |   +-- ...
|   |   |   |   |   +-- ...
|   |   |   +-- README.md (application readme with update information)
|   +-- service1
|   |   +-- chart
|   |   |   +-- templates
|   |   |   |   +-- ...
|   |   |   +-- values.yaml
|   |   |   +-- Chart.yaml
|   |   +-- values (cluster-specific value files)
|   |   |   +-- cloud1.region1.name1.yaml
|   |   |   +-- cloud2.region2.name2.yaml
|   |   |   +-- ...
.   .   .   .
.   .   .   .
.   .   .   .
+-- cluster-values
|   +-- cloud1.region1.name1.yaml
|   +-- cloud2.region2.name2.yaml
|   +-- ...
+-- tooling
|   +-- ...

For open-source applications, we keep the upstream files in our GitOps repository to pin the versions. In addition, keeping the upstream files separate (in a chart folder for helm charts and in a source foulder for kustomize files) simplifies version updates.

For all applications we provide general configurations and cluster-specific inputs in dedicated folders and files. The application CRs in the cluster-list reference the general and cluster-specific inputs.

 

5.3. Bootstrapping

 

To easily onboard clusters to the configuration repository, we need to install the bare necessities into the cluster manually and then hand over to Argo CD for remaining installations and management of the cluster in general. Luckily, Argo CD is happy to assume control over existing resources without recreating them. So the transition from manual installations to Argo CD management comes for free. The manual installation concerns all applications in red in Fig. 2. In addition, to these applications sensitive data in form of various access-keys must be provided for:

  • configuration repository read access (Argo CD),
  • DNS management access at the cloud provider (external DNS and cert-manager).

For the manual installation, we use a script that is manually triggered by cluster operators. For ease of use, this script is idempotent and for traceability reasons all interactions with cluster external systems rely on operator specific access-keys. In pseudo code this script executes the following idempotent steps:

  1. Read cluster_values
  2. Confirm that the operator wants to continue with the selected cluster
  3. Run file generation
  4. (if imperative) install Sealed Secrets
  5. Create secrets:
    • git_server_read_secret
    • docker_registry_secret
    • dns_secret
  6. (if selected) Dynamically update the cluster
  7. (if imperative) install remaining root applications:
    • argocd
    • cluster-list
    • external-dns
    • cert-manager
    • ingress-controller

Note, that we install Sealed Secrets before we create any of the needed access-keys. This allows us to directly employ the Sealed Secret operator to encrypt all sensitive data and store it in the form of SealedSecrets in the repository.

The core functionality of the install functions is actually fairly simple: render Kubernetes resources, apply those resources to the cluster. To minimize friction during hand-over to Argo CD, the resources should be rendered with the same binary versions that Argo CD uses internally. For Kustomize this means in particular, that the stand-alone binary should be used and for Helm this means that only the template functionality should be used.

 

6. How shall we dress our secrets?

 

Oh crap, all configuration also means all credentials, right?

As mentioned throughout this article, we want to store all configurations in our single configuration repository. For sensitive data this is of course a problem, since we do not want to litter our repository with access keys to various critical systems for everyone to see. Luckily, we have Sealed Secrets at our disposal which directly addresses this issue by providing an encryption mechanism for Kubernetes secrets.

We have a Sealed Secrets operator running in every cluster which stores public/private key pairs for encrypting and decrypting data locally in the cluster. We use the exposed public keys to transform our ordinary Kubernetes secrets into encrypted SealedSecrets which we can safely store in our configuration repository.

This is however not the full story, since we also need the ability to update sensitive data regularly (e.g. for key rotations). If we only store the encrypted data in the configuration repository, we have no programmatic approach to perform such changes. Therefore, we enrich each SealedSecret with additional non-sensitive information in form of annotations that uniquely identifies the origin of the data and provides all necessary information for a key rotation. For external-dns we have e.g.:

kind: SealedSecret
apiVersion: bitnami.com/v1alpha1
metadata:
  name: cloudprovider-dns-access
  namespace: external-dns
spec:
  template:
    metadata:
      name: cloudprovider-dns-access
      namespace: external-dns
      annotations:
        secret-manager/source: "aws"
        secret-manager/accessKeyId: "<access-key-id>"
        secret-manager/account: "<aws account id>"
        secret-manager/createDate: "<creation timestamp>"
        secret-manager/username: "<aws user name>"
        secret-manager/keyMappings: "[accessKeyId:AccessKeyId,secretAccessKey:SecretAccessKey]"
    type: opaque
  encryptedData:
    accessKeyId: <encrypted data>
    secretAccessKey: <encrypted data>

With the annotations above, we can recreate the SealedSecret if we have access to the source system (aws in this case). To achieve this, we developed a custom secret-manager tool to interact with all SealedSecrets that have annotations of the form secret-manager/.... This tool can perform the following tasks on SealedSecrets in the configuration repository:

  • list
  • update
  • delete

where each command can be executed on a filtered subset of all SealedSecrets in the repository. Filters include e.g. age of the credentials, source, or cluster.

A zero-downtime key-rotation in our system involves 3 distinct steps:

  1. Create an additional new key in the source system and update the SealedSecret in the configuration repository that holds this key. This step must not delete the existing key that is active in the target cluster.
  2. Update the target cluster by publishing the configuration change (updated SealedSecret) to the cluster (either pushing to the cluster-branch or moving the cluster-tag).
  3. Remove the old key from the source system.

To update a given SealedSecret, the secret-manager evaluates the annotations as input:

  • source: source of the sensitive data
  • keyMappings: mapping between sensitive data provided by the source system and the keys that appear in the SealedSecret
  • additional source-specific information (e.g. accessKeyId, username and
    account)

The secret-manager employs admin-provided access-keys to interact with the identified source system. This allows us to keep an audit-trail of admin interactions with our source systems.

 

7. Go test yourself

 

Whenever we promote changes to various stages in our system, we need these systems to respond by testing the changes in question. For this, we use two components: reactions and Argo Workflows. The former is an operator that we developed. We use it to identify changes in the cluster configuration or our configuration repository and trigger appropriate Argo Workflows as a result. The workflow then incorporate the specific actions (test suite execution or other) that should follow on a given change.

 

7.1. Reactions

 

The reactions operator can be configured to monitor our configuration repository for certain events:

  • pull-request (PR) creation or update
  • approval of a PR

Upon observing such an event a predefined reaction chain can be triggered. E.g. for PR validation clusters this contains:

  1. Update the cluster-tag to the HEAD of the PR branch
  2. Wait for deployment (reconciliation of Argo CD)
  3. Identify which component tests need to run based on the changes in the PR branch
  4. Execute the argo workflow test run

Alternatively, the reactions operator can monitor reconciliation updates by Argo CD (for branch-based clusters). When an update is observed, the operator waits for Argo CD to finish its synchronization cycle, and subsequently triggers an argo workflow test run to confirm that the updated system still operates correctly.

 

7.2. Argo Workflow test runs

 

To test configuration changes in our clusters, we use Argo workflows as an integration layer for all test suites provided by the application development teams. This allows us to minimize the interfaces between global test-setup and test suite specific environment preparations. All component tests are provided in the form of a docker container containing the test code and an Argo WorkflowTemplate containing the configuration details how to execute the docker container as part of the overall test pipeline.

We run a central orchestration workflow that takes care of central aspects of test suites:

  • Reporting test suite results to external systems (e.g. git, monitoring,
    task-tracking)
  • Gathering global results in the end

In addition, this central workflow triggers centrally defined sub-workflows that
implement the plumbing for component-specific test suites by providing

  • A scratch space for tests
  • An object store to persist test results
  • Enforcing timeouts
  • Executing the WorkflowTemplate that implements the test suite

The WorkflowTemplate for the actual component-specific test suite fulfils the
following interface requirements:

  • Provide a unified entry-point for the test suite
  • Accepts a predefined set of global parameters
  • Writes results to a predefined path

Only the last mentioned WorkflowTemplate must be provided by the team owning
the component.

An example of such a WorkflowTemplate can look as follows:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  annotations:
    workflows.argoproj.io/description: >-
      this workflow runs the example-service acceptance test.
    workflows.argoproj.io/maintainer: "@example-service-team"
  namespace: platform-tests
  name: at-example-service
spec:
  templates:
    - name: acceptance-tests
      serviceAccountName: "{{inputs.parameters.workflow-sa}}"
      inputs:
        parameters:
          - name: workdir
          - name: test-type
          - name: workflow-sa
          - name: debug
      container:
        image: "example-service-tests:v1.2.3"
        volumeMounts:
          - { name: workspace, mountPath: "{{inputs.parameters.workdir}}" }
        env:
          - { name: HELLO, value: "world" }

 

8. Summary

 

In this article we presented how we at SAP Artificial Intelligence approach GitOps in our large-scale project. Starting from our own history, we identified our core wishes for a CD-system and presented our take on how we get close to our desired CD-system.

We described how we manage many cluster instances in a single GitOps configuration repository and how we relate cluster configurations to git branches and tags. In addition, we provide a complete scheme how configuration progression and configuration divergence in the configuration repository can be handled.

Subsequently we turned to the cluster level, where we described the management hierarchy of all applications that run in a cluster. This hierarchical system gives us a single entry-point to all configurations in each cluster. In addition, this simplifies the bootstrapping of new clusters.

After discussing the broad structure, we turned the crucial topic of sensitive configuration data. We presented a mechanism with which this data can savely be stored in the GitOps configuration repository. Most crucially we also describe the process of updating sensitive credential data in a source system and at the same time in the GitOps repository in a zero downtime key-rotation. With our secret-manager, we can perform this process in a programatically and fully tracible fashion.

Finally, we discuss how we perform tests of configuration changes that we introduce in the GitOps repository. All testsuites are encoded in form of Argo Workflows, which split into central setup workflows and application specific sub-workflows. For change detection, we rely on our reactions operator which can identify untested pull-requests and updates to cluster configurations. Upon each change this operator waits until the relevant cluster is prepared and then triggers a suitable testsuite workflow.

The presented setup is very different from our previous CD solutions and hence the transition period was challenging at times. However, the resulting setup offers great benefits to us in terms of usability, security, reproducibility, and speed and we are very happy to have stepped into the world of GitOps.

1 Comment
You must be Logged on to comment or reply to a post.