Develop your Agile SQL Data Warehouse on SAP HANA: GIT Repository & Containers
Last time we talked about the SAP Web IDE and the editors that you use for Data Warehousing scenarios. But we did not handle two important aspects, which are the main drivers for what is “agile” about this data warehouse approach: the GIT repository, and containers. Containers allow a developer to work in isolation, unobstructed by others. And that is supported by a repository called Git.
Classic DW challenges with one workspace and runtime
In a common best-of-breed data warehouse project, all developers work on the same version of the repository and the same run-time environment. Any change made by one developer and activated on the database, in the ETL tool, or other tooling, is immediately visible for all other developers. A logical result of working in a “shared workspace” and “shared runtime” is that it is hard to develop and test features or user stories isolated from other developers or development teams.
Figure 1: Classic DWH development: developers work in the same workspace and runtime, on the same version
Let’s list a few obstacles you can run into with the classic approach:
- Parallel development. If you work with several teams on a large DW, there are two ways you easily clash with another developer or team: by changing the data set that other work on or depend on, or by changing the objects that other teams work on or depend on.
- Hot- and bugfixes for productive release. When you finished a release on Development and transported this to Test, Acceptance or Production, and bugs are found, how to fix those if meanwhile on Development you started with the next release that touch the same objects? I have been in bad situations where we had to build back Development to “how it is at production” before producing a fix.
- Playing around. Official development and playing around gets mixed up: developers need a place to try out some code or certain ETL construct. You could do this in a sandbox, but sometimes there is none, and even if you have one, it’s data or prerequisite artifacts are often outdated. But playing around in the shared development environment can accidentally affect builds or create confusion.
Programmers fixed this problem largely by having developers run their code and tests locally. But that usually doesn’t work for Data Warehousing where you are dependent on external server connections, lots of data, and especially the lack of tool support for a distributed repository.
Agile DW and Containers
In the Web IDE, the earlier described problem is solved on the server. This benefits the DW use case. For each project, each developer works in her own container. In database terms that means that for each developer working on a project, a schema is created automatically when joining a project. In that schema, all database artefacts of a project are activated specially for one developer, including data, data definitions like tables, ETL flowgraphs, calculation views, analytical privileges, you name it. This is all taken care of by the Web IDE: once you create, or clone a project, and you press the build button, a schema is generated in the background and you get access to that schema automatically. You can do this for multiple projects, or multiple versions of a project: imagine that as a developer you might run seven containers in parallel, some for the same project where you work on different features.
Figure 2: For each developer, a schema is created at build of project
Now that each developer runs an isolated container, development can be done completely isolated from others: data can be removed and reloaded, objects changed, removed, or created, without affecting the work of others. But what if you need to work together? How can you get changes from other developers?
Web IDE & GIT: sync when you want to
In the WebIDE, all you create are design-time objects. Only when you press the build button they are activated into runtime objects in the database. The Web IDE stores the design-time objects in a repository called Git.
What is Git? Well, to quote Wikipedia: “Git is a version control system for tracking changes in computer files and coordinating work on those files among multiple people”. So, that’s how developers share their work. Even though each developer runs an individual container, all design time files are shared through the Git repository, and it is up to the developer to decide when to sync, and when to make a new local build of all changes.
Figure 3: The Git repository holds the design-time objects and is the point of synchronization between developers
An important feature of Git is branching. On a given repository, you can branch off to work on a separate version of the repository. This is how you can isolate the development of design-time objects. You could for example create a new branch for a new user story or feature that you are developing on your own or with your team. Only when the user story is complete, you would merge the changes back into a main branch. You can probably imagine that this branch functionality fits very well to the container architecture of the WebIDE: you can isolate both the design-time and the runtime objects.
Figure 4: Two developers working on one user story 1, without impacting user story 2
“But this is a Data Warehouse…”
The response I get most often when discussing Git and containers for Data Warehousing is “How does this approach make sense for developing a Data Warehouse? How do you create and work with all these copies of the data?”. The answer is that you don’t have to duplicate all your data, and you don’t have to run all your data flows to get each container filled. That would indeed take up too much space, unless you have perfectly sized test data for all your sources, which rarely is the case. It would also take too much work and time to run all these batch or replication loads for each container. Instead, the answer lies in a smart split of containers.
One way or another, you will probably end up running different containers for different parts of the DW. For example, you could store your raw data layer in one container. This is where you simply store data from the source untransformed. In another container, you design all virtualized logic on top of the raw data, and point it to the first container. As most complexity is in the logic in the second container, which doesn’t hold any data, those are the ones you will want to create different versions of and work on in parallel.
Figure 5: Working in parallel on two design-time branches of a project in WebIDE
Figure 6: Run-time containers of a developer for the two branches
Another response I get is that Git is regarded to be too complex of a repository. And while it is true that there is a learning curve, the complexity of Git is primarily determined by how you set it up: you can make this pretty simple, or you can make it very complex. If you are unfamiliar with Git, best is to read up first on a few blogs that treat different Git workflows, to get an understanding of the possibilities. I like these two: a successful Git branching model explains Git in short with a (by now) well known model, and Git Workflows That Work presents an overview of Git workflows.
What was hard to manage with classic DW tooling, has become a lot easier using the WebIDE for HANA and the XS Advanced platform. Creating a HANA SQL Data Warehouse with the WebIDE allows for isolation of design-time and runtime versions of a project, which in turn provides a lot of flexibility for working with different developers or different teams on the same Data Warehouse.
For more information, visit our landing page for SAP HANA SQL Data Warehousing.