Last year we worked on the response times for Basic Authentication. I wrote in my last blog about the successes that we have achieved.
Right at the beginning of 2013 a number of new (cloud) applications have been linked to SAP ID service to handle the authentication of the user. Almost all of these new applications use SAML 2.0 to authenticate the user.
After a few weeks we had a look at the response times of our systems. What we saw was far away to be acceptable: the response time was somewhat lees than 1 sec on average for a complete cycle with SAML 2.0 authentication. The utilization of the systems in terms of RAM and CPU usage as well as disk I/O was almost at maximum.
We could not leave it that way.
Before I go into details, let me explain the flow of SAML 2.0 for SSO in the browser:
- User requests a resource at a Service Provider (SP) in a browser
- SP redirects to SSO service at the Identity Provider (IdP)
- Browser requests the SSO service at the IdP
- User authenticate against IdP (e.g. by username/password, certificate)
- IdP responds with an assertion
- Browser requests the assertion consumer service at the SP
- SP issues a redirect to the original resource from 1.
- Browser follows the redirect and requests the resource
- SP respond with the requested resource
For the following considerations we took into account steps 1 to 6. Steps 7 to 9 are beyond the control of SAP ID service.
In order to achieve a really good system behavior it is essential to set up a pro-active performance engineering as early as possible in the life cycle of a new application. I know the reality is usually different.
But if the system has already reached its limits, it is much more difficult to implement modifications without breaking existing scenarios. The cost to run performance tests while an application is in development is relatively small. Especially in a virtualized infrastructure, as we operate SAP ID service in, you should control the resource consumption of your application.
In order to come a good system behavior, we had to set up Performance Best Practices for SAP ID service. The steps then are:
- Identification of key performance metrics
- Define metric threshold
- Set up pro-active reporting
Identification of key performance metrics
When we started, we realized that we already have a lot of metrics available. Some had also been defined as critical. However, we had to sort out of the multitude of existing metrics the one that are really important. The most important metric is the “happy path” – the basic authentication flow of a user. Regardless of 10 special cases that are around.
The existing load tests served as a basis, as they represent the “happy path”.
But even when writing load tests you have to consider that these tests are efficient. So we had to revise our load tests, as a minimal increase of the user/threads in our Jmeter load tests had massive effects on the load generator itself. So we had to replace XPath expressions by regular expressions as they are handled more efficient.
One of the key performance metric is of course the response time. In addition we had to analyze the CPU utilization and RAM consumption. Connected to these metrics is the efficient use of existing objects and data structures.
We used the SAP JVM profiled to analyze the efficient use objects and optimized our coding based on the results. Furthermore we used SAP JVM profiler to analyze the most time consuming methods. Both methods, the allocation analysis and the performance hotspot analysis, showed us areas we had to work on. And it showed us that we had a problem with our disk I/O. The frequency of a called authentication method combined with the logging in this method showed us a problem in the configuration of the connected storage system.
At the beginning of the year we had two systems in use to store users and system configurations: LDAP to store user and MongoDB for specific configurations that cannot be stored in the LDAP itself.
However, to get a complete user two data stores had to be queried. This does not take time but it is also very inefficient on the long run. Some information is stored multiple times. Some test have shown that we could handle all data in the MongoDB much more efficient – and faster. So by mid-year all users were migrated during normal operation and we could turn of the LDAP servers. Just reading one repository had a significant improvement in response time.
A kickback was the increased encryption of sensitive data: we were much slower than with reading from two data stores. As the encryption is purely based on CPU cycles we followed the motto: the more the merrier! So we moved the whole application to a newer virtualized landscape with more powerful CPUs – and we were back to the response times before the increased security level.
Define metric threshold
To define metrics means to find answers to several questions. Which level of metric utilization can be considered as good? Which level would to be bad? When have we reached the balance between optimization and “good enough” response time?
You will come to the answers on these questions through research and testing. In addition we had several discussions with the experts of our performance lab. But the progress is very slow.
Set up pro-active reporting
We sent some time to define the right thresholds based on methods execution time and response time. These thresholds need to be adjusted with every new feature or change that get deployed to the system.
For those metrics we consider as important we defined several thresholds (low, medium, high usage) and integrated them into our monitoring system.
A next step will be to rework the architecture of the system to a more service-based architecture to handle the future load even better.