Skip to Content

Microservices architecture and reflection of its principles that can be found in modern cloud applications face certain challenges that are not common for monolithic systems – one of them is traceability of end-to-end transaction that is executed in distributed environment and that can potentially involve invocation of large number of diverse applications and services. In case of performance degradation that is experienced in end-to-end solution, identification of bottleneck component – such as called service or combination of services that caused performance degradation or significantly contributed to increase of end-to-end execution time – can become complex and nontrivial task. There are various technical solutions that address this concern, and many of them are based on reusing common approach: as a part of every call that is passed between components and services involved in distributed transaction execution, a specific trace identifier is added, that represents an end-to-end tracer and can be used to uniquely determine the executed transaction among other transactions across the entire landscape. Form and implementation of the trace identifier may vary – for example, in some landscapes where predominant integration architecture is SOA based, such identifier is placed as an additional SOAP header, in other environments, where diverse HTTP based (SOAP, REST/OData, plain HTTP) integration techniques are in use, the identifier is placed as an HTTP header, integration approaches heavily based on message brokering solutions suggest addition of the identifier to the produced message header / property (for example, AMQP or JMS message property). Given microservices architecture strongly relies on services exposed via HTTP and primarily promotes REST architecture, commonly used distributed tracing solutions in context of microservices are based on injection of tracer information within additional HTTP headers of exchanged messages.

 

OpenTracing API and main principles of distributed tracing systems

Recognizing heterogeneous nature of complex landscapes (which displays itself in variety of used platforms, programming languages, runtimes and frameworks) and demand in usage of generic, extensible toolset that is developed on top of platform agnostic and vendor neutral APIs and that can be used for building truly efficient distributed tracing system, community worked out and published so-called OpenTracing API specification that manifests concept and principles of distributed tracing and defines APIs that can be implemented for various platforms, following ideas of interoperability and compatibility. Implementations of OpenTracing API can be found already for several languages and are published on GitHub. I recommend familiarizing yourself with the specification, as it also defines core elements of the distributed tracing system. Here, I don’t go into details of it, but only draw attention to essential components and elements of its model.

 

The entire end-to-end transaction is identified by trace ID that remains constant across all involved applications. Each involved application can create one or multiple spans (analogue of transaction steps) that are identified by span ID, where span ID is unique for every span and contains a reference to trace ID. Each span, except the root span (the initial span that starts the trace), also contains a reference to its parent span – that is the preceding span of the trace. Each span is described with its start and finish time (that is required to calculate its execution time), name (which is commonly inherited from operation name), and can be programmatically assigned additional items (such as, for example, tags – that are specific to a particular span, and baggage items – that can be passed across multiple spans) that facilitates encapsulation of extra information required for further analysis or helpful for understanding of span context. Relationship of spans that correlate to the same trace, can be represented and visualized as a graph (as theory suggests, directed acyclic graph) that depicts the entire transaction (trace) and combination of steps (spans) that it consists of.

 

The below diagram that originates from documentation for Spring Cloud Sleuth (one of implementations of this concept, that I will get back to later) and provides visually compelling illustration of the described concept and relationship of trace and spans:

Source: Spring Cloud Sleuth by Adrian Cole, Spencer Gibb, Marcin Grzejszczak, Dave Syer.

 

Assuming applications / services are instrumented with the library that supports this concept and can produce traces in the compatible format (distributed tracing system client), trace information (composed of span data) generated by individual applications can be collected, persisted and processed (aggregated, analyzed, visualized, rendered, etc.) by a central component (distributed tracing system server complemented with presentation layer logic). Transport means and protocols that are used for shipment of span data from individual applications to the central server may vary. In the most basic setup, this can take form of HTTP requests containing information about spans, that are sent by applications to the central server. In more advanced setup, shipment of span data can be mediated using message brokering solutions (such as AMQP brokers – for example, RabbitMQ), which will allow decoupling of span data producer (traced applications) from span data consumer (distributed tracing system server). Persistence of collected span data (span storage) on the central server can be achieved using in-memory solutions, or those based on classic relational databases, streaming platforms (such as Apache Kafka) or search and indexing engines (such as Elasticsearch). Subsequent analysis of collected span data can include visualization of the entire spans’ graph and indication of context and execution time that was spent on each step of the traced transaction. High-level overview of involved mentioned components put together to form more sophisticated and advanced distributed tracing solution is depicted below:

 

Demo: description and remarks

In this blog, I demonstrate how the mentioned approach can be applied to the landscape that consists of several Spring Boot applications that provision various microservices and that are deployed to SAP Cloud Platform (SCP). In the demo, I use SCP’s Cloud Foundry environment, but very similar technique and tooling can be applied to Neo environment.

 

The demo is based on usage of Zipkin distributed tracing system. Reason behind selection of Zipkin is due to its ever-growing use and popularity within Java community in general and Spring based applications in particular. Tooling for instrumentation of Spring applications and enablement of distributed traces generation is developed in scope of Spring Cloud Sleuth project and has native and seamless integration to Spring Boot applications. Direct integration of Zipking client (span data generator) to Zipkin server (span data collector) over HTTP is used here, although Zipkin supports alternative, stream based options of span data shipment. In sake of simplicity, Zipkin server used in the demo, relies on built-in in-memory span storage – please note that for large scale, high volume, production ready solutions, alternative persistence layer options shall be considered – for example, Zipkin server can be configured to use Apache Kafka or Elasticsearch for persistence of collected span data.

 

Although demo relies on the specific implementation of distributed tracing system, it can be adapted to usage of OpenTracing API for Java, should this be required to design and roll out solution for heterogeneous landscapes including not only Java applications and services.

Setup of more complex infrastructure that was mentioned earlier, is not in scope of this blog so that attention is not distracted from the main idea of the blog.

 

To demonstrate practical aspects of the described concept, let us develop and deploy three Spring Boot applications – two applications are bundled with Zepkin client and will represent invoked business applications that follow some microservice principles, the remaining application is bundled with Zipkin server and UI for it.

Interaction diagram of developed applications is depicted on the illustration below:

Applications involved in the demo are:

  • ‘trigger’. The application sends a synchronous request to the service of the application ‘randomizer’, and after response from that service is received, is deliberately forced to wait for another 1 second (placeholder for some lightweight application logic). The application exposes HTTP GET endpoint, and in the demo, is triggered by requests sent using Postman, but any other HTTP client can be used as an alternative.
  • ‘randomizer’. The application is deliberately forced to wait for random period of time within the defined range – by default, between 10 and 30 seconds (placeholder for some heavyweight application logic). The application exposes HTTP GET endpoint, and in the demo, is triggered by requests sent by the application ‘trigger’.
  • ‘tracing-server’. The application receives span data from other applications, persists it and provides web based user interface to access span data in a visualized form.

Applications ‘trigger’ and ‘randomizer’ represent business application layer and implement logic of the called distributed transaction. Each of these applications will produce trace data, that needs to be consolidated.

Application ‘tracing-server’ represents server of the distributed tracing system with the embedded UI. The application aims collection, processing and visualization of collected trace data.

 

Applications’ source code, configuration properties and Gradle build scripts can be found in GitHub repository. Applications’ configuration that is published in repository, allows local execution of the entire demo – in order to avoid conflicts, application ‘trigger’ is configured to use port 8080, application ‘randomizer’ – port 8081, and application ‘tracing-server’ – port 8090.

 

Demo: implementation

In all applications, configuration file ‘bootstrap.properties’ has been added. In the file, property ‘spring.application.name’ is specified with the value that corresponds to the application name (refer to the diagram of applications used in the demo and provided above). This is required to allow application self-identification in the landscape.

 

Applications ‘trigger’ and ‘randomizer’ were developed using Spring Boot 2.0.

In both applications, additional log entries are created to indicate entry and exit from corresponding controller method, as well as to capture and log some relevant details regarding trace and the individual span – such as trace ID, span ID of the current span and its parent span, tags and baggage items that are available in context of the span. The application ‘trigger’ additionally adds extra tag and baggage item to the span. All operations with the span are implemented using API of Spring Cloud Sleuth library.

Applications expose corresponding HTTP GET based endpoints using Spring MVC annotation ‘@RestController’. Service’s URL exposed by the application matches application name. The application ‘trigger’ that needs to invoke endpoint of the application ‘randomizer’, does so using Spring’s RestTemplate class, URL of the endpoint to be invoked is provided to the application as a configuration property.

To embed Zipkin client in built applications, following dependency should be added: artefact ID ‘spring-cloud-starter-zipkin’ of group ID ‘org.springframework.cloud’.

After this is done and applications are instrumented with Zipkin client, it is necessary to configure it – minimal configuration implies:

  • Specification of URL of Zipkin server (configuration property ‘spring.zipkin.baseUrl’).
  • Selection and configuration of sampler strategy, that controls what spans are exported. Zipkin client library provides several predefined sampler strategies, that can be configured either using Spring bean configuration approach, and, for some of sampler strategies, they can be configured using configuration properties. In the demo, all span data is exported (the default sampler strategy with export of 100% of spans data, is used).

 

Application ‘tracing-server’ was developed using Spring Boot 1.5.x (precisely, release 1.5.9), as Zipkin server doesn’t support Spring Boot 2.0 yet.

To embed Zipkin server and UI, following dependencies should be added:

  • Artefact ID ‘zipkin-server’ of group ID ‘io.zipkin.java’ (Zipkin server).
  • Artefact ID ‘zipkin-autoconfigure-ui’ of group ID ‘io.zipkin.java’ (built-in user interface for Zipkin server).

After this is done, annotation ‘@EnableZipkinServer’ should be added to the application’s main class to enable Zipkin server nature for it.

 

Other bits of development are not relevant to primary focus of the blog, and are omitted here. After development of applications is done, unsure built scripts are configured properly, and assemble applications as JAR executable applications.

 

Demo: deployment

All demo applications are deployed to Cloud Foundry environment of SCP using standard tools, and started after deployment:

 

Demo: execution and trace analysis

We are now ready to run a demo by sending a request to the application ‘trigger’:

As it can be seen, the request has been executed successfully and response received, end-to-end execution of the request took almost 18,9 seconds.

 

Now let’s analyze applications’ log files and pay attention to some relevant details.

 

Firstly, we will look into log of the application ‘trigger’, as this is the primary entry point for the entire chain of calls:

 

2017-11-21T22:05:05.513+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.513 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Entering trigger service
2017-11-21T22:05:05.513+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.513 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Trace ID: ba85d10109df0dc8
2017-11-21T22:05:05.513+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.513 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span ID: ba85d10109df0dc8
2017-11-21T22:05:05.515+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.515 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span tag: mvc.controller.class - TriggerController
2017-11-21T22:05:05.515+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.515 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span tag: mvc.controller.method - getTrigger
2017-11-21T22:05:05.515+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.515 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span tag: http.path - /trigger
2017-11-21T22:05:05.515+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.515 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span tag: http.url - https://trigger.cfapps.eu10.hana.ondemand.com/trigger
2017-11-21T22:05:05.515+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.515 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span tag: http.method - GET
2017-11-21T22:05:05.515+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.515 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span tag: demo-tag - DemoTagValue
2017-11-21T22:05:05.515+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.515 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span tag: http.host - trigger.cfapps.eu10.hana.ondemand.com
2017-11-21T22:05:05.516+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.515 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Span baggage item: demo-baggage-item - DemoBaggageItemValue
2017-11-21T22:05:05.516+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:05.516 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Calling randomizer service
2017-11-21T22:05:22.499+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:22.498 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Thread is put to sleep for 1000 ms
2017-11-21T22:05:23.499+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:23.499 INFO [trigger,ba85d10109df0dc8,ba85d10109df0dc8,true] 7 --- [nio-8080-exec-5] c.d.scp.demo.trigger.TriggerController : Exiting trigger service
2017-11-21T22:05:23.519+0000 [RTR/1] OUT trigger.cfapps.eu10.hana.ondemand.com - [2017-11-21T22:05:04.827+0000] "GET /trigger HTTP/1.1" 204 0 0 "-" "PostmanRuntime/6.4.1" "10.0.136.1:36488" "10.0.137.37:62354" x_forwarded_for:"81.109.143.244" x_forwarded_proto:"https" vcap_request_id:"aab3a79b-c7ec-409b-5c0e-221b4e2b5791" response_time:18.691844099 app_id:"6cab380d-b845-49cc-816e-c310a3fa067e" app_index:"0" x_b3_traceid:"ba85d10109df0dc8" x_b3_spanid:"ba85d10109df0dc8" x_b3_parentspanid:"-"

 

It can be observed that every produced log entry is extended with information that is valuable for analysis of traces and logs in distributed environments – namely, entry contains identification of the service / application that produced log entry (‘trigger’), trace ID (‘ba85d10109df0dc8’) and span ID (‘ba85d10109df0dc8’). Since the application ‘trigger’ is the initiator of a trace, trace ID and span ID share the same identifier.

Note that custom span baggage item (named ‘demo-baggage-item’ and assigned value ‘DemoBaggageItemValue’) and span tag (named ‘demo-tag’ and assigned value ‘DemoTagValue’) that have been added programmatically in the developed logic, are reflected in log. This is important observation to keep in mind when we get to log of the application ‘randomizer’ and have practical observation of what part of span context is transmitted across spans and what part remains local to the specific span.

Another point to take a note on is the way how trace ID and span ID is exchanged between involved applications – as it can be seen in the log (alternatively, can be seen in HTTP trace), each HTTP request that is sent by instrumented application, is enriched with additional HTTP headers (X-B3-* – such as X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId) that contain corresponding IDs (‘x_b3_traceid:”ba85d10109df0dc8″ x_b3_spanid:”ba85d10109df0dc8″ x_b3_parentspanid:”-“‘). This is integral part of design of Zipkin client library that is helpful in case correlation of requests within the trace is going to be based on HTTP tracing.

 

Next, we will check log of the application ‘randomizer’:

 

2017-11-21T22:05:06.953+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:06.952 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Entering randomizer service
2017-11-21T22:05:06.953+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:06.953 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Trace ID: ba85d10109df0dc8
2017-11-21T22:05:06.953+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:06.953 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Span ID: b9f0a334a15d66c2
2017-11-21T22:05:06.954+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:06.954 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Parent span ID: ba85d10109df0dc8
2017-11-21T22:05:06.955+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:06.955 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Span tag: mvc.controller.method - getRandomizer
2017-11-21T22:05:06.955+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:06.955 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Span tag: mvc.controller.class - RandomizerController
2017-11-21T22:05:06.955+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:06.955 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Span baggage item: demo-baggage-item - DemoBaggageItemValue
2017-11-21T22:05:06.955+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:06.955 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Thread is put to sleep for 15000 ms
2017-11-21T22:05:21.956+0000 [APP/PROC/WEB/0] OUT 2017-11-21 22:05:21.955 INFO [randomizer,ba85d10109df0dc8,b9f0a334a15d66c2,true] 7 --- [nio-8080-exec-4] c.d.s.d.randomizer.RandomizerController : Exiting randomizer service
2017-11-21T22:05:22.208+0000 [RTR/0] OUT randomizer.cfapps.eu10.hana.ondemand.com - [2017-11-21T22:05:06.185+0000] "GET /randomizer HTTP/1.1" 204 0 0 "-" "Apache-HttpClient/4.5.3 (Java/1.8.0_152)" "10.0.72.3:41418" "10.0.137.86:61907" x_forwarded_for:"52.28.241.88" x_forwarded_proto:"https" vcap_request_id:"9f319b39-b4e8-45e7-7208-c9c81ebd2dcd" response_time:16.022582773 app_id:"9d3746ec-9cec-47a6-a6d9-448a84295a58" app_index:"0" x_b3_traceid:"ba85d10109df0dc8" x_b3_spanid:"b9f0a334a15d66c2" x_b3_parentspanid:"ba85d10109df0dc8"

 

As it can be seen, trace ID reflected in log, remains the same as it was in log of the application ‘trigger’ (‘ba85d10109df0dc8’), but span ID got changed (‘b9f0a334a15d66c2’), although it has reference to its parent span, which is earlier observed span generated by the application ‘trigger’ (‘ba85d10109df0dc8’).

Note that earlier set span baggage item ‘demo-baggage-item’ is still present in context of this span, but span tag ‘demo-tag’ is not there – that is good evidence of earlier mentioned characteristic of spans: baggage items that have been created and set in earlier generated spans, are sent across the entire trace to subsequent spans, whereas tags created and set in one span are not sent across and are not available in context of subsequent spans. Therefore, baggage items might become helpful in case it is required to transmit some additional information across the trace and make use of it later when analysing generated and collected trace.

 

Finally, after logs of individual applications have been browsed, let’s see how they got collected and correlated by Zipkin server. Navigate to the endpoint of the application ‘tracing-server’ and search for collected traces in its web based user interface – here is search result:

For every collected trace, we can get visualization of spans’ absolute duration time breakdown and relative sequence of their duration within the trace:

In the collected trace, although gross time of ‘trigger’ duration was almost 18,1 seconds, most of this time (16,8 seconds) was spent for calling ‘randomizer’ and waiting for response from it – which results in only 1,3 seconds of net duration time of ‘trigger’.

For every collected span, it is possible to get its full details – as it can be seen, both standard span properties as well as extra scenario specific attributes that has been set in this demo and observed earlier in log of involved applications (such as span baggage item and tag), are present here:
Information on trace in general and each span of the trace in particular can also be downloaded in JSON format and used in other analysis tools.

 

 

Generated log entries that were enriched with additional attributes by described distributed tracing system client library or its analogues, can also be directly shipped to centralized log management system (such as the one based on ELK stack – Logstash, Elasticsearch, Kibana – or its equivalents), and further correlation of log entries and reporting / visualization of analyzed data can be performed by means of that system. Reason why I mention Elasticsearch here is because SCP’s Cloud Foundry environment already provides corresponding service (‘application-logs’) that is based on Elastic stack of technologies and facilitates collection, analysis and visualization of collected log data, which might become handy if organization already uses technology stack or plans to move to it – otherwise, similar alternative solutions can be utilized.

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply