Distributed Tracing with Sleuth and Zipkin for Rookies

So, you want to have your system distributed, but wonder how to stay in control of your flow, troubleshoot potential latency problem, and be able to see what is actually going on when something breaks down? Tracing comes to the rescue, giving you the ability to correlate your requests and responses in the depths of network traffic. If your stack is based on Spring Boot, then creating a basic setup is easy as catching jellyfish in Bikini Bottom! 🙃

Sleuth

Assuming you have a Spring Boot application, you only need to have Sleuth on the classpath to have it do all the tracing for you. It marks particular parts of your flow with Trace Id and Span Id, which get included (among others) in messages' headers. Other information is also provided, e.g. for tracing requests in proper order, timing details etc.
Span refers to a basic unit of work, e.g. single exchange of messages (like HTTP request and response) and trace refers to set of spans formed in a tree-like structure, eg. all messages exchanged as part of the execution of a flow initiated by a single message.

This visualization from Spring Cloud Sleuth documentation shows very well what traces and spans are:
Traces and spans

Trace Id and Span Id appear in relevant logs (if you do log things of course 🙂), eg.

service1.log:2017-01-19 15:12:02.545  INFO [service1,d5b291496f21b36e,c362f06faa315849,true] 36161 --- [nio-8081-exec-1] tech.viacom.service1.SampleApp   : Service1 called, calling service2
service2.log:2017-01-19 15:12:02.689  INFO [service2,d5b291496f21b36e,9ef980eb860bbd28,true] 36162 --- [nio-8082-exec-2] tech.viacom.service2.SampleApp   : Service2 called, calling service3
service3.log:2017-01-19 15:12:02.779  INFO [service3,d5b291496f21b36e,10b8d941e19f0848,true] 36163 --- [nio-8083-exec-3] tech.viacom.service3.SampleApp   : Service3 called
service2.log:2017-01-19 15:12:02.864  INFO [service2,d5b291496f21b36e,9ef980eb860bbd28,true] 36162 --- [nio-8082-exec-2] tech.viacom.service2.SampleApp   : Service2 got response from service3
service1.log:2017-01-19 15:12:02.977  INFO [service1,d5b291496f21b36e,c362f06faa315849,true] 36161 --- [nio-8081-exec-1] tech.viacom.service1.SampleApp   : Service1 got response from service2

where d5b291496f21b36e is Trace Id, c362f06faa315849 is Span Id and true states whether log should be exported to Zipkin or not (this is customizable). When using org.springframework.cloud:spring-cloud-starter-zipkin traces are being traced by Sleuth and exported with io.zipkin.reporter:zipkin-reporter to your Zipkin server.

Zipkin

Zipkin manages collection and lookup of reported (exported) data. It runs as a self-contained server with a web gui. You can install and run one locally like this:

wget -O zipkin.jar 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec'
java -jar zipkin.jar

or run the latest Docker image:

docker run -d -p 9411:9411 openzipkin/zipkin

Now, you only need to tell your Spring Boot app where to report traces. You can do it by setting the spring.zipkin.baseUrl property. Then, you will have a very basic setup in place. Zipkin runs on port 9411 by default. Some other defaults:

  • threshold of traced data being sampled and reported in spring-cloud-starter-zipkin is 10%. This setting can be overridden by spring.sleuth.sampler.percentage property.
  • Zipkin keeps data in memory but can be configured to use MySQL, Cassandra or Elasticsearch for storage (more info here).

Examples below show how Zipkin visualises received data:

Want to know more?

You might be interested in this paper about Google Dapper which describes core concepts for the solutions mentioned above and this blog post which describes Sleuth and Zipkin in a more detailed way.

Happy tracing!

Barbara Śliwińska

Software Engineer @ Viacom