Distributed Monitoring 101: the “Four Golden Signals”

Monitoring as a lever for development teams empowerment

Vincent Gilles
ForePaaS

--

At ForePaaS, we have been experimenting with DevOps for a while. We started with a single team and we recently started spreading it to the entire company.

The reason we did that is pretty simple: while we are still a small team, our organization has grown bigger. We originally had a single small all-rounder team. They laid the foundations of our product’s architecture, design and security. Their strength lied in their ability to tackle problems very quickly. As we grew bigger, that team was divided into multiple specialized teams. Each of them focused on a specific aspect of the product: front-end, back-end development, operations…

We realized that the methods that used to work for us might become less efficient, and that we needed a change. Our main goal was to keep our velocity without sacrificing quality — and vice versa!

So far, DevOps had been the name given to the Ops team, which happened to also develop part of our backend. Other developers would — once a week — tell the DevOps team what new services they had to deploy in production. This way of proceeding could sometimes create friction. The DevOps team had little visibility over the developers’ processes. Also, developers weren’t encouraged to feel accountable for their services.

A big part of our work within the DevOps team lately has been to empower dev teams to take responsibility. Responsibility for their services’ availability, reliability and code quality. We first needed to reduce the anxiety that comes with such duties. A good way to do so is to start by giving developers enough visibility to diagnose a problem when it occurs. A good way to give visibility over a system’s state is to implement monitoring.

In this article, we will discuss what it is and what it is used for, and introduce the “Four Golden Signals” of monitoring. We will also see how to leverage metrics and drill-down techniques to investigate ongoing issues.

Sample Grafana dashboard using the “Four Golden Signals” to monitor a service.

What is monitoring?

Monitoring is the generation, collection, aggregation, and usage of metrics giving information on a system’s state.

To monitor a system, we need to get information about its components, both software and hardware. To get such information, we need to generate metrics. We can get these metrics either by using dedicated software or code instrumentation.

Instrumenting code means changing it to be able to measure its performance. We add code that doesn’t change our product’s features. It computes metrics and exposes them instead. Let’s say we want to measure request latency. We will add code to compute the time our service takes to serve each request it receives.

Once we generated our metrics, we need to collect and aggregate them. A popular way of doing that is to use Metricbeat for collection and Logstash to index the metrics in Elasticsearch. We can then use them as we need to take advantage of this information. One would most likely complete this stack with Kibana as a way to visualize data indexed in Elasticsearch.

Why monitor?

There are various reasons to monitor a system. At ForePaaS, we use it to get an immediate status of our system and its variations. This allows us to both build alerting systems and display information in dashboards. We use dashboards to help us identify outage causes whenever we get an alert. Others might use monitoring to compare two service versions or analyze long-term trends.

What should we monitor?

We came across a relevant chapter about monitoring distributed systems in the Site Reliability Engineering book. Based on Google’s approach, it describes the value of keeping an eye on the “Four Golden Signals”.

Front cover of Google’s Site Reliability Engineering book
Beyer, B., Jones C., Murphy, N. & Petoff, J. (2016) Site Reliability Engineering. How Google runs production systems. O’Reilly. Free online version: https://landing.google.com/sre/sre-book/toc/index.html
  • Latency, which is the time it takes to serve a request. We should separate success latency from error latency and watch them both. In fact, success latency is important, but also a long error is more frustrating than a short one.
  • Traffic, which is a high-level measure of how requested our service is. In the case of an API, this would be the number of queries received every second. For a music streaming service, that would be the amount of data streamed per second.
  • Errors. Another important indicator is our service’s error rate. These errors can be both explicit (500 error codes for example) or implicit. An implicit error would be a successful response but with the wrong content or long response time.
  • Saturation is the last of the “Golden Signals”. Saturation is about how “full” our service is. How much more load can it handle? Services are usually either CPU-constrained or memory-constrained. In either case, you should watch any constraining parameter. For a database system, you will want to watch available storage space.

How to monitor?

First, let’s talk about our technical stack. We tend to use popular tools over custom solutions. We only use custom code whenever we are not satisfied with the already available options. We deploy most of our services in Kubernetes environments. We instrument our code to expose metrics about each of our custom services. To collect these metrics and make sure they are scraped by Prometheus, we use one of Prometheus’ client libraries. There are client libraries for almost every popular language. The documentation also provides you with all the information you need to write your own.

For third-party open-source services we usually rely on the community’s exporters. Exporters are an additional piece of software that helps scrape metrics from a service and format them for Prometheus. They are typically used with services that weren’t designed to expose metrics using the Prometheus format.

We send metrics through our pipeline then store them as time-series in Prometheus. We also use Kubernetes’ kube-state-metrics to collect and send system metrics to Prometheus. We can then build our Grafana dashboarding and alerting systems by querying Prometheus. We aren’t going to get into too many technical details in this article, but feel free to go check them out. They are pretty easy to get started with and their documentations were very helpful.

For the rest of the article, we will consider a simple API. That API receives traffic and relies on other services to serve the requests it receives.

Latency

Latency corresponds to the time it takes to serve a request. As stated a few lines above, we separate both success and error latencies. We didn’t want errors to false our success latency, but we also wanted to measure error latency.

A common mistake people make is averaging the latency: while it could work, it’s not always a good idea. You should look at latency distribution instead, as it fits availability requirements better. In fact, a common Service Level Indicator (SLI) is the part of requests served faster than a threshold. The following is an example of Service Level Objective (SLO) for that SLI:

“Over a 24-hour period, serve 99% of requests faster than 1 second”

An easy way to measure that is to store latency metrics as histogram time-series. We put our metrics into buckets that our exporters collect every minute. This way we can calculate the n-quantiles for our services’ latencies.

For 0 < n < 1 and a histogram with q values, the n-quantile of that histogram is the value that ranks n*q amongst the q values. This means the 0.5-quantile (aka. the median) for a histogram with x entries is the value for which half of the x values are smaller or equal.

Latency graph for the API

On the above graph, we can see that most of the time, our API serves 99% of our requests in under 1 second. Yet, we see spikes at around 2 seconds, which would not be ok with the SLO earlier.

Now, because we are using Prometheus, we must be very careful with the bucket sizes we pick. Prometheus allows for both linear and exponential bucket sizes. It doesn’t really matter which of the two options you choose. As long as you factor estimation errors in your choice, you’re good to go.

Prometheus doesn’t give the exact value for a quantile. It actually detects which of the buckets that quantile is in. Once it has done that, it uses linear interpolation to give an estimated value.

Traffic

To measure traffic for the API, we want to count how many requests it receives every second. Now, because we only get new metrics every minute, we can’t get an exact value for a given second. We use Prometheus’ rate and irate functions to work with averaged numbers of queries per second instead.

To display this information, we use the Grafana SingleStat panel. It allows us to give the current average of queries per second, and to display its trend in the background.

Example of a Grafana SingleStat panel displaying the number of queries our API receives each second

What we are looking for here is to be able to see sudden changes in the number of queries per second. This way, we can immediately know there is a problem when our traffic gets divided by 2 in just a few minutes.

Errors

The explicit error rate is pretty easy to compute, as we only need to divide HTTP 500s served by the total number of requests served. Like for traffic, we use an averaged value.

The only thing we are careful about is picking the same average intervals as for traffic. It makes it easier to get an idea of the error traffic without duplicating panels.

For example, let’s say we have a 10% error rate and 200 queries per second over the last five minutes. It is now easy to deduce that we’ve had an average of 20 errors per second over the last 5 minutes.

Saturation

To watch a service’s saturation, we need to determine its constraining parameters. For this API, we started by measuring both CPU and Memory usage, as we initially didn’t know which of the two we needed to watch. Kubernetes internals and kube-state-metrics allow us to get such metrics for containers.

Graph showing CPU usage over time for our API service as it oscillates between 30% and 40%.
CPU usage overtime for our API

Measuring saturation can also be useful to predict an outage and capacity planning. Take a database’s storage for instance: we can measure both the available disk space and its fill rate. This way we can have an idea of the moment we need to take action.

Using drill-down dashboards to monitor distributed services

Let’s now consider another service: say a distributed API that acts as a proxy to other services. The API now has multiple instances, possibly in many regions. Also, the API now has a few different endpoints. Each of them relying on a different set of services. Soon enough, it becomes quite difficult to read graphs with dozens of series. We need to monitor the system as a whole, but still be able to detect individual failures.

Graph showing the CPU usage for 12 instances of our API service. It is not easy to read as too many series are displayed.
CPU usage overtime for 12 instances of our API

To do so, we use drill-down dashboards. Each panel of a given dashboard gives a global view of the system and clicking on it gives a more detailed view. For saturation, we use simple colored rectangles instead of graphs with CPU and RAM usage. Whenever an API’s resource usage exceeds a predefined threshold, the rectangle becomes orange.

CPU and memory usage indicators for our API instances

This way we only need to click on that rectangle to access a more detailed view. In said view, we display even more colored squares, each of them representing an API instance.

CPU usage indicators for our six API instances

If there only is a problem with a single instance, we can access an even more detailed view by clicking on its square. In that last view we get information like the instance’s region, the queries it receives, etc…

Detailed view of an API instance’s state. Left to right, top to bottom: provider region, instance hostname, last restart date, queries per second, CPU usage, memory usage, cumulative queries per second per path and cumulative error rates per path.

We proceeded the same way with the error rate: clicking on it shows the error rate for each API endpoint. This allows us to know whether an issue comes from the API itself or from one of the services it relies on.

We also decided to do the same for success and error latencies, yet there are a few things to keep in mind. The main goal here is to ensure our service is doing ok on a global scale. The problem is that the API has many different endpoints, each of them relying on multiple other services. Take any two different endpoints: their latencies will be different. Traffic on these endpoints will be different as well.

Setting individual SLOs (and SLAs) on each one of a service’s endpoints might be a little challenging. Some endpoints might have a way higher nominal latency than the others. In that case, there might be some refactoring to do. Whenever individual SLOs are required, dividing our service into even smaller services. In fact, it might show our service’s scope was too broad to begin with.

We decided our best option was to monitor our overall latency. Adding the possibility to drill down only allows us to investigate an issue when the latency variation is big enough to pique our curiosity.

On a closing note

We have been using these methods to monitor our services for a while now, and have seen drastic improvements in our mean time to detect an issue and our mean time to recovery (MTTR). The ability to drill down to find the actual problem when we detect an issue on a global scale is a real game-changer.

Other dev teams have started implementing these methods and it was only beneficial. Not only does it allow teams to take operational responsibility for their services. It also helps them to take even more ownership over their services. In fact, they can now visualize the impact code changes have on their services’ behaviors.

Using the Four Golden Signals doesn’t cover 100% of issues, yet it helps us a lot with the most recurring ones. With very little effort, we were able to significantly improve our monitoring coverage and reduce our MTTR. Add as many metrics as you deem necessary, but you can’t go wrong by at least using the “Four Golden Signals”.

Like what you read? Get in touch! Also, we’re hiring. 🚀

--

--