The RED Method: A New Approach to Monitoring Microservices

Feb 14th, 2018 12:21pm by Joab Jackson

Monitoring microservices effectively still can be a challenge, as many of the traditional performance monitoring techniques are ill-suited for providing the required granularity of system performance. Now a former Google and Weave engineer has developed an approach, called the RED Method, that seems to be gaining favor with administrators.

RED “encourages you to come to some sort of consistency of monitoring,” explained Tom Wilkie, the originator of RED, and a founder of the new microservices monitoring company Kausal. Wilkie spoke at the InfluxData‘s Influx Days user event held Tuesday in New York.

The most immediate benefit to instrumenting microservices along the channels described by RED gives engineers who may not be familiar with a badly-performing microservice a standard set of tools to diagnose and correct an issue. RED offers a “consistency across services [that] really helps reduce the cognitive load of your on-call people. It helps them be on call for more services, for services they didn’t write.”

Wilkie used this approach when he was an SRE engineer supporting Google Analytics.

“I didn’t write any of the Google Analytics services, but I was still able to be on call for them because for me, they were just black boxes. When something went wrong, I just had to traverse my little graph, figure out which one was throwing the errors, and then go and look at the logs, file a bug with developers, restart it, whatever,” he said.

RED came about because Wilkie was frustrated with the popular USE methodology of performance measurement. Created by Brendan Gregg, USE buckets system performance metrics around these groups:

Utilization (U): The percentage of time a resource is in use.
Saturation (S): The amount of work the resource must (the “queue” of work).
Errors (E): A count of errors.

System resources being measured can be CPUs, memory, I/O channels, and the like.

“The nice thing about this kind of pattern is that it turns the guesswork of figuring out why things are slow into a much more of a methodological approach,” Wilkie said. With the USE method, Kausal created a set of Grafana dashboards for monitoring Kubernetes infrastructure, using Prometheus as a backend.

The USE approach, however, has its limitations, Wilkie noted. For instance, it is difficult to measure the saturation of memory, or the amount of memory used. Also, error counts can be problematic, especially I/O errors and memory bandwidth. “Linux, it turns out, is really bad at exposing error counts,” Wilkie said. Also, USE is more infrastructure-focused, and RED is more focused on the end-user satisfaction.

As an alternative, Wilkie developed another easy-to-remember acronym, RED, when he was working at Weave. RED is based around requests, characterizing microservice performance thusly:

Rate (R): The number of requests per second.
Errors (E): The number of failed requests.
Duration (D): The amount of time to process a request.

“The thing I like about RED is that it is microservice-focused, as opposed to USE method which is more about the infrastructure,” said Paul Dix, founder and CEO of Influx Data. Influx invited Wilkie to speak at the event, given RED was a popular topic of conversation at such microservices friendly conferences last year as Monitorama and Kubecon.

Wilkie said that RED is actually derived from another, little-known, set of performance metrics that he learned as a site reliability engineer at Google, called The Four Golden Signals:

Latency: The time it takes to service a request.
Traffic: A measure of how much demand on the system.
Errors: The rate of failed requests.
Saturation: A measure of how “full” a service is, often measured by latency.

Like with USE, Wilke implemented the RED method as a client library for Prometheus. The open source InfluxDB time-series database, for instance, supports the Prometheus monitoring tool‘s read-and-write API. Prometheus can be used as a data collector, piping results into the database, and it can query data out of InfluxDB as well.

Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 25 years, including stints at IDG and Government Computer News. Before that, he...