PgBouncer on Kubernetes and how to achieve minimal latency

Experiments with connection poolers on Kubernetes for Postgres Operator

Software Engineer

Posted on Jun 24, 2020

Tags:

Introduction

In the new Postgres Operator release 1.5 we have implemented couple of new interesting features, including connection pooling support. Master Wq says there is "No greatest tool", to run something successfully in production one needs to understand pros and cons. Let's try to dig into the topic, and take a look at the performance aspect of connection pooler support, mostly from a scaling perspective.

But first let's make an introduction. Why do we quite often need a connection pooler for PostgreSQL (and in fact for many other databases too)? There are several performance implications of having too many connections to a database open that result from how a connection is opened (PostgreSQL uses a "process per user" client/server model, in which too many connections mean too many processes fighting for resources and drowning in context switches and CPU migrations) and how certain aspects of transaction handling are implemented (e.g. GetSnapshotData has O(connections) complexity). Having said that there are three options where to implement a connection pooler:

on the database side, like proposed in this patch
as a separate component between the database and the application
on the application side

For Postgres Operator we have chosen the second approach. Although there are pros and cons for all of those options, any other will obviously require a lot of efforts (application side connection pooler is not something under the operator control, and internal connection pooler for PostgreSQL is a major feature one needs to develop yet). Another interesting choice to make in this case is which solution for connection pooling to use. At the moment for PostgreSQL there are couple of available options (listed in no particular order):

PgBouncer is probably the most popular and the oldest solution. Pgpool-II can actually do much more than just connection pooling (e.g. it can do load balancing), but it means it's a bit more heavyweight than others. Odyssey and pgagroal are much newer and try to be more performance optimized and scalable than the alternatives.

Eventually we went for PgBouncer, but current implementation allow us to switch to any other solutions if they conform to a basic common standard. Now let's take a look at how PgBouncer performs in tests.

Setup

In fact, we did significant amount of benchmarks with PgBouncer for different workloads on our Kubernetes clusters and learned few interesting details. For example, I didn't know that a Kubernetes Service can distribute workload in not exactly uniform way, so that one can see something like this, where the third pod is only half utilized and in fact gets half as much queries as the others:

NAME                         CPU(cores)   MEMORY(bytes)
pool-test-7d8bfbc47f-6bbhr   977m         5Mi
pool-test-7d8bfbc47f-8jtnp   995m         6Mi
pool-test-7d8bfbc47f-ghvpn   585m         6Mi
pool-test-7d8bfbc47f-s945p   993m         6Mi

This could happen if kube-proxy works in iptablesmode and calculates probabilities to land on a pod instead of strict round-robin.

But in this article I want to offer one example, produced in a more artificial environment of my laptop. That's mostly because we can get more interesting metrics that are interesting for this particular case, but do not make sense to collect for all workloads. My original idea was to play around CPU management policies and exclusive CPUs, to show what will happen if a PgBouncer runs with a fixed cpuset. But interesting enough, another effect introduced an even bigger difference, so the following experiment will be more about scaling of PgBouncer instances.

To simulate the networking part of our experiment, let's setup a separate network namespace, where we will run PostgreSQL and PgBouncer, and connect it via veth link with the root namespace.

# setup veth link with veth0/veth1 at the ends
$ ip link add veth0 type veth peer name veth1

# check that they're present
$ ip link show type veth

# add a new network namespace
$ ip netns add db

# move one end into the new namespace
$ ip link set veth1 netns db

# check that now only veth0 is visible
$ ip link show type veth

# check that veth1 is visible from the other namespace
$ ip netns exec db ip link show type veth

# add corresponding addresses and bring everything up
$ ip addr add 10.0.0.10/24 dev veth0
$ ip netns exec db ip addr add 10.0.0.1/24 dev veth1
$ ip link set veth0 up
$ ip netns exec db ip link set veth1 up
$ ip netns exec db ip link set lo up

This link is going to be blazingly fast, so let's add a small delay to the veth interface, which corresponds to the empirical network latency we observe in our Kubernetes clusters. Distribution parameter here is mostly to emphasize its presence, since it's normal by default anyway.

$ tc qdisc add dev veth0 root netem delay 1ms 0.1ms distribution normal

In our experiment we will run pgbench test with a query ;, which is the smallest SQL query one can come up with. The idea is to not load the database itself too much and see how PgBouncer instance will handle many connections, which is in this case 1000 dispatched via 8 threads. A word of warning: use pgbench carefully, since in some cases it could be a bottleneck and produce confusing results. In our case we will try to limit this by pinning all the components to a separate cores, collect performance counters to see where what do we spend time and be alerted about strange results. But for a more diverse workload and more holistic approach you can use oltpbench or benchmarksql.

The result will be per transaction execution log. Every component, namely:

PostgreSQL instance
Two PgBouncer instances
PgBench workload generator

is bound to a single CPU core, with Intel turbo being disabled and CPU scaling governor for all the cores set to performance. Two instances of PgBouncer will run with so_reuseport option, which is essentially a way to get PgBouncer to use more CPU cores. The only degree of freedom we will investigate is their location between cores in relation to whether it's a real separate core, or just a separate hyperthread.

Benchmark

Here are the benchmark results, presenting rolling mean, 99th latency and standard deviation values, executed on a rather modest setup with 2 physical cores each with 2 hyperthreads for three cases:

Only one instance of PgBouncer on an isolated real core
Two PgBouncers on isolated hyperthreads, but on the same physical core.
Two PgBouncers on isolated cores (with potential noise from other components on the different hyperthread).

Hyper-Threading means than two components are still fighting for CPU time, but will share some execution state and cache. Usually, it renders more deviations in latency, which we will have in mind.

separate_cores tar gz

One nice feature we can immediately see is that results are relatively stable, which is good. Another interesting note is that despite the fact that we were only changing the core location for every component, we can see a significant difference in latency. For a single PgBouncer instance we've got the lowest latency, while for two PgBouncers on the same physical core it's almost two times higher (with somewhat minimal increase in throughput). In case of two PgBouncers on a different physical cores, even with potential competition for resources with another component (and a different resource consumption pattern), the latency is somewhere in between (with the throughput best of the three). Why is that?

In the course of investigation more and more puzzling measurements were collected, showing no significant difference in sampling with perf of PostgreSQL activity or both PgBouncer instances. Let's take a closer look at what PgBouncer is actually doing:

pgbouncer

As expected, it spends a lot of its time doing networking. Kernel docs says that:

For interrupt handling, HT has shown no benefit in initial tests, so limit the number of queues to the number of CPU cores in the system.

This could be our working assumption. Network interrupts probably are not very well scaled between hyperthreads, so one needs to use a real core to scale them out. To get a bit more evidences, let's take a look at interrupts latencies in both cases, different cores and different hyperthreads. For that we can use irq:softirq_entry and irq:softirq_exit and a script from Brendan Gregg:

# one PgBouncer instance is running on a CPU2 with no other PgBouncer on the
# same physical core. We're interested only in NET_RX,NET_TX vectors.

$ perf record -e irq:softirq_entry,irq:softirq_exit \
    -a -C 2 --filter 'vec == 2 || vec == 3'
$ perf script | awk '{ gsub(/:/, "") } $5 ~ /entry/ { ts[$6, $10] = $4 }
    $5 ~ /exit/ { if (l = ts[$6, $9]) { printf "%.f %.f\n", $4 * 1000000,
    ($4 - l) * 1000000; ts[$6, $10] = 0 } }' > latencies.out

And the same for another case when a PgBouncer sits together with another one on the same physical core. Here is the 99th percentile of the resulting latencies:

softirq_net_rx_net_tx_latencies

Which indeed points into the direction of network interrupts being a bit slower for the case when both PgBouncers are sharing the same physical CPU. In theory, it means that we can get surprising performance results after adding more pods to a connection pool deployment depending on where did those new pods land, on an isolated CPU or on a CPU with another hyperthread already busy. In the view of these results it could be beneficial to configure CPU manager in the cluster, so that this would not be an issue.

Conclusion

Having said all above I must admit it's just a tip of the iceberg. If there could be interesting complications about how to run a connection pooler within a single node, you can imagine what happens on a higher architecture level. We've spent a lot of time discussing different design possibilities for Postgres Operator, e.g. whether it should be a single "big" pgbouncer instance (with many processes reusing the same port) with an affinity to be close to the database, or multiple "small" instances equidistant from the database. Every design has its own trade-offs about network round trips and availability implications, but since we value simplicity (especially in the view of such complicated topic) we went for a rather straightforward approach relying on the standard Kubernetes functionality:

Postgres Operator creates a single connection pooler deployment and exposes it via new service.
Connection pooler pods are distributed between availability zones.
Due the nature of connection pooling, pods are doing CPU intensive work with minimal amount of memory (less than a hundred of megabytes in a simple case) and it makes sense to create as many as needed to prevent resource saturation. Those pods could be scattered across multiple nodes and availability zones which means latency variability.
For those cases when this variability could not be tolerated, we would consider creating manually a single "big" pooler instance with the affinity to put it on the same node as the database and adjust CPU manager to squeeze everything we can from this setup. This new instance would be a primary one for connecting with another one providing HA.

This simplicity should not be confused with ignorance, it's based on understanding of proposed solution limitations and what could be adjusted beyond them. As in my other blog posts and talks I would love to emphasize the importance of the described methodology: even if you have such a complicated system in your hand as Kubernetes it's important to understand what happens underneath!