Approximate Counting with Morris's Algorithm

Robert Morris's algorithm for counting large numbers using 8-bit registers is an early example of a sketch or data structure for efficiently processing a data stream. I introduce the algorithm and analyze its probabilistic behavior.

An approximate counting algorithm is a procedure for counting a large number of events using a small amount of memory. Approximate counting was invented by Robert Morris while he was working in Bell Labs (Morris, 1978). His motivation was that he wanted to count a large number of events in an 8-bit register. In general, an nn-bit register can count 2n12^n - 1 values, and so an 8-bit register can only count 255 events. Since Morris was only interested in approximate counts, he developed a probabilistic algorithm with a quantifiable margin of error. This algorithm is an early example of a sketching algorithm, or an algorithm which compresses a streaming data set for efficient querying.

Before we work through Morris’s algorithm, let’s follow his reasoning in the paper. A simple approach would be to count every other event with probability pp. Let XnX_n be a random variable for the approximate count value by event nn; this is the value we store in our counter, not the real number of events. Then clearly

XnBinomial(n,p)μ=E[Xn]=npσ2=V[Xn]=np(1p). \begin{aligned} X_n &\sim \text{Binomial}(n, p) \\ \mu &= \mathbb{E}[X_n] = np \\ \sigma^2 &= \mathbb{V}[X_n] = np(1-p). \end{aligned}

Morris’s example is that for p=1/2p = 1/2, after 400400 events, μ=200\mu = 200 and σ=10\sigma = 10, and our estimate of the true number of events is 2Xn2 X_n. So 95% of the time, our estimate is within 2(2σ)=402(2 \sigma) = 40 of the actual error count or 10% error.

The problem with this approach is that it can have high relative error for small counts, where

Relative error=1μapprox.μ. \text{Relative error} = \Big|1 - \frac{\mu_{\text{approx.}}}{\mu}\Big|.

For example, the probability of three events having 100% relative error is just p3=1/8p^3 = 1/8. The goal, then, is to develop an approximate counting algorithm whose relative error is nearly independent of nn. This leads to Morris’s algorithm.

Morris’s algorithm

Imagine that the value XnX_n stored in a counter is

Xn=log2(1+n) X_n = \log_2(1 + n)

where nn is the number of events that have occurred so far. The plus one just ensures that n=0    Xn=0n = 0 \implies X_n = 0. At any given time, our best estimate of nn is

θ^n=2Xn1. \hat{\theta}_n = 2^{X_n} - 1.

To increment the counter, we compute

Xn+1=log2(1+2Xn). X_{n+1} = \log_2(1 + 2^{X_n}).

That is, our new estimate is θ^_n+1=θ^_n+1=2Xn\hat{\theta}\_{n+1} = \hat{\theta}\_n + 1 = 2^{X_n}. We offset by this estimate by one and take the log. However, this value isn’t an integer in general. If we round the value, we might accumulate errors in our approximation. Instead, what we can do is the following. First, initialize X0=0X_0 = 0. Next, for n>0n > 0, compute

p=12X,rUniform(0,1). p = \frac{1}{2^X}, \qquad r \sim \text{Uniform}(0, 1).

Then the value of Xn+1X_{n+1} is

Xn+1={Xn+1if p<rXnelse. X_{n+1} = \begin{cases} X_{n} + 1 & \text{if $p < r$} \\ X_{n} & \text{else}. \end{cases}

Why is this a reasonable estimate? Because the expected value of 2Xn12^{X_n} - 1 is nn. We can prove this by induction. I follow Jelani Nelson’s lecture notes on sketching algorithms. The base case is when n=0n = 0. Clearly, E[2X01]=0\mathbb{E}[2^{X_0} - 1] = 0 by our initialization criteria. Now assume that

E[θ^n]=n \mathbb{E}[\hat{\theta}_n] = n

by our inductive hypothesis. Then

E[θ^n+1]E[2Xn+11]=i=1P(Xn=i)E[2Xn+1Xn=i]1=i=1P(Xn=i)[12j2j+1increment+(112j)2jno change]1=i=1P(Xn=i)(2j+1)1=i=1P(Xn=i)2j+i=1P(Xn=i)1=E[2Xn]=n+1. \begin{aligned} \mathbb{E}[\hat{\theta}_{n+1}] &\triangleq \mathbb{E}[2^{X_{n+1}} - 1] \\ &= \sum_{i = 1}^{\infty} \mathbb{P}(X_n = i) \mathbb{E}[2^{X_{n+1}} \mid X_n = i] - 1 \\ &= \sum_{i = 1}^{\infty} \mathbb{P}(X_n = i) \Bigg[\underbrace{\frac{1}{2^j} \cdot 2^{j+1}}_{\text{increment}} + \underbrace{\Big(1 - \frac{1}{2^j} \Big) \cdot 2^j}_{\text{no change}} \Bigg] - 1 \\ &= \sum_{i = 1}^{\infty} \mathbb{P}(X_n = i) \big( 2^j + 1 \big) - 1 \\ &= \sum_{i = 1}^{\infty} \mathbb{P}(X_n = i) 2^j + \sum_{i = 1}^{\infty} \mathbb{P}(X_n = i) - 1 \\ &= \mathbb{E}[2^{X_n}] \\ &= n + 1. \end{aligned}

This line of thinking led Morris to his algorithm:

  1. Initialize X=0X = 0.
  2. With probability 1/2X1 / 2^X, increment XX.
  3. Output 2X12^X - 1.

Deeper analysis and tighter bounds were later provided by (Flajolet, 1985). Let’s examine a few of Philippe Flajolet’s results in detail.

Analysis

Bounded variance

Morris’s algorithm has several nice properties. We just saw that θ^_n\hat{\theta}\_n is an unbiased estimator of nn. Furthermore, we claim

V[θ^n]3n(n+1)2+1. \mathbb{V}[\hat{\theta}_n] \leq \frac{3n (n+1)}{2} + 1.

To prove this following Alexandr Andoni’s lecture notes, consider this derivation,

V[θ^n]=E[(2Xn1)2]E[2Xn1]2=E[(2Xn1)2]n2=E[22Xn+12(2Xn)]n2=E[22Xn]+12E[2Xn)]n2R. \begin{aligned} \mathbb{V}[\hat{\theta}_n] &= \mathbb{E}[(2^{X_n} - 1)^2] - \mathbb{E}[2^{X_n} - 1]^2 \\ &= \mathbb{E}[(2^{X_n} - 1)^2] - n^2 \\ &= \mathbb{E}[2^{2X_n} + 1 - 2(2^{X_n})] - n^2 \\ &= \mathbb{E}[2^{2X_n}] + \underbrace{1 - 2 \mathbb{E}[2^{X_n})] - n^2}_{R}. \end{aligned}

Note that the term labeled RR is equal to

R=(n2+2n+1) R = -(n^2 + 2n + 1)

Since n0n \geq 0, then R<0R < 0 and

V[θ^n]E[22Xn]. \mathbb{V}[\hat{\theta}_n] \leq \mathbb{E}[2^{2X_n}].

Thus, to bound the variance of our estimator of nn, we just need to compute E[22Xn]\mathbb{E}[2^{2X_n}]. First, note that we can express the probability that XnX_n takes value ii as

(12i1)P(Xn1=i1)+(112i)P(Xn1=i). \Big( \frac{1}{2^{i-1}} \Big) \mathbb{P}(X_{n-1} = i - 1) + \Big( 1 - \frac{1}{2^i} \Big) \mathbb{P}(X_{n-1} = i).

In words, this is just the probability that Xn1X_{n-1} was i1i - 1 and was incremented, plus the probability that Xn1X_{n-1} was ii but was not incremented. Then we can write the expectation above as

E[22Xn]=i=122iP(Xn=i)=i=122i[(12i1)P(Xn1=i1)+(112i)P(Xn1=i)]=i=12i+1P(Xn1=i1)+i=122iP(Xn1=i)i=12iP(Xn1=i)=4E[2Xn1]+E[22Xn1]E[2Xn1]=3E[2Xn1]+E[22Xn1]=3n+E[22Xn1]. \begin{aligned} \mathbb{E}[2^{2X_n}] &= \sum_{i=1}^{\infty} 2^{2i} \mathbb{P}(X_n = i) \\ &= \sum_{i=1}^{\infty} 2^{2i} \Bigg[ \Big( \frac{1}{2^{i-1}} \Big) \mathbb{P}(X_{n-1} = i - 1) + \Big( 1 - \frac{1}{2^i} \Big) \mathbb{P}(X_{n-1} = i) \Bigg] \\ &= \sum_{i=1}^{\infty} 2^{i+1} \mathbb{P}(X_{n-1} = i - 1) + \sum_{i=1}^{\infty} 2^{2i} \mathbb{P}(X_{n-1} = i) - \sum_{i=1}^{\infty} 2^i \mathbb{P}(X_{n-1} = i) \\ &= 4 \mathbb{E}[2^{X_{n-1}}] + \mathbb{E}[2^{2X_{n-1}}] - \mathbb{E}[2^{X_{n-1}}] \\ &= 3 \mathbb{E}[2^{X_{n-1}}] + \mathbb{E}[2^{2X_{n-1}}] \\ &= 3n + \mathbb{E}[2^{2X_{n-1}}]. \end{aligned}

Note that this is a recursive definition,

E[22X0]=1E[22X1]=3+E[22X0]E[22X2]=6+E[22X1]=6+3+E[22X0]E[22X3]=9+E[22X2]=9+6+3+E[22X0] \begin{aligned} \mathbb{E}[2^{2X_0}] &= 1 \\ \mathbb{E}[2^{2X_1}] &= 3 + \mathbb{E}[2^{2X_0}] \\ \mathbb{E}[2^{2X_2}] &= 6 + \mathbb{E}[2^{2X_1}] = 6 + 3 + \mathbb{E}[2^{2X_0}] \\ \mathbb{E}[2^{2X_3}] &= 9 + \mathbb{E}[2^{2X_2}] = 9 + 6 + 3 + \mathbb{E}[2^{2X_0}] \\ &\dots \end{aligned}

We can write this generally as

E[22Xn]=i=1n3i+1. \mathbb{E}[2^{2X_n}] = \sum_{i=1}^n 3i + 1.

The sum is just a finite arithmetic sequence, and it’s easy to check that i=1n3i=32n(n+1)\sum_{i=1}^n 3i = \frac{3}{2} n(n+1). Thus, we have proven our bound,

V[θ^n]3n(n+1)2+1. \mathbb{V}[\hat{\theta}_n] \leq \frac{3n (n+1)}{2} + 1.

While this is not the best bound on the variance—the variance of our estimator increases as a function of nn—we can use it to bound the probability that 2Xn12^{X_n} - 1 deviates from its expectated value (Figure 11). I believe this is what Morris meant when he said that this algorithm bounds relative error.

Figure 1. Morris's algorithm on n=1024n = 1024 data points. The dotted line is the true counts, while the black line is Morris's estimate 2Xn12^{X_n} - 1.

Bounded tail probabilities

To bound the tail probabilities, we invoke Markov’s inequality,

P[θ^nnεn]V[θ^n]ε2n212ε2. \mathbb{P}[|\hat{\theta}_n - n| \geq \varepsilon n] \leq \frac{\mathbb{V}[\hat{\theta}_n]}{\varepsilon^2 n^2} \leq \frac{1}{2 \varepsilon^2}.

The second inequality follows from the upper bound we just computed on the variance, namely

V[θ^n]=E[(θ^nE[θ^n])2]=E[(2Xn1n)2]=E[(2Xn(n+1))2]=E[22Xn]2E[2Xn](n+1)+(n+1)2=E[22Xn](n+1)2=32n(n+1)+1(n+1)2=32n2+32n+1n22n1=32n2+32n22n242n=12n212n=n2n2n22. \begin{aligned} \mathbb{V}[\hat{\theta}_n] &= \mathbb{E}[(\hat{\theta}_n - \mathbb{E}[\hat{\theta}_n])^2] \\ &= \mathbb{E}[(2^{X_n} - 1 - n)^2] \\ &= \mathbb{E}[(2^{X_n} - (n + 1))^2] \\ &= \mathbb{E}[2^{2X_n}] - 2\mathbb{E}[2^{X_n}](n + 1) + (n + 1)^2 \\ &= \mathbb{E}[2^{2X_n}] - (n + 1)^2 \\ &= \frac{3}{2} n (n + 1) + 1 - (n + 1)^2 \\ &= \frac{3}{2} n^2 + \frac{3}{2} n + 1 - n^2 - 2n - 1 \\ &= \frac{3}{2} n^2 + \frac{3}{2} n - \frac{2}{2}n^2 - \frac{4}{2} n \\ &= \frac{1}{2} n^2 - \frac{1}{2} n \\ &= \frac{n^2 - n}{2} \\ &\leq \frac{n^2}{2}. \end{aligned}

In words, the probability that our estimate of nn deviates far from the true value of nn is bounded. Of course, the tail probability is over εn\varepsilon n. We can see in Figure 11 that as nn increases, the variance or deviation from the true value also increases.

Space complexity

The number of bits required to store XnX_n is log2(Xn)\log_2(X_n). However, Morris’s algorithm only needs to increment XnX_n an average of log(2n+1)\log(2n + 1) times. Thus,

log2(Xn)log2log2(2n+1)=O(loglogn) \log_2(X_n) \leq \log_2 \log_2(2n + 1) = \mathcal{O}(\log \log n)

on average.

  1. Morris, R. (1978). Counting large numbers of events in small registers. Communications of the ACM, 21(10), 840–842.
  2. Flajolet, P. (1985). Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1), 113–134.