### The Joy of Randomness

02Nov11

Ravi Kannan gave a distinguished lecture today at Georgia Tech covering a wide swath of algorithms, randomness, and high-dimensional geometry. There was one particularly excellent slide that gave a very intuitive explanation for a result I have seen, but not fully understood, before.

The result is in the field of streaming algorithms. Specifically, there is a “streaming” data set which consists of a large number m of elements, where each element is some b-bit string — actually it is more convenient to think of each element as integer between 1 and n := 2b. You only get to see one element from the stream at a time, do a computation and change your working memory, and then the next one arrives. We want to capture some statistical properties of this data set. Storing the entire data set could be done easily in either log n bits, or n log m bits. But what if this is much more memory than our computer has available? (E.g. if these are a continuous stream of data points from a sensor, or requests on a web server, or Google’s cache of the internet.) We would like to say something useful about the data set using only polylog(m, n) memory.

First, an easier puzzle. Let’s say that we don’t precisely know in advance the total number of items (including repeats) m that will arrive, and want to find m out. (For the puzzle, we don’t care about the items’ contents.) Evidently we can just count this in a streaming manner using log bits of memory.  Can we do it with less? More precisely, can we come up with a good estimate of m using only O(log log m) bits of memory? I give the answer at the end.

Part of a Stream

Back to Ravi’s talk. The particular statistic we’d like to count is the second moment. For each string i, let fi be the frequency of that string, or in other words the total number of times it appears in the data set. Then the second moment is defined as $\sum_{i} f^2_i$. Put differently, the second moment is the probability that two random elements from the stream are the same, up to a factor of n2. How can we estimate this value?

Ravi’s explanation of how you could imagine doing this was very nice. Think of f as a vector in high-dimensional space. What we want to measure is (the square of) the Euclidean length of f, but we don’t have the space to explicitly store all of the bits of f in memory. The streaming model amounts to discovering the vector f in unit steps. The key insight in his explanation is to use the projection of f on to a random d-dimensional subspace where d is logarithmic; analogous to the Johnson-Lindenstrauss lemma, the random length of the projection actually gives us a pretty good estimate of the original length. Moreover, the projection of the sum of the steps is the sum of the projections of the steps. So it suffices to keep a running (vector) total of the sum of the projections of the steps.

A possible problem in implementing this is that projecting on to a d-dimensional subspace amounts to multiplying each incoming item (n-dimensional unit basis “step” vector) by a n-by-d matrix, and we cannot store anything of length n. To get around this, we let our n-by-d matrix actually be pseudorandom — that is to say, there should be some small random “seed” so that a computer program can determine the n-by-d matrix from just the seed. In this way, we don’t need to store the whole matrix in memory, rather just the seed, so that we can just recompute the ith row each time an element i shows up. In this way we can keep a running count of the d-dimensional vector sum and complete the idea.

This result is due to Alon, Matias, and Szegedy, which I have heard called one of the first papers in streaming computation (1996). The paper is nice to read and gives a clean explanation where you project on to random ±1 vectors in a derandomizable way. But this intuition helps me follow the proof idea for second moments (section 2.2). I read it previously (without this intuition), back when I was working on my Master’s thesis, because I had looked at an earlier paper of Morris for estimating the first moment… and this brings me to the puzzle’s solution.

***

You may have already deduced that you would need randomness to solve the puzzle, because with only O(log log m) bits of memory, our counter could only have polylog(m) states, and so only count polylog(m) items reliably.

So, let’s say that we have a “log counter” instead. The log counter stores a value k, such that our running estimate of the number of items seen so far is 2k. Then, we maintain this counter using the probabilistic method: if the counter currently reads k, when the next item shows up, only increment the counter with probability 2-k. So, the counter reads 1 after the first item arrives; the expected time until it reads 2 is three items; in general the expected time until it reads j is 2j-1 items. You would need some more formal calculations to see exactly how good of an estimate it is, but this is the main idea.

The paper of Alon et al has a lot of other nice stuff. We saw small-space approximations for the first and second moments; they also formalize a previous idea for the 0th moment (counting distinct items) and show that some higher moments do not admit any small-space approximation. For recent work on this, not a bad place to start is to look at the work of Kane & Nelson et al.!