### A Power Law

10May13

One of my advisees this semester, George Okeowo, investigated the error messages produced by Python, using a corpus of 600000 buggy programs submitted by students to the Computer Science Circles website. The project was focused on writing beginner-understandable versions of error messages, focusing on the most common errors. A side-effect of the investigation was to look solely at the frequencies of each type of error.

The motivation for this is Zipf’s law, which says that in a book or other corpus of text, the frequency of the kth most popular word is proprtional to 1/k. Here is an example from Wikipedia showing Zipf’s law applied to the book Moby Dick:

Each circle is a word. The top-left is the most popular word the; the next circle is of, which appears about half as often as the. Zipf’s law is confirmed by the fact that the circles lie on a line of slope -1 when plotted on a log-log scale. (Also, the number of distinct words is very close to the maximum frequency, but I don’t think that’s anybody’s law.)

For George’s project, he filtered the error messages by type; for example the top type was SyntaxError, the next was NameError, etc. When plotted on a shifted log-log scale, the following graph results:

This is an amazingly good fit, although we intend to double-check using the more precise domain-specific fitting method of Clauset, Shalizi and Newman. But what should we make of the formula? It says that

frequency of kth most common error ∝ (k+34)-4.7

which is a formula that only a statistician could love. But, the fit is remarkably good.

Aaron Clauset was very helpful in my sorting through his paper, and in corresponding with him I learned there is a second, more generative model, for how power law frequencies like this can arise. Namely, let

D(1.21, 100000) := the truncated discrete power law distribution where Pr[x] ∝ x-1.21 for 1 ≤ x ≤ 100000.

Then, this second model says that

the frequency of each error is selected independently from D(1.21, 100000).

Again, this fits the data extremely well.

For either model, it is difficult to imagine how insight could be extracted from this observation. There are many models that can generate power law distributions, but Michael Mitzenmacher argues that these power-law models are like string theory: beautiful and seductively simple, but of questionable value unless we can apply them to gain some insight in testing a falsifiable hypothesis.

Is there some plausible explanation for why human errors would arise in this way? Or is there some aspect of the way that Python’s code reports errors that affects the shape of the distribution? I could imagine that an ideal language would have a distinct error message for every variation of every possible error, giving a completely flat power-law curve, while a terrible language would give only the message “Error” for every type of error. Is there some quantitative relationship between the power law exponent for any given language, and its suitability for beginners?

Or is a line sometimes just a line?