## A Power Law

One of my advisees this semester, George Okeowo, investigated the error messages produced by Python, using a corpus of 600000 buggy programs submitted by students to the Computer Science Circles website. The project was focused on writing beginner-understandable versions of error messages, focusing on the most common errors. A side-effect of the investigation was to look solely at the frequencies of each type of error.

The motivation for this is Zipf’s law, which says that in a book or other corpus of text, the frequency of the kth most popular word is proprtional to 1/k. Here is an example from Wikipedia showing Zipf’s law applied to the book Moby Dick:

Each circle is a word. The top-left is the most popular word the; the next circle is of, which appears about half as often as the. Zipf’s law is confirmed by the fact that the circles lie on a line of slope -1 when plotted on a log-log scale. (Also, the number of distinct words is very close to the maximum frequency, but I don’t think that’s anybody’s law.)

For George’s project, he filtered the error messages by type; for example the top type was SyntaxError, the next was NameError, etc. When plotted on a shifted log-log scale, the following graph results:

This is an amazingly good fit, although we intend to double-check using the more precise domain-specific fitting method of Clauset, Shalizi and Newman. But what should we make of the formula? It says that

frequency of kth most common error ∝ (k+34)-4.7

which is a formula that only a statistician could love. But, the fit is remarkably good.

Aaron Clauset was very helpful in my sorting through his paper, and in corresponding with him I learned there is a second, more generative model, for how power law frequencies like this can arise. Namely, let

D(1.21, 100000) := the truncated discrete power law distribution where Pr[x] ∝ x-1.21 for 1 ≤ x ≤ 100000.

Then, this second model says that

the frequency of each error is selected independently from D(1.21, 100000).

Again, this fits the data extremely well.

For either model, it is difficult to imagine how insight could be extracted from this observation. There are many models that can generate power law distributions, but Michael Mitzenmacher argues that these power-law models are like string theory: beautiful and seductively simple, but of questionable value unless we can apply them to gain some insight in testing a falsifiable hypothesis.

Is there some plausible explanation for why human errors would arise in this way? Or is there some aspect of the way that Python’s code reports errors that affects the shape of the distribution? I could imagine that an ideal language would have a distinct error message for every variation of every possible error, giving a completely flat power-law curve, while a terrible language would give only the message “Error” for every type of error. Is there some quantitative relationship between the power law exponent for any given language, and its suitability for beginners?

Or is a line sometimes just a line?

## Graphs and Bananas

Some fantastic things as of late. (Not more fantastic than previously but still not bad.)

• A puzzle. You have an integer array of length N in memory, indexed from 1 to N. Each of the integers that it contains is between 1 and N-1. By the pigeonhole principle, two entries are the same. If the array is write-only read-only, can you find two identical entries of the array in O(N) time, using O(log N) additional bits of read-write memory? Thanks to Dan Armendariz for telling me about this at SIGCSE.
• A game. Binary tree isomorphism? There’s an app for that. The game Splice is based on this at some level, where you have a limited number of moves to adjust one tree to make it into another. This is alone a nice kind of puzzle, especially since it’s NP-complete. But what is the most striking about this game is its beautiful user interface. Very responsive and fluid, behaves well with a touchscreen. Excellently themed. Getting started is rough since the instructions are tucked away in a corner. Probably my favourite Android game, although I have not found that many contenders.
• A book. In Pursuit of The Travelling Salesman by Bill Cook describes a classical problem in computer science and mathematics in a way that is both scientifically elegant and humanistically down-to-earth. This is probably my favourite book out of anything I could hand to an undergrad and get them to care about the P vs NP problem. The writing explains insights without being technical, and presents interesting stories about the researchers without turning into biographies.
• A movie. There is also a movie Travelling Salesman, which recently completed an IndieGogo campaign, dealing with the same problem. According to Bill Cook, the film almost had to cancel production in the middle of 2010 when an alleged proof of P != NP was circulated. (Although, aren’t a lot of good fictional movies based on false premises?) Perhaps my favourite thing about this movie is that the main character shares a name with a beloved Canadian hockey player-cum-coffee mascot.
• Muffins. Not sure why I like bananas only when they are a component in a baked good, but there you go.
Categories: computer science, food, tsp

## Puerto Rico Trip

For spring break, Sara and I took a voyage out of New Jersey and into Puerto Rico. Why Puerto Rico? Here are some photos to explain:

The waterfall above is from El Yunque national park, and so is the next vista:

This one, however, is sort of from the middle of nowhere, as we got lost one day but it turned out pretty nice:

Aside from fantastic scenery, we came across a lot of food.

• Left: This is a cat hanging out at the Luquillo food kiosks, where we got a barbequed pork skewer from a guy grilling outdoors,
• Bottom right: this was how excited Sara was when we returned back to our favourite bakery for a second trip

Below left: you can see a picture that explains clearly why both of us were so excited.

As you can deduce, we took a lot of photos on the trip!

 Sara taking a picture of the place we got lost Sara taking a picture of pigeons Sara taking a picture of cats Sara taking a picture of a waterfall Sara taking a picture of the rainforest I bet Sara is thinking about taking a picture

Finally, here is a picture of part of Luquillo beach where two waves were meeting each other orthogonally,

and a picture of a city square and church in central PR!

Last but not least, we got engaged! And we mastered the art of taking double self-portraits with an auto-timed camera, which is perhaps also a once-in-a-lifetime event.

Categories: photos, travel

## Happy Pi Day 2013

Pi Kitchen + Bar

I just got back from SIGCSE in Denver, which aside from Euclid Hall Bar and Kitchen, has a restaurant called Pi Kitchen and Bar. This would have been a fortuitous place to be today, Pi day (3/14).

We have a programming exam scheduled for today in COS 126. In case you are a student taking the exam, let me disappoint you already, and tell you that the exam question is not about computing the digits of π.

Nonetheless, everyone can celebrate. I’ll be celebrating (while proctoring) by listening to this playlist, which encodes Pi up to 36 digits after the decimal point.

Here are the tracks:

• The Bloodhound Gang – Three Point One Four
• Nine Inch Nails – Le Mer
• Rjd2 – Ghostwriter
• Buck 65 – Zombie Delight
• Medeski, Martin, Scofield and Wood – Jeep on 35°
• The Herbaliser f. Latyrx – 8pt Agenda
• Emerson, Lake & Palmer – Karn Evil 9
• Miles Davis – Seven Steps to Heaven
• Boards of Canada – M9
• Jimi Hendrix – Third Stone From The Sun
• Alpha Blondy – Psaume 23
• Boards of Canada – 84 Pontiac Dream
• Tortoise – Six Pack
• La Maison Tellier – Le second souffle
• The Beatles – When I’m Sixty-Four
• Ry Cooder – 3 Cool Cats
• The Tragically Hip – 38 Years Old
• Push Button Objects – 3 Doctors
• Grizzly Bear – Two Weeks
• Wu Tang Clan – 7th Chamber
• ATB – 9 PM
• Paul Simon – 50 Ways to Leave Your Lover
• Karkwa – 28 jours
• Eminem – 8 Mile Road
• Led Zeppelin – Four Sticks

Updates! Thanks Sean and Andrew! Combo points awarded for double digits.

• Three Dog Night – One
• Bryan Adams – Cloud Number 9
• Great Big Sea – The Seven Joys of Mary
• Bob Marley – One Love
• Bryan Adams – Summer of 69
• Eddie Cochran – Three Steps to Heaven
• Nena – 99 Luftballoons

Do you know any more songs to extend the list? The next digits, getting up to a precision of 10-50, are 19716939937510. There are definitely some promising digits in that subsequence.

The alternative is to save yourself an infinite amount of listening time and check out the song π by Monoton. Either way, hope you enjoy some circles on this day!

Categories: math, music, photos

## Bits of CS

• On the first day of the Intro to CS course, a written survey asked students a bunch of questions. The overall best answer, given to the question “Why are you here?”, was “Because it’s 12:30 on a Tuesday.”
• I had a student who couldn’t run Java because calls to Math.sqrt and the like would not compile. A typical error was
Error: The method sqrt(double) is undefined for the type Math

How does such an error arise? I have no idea. Why did import java.lang.Math; added to the top of the program fix it? I also have no idea. But I am glad it worked!

• Sara and I recently watched Source Code. I didn’t know anything but the title going in to the movie. While I was disappointed by the total lack of programming in the movie, it was pretty great overall. The logic makes sense locally, but doesn’t quite come together if you try to explain the whole thing at once. Sort of like a Klein Bottle of movies.
• You know a project is successful when it starts spawning secondary sub-industries. This seems to have happened with CS Circles: there’s a guy on YouTube giving demos/walkthroughs of some of our exercises!
Categories: Java, movies, programming

## COS 126 Wrap-up

The main course that I taught this term, the introduction to computer science (COS 126) at Princeton, just wrapped up. It’s a very well-designed course. Before coming here I had thought that Java had too much potential for confusion for beginners — it is a lot of confusing work to type out

public class { public static void main (String[] args) { ...

for each of your first 10 programs without having a clue what it means. But, once you get past this there are some advantages.

Is it important that Java has strongly typed variables, compared to say Python? Not sure, I can see pros and cons to both styles. But, it is a definite bonus that Java’s functions have typed parameters, return values, and explicit access modifiers (public/private). When we give students assignments requiring that they build methods of a particular type, this makes them think about the big-level ideas before starting to just code. Then, this also fits well with requiring students to submit more than one file in an assignment, since we can explicitly control which public information is shared in-between classes, forcing students to think carefully about information representation, and the different ways that the same information can be passed around in different forms.

Highlights of the final project grading:

former = latter;
current = next;
before = after;

for three parallel variables (e.g., an integer, a float, and a string) used in between iterations;

predator[j].distanceTo(hunter[k]);

for geometric point sets (no alien technology involved);

This upcoming term, I am hoping to make a web-based system for submitting tiny Java exercises — programs just big enough to test out new syntax elements and allow students to have their first bugs appear in a small setting de-coupled from any complications. Compared to CS Circles two of the big changes would be (1) multiple blanks for code entry in a single exercise and (2) syntax checking required to get the latter to work. Students rarely take the time to do the small exercises in the book, and even when they do they don’t seem to test them exhaustively; I hope this can help get students doing more tiny coding sessions from scratch.

Categories: Java, programming, teaching

## Wrapup: Math Software Course for Math Teachers

Stephen Tosh and I co-taught the online course Math 600 for the University of Waterloo last term, in September and October. The course is part of the Master’s of Mathematics for Teachers program, which is an online Master’s degree for high-school math teachers. The students were a very interesting mix of people from all over the globe, ranging from freshly-graduated to very experienced. I just finished the very last part of the course administrivia, which was reading the results of a survey we sent to them, and it’s a good time to reflect on it before the next term starts here at Princeton.

So what was this course, exactly? It serves to introduce three pieces of software: LaTeX which is a typesetting system, Maple which is a programming language oriented around math and algebra, and GeoGebra which is a interactive system for setting up geometric and algebraic objects in the plane. We spent three weeks on LaTeX, two on Maple and one on GeoGebra, and gave out either short exercises or an open-ended project each week. The general idea is that we choose these technologies because they are useful for teachers in the classroom and/or they will be useful for teachers in future MMT courses that they take.

We hosted the course content on a WordPress installation at

http://cemclinux1.math.uwaterloo.ca/~math600

and WordPress turned out to be a pretty great tool for hosting an online textbook. We didn’t do any interactive stuff, just reading material in a nice searchable, universally accessible format. Along the lines of CS Circles we tried putting exercises sprinkled throughout each lesson, rather than bunched-up in problem sets at the end, which seemed to work well. We used LaTeX-generated typography in WordPress three ways:

1. copying-and-pasting images
2. using the built-in LaTeX plugin for WordPress to render text as images
3. using MathJax to render text as MathML

Only the dumbest solution, #1, worked flawlessly for all students. But the other two approaches were also good: #2 only failed for one student behind the Great Firewall of China, and #3′s issues were limited to just a couple of students plus a couple of complex fraction formulas that were not rendered correctly.

It would be lovely, some day, to be able to automatically grade LaTeX, Maple, and GeoGebra in the same way that auto-graders currently exist for C++, Java, and Python. That day is a long way off, so I have to give a lot of credit to our teaching assistant Burcu Karabina’s excellent attention to detail in the manual grading process.

Here’s a couple of questions that I wanted to find out about in the end-of-course feedback:

• Did students use LaTeX on their own computer or “in the cloud?” We showed students that they could download LaTeX software either to their own computer, or use a web-based version. Sadly the fabulous ScribTeX stopped accepting registrations just before the term started, but thankfully ShareLaTeX was a good remaining option. The students were equally split 3 ways between using ShareLaTeX, using TeXWorks in Windows, and using Mac-based options. If I taught the course again I would pursue Sage as a replacement for Maple, since Sage also can be run in the cloud.
• How should you support students in an online course? We had forums built in to UW’s Blackboard Learn system, scheduled office hours for google/skype chats, and we allowed students to directly e-mail us. By far, the forums were the most utilized and successful option, and students indicated that we they would be happy if we stopped the office hours entirely.
• Can you get students more engaged with each other despite being so far away from one another? This is more challenging. Based on questions that we asked, the students are not very supportive of group projects or peer reviews. A great idea mentioned by a couple of students was to let students share their completed projects with the others, which I would have loved to do and skipped only for lack of time.

For the latter two findings, it might be relevant that we’re teaching teachers in particular; younger students without a full-time job might want more close support and more interactivity. At this point I don’t have enough experience to say for sure.

In terms of the material, it seems like UW’s chosen combination of LaTeX, Maple and GeoGebra was pretty good. People expressed a love-or-hate relationship with LaTeX… two students said

LaTeX is not, and will not, be the preferred method for high school teachers. Keep in mind that we using Word and Equation Editor, there is no use, to most of us, to put in the numerous hours to use/learn LaTeX at the level to create something we can do in minutes in Word.

versus

LaTeX felt like a really useful program to know given its versatility. It was a very intuitive program.

The general feedback was appreciative of the course, and I was happy to see that students said they enjoyed doing the homework, which I consider a very excellent compliment. We seemed to do a pretty good job of one of our main goals, which was to give teachers tools that they actually would find useful in the classroom. GeoGebra and LaTeX were singled out for this… I do think that using programming is harder in a math course due to the extensive training and teaching needed, and probably could sit inside of its own separate course. It might be interesting to explore the various tools and mini-applications that Maple has inside of it for doing math and science explanations, although I know very little about that.

What else can I say? Go check out the site and see if you can figure out the Easter eggs at the top of each page.

Categories: math, teaching, waterloo