What Is ASCII?

15Feb15

Disclaimer: I am not a historian. I am just a Computer Science educator who likes reading Wikipedia and printing funny characters. This explanation is a long optional appendix for a programming assignment in my course.

ASCII is the American Standard Code for Information Interchange. Developed starting in 1960, it was a way of standardizing means of communication between computers.

To explain, it’s useful to note that “8 bits” means two things from the perspective of a C++ programmer.

  • These days all computers have standardized on organizing memory in bytes. That is to say, each memory address contains 8 bits.
  • In C++, a char variable, representing a single character for communication, is also 8 bits wide.

Neither of these has been true forever. (Actually, some of the earliest computers did not use binary! But we’ll only be talking about bit-based computers here.)

  • There were machines where each memory address contained some other number of bits like 22, 12, 18, 25, etc; see this table on Wikipedia.
  • People used 5-bit Baudot codes for early communication. At 32 possible codes, that was just big enough for the English alphabet, but not lowercase letters or digits. This gradually expanded over the years to 6-bit, 7-bit and finally 8-bit codes.

However, 7 bits was sort of a sweet spot. At 128 possible codes, there was enough room for both lower-case and upper-case letters, digits, punctuation, with space left over for “control codes.” The development of ASCII in the 1960s was led by the American Standards Association. It proposed a standard meaning for all 128 values in a 7-bit code:

  • The first 32 codes (0-31) were specific “control characters.” They didn’t have a direct graphical representation but instead had a specific semantic meaning. Notable examples are #0 the null character, #8 backspace, #9 tab, #10 line feed, and #13 carriage return.
  • Character 32 was a space.
  • Characters 33-126 meant the following, in order:
    #33-63: !"#$%&'()*+,-./0123456789:;<=>
    #64-95: @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
    #96-126: `abcdefghijklmnopqrstuvwxyz{|}~
  • Character 127 meant “delete”. Here is a quote from Wikipedia that gives a nice explanation:

    This code was originally used to mark deleted characters on punched tape, since any character could be changed to all ones by punching holes everywhere. If a character was punched erroneously, punching out all seven bits caused this position to be ignored (or deleted).

Anyway, the idea of proposing ASCII as a way of standardizing was pretty good. It meant that if you had two machines communicating with each other, they had a chance to understand one another.

ASCII is ubiquitous now, but that was not always the case. Quoting this CNN article,

[T]here was an 18-year gap between the completion of ASCII in 1963 and its common acceptance. … Until 1981, when IBM finally used ASCII in its first PC, the only ASCII computer was the Univac 1050, released in 1964 (although Teletype immediately made all of its new typewriter-like machines work in ASCII).

The most well-known system not compatible with ASCII was EBCDIC. EBCDIC was developed in 1963. It is deeply weird in the sense that the letters of the alphabet don’t all contiguous codes. But there’s reason for this; it has to do with the Hollerith card code. Hollerith worked for the US Census and founded a startup in 1896 that was the precursor to IBM. The common punchcard standard that eventually dominated had 12 rows but not all 2048 combinations were viable, since it would destroy the card’s physical strength. So typical characters in a Hollerith code would punch up to 1 out of three “zone” rows plus up to 1 of out of nine “numeric” rows. Encode this as decimal and you get 36 choices from “00” to “39”. Then encode that as binary and all of a sudden you have a gap, e.g. formerly-adjacent codes 19 and 20 are all of a sudden 011001 and 100000. (This oversimplifies a bit, but a summary of the main point is that the BCD in EBCDIC means binary-coded decimal.)

029-64-250Also, EBCDIC did not contain all the punctuation characters that ASCII did. Some keyboards set up for EBCDIC did not have keys for characters like ^, { or }. See the 1972 international standard ISO 646, which is sort of a missing link between EBCDIC and ASCII. The presence of hardware and operating systems unable to support all characters is the reason that “C digraphs” and “C trigraphs” exist, e.g. that int x = 3 ??' 5 sets x equal to 4. (For applications of C di/trigraphs, see the IOCCC.)

So in a nutshell EBCDIC existed due to the fact that hardware and software’s evolution was gradual and intertwined.

There’s another acronym that you’ll see a lot in researching this subject. Specifically, you’ll sometimes see ANSI as a synonym for ASCII, or to mean a variety of related concepts. What actually happened is that the American Standards Institute, which was the organization to develop ASCII, renamed itself the American National Standards Institute (ANSI) in 1969. From this ASCII gained another name retroactively, the “ANSI X3.4″ standard. (And ANSI released more standards later on.)

Towards 8 bits

Eventually, ASCII was adopted more and more, which is why you don’t have an EBCDIC table in the appendix of your textbook. To this day, the word ASCII still technically refers to the 128-character 7-bit system mentioned above.

Over time, computer architectures did eventually standardize on word sizes and addressable locations that used bytes (8 bits). The PDP-8, sold commercially between 1963 and 1979, was the last popular machine whose word size was not a power of 2 (it used 12-bit words).

Hence, there was some wiggle room. A file on your hard drive, or a character being transmitted from computer to computer, or a char variable, actually had 8 bits of space, but ASCII only used 7. If you were to use all 8 bits, you could encode 256 possibilities, which was an extra 128 characters! This was particularly useful for non-English speakers, since most languages use accents or different letters not present in English. Spanish users would be interested in having ñ get one of those values in the unused space from #128-255, while French and Turkish speakers could use ç, Germans would like to add ß, Russians could use И, etc. But, there wasn’t enough space to satisfy everyone.

Enter the code page. This system, which became particularly common with DOS 3.3 in 1987, meant that every user could make their own choice about what system of letters would appear in those extra 128 slots. The most common English code page, CP 437, used those slots for the following characters:

#128-159: ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒ
#160-191: áíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
#192-223: └┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
#224-254: αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■

(The last character #255 is “non-breaking space,” another control character.)

Those lines and boxes were extremely useful in creating an entirely new domain, ASCII art. (Or maybe it is called ANSI art? Technically it should be called CP-437 Art but that name didn’t seem to take off.)

Of special mention is that on many machines of the DOS era, you could directly manipulate the screen’s contents. A program could set your monitor to text mode, for example 80×25 where there are 25 rows and 80 columns, each with room for a single letter. (Your VM looks like this when  it boots up.) This is extremely similar to a bitmap: a 80-by-25 array of char values. In fact, the system assigned visual symbols or glyphs for all possible values from 0 to 255, even the non-printable control codes. So text-based games/art/applications of the day also had access to these symbols (in CP 437):

32

There were many other code pages. Almost all of them stuck to ASCII for the first 128 values (#0-127) and then added the 128 additional characters of local interest. E.g. here is what CP 866 contains at these positions:

#128-159: А‌Б‌В‌Г‌Д‌Е‌Ж‌З‌И‌Й‌К‌Л‌М‌Н‌О‌П‌Р‌С‌Т‌У‌Ф‌Х‌Ц‌Ч‌Ш‌Щ‌Ъ‌Ы‌Ь‌Э‌Ю‌Я
#160-191: а‌б‌в‌г‌д‌е‌ж‌з‌и‌й‌к‌л‌м‌н‌о‌п‌░‌▒‌▓‌│‌┤‌╡‌╢‌╖‌╕‌╣‌║‌╗‌╝‌╜‌╛‌┐
#192-223: └‌┴‌┬‌├‌─‌┼‌╞‌╟‌╚‌╔‌╩‌╦‌╠‌═‌╬‌╧‌╨‌╤‌╥‌╙‌╘‌╒‌╓‌╫‌╪‌┘‌┌‌█‌▄‌▌‌▐‌▀
#224-255: р‌с‌т‌у‌ф‌х‌ц‌ч‌ш‌щ‌ъ‌ы‌ь‌э‌ю‌я‌Ё‌ё‌Є‌є‌Ї‌ї‌Ў‌ў‌°‌∙‌·‌√‌№‌¤‌■‌

The picture is further complicated by right-left languages and joined-letter scripts like Arabic, and languages like traditional Chinese with way more than 128 writing symbols.

In English communication, the Windows-1252 codepage became dominant. Compared to CP 437, it lost the line art, but gained smart quotes, fancy dashes, and other useful punctuation. (This made since because in Windows, they had real windows made out of pixels, rather than line art text mode graphics.)

For our Caesar Decipher assignment, the files will be given to you in the Windows-1252 format. (Technically, the internationally standardized Latin-1 subset.) But it shouldn’t matter, since your program will only do anything to those char values between ‘A’ and ‘z’, which are in the “official” ASCII range between 0 and 127.

Unicode

Most modern communications today, especially international ones, use a more recent system called Unicode.

One problem with the codepage system above is that you don’t really know what a file is saying just by looking at its bits. You can’t be sure if (char)161 means б‌, í, or something else. And there’s absolutely no way to represent a file that contains more than 256 distinct characters. (Such as this very article.)

This was eventually solved by Unicode, which began as an attempt to bring all of the code pages into a single system. Unicode is based on two principles:

  • One unified and fixed numbering of all possible symbols. For example, б is 1073 (“CYRILLIC SMALL LETTER BE”) and í is 237 (“LATIN SMALL LETTER I WITH ACUTE”). Each possible symbol is called a unicode code point.
  • A variety of different encodings, i.e. systems to encode a given sequence of code points as a series of bytes. (The most common encoding these days is UTF-8, a variable-width encoding.)

This system still had its growing pains and we’ll just mention the most noteworthy miscalculation. Joe Becker said in 1988,

Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2^14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.

The Java language was even designed based on this assumption: a Java char is a 16-bit variable. However, this didn’t last long, and as of writing, the latest version of Unicode, version 7.0, contains 113,021 different characters (code points). In Java, this entailed the use of the UTF-16 encoding. In C++, you can read about the platform-dependent wchar type.

For historical context, you might read Kiss your ASCII Goodbye, written in 1992.

One thing is clear: at some point, maybe related to surpassing the 16,384-character limit, the Unicode consortium changed their definition of the “modern-use characters” that could fit in the standard. For example, character 9731 is ☃ (SNOWMAN). Try copying

http://☃.net

into your browser address bar.


Over the last couple of weeks, I’ve been building a graphics/animation library to use in the class that I’m teaching at USC. It’s now in a stable state and has the features that I think I’ll need this semester.

A lot of the motivation comes from Princeton where I taught previously: the class there, taught in Java, uses the awesome book, assignments, and libraries of Kevin Wayne and Bob Sedgewick. StdDraw.java, their library, Just Works™. You don’t need to create any objects, or initialize any canvas, or create a hook. If you want a circle, just call StdDraw.circle(x, y, r). This makes it reasonable to write code for a beginner class that students will actually read, and reasonable for students to write their own clients. So I’ve tried to emulate that.

Of course, with Java, you have the benefit of native platform support, and their library is a wrapper over the java.awt and javax.swing libraries. In my case, after a lot of false starts, the only C++ library that

  • I could get working on my VM
  • I could get making graphics and animation

was “Qt.” More on that is below. But I’m pretty happy with how it turned out.

Here’s a little picture showing it in action:

If you’re interested, here is the:

Caveats

There were a few things that made life unnecessarily difficult, at least for some shlub with a Java/Python background like me.

  1. I have no idea if someone not using the course VM will be able to get it to run, since they’ll have to install Qt. And adding audio & .mid support is another order of magnitude difficult for the student.
  2. Every function call causes a pretty long chain of events, and as a result, there’s a ridiculous amount of boilerplate in draw.cpp, so every function has to be declared 4 times in slightly different ways:
    • The user calls a library function.
    • The library function, a static function, calls a public member function of a QObject.
    • That member function “emits” a protected “signal”.
    • It’s transmitted (through a cross-thread queue) to the “slot” of another object, the GUI’s “QWidget”.
    • That slot actually does the work.
  3. You need threading in order for the student’s calls to be non-blocking, which is important for animation. But clang++ doesn’t support threads properly. So we have to use g++, which is not our course compiler, and is not as portable to OSX.
  4. We really would like to run some stuff (wait for the window to close) after the student’s main() function, but there’s no exposed API for this that worked. So we use the preprocessor to rename their main() method to _main(). It’s awful because you need to transform both a no-argument main() and a main(int, char**) to the same thing. This really makes me feel dirty…
    // transform main() or main(int, char**) to _main(int, char**)
    #define main(...) vamain(__VA_ARGS__)
    #define vamain(...) vamainhelp(,##__VA_ARGS__, int, char**)
    #define vamainhelp(blank, first, second, ...) _main(first, second)

But, I would have to say that I learned a lot in the process. The library even works with gdb for debugging and stepping. It’s weird that Qt seems to try to do a lot of stuff to C++ that other languages support natively: hooks (signals/slots), reflection and multi-threading in particular.

We’ll see how it goes during the semester. Hopefully many of the undergrads working in my course are wizards enough at Qt, from their experience using it in the data structures course, that more people can configure it on whatever machine they like. My fingers are crossed!


51WpscuKU7L._SL160_Happy new year! As a way of celebrating (?) or making up for procrastinating (more likely), I have finished a long-overdue book review for The Mathematical Intelligencer. The book is called Combinatorics: Ancient & Modern.


This book, a collection of 16 chapters by different authors, spans the history of combinatorics from the dawn of civilization to present day. It is a cohesive book: it does not suffer the typical problem of compilations where the tone is inconsistent from chapter to chapter. Rather, it is consistently easy to read, well-illustrated, and it strikes a good balance between technical information and historical details.

The book’s organization is flexible, which helps each chapter speak through. The first third of the book deals with mathematics organized by culture and place of origin. This is followed by several chapters organized by time period and then chapters on particular topics within combinatorics (e.g., partitions).

I was amazed at the sheer diversity in applications of combinatorics throughout its early history. People studied ways to choose, arrange and combine Chinese syllables for divination, syllables in Indian, Latin and Greek poetry, pronunciation of Arabic words, Hebrew letters for seating charts, note sequences in songs, voicings of parts to singers, tastes of medicine, dice outcomes for gambling, virtues & vices, splitting inheritance, partitions of incense smells, humors in medicine, rhythm patterns, and chemical combinations; this is in addition to more well-known applications of combinatorics like supernatural protection (magic squares) and sheep farming (mutually orthogonal latin squares).

Two of the early applications I found particularly interesting. One from Jewish culture was the enumeration of words of a given length, with the motivation being that everything God made has a name, and therefore counting the number of names gives a bound on how many things God made. Another is in the introductory chapter. In 1615, Bauhuis wrote a poem which, in 1617, Puteanus tried to permute in all possible ways subject to certain pronunciation constraints. Through the years Leibniz, Prestis, Wallis, Jacob Bernoulli and the Mathematical Gazette all tried to determine the correct number of possibilities; Knuth reports that Bernoulli, who used a careful exhaustive backtracking algorithm, was the only one who was correct. Quoting Bernoulli,

Even the wisest and most prudent people often suffer from what Logicians call insufficient enumeration of cases.

Other algorithmic ideas are sprinkled throughout the book. The history of Indian combinatorics includes the ideas of binary encoding, repeated squaring, and error-correction (memorizing Vedas both backwards and forwards). In the middle of the book, after a wonderful visual proof of the pentagonal number theorem, it is shown that the theorem is not just a neat fact, but also leads to a faster way to compute the partition numbers. And towards the end of the book, we find that computer-based proofs were necessary to solve several longstanding open problems: the 4-color theorem posed in 1852 by De Morgan/Hamilton and solved in 1976 by Appel and Haken; the existence of a resolvable 15-point Steiner triple system, posed in 1850 by Sylvester and solved affirmatively in 1971 by Dennison; and the existence of a 10-point projective plane, open since Veblen’s early work in 1904 and resolved negatively by Lam in 1988. (Though, computers did not solve the longest-standing open problem in the book, Parker’s construction in 1959 of order-10 orthogonal latin squares, conjectured by Euler in 1782 not to exist.)

Another theme that recurs as the book works its way through the annals of history is mis-attribution. For example, “Pascal’s Triangle” was written about by other authors in North Africa and Europe hundreds of years before him, and in Asia over a thousand years before that. Chapter 7 hypothesizes that Pascal learned it from Mersenne, who may have learned it from Cardano and Tartaglia. However, Pascal did publicize it well, writing the first full book on the topic. And having died three years before its publication, he was not in the best position to correct the subsequent assumption (made in writing by de Montimort) that he was the originator.

The middle ages brought other new ideas. One, the study of gambling and probability, mentioned several times in the middle part of the book. Two, the frontispiece, the decorative front page of a book. Throughout, Combinatorics: Ancient & Modern has high quality illustrations, scans and diagrams, and the many frontispieces reproduced in the book are no exception. I particularly like page 140, showing a wonderfully busy scene including an angel carrying an icosahedron (the table of contents) from Caramuel’s 1670 Old and New Two-Headed Mathematics, and page 290, with at least a dozen fonts adorning Nicholson’s 1818 Essays on the Combinatorial Analysis.

The last third of the book is largely comprised of histories/surveys of individual topics in modern combinatorics: partitions, designs, algebraic combinatorics, extremal combinatorics, and graphs. Chapter 12, on enumeration and algebraic combinatorics, included an intriguing story and equally intriguing mathematics. J.~Howard Redfield trained originally as a civil engineer and linguist, but did pioneering outsider work in 1927 on using group theory in enumerative combinatorics. There is a brief explanation of his methods, illustrated by the problem of counting arrangements of 4 white and 4 black balls at the corners of a cube, modulo rotations. It is very interesting and leaves me wanting a slightly more detailed explanation: I have never seen the number 7 obtained in such a mysterious way!

Chapter 13, on combinatorial set theory and extremal combinatorics, presents a very nice history while also being thought-provoking. For example, it discussed Schur’s precursor work to Ramsey theory, and how Schur used it to give a nicer proof of Dickson’s result that Fermat’s Last Theorem does not hold if you take the integers modulo a prime. The only fault that I could draw with the book is that it would be nice to include more proofs: Schur’s theorem and its corollary, for instance, have beautiful short proofs but they are not given. Regardless, this chapter (like the rest) surveys history nicely, ranging from a 1662 textbook use of the pigeonhole principle to show that two people in Paris have the same number of hairs on their head, and continuing its discussion to the work of Erdős and his collaborators.

Overall, I recommend the book for anyone with interests in mathematics, culture, or history. The authors do a good job of presenting the mathematics in an authentic way while not assuming the reader has any specific background, and the variety of topics means there are a lot of opportunities to engage your interests. The book is a light read, though still mathematical, and can inspire further discovery. For me, the presentation of each culture separately was very novel and opened my eyes up to new ways of motivating combinatorics: it may be that for future assignments, rather than asking students about arranging cities in a tour or selecting outfits from a given set of clothing, that I will ask them to permute the words of a poem.


We are a few months into our grand Southern California experiment. So, I have had the time to get my bike fixed and use it for commuting. Los Angeles is really an amazing city for biking, in the sense that the weather is perfect for it throughout the year. Through Central LA the roads are flat and wide, and although the bike paths are hit-or-miss, you’re generally not forced to ride in dangerous spots.

Here was my first attempt at a nice-looking path, which tried to pick up on that perfect diagonal through San Vicente and Venice Blvd, which even have official bike paths.

original

This didn’t quite work out as planned; there were a number of hills, long stretches of ugly asphalt, and some dodgy car-dominated intersections (actually San Vicente and Venice was one, below).

dodgy

One day I picked the southernmost of the options presented to me by Google Maps, shown below:

spiky

That one’s simple enough to memorize and reasonably short. However, just about where that red circle (Rancho Cienega Recreation Center), there was a small problem. A very small, very spiky problem.

Wikipedia caption: “Thumbtack-like Tribulus terrestris nutlets are a hazard to bicycle tires.” Yup.

I didn’t notice it, but in that area there grew some Tribulus terrestris, also known as “puncturevine”, which drops nutlets like those shown above. By the time I noticed these things sticking in my tires, it was too late; both of my inner tubes were punctured. So I brought my bike on the metro nearby and called it a day. At least it was in my tires rather than my feet.

The following week, I brought my bike in to get new tubes. This is what it looked like pulling out the thorns from the tires.

thorn

What was really surprising was the sheer number of thorns and fragments left in my wheels, even though I took out the major chunks the same day that I ran over them. This is my collection:

hand

Aside from this incident, actually the biking has been pretty good. The city has been pretty safe and clean. Here’s another path that I’ve been taking which has worked out well:

better

Hopefully I won’t be looking up any other crazy plant names due to future incidents…


I am continuing to teach C++, which I am now convinced is a specialized language whose main feature is unchecked array access. However, my students discovered a cool application of said feature. I asked students to write some code to turn this elephant

elephant

into this grid of 4 elephants:

elephant-tile

They have a primitive bitmap library, which reads a 256 pixel square grayscale bitmap into a 256-by-256 array of unsigned chars (bytes). Here is an attempted solution that several students stumbled upon:

#include "bmplib.h"

int main() {
  unsigned char src[256][256];
  unsigned char dest[256][256];
  readGSBMP("elephant.bmp", src);
  for (int i=0; i<256; i++) // row
    for (int j=0; j<256; j++) // column
      dest[i][j] = src[2*i][2*j];
  showGSBMP(dest);
}

To their surprise and mine, it generated the following picture, which was way cooler than the desired one:

elephantree

The questions for the reader are:

  • why does the result look like this?
  • what needs to change to get the 4-elephant grid originally desired?

On C++

10Sep14

I’m teaching C++ this semester, for CSCI 103 at the University of Southern California. The students are great! However, the transition to this language is challenging. I am trying to build some tools to work around the language, with mixed success. As always, StackOverflow is very helpful, as is the DeathHandler tool on GitHub.

With C++ there are so many opportunities for a student to shoot themselves in the foot. But even with these tools it feels like trying to get them better shoes rather than getting rid of the gun.

</rant>

img


++

30Jul14

Hey RSS readers, and the greater internets! I’m married as of 4 days ago. Here’s a link to some pictures. We’re not on honeymoon technically, although we are both crashing on the same floor in the bride’s sister’s place, and I always thought Kitchener is pretty romantic. It was really a fantastic ceremony, an even better party, and it was better than anything I could have imagined to see all of our friends and family gathered at once. Here’s to another 40000 days!

pv-11




Categories

Follow

Get every new post delivered to your Inbox.