What a Character!
I am working with a colleague on an instructional programming website. One upcoming lesson is about strings and characters. Back in the day, one of the highlights for me after learning
CHR in Basic was to print out all 256 of the ascii characters. You would see funky things with accents ö, mutant letters æ, mysterious stuff like þ, ▓, ╩. It seemed to me that some of these weirdos would make a good addition to the lesson.
I tried the Python code to print an ö:
which led to the error
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 0: ordinal not in range
Fair enough Python, fair enough. This has to do with the fact that linux output stream in the teaching website was piped and therefore using the “real” ascii which only has 128 characters, in comparison to the 256 characters typically implemented by ANSI a.k.a. CP-1252 on Windows or Dos. But these days there are larger standards such as Unicode with hundreds of thousands of characters. Computers are smarter than they used to be, so it should be easy to switch, no?
No! It took me hours to figure this out, in the process I learned all about the various standards. The short answer is, the problem was fixed when I changed the environment variable “PYTHONIOENCODING” to “utf_8”. (UTF-8 is one of several encodings of Unicode; it is most backwards-compatible with ASCII, and the de facto standard on webpages.) Then everything was just gravy — when I ran
print(chr(246), chr(9786), chr(9787))
I got the expected result,
ö ☺ ☻
Unfortunately this came after finding lots of other online suggestions which did not work, including changing the environment variables LANG, LC_CTYPE, LC_ALL, somehow changing sys.stdout.encoding in Python, dancing a voodoo dance, and others. May google expedite this information to anyone else in need of help! Once I got things working correctly the following test program
import sys, locale, os print(sys.stdout.encoding) print(sys.stdout.isatty()) print(locale.getpreferredencoding()) print(sys.getfilesystemencoding()) print(os.environ["PYTHONIOENCODING"]) print(chr(246), chr(9786), chr(9787))
gave me the full output
utf_8 False ANSI_X3.4-1968 ascii utf_8 ö ☺ ☻
(Note: ANSI_X3.4-1968 is the formal name for ASCII. Confusing, no?)
In the process I learned all sorts of factoids including that Bush hid the facts, that this o-with-umlaut ö takes either 1 or 2 bytes if you save it in notepad (depends on whether the encoding is ansi or unicode) whereas this Unicode happy face ☻ takes 2 or 3 bytes depending on which encoding of Unicode you use (2 bytes in UTF-16 (which notepad calls just “Unicode”), 3 bytes in UTF-8). I guess it is good to learn this before I start writing a lesson filled with lies from the 1990s!
Filed under: programming, python | 17 Comments