What a Character!

26Oct10

I am working with a colleague on an instructional programming website. One upcoming lesson is about strings and characters. Back in the day, one of the highlights for me after learning ASC and CHR in Basic was to print out all 256 of the ascii characters. You would see funky things with accents ö, mutant letters æ, mysterious stuff like þ, ▓, ╩. It seemed to me that some of these weirdos would make a good addition to the lesson.

I tried the Python code to print an ö:

print(chr(246))

which led to the error

UnicodeEncodeError: 'ascii' codec can't encode
character '\xf6' in position 0: ordinal not in range

Fair enough Python, fair enough. This has to do with the fact that linux output stream in the teaching website was piped and therefore using the “real” ascii which only has 128 characters, in comparison to the 256 characters typically implemented by ANSI a.k.a. CP-1252 on Windows or Dos. But these days there are larger standards such as Unicode with hundreds of thousands of characters. Computers are smarter than they used to be, so it should be easy to switch, no?

No! It took me hours to figure this out, in the process I learned all about the various standards. The short answer is, the problem was fixed when I changed the environment variable “PYTHONIOENCODING” to “utf_8″. (UTF-8 is one of several encodings of Unicode; it is most backwards-compatible with ASCII, and the de facto standard on webpages.) Then everything was just gravy — when I ran

print(chr(246), chr(9786), chr(9787))

I got the expected result,

ö ☺ ☻

Unfortunately this came after finding lots of other online suggestions which did not work, including changing the environment variables LANG, LC_CTYPE, LC_ALL, somehow changing sys.stdout.encoding in Python, dancing a voodoo dance, and others. May google expedite this information to anyone else in need of help! Once I got things working correctly the following test program

import sys, locale, os
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
print(chr(246), chr(9786), chr(9787))

gave me the full output

utf_8
False
ANSI_X3.4-1968
ascii
utf_8
ö ☺ ☻

(Note: ANSI_X3.4-1968 is the formal name for ASCII. Confusing, no?)

In the process I learned all sorts of factoids including that Bush hid the facts, that this o-with-umlaut ö takes either 1 or 2 bytes if you save it in notepad (depends on whether the encoding is ansi or unicode) whereas this Unicode happy face  takes 2 or 3 bytes depending on which encoding of Unicode you use (2 bytes in UTF-16 (which notepad calls just “Unicode”), 3 bytes in UTF-8). I guess it is good to learn this before I start writing a lesson filled with lies from the 1990s!

About these ads


17 Responses to “What a Character!”

  1. Actually the de facto standard for webpages is win-1252. It is ISO-8859-1 in the HTTP spec but after much investigation into current practices it was changed in HTML5. this is because most pages that don’t specify a charset explicitly are usually using win-1252 for accented characters and stuff. Browsers will usually detect most encodings correctly but the default is win-1252. Another thing we can thank IE for. UTF-8 is better, of course, and what people *should* be using for latin character set-based languages. For languages like Chinese other encodings are more efficient.

    And don’t forget the LE vs. BE variants of UTF-16!

    relevant link: http://whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding in particular note the tables to map encodings in 10.2.2.2

  2. I stand corrected! Also, why the hell does Unicode include a snowman? Is it just so someone could buy ☃.net?

  3. Yeah Unicode is full of weird things like that. Apparently U+2601 is a cloud character. That region in general seems to be full of weather symbols…

  4. 4 antony

    Why do you have sys.stdout.isatty()=False ?
    Are you redirecting your output to file or pipe?
    Can you print unicode character to a console without such redirection?
    I’m having the same problem and can’t figure it out so far.

    • In this case it was redirected to a pipe (which in turn was created by proc_open in php). At least for the PuTTY terminal that I use, I am under the vague impression that it’s absolutely not possible to send/receive/view characters higher than 255 (google would know better than I). I only ever actually view such characters because of a web interface whereby I see all the input and output in a browser.

  5. 6 ravi kiran

    hey dave,
    thanks for the post but i am a noob to python and working on it for my project.
    can u plz temme the exact command to set pythonioencoding to utf-8?

    • Environment variables are a part of the operating system so it depends on which operating system you use. In Linux, it also depends on the flavour of shell you use (e.g., ksh, bash, csh, etc). Look at Wikipedia or google for “set environment variable OS” where OS=windows/ksh/osx whatever or drop another line if you get stuck, saying what is your OS and shell.

  6. Which version of Python is this post based on? According to http://docs.python.org/library/functions.html#chr, for chr(i), i needs to be in the range[0..255], otherwise ValueError is raised. Should use unichr() instead?

  7. 10 Jim

    How to change the environment variable “PYTHONIOENCODING” to “utf_8″? I mean, what is the command to use, thanks.

    • @Jim: this is the same as ravi’s question I think? Check out the link there. The “environment variable” is actually something you set up in the operating system, before you even launch Python. In my linux shell (bash) I have to do two lines

      PYTHONIOENCODING=utf_8
      export PYTHONIOENCODING

      and then once I start Python it works. For one other example of a different OS, here’s a link explaining for Windows 7 (see the bit after “To open this dialog box”). Hope something from this list helps?

      • 12 Simon

        In bash you can squash this:

        export PYTHONIOENCODING=utf_8

  8. 13 Anonymous

    Thank you so much, I was so desparate!
    This perfectly works !

  9. 14 Anonymous

    Much appreciated!

  10. 15 Steve

    Thanks so much for this!! I think that the only thing you can add to make this article complete is os.environ['PYTHONIOENCODING']=’utf_8′, this way you can set it in python in os independent way as opposed to relying on external bash scripts / Win env settings

    • I’m unable to get this to work, as it appears that the sys.stdout.encoding is set once and for all before my code executes. Did you get it to work?

      • In some situations it looks like one can (1) call sys.stdout.close() and then (2) reset sys.stdout = open(1, ‘w’, encoding=’utf-8′) but I also sometimes get inexplicable errors using this approach.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: