PyZine
 


Article Finder
People
Issue 8 - Revision 4  /   October 20, 2005 


 
  Py Links:
Latest Issue
Issue 08
Issue 07
Issue 06
Issue 05
Issue 04
Issue 02
Issue 01
 
 
Downloads
     
  Articles:
Throughout the quarter we cover topics of interest to Python developers.

Snakes & Molecules

Extending Popular Software

Doctests

CGI Web Applications

Pythoneering Eclipse

Docutils (Part II)

Encodings

 
 
 
     


A Crash Course in Character Encoding

- - - - - - - - - - - -

By Michael Foord  | October 18, 2005

print

Character encodings and Unicode can seem like a black art, fraught with complexity and voodoo-like machinations. To anyone who has only ever used ordinary strings, the mere mention of the word "Unicode" can provoke a mind numbing headache.

Unfortunately, it's no longer possible to bury your head in the sand. Computer users span the globe, with participants from every culture, speaking many different languages. If your program can only cope with ASCII, or more likely has no concept of different character sets, you automatically exclude many people from using your programs. Worse, Unicode issues can cause inexplicable problems even for those who try never to think about them. [1]

Luckily, with only a very basic understanding of character encodings and the Unicode datatype, you can modify your programs to work on every corner of the globe. Despite rumors to the contrary, Python makes Unicode quite easy.

This article is a crash course in handling character encodings with Python. While it's not intended to be a complex technical reference on the subject, it should provide you with enough information to proudly proclaim "Unicode spoken here" in the programs you create.

What Is a Character Encoding?

The sad day has come at last: the comforting myth of "plain text" must be laid to rest once and for all. [2] There's no such thing as plain text, and the truth is, there never was. When you talk about plain text [3], or even an ordinary string, you probably mean text encoded with the ASCII encoding. If you live in Europe, you possibly mean the Latin-1 encoding [4].

Text is stored as numbers, and each number corresponds to a different character. An encoding can be thought of "a scheme that defines what number relates to which character." (This isn't the place to go into a discussion of what a character is [5]. Unicode buffs call them glyphs or code points, depending on whether it's the visual representation or the logical unit, respectively. Glyphs and code points are not always the same thing. For example, the positioning of two characters next to each other — such as "f" and "i" and "f" and "l" — can change the glyph, or there may not even be a one to one relationship between characters and symbols.)

The encoding that even the most xenophobic programmer has heard of is ASCII [6]. ASCII is a 7-bit encoding, yielding 128 possible characters. ASCII includes the lowercase (English language) alphabet (a-z), the uppercase alphabet (A-Z), digits (0-9), some symbols, and a few control codes (such as newline and line feed). For example, in the ASCII encoding, lowercase "a" is represented by the number 97 and the number 32 represents the Space character.

ASCII cannot include any of the "funny" accented characters used by Europeans, so 8-bit encodings, which have 256 total possible characters, have been added to represent the accented characters. 8-bit encodings can represent more characters, yet still fit one character per byte. A few examples of 8-bit encodings include Latin-1, IBM500, Windows-1252, and Macroman.

Unfortunately, having multiple encodings is where problems start. You've almost certainly had the experience of opening a web page or a text file and seeing weird accented characters or even funny symbols where they shouldn't be. At times, programs cannot deduce the correct encoding and choose the wrong set of characters to visualize the otherwise numeric data. Worse, some alphabets cannot be fully-represented in 256 characters (for example, think of the Asian-language alphabets) and have to use more than one byte for each character.

There is a clear rule: if you don't know the encoding of the text, then you don't know how to display it.

Unicode

Thankfully, you don't need to be aware of every possible character set or encoding. Unicode is the standard [7] that assigns a single Unicode reference to every possible character [8].

Python supports Unicode by having a special datatype called the Unicode string. Once your text is stored as Unicode, it's trivial to convert it to any of the many character encodings that Python supports [9].

To be clear, Unicode is an internal representation only. If you want to save Unicode, you must choose an encoding. Conversely, if you have text stored in a file, you need to know what encoding to decode it with. Luckily, there are several character encodings that encompass the whole of the Unicode standard, which you'll see in a few paragraphs.

If this sounds like mumbo jumbo, don't worry. Let's look at what it means to encode strings and decode text.

Decoding and Encoding

Python has a datatype known as "the string." Often referred to as string literals or ordinary strings, it's probably more accurate to call the datatype a "byte string," since it's a sequence of bytes, just binary data.

Because text is often capable of being represented as a simple sequence of bytes, ordinary string objects have many methods that treat sequences of bytes as text, including strip(), lower(), isdigit(), and more.

In some sense, it might cause less confusion if you couldn't treat byte strings as text. Some character encodings use more than one byte (or even a variable number of bytes) to represent each character. Treating these as byte strings simply doesn't work. To properly handle text with an arbitrary character encoding, you need to turn it into a Unicode string first.

The process of transforming a byte string into a Unicode string is called decoding.

Decoding

When you read text from a file, the data is a byte string [10], just a sequence of numbers. If you want to turn that byte string into a Unicode string (a sequence of characters), you need to decode the byte string into Unicode. And to do that, you need to specify what encoding the string was encoded with.

There are basically two ways to turn a byte string into a Unicode string: the built in unicode() function or the string method decode(). Both routines take the same arguments: the encoding used to create the string and an optional "errors" argument. [11] (Discussion of the "errors" argument is beyond the scope of this article; for now, use strict so that any problems are immediately apparent.)

Assuming you know the encoding used to create the string, turning it into a Unicode string is easy:

>>> aString = 'Some non-ascii characters - $5 < £5. äêïøü'
>>> uni_string_1 = Unicode(aString, 'latin1')
>>> uni_string_2 = aString.decode('latin1')
>>> uni_string_1 == uni_string_2
True 

The byte string aString is created using the Latin-1 encoding. It's then converted into a Unicode string and back to a Latin-1 string using Unicode() and decode(), respectively.

As another example, here's a byte string that's unreadable until you properly decode it:

>>> aString = '\xff\xfeS\x00o\x00m\x00e\x00' + \ 
... '\x00n\x00o\x00n\x00-\x00a\x00s\x00c\x00i\x00i\x00' + \
... ' \x00c\x00h\x00a\x00r\x00a\x00c\x00t\x00e\x00r\x00s\x00' + \
... '\x00-\x00 \x00$\x005\x00 \x00<\x00 \x00\xa3\x005\x00.\x00' + \
... '\x00\xe4\x00\xea\x00\xef\x00\xf8\x00\xfc\x00'
>>> print aString.decode('utf16')
Some non-ascii characters - $5 < £5. äêïøü

aString is a sequence of bytes that uses the utf16 encoding, which represents a character using two bytes. It makes little sense until the byte string is decoded using the proper encoding.

The UnicodeDecodeError

Let's see what happens when trying to decode a string into a Unicode string with the wrong encoding :

>>> aString = 'Some non-ascii characters - $5 < £5. äêïøü'
>>> aString.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 
0xac in position 28: ordinal not in range(128)
>>> 

When the ASCII encoder/decoder (called a "codec") reaches position 28, it finds a character that doesn't make sense. byte 0xac doesn't represent any character in the ASCII character set, so decode() throws a UnicodeDecodeError.

If you begin working regularly in Unicode, UnicodeDecodeError will be a constant and frustrating companion. The error means either the text isn't in the encoding you think it is or the text is corrupt.

You can also see this error in surprising places, even when you aren't explicitly doing any decoding at all, because there are several situations that force Python to implicitly encode or decode strings. Understanding encoding helps you recognize these situations and not get caught unaware.

Encoding

Encoding a string is the opposite of decoding one (funny that!) Encoding turns a Unicode string into a byte string. This might be for displaying the text, writing a file, or transmitting the text via a socket.

During the encoding process, each Unicode "code point" (character) is turned into digits using the appropriate codec. Depending on the encoding used, characters might be 1-, 2-, 3-, or even 4-bytes in size.

This example takes a Latin-1 encoded string and re-encodes it using the UTF-8 encoding:

aString = 'Some non-ascii characters - $5 < £5. äêïøü' # Latin-1 byte string
uni_string_1 = unicode(aString, 'latin1') # Unicode string
anotherString = uni_string_1.encode('utf8')  # UTF-8 byte string
The UnicodeEncodeError

The brother of UnicodeDecodeError is UnicodeEncodeError. A UnicodeDecodeError is raised when members of a string have no meaning in the character set used to decode them. Conversely, UnicodeEncodeError is raised when the character set doesn't have any way to represent some characters in your Unicode string:

>>> aString = 'Non-ascii characters - $5 < £5. äêïøü' # Latin-1 byte string
>>> uni_string_1 = unicode(aString, 'latin1') # Unicode string
>>> print uni_string_1.encode('ascii') # Try to convert to ASCII byte string

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in -toplevel-
    print uni_string_1.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in 
position 33: ordinal not in range(128)
>>> 
The Default Encoding

Every operation that combines ordinary strings and Unicode strings yields a Unicode string. In that operation, the ordinary strings must first be turned to Unicode and because you haven't specified what encoding the strings are in, Python uses the default encoding. (The alternative is to not be able to mix Unicode and ordinary strings, but there are many times when that would be an extremely tedious rule.)

The default encoding is established in a file called site.py when the interpreter starts up. You can tell what it is (but not change it) [12] through the function sys.getdefaultencoding:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

So, if you have an ordinary string with characters not in the default encoding, and Python has to coerce this string into Unicode, it goes badly:

>>> aString = 'Some non-ascii characters - $5 < £5. äêïøü'
>>> uni_string = u'A Unicode string.'
>>> print aString + uni_string

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in -toplevel-
    print aString + uni_string
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in 
position 28: ordinal not in range(128)
>>> 

The answer is to know that your string has to be coerced into Unicode and explicitly decode it first:

>>> aString = 'Some non-ascii characters - $5 != £5. äêïøü'
>>> uni_string1 = u'A Unicode string. '
>>> uni_string2 = aString.decode('latin1')
>>> print uni_string1 + uni_string2
A Unicode string. Some non-ascii characters - $5 != £5. äêïøü

Ordinary strings also have the encode() method. When called, Python encodes the string with the encoding you specify. To do this, it must first convert it to Unicode — again using the default encoding. If your string is actually in another encoding this may fail. Hence, it’s better to explictly convert the string to Unicode, and then call the encode() method to produce the string you are after.

Printing Unicode

So far, you've seen how ordinary strings can be coerced into Unicode using the default encoding. The opposite happens when you print Unicode strings using the print statement. The print statement outputs on sys.stdout. This is a "stream," a file-like object. As of Python 2.3, the manual has this to say about the encoding attribute of file objects:

The encoding that this file uses. When Unicode strings are
written to a file, they will be converted to byte strings
using this encoding. In addition, when the file is connected
to a terminal, the attribute gives the encoding that the
terminal is likely to use (that information might be
incorrect if the user has misconfigured the terminal). The
attribute is read-only and may not be present on all
file-like objects. It may also be None, in which case the
file uses the system default encoding for converting Unicode
strings:

>>> import sys
>>> print sys.stdout.encoding
cp1252
>>> 

A natural result of this is that if your Unicode string contains characters not supported by sys.stdout.encoding, it raises a UnicodeEncodeError when you print:

>>> print u'\N{Greek Small Letter Pi}'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\Python24\lib\encodings\cp850.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u03c0' in position
 0: character maps to <undefined>
>>>

Hence, writing any Unicode string to a file-like object involves an implicit encoding. As usual, it's far better to do the encoding explicitly.

A normal file object (as opposed to a "stream") will probably have an encoding attribute of None, meaning that the system default encoding is used if you attempt to write any Unicode strings to it.

Useful Encodings

It's obvious that the many different languages and alphabets mean many possible character encodings. So how do you take this into account in Python?

There are a few possible approaches:

  • You can attempt to autodetect the encoding in use
  • You can allow (or force) the user to specify the encoding
  • You can use an encoding that covers the whole Unicode standard

The first option is covered momentarily in guessing the encoding.

Here, let's cover the last option, a few useful encodings that cover the whole Unicode standard. Using such "global" encodings ensures that you can represent any Unicode string with these encodings without risk of the (dreaded) UnicodeEncodeError.

There are several encodings to choose from: UTF7, UTF8, UTF16, and UTF32. As you might guess, UTF7 is a 7-bit encoding, although it's special because it's safe to transfer across systems designed to cope with 7-bit encodings, like email systems. UTF8 uses a sequence of 8-bit digits to represent characters; UTF16 uses a sequence of 16-bit digits; UTF32 is rare and not supported by Python as a standard encoding [13].

Where most of your text is within the ASCII character set, it can be represented using one-byte per character. UTF8 uses the ASCII characters, so every valid ASCII character is also a valid UTF8 character. For characters that aren't in the ASCII character set, UTF8 uses a control code and two more bytes. Hence, text that is ASCII uses one byte per character. If most of your text is non-ASCII, then most characters will require three bytes.

UTF16 uses two bytes for every character. [14] This means UTF16 is more efficient at storing text that is mainly non-ASCII, and UTF8 is more efficient where the text is mainly ASCII characters. Be warned, though, that some applications (such as databases) may reserve three bytes per character if you specify UTF8. This means you lose the advantage of memory efficiency with UTF8.

Another great advantage of UTF8 (and other 8-bit western encodings) is that an early part of the text can use ASCII characters to specify the encoding used. As ASCII is the default encoding on many systems and protocols this is very convenient. For this reason, UTF8 is the only "full Unicode" encoding recognized by the HTTP protocol. This allows the encoding to be specified in a meta-tag inside the body of an HTML page. The section Encoding and the Web looks more closely at this.

BOM: Encoding Signatures

Both UTF-8 and UTF-16 have a technique that marks files stored in each respective encoding. This encoding signature is often referred to as the byte order mark, or BOM, and is 2- or 3-byte marker placed at the start of a file.

For UTF-8, the phrase "byte order" has no meaning, so encoding signature is a better name. For UTF-16, the BOM not only marks the text as encoded with UTF-16, but also tells you which byte order was used to create the encoding.

Using the BOM can make things much easier: you can automatically detect text as being UTF-8 or UTF-16 encoded. The trouble is that not all applications use the BOM. Because it's application specific, Python leaves it to the programmer (you) to deal with it.

When reading data, the correct thing to do is to detect and remove the BOM before using the data. If your application writes files, it should remember if a BOM was used and preserve it when writing. [15]

The various BOMs are available to you in the codecs library module. Each BOM is an encoded string rather than Unicode objects because you need to use them before you decode your strings.

The following code detects the appropriate BOM for UTF-8 or UTF-16. It removes the BOM and decodes the string appropriately. It then re-encodes the string and puts the BOM back:

import codecs, sys
bomdict = { codecs.BOM_UTF7 : 'UTF7', \
    codecs.BOM_UTF8 : 'UTF8', \
    codecs.BOM_UTF16_BE : 'UTF-16BE', \
    codecs.BOM_UTF16_LE : 'UTF-16LE' }

the_text = open(filename, 'r').read()
for bom, encoding in bomdict.items():
    if the_text.startswith(bom):
        the_text = the_text[len(bom):]
        break
else:
    print 'No BOM or BOM not recognised.'
    sys.exit()
unicode_text = the_text.decode(encoding)

if encoding.startswith('UTF-16'):
     encoding = 'UTF-16'
     bom = codecs.BOM_UTF16

output_text = bom + unicode_text.encode(encoding)

The codecs module has values for UTF-7, UTF-8, and UTF-16. UTF-16 encoding is done differently on "big endian" computers than on "little endian" [16] ones. Here,the code treats them differently when decoding the string. When encoding, the code just uses the UTF16 encoding, which will be the appropriate form for the 'endianness' of your computer.

After adding some additional methods for guessing the encoding, this code can become part of a general way of handling text.

Guessing the Encoding

Unicode is difficult because you often don't know the encoding used to produce the text you're given. However, in many situations, it ought to be obvious what encoding to try. For example, on a Mac, you'd probably guess macroman; in Korea, you'd guess a Korean encoding, and so on.

The way to access the information about the "normal" encoding used on the current computer is through the locale module. Before using locale to retrieve the information you want, you need to call locale.setlocale(locale.LC_ALL, ''). Because of the sensitivity of the underlying C locale module on some platforms, this should only be done once.

It can also change the default encodings for your program, which might have been unset before. This may break expected behavior for the code that's calling your routines. For these reasons, if you're writing a library or module rather than an application, don't call locale.setlocale(locale.LC_ALL, '') yourself. Instead, make it clear that the application calling your code needs to make the call if it wants you to be able to handle encodings.

The following code is adapted from the io.py file in docutils. It expects that locale.setlocale(locale.LC_ALL, '') has already been called. It uses various sources of information from the locale module to build a list of encodings to try. Not all of these exist on all platforms, so they're wrapped in try:... except: blocks. utf-8 and latin-1 are added to the list as they're very common encodings. Once it's created the list. the code tries all of the encodings until one is successful — that is that it decodes the text without raising an error. It returns the decoded text and the encoding used. If it fails to find an encoding that works it raises a UnicodeError:

# adapted from io.py
# in the docutils extension module
# see http://docutils.sourceforge.net

import locale
def guess_encoding(data):
    """
    Given a byte string, attempt to decode it.
    Tries the standard 'UTF8' and 'latin-1' encodings,
    Plus several gathered from locale information.

    The calling program *must* first call 
        locale.setlocale(locale.LC_ALL, '')

    If successful it returns 
        (decoded_unicode, successful_encoding)
    If unsuccessful it raises a ``UnicodeError``
    """
    successful_encoding = None
    # we make 'utf-8' the first encoding
    encodings = ['utf-8']
    #
    # next we add anything we can learn from the locale
    try:
        encodings.append(locale.nl_langinfo(locale.CODESET))
    except AttributeError:
        pass
    try:
        encodings.append(locale.getlocale()[1])
    except (AttributeError, IndexError):
        pass
    try:
        encodings.append(locale.getdefaultlocale()[1])
    except (AttributeError, IndexError):
        pass
    #
    # we try 'latin-1' last
    encodings.append('latin-1')
    for enc in encodings:
        # some of the locale calls 
        # may have returned None
        if not enc:
            continue
        try:
            decoded = unicode(data, enc)
            successful_encoding = enc

        except (UnicodeError, LookupError):
            pass
        else:
            break
    if not successful_encoding:
         raise UnicodeError(
        'Unable to decode input data.  Tried the following encodings: %s.'
        % ', '.join([repr(enc) for enc in encodings if enc]))
    else:
         return (decoded, successful_encoding)

Caution!

This approach isn't foolproof. It relies on a codec failing to decode text with any encoding other than the right one. This isn't always the case.

For example, it can be possible to decode text encoded with the cp1252 [17] encoding using the latin1 codec. If this happens, you may get garbage in your Unicode string.

If you combine this with the code that checks for the Unicode signature, you have a good general purpose technique for guessing the encoding of text [18] :

 # uses the guess_encoding function from above
 import codecs
 import locale
 import sys
 
 bomdict = { codecs.BOM_UTF7 : 'UTF7', 
                     codecs.BOM_UTF8 : 'UTF8', 
             codecs.BOM_UTF16_BE : 'UTF-16BE', 
             codecs.BOM_UTF16_LE : 'UTF-16LE' }
 
 locale.setlocale(locale.LC_ALL, '')     # set the locale
 # check if there is Unicode signature
 for bom, encoding in bomdict.items():   
     if the_text.startswith(bom):
         the_text = the_text[len(bom):]
         break
else:
    bom  = None
    encoding = None

 if encoding is None:    # there was no BOM
     try:
         unicode_text, encoding = guess_encoding(the_text)
     except UnicodeError:
         print "Sorry - we can't work out the encoding."
         raise
 else:                   
     # we found a BOM so we know the encoding
     unicode_text = the_text.decode(encoding)
 # now you have your Unicode text.. and can do with it what you will
 
 # now we want to re-encode it to a byte string
 # so that we can write it back out
 # we will reuse the original encoding, and preserve any BOM
 if bom is not None:
     if encoding.startswith('UTF-16'):       
     # we will use the right 'endian-ness' for this machine
          encoding = 'UTF-16'
          bom = codecs.BOM_UTF16
 byte_string = unicode_text.encode(encoding)
 if bom is not None:
     byte_string = bom + byte_string
 # now we have the text encoded as a byte string, ready to be saved to a file
Encoding and the Web

Another tricky situation is receiving form submissions from the internet. How do you know what character set was used to make the submission?

Normally, the browser will honor the character set the page is sent with [19]. The encoding used to send the page will be the encoding used to send the form submission. What if the user is Korean [20] though? If you send your page with a Western European encoding, like Latin-1, the user is likely to use a different character set when filling in the form. The browser may even change the encoding automatically.

There is a reliable way to tell what encoding is being used. It works with at least Mozilla and Internet Explorer, and possibly others. If you include a hidden form field named _charset_, the browser will fill in the value with the encoding used. Note the single underscores around "charset," not double underscores! The field should look like <input type="hidden" name="_charset_" />.

If you want to use an encoding capable of displaying all of the character sets, the choice is simple. For use on the web, the only one supported is UTF-8. If your page is sent using UTF-8, you will be able to easily receive form submissions from users with any native character set.

The following code snippet demonstrates how to receive a form submission (using CGI) and decoding specific values. The code assumes that the page was sent using a UTF-8 encoding and the '_charset_' field was included in the form:

import cgi
theform = cgi.FieldStorage()
charset = theform.getvalue('_charset_', 'UTF8')
value1_unicode = theform.getvalue('value1', '').decode(charset)
value2_unicode = theform.getvalue('value2', '').decode(charset)
value3_unicode = theform.getvalue('value3', '').decode(charset)
Encodings in Source Code

Python source code files are text. If you want to include non-ASCII characters in your source code, you need to tell the interpreter what encoding is being used. If you don't specify the encoding, you (currently) incur a "deprecated" warning, but in future versions of Python, your code won't run at all. (For full details, see Encoding Declarations.)

The way to specify an encoding is to include a comment line like the following as the first or second line of your Python script :

# -*- coding: <encoding-name> -*-

The interpreter will also recognize the Unicode Signature for UTF-8, without you having to explicitly declare the encoding.

The main practical difference this makes is that Unicode strings in your script are decoded using the right encoding. That is to say that statements like uni_string = u"Some non-ascii characters - $5 < £5. äêïøü have the byte string in the source code converted into a Unicode object, using the encoding you specified at the start of the file.

Bonus Encodings

If you checkout the standard encodings page you might notice some encodings that aren't normal character set encodings. Instead they use the codec machinery to encode byte strings into different formats. Two of my personal favourites are the uu encoding and the zip encoding.

uu is used on Usenet newsgroups. It's a way of encoding binary data in ASCII characters. Using the uu encoding, encoding to or from uu are single line operations:

binary_data = open(filename, 'rb').read()
uu_data = binary_data.encode('uu')
orig_data = uu_data.decode('uu')
assert orig_data == binary_data

There are two differences between this and the normal Unicode processes you've seen so far. First, all of the output of byte_string.encode('uu') is another byte string, not a Unicode string. The other difference is that normally calling encode() on a byte string causes an "implicit decode" to happen. If you called byte_string.encode('utf8'), then byte_string would first be decoded into Unicode using the default encoding. With binary data that wouldn't make sense, and the intermediate step of decoding into Unicode doesn't happen.

Another useful encoding is zip. This can compress (encode) or decompress (decode) strings of arbitrary data using the Zip algorithm. If you want to add data compression to your program, perhaps for storage or transmission, the zip encoding makes it ridiculously easy:

zipped_data = arbitrary_data.encode('zip')
arbitrary_data = zipped_data.decode('zip')
Summary

A quick summary of some of the main points about Unicode strings in Python.

  • Normal (byte) strings are always encoded with some character encoding.

  • To get Unicode strings you need to decode the string, using the right encoding.

  • Unicode is only an internal representation.

  • To display or save the string you'll need to encode it into a byte string, using an encoding.

  • In some situations, if you don't specify an encoding, python uses a default one.

  • Implicit conversion can happen when:

    • Adding a byte string to a Unicode string
    • Printing a Unicode string
    • Writing a Unicode string to a file
    • You use the encode() or decode() string methods, or use the unicode function without specifying an encoding.
  • If you encode or decode with the wrong encoding you're likely to see UnicodeDecodeError or UnicodeEncodeError.

  • UTF8 and UTF16 are useful encodings that can represent all the characters in the Unicode specification.

  • A Unicode signature, which can also act as a byte order mark (BOM), can help you recognize which encoding was used to create the text.

Footnotes
[1]Entering the '£' sign (a non-ascii character) in several open source tools can illustrate this.
[2]See this Joel on Software article about Unicode - http://www.joelonsoftware.com/articles/Unicode.html
[3]As in 'reading or writing a plain text file'.
[4]or iso8859-1, or cp819, as it is variously known.
[5]For a discussion of glyphs, code points, etc, - http://www.cs.tut.fi/~jkorpela/chars.html#glyph or http://www.debian.org/doc/manuals/intro-i18n/ch-coding.en.html
[6]American Standard Code for Information Interchange - http://www.localcolorart.com/search/encyclopedia/ASCII/
[7]The canonical reference is http://www.Unicode.org/
[8]well nearly all of them anyway.
[9]See http://docs.python.org/lib/standard-encodings.html for all the available encodings
[10]using standard read methods that is. There are stream readers that will do the decoding on the fly - but we'll leave those for the moment.
[11]In fact 'encoding' is optional too. If you miss it out, Python will use the default encoding.
[12]Guido alone knows why... - Changing the default encoding is a bad idea anyway
[13]There is support for UTF32 encoding through iconv - see http://cjkpython.i18n.org/ and http://www.gnu.org/software/libiconv/
[14]Unicode has in fact now been expanded to more than 65536 characters. See PEP261 for how Python supports this
[15]Note that when joining UTF files (manually or programatically) it is generally considered to be an error to leave a BOM in the middle of the file.
[16]See http://www.cs.umass.edu/~verts/cs32/endian.html for a reference about endianness
[17]cp1252 is the default Windows encoding. It's similar to latin1, but has some different symbols - including a symbol for the euro which is missing from latin1.
[18]More specific methods exist for (for example) telling the difference between similar oriental encodings. Thats a specialist subject though.
[19]You do always specify a character set when sending web pages, don't you ? Your pages can't be valid XHTML without one.
[20]Or Chinese, or Japanese, no reason why I picked on the Korean

Michael Foord

shim
shim

 Py is committed to bringing you great Python Articles.

shim
shim


Home   Subscribe   Migration FAQ   Contact PyZine   Write for PyZine   ZopeMag   opensourcexperts.com  

Reproduction of material from any of PyZine's pages without prior written permission is strictly prohibited. Copyright 2003 - 2005 PyZine Zope/Plone hosting by Nidelven IT