Tuesday, June 4, 2013

Unicode & Python 2.x a Love Hate Relationship

I'm not the first person to rant about text encodings and Unicode, Joel Spolsky has a wonderful blog about it titled "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" which I recommend everyone read, and reread as necessary. My goal is to impart my experience with handling text encodings and Unicode in the hope that there is some wisdom there that can help others. The bulk of my experience is with Python and the sample code here will be in Python but the principles will be applicable in other languages and technologies.

Python has two different types for dealing with text the "str" type and the "unicode" type, both of which inherit from the type "basestring" but are very different. The "str" type is for encoded text data, for character strings of a certain encoding utf_8, latin1, cp1252, ascii, utf_16 and so on. The "unicode" type is for decoded text data in the form of code-points on the Unicode plane. A character encoding is mapped to the Unicode plane and you can take a string of characters encoded in that character encoding, decode the characters to code points and use Unicode instead of encoded character text. Okay big deal, how is this useful? Well if you have character text from a user in the USA coming in as utf_8 or cp1252, character text from someone in Western Europe that is encoded in latin1, and finally a user from Eastern Asia is submitting text in utf_16. All of this text can be dealt with and stored together as Unicode by decoding the encoded text. Well that is the vision anyway, but unfortunately much like Javascript implementation across different browsers Unicode has suffered the same effects of differing opinions. I won't go into detail about what technologies do what but just beware that Unicode support is not consistent nor does it mean the same thing between different technologies that deal with character text. An important miss conception about Unicode is that you cannot translate encoded text to other encodings by decoding from "encoding one" to Unicode and then re encoding to "encoding two". This is because not every encoding maps to the same code point for the same glyph, or uses the same number of bytes, and every encoding does not contain a character value for every glyph on the Unicode plane. The translation miss conception is one reason why you see garbled text with "????" or other text corruption. One of my recent favorites as of late is the Microsoft single quote/apostrophe appearing as the superscript "TM".

Before we get to some Python code examples I would like to address one more topic. How do you know what encoding text data is encoded in? That is a question that doesn't have an answer, unless the encoding is explicitly identified in meta data about the incoming text, like a header of the form "# -*- coding=utf_8" in a text file, you can't know for sure, and there is the possibility that the meta data is incorrect. So this is where I will show you how I deal with things.

Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> class Message(object):
...     def __init__(self,text,encoding):
...         self.text = text.decode(encoding)
...         self.encoding = encoding
...     def __str__(self):
...         return self.text.encode(self.encoding)
...     def __unicode__(self):
...         return self.text
...
>>> my_message = Message("Hello world!","utf_8")
>>> str(my_message)
'Hello world!'
>>> unicode(my_message)
u'Hello world!'
>>>

Your types need to keep track of encodings and use them consistently to display your text data. If the wrong encoding is used and an issue arrises then you will need to know about it, so don't suppress exceptions let them rise and deal with them by informing the client.

>>> your_message = Message("Non c'è","ascii")
Traceback (most recent call last):
  File "", line 1, in
  File "", line 3, in __init__
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
>>> your_message = Message("Non c'è","utf_8")
>>> str(your_message)
"Non c'\xc3\xa8"
>>> print(str(your_message))
Non c'è
>>> print(unicode(your_message))
Non c'è
>>> unicode(your_message)
u"Non c'\xe8"
>>>

So that the client/user can submit the text data with the correct character encoding. This is a very simple example and there are many things that can go wrong, especially when using the "unicode()" builtin. Take a look at the following example.

>>> class Message(object):
...     def __init__(self,text,encoding):
...         self.text = text.decode(encoding)
...         self.encoding = encoding
...     def __str__(self):
...         return self.text.encode(self.encoding)
...
>>> my_string = Message("Hello world!","utf_8")
>>> str(my_string)
'Hello world!'
>>> unicode(my_string)
u'Hello world!'
>>> your_message = Message("Non c'è","utf_8")
>>> print(str(your_message))
Non c'è
>>> print(unicode(your_message))
Traceback (most recent call last):
  File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
>>>

What happened!? We did not implement the builtin "unicode" and the default behavior is to try and decode any type handed to "unicode" as "ascii" text. The lesson here it always implement "unicode" in your types and be careful using "unicode" on types you didn't implement.

No comments:

Post a Comment