Friday, September 4, 2009

MS Excel and Unicode

Last week I was sent a screen shot by our customer service department, the screen shot was of the application my company developed for the online management of homework for quantitative subjects such as mathematics, and physics. The screen shot showed text from a description field which as loaded into the application via an Excel spreadsheet, and between each character was a black diamond and question mark similar to this; . Thanks to Joel Spolsky's article on the subject of Unicode and character encodings, I recognized that some kind of character encoding issue was at play here. What had me confused was that the issue was not global, it only effect certain description fields, not all the description fields that where loaded from the Excel spreadsheet.

The script that parsed the Excel file for upload into the application is written in Python, my first effort was to cast all strings coming from the Excel file as non-unicode strings, i.e. ASCII. Using the str() function in Python I did just that and expected one of two things to happen, an exception to erupt or the issue to be resolved (highly unlikely). The former is what happened, complaining that a certain character is beyond the ASCII character set, and python is nice enough to supply offending character's hexadecimal code; u2019. Popping the code into google produced an instant hit, this was the character that MS uses for the right single quote mark or apostrophe, the nearest equivalent is u0027 which will map to the ASCII character for the single quote and apostrophe. Using the replace() function I took care of that character, and worked my way through all the offending characters which also included unicode characters u201c and u201d, the left and right double quote characters that MS uses, I'll come back to these in a moment.

I then uploaded the newly parsed Excel data, and had a look in the web browser. The black diamond with question marks where all still there but now the single quote/apostrophe's and double quotes where displaying correctly instead of the boxes with the Unicode hexadecimal codes, which I had not noticed before because I was distracted by all the diamonds. Something was still amiss, and the characters MS uses were not the whole story. There was something else which was effecting every character, because remember that diamonds with question marks were between every single character, no exaggeration. It was like the application didn't know what to do with the last bit in every character... Ah ha! I first checked to make sure that the unicode strings coming from the Excel spreadsheet were uft-8 and not something like uft-16. That wasn't the issue so I started looking at all the other different encoding, and there was one for utf-7 which was compatible with older systems that only use 7-bit ASCII strings.

Could it be that my company had turned out an application that could only handle 7-bit ASCII characters? So I converted the uft-8 string to utf-7, and then cast them as ASCII and re-parsed the spreadsheets, uploaded the data and had a look. Tada! Everything looked great, all the odd characters where gone, and my script was set up to throw an exception the next time it encountered a Unicode character it couldn't cast to ASCII.

With my findings I went to the developer who created the application I was uploading the data for, told him about the problem I had encountered, how I had fixed it, and what it meant about our application. I got a shrug, a blank stare, and a 'Gee, I don't know anything about character sets, good thing you figured out how to fix it.' I'm still shacking my head about it, but you can only lead a horse to water.

Now to go back to those characters MS uses, and why only certain description fields where effected. It is one thing for MS to use characters outside the 127 characters in the ASCII character set for punctuation, but it is quite another to switch the encodings only when those characters are used. I have not figure out whether it is the library I am using to read the Excel spreadsheets (Python's xlrd module) or if it is Excel itself, but if the string contains no characters outside the ASCII 127 character range it is ASCII 7-bit, if the string contains characters outside the ASCII 127 range the string is Unicode 8-bit. The description from the same Excel spreadsheet that do not contain any characters that MS would use special characters for came through as ASCII 7-bit, but the other descriptions that do contain these special characters come through as Unicode 8-bit. That is not cool in my opinion, and when I figure out who is behind this behavior they are going to hear from me.

No comments:

Post a Comment