Python provides the unicodedata.normalizeįunction which can perform the normalizations for us: > import unicodedata So to open the file reliably, we have to try a number ofĭifferent Unicode normalization forms to be sure to open it. On the Mac, trying to open the file with either string works, on Ubuntu, you have to use the Unicode defines complex rules that make it so that our two strings are This demonstrates a complicated Unicode concept known as equivalenceĪnd normalization. Turns out Unicode has both a single combined code point for accented e, and also two code points The second is two code points: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT). Pure Unicode terms, the first is a single code point, U+00E9, or LATIN SMALL LETTER E WITH ACUTE. In this case, the accented é is represented as two different UTF-8 strings: both as ‘\xc3\xa9’ and as ‘e\xcc\x81’. To store text? Turns out it doesn’t make everything simple, there are still multiple Unicode supposed to get us out of character set hell by having everyone agree on how What’s with the two different strings that seem to both represent the same text? Wasn’t IOError : No such file or directory : 'l \xc3\xa9. On the Mac, that filename will open that file: > open ( fname ) Looking into it, the filename we’re using, and the filename it has, are different: > fname = u "l \u00e9.
On the Mac, this file can be opened by name, on Ubuntu, Of it, I learned some new things about Python and Unicode.
The test in question was trying to open a file by name, no big deal, right? Well, in thisĬase, the filename had an accented character, so it was a big deal.
I was surprised when a test failed on Ubuntu I’m working on projects for Threepress, and they