[Answered ]-Python doesn't interpret UTF8 correctly

2👍

✅

The big problem here is that you’re mixing up Python 2 and Python 3. In particular, you’ve written Python 3 code, and you’re trying to run it in Python 2.7. But there are a few other problems along the way. So, let me try to explain everything that’s going wrong.


I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not.

The first problem here is that the str of a tuple (or any other collection) includes the repr, not the str, of its elements. The simple way to solve this problem is to not print collections. In this case, there is really no reason to print a tuple at all; the only reason you have one is that you’ve built it for printing. Just do something like this:

print '({}, {})'.format(lines[0].strip(), lines[1].strip())

In cases where you already have a collection in a variable, and you want to print out the str of each element, you have to do that explicitly. You can print the repr of the str of each with this:

print tuple(map(str, my_tuple))


 or print the str of each directly with this:

print '({})'.format(', '.join(map(str, my_tuple)))

Notice that I’m using Python 2 syntax above. That’s because if you actually used Python 3, there would be no tuple in the first place, and there would also be no need to call str.


You’ve got a Unicode string. In Python 3, unicode and str are the same type. But in Python 2, it’s bytes and str that are the same type, and unicode is a different one. So, in 2.x, you don’t have a str yet, which is why you need to call str.

And Python 2 is also why print(lines[0].strip(), lines[1].strip()) prints a tuple. In Python 3, that’s a call to the print function with two strings as arguments, so it will print out two strings separated by a space. In Python 2, it’s a print statement with one argument, which is a tuple.

If you want to write code that works the same in both 2.x and 3.x, you either need to avoid ever printing more than one argument, or use a wrapper like six.print_, or do a from __future__ import print_function, or be very careful to do ugly things like adding in extra parentheses to make sure your tuples are tuples in both versions.


So, in 3.x, you’ve got str objects and you just print them out. In 2.x, you’ve got unicode objects, and you’re printing out their repr. You can change that to print out their str, or to avoid printing a tuple in the first place
 but that still won’t help anything.

Why? Well, printing anything, in either version, just calls str on it and then passes it to sys.stdio.write. But in 3.x, str means unicode, and sys.stdio is a TextIOWrapper; in 2.x, str means bytes, and sys.stdio is a binary file.

So, the pseudocode for what ultimately happens is:

sys.stdio.wrapped_binary_file.write(s.encode(sys.stdio.encoding, sys.stdio.errors))

sys.stdio.write(s.encode(sys.getdefaultencoding()))

And, as you saw, those will do different things, because:

print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None)

You can simulate Python 3 here by using a io.TextIOWrapper or codecs.StreamWriter and then using print >>f, 
 or f.write(
) instead of print, or you can explicitly encode all your unicode objects like this:

print '({})'.format(', '.join(element.encode('utf-8') for element in my_tuple)))

But really, the best way to deal with all of these problems is to run your existing Python 3 code in a Python 3 interpreter instead of a Python 2 interpreter.

If you want or need to use Python 2.7, that’s fine, but you have to write Python 2 code. If you want to write Python 3 code, that’s great, but you have to run Python 3.3. If you really want to write code that works properly in both, you can, but it’s extra work, and takes a lot more knowledge.

For further details, see What’s New In Python 3.0 (the “Print Is A Function” and “Text Vs. Data Instead Of Unicode Vs. 8-bit” sections), although that’s written from the point of view of explaining 3.x to 2.x users, which is backward from what you need. The 3.x and 2.x versions of the Unicode HOWTO may also help.

đŸ‘€abarnert

0👍

For completeness: I’m reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.

In Python 3.x, the standard print function just writes Unicode to sys.stdout. Since that’s a io.TextIOWrapper, its write method is equivalent to this:

self.wrapped_binary_file.write(s.encode(self.encoding, self.errors))

So one likely problem is that sys.stdout.encoding does not match your terminal’s actual encoding.


And of course another is that your shell’s encoding does not match your terminal window’s encoding.

For example, on OS X, I create a myscript.py like this:

print('\u00e5')

Then I fire up Terminal.app, create a session profile with encoding “Western (ISO Latin 1)”, create a tab with that session profile, and do this:

$ export LANG=en_US.UTF-8
$ python3 myscript.py


 and I get exactly the behavior you’re seeing.

đŸ‘€abarnert

0👍

It seems from your comment that you are using python-2 and not python-3.

If you are using python-3, it’s worth reading the unicode howto guide on reading/writing to understand what python is doing.

The basic flow if encoding is:

DECODE from encoding to unicode -> Processing -> Encode from unicode to encoding

In python3 the bytes are decoded to strings and strings are encoded to bytes.
The bytes to string decoding is handled for you with open().

[..] the built-in open() function can return a file-like object that
assumes the file’s contents are in a specified encoding and accepts
Unicode parameters for methods such as read() and write(). This works
through open()‘s encoding and errors parameters [..]

So to read in unicode from a utf-8 encoded file you should be doing this:

# python-3
with open('utf8.txt', mode='r', encoding='utf-8') as f:
    lines = f.readlines() # returns unicode 

If you want similar functionality using python-2, you can use codecs.open():

# python-2
import codecs
with codecs.open('utf8.txt', mode='r', encoding='utf-8') as f:
    lines = f.readlines() # returns unicode 
đŸ‘€monkut

Leave a comment