2👍
Usually you have to use both file encoding
and literal strings encoding
but they actually control something very different and it is helpful to know the difference.
File Encoding
If you expect to write unicode characters in your source code in any place like comments or literal strings, you need to change the encoding in order for the python parser to work. Setting the wrong encoding will result in SyntaxError
exception. PEP 263 explains the problem in detail and how you can control the encoding of the parser.
In Python 2.1, Unicode literals can only be written using the Latin-1
based encoding “unicode-escape”. This makes the programming
environment rather unfriendly to Python users who live and work in
non-Latin-1 locales such as many of the Asian countries.…
Python will default to ASCII as standard encoding if no other encoding hints are given.
Unicode Literal Strings
Python 2 uses two different types for strings, unicode
and str
. When you define a literal string the interpreter actually creates a new object of type str
that holds this literal.
s = "A literal string"
print type(s)
<type 'str'>
TL;DR
If you want to change this behavior and instead create
unicode
object every time an unprefixed string literal is defined, you can usefrom __future__ import unicode_literals
If you need to understand why this is useful keep reading.
You can explicitly define a literal string as unicode using the u
prefix. The interpreter will create instead a unicode
object for this literal.
s = u"A literal string"
print type(s)
<type 'unicode'>
For ASCII text, using str
type is sufficient but if you intend to manipulate non-ASCII text it is important to use unicode
type for character level operations to work correctly. The following example shows the difference of character level interpretation using str
and unicode
for exactly the same literal.
# -*- coding: utf-8 -*-
def print_characters(s):
print "String of type {}".format(type(s))
print " Length: {} ".format(len(s))
print " Characters: " ,
for c in s:
print c,
print
print
u_lit = u"Γειά σου κόσμε"
s_lit = "Γειά σου κόσμε"
print_characters(u_lit)
print_characters(s_lit)
Output:
String of type <type 'unicode'>
Length: 14
Characters: Γ ε ι ά σ ο υ κ ό σ μ ε
String of type <type 'str'>
Length: 26
Characters: � � � � � � � � � � � � � � � � � � � � � � � �
Using str
it erroneously reported that it is of 26
characters length and iterating over character returned garbage. On the other hand unicode
worked as expected.
Setting sys.setdefaultencoding(‘utf8’)
There is a nice answer in stack overflow about why we shouldn’t use it 🙂