[Django]-What's the different between "# -*- coding: utf-8 -*-" ,"from __future__ import unicode_literals" and "sys.setdefaultencoding("utf8")"

2👍

Usually you have to use both file encoding and literal strings encoding but they actually control something very different and it is helpful to know the difference.

File Encoding

If you expect to write unicode characters in your source code in any place like comments or literal strings, you need to change the encoding in order for the python parser to work. Setting the wrong encoding will result in SyntaxError exception. PEP 263 explains the problem in detail and how you can control the encoding of the parser.

In Python 2.1, Unicode literals can only be written using the Latin-1
based encoding “unicode-escape”. This makes the programming
environment rather unfriendly to Python users who live and work in
non-Latin-1 locales such as many of the Asian countries.

Python will default to ASCII as standard encoding if no other encoding hints are given.

Unicode Literal Strings

Python 2 uses two different types for strings, unicode and str. When you define a literal string the interpreter actually creates a new object of type str that holds this literal.

s = "A literal string"
print type(s)

<type 'str'>

TL;DR

If you want to change this behavior and instead create unicode object every time an unprefixed string literal is defined, you can use from __future__ import unicode_literals

If you need to understand why this is useful keep reading.

You can explicitly define a literal string as unicode using the u prefix. The interpreter will create instead a unicode object for this literal.

s = u"A literal string"
print type(s)

<type 'unicode'>

For ASCII text, using str type is sufficient but if you intend to manipulate non-ASCII text it is important to use unicode type for character level operations to work correctly. The following example shows the difference of character level interpretation using str and unicode for exactly the same literal.

# -*- coding: utf-8 -*-

def print_characters(s):
    print "String of type {}".format(type(s))
    print "  Length: {} ".format(len(s))
    print "  Characters: " ,
    for c in s:
        print c,
    print
    print


u_lit = u"Γειά σου κόσμε"
s_lit = "Γειά σου κόσμε"

print_characters(u_lit)
print_characters(s_lit)

Output:

String of type <type 'unicode'>
  Length: 14 
  Characters:  Γ ε ι ά   σ ο υ   κ ό σ μ ε

String of type <type 'str'>
  Length: 26 
  Characters:  � � � � � � � �   � � � � � �   � � � � � � � � � �

Using str it erroneously reported that it is of 26 characters length and iterating over character returned garbage. On the other hand unicode worked as expected.

Setting sys.setdefaultencoding(‘utf8’)

There is a nice answer in stack overflow about why we shouldn’t use it 🙂

Leave a comment