PEP: Python3 and UnicodeDecodeError
This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. It’s a draft, don’t hesitate to comment it. This document suppose that my patch to allow bytes filenames is accepted which is not the case today.
While I was writing this document I found poential problems in Python3. So here is a TODO list (things to be checked):
- FIXME: When bytearray is accepted or not?
- FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get bytes or unicode?
Can anyone write a section about bytes encoding in Unicode using escape sequence?
What is the best tool to work on a PEP? I hate email threads, and I would prefer SVN / Mercurial / anything else.
Python3 and UnicodeDecodeError for the command line, environment variables and filenames
Introduction
Python3 does its best to give you texts encoded as a valid unicode characters strings. When it hits an invalid bytes sequence (according to the used charset), it has two choices: drops the value or raises an UnicodeDecodeError. This document present the behaviour of Python3 for the command line, environment variables and filenames.
Example of an invalid bytes sequence: ::
>>> str(b'xff', 'utf8') UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...)
whereas the same byte sequence is valid in another charset like ISO-8859-1: ::
>>> str(b'xff', 'iso-8859-1') 'ÿ'
Default encoding
Python uses «UTF-8» as the default Unicode encoding. You can read the default charset using sys.getdefaultencoding(). The «default encoding» is used by PyUnicode_FromStringAndSize().
A function sys.setdefaultencoding() exists, but it raises a ValueError for charset different than UTF-8 since the charset is hardcoded in PyUnicode_FromStringAndSize().
Command line
Python creates a nice unicode table for sys.argv using mbstowcs(): ::
$ ./python -c 'import sys; print(sys.argv)' 'Ho hé !' ['-c', 'Ho hé !']
On Linux, mbstowcs() uses LC_CTYPE environement variable to choose the encoding. On an invalid bytes sequence, Python quits directly with an exit code 1. Example with UTF-8 locale:
$ python3.0 $(echo -e 'invalid:xff') Could not convert argument 1 to string
Environment variables
Python uses «_wenviron» on Windows which are contains unicode (UTF-16-LE) strings. On other OS, it uses «environ» variable and the UTF-8 charset. It drops a variable if its key or value is not convertible to unicode. Example:
env -i HOME=/home/my PATH=$(echo -e "xff") python >>> import os; list(os.environ.items()) [('HOME', '/home/my')]
Both key and values are unicode strings. Empty key and/or value are allowed.
Python ignores invalid variables, but values still exist in memory. If you run a child process (eg. using os.system()), the «invalid» variables will also be copied.
Filenames
Introduction
Python2 uses byte filenames everywhere, but it was also possible to use unicode filenames. Examples:
- os.getcwd() gives bytes whereas os.getcwdu() always returns unicode
-
os.listdir(unicode) creates bytes or unicode filenames (fallback to bytes on UnicodeDecodeError), os.readlink() has the same behaviour
- glob.glob() converts the unicode pattern to bytes, and so create bytes filenames
- open() supports bytes and unicode
Since listdir() mix bytes and unicode, you are not able to manipulate easily filenames:
>>> path=u'.' >>> for name in os.listdir(path): ... print repr(name) ... print repr(os.path.join(path, name)) ... u'valid' u'./valid' 'invalidxff' Traceback (most recent call last): ... File "/usr/lib/python2.5/posixpath.py", line 65, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xff (...)
Python3 supports both types, bytes and unicode, but disallow mixing them. If you ask for unicode, you will always get unicode or an exception is raised.
You should only use unicode filenames, except if you are writing a program fixing file system encoding, a backup tool or you users are unable to fix their broken system.
Windows
Microsoft Windows since Windows 95 only uses Unicode (UTF-16-LE) filenames. So you should only use unicode filenames.
Non Windows (POSIX)
POSIX OS like Linux uses bytes for historical reasons. In the best case, all filenames will be encoded as valid UTF-8 strings and Python creates valid unicode strings. But since system calls uses bytes, the file system may returns an invalid filename, or a program can creates a file with an invalid filename.
An invalid filename is a string which can not be decoded to unicode using the default file system encoding (which is UTF-8 most of the time).
A robust program will have to use only the bytes type to make sure that it can open / copy / remove any file or directory.
Filename encoding
Python use:
- «mbcs» on Windows
- or «utf-8» on Mac OS X
- or nl_langinfo(CODESET) on OS supporting this function
- or UTF-8 by default
«mbcs» is not a valid charset name, it’s an internal charset saying that Python will use the function MultiByteToWideChar() to decode bytes to unicode. This function uses the current codepage to decode bytes string.
You can read the charset using sys.getfilesystemencoding(). The function may returns None if Python is unable to determine the default encoding.
PyUnicode_DecodeFSDefaultAndSize() uses the default file system encoding, or UTF-8 if it is not set.
On UNIX (and other operating systems), it’s possible to mount different file systems using different charsets. sys.getdefaultencoding() will be the same for the different file systems since this encoding is only used between Python and the Linux kernel, not between the kernel and the file system which may uses a different charset.
Display a filename
Example of a function formatting a filename to display it to human eyes: ::
from sys import getfilesystemencoding def format_filename(filename): return str(filename, getfilesystemencoding(), 'replace')
Example: format_filename(‘rxffport.doc’) gives ‘r�port.doc’ with the UTF-8 encoding.
Functions producing filenames
Policy: for unicode arguments: drop invalid bytes filenames; for bytes arguments: return bytes
- os.listdir()
- glob.glob()
This behaviour (drop silently invalid filenames) is motivated by the fact to if a directory of 1000 files only contains one invalid file, listdir() fails for the whole directory. Or if your directory contains 1000 python scripts (.py) and just one another document with an invalid filename (eg. r�port.doc), glob.glob(‘*.py’) fails whereas all .py scripts have valid filename.
Policy: for an unicode argument: raise an UnicodeDecodeError on invalid filename; for an bytes argument: return bytes
- os.readlink()
Policy: create unicode directory or raise an UnicodeDecodeError
- os.getcwd()
Policy: always returns bytes
- os.getcwdb()
Functions for filename manipulation
Policy: raise TypeError on bytes/str mix
- os.path.*(), eg. os.path.join()
- fnmatch.*()
Functions accessing files
Policy: accept both bytes and str
- io.open()
- os.open()
- os.chdir()
- os.stat(), os.lstat()
- os.rename()
- os.unlink()
- shutil.*()
os.rename(), shutil.copy*(), shutil.move() allow to use bytes for an argment, and unicode for the other argument
bytearray
In most cases, bytearray() can be used as bytes for a filename.
Unicode normalisation
Unicode characters can be normalized in 4 forms: NFC, NFD, NFKC or NFKD. Python does never normalize strings (nor filenames). No operating system does normalize filenames. So the users using different norms will be unable to retrieve their file. Don’t panic! All users use the same norm.
Use unicodedata.normalize() to normalize an unicode string.
I’m reading and parsing an Amazon XML file and while the XML file shows a ‘ , when I try to print it I get the following error:
'ascii' codec can't encode character u'u2019' in position 16: ordinal not in range(128)
From what I’ve read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?
asked Jul 11, 2010 at 19:00
2
Likely, your problem is that you parsed it okay, and now you’re trying to print the contents of the XML and you can’t because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:
unicodeData.encode('ascii', 'ignore')
the ‘ignore’ part will tell it to just skip those characters. From the python docs:
>>> # Python 2: u = unichr(40960) + u'abcd' + unichr(1972)
>>> u = chr(40960) + u'abcd' + chr(1972)
>>> u.encode('utf-8')
'xeax80x80abcdxdexb4'
>>> u.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character 'ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'ꀀabcd'
You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what’s going on. After the read, you’ll stop feeling like you’re just guessing what commands to use (or at least that happened to me).
Mike T
40k18 gold badges148 silver badges197 bronze badges
answered Jul 11, 2010 at 19:10
Scott StaffordScott Stafford
43k26 gold badges127 silver badges177 bronze badges
6
A better solution:
if type(value) == str:
# Ignore errors even if the string is not proper UTF-8 or has
# broken marker bytes.
# Python built-in function unicode() can do this.
value = unicode(value, "utf-8", errors="ignore")
else:
# Assume the value object has proper __unicode__() method
value = unicode(value)
If you would like to read more about why:
http://docs.plone.org/manage/troubleshooting/unicode.html#id1
twasbrillig
16.1k9 gold badges41 silver badges62 bronze badges
answered Jan 9, 2014 at 20:24
PaxwellPaxwell
73810 silver badges18 bronze badges
1
Don’t hardcode the character encoding of your environment inside your script; print Unicode text directly instead:
assert isinstance(text, unicode) # or str on Python 3
print(text)
If your output is redirected to a file (or a pipe); you could use PYTHONIOENCODING
envvar, to specify the character encoding:
$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8
Otherwise, python your_script.py
should work as is — your locale settings are used to encode the text (on POSIX check: LC_ALL
, LC_CTYPE
, LANG
envvars — set LANG
to a utf-8 locale if necessary).
To print Unicode on Windows, see this answer that shows how to print Unicode to Windows console, to a file, or using IDLE.
answered Jun 29, 2015 at 7:46
jfsjfs
391k188 gold badges966 silver badges1647 bronze badges
Excellent post : http://www.carlosble.com/2010/12/understanding-python-and-unicode/
# -*- coding: utf-8 -*-
def __if_number_get_string(number):
converted_str = number
if isinstance(number, int) or
isinstance(number, float):
converted_str = str(number)
return converted_str
def get_unicode(strOrUnicode, encoding='utf-8'):
strOrUnicode = __if_number_get_string(strOrUnicode)
if isinstance(strOrUnicode, unicode):
return strOrUnicode
return unicode(strOrUnicode, encoding, errors='ignore')
def get_string(strOrUnicode, encoding='utf-8'):
strOrUnicode = __if_number_get_string(strOrUnicode)
if isinstance(strOrUnicode, unicode):
return strOrUnicode.encode(encoding)
return strOrUnicode
answered Sep 13, 2016 at 18:31
Ranvijay SachanRanvijay Sachan
2,3773 gold badges28 silver badges47 bronze badges
You can use something of the form
s.decode('utf-8')
which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don’t ever access the XML string directly, you might have to use a decoder object from the codecs
module.
answered Jul 11, 2010 at 19:04
David ZDavid Z
126k27 gold badges253 silver badges277 bronze badges
3
I wrote the following to fix the nuisance non-ascii quotes and force conversion to something usable.
unicodeToAsciiMap = {u'u2019':"'", u'u2018':"`", }
def unicodeToAscii(inStr):
try:
return str(inStr)
except:
pass
outStr = ""
for i in inStr:
try:
outStr = outStr + str(i)
except:
if unicodeToAsciiMap.has_key(i):
outStr = outStr + unicodeToAsciiMap[i]
else:
try:
print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
except:
print "unicodeToAscii: unknown code (encoded as _)", repr(i)
outStr = outStr + "_"
return outStr
answered Sep 10, 2015 at 11:31
Try adding the following line at the top of your python script.
# _*_ coding:utf-8 _*_
answered Jan 20, 2016 at 5:08
abnvanandabnvanand
1931 silver badge6 bronze badges
0
Python 3.5, 2018
If you don’t know what the encoding but the unicode parser is having issues you can open the file in Notepad++
and in the top bar select Encoding->Convert to ANSI
. Then you can write your python like this
with open('filepath', 'r', encoding='ANSI') as file:
for word in file.read().split():
print(word)
answered Oct 9, 2018 at 21:56
Atomar94Atomar94
451 silver badge10 bronze badges
Python 2.7. Unicode Errors Simply Explained
I know I’m late with this article for about 5 years or so, but people are still using Python 2.x, so this subject is relevant I think.
Some facts first:
- Unicode is an international encoding standard for use with different languages and scripts
- In python-2.x, there are two types that deal with text.
str
is an 8-bit string.unicode
is for strings of unicode code points.
A code point is a number that maps to a particular abstract character. It is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal)
- Encoding (noun) is a map of Unicode code points to a sequence of bytes. (Synonyms: character encoding, character set, codeset). Popular encodings: UTF-8, ASCII, Latin-1, etc.
- Encoding (verb) is a process of converting
unicode
to bytes ofstr
, and decoding is the reverce operation. - Python 2.x uses ASCII as a default encoding. (More about this later)
SyntaxError: Non-ASCII character
When you sees something like this
SyntaxError: Non-ASCII character 'xd0' in file /tmp/p.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
you just need to define encoding in the first or second line of your file.
All you need is to have string coding=utf8
or coding: utf8
somewhere in your comments.
Python doesn’t care what goes before or after those string, so the following will work fine too:
# -*- encoding: utf-8 -*-
Notice the dash in utf-8. Python has many aliases for UTF-8 encoding, so you should not worry about dashes or case sensitivity.
UnicodeEncodeError
Explained
>>> str(u'café') Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 3: ordinal not in range(128)
str()
function encodes a string. We passed a unicode
string, and it tried to encode it using a default encoding, which is ASCII. Now the error makes sence because ASCII is 7-bit encoding which doesn’t know how to represent characters outside of range 0..128.
Here we called str()
explicitly, but something in your code may call it implicitly and you will also get UnicodeEncodeError
.
How to fix: encode unicode
string manually using .encode('utf8')
before passing to str()
UnicodeDecodeError
Explained
>>> utf_string = u'café' >>> byte_string = utf_string.encode('utf8') >>> unicode(byte_string) Traceback (most recent call last): File "<input>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Let’s say we somehow obtained a byte string byte_string
which contains encoded UTF-8 characters. We could get this by simply using a library that returns str
type.
Then we passed the string to a function that converts it to unicode
. In this example we explicitly call unicode()
, but some functions may call it implicitly and you’ll get the same error.
Now again, Python uses ASCII encoding by default, so it tries to convert bytes to a default encoding ASCII. Since there is no ASCII symbol that converts to 0xc3
(195 decimal) it fails with UnicodeDecodeError
.
How to fix: decode str
manually using .decode('utf8')
before passing to your function.
Rule of Thumb
Make sure your code works only with Unicode strings internally, converting to a particular encoding on output, and decoding str
on input.
Learn the libraries you are using, and find places where they return str
. Decode str
before return value is passed further in your code.
I use this helper function in my code:
def force_to_unicode(text): "If text is unicode, it is returned as is. If it's str, convert it to Unicode using UTF-8 encoding" return text if isinstance(text, unicode) else text.decode('utf8')
Source: https://docs.python.org/2/howto/unicode.html