Utf 8 python ошибка - Исправление ошибок и поиск оптимальных решений проблем

PEP: Python3 and UnicodeDecodeError

This is a PEP describing the behaviour of Python3 on UnicodeDecodeError. It’s a draft, don’t hesitate to comment it. This document suppose that my patch to allow bytes filenames is accepted which is not the case today.

While I was writing this document I found poential problems in Python3. So here is a TODO list (things to be checked):

FIXME: When bytearray is accepted or not?
FIXME: Allow bytes/str mix for shutil.copy*()? The ignore callback will get bytes or unicode?

Can anyone write a section about bytes encoding in Unicode using escape sequence?

What is the best tool to work on a PEP? I hate email threads, and I would prefer SVN / Mercurial / anything else.

Python3 and UnicodeDecodeError for the command line, environment variables and filenames

Introduction

Python3 does its best to give you texts encoded as a valid unicode characters strings. When it hits an invalid bytes sequence (according to the used charset), it has two choices: drops the value or raises an UnicodeDecodeError. This document present the behaviour of Python3 for the command line, environment variables and filenames.

Example of an invalid bytes sequence: ::

>>> str(b'xff', 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff (...)

whereas the same byte sequence is valid in another charset like ISO-8859-1: ::

>>> str(b'xff', 'iso-8859-1')
'ÿ'

Default encoding

Python uses «UTF-8» as the default Unicode encoding. You can read the default charset using sys.getdefaultencoding(). The «default encoding» is used by PyUnicode_FromStringAndSize().

A function sys.setdefaultencoding() exists, but it raises a ValueError for charset different than UTF-8 since the charset is hardcoded in PyUnicode_FromStringAndSize().

Command line

Python creates a nice unicode table for sys.argv using mbstowcs(): ::

$ ./python -c 'import sys; print(sys.argv)' 'Ho hé !'
['-c', 'Ho hé !']

On Linux, mbstowcs() uses LC_CTYPE environement variable to choose the encoding. On an invalid bytes sequence, Python quits directly with an exit code 1. Example with UTF-8 locale:

$ python3.0 $(echo -e 'invalid:xff')
Could not convert argument 1 to string

Environment variables

Python uses «_wenviron» on Windows which are contains unicode (UTF-16-LE) strings. On other OS, it uses «environ» variable and the UTF-8 charset. It drops a variable if its key or value is not convertible to unicode. Example:

env -i HOME=/home/my PATH=$(echo -e "xff") python
>>> import os; list(os.environ.items())
[('HOME', '/home/my')]

Both key and values are unicode strings. Empty key and/or value are allowed.

Python ignores invalid variables, but values still exist in memory. If you run a child process (eg. using os.system()), the «invalid» variables will also be copied.

Filenames

Introduction

Python2 uses byte filenames everywhere, but it was also possible to use unicode filenames. Examples:

os.getcwd() gives bytes whereas os.getcwdu() always returns unicode
os.listdir(unicode) creates bytes or unicode filenames (fallback to bytes on UnicodeDecodeError), os.readlink() has the same behaviour
glob.glob() converts the unicode pattern to bytes, and so create bytes filenames
open() supports bytes and unicode

Since listdir() mix bytes and unicode, you are not able to manipulate easily filenames:

>>> path=u'.'
>>> for name in os.listdir(path):
...  print repr(name)
...  print repr(os.path.join(path, name))
...
u'valid'
u'./valid'
'invalidxff'
Traceback (most recent call last):
...
File "/usr/lib/python2.5/posixpath.py", line 65, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff (...)

Python3 supports both types, bytes and unicode, but disallow mixing them. If you ask for unicode, you will always get unicode or an exception is raised.

You should only use unicode filenames, except if you are writing a program fixing file system encoding, a backup tool or you users are unable to fix their broken system.

Windows

Microsoft Windows since Windows 95 only uses Unicode (UTF-16-LE) filenames. So you should only use unicode filenames.

Non Windows (POSIX)

POSIX OS like Linux uses bytes for historical reasons. In the best case, all filenames will be encoded as valid UTF-8 strings and Python creates valid unicode strings. But since system calls uses bytes, the file system may returns an invalid filename, or a program can creates a file with an invalid filename.

An invalid filename is a string which can not be decoded to unicode using the default file system encoding (which is UTF-8 most of the time).

A robust program will have to use only the bytes type to make sure that it can open / copy / remove any file or directory.

Filename encoding

Python use:

«mbcs» on Windows
or «utf-8» on Mac OS X
or nl_langinfo(CODESET) on OS supporting this function
or UTF-8 by default

«mbcs» is not a valid charset name, it’s an internal charset saying that Python will use the function MultiByteToWideChar() to decode bytes to unicode. This function uses the current codepage to decode bytes string.

You can read the charset using sys.getfilesystemencoding(). The function may returns None if Python is unable to determine the default encoding.

PyUnicode_DecodeFSDefaultAndSize() uses the default file system encoding, or UTF-8 if it is not set.

On UNIX (and other operating systems), it’s possible to mount different file systems using different charsets. sys.getdefaultencoding() will be the same for the different file systems since this encoding is only used between Python and the Linux kernel, not between the kernel and the file system which may uses a different charset.

Display a filename

Example of a function formatting a filename to display it to human eyes: ::

from sys import getfilesystemencoding
def format_filename(filename):
    return str(filename, getfilesystemencoding(), 'replace')

Example: format_filename(‘rxffport.doc’) gives ‘r�port.doc’ with the UTF-8 encoding.

Functions producing filenames

Policy: for unicode arguments: drop invalid bytes filenames; for bytes arguments: return bytes

os.listdir()
glob.glob()

This behaviour (drop silently invalid filenames) is motivated by the fact to if a directory of 1000 files only contains one invalid file, listdir() fails for the whole directory. Or if your directory contains 1000 python scripts (.py) and just one another document with an invalid filename (eg. r�port.doc), glob.glob(‘*.py’) fails whereas all .py scripts have valid filename.

Policy: for an unicode argument: raise an UnicodeDecodeError on invalid filename; for an bytes argument: return bytes

os.readlink()

Policy: create unicode directory or raise an UnicodeDecodeError

os.getcwd()

Policy: always returns bytes

os.getcwdb()

Functions for filename manipulation

Policy: raise TypeError on bytes/str mix

os.path.*(), eg. os.path.join()
fnmatch.*()

Functions accessing files

Policy: accept both bytes and str

io.open()
os.open()
os.chdir()
os.stat(), os.lstat()
os.rename()
os.unlink()
shutil.*()

os.rename(), shutil.copy*(), shutil.move() allow to use bytes for an argment, and unicode for the other argument

bytearray

In most cases, bytearray() can be used as bytes for a filename.

Unicode normalisation

Unicode characters can be normalized in 4 forms: NFC, NFD, NFKC or NFKD. Python does never normalize strings (nor filenames). No operating system does normalize filenames. So the users using different norms will be unable to retrieve their file. Don’t panic! All users use the same norm.

Use unicodedata.normalize() to normalize an unicode string.

Источник

I’m reading and parsing an Amazon XML file and while the XML file shows a ‘ , when I try to print it I get the following error:

'ascii' codec can't encode character u'u2019' in position 16: ordinal not in range(128)

From what I’ve read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?

asked Jul 11, 2010 at 19:00

Likely, your problem is that you parsed it okay, and now you’re trying to print the contents of the XML and you can’t because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:

unicodeData.encode('ascii', 'ignore')

the ‘ignore’ part will tell it to just skip those characters. From the python docs:

>>> # Python 2: u = unichr(40960) + u'abcd' + unichr(1972)
>>> u = chr(40960) + u'abcd' + chr(1972)
>>> u.encode('utf-8')
'xeax80x80abcdxdexb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character 'ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'ꀀabcd޴'

You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what’s going on. After the read, you’ll stop feeling like you’re just guessing what commands to use (or at least that happened to me).

Mike T

40k18 gold badges148 silver badges197 bronze badges

answered Jul 11, 2010 at 19:10

Scott StaffordScott Stafford

43k26 gold badges127 silver badges177 bronze badges

A better solution:

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

If you would like to read more about why:

http://docs.plone.org/manage/troubleshooting/unicode.html#id1

twasbrillig

16.1k9 gold badges41 silver badges62 bronze badges

answered Jan 9, 2014 at 20:24

PaxwellPaxwell

73810 silver badges18 bronze badges

Don’t hardcode the character encoding of your environment inside your script; print Unicode text directly instead:

assert isinstance(text, unicode) # or str on Python 3
print(text)

If your output is redirected to a file (or a pipe); you could use PYTHONIOENCODING envvar, to specify the character encoding:

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

Otherwise, python your_script.py should work as is — your locale settings are used to encode the text (on POSIX check: LC_ALL, LC_CTYPE, LANG envvars — set LANG to a utf-8 locale if necessary).

To print Unicode on Windows, see this answer that shows how to print Unicode to Windows console, to a file, or using IDLE.

answered Jun 29, 2015 at 7:46

jfsjfs

391k188 gold badges966 silver badges1647 bronze badges

Excellent post : http://www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or 
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

answered Sep 13, 2016 at 18:31

Ranvijay SachanRanvijay Sachan

2,3773 gold badges28 silver badges47 bronze badges

You can use something of the form

s.decode('utf-8')

which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don’t ever access the XML string directly, you might have to use a decoder object from the codecs module.

answered Jul 11, 2010 at 19:04

David ZDavid Z

126k27 gold badges253 silver badges277 bronze badges

I wrote the following to fix the nuisance non-ascii quotes and force conversion to something usable.

unicodeToAsciiMap = {u'u2019':"'", u'u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

answered Sep 10, 2015 at 11:31

Try adding the following line at the top of your python script.

# _*_ coding:utf-8 _*_

answered Jan 20, 2016 at 5:08

abnvanandabnvanand

1931 silver badge6 bronze badges

Python 3.5, 2018

If you don’t know what the encoding but the unicode parser is having issues you can open the file in Notepad++ and in the top bar select Encoding->Convert to ANSI. Then you can write your python like this

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

answered Oct 9, 2018 at 21:56

Atomar94Atomar94

451 silver badge10 bronze badges

Источник

Python 2.7. Unicode Errors Simply Explained

I know I’m late with this article for about 5 years or so, but people are still using Python 2.x, so this subject is relevant I think.

Some facts first:

Unicode is an international encoding standard for use with different languages and scripts
In python-2.x, there are two types that deal with text.
1. str is an 8-bit string.
2. unicode is for strings of unicode code points.
  A code point is a number that maps to a particular abstract character. It is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal)
Encoding (noun) is a map of Unicode code points to a sequence of bytes. (Synonyms: character encoding, character set, codeset). Popular encodings: UTF-8, ASCII, Latin-1, etc.
Encoding (verb) is a process of converting unicode to bytes of str, and decoding is the reverce operation.
Python 2.x uses ASCII as a default encoding. (More about this later)

SyntaxError: Non-ASCII character

When you sees something like this

SyntaxError: Non-ASCII character 'xd0' in file /tmp/p.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

you just need to define encoding in the first or second line of your file.
All you need is to have string coding=utf8 or coding: utf8 somewhere in your comments.
Python doesn’t care what goes before or after those string, so the following will work fine too:

# -*- encoding: utf-8 -*-

Notice the dash in utf-8. Python has many aliases for UTF-8 encoding, so you should not worry about dashes or case sensitivity.

`UnicodeEncodeError` Explained

>>> str(u'café')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 3: ordinal not in range(128)

str() function encodes a string. We passed a unicode string, and it tried to encode it using a default encoding, which is ASCII. Now the error makes sence because ASCII is 7-bit encoding which doesn’t know how to represent characters outside of range 0..128.
Here we called str() explicitly, but something in your code may call it implicitly and you will also get UnicodeEncodeError.

How to fix: encode unicode string manually using .encode('utf8') before passing to str()

`UnicodeDecodeError` Explained

>>> utf_string = u'café'
>>> byte_string = utf_string.encode('utf8')
>>> unicode(byte_string)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Let’s say we somehow obtained a byte string byte_string which contains encoded UTF-8 characters. We could get this by simply using a library that returns str type.
Then we passed the string to a function that converts it to unicode. In this example we explicitly call unicode(), but some functions may call it implicitly and you’ll get the same error.
Now again, Python uses ASCII encoding by default, so it tries to convert bytes to a default encoding ASCII. Since there is no ASCII symbol that converts to 0xc3 (195 decimal) it fails with UnicodeDecodeError.

How to fix: decode str manually using .decode('utf8') before passing to your function.

Rule of Thumb

Make sure your code works only with Unicode strings internally, converting to a particular encoding on output, and decoding str on input.
Learn the libraries you are using, and find places where they return str. Decode str before return value is passed further in your code.

I use this helper function in my code:

def force_to_unicode(text):
    "If text is unicode, it is returned as is. If it's str, convert it to Unicode using UTF-8 encoding"
    return text if isinstance(text, unicode) else text.decode('utf8')

Source: https://docs.python.org/2/howto/unicode.html

Источник

PEP: Python3 and UnicodeDecodeError

Python3 and UnicodeDecodeError for the command line, environment variables and filenames

Introduction

Default encoding

Command line

Environment variables

Filenames

Introduction

Windows

Non Windows (POSIX)

Filename encoding

Display a filename

Functions producing filenames

Functions for filename manipulation

Functions accessing files

bytearray

Unicode normalisation

SyntaxError: Non-ASCII character

UnicodeEncodeError Explained

UnicodeDecodeError Explained

Rule of Thumb

Читайте также:

`UnicodeEncodeError` Explained

`UnicodeDecodeError` Explained