Unicodeencodeerror python как исправить

debian, python, unicode

0

2

Traceback (most recent call last): File "betfair.py", line 10, in <module>   print('u041fu0410u0420u0421u0418u0422u0421u042f ' + baseurl + parturl) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)

Строка такая:

print('ПАРСИТСЯ ' + baseurl + parturl)

Почему так и как исправить?

На убунте 14.04 работает

Ссылка

Ответ на:

комментарий
от yars068 12.05.16 11:18:17 MSK

Сделал

msg = 'ПАРСИТСЯ ' + baseurl + parturl
msg = msg.encode('utf-8')
print(msg)

Теперь мне выводит в виде кодов юникода, а надо по-русски( Я с мобилки — все не очень удобно делать(
Что я делаю не так и почему на убунте работает?

Qwentor ★★★★★

(12.05.16 12:07:11 MSK)

Показать ответ
Ссылка

Ответ на:

комментарий
от Qwentor 12.05.16 12:07:11 MSK

Ответ на:

комментарий
от bvn13 12.05.16 15:33:20 MSK

Ответ на:

комментарий
от Qwentor 12.05.16 15:49:11 MSK

Ну так поставь себе юникодную локаль. sudo dpkg-reconfigure locales, и там нужные настройки.

Показать ответ
Ссылка

Ответ на:

комментарий
от proud_anon 12.05.16 16:05:48 MSK

Недавно наткнулся на такое. Решение — либо костылять свой print через sys.stdout.buffer, либо поправить локаль в переменной окружения на UTF-8. Например LC_ALL='ru_RU.UTF-8',LANG='ru_RU.UTF-8'.

PolarFox ★★★★★

(12.05.16 16:38:37 MSK)

Ссылка

Ответ на:

комментарий
от Qwentor 12.05.16 16:33:51 MSK

ru_RU.UTF-8

Не обязательно. Любая юникодная сойдёт. Или любая с русскими символами.

PolarFox ★★★★★

(12.05.16 16:39:31 MSK)

Ссылка

Ответ на:

комментарий
от Qwentor 12.05.16 16:33:51 MSK

Мне нужно установить ru_RU.UTF-8 по-умолчанию? Установил как вторую — без толку

Любую .UTF-8. Если ты почему-то не хочешь устанавливать UTF-8-локаль по умолчанию, то запускай скрипт как LANG=ru_RU.UTF betfair.py (или любую другую UTF-8-локаль).

Показать ответ
Ссылка

Ответ на:

комментарий
от proud_anon 12.05.16 16:44:50 MSK

Йоу, спасибище! Помогло!
Нажал бы плюсик, если б было)

Qwentor ★★★★★

(12.05.16 17:13:07 MSK)

Ссылка

Вы не можете добавлять комментарии в эту тему. Тема перемещена в архив.

Похожие темы

Форум
Снова юникод в python (2012)
Форум
Питон, юникод и конвеер (2015)
Форум
Python: как принудить к utf-8? (2009)
Форум
python && utf-8 (2005)
Форум
[conky][utf-8][python] Не работает вывод в conky (2010)

Форум
[python][unicode]jinja (2008)
Форум
Python, Informix и more (2008)
Форум
[python3.2][archlinux] (2011)
Форум
Python 3 и locale (2015)
Форум
pexpect (2015)

Источник

В очередной раз запустив в Windows свой скрипт-информер для СамИздат-а и увидев в консоли «загадочные символы» я сказал себе: «Да уже сделай, наконец, себе нормальный кросс-платформенный логгинг!»

Об этом, и о том, как раскрасить вывод лога наподобие Django-вского в Win32 я попробую рассказать под хабра-катом _{(Всё ниженаписанное применимо к Python 2.x ветке)}

Задача первая. Корректный вывод текста в консоль

Симптомы

До тех пор, пока мы не вносим каких-либо «поправок» в проинициализировавшуюся систему ввода-вывода и используем только оператор print с unicode строками, всё идёт более-менее нормально вне зависимости от ОС.

«Чудеса» начинаются дальше — если мы поменяли какие-либо кодировки (см. чуть дальше) или воспользовались модулем logging для вывода на экран. Вроде бы настроив ожидаемое поведение в Linux, в Windows получаешь «мусор» в utf-8. Начинаешь править под Win — вылезает 1251 в консоли…

Теоретический экскурс

За параметры преобразования символов и вывода их в консоль отвечают сразу несколько параметров:

Кодировка по-умолчанию модуля sys — sys.getdefaultencoding()
Предпочитаемая кодировка для текущей локали — locale.getpreferredencoding()
Кодировка стандартных потоков sys.stdout, sys.stderr
Кое-какую смуту вносит и кодировка самого файла, но договоримся, что у нас всё унифицировано и все файлы в utf-8 и содержат корректный заголовок
«Бонусом» идёт любовь стандартных потоков выдавать исключения, если, даже при корректно установленной кодировке, не найдётся нужного символа при печати unicode строки

Ищем решение

Очевидно, чтобы избавиться от всех этих проблем, надо как-то привести их к единообразию.
И вот тут начинается самое интересное:

# -*- coding: utf-8 -*-
>>> import sys
>>> import locale
>>> print sys.getdefaultencoding()
ascii
>>> print locale.getpreferredencoding() # linux
UTF-8
>>> print locale.getpreferredencoding() # win32/rus
cp1251
# и самое интересное:
>>> print sys.stdout.encoding # linux
UTF-8
>>> print sys.stdout.encoding # win32
cp866

Ага! Оказывается «система» у нас живёт вообще в ASCII. Как следствие — попытка по-простому работать с вводом/выводом заканчивается «любимым» исключением UnicodeEncodeError/UnicodeDecodeError.

Кроме того, как замечательно видно из примера, если в linux у нас везде utf-8, то в Windows — две разных кодировки — так называемая ANSI, она же cp1251, используемая для графической части и OEM, она же cp866, для вывода текста в консоли. OEM кодировка пришла к нам со времён DOS-а и, теоретически, может быть также перенастроена специальными командами, но на практике никто этого давно не делает.

До недавнего времени я пользовался распространённым способом исправить эту неприятность:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# ==============
#      Main script file
# ==============
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# или
import locale
sys.setdefaultencoding(locale.getpreferredencoding())
# ...

И это, в общем-то, работало. Работало до тех пор, пока пользовался print-ом. При переходе к выводу на экран через logging всё сломалось.
Угу, подумал я, раз «оно» использует кодировку по-умолчанию, — выставлю-ка я ту же кодировку, что в консоли:

sys.setdefaultencoding(sys.stdout.encoding or sys.stderr.encoding)

Уже чуть лучше, но:

В Win32 текст печатается кракозябрами, явно напоминающими cp1251
При запуске с перенаправленным выводом опять получаем не то, что ожидалось
Периодически, при попытке напечатать текст, где есть преобразованный в unicode символ типа ① (①), «любезно» добавленный автором в какой-нибудь заголовок, снова получаем UnicodeEncodeError!

Присмотревшись к первому примеру, нетрудно заметить, что так желаемую кодировку «cp866» можно получить только проверив атрибут соответствующего потока. А он далеко не всегда оказывается доступен.
Вторая часть задачи — оставить системную кодировку в utf-8, но корректно настроить вывод в консоль.
Для индивидуальной настройки вывода надо переопределить обработку выходных потоков примерно так:

import sys
import codecs
sys.stdout = codecs.getwriter('cp866')(sys.stdout,'replace')

Этот код позволяет убить двух зайцев — выставить нужную кодировку и защититься от исключений при печати всяких умляутов и прочей типографики, отсутствующей в 255 символах cp866.
Осталось сделать этот код универсальным — откуда мне знать OEM кодировку на произвольном сферическом компе? Гугление на предмет готовой поддержки ANSI/OEM кодировок в python ничего разумного не дало, посему пришлось немного вспомнить WinAPI

UINT GetOEMCP(void); // Возвращает системную OEM кодовую страницу как число
UINT GetANSICP(void); // то же для ANSI кодовой странцы

… и собрать всё вместе:

# -*- coding: utf-8 -*-
import sys
import codecs

def setup_console(sys_enc="utf-8"):
    reload(sys)
    try:
        # для win32 вызываем системную библиотечную функцию
        if sys.platform.startswith("win"):
            import ctypes
            enc = "cp%d" % ctypes.windll.kernel32.GetOEMCP() #TODO: проверить на win64/python64
        else:
            # для Linux всё, кажется, есть и так
            enc = (sys.stdout.encoding if sys.stdout.isatty() else
                        sys.stderr.encoding if sys.stderr.isatty() else
                            sys.getfilesystemencoding() or sys_enc)

        # кодировка для sys
        sys.setdefaultencoding(sys_enc)

        # переопределяем стандартные потоки вывода, если они не перенаправлены
        if sys.stdout.isatty() and sys.stdout.encoding != enc:
            sys.stdout = codecs.getwriter(enc)(sys.stdout, 'replace')

        if sys.stderr.isatty() and sys.stderr.encoding != enc:
            sys.stderr = codecs.getwriter(enc)(sys.stderr, 'replace')

    except:
        pass # Ошибка? Всё равно какая - работаем по-старому...

Задача вторая. Раскрашиваем вывод

Насмотревшись на отладочный вывод Джанги в связке с werkzeug, захотелось чего-то подобного для себя. Гугление выдаёт несколько проектов разной степени проработки и удобности — от простейшего наследника logging.StreamHandler, до некоего набора, при импорте автоматически подменяющего стандартный StreamHandler.

Попробовав несколько из них, я, в итоге, воспользовался простейшим наследником StreamHandler, приведённом в одном из комментов на Stack Overflow и пока вполне доволен:

class ColoredHandler( logging.StreamHandler ):
    def emit( self, record ):
        # Need to make a actual copy of the record
        # to prevent altering the message for other loggers
        myrecord = copy.copy( record )
        levelno = myrecord.levelno
        if( levelno >= 50 ): # CRITICAL / FATAL
            color = 'x1b[31;1m' # red
        elif( levelno >= 40 ): # ERROR
            color = 'x1b[31m' # red
        elif( levelno >= 30 ): # WARNING
            color = 'x1b[33m' # yellow
        elif( levelno >= 20 ): # INFO
            color = 'x1b[32m' # green
        elif( levelno >= 10 ): # DEBUG
            color = 'x1b[35m' # pink
        else: # NOTSET and anything else
            color = 'x1b[0m' # normal
        myrecord.msg = (u"%s%s%s" % (color, myrecord.msg, 'x1b[0m')).encode('utf-8')  # normal
        logging.StreamHandler.emit( self, myrecord )

Однако, в Windows всё это работать, разумеется, отказалось. И если раньше можно было «включить» поддержку ansi-кодов в консоли добавлением «магического» ansi.dll из проекта symfony куда-то в недра системных папок винды, то, начиная (кажется) с Windows 7 данная возможность окончательно «выпилена» из системы. Да и заставлять юзера копировать какую-то dll в системную папку тоже как-то «не кошерно».

Снова обращаемся к гуглу и, снова, получаем несколько вариантов решения. Все варианты так или иначе сводятся к подмене вывода ANSI escape-последовательностей вызовом WinAPI для управления атрибутами консоли.

Побродив некоторое время по ссылкам, набрёл на проект colorama. Он как-то понравился мне больше остального. К плюсам именно этого проекта ст́оит отнести, что подменяется весь консольный вывод — можно выводить раскрашенный текст простым print u"x1b[31;40mЧто-то красное на чёрномx1b[0m" если вдруг захочется поизвращаться.

Сразу замечу, что текущая версия 0.1.18 содержит досадный баг, ломающий вывод unicode строк. Но простейшее решение я привёл там же при создании issue.

Собственно осталось объединить оба пожелания и начать пользоваться вместо традиционных «костылей»:

# -*- coding: utf-8 -*-
import sys
import codecs
import copy
import logging

#: Is ANSI printing available
ansi = not sys.platform.startswith("win")

def setup_console(sys_enc='utf-8', use_colorama=True):
    """
    Set sys.defaultencoding to `sys_enc` and update stdout/stderr writers to corresponding encoding

    .. note:: For Win32 the OEM console encoding will be used istead of `sys_enc`
    """
    global ansi
    reload(sys)
    try:
        if sys.platform.startswith("win"):
#... код, показанный выше
        if use_colorama and sys.platform.startswith("win"):
            try:
                # пробуем подключить colorama для винды и взводим флаг `ansi`, если всё получилось
                from colorama import init
                init()
                ansi = True
            except:
                pass

class ColoredHandler( logging.StreamHandler ):
    def emit( self, record ):
        # Need to make a actual copy of the record
        # to prevent altering the message for other loggers
        myrecord = copy.copy( record )
        levelno = myrecord.levelno
        if( levelno >= 50 ): # CRITICAL / FATAL
            color = 'x1b[31;1m' # red
        elif( levelno >= 40 ): # ERROR
            color = 'x1b[31m' # red
        elif( levelno >= 30 ): # WARNING
            color = 'x1b[33m' # yellow
        elif( levelno >= 20 ): # INFO
            color = 'x1b[32m' # green
        elif( levelno >= 10 ): # DEBUG
            color = 'x1b[35m' # pink
        else: # NOTSET and anything else
            color = 'x1b[0m' # normal
        myrecord.msg = (u"%s%s%s" % (color, myrecord.msg, 'x1b[0m')).encode('utf-8')  # normal
        logging.StreamHandler.emit( self, myrecord )

Дальше в своём проекте, в запускаемом файле пользуемся:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from setupcon import setup_console
setup_console('utf-8', False)
#...

# или если будем пользоваться раскрашиванием логов
import setupcon
setupcon.setup_console()
import logging
#...
if setupcon.ansi:
    logging.getLogger().addHandler(setupcon.ColoredHandler())

На этом всё. Из потенциальных доработок осталось проверить работоспособность под win64 python и, возможно, добаботать ColoredHandler чтобы проверял себя на isatty, как в более сложных примерах на том же StackOverflow.

Итоговый вариант получившегося модуля можно забрать на dumpz.org

Источник

Introduction to Python Unicode Error

In Python, Unicode is defined as a string type for representing the characters that allow the Python program to work with any type of different possible characters. For example, any path of the directory or any link address as a string. When we use such a string as a parameter to any function, there is a possibility of the occurrence of an error. Such error is known as Unicode error in Python. We get such an error because any character after the Unicode escape sequence (“ u ”) produces an error which is a typical error on windows.

Working of Unicode Error in Python with Examples

Unicode standard in Python is the representation of characters in code point format. These standards are made to avoid ambiguity between the characters specified, which may occur Unicode errors. For example, let us consider “ I ” as roman number one. It can be even considered the capital alphabet “ i ”; they both look the same, but they are two different characters with a different meaning to avoid such ambiguity; we use Unicode standards.

In Python, Unicode standards have two types of error: Unicode encodes error and Unicode decode error. In Python, it includes the concept of Unicode error handlers. These handlers are invoked whenever a problem or error occurs in the process of encoding or decoding the string or given text. To include Unicode characters in the Python program, we first use Unicode escape symbol u before any string, which can be considered as a Unicode-type variable.

Syntax:

Unicode characters in Python program can be written as follows:

“u dfskgfkdsg”

“U sakjhdxhj”

“u1232hgdsa”

In the above syntax, we can see 3 different ways of declaring Unicode characters. In the Python program, we can write Unicode literals with prefix either “u” or “U” followed by a string containing alphabets and numerical where we can see the above two syntax examples. At the end last syntax sample, we can also use the “u” Unicode escape sequence to declare Unicode characters in the program. In this, we have to note that using “u”, we can write a string containing any alphabet or numerical, but when we want to declare any hex value then we have to “x” escape sequence which takes two hex digits and for octal, it will take digit 777.

Example #1

Now let us see an example below for declaring Unicode characters in the program.

Code:

#!/usr/bin/env python
# -*- coding: latin-1 -*-
a= u'dfsfxacu1234'
print("The value of the above unicode literal is as follows:")
print(ord(a[-1]))

Output:

In the above program, we can see the sample of Unicode literals in the python program, but before that, we need to declare encoding, which is different in different versions of Python, and in this program, we can see in the first two lines of the program.

Now we will see the Unicode errors such as Unicode encoding Error and Unicode decoding errors, which are handled by Unicode error handlers, are invoked automatically whenever the errors are encountered. There are 3 typical errors in Python Unicode error handlers.

Strict error in Python raises UnicodeEncodeError and UnicodeDecodeError for encoding and decoding errors that are occurred, respectively.

Example #2

UnicodeEncodeError demonstration and its example.

In Python, it cannot detect Unicode characters, and therefore it throws an encoding error as it cannot encode the given Unicode string.

Code:

str(u'éducba')

Output:

In the above program, we can see we have passed the argument to the str() function, which is a Unicode string. But this function will use the default encoding process ASCII. As we can see in the above statement, we have not specified any encoding at the starting of this program, and therefore it throws an error, and the default encoding that is used is 7-bit encoding, and it cannot recognize the characters that are outside the 0 to 128 range. Therefore, we can see the error that is displayed in the above screenshot.

The above program can be fixed by encoding Unicode string manually, such as .encode(‘utf8’), before passing the Unicode string to the str() function.

Example #3

In this program, we have called the str() function explicitly, which may again throw an UnicodeEncodeError.

Code:

a = u'café'
b = a.encode('utf8')
r = str(b)
print("The unicode string after fixing the UnicodeEncodeError is as follows:")
print(r)

Output:

In the above, we can show how we can avoid UnicodeEncodeError manually by using .encode(‘utf8’) to the Unicode string.

Example #4

Now we will see the UnicodeDecodeError demonstration and its example and how to avoid it.

Code:

a = u'éducba' b = a.encode('utf8') unicode(b)

Output:

In the above program, we can see we are trying to print the Unicode characters by encoding first; then we are trying to convert the encoded string into Unicode characters, which mean decoding back to Unicode characters as given at the starting. In the above program, when we run, we get an error as UnicodeDecodeError. So to avoid this error, we have to manually decode the Unicode character “b”.

Decode

So we can fix it by using the below statement, and we can see it in the above screenshot.

b.decode(‘utf8’)

Conclusion

In this article, we conclude that in Python, Unicode literals are other types of string for representing different types of string. In this article, we saw different errors like UnicodeEncodeError and UnicodeDecodeError, which are used to encode and decode strings in the program, along with examples. In this article, we also saw how to fix these errors manually by passing the string to the function.

Похожие темы

Задача первая. Корректный вывод текста в консоль

Симптомы

Теоретический экскурс

Ищем решение

Задача вторая. Раскрашиваем вывод

Introduction to Python Unicode Error

Working of Unicode Error in Python with Examples

Example #1

Example #2

Example #3

Example #4

Conclusion

Recommended Articles

Читайте также: