If you need to store UTF8 data in your database, you need a database that accepts UTF8. You can check the encoding of your database in pgAdmin. Just right-click the database, and select «Properties».
But that error seems to be telling you there’s some invalid UTF8 data in your source file. That means that the copy
utility has detected or guessed that you’re feeding it a UTF8 file.
If you’re running under some variant of Unix, you can check the encoding (more or less) with the file
utility.
$ file yourfilename
yourfilename: UTF-8 Unicode English text
(I think that will work on Macs in the terminal, too.) Not sure how to do that under Windows.
If you use that same utility on a file that came from Windows systems (that is, a file that’s not encoded in UTF8), it will probably show something like this:
$ file yourfilename
yourfilename: ASCII text, with CRLF line terminators
If things stay weird, you might try to convert your input data to a known encoding, to change your client’s encoding, or both. (We’re really stretching the limits of my knowledge about encodings.)
You can use the iconv
utility to change encoding of the input data.
iconv -f original_charset -t utf-8 originalfile > newfile
You can change psql (the client) encoding following the instructions on Character Set Support. On that page, search for the phrase «To enable automatic character set conversion».
I’ve spent the last 8 hours trying to import the output of ‘mysqldump —compatible=postgresql’ into PostgreSQL 8.4.9, and I’ve read at least 20 different threads here and elesewhere already about this specific problem, but found no real usable answer that works.
MySQL 5.1.52 data dumped:
mysqldump -u root -p --compatible=postgresql --no-create-info --no-create-db --default-character-set=utf8 --skip-lock-tables rt3 > foo
PostgreSQL 8.4.9 server as destination
Loading the data with ‘psql -U rt_user -f foo’ is reporting (many of these, here’s one example):
psql:foo:29: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
According the following, there are no NULL (0x00) characters in the input file.
database-dumps:rcf-temp1# sed 's/x0/ /g' < foo > nonulls
database-dumps:rcf-temp1# sum foo nonulls
04730 2545610 foo
04730 2545610 nonulls
database-dumps:rcf-temp1# rm nonulls
Likewise, another check with Perl shows no NULLs:
database-dumps:rcf-temp1# perl -ne '/00/ and print;' foo
database-dumps:rcf-temp1#
As the «HINT» in the error mentions, I have tried every possible way to set ‘client_encoding’ to ‘UTF8’, and I succeed but it has no effect toward solving my problem.
database-dumps:rcf-temp1# psql -U rt_user --variable=client_encoding=utf-8 -c "SHOW client_encoding;" rt3
client_encoding
-----------------
UTF8
(1 row)
database-dumps:rcf-temp1#
Perfect, yet:
database-dumps:rcf-temp1# psql -U rt_user -f foo --variable=client_encoding=utf-8 rt3
...
psql:foo:29: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
...
Barring the «According to Hoyle» correct answer, which would be fantastic to hear, and knowing that I really don’t care about preserving any non-ASCII characters for this seldom-referenced data, what suggestions do you have?
Update: I get the same error with an ASCII-only version of the same dump file at import time. Truly mind-boggling:
database-dumps:rcf-temp1# # convert any non-ASCII character to a space
database-dumps:rcf-temp1# perl -i.bk -pe 's/[^[:ascii:]]/ /g;' mysql5-dump.sql
database-dumps:rcf-temp1# sum mysql5-dump.sql mysql5-dump.sql.bk
41053 2545611 mysql5-dump.sql
50145 2545611 mysql5-dump.sql.bk
database-dumps:rcf-temp1# cmp mysql5-dump.sql mysql5-dump.sql.bk
mysql5-dump.sql mysql5-dump.sql.bk differ: byte 1304850, line 30
database-dumps:rcf-temp1# # GOOD!
database-dumps:rcf-temp1# psql -U postgres -f mysql5-dump.sql --variable=client_encoding=utf-8 rt3
...
INSERT 0 416
psql:mysql5-dump.sql:30: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encod.
INSERT 0 455
INSERT 0 424
INSERT 0 483
INSERT 0 447
INSERT 0 503
psql:mysql5-dump.sql:36: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encod.
INSERT 0 502
INSERT 0 507
INSERT 0 318
INSERT 0 284
psql:mysql5-dump.sql:41: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encod.
INSERT 0 382
INSERT 0 419
INSERT 0 247
psql:mysql5-dump.sql:45: ERROR: invalid byte sequence for encoding "UTF8": 0x00
HINT: This error can also happen if the byte sequence does not match the encod.
INSERT 0 267
INSERT 0 348
^C
One of the tables in question is defined as:
Table "public.attachments"
Column | Type | Modifie
-----------------+-----------------------------+--------------------------------
id | integer | not null default nextval('atta)
transactionid | integer | not null
parent | integer | not null default 0
messageid | character varying(160) |
subject | character varying(255) |
filename | character varying(255) |
contenttype | character varying(80) |
contentencoding | character varying(80) |
content | text |
headers | text |
creator | integer | not null default 0
created | timestamp without time zone |
Indexes:
"attachments_pkey" PRIMARY KEY, btree (id)
"attachments1" btree (parent)
"attachments2" btree (transactionid)
"attachments3" btree (parent, transactionid)
I do not have the liberty to change the type for any part of the DB schema. Doing so would likely break future upgrades of the software, etc.
The likely problem column is ‘content’ of type ‘text’ (perhaps others in other tables as well). As I already know from previous research, PostgreSQL will not allow NULL in ‘text’ values. However, please see above where both sed and Perl show no NULL characters, and then further down where I strip all non-ASCII characters from the entire dump file but it still barfs.
In this article, we will see how you can fix error ‘invalid byte sequence for encoding UTF8’ while restoring a PostgreSQL database. At work, I got a task to move DBs which has ASCII encoding to UTF8 encoding. Let me first confess that the ASCII DBs was not created by intention, someone accidentally created it!!! Having a DB ASCII encoded is very dangerous, it should be moved to UTF8 encoding as soon as possible. So the initial plan was to create archive dump of the DB with pg_dump , create a new DB with UTF8 encoding and restore the dump to the new DB using pg_restore . The plan worked for most of the DBs, but failed for one DB with below error.
DETAIL: Proceeding with relation creation anyway. pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 35091; 0 2527787452 TABLE DATA my_table release pg_restore: [archiver (db)] COPY failed for table "my_table": ERROR: invalid byte sequence for encoding "UTF8": 0xa5 CONTEXT: COPY my_table, line 41653 WARNING: errors ignored on res
As the error says, there are some invalid UTF8 characters in table “my_table” which prevents pg_restore from restoring the particular table. I did a lot of research and googling to see what to do. I will list out what all steps I did.
Assume ‘my_db’ and ‘my_table’ is the database name and table name respectively.
Step 1:
Dump the Database excluding particular table ‘my_table’. I would suggest dumping the database in archive format for saving time and disk space.
pg_dump -Fc -T 'my_table' -p 1111 -f dbdump.pgd my_db
Step 2:
Create the new database with UTF8 encoding and restore the dump.
pg_restore -p 2222 -j 8 -d my_new_db dbdump.pgd
The restoration should be successful as we didn’t restore the offending table.
Step 3:
Dump the offending table ‘my_table’ in plain text format.
pg_dump -Fp -t 'my_table' -p 1111 my_db > my_db_table_only.sql
Step 4:
Now we have table data in plain text. Let’s find invalid UTF8 characters in the file by running below command(make sure locale is set to UTF-8,).
# grep -naxv '.*' my_db_table_only.sql 102:2010-03-23 ��ԥ� data1 data2
� represents an invalid UTF8 character and it is present in 102th line of the file.
Step 5:
Find which charset the invalid UTF8 characters belongs to.
#grep -naxv '.*' my_db_table_only.sql > test.txt #file -i test.txt test.txt: text/plain; charset=iso-8859-1
As per the output, those characters belongs to iso-8859-1. The charset may be different in your case.
Step 6:
Let’s convert iso-8859-1 to UTF8 using iconv command.
#grep -naxv '.*' my_db_table_only.sql | iconv --from-code=ISO-8859-1 --to-code=UTF-8 102:2010-03-23 ¥Êԥ¡ data1 data2
Now you got the characters in UTF8 encoding. So you can just replace ��ԥ� with ¥Êԥ¡ in 102th line of dump file(I used nano editor to do this, faced issues with Vim .)
I know that replacing characters manually could be a pain in the ass if there are lot of invalid UTF8 characters. We can run iconv on the whole file as shown below.
iconv --from-code=ISO-8859-1 --to-code=UTF-8 my_db_table_only.sql > my_db_table_only_utf8.sql
But I won’t recommend this as it may change valid characters(eg: Chinese characters ) to some other characters. If you plan to run iconv on the file, just make sure only invalid UTF8 characters are converted by taking diff of both files.
Step7.
Once the characters are replaced. Restore the table to the database.
psql -p 2222 -d my_new_db -f my_db_table_only.sql
No more “Invalid byte sequence for encoding UTF8” error. Thanks for the time taken to read my blog. Subscribe to this blog so that you don’t miss out anything useful (Checkout Right Sidebar for the Subscription Form and Facebook follow button) . Please also put your thoughts as comments .
Если вам нужно хранить данные UTF8 в своей базе данных, вам нужна база данных, которая принимает UTF8. Вы можете проверить кодировку вашей базы данных в pgAdmin. Просто щелкните правой кнопкой мыши базу данных и выберите «Свойства».
Но эта ошибка, кажется, говорит вам, что в исходном файле есть недопустимые данные UTF8. Это означает, что copy
утилита обнаружила или предположила, что вы передаете ей файл UTF8.
Если вы работаете под какой-либо версией Unix, вы можете проверить кодировку (более или менее) с помощью file
утилита.
$ file yourfilename
yourfilename: UTF-8 Unicode English text
(Я думаю, что это будет работать и на Mac в терминале.) Не знаю, как это сделать под Windows.
Если вы используете эту же утилиту для файла, полученного из систем Windows (т. е. файла, не в кодировке UTF8), вероятно, будет отображаться что-то вроде этого:
$ file yourfilename
yourfilename: ASCII text, with CRLF line terminators
Если что-то останется странным, вы можете попытаться преобразовать свои входные данные в известную кодировку, изменить кодировку вашего клиента или и то, и другое. (Мы действительно расширяем границы моих знаний о кодировках.)
Вы можете использовать iconv
утилита для изменения кодировки входных данных.
iconv -f original_charset -t utf-8 originalfile > newfile
Вы можете изменить кодировку psql (клиент), следуя инструкциям на Поддержка набора символов. На этой странице найдите фразу «Чтобы включить автоматическое преобразование набора символов».
Если вам нужно хранить данные UTF8 в своей базе данных, вам нужна база данных, которая принимает UTF8. Вы можете проверить кодировку своей базы данных в pgAdmin. Просто щелкните правой кнопкой мыши базу данных и выберите «Свойства».
Но эта ошибка, похоже, говорит вам о некоторых недопустимых данных UTF8 в исходном файле. Это означает, что утилита copy
обнаружила или предположила, что вы загружаете файл UTF8.
Если вы работаете под некоторым вариантом Unix, вы можете проверить кодировку (более или менее) с помощью file
.
$ file yourfilename
yourfilename: UTF-8 Unicode English text
(Я думаю, что это будет работать и на Mac в терминале.) Не уверен, как это сделать в Windows.
Если вы используете ту же самую утилиту для файла, который поступает из систем Windows (то есть файла, который не закодирован в UTF8), он, вероятно, будет показывать что-то вроде этого:
$ file yourfilename
yourfilename: ASCII text, with CRLF line terminators
Если ситуация остается странной, вы можете попытаться преобразовать свои входные данные в известную кодировку, изменить свою клиентскую кодировку или и то, и другое. (Мы действительно растягиваем пределы моих знаний о кодировках.)
Вы можете использовать утилиту iconv
для изменения кодировки входных данных.
iconv -f original_charset -t utf-8 originalfile > newfile
Вы можете изменить кодировку psql (клиент), следуя инструкциям Поддержка набора символов. На этой странице найдите фразу «Включить автоматическое преобразование набора символов».
Symptoms
-
When migrating Stash’s datastore to a PostgreSQL database, the following error is shown in the administration web interface:
Stash could not be migrated to the new database. PostgreSQL does not allow null characters (U+0000) in text columns. See the following knowledge base to solve the problem: https://confluence.atlassian.com/x/OwOCKQ
-
When restoring a backup to a Stash instance that uses a PostgreSQL database, the restore fails and the following error appears in the
atlassian-stash.log
:Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2198) ~[postgresql-9.3-1102.jdbc41.jar:na] at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1927) ~[postgresql-9.3-1102.jdbc41.jar:na] at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) ~[postgresql-9.3-1102.jdbc41.jar:na] at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:561) ~[postgresql-9.3-1102.jdbc41.jar:na] at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:419) ~[postgresql-9.3-1102.jdbc41.jar:na] at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:365) ~[postgresql-9.3-1102.jdbc41.jar:na] at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:203) ~[bonecp-0.7.1.RELEASE.jar:0.7.1.RELEASE] at com.atlassian.stash.internal.backup.liquibase.DefaultLiquibaseDao.insert(DefaultLiquibaseDao.java:272) ~[stash-dao-impl-3.6.0-SNAPSHOT.jar:na] ... 39 common frames omitted
Cause
This problem occurs because PostgreSQL does not allow null characters (U+0000) in its text data types. As a result, when migrating or restoring a backup to a PostgreSQL database, the operation can fail with the error above. This problem is restricted to PostgreSQL. Other databases supported by Stash are not affected by null characters.
Resolution
Follow the steps below to sanitize the source database and then re-run the migration or restore.
- Stop Stash.
- Find and remove the null characters (U+0000) in the source database text columns. Most likely candidates are comments (
sta_comment
table) or plugin settings (plugin_setting
table).
To remove the null characters on those tables, run the following SQL queries on the source database.-
-
If the source database is MySQL:
SELECT * FROM sta_comment WHERE comment_text like concat('%', 0x00, '%'); UPDATE sta_comment SET comment_text = replace(comment_text, 0x00, '') WHERE comment_text like concat('%', 0x00, '%'); SELECT * FROM plugin_setting WHERE key_value like concat('%', 0x00, '%'); UPDATE plugin_setting SET key_value = replace(key_value, 0x00, '') WHERE key_value like concat('%', 0x00, '%');
-
If the source database is Oracle:
SELECT * FROM sta_comment WHERE instr(comment_text, unistr('000')) > 0; UPDATE sta_comment SET comment_text = replace(comment_text, unistr('000')) WHERE instr(comment_text, unistr('000')) > 0; SELECT * FROM plugin_setting WHERE instr(key_value, unistr('000')) > 0; UPDATE plugin_setting SET key_value = replace(key_value, unistr('000')) WHERE instr(key_value, unistr('000')) > 0;
-
If the source database is Microsoft SQL Server, execute the following T-SQL code (note that a custom function is used because the built-in REPLACE function cannot replace null characters):
IF OBJECT_ID (N'dbo.removeNullCharacters', N'FN') IS NOT NULL DROP FUNCTION removeNullCharacters; GO CREATE FUNCTION dbo.removeNullCharacters(@s nvarchar(max)) RETURNS nvarchar(max) AS BEGIN DECLARE @c nchar(1) DECLARE @p int DECLARE @ret nvarchar(max) IF @s is NULL SET @ret = @s ELSE BEGIN SET @p = 0 SET @ret = '' WHILE (@p <= LEN(@s)) BEGIN SET @c = SUBSTRING(@s, @p, 1) IF @c <> nchar(0) BEGIN SET @ret = @ret + @c END SET @p = @p + 1 END END RETURN @ret END; SELECT * FROM sta_comment WHERE cast(comment_text AS varchar) like '%' + char(0) +'%'; UPDATE sta_comment SET comment_text = dbo.removeNullCharacters(comment_text) WHERE cast(comment_text AS varchar) like '%' + char(0) +'%'; SELECT * FROM plugin_setting WHERE cast(key_value AS varchar) like '%' + char(0) +'%'; UPDATE plugin_setting SET key_value = dbo.removeNullCharacters(key_value) WHERE cast(key_value AS varchar) like '%' + char(0) +'%';
-
If the source database is HSQLDB, either:
-
Migrate the database to an intermediate external database (such as MySQL), or
-
Find the problematic rows using the following queries and manually edit them to remove the null characters (U+0000);
SELECT * FROM sta_comment WHERE comment_text like U&'%000%'; SELECT * FROM plugin_setting WHERE key_value like U&'%000%';
Note: Before accessing Stash’s HSQLDB (internal database) with an external tool, ensure Stash is not running.
Note: Stash’s HSQLDB database (its internal database) can be opened by any database management tool that supports the JDBC protocol (such as DbVisualizer), using the following settings:
-
-
-
- Database driver: HSQLDB Server
-
Database driver location:
STASH_INSTALL/atlassian-stash/WEB-INF/lib/hsqldb-2.2.4.jar
(whereSTASH_INSTALL
is the path to the Stash installation directory) - Database user: SA
-
JDBC URL:
jdbc:hsqldb:file:STASH_HOME/shared/data/db;shutdown=true;hsqldb.tx=mvlocks
(whereSTASH_HOME
is the path to the Stash home directory)
-
-
-
- Re-create the PostgreSQL database (using the settings highlighted here) used in the original migration if it is not empty (for safety reasons, Stash blocks any migration to a non-empty database).
- Start Stash.
- Initiate the migration or the restoration of the backup once more.
- If the migration or restoration still fails, use the following instructions to diagnose the cause:
- Turn on PostgreSQL statement logging.
- Recreate the target PostgreSQL database to ensure it is empty.
- Restart the migration or the backup restoration to trigger the error again.
- Consult the PostgreSQL statement log to determine which SQL INSERT failed. This will indicate which table still contains the null characters which have to be sanitized as described above.
- Restart from step (a) until the migration or restore succeeds.
Last modified on Feb 26, 2016
Related content
- No related content found
Hello @v-danhe-msft
Please help me.. I have the same issue when I get data from ODB or directly from PostgreSql.
Some tables I can import the table to powerquery and see it, but i cant aply it to powerbi desktop.
Other tables i cant even see it at powerquery.
Failed to save modifications to the server. Error returned: ‘Erro do OLE DB ou ODBC: [datasource.error] PostgreeSQL: 22021: invalid byte sequence for encoding «UTF8»: 0xc7 0x4f. ‘ .
From ODBC:
DataSource.Error: ODBC: ERROR [22021] ERROR: invalid byte sequence for encoding «UTF8»: 0xe9 0x20 0x41;
Error while executing the query
Detalhes:
DataSourceKind=Odbc
DataSourcePath=dsn=PostgreSQL35W32b
OdbcErrors=[Table]
FROM POSTGRE:
DataSource.Error: PostgreSQL: 22021: invalid byte sequence for encoding «UTF8»: 0xc7 0xc3
Detalhes:
DataSourceKind=PostgreSQL
DataSourcePath=000.000.0.0;TESTE
Message=22021: invalid byte sequence for encoding «UTF8»: 0xc7 0xc3
ErrorCode=-2147467259
MY DATABASE:
CREATE DATABASE «TESTE»
WITH OWNER = postgres
ENCODING = ‘SQL_ASCII’
TABLESPACE = pg_default
LC_COLLATE = ‘Portuguese_Brazil.1252’
LC_CTYPE = ‘Portuguese_Brazil.1252’
CONNECTION LIMIT = -1;
ALTER DATABASE «TESTE»
SET DateStyle = ‘iso, mdy’;
ALTER DATABASE «TESTE»
SET bytea_output = ‘escape’;
ALTER DATABASE «TESTE»
SET standard_conforming_strings = ‘off’;
MY ODBC Config:
datasourse int-8 as default
extra op 0x0
byte as LO