I’m trying to import a .txt file into PostgreSQL. The txt file has 6 columns:
Laboratory_Name Laboratory_ID Facility ZIP_Code City State
And 213 rows.
I’m trying to use copy
to put the contents of this file into a table called doe2
in PostgreSQL using this command:
copy DOE2 FROM '/users/nathangroom/desktop/DOE_inventory5.txt' (DELIMITER(' '))
It gives me this error:
missing data for column "facility"
I’ve looked all around for what to do when encountering this error and nothing has helped. Has anyone else encountered this?
asked Nov 5, 2014 at 0:33
2
Three possible causes:
-
One or more lines of your file has only 4 or fewer space characters (your delimiter).
-
One or more space characters have been escaped (inadvertently). Maybe with a backslash at the end of an unquoted value. For the (default)
text
format you are using, the manual explains:
Backslash characters (
) can be used in the
COPY
data to quote data
characters that might otherwise be taken as row or column delimiters.
Output from COPY TO
or pg_dump
would not exhibit any of these faults when reading from a table with matching layout. But maybe your file has been edited or is from a different, faulty source?
- You are not using the file you think you are using. The
copy
meta-command of the psql command-line interface is a wrapper forCOPY
and reads files local to the client. If your file lives on the server, use the SQL commandCOPY
instead.
answered Nov 5, 2014 at 1:30
Erwin BrandstetterErwin Brandstetter
579k139 gold badges1035 silver badges1189 bronze badges
7
Check the file carefully. In my case, a blank line at the end of the file caused the ERROR: missing data for column
. Deleted it, and worked fine.
Printing the blank lines might reveal something interesting:
cat -e $filename
answered Oct 13, 2021 at 16:17
NagevNagev
9,3454 gold badges52 silver badges66 bronze badges
I had a similar error. check the version of pg_dump that was used in exporting the data and the version of the database you are want to insert it into. make sure they are same. Also, if copy export fails then export the data by insert
answered Jan 2, 2022 at 21:20
aniefiokaniefiok
211 silver badge5 bronze badges
Contents
- 1 History
- 2 Overview
- 3 COPY options
- 4 Example
- 4.1 error logging off
- 4.2 skip bad rows
- 4.3 turn error logging on (default logs in error_logging_table)
- 4.4 Redirect to another table with a specific label
- 4.5 Limit to 2 bad rows:
History
Error logging in COPY was a proposed feature developed by Aster Data against the PostgreSQL 9.0 code base. It was submitted and reviewed (1) but not accepted into the core product for that or any other version so far.
Overview
The purpose of error logging in COPY is to prevent the backend from erroring out if a malformed tuple is encountered during a COPY operation. Bad tuples can either be skipped or logged into an error logging table.
The format of the error logging table is as follows:
CREATE TABLE error_logging_table( tupletimestamp TIMESTAMP WITH TIME ZONE, targettable VARCHAR, dmltype CHAR(1), errmessage VARCHAR, sqlerrcode CHAR(5), label VARCHAR, key BIGINT, rawdata BYTEA );
The COPY command returns the number of successfully copied tuples only.
COPY options
Error logging is set by adding options to the COPY command. Here is the list of the available options:
Variable name | Description | Default value |
ERROR_LOGGING | Enables error handling for COPY commands (when set to true). | true |
ERROR_LOGGING_SKIP_BAD_ROWS | Enables the ability to skip malformed tuples that are encountered in COPY commands (when set to true). | true |
ERROR_LOGGING_MAX_ERRORS | Maximum number of bad rows to log before stopping the COPY operation (0 means unlimited). | 0 |
ERROR_LOGGING_SCHEMA_NAME | Schema name of the table where malformed tuples are inserted by the error logging module | ‘public’ |
ERROR_LOGGING_TABLE_NAME | Relation name where malformed tuples are inserted by the error logging module. The table is automatically created if it does not exist. | ‘error_table’ |
ERROR_LOGGING_LABEL | Optional label that is used to identify malformed tuples | COPY command text |
ERROR_LOGGING_KEY | Optional key to identify malformed tuples | Index of the tuple in the COPY stream |
Bad tuples can be rejected for a number of reasons (extra or missing column, constraint violation, …). The error table tries to capture as much context as possible about the error. If the table does not exist it is created automatically. The format of the error logging table is as follows:
CREATE TABLE error_logging_table( tupletimestamp TIMESTAMP WITH TIME ZONE, targettable VARCHAR, dmltype CHAR(1), errmessage VARCHAR, sqlerrcode CHAR(5), label VARCHAR, key BIGINT, rawdata BYTEA );
tupletimestamp stores the time at which the error occured. targettable describes the table in which the row was inserted when the error occured. The exact error message and sql error code are recorded in errmessage and sqlerrcode, respectively. The original data of the row can be found in rawdata.
Example
CREATE TEMP TABLE foo (a bigint, b text);
— input_file.txt —
1 one 2 3 three 111 four 4 5 five
— end of input_file.txt —
error logging off
COPY foo FROM 'input_file.txt';
ERROR: missing data for column "b" CONTEXT: COPY foo, line 2: "2"
skip bad rows
--skip bad rows COPY foo FROM 'input_file.txt' (ERROR_LOGGING, ERROR_LOGGING_SKIP_BAD_ROWS); SELECT * from foo;
a | b ---+------ 1 | one 5 | five (2 rows)
turn error logging on (default logs in error_logging_table)
--turn error logging on (default logs in error_logging_table) COPY foo FROM 'input_file.txt' (ERROR_LOGGING); SELECT * from foo;
a | b ---+------ 1 | one 5 | five (2 rows)
SELECT * FROM error_logging_table;
key | tupletimestamp | label | targettable | dmltype | errmessage | sqlerrcode | rawdata -----+-------------------------------------+---------------------------------+---------------+---------+------------------------------------------+------------+-------------------------- 2 | Thu Sep 10 07:09:17.869521 2009 PDT | COPY foo FROM 'input_file.txt'; | pg_temp_2.foo | C | missing data for column "b" | 22P04 | x32 3 | Thu Sep 10 07:09:17.86953 2009 PDT | COPY foo FROM 'input_file.txt'; | pg_temp_2.foo | C | extra data after last expected column | 22P04 | x3309746872656509313131 4 | Thu Sep 10 07:09:17.869538 2009 PDT | COPY foo FROM 'input_file.txt'; | pg_temp_2.foo | C | invalid input syntax for integer: "four" | 22P02 | x666f75720934 (3 rows)
Redirect to another table with a specific label
-- Redirect to another table with a specific label COPY foo FROM 'input_file.txt' (ERROR_LOGGING, ERROR_LOGGING_SCHEMA_NAME 'error', ERROR_LOGGING_TABLE_NAME 'table1', ERROR_LOGGING_LABEL 'batch1'); SELECT * FROM error.table1;
key | tupletimestamp | label | targettable | dmltype | errmessage | sqlerrcode | rawdata -----+-------------------------------------+--------+---------------+---------+------------------------------------------+------------+-------------------------- 2 | Thu Sep 10 07:09:17.869521 2009 PDT | batch1 | pg_temp_2.foo | C | missing data for column "b" | 22P04 | x32 3 | Thu Sep 10 07:09:17.86953 2009 PDT | batch1 | pg_temp_2.foo | C | extra data after last expected column | 22P04 | x3309746872656509313131 4 | Thu Sep 10 07:09:17.869538 2009 PDT | batch1 | pg_temp_2.foo | C | invalid input syntax for integer: "four" | 22P02 | x666f75720934 (3 rows)
Limit to 2 bad rows:
-- Limit to 2 bad rows: COPY foo FROM 'input_file.txt' (ERROR_LOGGING, ERROR_LOGGING_MAX_ERRORS 2);
ERROR: invalid input syntax for integer: "four" CONTEXT: COPY foo, line 4, column a: "four"
SELECT count(*) from error_logging_table;
count ------- 0 (1 row)
I’m working on items for migrating my database class from Oracle to PostgreSQL. I ran into an interesting limitation when I tried using the COPY
command to read an external CSV file.
I had prepared the system by creating a new directory hierarchy owned by the postgres user on top of a /u01/app
mount point. I set the ownership of the directories and files with the following command from the /u01/app
mount point:
chown -R postgres:postgres postgres
After running the following command:
COPY transaction_upload FROM '/u01/app/upload/postgres/transaction_upload_postgres.csv' DELIMITERS ',' CSV;
The command raised the following error:
COPY transaction_upload FROM '/u01/app/upload/postgres/transaction_upload_postgres.csv' DELIMITERS ',' CSV; ERROR: must be superuser or a member of the <code>pg_read_server_files</code> role to COPY from a file HINT: Anyone can COPY to stdout or from stdin. psql's copy command also works for anyone.
The two options for fixing the problem are: Changing the student
user to a superuser, and granting the pg_read_server_files
role to the student
user. Changing the student
user to a superuser isn’t really a practical option. So, I connected as the postgres
superuser and granted the pg_read_server_files
role to the student
user. It is a system level role and therefore doesn’t limit the role to only the videodb
database.
As the postgres
user, type the following command to grant the pg_read_server_files
role to the system
user:
GRANT pg_read_server_files TO student;
After granting the role to the student
user, I created a small test case. The test table definition is:
CREATE TABLE test ( id INTEGER , first_name VARCHAR(20) , last_name VARCHAR(20));
I created a test.csv
file in the /u01/app/upload/postgres
directory, like:
1,Simon,Bolivar 2,Peter,Davenport 3,Michael,Swan
The test.csv
file requires the following permissions and ownerships:
-rw-r--r--. 1 postgres postgres 49 Nov 13 10:56 test.csv
The permissions are user read-write, groups read, and others read. The ownership should be granted to postgres
and the primary group for the postgres
user, which should also be postgres
.
You can then connect to psql
as the student
user with the database set to videodb
and run the following copy
command:
COPY test FROM '/u01/app/upload/postgres/test.csv' DELIMITERS ',' CSV;
If you put a comma at the end of each line, like you would do in MySQL, it raises an error. The trailing comma raises the following error:
ERROR: extra data after last expected column
If you forget a delimiting commas somewhere on a line, the copy command raises the following error:
ERROR: missing data for column "last_name" CONTEXT: COPY tester, line 3: "3,Michael Swan"
The error
points to the column after the missing column. The context
points to the line number while displaying the text.
You should take careful note that the copy
command is an appending command. If you run it a second time, you insert a duplicate set of values in the target table.
After experimenting, its time to fix my student instance. The transaction_upload_mysql.csv file has two critical errors that need to be fixed. They are:
- A comma terminates each line, which would raise an extra data after last expected column error.
- A comma terminates each line followed by some indefinite amount of whitespace, which would also raise an extra data after last expected column error.
Since I have students with little expertise in Unix or Linux commands, I must provide a single command that they can use to convert the file with problems to one without problems. However, they should copy the transaction_upload_mysql.csv
file to ensure they don’t disable the equivalent functionality for the MySQL solution space.
They should copy two files as the root
user from the mysql
directory to the postgres
directory, as follows:
cp /u01/app/mysql/upload/transaction_upload_mysql.csv /u01/app/postgres/upload/transaction_upload_postgres.csv cp /u01/app/mysql/upload/transaction_upload2_mysql.csv /u01/app/postgres/upload/transaction_upload2_postgres.csv
As the root user in the /u01/app/upload/postgres directory, run the following command:
cat transaction_upload_postgres.csv | sed -e 's/,$//g' > x; cat x | sed -e 's/,[[:space:]]*$//g' > y; mv y transaction_upload_postgres.csv; rm x
Please check the file permissions and ownerships with the ll
(long list) command. If the file isn’t like this:
-rw-r--r--. 1 postgres postgres 49 Nov 13 10:56 transaction_upload_postgres.csv
Then, they should be able to change it as the root user with these commands:
chown postgres:postgres transaction_upload_postgres.csv chmod 544 transaction_upload_postgres.csv
Lastly, they should connect to the psql
as the student
user, using the videodb
database and run the following command:
COPY transaction_upload FROM '/u01/app/postgres/upload/transaction_upload_postgres.csv' DELIMITERS ',' CSV;
A query of the import table with this:
SELECT COUNT(*) FROM transaction_upload;
should return:
count ------- 11520 (1 row)
As always, I hope this helps those looking for some explanation and example on the copy
feature of PostgreSQL.
The PostgreSQL server COPY
command is very simple and just aborts on a single failure. You might think that it could do far better (I know I do), but there’s a reason that the PostgreSQL codebase is so compact with respect to MySQL’s (by a factor of ~ 10/1).
However, there is the (very) nice pgloader
programme which compensates for this at the price of having to run a separate utility.
Of course, if you’re good at the PL/pgSQL language (internal to the the server), then maybe you could explore that route — but why reinvent the wheel? Python and Perl also have internal PostgreSQL options. Then of course, there’s all the languages under the sun external to the server.
From the manual:
PgLoader Reference Manual
pgloader loads data from various sources into PostgreSQL. It can
transform the data it reads on the fly and submit raw SQL before and
after the loading. It uses the COPY PostgreSQL protocol to stream the
data into the server, and manages errors by filling a pair of
reject.dat and reject.log files.
which appears to be right up your alley?
The way it works is: (sorry for the long quote)
TL;DR — pgloader loads a batch (configurable) at a time. On failure, it «marks the spot», uses COPY
again up until that point, stops, then puts the bad record into a file and continues from bad-record + 1.
Batches And Retry Behaviour
To load data to PostgreSQL, pgloader uses the COPY streaming protocol.
While this is the faster way to load data, COPY has an important
drawback: as soon as PostgreSQL emits an error with any bit of data
sent to it, whatever the problem is, the whole data set is rejected by
PostgreSQL.To work around that, pgloader cuts the data into batches of 25000 rows
each, so that when a problem occurs it’s only impacting that many rows
of data. Each batch is kept in memory while the COPY streaming
happens, in order to be able to handle errors should some happen.When PostgreSQL rejects the whole batch, pgloader logs the error
message then isolates the bad row(s) from the accepted ones by
retrying the batched rows in smaller batches. To do that, pgloader
parses the CONTEXT error message from the failed COPY, as the message
contains the line number where the error was found in the batch, as in
the following example:CONTEXT: COPY errors, line 3, column b: «2006-13-11»
Using that information, pgloader will reload all rows in the batch
before the erroneous one, log the erroneous one as rejected, then try
loading the remaining of the batch in a single attempt, which may or
may not contain other erroneous data.At the end of a load containing rejected rows, you will find two files
in the root-dir location, under a directory named the same as the
target database of your setup. The filenames are the target table, and
their extensions are .dat for the rejected data and .log for the file
containing the full PostgreSQL client side logs about the rejected
data.