Parsererror error tokenizing data c error buffer overflow caught possible malformed input file

So i tried reading all the csv files from a folder and then concatenate them to create a big csv(structure of all the files was same), save it and read it again. All this was done using Pandas. The...

So i tried reading all the csv files from a folder and then concatenate them to create a big csv(structure of all the files was same), save it and read it again. All this was done using Pandas. The Error occurs while reading. I am Attaching the code and the Error below.

import pandas as pd
import numpy as np
import glob

path =r'somePath' # use your path
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
store = pd.concat(list_)
store.to_csv("C:workDATARaw_data\store.csv", sep=',', index= False)
store1 = pd.read_csv("C:workDATARaw_data\store.csv", sep=',')

Error:-

CParserError                              Traceback (most recent call last)
<ipython-input-48-2983d97ccca6> in <module>()
----> 1 store1 = pd.read_csv("C:workDATARaw_data\store.csv", sep=',')

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in _read(filepath_or_buffer, kwds)
    258         return parser
    259 
--> 260     return parser.read()
    261 
    262 _parser_defaults = {

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in read(self, nrows)
    719                 raise ValueError('skip_footer not supported for iteration')
    720 
--> 721         ret = self._engine.read(nrows)
    722 
    723         if self.options.get('as_recarray'):

C:UsersarmsharmAppDataLocalContinuumAnacondalibsite-packagespandasioparsers.pyc in read(self, nrows)
   1168 
   1169         try:
-> 1170             data = self._reader.read(nrows)
   1171         except StopIteration:
   1172             if nrows is None:

pandasparser.pyx in pandas.parser.TextReader.read (pandasparser.c:7544)()

pandasparser.pyx in pandas.parser.TextReader._read_low_memory (pandasparser.c:7784)()

pandasparser.pyx in pandas.parser.TextReader._read_rows (pandasparser.c:8401)()

pandasparser.pyx in pandas.parser.TextReader._tokenize_rows (pandasparser.c:8275)()

pandasparser.pyx in pandas.parser.raise_parser_error (pandasparser.c:20691)()

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

I tried using csv reader as well:-

import csv
with open("C:workDATARaw_data\store.csv", 'rb') as f:
    reader = csv.reader(f)
    l = list(reader)

Error:-

Error                                     Traceback (most recent call last)
<ipython-input-36-9249469f31a6> in <module>()
      1 with open('C:workDATARaw_data\store.csv', 'rb') as f:
      2     reader = csv.reader(f)
----> 3     l = list(reader)

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

@WillAyd Let me know if you need additional info.

Since GitHub doesn’t accept CSVs, I changed the extension to .txt.
Here’s the code which will trigger the exception.

for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)):
    pass

Here’s the file: debug.txt

Here’s the exception from Windows 10, using Anaconda.

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:programsanaconda3libsite-packagespandasioparsers.py", line 1007, in __next__
    return self.get_chunk()
  File "D:programsanaconda3libsite-packagespandasioparsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "D:programsanaconda3libsite-packagespandasioparsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "D:programsanaconda3libsite-packagespandasioparsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas_libsparsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas_libsparsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas_libsparsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas_libsparsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas_libsparsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

And the same on RedHat.

$ python3
Python 3.6.6 (default, Aug 13 2018, 18:24:23)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> for chunk in pandas.read_csv('debug.csv', chunksize=1000, names=range(2504)): pass
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1007, in __next__
    return self.get_chunk()
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1070, in get_chunk
    return self.read(nrows=size)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

While reading a CSV file, you may get the “Pandas Error Tokenizing Data“. This mostly occurs due to the incorrect data in the CSV file.

You can solve python pandas error tokenizing data error by ignoring the offending lines using error_bad_lines=False.

In this tutorial, you’ll learn the cause and how to solve the error tokenizing data error.

If you’re in Hurry

You can use the below code snippet to solve the tokenizing error. You can solve the error by ignoring the offending lines and suppressing errors.

Snippet

import pandas as pd

df = pd.read_csv('sample.csv', error_bad_lines=False, engine ='python')

df

If You Want to Understand Details, Read on…

In this tutorial, you’ll learn the causes for the exception “Error Tokenizing Data” and how it can be solved.

Cause of the Problem

  • CSV file has two header lines
  • Different separator is used
  • r – is a new line character and it is present in column names which makes subsequent column names to be read as next line
  • Lines of the CSV files have inconsistent number of columns

In the case of invalid rows which has an inconsistent number of columns, you’ll see an error as Expected 1 field in line 12, saw m. This means it expected only 1 field in the CSV file but it saw 12 values after tokenizing it. Hence, it doesn’t know how the tokenized values need to be handled. You can solve the errors by using one of the options below.

Finding the Problematic Line (Optional)

If you want to identify the line which is creating the problem while reading, you can use the below code snippet.

It uses the CSV reader. hence it is similar to the read_csv() method.

Snippet

import csv

with open("sample.csv", 'rb') as file_obj:
    reader = csv.reader(file_obj)
    line_no = 1
    try:
        for row in reader:
            line_no += 1
    except Exception as e:
        print (("Error in the line number %d: %s %s" % (line_no, str(type(e)), e.message)))

Using Err_Bad_Lines Parameter

When there is insufficient data in any of the rows, the tokenizing error will occur.

You can skip such invalid rows by using the err_bad_line parameter within the read_csv() method.

This parameter controls what needs to be done when a bad line occurs in the file being read.

If it’s set to,

  • False – Errors will be suppressed for Invalid lines
  • True – Errors will be thrown for invalid lines

Use the below snippet to read the CSV file and ignore the invalid lines. Only a warning will be shown with the line number when there is an invalid lie found.

Snippet

import pandas as pd

df = pd.read_csv('sample.csv', error_bad_lines=False)

df

In this case, the offending lines will be skipped and only the valid lines will be read from CSV and a dataframe will be created.

Using Python Engine

There are two engines supported in reading a CSV file. C engine and Python Engine.

C Engine

  • Faster
  • Uses C language to parse the CSV file
  • Supports float_precision
  • Cannot automatically detect the separator
  • Doesn’t support skipping footer

Python Engine

  • Slower when compared to C engine but its feature complete
  • Uses Python language to parse the CSV file
  • Doesn’t support float_precision. Not required with Python
  • Can automatically detect the separator
  • Supports skipping footer

Using the python engine can solve the problems faced while parsing the files.

For example, When you try to parse large CSV files, you may face the Error tokenizing data. c error out of memory. Using the python engine can solve the memory issues while parsing such big CSV files using the read_csv() method.

Use the below snippet to use the Python engine for reading the CSV file.

Snippet

import pandas as pd

df = pd.read_csv('sample.csv', engine='python', error_bad_lines=False)

df

This is how you can use the python engine to parse the CSV file.

Optionally, this could also solve the error Error tokenizing data. c error out of memory when parsing the big CSV files.

Using Proper Separator

CSV files can have different separators such as tab separator or any other special character such as ;. In this case, an error will be thrown when reading the CSV file, if the default C engine is used.

You can parse the file successfully by specifying the separator explicitly using the sep parameter.

As an alternative, you can also use the python engine which will automatically detect the separator and parse the file accordingly.

Snippet

import pandas as pd

df = pd.read_csv('sample.csv', sep='t')

df

This is how you can specify the separator explicitly which can solve the tokenizing errors while reading the CSV files.

Using Line Terminator

CSV file can contain r carriage return for separating the lines instead of the line separator n.

In this case, you’ll face CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file when the line contains the r instead on n.

You can solve this error by using the line terminator explicitly using the lineterminator parameter.

Snippet

df = pd.read_csv('sample.csv',
                 lineterminator='n')

This is how you can use the line terminator to parse the files with the terminator r.

Using header=None

CSV files can have incomplete headers which can cause tokenizing errors while parsing the file.

You can use header=None to ignore the first line headers while reading the CSV files.

This will parse the CSV file without headers and create a data frame. You can also add headers to column names by adding columns attribute to the read_csv() method.

Snippet

import pandas as pd

df = pd.read_csv('sample.csv', header=None, error_bad_lines=False)

df

This is how you can ignore the headers which are incomplete and cause problems while reading the file.

Using Skiprows

CSV files can have headers in more than one row. This can happen when data is grouped into different sections and each group is having a name and has columns in each section.

In this case, you can ignore such rows by using the skiprows parameter. You can pass the number of rows to be skipped and the data will be read after skipping those number of rows.

Use the below snippet to skip the first two rows while reading the CSV file.

Snippet

import pandas as pd

df = pd.read_csv('sample.csv',  header=None, skiprows=2, error_bad_lines=False)

df

This is how you can skip or ignore the erroneous headers while reading the CSV file.

Reading As Lines and Separating

In a CSV file, you may have a different number of columns in each row. This can occur when some of the columns in the row are considered optional. You may need to parse such files without any problems during tokenizing.

In this case, you can read the file as lines and separate it later using the delimiter and create a dataframe out of it. This is helpful when you have varying lengths of rows.

In the below example,

  • the file is read as lines by specifying the separator as a new line using sep='n'. Now the file will be tokenized on each new line, and a single column will be available in the dataframe.
  • You can split the lines using the separator or regex and create different columns out of it.
  • expand=True expands the split string into multiple columns.

Use the below snippet to read the file as lines and separate it using the separator.

Snippet

import pandas as pd

df = pd.read_csv('sample.csv', header=None, sep='n')

df = df[0].str.split('s|s', expand=True)

df

This is how you can read the file as lines and later separate it to avoid problems while parsing the lines with an inconsistent number of columns.

Conclusion

To summarize, you’ve learned the causes of the Python Pandas Error tokenizing data and the different methods to solve it in different scenarios.

Different Errors while tokenizing data are,

  • Error tokenizing data. C error: Buffer overflow caught - possible malformed input file
  • ParserError: Expected n fields in line x, saw m
  • Error tokenizing data. c error out of memory

Also learned the different engines available in the read_csv() method to parse the CSV file and the advantages and disadvantages of it.

You’ve also learned when to use the different methods appropriately.

If you have any questions, comment below.

You May Also Like

  • How to write pandas dataframe to CSV
  • How To Read An Excel File In Pandas – With Examples
  • How To Read Excel With Multiple Sheets In Pandas?
  • How To Solve Truth Value Of A Series Is Ambiguous. Use A.Empty, A.Bool(), A.Item(), A.Any() Or A.All()?
  • How to solve xlrd.biffh.XLRDError: Excel xlsx file; not supported Error?

Содержание

  1. How to Solve Python Pandas Error Tokenizing Data Error?
  2. Cause of the Problem
  3. Finding the Problematic Line (Optional)
  4. Using Err_Bad_Lines Parameter
  5. Using Python Engine
  6. Using Proper Separator
  7. Using Line Terminator
  8. Using header=None
  9. Using Skiprows
  10. Reading As Lines and Separating
  11. Conclusion
  12. read_csv C-engine CParserError: Error tokenizing data #11166
  13. Comments
  14. How To Fix pandas.parser.CParserError: Error tokenizing data
  15. Understanding why the error is raised and how to deal with it when reading CSV files in pandas
  16. Introduction
  17. Reproducing the error
  18. Fixing the file manually
  19. Specifying line terminator
  20. Specifying the correct delimiter and headers
  21. Skipping rows
  22. Final Thoughts

How to Solve Python Pandas Error Tokenizing Data Error?

While reading a CSV file, you may get the “Pandas Error Tokenizing Data“. This mostly occurs due to the incorrect data in the CSV file.

You can solve python pandas error tokenizing data error by ignoring the offending lines using error_bad_lines=False .

In this tutorial, you’ll learn the cause and how to solve the error tokenizing data error.

If you’re in Hurry

You can use the below code snippet to solve the tokenizing error. You can solve the error by ignoring the offending lines and suppressing errors.

Snippet

If You Want to Understand Details, Read on…

In this tutorial, you’ll learn the causes for the exception “Error Tokenizing Data” and how it can be solved.

Table of Contents

Cause of the Problem

  • CSV file has two header lines
  • Different separator is used
  • r – is a new line character and it is present in column names which makes subsequent column names to be read as next line
  • Lines of the CSV files have inconsistent number of columns

In the case of invalid rows which has an inconsistent number of columns, you’ll see an error as Expected 1 field in line 12, saw m . This means it expected only 1 field in the CSV file but it saw 12 values after tokenizing it. Hence, it doesn’t know how the tokenized values need to be handled. You can solve the errors by using one of the options below.

Finding the Problematic Line (Optional)

If you want to identify the line which is creating the problem while reading, you can use the below code snippet.

It uses the CSV reader. hence it is similar to the read_csv() method.

Snippet

Using Err_Bad_Lines Parameter

When there is insufficient data in any of the rows, the tokenizing error will occur.

You can skip such invalid rows by using the err_bad_line parameter within the read_csv() method.

This parameter controls what needs to be done when a bad line occurs in the file being read.

  • False – Errors will be suppressed for Invalid lines
  • True – Errors will be thrown for invalid lines

Use the below snippet to read the CSV file and ignore the invalid lines. Only a warning will be shown with the line number when there is an invalid lie found.

Snippet

In this case, the offending lines will be skipped and only the valid lines will be read from CSV and a dataframe will be created.

Using Python Engine

There are two engines supported in reading a CSV file. C engine and Python Engine.

C Engine

  • Faster
  • Uses C language to parse the CSV file
  • Supports float_precision
  • Cannot automatically detect the separator
  • Doesn’t support skipping footer

Python Engine

  • Slower when compared to C engine but its feature complete
  • Uses Python language to parse the CSV file
  • Doesn’t support float_precision . Not required with Python
  • Can automatically detect the separator
  • Supports skipping footer

Using the python engine can solve the problems faced while parsing the files.

For example, When you try to parse large CSV files, you may face the Error tokenizing data. c error out of memory. Using the python engine can solve the memory issues while parsing such big CSV files using the read_csv() method.

Use the below snippet to use the Python engine for reading the CSV file.

Snippet

This is how you can use the python engine to parse the CSV file.

Optionally, this could also solve the error Error tokenizing data. c error out of memory when parsing the big CSV files.

Using Proper Separator

CSV files can have different separators such as tab separator or any other special character such as ; . In this case, an error will be thrown when reading the CSV file, if the default C engine is used.

You can parse the file successfully by specifying the separator explicitly using the sep parameter.

As an alternative, you can also use the python engine which will automatically detect the separator and parse the file accordingly.

Snippet

This is how you can specify the separator explicitly which can solve the tokenizing errors while reading the CSV files.

Using Line Terminator

CSV file can contain r carriage return for separating the lines instead of the line separator n .

In this case, you’ll face CParserError: Error tokenizing data. C error: Buffer overflow caught — possible malformed input file when the line contains the r instead on n .

You can solve this error by using the line terminator explicitly using the lineterminator parameter.

Snippet

This is how you can use the line terminator to parse the files with the terminator r .

CSV files can have incomplete headers which can cause tokenizing errors while parsing the file.

You can use header=None to ignore the first line headers while reading the CSV files.

This will parse the CSV file without headers and create a data frame. You can also add headers to column names by adding columns attribute to the read_csv() method.

Snippet

This is how you can ignore the headers which are incomplete and cause problems while reading the file.

Using Skiprows

CSV files can have headers in more than one row. This can happen when data is grouped into different sections and each group is having a name and has columns in each section.

In this case, you can ignore such rows by using the skiprows parameter. You can pass the number of rows to be skipped and the data will be read after skipping those number of rows.

Use the below snippet to skip the first two rows while reading the CSV file.

Snippet

This is how you can skip or ignore the erroneous headers while reading the CSV file.

Reading As Lines and Separating

In a CSV file, you may have a different number of columns in each row. This can occur when some of the columns in the row are considered optional. You may need to parse such files without any problems during tokenizing.

In this case, you can read the file as lines and separate it later using the delimiter and create a dataframe out of it. This is helpful when you have varying lengths of rows.

In the below example,

  • the file is read as lines by specifying the separator as a new line using sep=’n’ . Now the file will be tokenized on each new line, and a single column will be available in the dataframe.
  • You can split the lines using the separator or regex and create different columns out of it.
  • expand=True expands the split string into multiple columns.

Use the below snippet to read the file as lines and separate it using the separator.

Snippet

This is how you can read the file as lines and later separate it to avoid problems while parsing the lines with an inconsistent number of columns.

Conclusion

To summarize, you’ve learned the causes of the Python Pandas Error tokenizing data and the different methods to solve it in different scenarios.

Different Errors while tokenizing data are,

  • Error tokenizing data. C error: Buffer overflow caught — possible malformed input file
  • ParserError: Expected n fields in line x, saw m
  • Error tokenizing data. c error out of memory

Also learned the different engines available in the read_csv() method to parse the CSV file and the advantages and disadvantages of it.

You’ve also learned when to use the different methods appropriately.

Источник

read_csv C-engine CParserError: Error tokenizing data #11166

I have encountered a dataset where the C-engine read_csv has problems. I am unsure of the exact issue but I have narrowed it down to a single row which I have pickled and uploaded it to dropbox. If you obtain the pickle try the following:

I get the following exception:

If you try and read the CSV using the python engine then no exception is thrown:

Suggesting that the issue is with read_csv and not to_csv. The versions I using are:

The text was updated successfully, but these errors were encountered:

Your second-to-last line includes an ‘r’ break. I think it’s a bug, but one workaround is to open in universal-new-line mode.

I’m encountering this error as well. Using the method suggested by @chris-b1 causes the following error:

I have also found this issue when reading a large csv file with the default egine. If I use engine=’python’ then it works fine.

I missed @alfonsomhc answer because it just looked like a comment.

had the same issue trying to read a folder not a csv file

Has anyone investigated this issue? It’s killing performance when using read_csv in a keras generator.

The original data provided is no longer available so the issue is not reproducible. Closing as it’s not clear what the issue is, but @dgrahn or anyone else if you can provide a reproducible example we can reopen

@WillAyd Let me know if you need additional info.

Since GitHub doesn’t accept CSVs, I changed the extension to .txt.
Here’s the code which will trigger the exception.

Here’s the exception from Windows 10, using Anaconda.

And the same on RedHat.

@dgrahn I have downloaded debug.txt and I get the following if you run pd.read_csv(‘debug.xt’, header=None) on a mac:

ParserError: Error tokenizing data. C error: Expected 204 fields in line 3, saw 2504

Which is different from the Buffer overflow caught error originally described.

I have inspected the debug.txt file and the first two lines have 204 columns but the 3rd line has 2504 columns. This would make the file unparsable and explains why an error is thrown.

Is this expected? GitHub could be doing some implicit conversion in the background between newline types («rn» and «n») that is messing up the uploaded example.

@joshlk Did you use the names=range(2504) option as described in the comment above?

Ok can now reproduce the error with pandas.read_csv(‘debug.csv’, chunksize=1000, names=range(2504)) .

It’s good to note that pandas.read_csv(‘debug.csv’, names=range(2504)) works fine and so its then unlikely related to the original bug but it is producing the same symptom.

@joshlk I could open a separate issue if that would be preferred.

pd.read_csv(open(‘test.csv’,’rU’), encoding=’utf-8′, engine=’python’)

Solved my problem.

I tried this approach and was able to upload large data files. But when I checked the dimension of the dataframe I saw that the number of rows have increased. What can be the logical regions for that?

@dheeman00 : I am facing the same problem as you with changing sizes. I have a dataframe of shape (100K, 21) and after using engine = ‘python’, it gives me a dataframe of shape (100,034,21) (without enging=’python’, I get the same error as OP). After comparing them, I figured the problem is with one of my columns that contains text field, some with unknown chars, and some of them are broken into two different rows (the second row with the continuation of the text has all other columns set to «nan»).
If you know your data well, playing with delimiters and maybe running a data cleaning before saving as CSV would be helpful. In my case, the data was too messy and too big (it was the subset of a bigger csv file) so I changed to Spark for data cleaning.

Источник

How To Fix pandas.parser.CParserError: Error tokenizing data

Understanding why the error is raised and how to deal with it when reading CSV files in pandas

Introduction

Importing data from csv files is probably the most commonly used way of instantiating pandas DataFrames. However, many times this could be a little bit tricky especially when the data included in the file are not in the expected form. In these cases, pandas parser may raise an error similar to the one reported below:

In today’s short guide we will discuss why this error is being raised in the first place and additionally, we will discuss a few ways that could eventually help you deal with it.

Reproducing the error

First, let’s try to reproduce the error using a small dataset I have prepared as part of this tutorial.

Now if we attempt to read in the file using read_csv :

we are going to get the following error

The error is pretty clear as it indicates that on the 4th line instead of 4, 6 fields were observed (and by the way, the same issue occurs in the last line as well).

By default, read_csv uses comma ( , ) as the delimiter but clearly, two lines in the file have more separators than expected. The expected number in this occasion is actually 4, since our headers (i.e. the first line of the file) contains 4 fields separated by commas.

Fixing the file manually

The most obvious solution to the problem, is to fix the data file manually by removing the extra separators in the lines causing us troubles. This is actually the best solution (assuming that you have specified the right delimiters, headers etc. when calling read_csv function). However, this may be quite tricky and painful when you need to deal with large files containing thousands of lines.

Specifying line terminator

Another cause of this error may be related to some carriage returns (i.e. ‘r’ ) in the data. In some occasions this is actually introduced by pandas.to_csv() method. When writing pandas DataFrames to CSV files, a carriage return is added to column names in which the method will then write the subsequent column names to the first column of the pandas DataFrame. And thus we’ll end up with different number of columns in the first rows.

If that’s the case, then you can explicitly specify the line terminator to ‘n’ using the corresponding parameter when calling read_csv() :

Specifying the correct delimiter and headers

The error may also be related to the delimiters and/or headers (not) specified when calling read_csv . Make sure to pass both the correct separator and headers.

For example, the arguments below specify that ; is the delimiter used to separate columns (by default commas are used as delimiters) and that the file does not contain any headers at all.

Skipping rows

Skipping rows that are causing the error should be your last resort and I would personally discourage you from doing so, but I guess there are certain use cases where this may be acceptable.

If that’s the case, then you can do so by setting error_bad_lines to False when calling read_csv function:

As you can see from the output above, the lines that they were causing errors were actually skipped and we can now move on with whatever we’d like to do with our pandas DataFrame.

Final Thoughts

In today’s short guide, we discussed a few cases where pandas.errors.ParserError: Error tokenizing data is raised by the pandas parser when reading csv files into pandas DataFrames.

Additionally, we showcased how to deal with the error by fixing the errors or typos in the data file itself, or by specifying the appropriate line terminator. Finally, we also discussed how to skip lines causing errors but keep in mind that in most of the cases this should be avoided.

Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.

Источник

  1. What Is the ParserError: Error tokenizing data. C error in Python
  2. How to Fix the ParserError: Error tokenizing data. C error in Python
  3. Skip Rows to Fix the ParserError: Error tokenizing data. C error
  4. Use the Correct Separator to Fix the ParserError: Error tokenizing data. C error
  5. Use dropna() to Fix the ParserError: Error tokenizing data. C error
  6. Use the fillna() Function to Fill Up the NaN Values

Error Tokenizing Data C Error in Python

When playing with data for any purpose, it is mandatory to clean the data, which means filling the null values and removing invalid entries to clean the data, so it doesn’t affect the results, and the program runs smoothly.

Furthermore, the causes of the ParserError: Error tokenizing data. C error can be providing the wrong data in the files, like mixed data, a different number of columns, or several data files stored as a single file.

And you can also encounter this error if you read a CSV file as read_csv but provide different separators and line terminators.

What Is the ParserError: Error tokenizing data. C error in Python

As discussed, the ParserError: Error tokenizing data. C error occurs when your Python program parses CSV data but encounters errors like invalid values, null values, unfilled columns, etc.

Let’s say we have this data in the data.csv file, and we are using it to read with the help of pandas, although it has an error.

Name,Roll,Course,Marks,CGPA
Ali,1,SE,87,3
John,2,CS,78,
Maria,3,DS,13,,

Code example:

import pandas as pd
pd.read_csv('data.csv')

Output:

ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 6

As you can see, the above code has thrown a ParserError: Error tokenizing data. C error while reading data from the data.csv file, which says that the compiler was expecting 5 fields in line 4 but got 6 instead.

The error itself is self-explanatory; it indicates the exact point of the error and shows the reason for the error, too, so we can fix it.

How to Fix the ParserError: Error tokenizing data. C error in Python

So far, we have understood the ParserError: Error tokenizing data. C error in Python; now let’s see how we can fix it.

It is always recommended to clean the data before analyzing it because it may affect the results or fail your program to run.

Data cleansing helps in removing invalid data inputs, null values, and invalid entries; basically, it is a pre-processing stage of the data analysis.

In Python, we have different functions and parameters that help clean the data and avoid errors.

Skip Rows to Fix the ParserError: Error tokenizing data. C error

This is one of the most common techniques that skip the row, causing the error; as you can see from the above data, the last line was causing the error.

Now using the argument on_bad_lines = 'skip', it has ignored the buggy row and stored the remaining in data frame df.

import pandas as pd
df = pd.read_csv('data.csv', on_bad_lines='skip')
df

Output:

	Name	Roll	Course	Marks	CGPA
0	Ali		1		SE		87		3.0
1	John	2		CS		78		NaN

The above code will skip all those lines causing errors and printing the others; as you can see in the output, the last line is skipping because it was causing the error.

But we are getting the NaN values that need to be fixed; otherwise, it will affect the results of our statistical analysis.

Use the Correct Separator to Fix the ParserError: Error tokenizing data. C error

Using an invalid separator can also cause the ParserError, so it is important to use the correct and suitable separator depending on the data you provide.

Sometimes we use tab to separate the CSV data or space, so it is important to specify that separator in your program too.

import pandas as pd
pd.read_csv('data.csv', sep=',',on_bad_lines='skip' ,lineterminator='n')

Output:

	Name	Roll	Course	Marks	CGPAr
0	Ali		1		SE		87		3r
1	John	2		CS		78		r

The separator is , that’s why we have mentioned sep=',' and the lineterminator ='n' because our line ends with n.

Use dropna() to Fix the ParserError: Error tokenizing data. C error

The dropna function is used to drop all the rows that contain any Null or NaN values.

import pandas as pd
df = pd.read_csv('data.csv', on_bad_lines='skip')
print("      **** Before dropna ****")
print(df)

print("n      **** After dropna ****")
print(df.dropna())

Output:

      **** Before dropna ****
   Name  Roll Course  Marks  CGPA
0   Ali     1     SE     87   3.0
1  John     2     CS     78   NaN

      **** After dropna ****
  Name  Roll Course  Marks  CGPA
0  Ali     1     SE     87   3.0

Since we have only two rows, one row has all the attributes but the second row has NaN values so the dropna() function has skip the row with the NaN value and displayed just a single row.

Use the fillna() Function to Fill Up the NaN Values

When you get NaN values in your data, you can use the fillna() function to replace other values that use the default value 0.

Code Example:

import pandas as pd

print("      **** Before fillna ****")
df = pd.read_csv('data.csv', on_bad_lines='skip')
print(df,"nn")

print("      **** After fillna ****")
print(df.fillna(0))  # using 0 inplace of NaN

Output:

      **** Before fillna ****
   Name  Roll Course  Marks  CGPA
0   Ali     1     SE     87   3.0
1  John     2     CS     78   NaN


      **** After fillna ****
   Name  Roll Course  Marks  CGPA
0   Ali     1     SE     87   3.0
1  John     2     CS     78   0.0

The fillna() has replaced the NaN with 0 so we can analyze the data properly.

Понравилась статья? Поделить с друзьями:

Читайте также:

  • Parser error internal error huge input lookup
  • Parser error entityref expecting
  • Parseerror thrown syntax error unexpected
  • Parsec ошибка 800
  • Parsec ошибка 6023 как исправить

  • 0 0 голоса
    Рейтинг статьи
    Подписаться
    Уведомить о
    guest

    0 комментариев
    Старые
    Новые Популярные
    Межтекстовые Отзывы
    Посмотреть все комментарии