Pandas read_csv() force numeric dtype on column and set unparseable entry as missing (NaN)

How can I force a dtype on a column and ensure that any not-parseable data entry are filled as NaN? This is important in cases where there are unpredictable data entry errors in CSVs or database streams that cannot be mapped to missing values a priori.

Eg: Below I want column 'a' to be parsed as np.float but the erroneous 'Dog' entry causes an exception. Is there a way to tell read_csv() to force parsing a column 'a' as np.float and fill all non-parseable entries with NaN?

data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
df.dtypes

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-cd8b6f868aec> in <module>()
      1 data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
----> 2 df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
      3 df.dtypes

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    389                     buffer_lines=buffer_lines)
    390 
--> 391         return _read(filepath_or_buffer, kwds)
    392 
    393     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    205         return parser
    206 
--> 207     return parser.read()
    208 
    209 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    622             #     self._engine.set_error_bad_lines(False)
    623 
--> 624         ret = self._engine.read(nrows)
    625 
    626         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    943 
    944         try:
--> 945             data = self._reader.read(nrows)
    946         except StopIteration:
    947             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5785)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6002)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6870)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7919)()

AttributeError: 'NoneType' object has no attribute 'dtype'

Comment From: wesm

This would be a nice enhancement sometime. I don't know when I or someone else will get to it though

Comment From: jreback

same as #2779 ?

Comment From: dragoljub

Yes this is the same enhancement request.

Comment From: jreback

ok...why don't you close the other one then....

Comment From: dragoljub

Comments from closed issue #2779:

I have data for which I know what the data type should be for each column, e.g. float. Occasionally these columns will also have spurious non-float string values due to erroneous data processing/pivoting up stream. I would like read_csv to help me ignore this (non-predictable) spurious data by parsing them as np.nan (missing values). This rules out using read_csv( na_values= ) parameter because I cant predict them before hand.

I wanted to use the read_csv supported converter functions to return an np.float if it can be parsed but it seems that this method is prohibitively slow between 4-10x slower in some examples I have tried.

I would love read_csv's quick parser to have a "force_dtype=True" option where no matter what string is in the column the specified dtype is always returned or np.nan is used to indicate an impossible parse and therefore a missing value is placed instead.

import numpy as np
import pandas as pd
import StringIO

data = 'a,b,c\n1.1,2.2,3.3\n2.2,garb,4.4\n3.3,4.4,5.5'
data = data+data[5:]*3000

def converter(num):
try:
return np.float(num)
except:
return np.nan

%time df = pd.read_csv(StringIO.StringIO((data+data[5:]*30)))
df.dtypes

%time df2 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), converters={'b':converter})
df2.dtypes

%time df3 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), dtype=np.float)
df3.dtypes

In [7]: %time df = pd.read_csv(StringIO.StringIO((data+data[5:]*30)))
CPU times: user 0.12 s, sys: 0.00 s, total: 0.12 s
Wall time: 0.12 s <-- Quick native CSV parser

In [8]: df.dtypes <-- Returns column 'b' as float and string without the ability to convert to a single type.
Out[8]: a float64
b object
c float64

In [9]: %time df2 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), converters={'b':converter})
CPU times: user 0.42 s, sys: 0.00 s, total: 0.42 s
Wall time: 0.42 s <-- Using converter function slows the parsing down by 4X (this is only for a one-column converter)

In [10]: df2.dtypes <-- This time only np.float types are returned, and np.nan is used in place for any spurious values in column 'b'
Out[10]: a float64
b float64
c float64

In [11]: %time df3 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), dtype=np.float)
AttributeError: 'NoneType' object has no attribute 'dtype' <-- Trying to force np.float on the columns results in un-parseable exception for column 'b' :(

Comment From: gravesee

Just wanted to add my 2 cents that the feature requested by @dragoljub is what SAS does by default when creating a data set. I have a similar desire for this functionality. In SAS, if I specify a column as numeric on import and a character field is parsed, it will return a NULL value.

Comment From: wesm

I'll have a look at this (unless someone beats me to it) because it's a somewhat straightforward addition and makes sense in the event of data types explicitly specified.

Comment From: jreback

@dragoljub quite straightforward after reading, I guess this is a request to push this down to read_csv (de factor when you specify a dtype)

In [5]: df = read_csv(StringIO(data))

In [6]: df
Out[6]: 
     a  b    c
0  1.1  2  3.0
1  Dog  5  6.0
2  7.7  8  9.5

In [8]: df['a'].convert_objects(convert_numeric='force')
Out[8]: 
0    1.1
1    NaN
2    7.7
dtype: float64

Comment From: jreback

@dragoljub maybe best to add this as a cookbook entry / or in docs (see my example at the end)

Comment From: dragoljub

This is a good learning to have in the cookbook. BTW does convert_objects() work on DataFrames too?

I guess some explicit pandas dtype post processing is not so bad after the CSV parser reads all the data so fast. :v:

Comment From: jreback

yep...convert_objects works on all NDFrames

want to do a quick PR for the cookbook?

Comment From: lminer

If dtype is specified ahead of time it would be nice if conversion were forced for date types as well. I'd be willing to give this a shot...

Comment From: jreback

@lminer that would be excellent!

ideally implement for both the c and python parsers. lmk if you need help!

Comment From: adrivsh

Apparently there is already code that does that

df.convert_objects(convert_numeric=True)

So it is only a matter of callin convert_objects when reading, right?

Comment From: gfyoung

@jreback : I'd thought I revisit this given our recent discussion on .infer_objects. I can imagine just adding a coerce_dtype parameter to the signature that would replace failed converted objects with nan (or the equivalent for other dtypes).

That being said, I'm concerned about adding more bloat to an already massive signature. Do you still think that implementing this internally (compared to say using .infer_objects after the call) is preferable?

Comment From: jreback

if u can find a way to fit this in with the existing dtype option would be preferable

maybe

dtype = {'foo' : (float, 'coerce')}

or we introduce a helper function

dtype = {'foo' : parser.coerce(float)}

where coerce just returns an instance

Comment From: gfyoung

@jreback : I like the first option, seems more user-friendly IMO. I can investigate that and see where it takes me in the process.

Comment From: grantroch

I went ahead and implemented the dtype argument with a tuple suggested above, but since so much has changed with read_csv since then @jreback wanted to have an API discussion to see what the preferred API should be. I think there is still value with this approach as the current workflow of reading the csv and then coercing to float is quite time consuming versus doing that coercing while reading the file. I think the changes in #44022 to coerce in the paring code are at least towards the right direction, it is more a matter of how we want to expose this ability to the user.

Comment From: mroeschke

In 1.5, on_bad_lines accepts a callable that would allow custom parsing on a bad line https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#handling-bad-lines

Closing as I think this satisfies the enhancement