Code Sample, a copy-pastable example if possible
In pandas
v0.20.2, the following code
import pandas as pd
from io import StringIO
data = "c1,c2\nfalse,1\n,1"
pd.read_csv(StringIO(data), dtype={'c1': 'float32'})['c1']
gives output
0 0.0
1 1.0
Name: c1, dtype: float32
Problem description
In this example, the column of boolean data contains a missing value. If I read the column as booleans (either explicitly via dtype
or by allowing pandas
to infer the type), then the missing value is given as NaN
, as it should be. If I force the column type to be a float (or an integer) via the dtype
argument to read_csv
, then the missing value is given as 1.0
, the same as True
.
Expected Output
The output of
pd.read_csv(StringIO(data), dtype={'c1': 'float32'})['c1']
should be the same as the output of
pd.read_csv(StringIO(data))['c1'].astype('float32')
which is
0 0.0
1 NaN
Name: c1, dtype: float32
I.e., the missing value in the input CSV should be cast to NaN
rather than 1.0
.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.2 pytest: 3.0.7 pip: 9.0.1 setuptools: 33.1.1.post20170320 Cython: 0.25.2 numpy: 1.13.0 scipy: 0.19.0 xarray: None IPython: 6.0.0 sphinx: 1.5.5 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.5.3 html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 s3fs: None pandas_gbq: None pandas_datareader: None
------------------
commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.2 pytest: 3.0.7 pip: 9.0.1 setuptools: 33.1.1.post20170320 Cython: 0.25.2 numpy: 1.13.0 scipy: 0.19.0 xarray: None IPython: 6.0.0 sphinx: 1.5.5 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.5.3 html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.8 s3fs: None pandas_gbq: None pandas_datareader: None
Comment From: jreback
yeah this looks like a bug. welcome to have a PR to fix.
Comment From: phofl
This was fixed in #44901, tests cover this