Pandas NAs in mixed-type columns - Nineya|java/go/python

File: sample.txt

import pandas as pd

# This does not work as expected
w = pd.read_csv("sample.txt", dtype={"a": str, "b": float}, sep="\t")
print(type(w["a"][0]))
print(type(w["a"][1]))
print(type(w["a"][2]))
print(type(w["b"][0]))
print(type(w["b"][1]))
print(type(w["b"][2]))
print(w)

print("======")

# This is cumbersome work-around
w = pd.read_csv("sample.txt", dtype={"a": str, "b": str}, sep="\t", na_values=[], keep_default_na=False)
w.b = w.b.apply(pd.to_numeric, args=('coerce',))
print(type(w["a"][0]))
print(type(w["a"][1]))
print(type(w["a"][2]))
print(type(w["b"][0]))
print(type(w["b"][1]))
print(type(w["b"][2]))
print(w)

Output

<class 'str'>
<class 'float'>
<class 'str'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
            a    b
0  sampletext  1.0
1         NaN  2.0
2  sampletext  NaN
3         NaN  NaN
======
<class 'str'>
<class 'str'>
<class 'str'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
            a    b
0  sampletext  1.0
1              2.0
2  sampletext  NaN
3         NaN  NaN

Problem description

Documentation states:

Use str or object to preserve and not interpret dtype.

This is somewhat confusing behaviour itself, and has consequences. One could expect that if pandas finds out that some column is string, it will coerce all values within the column to string. It's not the case.

Given the fact that NA can only be float, it is possible to get mixed-type column (which is fun by itself).

Imagine that you have two columns, str and float. Both can have empty spaces and "NaN"s (and other possible stuff). You want to preserve empty spaces and NaNs as string in first column, yet want other column to have proper NAs.

The only solution here seems to be setting na_values to [], and coercing column to float manually. I work with relatively large tables (40 columns, half strings, half floats, ~2..10M rows), so it's very time-memory consuming (floats stored as strings) and error-prone to manually cast everything to proper types.

Expected behaviour

Everything works as stated in documentation. However, this has unpleasant consequences. I cannot fathom a scenario where one would need a mixed-type column, string mixed with float-NAs. For me, it seems obvious to expect column to contain strings. Also, it seems obvoius that if I have parameter dtype to force column types, I can force column to contain strings (but it's not the case - it doesn't even throw a warning! why it's designed this way? It's puzzling me).

>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.4.3.final.0 python-bits: 32 OS: Windows OS-release: 8 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.1 nose: None pip: 9.0.1 setuptools: 20.1.1 Cython: None numpy: 1.11.3 scipy: None statsmodels: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.3 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.9 pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Comment From: jreback

well, since you haven't provide a reproducible example I'll will speak generally.

NaN IS the missing value integer for float/string columns, while NaT for datetimelikes.

You are fighting pandas by trying to manipulate the options; this will for the most part just work.

Comment From: rampeer

@jreback I have provided a reproducible example: code that loads a file, and file itself (moved file link to the top to make it more visible).

Currently, I either have to enjoy mixed-type column and loss of data (both empty strings and "NaN" get converted into float NaNs), or have to parse float column by myself.

Would you kindly review the code?

Comment From: jreback

you have odd data with empty string AND NaN incoming. and you want to preserve the ''. This is quite atypical. So not really sure what you are asking.

Comment From: rampeer

When possible NA values in float column interfere with values in string column, string column values get automatically converted to float "NaN"s.

A less complicated example:

import pandas as pd
from numpy import NaN
desired_table = pd.DataFrame({"a": ["Sample", ""], "b": [0, NaN]})
desired_table.to_csv("sample.txt", index=False, na_rep="")
read_table = pd.read_csv("sample.txt")
print(read_table)

        a    b
0  Sample  0.0
1     NaN  NaN

Note the NaN in column "a". I expect empty string to be here.

Is there any way to ensure column type consistency in this particular situation, other than manually fixing NaN's in string column?

Comment From: jreback

@rampeer you are missing the point

NaN IS the indicator for missing values in a string column. Pandas excepts this.

you can .fillna('') if you really don't want that. You are talking about export, so that's up to you.

exporting is as expected

In [5]: result_table.to_csv()
Out[5]: ',a,b\n0,Sample,0.0\n1,,\n'

Comment From: rampeer

Thanks!

I was hoping there is a better way to do this. At least, that preserves original string.

By the way, simple warning message "Columns have mixed types, NaNs found in string column" would have saved hour of debugging.