File: sample.txt
import pandas as pd
# This does not work as expected
w = pd.read_csv("sample.txt", dtype={"a": str, "b": float}, sep="\t")
print(type(w["a"][0]))
print(type(w["a"][1]))
print(type(w["a"][2]))
print(type(w["b"][0]))
print(type(w["b"][1]))
print(type(w["b"][2]))
print(w)
print("======")
# This is cumbersome work-around
w = pd.read_csv("sample.txt", dtype={"a": str, "b": str}, sep="\t", na_values=[], keep_default_na=False)
w.b = w.b.apply(pd.to_numeric, args=('coerce',))
print(type(w["a"][0]))
print(type(w["a"][1]))
print(type(w["a"][2]))
print(type(w["b"][0]))
print(type(w["b"][1]))
print(type(w["b"][2]))
print(w)
Output
<class 'str'>
<class 'float'>
<class 'str'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
a b
0 sampletext 1.0
1 NaN 2.0
2 sampletext NaN
3 NaN NaN
======
<class 'str'>
<class 'str'>
<class 'str'>
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.float64'>
a b
0 sampletext 1.0
1 2.0
2 sampletext NaN
3 NaN NaN
Problem description
Documentation states:
Use str or object to preserve and not interpret dtype.
This is somewhat confusing behaviour itself, and has consequences. One could expect that if pandas finds out that some column is string, it will coerce all values within the column to string. It's not the case.
Given the fact that NA can only be float, it is possible to get mixed-type column (which is fun by itself).
Imagine that you have two columns, str and float. Both can have empty spaces and "NaN"s (and other possible stuff). You want to preserve empty spaces and NaNs as string in first column, yet want other column to have proper NAs.
The only solution here seems to be setting na_values to [], and coercing column to float manually. I work with relatively large tables (40 columns, half strings, half floats, ~2..10M rows), so it's very time-memory consuming (floats stored as strings) and error-prone to manually cast everything to proper types.
Expected behaviour
Everything works as stated in documentation. However, this has unpleasant consequences. I cannot fathom a scenario where one would need a mixed-type column, string mixed with float-NAs. For me, it seems obvious to expect column to contain strings. Also, it seems obvoius that if I have parameter dtype to force column types, I can force column to contain strings (but it's not the case - it doesn't even throw a warning! why it's designed this way? It's puzzling me).
Comment From: jreback
well, since you haven't provide a reproducible example I'll will speak generally.
NaN
IS the missing value integer for float/string columns, while NaT
for datetimelikes.
You are fighting pandas by trying to manipulate the options; this will for the most part just work.
Comment From: rampeer
@jreback I have provided a reproducible example: code that loads a file, and file itself (moved file link to the top to make it more visible).
Currently, I either have to enjoy mixed-type column and loss of data (both empty strings and "NaN" get converted into float NaNs), or have to parse float column by myself.
Would you kindly review the code?
Comment From: jreback
you have odd data with empty string AND NaN incoming. and you want to preserve the ''. This is quite atypical. So not really sure what you are asking.
Comment From: rampeer
When possible NA values in float column interfere with values in string column, string column values get automatically converted to float "NaN"s.
A less complicated example:
import pandas as pd
from numpy import NaN
desired_table = pd.DataFrame({"a": ["Sample", ""], "b": [0, NaN]})
desired_table.to_csv("sample.txt", index=False, na_rep="")
read_table = pd.read_csv("sample.txt")
print(read_table)
a b
0 Sample 0.0
1 NaN NaN
Note the NaN in column "a". I expect empty string to be here.
Is there any way to ensure column type consistency in this particular situation, other than manually fixing NaN's in string column?
Comment From: jreback
@rampeer you are missing the point
NaN
IS the indicator for missing values in a string column. Pandas excepts this.
you can .fillna('')
if you really don't want that. You are talking about export, so that's up to you.
exporting is as expected
In [5]: result_table.to_csv()
Out[5]: ',a,b\n0,Sample,0.0\n1,,\n'
Comment From: rampeer
Thanks!
I was hoping there is a better way to do this. At least, that preserves original string.
By the way, simple warning message "Columns have mixed types, NaNs found in string column" would have saved hour of debugging.