Pandas Inconsistency in CSV Import/Export

Pandas is not able to export a matrix as csv file and reimport it and keeping the data consistent. Using default arguments for to_csv, it will add an additional column for the indeces. Importing with default parameters will treat this column as data, not as index....

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(10,10))
df.to_csv('np.csv')
print pd.read_csv('np.csv')


   Unnamed: 0         0         1         2         3         4         5  \
0           0  0.055663  0.492976  0.936424  0.931585  0.043748  0.931660   
1           1  0.946510  0.481707  0.935273  0.987895  0.982537  0.735273   
2           2  0.429818  0.090192  0.923747  0.973678  0.432166  0.318196   
3           3  0.579657  0.599554  0.794318  0.631867  0.700044  0.834421   
4           4  0.438074  0.747774  0.034653  0.113885  0.982059  0.736432   
5           5  0.379523  0.094214  0.435573  0.729742  0.778312  0.341792   
6           6  0.542644  0.175657  0.913459  0.532352  0.607791  0.369434   
7           7  0.132935  0.052179  0.145688  0.549158  0.127237  0.475737   
8           8  0.454960  0.872086  0.006616  0.444334  0.435469  0.435362   
9           9  0.141345  0.512531  0.900547  0.570482  0.366632  0.992289   

          6         7         8         9  
0  0.385482  0.432543  0.927187  0.408233  
1  0.385019  0.905481  0.852093  0.368507  
2  0.641478  0.966683  0.706884  0.229032  
3  0.592390  0.091528  0.969585  0.177480  
4  0.805170  0.585675  0.024259  0.961815  
5  0.818240  0.688166  0.175099  0.583955  
6  0.697869  0.202709  0.458018  0.546078  
7  0.597875  0.625422  0.055143  0.720858  
8  0.866318  0.348642  0.855215  0.689258  
9  0.723096  0.194654  0.681293  0.941478

Comment From: jreback

the inverse operation of .to_csv is from_csv

In [10]: df = pd.DataFrame(np.random.rand(10,10))

In [11]: df.to_csv('test.csv',mode='w')

In [12]: !cat test.csv
,0,1,2,3,4,5,6,7,8,9
0,0.410789548933,0.141882962291,0.481424012182,0.253145260533,0.349319258408,0.552969720747,0.457827171398,0.361762326267,0.00569519672086,0.623535751613
1,0.369638666467,0.322324774448,0.400265909069,0.642042275107,0.799972540147,0.359167258874,0.239007981282,0.812969158011,0.559582423368,0.00271466592636
2,0.717172031665,0.179713595564,0.956176942931,0.848912709056,0.91118300087,0.391446338563,0.708771850147,0.885832551406,0.708784751692,0.430181079966
3,0.0225329325896,0.190005393361,0.0194796447118,0.869802283448,0.430925947353,0.136011580077,0.529612719739,0.681007234468,0.115292421255,0.305482908184
4,0.289044376003,0.535503444011,0.212408295498,0.0542784302991,0.664277492374,0.357734952961,0.375739315655,0.831491303632,0.00554139533804,0.59147155945
5,0.317218866368,0.461190823521,0.0580049804076,0.539360261154,0.990320435889,0.430079077782,0.442252192586,0.286467160784,0.67520580223,0.358516637142
6,0.681700666131,0.468662142977,0.178406551592,0.627463561773,0.9228852801,0.956406234721,0.669339262005,0.0653954611576,0.187273735622,0.697836946507
7,0.00022882527549,0.00633811057126,0.147099077394,0.0305195112454,0.395283200237,0.163439056245,0.138368552052,0.999240657646,0.786156284675,0.94207117023
8,0.686420735795,0.634091772292,0.448123675745,0.960918481445,0.341246536191,0.349309821001,0.203070985042,0.520821277184,0.0863019780958,0.850411108284
9,0.403063746431,0.0217493935357,0.706866005935,0.19966875768,0.902210895494,0.360288312432,0.422414808927,0.721770768274,0.650247350901,0.436017563996

In [14]: DataFrame.from_csv('test.csv')
Out[14]: 
          0         1         2         3         4         5         6         7         8         9
0  0.410790  0.141883  0.481424  0.253145  0.349319  0.552970  0.457827  0.361762  0.005695  0.623536
1  0.369639  0.322325  0.400266  0.642042  0.799973  0.359167  0.239008  0.812969  0.559582  0.002715
2  0.717172  0.179714  0.956177  0.848913  0.911183  0.391446  0.708772  0.885833  0.708785  0.430181
3  0.022533  0.190005  0.019480  0.869802  0.430926  0.136012  0.529613  0.681007  0.115292  0.305483
4  0.289044  0.535503  0.212408  0.054278  0.664277  0.357735  0.375739  0.831491  0.005541  0.591472
5  0.317219  0.461191  0.058005  0.539360  0.990320  0.430079  0.442252  0.286467  0.675206  0.358517
6  0.681701  0.468662  0.178407  0.627464  0.922885  0.956406  0.669339  0.065395  0.187274  0.697837
7  0.000229  0.006338  0.147099  0.030520  0.395283  0.163439  0.138369  0.999241  0.786156  0.942071
8  0.686421  0.634092  0.448124  0.960918  0.341247  0.349310  0.203071  0.520821  0.086302  0.850411
9  0.403064  0.021749  0.706866  0.199669  0.902211  0.360288  0.422415  0.721771  0.650247  0.436018

Comment From: groakat

Sorry, I did not see that, because it seems not to be exposed to the pandas library. So you have to do

df = pd.DataFrame.from_csv('text.csv')

rather than

df = pd.from_csv('text.csv')

Comment From: jreback

round tripping csv is not nearly as common as simply reading csvs

you might want to simply

to_csv(..., index=False)

then can be read by default arguments in pd.read_csv

or u can write as is and specify pd.read_csv(...., index_col=None)

Comment From: johne13

FWIW, this does not hold for strings with a value of 'NA'. I imagine there are other exceptions also. I'm not sure exactly how much consistency is to be expected here to be honest. Probably it can be expected for numbers but not necessarily for less well behaved strings where quoting, non-standard characters, etc could throw things off.

df=pd.DataFrame({ 'x':['NA','foo'] })

     x
0   NA
1  foo

df.to_csv('test.csv')
pd.DataFrame.from_csv('test.csv')

     x
0  NaN
1  foo

Comment From: jreback

@johne13 I don't that was ever intended and is very much an edge case otherwise you not have automatic nan conversion on float dtypes which is much more common

you cannot have perfect fidelity in all situations with an inherently imperfect format there simply is not enough meta data in csv

HDF5 and msgpack have very nice fidelity OTOH (in theory SQL does as well but in practice has issues)

Comment From: johne13

@jreback Right, my CSV use is always necessity, not choice. I'm not trying to promote CSV use! But it's a common necessity for me unfortunately.

FWIW, I've actually never come across "NA" or "NaN" in numerical columns in the data I work with (economic and/or survey data). It's always blank or . (period). But I get the point.