Code Sample, a copy-pastable example if possible
>>> import pandas as pd
>>> import numpy as np
>>> data = """a,b,c
1,a,3.4
1,a,3.4
2,b,4.5"""
>>> frame = pd.read_csv(StringIO(data), dtype={0: 'category', 1: 'str', 2: 'float64'})
>>> np.asarray(frame.ix[:,0])
array(['1', '1', '2'], dtype=object)
>>> np.asarray(frame.ix[:,1])
array(['a', 'a', 'b'], dtype=object)
>>> np.asarray(frame.ix[:,2])
array([ 3.4, 3.4, 4.5])
Problem description
When loading CSV data it seems it is not possible to specify what should be internal dtype for a categorical type. I can specify that it is categorical, but not that it is integer.
Expected Output
>>> np.asarray(frame.ix[:,0])
array([1, 1, 2])
But if I do:
>>> series = pd.Series([1, 1, 2], dtype='category')
>>> np.asarray(series)
array([1, 1, 2])
It would be great if I could at CSV reading time specify both that the column should be categorical and int.
(Using categorical and int is just for demo purposes.)
Or, on the other hand, is it guaranteed that dtype will be always object
when read from CSV file and converting to numpy?
Output of pd.show_versions()
Comment From: chris-b1
You are correct - specifying the category dtype when parsing is not currently supported, it is guaranteed that the categories are object
. See docs here:
http://pandas.pydata.org/pandas-docs/stable/io.html#specifying-categorical-dtype
You can convert the categories after parsing as in the doc example:
frame['a'].cat.categories = pd.to_numeric(frame['a'].cat.categories)
Comment From: mitar
Thanks for the reply. It makes sense.