reference: https://stackoverflow.com/questions/34059704/pandas-groupby-aggregation-operation-changes-key-column-dtype When I do group and aggregation operation to the original data frame. The dtype of the 'key' is changed from int32 to int64. Here is a simple example:
df = pd.DataFrame({'id': np.array([3234, 332635, 325993]), 'amount': np.array([34, 43, 32])},
index=['a', 'b', 'c'],
dtype='int32')
print df.info()
df = df.groupby('id', as_index=False).sum()
print df.info()
output is:
<class 'pandas.core.frame.DataFrame'> Index: 3 entries, a to c Data columns (total 2 columns): amount 3 non-null int32 id 3 non-null int32 dtypes: int32(2) memory usage: 48.0+ bytes None <class 'pandas.core.frame.DataFrame'> Int64Index: 3 entries, 0 to 2 Data columns (total 2 columns): id 3 non-null int64 amount 3 non-null int32 dtypes: int32(1), int64(1) memory usage: 60.0 bytes None
pd.show_versions():
INSTALLED VERSIONS
commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: None.None
pandas: 0.20.3 pytest: 2.8.5 pip: 9.0.1 setuptools: 20.3 Cython: 0.23.4 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 4.1.2 sphinx: 1.3.5 patsy: 0.4.0 dateutil: 2.5.1 pytz: 2016.2 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.4.1 html5lib: None sqlalchemy: 1.0.12 pymysql: None psycopg2: None jinja2: 2.8 s3fs: None pandas_gbq: None pandas_datareader: None None
Comment From: gfyoung
@halfmoonhalf : Thanks for reporting this! IMO, I wouldn't be too scared by this type upcasting, but even so, I do agree that we should probably try and not do this if possible. Feel free to poke around and see where and how that is happening.
Comment From: jreback
generally for indexers we use int64
, so this upcasting is unavoidable and expected.
Comment From: jreback
For example, we don't allow non int64 indexers in a result index.
In [25]: df.groupby('id').sum().index
Out[25]: Int64Index([3234, 325993, 332635], dtype='int64', name='id')
Comment From: daragallagher
I know this is an old and closed issue but I was bitten by this issue recently.
And I found myself confused by the above comment ("we don't allow non int64 indexers in a result index"):
df = pd.DataFrame({
'a': np.array([1], dtype=np.int32),
'b': np.array([1.23], dtype=np.float64),
'c': np.array([3], dtype=np.uint8),
'd': np.array([7.8], dtype=np.float16),
'e': np.array([False], dtype=np.bool)})
I can get a bunch of indexers:
In [4]: df.groupby('a').sum().index
Out[4]: Int64Index([1], dtype='int64', name='a')
In [5]: df.groupby('b').sum().index
Out[5]: Float64Index([1.23], dtype='float64', name='b')
In [6]: df.groupby('c').sum().index
Out[6]: UInt64Index([3], dtype='uint64', name='c')
In [7]: df.groupby('d').sum().index
Out[7]: Float64Index([7.80078125], dtype='float64', name='d')
In [8]: df.groupby('e').sum().index
Out[8]: Index([False], dtype='object', name='e')
I'm not sure how this behaviour should be expected by users as there's no mention of this behaviour in the groupby documentation.
Comment From: rjafarau
Agreed with @daragallagher. It's not obvious behavior from groupby documentation! @jreback