Pandas pandas groupby/aggregation operation changes 'key' column dtype

reference: https://stackoverflow.com/questions/34059704/pandas-groupby-aggregation-operation-changes-key-column-dtype When I do group and aggregation operation to the original data frame. The dtype of the 'key' is changed from int32 to int64. Here is a simple example:

df = pd.DataFrame({'id': np.array([3234, 332635, 325993]), 'amount': np.array([34, 43, 32])},
                  index=['a', 'b', 'c'],
                  dtype='int32')
print df.info()

df = df.groupby('id', as_index=False).sum()
print df.info()

output is:

  <class 'pandas.core.frame.DataFrame'>
  Index: 3 entries, a to c
  Data columns (total 2 columns):
  amount    3 non-null int32
  id        3 non-null int32
  dtypes: int32(2)
  memory usage: 48.0+ bytes
  None

  <class 'pandas.core.frame.DataFrame'>
  Int64Index: 3 entries, 0 to 2
  Data columns (total 2 columns):
  id        3 non-null int64
  amount    3 non-null int32
  dtypes: int32(1), int64(1)
  memory usage: 60.0 bytes
  None

pd.show_versions():

INSTALLED VERSIONS

commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.3 pytest: 2.8.5 pip: 9.0.1 setuptools: 20.3 Cython: 0.23.4 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 4.1.2 sphinx: 1.3.5 patsy: 0.4.0 dateutil: 2.5.1 pytz: 2016.2 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.4.1 html5lib: None sqlalchemy: 1.0.12 pymysql: None psycopg2: None jinja2: 2.8 s3fs: None pandas_gbq: None pandas_datareader: None None

Comment From: gfyoung

@halfmoonhalf : Thanks for reporting this! IMO, I wouldn't be too scared by this type upcasting, but even so, I do agree that we should probably try and not do this if possible. Feel free to poke around and see where and how that is happening.

Comment From: jreback

generally for indexers we use int64, so this upcasting is unavoidable and expected.

Comment From: jreback

For example, we don't allow non int64 indexers in a result index.

In [25]: df.groupby('id').sum().index
Out[25]: Int64Index([3234, 325993, 332635], dtype='int64', name='id')

Comment From: daragallagher

I know this is an old and closed issue but I was bitten by this issue recently.

And I found myself confused by the above comment ("we don't allow non int64 indexers in a result index"):

df = pd.DataFrame({
              'a': np.array([1], dtype=np.int32), 
              'b': np.array([1.23], dtype=np.float64), 
              'c': np.array([3], dtype=np.uint8), 
              'd': np.array([7.8], dtype=np.float16), 
              'e': np.array([False], dtype=np.bool)})

I can get a bunch of indexers:

In [4]: df.groupby('a').sum().index                                                                                                                 
Out[4]: Int64Index([1], dtype='int64', name='a')

In [5]: df.groupby('b').sum().index                                                                                                                 
Out[5]: Float64Index([1.23], dtype='float64', name='b')

In [6]: df.groupby('c').sum().index                                                                                                                 
Out[6]: UInt64Index([3], dtype='uint64', name='c')

In [7]: df.groupby('d').sum().index                                                                                                                 
Out[7]: Float64Index([7.80078125], dtype='float64', name='d')

In [8]: df.groupby('e').sum().index                                                                                                                 
Out[8]: Index([False], dtype='object', name='e')

I'm not sure how this behaviour should be expected by users as there's no mention of this behaviour in the groupby documentation.

Comment From: rjafarau

Agreed with @daragallagher. It's not obvious behavior from groupby documentation! @jreback