Code sample
import pandas as pd
import numpy as np
df = pd.DataFrame(zip(np.random.rand(5).tolist(), [1]*5, [dt.date.today()]*5), columns=list('abc'))
print df
print df.groupby('c').a.apply(lambda x: x.max()).index
print df.groupby(['b', 'c']).a.apply(lambda x: x.max()).index.get_level_values(1)
print df.groupby(['b', 'c']).a.max().index.get_level_values(1)
The output is
a b c
0 0.896739 1 2017-09-24
1 0.473168 1 2017-09-24
2 0.100591 1 2017-09-24
3 0.870899 1 2017-09-24
4 0.716934 1 2017-09-24
Index([2017-09-24], dtype='object', name=u'c')
DatetimeIndex(['2017-09-24'], dtype='datetime64[ns]', name=u'c', freq=None)
Index([2017-09-24], dtype='object', name=u'c')
Problem description
Out of the three group-by statements, date field (column 'c' in the above code) is converted from object data type to datetime64[ns] only in second statement. Shouldn't the date to datetime conversion be consistent in all three cases?
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 18.5
Cython: 0.23.4
numpy: 1.12.1
scipy: 0.13.0b1
statsmodels: 0.8.0
xarray: None
IPython: 5.4.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: 2.3.3
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.9.8
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: 0.9.2
apiclient: 1.5.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None
Comment From: jreback
datetime.date
is not a first class citizen, don't use it, you can convert using .to_datetime()
.
In [7]: df = pd.DataFrame(list(zip(np.random.rand(5).tolist(), [1]*5, [dt.date.today()]*5)), columns=list('abc'))
...: df['c'] = pd.to_datetime(df['c'])
...: print(df)
...: print(df.groupby('c').a.apply(lambda x: x.max()).index)
...: print(df.groupby(['b', 'c']).a.apply(lambda x: x.max()).index.get_level_values(1))
...: print(df.groupby(['b', 'c']).a.max().index.get_level_values(1))
...:
a b c
0 0.858704 1 2017-09-23
1 0.965132 1 2017-09-23
2 0.803442 1 2017-09-23
3 0.818829 1 2017-09-23
4 0.424483 1 2017-09-23
DatetimeIndex(['2017-09-23'], dtype='datetime64[ns]', name='c', freq=None)
DatetimeIndex(['2017-09-23'], dtype='datetime64[ns]', name='c', freq=None)
DatetimeIndex(['2017-09-23'], dtype='datetime64[ns]', name='c', freq=None)