Code Sample, a copy-pastable example if possible
import pandas as pd
import pickle
from datetime import datetime, timedelta
import numpy as np
# function to group by day of year
def group_func(d):
print("Inside grouping func")
print("d={}".format(d))
print("d.month = {}".format(d.month))
print("d.year = {}".format(d.year))
print(d.month, d.year)
return datetime(2001, d.month, d.year)
dates = [datetime(2000, 1, 1) + timedelta(days=i) for i in range(10000)]
data = np.random.randn(len(dates))
df = pd.DataFrame(data=data, index=dates)
df = df.select(lambda d: not (d.month == 2 and d.day == 29))
groups = df.groupby(by=group_func)
Here I have a notebook that demonstrates the problem: https://github.com/guziy/PyNotebooks/blob/master/debug/weird_pandas_grouping.ipynb
Problem description
The problem is that when I run the code above, I get an exception because in the grouping function a datetime object is created expecting index values and not entire index column. And therefore I get this exception when trying to create datetime:
TypeError: an integer is required (got type Int64Index)
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 28.5.0
Cython: 0.22
numpy: 1.13.1
scipy: 0.15.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
Comment From: jreback
well, you have an error in your grouping function
return datetime(2001, d.month, d.year)
should be return datetime(2001, d.month, d.day)
in any event you are better off doing something like this
In [32]: df.groupby([df.index.month, df.index.day]).sum()
Out[32]:
0
1 1 2.912116
2 -1.814301
3 -10.006528
4 -9.808739
5 -1.420640
6 2.062087
... ...
12 26 6.842057
27 7.446606
28 3.837242
29 1.415139
30 6.850449
31 -7.354046
[365 rows x 1 columns]
Comment From: guziy
Thanks @jreback: Cheers