Hi,
this behavior puzzled me but I am unsure of it is my ignorance or really something wrong.
My pandas installation is a per latest conda install.
tradf.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (2, 2016-09-09 00:00:00, 2016-09-14 00:00:00) to (26123, 2016-11-18 00:00:00, 2016-12-18 00:00:00)
Data columns (total 9 columns):
payment_method_id 100000 non-null int16
payment_plan_days 100000 non-null int16
plan_list_price 100000 non-null int16
actual_amount_paid 100000 non-null int16
is_auto_renew 100000 non-null bool
transaction_date 100000 non-null datetime64[ns]
membership_expire_date 100000 non-null datetime64[ns]
is_cancel 100000 non-null bool
dis 89012 non-null timedelta64[ns]
dtypes: bool(2), datetime64[ns](2), int16(4), timedelta64[ns](1)
memory usage: 4.3 MB
index is a multiindex with idx (int), transaction_date_i(datetime64[ns]),membership_expire_date_i(datetime64[ns]) dataframe is sorted by idx, transaction_date_i and membership_expire_date_i. The index is sorted.
print(tradf.loc[(165,"2017-02-01","2017-03-01"),:].membership_expire_date==pd.Timestamp("2017-03-01"))
print(tradf.loc[(165,"2017-02-01","2017-03-01"),:].membership_expire_date=="2017-03-01")
See no casting needed as expected
idx transaction_date_i membership_expire_date_i
165 2017-02-01 2017-03-01 True
Name: membership_expire_date, dtype: bool
idx transaction_date_i membership_expire_date_i
165 2017-02-01 2017-03-01 True
Name: membership_expire_date, dtype: bool
However this fails
print(tradf.groupby(level="idx").apply(
lambda x : x.iloc[-1].membership_expire_date==pd.Timestamp("2017-03-01")).loc[165])
print(tradf.groupby(level="idx").apply(
lambda x : x.iloc[-1].membership_expire_date=="2017-03-01").loc[165])
True
False
In the first groupby the timestamp in string format is not interpreted. Casting it to Timestamp, works.
shouldn't be the same behavior ??
Cheers
JC
Comment From: jreback
you need a copy-poastable example
Comment From: littlegreenbean33
I am not very good at this..... but the below reproduces the behavior I described
import pandas as pd
import numpy as np
idx=np.arange(50)
d1=pd.Series(pd.date_range("2017-01-01",periods=50))
d2=pd.Series(pd.date_range("2017-01-03",periods=50))
data1=np.random.randn(50)
data2=np.random.randn(50)
df=pd.DataFrame({"data1":data1,"data2":data2,"idx":idx,"d1":d1,"d2":d2})
df.set_index(["idx","d1","d2"],inplace=True,drop=False)
df.head()
d1 d2 data1 data2 idx
idx d1 d2
0 2017-01-01 2017-01-03 2017-01-01 2017-01-03 -0.044711 0.082410 0
1 2017-01-02 2017-01-04 2017-01-02 2017-01-04 -0.346293 -1.922916 1
2 2017-01-03 2017-01-05 2017-01-03 2017-01-05 1.481638 -2.369816 2
3 2017-01-04 2017-01-06 2017-01-04 2017-01-06 -1.980486 0.496076 3
4 2017-01-05 2017-01-07 2017-01-05 2017-01-07 -0.613333 0.442724 4
print(df.loc[(4,slice(None),"2017-01-07")].d2=="2017-01-07")
print(df.loc[(4,slice(None),"2017-01-07")].d2==pd.Timestamp("2017-01-07"))
idx d1 d2
4 2017-01-05 2017-01-07 True
Name: d2, dtype: bool
idx d1 d2
4 2017-01-05 2017-01-07 True
Name: d2, dtype: bool
#
print(df.groupby(level="idx").apply(
lambda x:x.iloc[-1].d2=="2017-01-07").iloc[4])
# the above is what I see as not as expected why is the comparison unequal ?
print(df.groupby(level="idx").apply(
lambda x:x.iloc[-1].d2==pd.Timestamp("2017-01-07")).iloc[4])
False
True
Hope this helps.
Cheers.
JC
Comment From: jreback
Some tips:
- using named indices the same as columns is plain confusing.
- using opaque UDFs (IOW inside the
.apply
is non-performant and just hard to manage), so use pandas builtins
e.g. you prob want
In [34]: df.groupby(level='idx').last().head()
Out[34]:
d1 d2 data1 data2 idx
idx
0 2017-01-01 2017-01-03 -0.065398 0.554164 0
1 2017-01-02 2017-01-04 0.157238 0.223270 1
2 2017-01-03 2017-01-05 0.471431 -0.332721 2
3 2017-01-04 2017-01-06 -1.258124 0.618444 3
4 2017-01-05 2017-01-07 -0.298601 0.015549 4
you are selecting a series rather than a dataframe and that coerces all types to object
when types are mixed (the case here).
In [25]: g = df.groupby(level="idx")
# get the first group
In [26]: g.get_group(0)
Out[26]:
d1 d2 data1 data2 idx
idx d1 d2
0 2017-01-01 2017-01-03 2017-01-01 2017-01-03 -0.065398 0.554164 0
In [27]: g.get_group(0).iloc[-1].d2=="2017-01-07"
Out[27]: False
# you are actually selecting a series, so all of the dtypes are pushed to object
In [28]: g.get_group(0).iloc[-1]
Out[28]:
d1 2017-01-01 00:00:00
d2 2017-01-03 00:00:00
data1 -0.0653979
data2 0.554164
idx 0
Name: (0, 2017-01-01 00:00:00, 2017-01-03 00:00:00), dtype: object
# you can select as a dataframe like this
# its still false, though because 2017-01-3 != 2017-01-07
In [30]: g.get_group(0).iloc[[-1]].d2=='2017-01-07'
Out[30]:
idx d1 d2
0 2017-01-01 2017-01-03 False
Name: d2, dtype: bool
Comment From: littlegreenbean33
Thanks ! All clear