Hi,

this behavior puzzled me but I am unsure of it is my ignorance or really something wrong.

My pandas installation is a per latest conda install.

tradf.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (2, 2016-09-09 00:00:00, 2016-09-14 00:00:00) to (26123, 2016-11-18 00:00:00, 2016-12-18 00:00:00)
Data columns (total 9 columns):
payment_method_id         100000 non-null int16
payment_plan_days         100000 non-null int16
plan_list_price           100000 non-null int16
actual_amount_paid        100000 non-null int16
is_auto_renew             100000 non-null bool
transaction_date          100000 non-null datetime64[ns]
membership_expire_date    100000 non-null datetime64[ns]
is_cancel                 100000 non-null bool
dis                       89012 non-null timedelta64[ns]
dtypes: bool(2), datetime64[ns](2), int16(4), timedelta64[ns](1)
memory usage: 4.3 MB

index is a multiindex with idx (int), transaction_date_i(datetime64[ns]),membership_expire_date_i(datetime64[ns]) dataframe is sorted by idx, transaction_date_i and membership_expire_date_i. The index is sorted.

    print(tradf.loc[(165,"2017-02-01","2017-03-01"),:].membership_expire_date==pd.Timestamp("2017-03-01"))
print(tradf.loc[(165,"2017-02-01","2017-03-01"),:].membership_expire_date=="2017-03-01")

See no casting needed as expected

idx  transaction_date_i  membership_expire_date_i
165  2017-02-01          2017-03-01                  True
Name: membership_expire_date, dtype: bool
idx  transaction_date_i  membership_expire_date_i
165  2017-02-01          2017-03-01                  True
Name: membership_expire_date, dtype: bool

However this fails

print(tradf.groupby(level="idx").apply(
     lambda x :  x.iloc[-1].membership_expire_date==pd.Timestamp("2017-03-01")).loc[165])
print(tradf.groupby(level="idx").apply(
     lambda x :  x.iloc[-1].membership_expire_date=="2017-03-01").loc[165])

True
False

In the first groupby the timestamp in string format is not interpreted. Casting it to Timestamp, works.

shouldn't be the same behavior ??

Cheers

JC

Comment From: jreback

you need a copy-poastable example

Comment From: littlegreenbean33

I am not very good at this..... but the below reproduces the behavior I described

import pandas as pd
import numpy as np

idx=np.arange(50)
d1=pd.Series(pd.date_range("2017-01-01",periods=50))
d2=pd.Series(pd.date_range("2017-01-03",periods=50))

data1=np.random.randn(50)
data2=np.random.randn(50)
df=pd.DataFrame({"data1":data1,"data2":data2,"idx":idx,"d1":d1,"d2":d2})
df.set_index(["idx","d1","d2"],inplace=True,drop=False)
df.head()

                               d1          d2         data1       data2     idx
idx     d1          d2                  
0   2017-01-01  2017-01-03  2017-01-01  2017-01-03  -0.044711   0.082410    0
1   2017-01-02  2017-01-04  2017-01-02  2017-01-04  -0.346293   -1.922916   1
2   2017-01-03  2017-01-05  2017-01-03  2017-01-05  1.481638    -2.369816   2
3   2017-01-04  2017-01-06  2017-01-04  2017-01-06  -1.980486   0.496076    3
4   2017-01-05  2017-01-07  2017-01-05  2017-01-07  -0.613333   0.442724    4

print(df.loc[(4,slice(None),"2017-01-07")].d2=="2017-01-07")
print(df.loc[(4,slice(None),"2017-01-07")].d2==pd.Timestamp("2017-01-07"))

    idx  d1          d2        
    4    2017-01-05  2017-01-07    True
    Name: d2, dtype: bool
    idx  d1          d2        
    4    2017-01-05  2017-01-07    True
    Name: d2, dtype: bool
#
print(df.groupby(level="idx").apply(
    lambda x:x.iloc[-1].d2=="2017-01-07").iloc[4])
# the above is what I see as not as expected why is the comparison unequal ?


print(df.groupby(level="idx").apply(
    lambda x:x.iloc[-1].d2==pd.Timestamp("2017-01-07")).iloc[4])

    False
    True

Hope this helps.

Cheers.

JC

Comment From: jreback

Some tips:

  • using named indices the same as columns is plain confusing.
  • using opaque UDFs (IOW inside the .apply is non-performant and just hard to manage), so use pandas builtins

e.g. you prob want

In [34]: df.groupby(level='idx').last().head()
Out[34]: 
            d1         d2     data1     data2  idx
idx                                               
0   2017-01-01 2017-01-03 -0.065398  0.554164    0
1   2017-01-02 2017-01-04  0.157238  0.223270    1
2   2017-01-03 2017-01-05  0.471431 -0.332721    2
3   2017-01-04 2017-01-06 -1.258124  0.618444    3
4   2017-01-05 2017-01-07 -0.298601  0.015549    4

you are selecting a series rather than a dataframe and that coerces all types to object when types are mixed (the case here).

In [25]: g = df.groupby(level="idx")

# get the first group
In [26]: g.get_group(0)
Out[26]: 
                                  d1         d2     data1     data2  idx
idx d1         d2                                                       
0   2017-01-01 2017-01-03 2017-01-01 2017-01-03 -0.065398  0.554164    0

In [27]: g.get_group(0).iloc[-1].d2=="2017-01-07"
Out[27]: False

# you are actually selecting a series, so all of the dtypes are pushed to object
In [28]: g.get_group(0).iloc[-1]
Out[28]: 
d1       2017-01-01 00:00:00
d2       2017-01-03 00:00:00
data1             -0.0653979
data2               0.554164
idx                        0
Name: (0, 2017-01-01 00:00:00, 2017-01-03 00:00:00), dtype: object


# you can select as a dataframe like this
# its still false, though because 2017-01-3 != 2017-01-07
In [30]: g.get_group(0).iloc[[-1]].d2=='2017-01-07'
Out[30]: 
idx  d1          d2        
0    2017-01-01  2017-01-03    False
Name: d2, dtype: bool

Comment From: littlegreenbean33

Thanks ! All clear