Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
from datetime import datetime,timedelta
threshold = np.datetime64(datetime.today()+timedelta(weeks=3))
df[threshold < df['date']]
df[df['date'] < threshold]
Problem description
As the two comparisons above show, they should present opposite results. Instead, both of them return the same result, as if df['date'] was always the first comparison operand.
Expected Output
The picture below illustrates the issue. It was expected that the line df[threshold < df['date']]
would result in an empty DataFrame.
Output of pd.show_versions()
Comment From: cristianornelas
Same issue here.
Comment From: TomAugspurger
Can you make a complete example? Your df
is undefined.
Comment From: gmatheus95
I cannot give you the actual dataframe I'm working with since it's private data, but here's a very naive executable code to illustrate the issue.
import pandas as pd
import numpy as np
from datetime import datetime,timedelta
today = datetime.today()
x = [today] * 10000
df = pd.DataFrame({'date':x})
threshold = np.datetime64(datetime.today()+timedelta(weeks=3))
#and then the comparisons:
df[threshold < df['date']]
df[df['date'] < threshold]
Thanks.
Comment From: TomAugspurger
Seems to be related to the numpy timestamp being microsecond precision:
In [111]: np.datetime64(today + timedelta(weeks=3)) < pd.Series([today])
Out[111]:
0 True
dtype: bool
In [112]: np.datetime64(today + timedelta(weeks=3)).astype("<M8[ns]") < pd.Series([today])
Out[112]:
0 False
dtype: bool
I'm not sure what the desired outcome is here. pandas only deals with nanosecond precision timestamps, so do we silently change the precision of the input, or raise an error?
Either way, we need to fix things to be consistent between Series and DataFrame here.
Comment From: gmatheus95
I'm sorry, I don't get it. Even after adding three weeks of delta, why do I still get True in Out[111]
because of timestamp precision? I'm sorry if it's a naive question, I'm not really experienced in numpy.
Thank you anyway for addressing the issue so fast!
Comment From: TomAugspurger
I'm sorry, I don't get it. Even after adding three weeks of delta, why do I still get True in Out[111] because of timestamp precision?
Sorry if I wasn't clear, it's definitely a bug. It should be False (or maybe an exception).
Numpy stores datetimes as int64s, where the exact datetime of an integer depends on the resolution.
In [119]: np.datetime64(today).view('i8')
Out[119]: 1499262896667864
In [120]: np.datetime64(today).astype('<M8[ns]').view('i8')
Out[120]: 1499262896667864000
It's possible (haven't confirmed yet) that when you do threshold < df['date']
, pandas looks at those integers without checking that they're at the same resolution. And since your threshold
is in microseconds it's going to be smaller than the pandas one (which is in nanoseconds). This is guess a guess though.
Comment From: TomAugspurger
Ah, indeed this seems to be a duplicate of https://github.com/pandas-dev/pandas/issues/7996.
For now, you can workaround by converting threshold
to a pd.Timestamp
, which will ensure that you have nanosecond-precision datetimes everywhere.
Comment From: gmatheus95
Oh now I get it, thanks again!!