Came a across this very odd problem while trying to get difference between two columns in a DataFrame of type datetime64. The following simple example illustrates the problem:
df = pd.DataFrame({'A': pd.date_range('20150301', '20150304'),
'B': pd.date_range('20150303', '20150306')},
index = [1, 2, 3, 3])
df
A B
1 2015-03-01 2015-03-03
2 2015-03-02 2015-03-04
3 2015-03-03 2015-03-05
3 2015-03-04 2015-03-06
df.B - df.A
1 2 days
2 2 days
3 2 days
3 1 days
3 3 days
3 2 days
dtype: timedelta64[ns]
What happens generally is that the output contains an entry for the difference in every pair in the Cartesian product of the rows containing the duplicate index. I learned this the hard way on a df with 30K+ rows and only 50 distinct indices ending up trying to compute a series with 385 million+ entries, quickly freezing my computer.
On a related note, using the eval method to compute the difference throws the following error:
df.eval('B - A')
ValueError: unkown type timedelta64[ns]
Comment From: jorisvandenbossche
Which pandas version are you using? (pd.__version__
)? I get the correct result on 0.15.2 and with master.
Comment From: eoincondron
Ah yes, I forgot I was working with version 0.14.1. Should have tested on 0.15 first but forgot all about it. Thanks.
Comment From: jorisvandenbossche
@eoincondron Is there still a problem? (as you reopened the issue)
Comment From: eoincondron
No that was a mistake, sorry.