Pandas KeyError on slicing with datetime or pandas.Timestamp

related #6066 (on Float64Index too)

The 'KeyError' on slicing was discussed multiple times but I'm still not sure if the issue below is a bug or just my misunderstanding.

I'm experiencing 'KeyError' from time to time when I try to slice my dataframes with datetime or Timestamp objects, however slicing with strings works perfectly. I was unable to construct synthetic example with pandas.date_range, so I needed to upload the piece of real data where the issue appears: https://www.dropbox.com/s/ibzbwqs35tiydyc/tmp.h5

df = pd.read_hdf('tmp.h5', 'data')
print df.head()

When I try to slice it with the pandas.Timestamp objects it results in 'KeyError':

tt1 = pd.Timestamp('2006-11-16 18:30:00')
tt2 = pd.Timestamp('2006-11-16 18:40:00')
print df.ix[tt1:tt2].head()

Same for datetime objects:

tt1 = pd.Timestamp('2006-11-16 18:30:00').to_datetime()
tt2 = pd.Timestamp('2006-11-16 18:40:00').to_datetime()
print df.ix[tt1:tt2].head()

However this slicing works perfectly:

tt1 = pd.Timestamp('2006-11-16 18:30:00').strftime('%Y-%m-%d %H:%M:%S')
tt2 = pd.Timestamp('2006-11-16 18:40:00').strftime('%Y-%m-%d %H:%M:%S')
print df[tt1:tt2].head()

Numpy version 1.8.0 Pandas version 0.13.0

Comment From: jreback

Your index is not monotonic (e.g. sorted). This is more of an incorrect error report

If you sort it it works (with exact indexes or not)

In [14]: df.sort_index().ix[Timestamp('2006-11-16 18:30:04.101497'):Timestamp('2006-11-16 18:30:06.075980')]
Out[14]: 
                                  val
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:06.075980+00:00    0
2006-11-16 18:30:06.075980+00:00    0

[5 rows x 1 columns]

In [15]: df.sort_index().ix[Timestamp('2006-11-16 18:30:04.101497'):Timestamp('2006-11-16 18:30:06')]
Out[15]: 
                                  val
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0

[3 rows x 1 columns]

In [16]: df.sort_index().ix[Timestamp('2006-11-16 18:30:04'):Timestamp('2006-11-16 18:30:06')]
Out[16]: 
                                  val
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0
2006-11-16 18:30:04.101497+00:00    0

[3 rows x 1 columns]

This should be a ValueError I think; so its a 'bug' on the error report

In [17]: df.ix[Timestamp('2006-11-16 18:30:04.101497'):Timestamp('2006-11-16 18:30:06.075980')]
KeyError: 'cannot peform a slice operation on a non-unique non-monotonic index'

@jtratner agree?

Comment From: vfilimonov

@jreback thanks for pointing out the problem with my dataset! I was not aware of it.

Comment From: jreback

closing as not a bug

Comment From: KristianHolsheimer

Hi there, I know this issue is closed, since it's not a bug. I would argue, though, that the error message could point the user in the right direction (I googled the error message and it lead me here).

Comment From: leonsas

I second Kristian. And I'll provide a bit more context which can be helpful:

trimmed_series = series[event.startTime:event.endTime]

Which throws KeyError: 1472641846000000000. Initially I thought that maybe event.endTime happens to be after the last index of series, and that maybe pandas isn't smart in dealing with out of bound errors. Obviously pandas does deal with this and it was the index not being sorted (which was a bug elsewhere in my code), which I found here. But it definitely isn't intuitive.

Comment From: jreback

@leonsas @KristianHolsheimer

if you have a nice reproducible example, pls open a new issue (and xref this one). This is well-defined behavior. I closed this because its not a bug, though, it could/should be a ValueError with an even more informative error message.

Comment From: kdebrab

As requested by @jreback, I hereby create a nice reproducible example:

import pandas as pd
index = pd.date_range('2016-10-29 23:00', '2016-10-30 3:00', freq='15T', tz='UTC')
index = index.tz_convert('Europe/Brussels').tz_localize(None)
ts = pd.Series(1, index=index)
ts.truncate(before='2016-10-30 2:10')

which raises:

KeyError: 1477793400000000000L

In above example, the index is non-sorted due to the naive local time, comprising a switch from Summer to Winter time.

Slicing the data as:

ts[pd.Timestamp('2016-10-30 2:10'):]

returns the same KeyError.

Truncating with a time that is contained inside the index works just fine (unless you choose a non-unique label):

ts.truncate(before='2016-10-30 1:30')

Interestingly:

ts['2016-10-30 2:10':]

raises no error, returning the correct (non-sorted !) result. Even when choosing a duplicate label!

All different operations work on the sorted series, e.g.:

ts.sort_index().truncate(before='2016-10-30 2:10')

Though now the result is sorted as well of course.

I'm using pandas version 0.19.1

Comment From: jreback

@kdebrab the only thing that would be nicer would be the actual KeyError message for a not-found label. (It should show it as a Timestamp). So would take a fix for that.

you can open a new issue, or push a PR if that works for you.

Comment From: jorisvandenbossche

The truncate does just a slice under the hood, so the example of @kdebrab becomes:

In [12]: ts[pd.Timestamp('2016-1-1'):]
...
KeyError: 1451606400000000000

So which raises an error, because you are slicing with a non-present value on a non-sorted index. To give a more simple example for this:

In [33]: s = pd.Series(range(5), index=[1,2,3,2,4])

# works because '1' is present (and is unique), so unsortedness does not matter
In [34]: s.loc[1:]
Out[34]: 
1    0
2    1
3    2
2    3
4    4
dtype: int64

# fails because '0' is not a label, because of unsortedness pandas cannot know how to handle this
In [35]: s.loc[0:]
...
KeyError: 0

I think it really be nice to have a better error message for this, as the difference between sorted/not-sorted can be very subtle, and with a sorted index, slicing with a non-present label is perfectly fine.

@jreback The only thing I don't directly find clear is why does the same example work with strings?

In [13]: ts['2016-1-1':]
Out[13]: 
2016-10-30 01:00:00    1
2016-10-30 01:15:00    1
...
2016-10-30 03:45:00    1
2016-10-30 04:00:00    1
Freq: 15T, dtype: int64

Shouldn't this be equivalent to ts[pd.Timestamp('2016-1-1'):]? Or at least in the case of ts['2016-1-1 00:00:00':] (which also works) where it cannot be seen as partial string?

Comment From: jreback

this works with strings because a string turns into partial timestamp indexing and thus is a slice, and hence works.

so maybe we ought to always make truncate a slice, then this will just work. (internally its a scalar being passed and NOT a slice), so this is pretty easy to 'fix'.

and I agree that this should work.

So let's create 2 new issues for this

1) better error message with a datetime aware (e.g. it should show the converted key, not the internal repr) 2) fix .truncate to always be slice like

Comment From: jorisvandenbossche

@jreback I thought partial strings only turned into a slice when the string has lower resolution than the series. In the case of ts['2016-1-1 00:00:00':] this is certainly not the case, and should be treated as a slice with an exact Timestamp?

Comment From: jreback

In [7]: ts.index.resolution
Out[7]: 'minute'

so I think its always treated as a slice (in this example)

Comment From: DanielHabenicht

So this was never settled? I think I will create a new bug for it as it just cost me almost an hour.