I deal with DateTime columns extensively. Using strings to slice and select items is extremely useful, but it requires a DatetimeIndex.
I think it would be great to include the same functionality for any column.
Consider this DF:
>>> a = pd.DataFrame([pd.to_datetime('2016-4-3 15:32:04'), pd.to_datetime('2016-4-9 15:32:04')], columns=['a_col'])
>>> print(a)
a_col
0 2016-04-03 15:32:04
1 2016-04-09 15:32:04
This is what I usually do, but it's extremely cluttered, and fails when the index is already something useful.
>>>print(a.set_index('a_col')['2016-4-3'].reset_index())
a_col
0 2016-04-03 15:32:04
I imagine this is the preferred way of doing the same thing (please correct me if I'm wrong). Requires a lot of typing, and not elegant at all (especially for chosing a single day, as in this example).
>>> print(a[(a.a_col >= pd.to_datetime('2016-4-3')) & (a.a_col < pd.to_datetime('2016-4-4'))])
a_col
0 2016-04-03 15:32:04
I think something like the ds
would be great:
>>> print(a.ds['a_col', '2016-4-3'])
The second element could be any argument that is accepted to DatetimeIndex elements.
What do you think? Am I missing something?
Comment From: jreback
this is just partial string indexing: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex-partial-string-indexing
you can do this on a particular MultiIndex level in 0.18.0 (some bugs are fixed in forthcoming 0.18.2)
so not sure what you are actually motivating here - your use case is wel convered
pls add a more useful example
Comment From: bbirand
My point was to be able to do this type of slicing and indexing and any datetime columns (not just on DatetimeIndex).
My understanding from the documentation you sent is that this works only with DatetimeIndexes.
On Jul 1, 2016, at 8:38 PM, Jeff Reback notifications@github.com wrote:
this is just partial string indexing: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#datetimeindex-partial-string-indexing
you can do this on a particular MultiIndex level in 0.18.2 (coming soon)
so not sure what you are actually motivating here - your use case is wel convered
pls add a more useful example
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Comment From: jreback
so set the index
not sure you gain much by doing this with yet another indexer
Comment From: bbirand
First, setting an index, slicing, and then resetting the index is not very elegant. Correct if I'm wrong, but I think this is the only instance for which slicing is considerably easier when used as an index, compared to a column.
The other reason is that if the index is already set to another column, this operation requires saving the previous index, setting the column, slicing, and then setting back the index. Even more laborious
Given that time slices are extremely common, I still maintain that this would be a worthy addition.
I feel like you are not convinced (perhaps because you don't work with datetime columns often?). Would adding such a new operator be too much work?
On Jul 1, 2016, at 9:10 PM, Jeff Reback notifications@github.com wrote:
so set the index
not sure you gain much by doing this with yet another indexer
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Comment From: bbirand
Perhaps another motivating example would be useful. Consider this Dataframe:
df = pd.DataFrame([[pd.to_datetime('2016-4-3 15:32:04'), pd.to_datetime('2016-9-9 15:32:04')],
[pd.to_datetime('2016-5-3 12:19:04'), pd.to_datetime('2016-12-9 19:12:04')]],
columns=['start_date', 'end_date'],
index=["a","b"])
df
start_date end_date
a 2016-04-03 15:32:04 2016-09-09 15:32:04
b 2016-05-03 12:19:04 2016-12-09 19:12:04
...
Let's say I want to see the events that started in April '16, and that ended any time before December '16.
With the "hack" that you mention, this would look like this:
df.reset_index().set_index('start_date')['2016-4'].reset_index().set_index('end_date')[:'2016-12'].reset_index().set_index('index')
A lot of unnecessary index operations, but otherwise the date portions are very easy to see.
The preferred, "non-hack" version using datetimes it would look like this: (again, please do tell me if there is a better way)
df[(df['start_date'] >= pd.to_datetime('2016-4')) & (df['start_date'] < pd.to_datetime('2016-6')) & \
(df['end_date'] <= pd.to_datetime('2016-12'))
]
This is probably the worst. Since pd.to_datetime
returns a Timestamp
, month or year equality does not work. In order to get the results for a single month, I have to use inequalities, and think about adding the next month. If I decide to look at a different month later, I have to reparse the whole line, and then remember to also change the next time.
My proposed method would involve using a new date indexer. Let's call it ds
:
df.ds['start_date','2016-4'].ds['end_date',:'2016-12']
Comment From: jorisvandenbossche
@bbirand First, this issue is not specific to datetime columns. In general, you can easily slice on an index (eg with an integer index df[3:10]
), but not on a column (the slice example when the int index is a column would be df[(df['int_col'] >= 3) & (df['int_col'] <= 10)]
).
Second, in those comparisons, you can also use strings (which makes the code a little bit less long). So your above line would look like:
df[(df['start_date'] >= '2016-4') & (df['start_date'] < '2016-6') & (df['end_date'] <= '2016-12')]
I feel like you are not convinced (perhaps because you don't work with datetime columns often?). Would adding such a new operator be too much work?
It is not necessarily a matter of 'too much work' (I don't think it would be much work). It is more a question of "is this functionality worth the addition of a new method to the API, and so worth the added complexity". Pandas dataframes already have very many (too many, depending on your opinion) methods, so we are very carefully with adding new ones. In any case, if we add something, I think it should be more generic.
Comment From: jorisvandenbossche
Something that also makes filtering rows on column values a bit easier is the query
method. For example, you can do:
In [24]: df.query("'2016-04-01' < start_date < '2016-04-20'")
Out[24]:
start_date end_date
a 2016-04-03 15:32:04 2016-09-09 15:32:04
(in this case chained comparisons work and make it a bit shorter)
Comment From: jorisvandenbossche
And actually, what you ask for maybe already exists in the form of between
?
In [30]: df[df['start_date'].between('2016-04', '2016-06') & df['end_date'].between('2016-01', '2016-12')]
Out[30]:
start_date end_date
a 2016-04-03 15:32:04 2016-09-09 15:32:04
This does not do the implicit datetime string slicing (eg '2013' which is interpreted as '2013-01-01':'2013-12-31), but is quite close to the ds
indexer you proposes I think.
In any case, my last posts also show that there are already many ways :-) That does of course not mean it is never worth to add yet another way if it is a clear improvement.
Comment From: bbirand
@jorisvandenbossche Thanks a lot for the replies! Your points are very well taken.
First off, I didn't know that one could use strings for Datetime comparisons:
df['start_date'] >= '2016-4'
I looked through the documentation again, but still couldn't find any mention of it (though I'm happy to accept the responsibility for my oversight). This is already a big step up from before.
I looked for the between()
method, but couldn't find that either. I know there is a truncate
command also, which I tried to use. It never felt natural though. Both of these seem to be limited in that they only work with inequalities.
What's really useful when doing time-series analysis, is to select a particular month or year. In essence, the ds
selector I propose could get rid of both of these functions.
Out of the examples that you give, I think the one using query
is closest to what I'd have in mind (besides adding a new operator/method). So maybe adding the =
functionality found in partial strings to query
would be an improvement. For instance, for doing this to pick out elements that are within that month:
df.query("start_date = '2016-04'")
Then my example would be
df.query("start_date = '2016-4'").query("end_date < '2016-12'")
In short:
1. Given the existence between
and truncate
, it might make sense API-wise to create a single way of accessing datetimes (both inequality and equality).
2. Adding this functionality to query
would also be useful.
Comment From: jreback
@bbirand so partial string indexing which is the link I pointed above has some caveats. - It is not 'meant' for production, meaning it is a convenience for interactively working with data. - It works because the index itself is monotonic (meaning its sorted), and so can be very efficient.
If its NOT sorted, and generally a column is not, then boolean indexing is used and so its the same as selecting (as you have above); this is true for .query
as well.
This is why indexes are special and can have a special syntax. This is not applicable more generally. Further I would be really really hesistant to introduce more non-production oriented convenience features such as this.
In fact we are going to be moving away from unexpected things like:
df['2016']
, though still allow df.loc['2016']
as the first actually slices the ROWS even though its typically a COLUMN indexer.
Comment From: bbirand
Hi @jreback ,
Thanks for the explanation. I see that you've closed the issue, so I won't drag this on any longer. But if I can humbly comment on one point, it would be this:
It is not 'meant' for production, meaning it is a convenience for interactively working with data.
I'd be very careful while defining what really consists of "production". The vast, vast majority of people I know that use Pandas, use it first and foremost as an interactive environment. So in that context, "production" features would be exactly things that make it easy to deal with basic data analysis tasks, say via quick indexing.
It seems that your definition of production is vastly different. I realized that things like the safety of operations become important when code is meant to run non-interactively.
In order to deal with this vast discrepancy in usage, my 2c would be to have a type of global flag. When I launch Jupyter Notebook, and I'm playing around with data, I can use all sorts of shortcuts that made Pandas so attractive for data analysis work. When I have a good first version, I can change the flag, which may output warnings for "unsafe" operations, like indexing shortcuts, and ask the user to change these more specific and safe versions.
I realize this issue is not the best place to discuss this, but I figured I'd note this here and get your thoughts before creating any more issues. What do you think?
Comment From: jreback
@bbirand this certainly could be enabled via an option (and actually prob should be). That's exactly what I mean by 'production'. A non-interactive run may, or may not have the same semantics / usage as an interactive environment. This could be configurable.
If you are willing to submit a pull-request for this, great.
Comment From: bbirand
@jreback I'll definitely take a stab at it! But should I then start a new issue for this "production vs interactive configuration" topic?
Comment From: jreback
sure