At the moment, I don't see any way to have performant, timezone aware columns. This seems rather surprising, as generally when you have a column with timestamps, they will all likely be of the same timezone. Right now, as far as I can see you have two options: 1) Columns that are basically just numpy datetime64[ns] types. These seem to be timezone unaware.If you make such a column timezone aware (by e.g., dataframe.time_column.dt.tz_localize('UTC')) it becomes a column of dtype object. 2) A DatetimeIndex, which keeps track of timezone information at seemingly the column level (which is laudable). However, this only seems to really work used as an index. If I assign it to a column, it again gets converted to a dtype object, and things get slow.
Am I missing something? Is there a really good reason why a DatetimeIndex can't just be used, as is, in a column, without the dtype=object conversion?
Comment From: jorisvandenbossche
Your analysis is generally correct. The problem is, as you pointed out, that numpy does not have support for time zones, and the data in columns are stored as numpy arrays.
The DatetimeIndex provides some work-arounds to handle time zones at the level of the full index, and these workaround are not (yet) available for columns ('blocks'). But, as far as I understand, it would be possible to do something similar for the DatetimeBlock, but @jreback can shed more light on this.
I think it is a rather big enhancement, but if your are interested in working on this, certainly welcome! Or, for improving timezone support in numpy itself, they are certainly also looking for help.
Comment From: quicknir
Thank your @jorisvandenbossche, very helpful. Handling it at the DatetimeBlock level seems a bit messier, e.g. different columns could have different timezone or precision information. I was thinking of a solution more like Categorical. As far as I can see Categorical seems to have all the right basic infrastructure in place to duplicate to create a DatetimeColumn, but naturally this is a very superficial viewpoint.
Sure, I am interested in at least looking at what would be required. The timezone support on the numpy side I actually view as adequate, in the sense that I think the datetime64[ns] is perfectly adequate as a low-level data type, and has the enormous advantage of being just exactly a 64 bit integer and nothing else. This, I think, should be handled on the pandas side.
Comment From: jreback
dupe of #8260
@quicknir you are welcome to give this a go, its actually not that tricky, just inherit from DatetimeBlock
. And the NonConsolidatingBlock
mixin (so these blocks are not combined with one another). This is by far the cleanest soln.
Sure, I am interested in at least looking at what would be required. The timezone support on the numpy side I actually view as adequate, in the sense that I think the datetime64[ns] is perfectly adequate as a low-level data type, and has the enormous advantage of being just exactly a 64 bit integer and nothing else. This, I think, should be handled on the pandas side.
support on numpy is non-existant, though there are some proposals. just look thru the pandas codebase and you will appreciate the enormity of what @wesm did with timezones.