Pandas Timezones silently dropped in parsing

TLDR: pandas should pass a tzinfos kwarg to the dateutil parser using sensible defaults.

dateutil has a bug that silently drops most timezones. That bug is inherited by pandas. The following is run on a machine located in US/Pacific:

>>> pd.Timestamp('2017-12-08 08:20 PM PST')     # <-- only parsed correctly because of locale
Timestamp('2017-12-08 20:20:00-0800', tz='tzlocal()')
>>> pd.Timestamp('2017-12-08 08:20 PM EST')     # <-- timezone silently dropped
Timestamp('2017-12-08 20:20:00')

There is a partial fix in progress over at dateutil, the most likely outcome of which is that these cases will raise in the future unless a tzinfos kwarg is explicitly passed to dateutil.parser.parse. The issue for pandas is then to decide on what tzinfos to pass (a suggestion to handle the most common use cases by default within dateutil went nowhere).

The tzinfos kwarg is a dictionary taking a string and returning a tzinfo object, e.g.

unambiguous_tzinfos = {
    'PDT': dateutil.tz.gettz('US/Pacific'),
    'PT': dateutil.tz.gettz('US/Pacific'),
    'MDT': dateutil.tz.gettz('US/Mountain'),
    'MT': dateutil.tz.gettz('US/Mountain'),
    'ET': dateutil.tz.gettz('US/Eastern'),
    'CET': dateutil.tz.gettz('Europe/Amsterdam),
    'NZDT': dateutil.tz.gettz('Pacific/Auckland')}

This example includes only abbreviations for which there are no other alternatives listed here. So e.g. "CST" is excluded since it could also be "China Standard Time", "EST" is excluded since it could refer to "Australian Eastern Standard Time". Note this is only a subset of the unambiguous abbreviations.

Comment From: jreback

hmm ok, I would rather hand off non-iso 8601 parsing to dateutil directly, so this would qualitfy. note that this only when format is not passed and in a very limited set of cases.

Comment From: jbrockmendel

I'd prefer that dateutil handle this internally too; my hope is that consensus will develop over there once more people report that it doesn't Just Work. But until then, it's still a nontrivial question of exactly what we want to recognize by default and whether/how to let users customize it.

I see two viable options:

1) The most convenient thing to do -- at least in my comfortably Anglo-centric seat -- would be to pass defaults for a) abbreviations that are unambiguous and b) abbreviations for the most common timezones, e.g. assume CDT means "Central Daylight Time" and not "Cuba Daylight Time". Users who want to override that would need to do the parsing step before passing to the Timestamp/to_datetime constructor.

2) Same as 1, but allow users a mechanism to override the tzinfos dict that pandas passes to dateutil.

Comment From: jreback

we shouldn’t be hard coding any time zones i would think u can simply pull out the string and just try to localize

Comment From: jbrockmendel

i would think u can simply pull out the string and just try to localize

Can you expand on that? Are you suggesting users should do this before passing to Timestamp/to_datetime?

Comment From: jreback

of course not

when parsing if u hit something that looks like a tz rather than an offset u can simply take the string and localize

Comment From: jbrockmendel

of course not

Good. That seemed unlikely (and altogether silly).

when parsing if u hit something that looks like a tz rather than an offset u can simply take the string and localize

It's the "simply" that I'm having trouble with. here. This sounds like you're suggesting the parsing be done within pandas, which I thought was what we're trying to avoid. Can you give an example of what you have in mind?