Pandas Timedelta parse AM/PM - Nineya|java/go/python

I have some data that is roughly like "%H:%M:%S %P" (where %P is AM/PM). I'd like to store it as a Timedelta representing seconds since midnight.

Currently parsing that "succeeds", ignoring the AM / PM

In [3]: import pandas as pd

In [6]: raw = ['3:25:00 AM', '3:25:00 PM', '12:30:00 AM', '12:30:00 PM']
   ...: base = pd.to_timedelta(raw)
   ...: base
   ...:
Out[6]: TimedeltaIndex(['03:25:00', '03:25:00', '12:30:00', '12:30:00'], dtype='timedelta64[ns]', freq=None)

Here's my current workaround.

In [7]: split = pd.Series(raw).str.extract(r"(?P<base>\S+) (?P<am_pm>\w{2})", expand=True)
   ...: base = pd.to_timedelta(split['base'])
   ...: hour = base.dt.total_seconds() // (60 * 60)
   ...:
   ...: move_ahead = (split.am_pm == "PM") & (hour < 12)
   ...: move_back = (split.am_pm == "AM") & (hour == 12)
   ...:
   ...: base[move_ahead] += pd.Timedelta("12H")
   ...: base[move_back] -= pd.Timedelta("12H")
   ...: base
   ...:
Out[7]:
0   03:25:00
1   15:25:00
2   00:30:00
3   12:30:00
Name: base, dtype: timedelta64[ns]

That works, but is a bit tricky to get right (assuming I have gotten it right). I'd propose that we either

raise when we see am/pm in the data, and add this as cookbook recipe
Try to support parsing this kind of data directly

Comment From: chris-b1

The 'today' part of this is a hack, but here's a slightly simpler recipe


In [23]: raw = ['3:25:00 AM', '3:25:00 PM', '12:30:00 AM', '12:30:00 PM']

In [24]: pd.to_datetime(raw)
Out[24]: 
DatetimeIndex(['2017-12-15 03:25:00', '2017-12-15 15:25:00',
               '2017-12-15 00:30:00', '2017-12-15 12:30:00'],
              dtype='datetime64[ns]', freq=None)

In [25]: pd.to_datetime(raw) - pd.Timestamp('today').normalize()
Out[25]: TimedeltaIndex(['03:25:00', '15:25:00', '00:30:00', '12:30:00'], dtype='timedelta64[ns]', freq=None)

Comment From: TomAugspurger

Oh, that's clever :)

Comment From: WillAyd

If we wanted to do option 2 I don't think it would be all that difficult. I think we would just need to update parse_timedelta_string to detect AM/PM and add 1000000000L * 3600 * 12 to the result in case of the latter. The only exception would be if the hour is 12 - then we'd remove that amount from the result if the AM is detected and leave PM untouched.

Happy to give it a go in a PR

https://github.com/pandas-dev/pandas/blob/b5f1e716d89487423bb12d6cc4e6da2b39184531/pandas/_libs/tslibs/timedeltas.pyx#L238

Comment From: pganssle

I would think that it's inadvisable to include AM and PM in any sort of Timedelta parser. I would think parsing what is obviously a time to Timedelta is just the wrong behavior. I would probably throw an error on AM/PM and instead use @chris-b1's solution. The today thing can be made more explicit by specifying any default date (unless the "today" behavior is 100% known to be stable and not likely to be affected by DST changes or leap seconds or whatever) for the parse and then subtracting that off (or by adding first-class Time support that doesn't require a date portion).

Comment From: mihaicezar

I agree with @pganssle for the fact that Timedelta should not be used for parsing a time. Probably timedeltas.pyx should have always thrown an error on AM/PM. This solution means it's then on the external developer to deal with the problem (and @chris-b1's workaround is a bit risky).

However, if support for AM/PM is decided, @WillAyd's idea is a bit incomplete in the negative domain: While pandas.Timedelta('-1 days 12:59:59 AM') is covered, pandas.Timedelta('- 12:59:59 AM') should raise an error (it has literally no logical meaning).

So which way should this go?

Comment From: WillAyd

I think an error here makes sense so unless @TomAugspurger disagrees would certainly take a PR!

Comment From: TomAugspurger

Raising sounds fine.

Comment From: mihaicezar

@WillAyd @TomAugspurger Just to be sure, you mean raising error on any AM/PM, right?

Comment From: WillAyd

Yep!

Comment From: jbrockmendel

Looks like pd.core.tools.times.to_time does not handle the OP example.