Pandas ENH: Support parsing <Month Name> <Day number> e.g. Jan 1 in date utilities

I assume that this was officially supported before. Haven't narrowed it down any more than sometime between 0.16.2 and 0.17.0.

In [1]: pd.__version__
Out[1]: '0.16.2'

In [2]: pd.date_range("Jan 1", "March 31", name="date")
Out[2]:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
               '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
...

In [1]: pd.__version__
Out[1]: '0.17.0'

In [2]: pd.date_range("Jan 1", "March 31", name="date")
---------------------------------------------------------------------------
OutOfBoundsDatetime                       Traceback (most recent call last)
<ipython-input-2-8eaca08051ac> in <module>()
----> 1 pd.date_range("Jan 1", "March 31", name="date")

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in date_range(start, end, periods, freq, tz, normalize, name, closed)
   1912     return DatetimeIndex(start=start, end=end, periods=periods,
   1913                          freq=freq, tz=tz, normalize=normalize, name=name,
-> 1914                          closed=closed)
   1915
   1916

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/util/decorators.py in wrapper(*args, **kwargs)
     87                 else:
     88                     kwargs[new_arg_name] = new_arg_value
---> 89             return func(*args, **kwargs)
     90         return wrapper
     91     return _deprecate_kwarg

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
    234             return cls._generate(start, end, periods, name, freq,
    235                                  tz=tz, normalize=normalize, closed=closed,
--> 236                                  ambiguous=ambiguous)
    237
    238         if not isinstance(data, (np.ndarray, Index, ABCSeries)):

/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in _generate(cls, start, end, periods, name, offset, tz, normalize, ambiguous, closed)
    383
    384         if start is not None:
--> 385             start = Timestamp(start)
    386
    387         if end is not None:

pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:8967)()

pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:22303)()

pandas/tslib.pyx in pandas.tslib.convert_str_to_tsobject (pandas/tslib.c:24364)()

pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:23344)()

pandas/tslib.pyx in pandas.tslib._check_dts_bounds (pandas/tslib.c:26590)()

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 00:00:00

Comment From: TomAugspurger

Never mind. Never documented.

Comment From: jorisvandenbossche

@jreback commented here https://github.com/mwaskom/seaborn/issues/702#issuecomment-139608714 it was never meant to be supported

Comment From: jorisvandenbossche

Further, @jreback, I thought it was maybe not a written rule, but still somewhat generally assumed that pandas did fall back to dateutil.parser.parse if it couldn't parse the datetime itself. So in that sense, it is somewhat surprising to me pd.to_datetime('Jan 1') no longer works

Comment From: jreback

this has to do with the change IIRC @sinhrks, e.g. #7599 made to make all parsing consistent.

I think this did work (though not officially / undoced / not tested). So we prob can support it. But should have a real effort here.

Comment From: sinhrks

I think 0.17 behavior is consistent, but allowing to pass default datetime (today's date in most cases) for dateutil parsing may be useful? - https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L1800

Comment From: jorisvandenbossche

Another related case (also no (full) date part provided in the string) from https://github.com/pandas-dev/pandas/issues/16074

In [20]: pd.to_datetime("4pm")
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 16:00:00

In [21]: pd.to_datetime("16:00")
Out[21]: Timestamp('2017-04-21 16:00:00')

so depending on the format of the time string, this does or does not work, so there is at least some inconsistency.

But I think we mainly have to decide on whether, if a part of the date (eg the year) or the full date is missing, do we fill it with 0001-01-01 (and with result that it raises an error) or with the current date? In the first case, it is probably better to raise a custom error message instead of the OutOfBounds error.

Comment From: mikedeltalima

@jreback thanks for linking the duplicate. Would it be reasonable to use the dateutil parser and allow the user to pass a default datetime?

Comment From: jreback

@mikedeltalima you can do that as a user pandas doesn't and shouldn't have an api for that giving back current day as a non explicit operation seems magical

Comment From: mikedeltalima

@jreback I'm not sure I understand. You can specify the default in dateutil, but not pandas. Why shouldn't the user have that option? Also, dateutil chose the current date for their default default, but pandas could choose something else (Jan 1 of current year?).

Comment From: jreback

how is having a default date useful? sure for creating a single date i suppose but not sure of the general utility

Comment From: mikedeltalima

Let's say I have a Series (a column in a DataFrame) that consists of dates like April 5, May 10 etc. None of them specify the year. If I want to capture that information, I need to convert the column using to_datetime, but datetimes need the year specified. Why force the user to implement a (costly?) transform after the fact?

Comment From: jreback

still not sure what you mean this is a context problem and pandas shouldn't be guessing what you mean

Comment From: jorisvandenbossche

Just to be sure it is clear it is clear for everybody: dateutil by default fills in missing parts with the current date, and also has a keyword to change this behaviour:

In [8]: import datetime

In [9]: import dateutil

In [10]: dateutil.parser.parse('16:00')
Out[10]: datetime.datetime(2017, 4, 27, 16, 0)

In [11]: dateutil.parser.parse('16:00', default=datetime.datetime(2000, 1, 1))
Out[11]: datetime.datetime(2000, 1, 1, 16, 0)

In [13]: dateutil.parser.parse('April 3')
Out[13]: datetime.datetime(2017, 4, 3, 0, 0)

In [16]: dateutil.parser.parse('April 3', default=datetime.datetime(2000, 1, 1))
Out[16]: datetime.datetime(2000, 4, 3, 0, 0)

When it becomes more strange is when filling in the current day of the month:

In [12]: dateutil.parser.parse('Aug 2016')
Out[12]: datetime.datetime(2016, 8, 27, 0, 0)

(today is the 27th of April)

So we can discuss whether we should follow that behaviour in pandas or not. In https://github.com/pandas-dev/pandas/pull/7599 we decided to not follow that for at least the filling of the current day of the month (the last more strange example). The consequence is that we also do not follow the rule for filling the current year, at least for certain formats (this issue), a consequence which was not fully on purpose I think (or at least this is not discussed / tested in the original PR).

Filling with current year as dateutil does, has at least some usecase I think, but if you want this, you can always directly use the dateutil parser instead of to_datetime. (@mikedeltalima that is a workaround that you can use as well, instead of adapting the strings)

giving back current day as a non explicit operation seems magical

Jeff, note that we currently actually still do that in specific cases, depending on the format of the string (see my example above https://github.com/pandas-dev/pandas/issues/11430#issuecomment-296122260)

In any case, the current situation is also not ideal, as certain cases still, and I think the error message should not be an OutOfBounds error, but an error message indicating that the string could not be parsed because (part of) the date was missing.

Comment From: mikedeltalima

@jorisvandenbossche thanks for the explanation! You've really improved the readability of the conversation :)

Could you elaborate on the workaround? This is what I could do, but I wonder if there are good reasons to use to_datetime instead if I can fix it.

pd.Series('april 5').apply(dateutil.parser.parse)
>>> 0   2017-04-05
>>> dtype: datetime64[ns]
pd.Series('april 5').apply(lambda x: dateutil.parser.parse(x, default=datetime.datetime(2017, 1,
1)))
>>> 0   2017-04-05
>>> dtype: datetime64[ns]
pd.Series('april 5').apply(lambda x: dateutil.parser.parse(x, default=datetime.datetime(1, 1,
1)))
>>> 0    0001-04-05 00:00:00
>>> dtype: object

That said, @jreback I don't think my suggestion involves pandas guessing anything. It would simply allow the user to override a default (which pandas is already using). As you can see from the last two lines in my example above, there seems to be a bug that does not allow pandas to recognize some datetimes as the correct dtype. Looks like the cutoff is September 21, 1677. :)

>>> pd.Series(datetime.datetime(1677, 1, 1))
0    1677-01-01 00:00:00
dtype: object
>>> pd.Series(datetime.datetime(1678, 1, 1))
0   1678-01-01
dtype: datetime64[ns]
>>> pd.Series(datetime.datetime(1677, 9, 22))
0   1677-09-22
dtype: datetime64[ns]
>>> pd.Series(datetime.datetime(1677, 9, 21))
0    1677-09-21 00:00:00
dtype: object

Comment From: jorisvandenbossche

Regarding the 1677, the reason for that is very simple, once you know it: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-timestamp-limits

When you try to parse a string outside of the supported range, you get an OutOfBounds error, but if you pass a datetime.datetime object outside that range, it's get stored as object dtype.

Comment From: mikedeltalima

@jorisvandenbossche so proposed fix: pick a default that is within the bounds of the timestamp limits (Jan 1 of current year up to pd.Timestamp.max) and allow users to pass a default (raise exception if out of bounds).

Comment From: jreback

@mikedeltalima

so you want to add an argument to the constructor of Timestamp and DatetimeIndex of default=None.

Timestamp('4 pm', default=Timestamp(2000,1,1)) or whatever ?

I suppose we could implement logic, or simply pass thru to dateutil (which makes it really slow though as its pure python), but then again this is a convenince feature.

Comment From: mikedeltalima

@jreback is that necessary? I was only thinking of adding the argument to the to_datetime function. From what I can tell, the function is already implementing a default, it's just hardcoded as _DEFAULT_DATETIME (https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslib.pyx#L2171), so I don't think this would slow things down.

Comment From: jreback

so I don't think this would slow things down.

by-definition when thinks get down to use dateutil they have already slowed down. this is a last ditch effort (and rarely happens actually).

so making this more explicit with a default value makes sense.

Comment From: MarcoGorelli

strong -1 on adding extra arguments to datetime parsing, it's fine for this to error

Comment From: mroeschke

Agreed here. I think whatever dateutil can parse as a string is sufficient at this point so closing