I assume that this was officially supported before. Haven't narrowed it down any more than sometime between 0.16.2 and 0.17.0.
In [1]: pd.__version__
Out[1]: '0.16.2'
In [2]: pd.date_range("Jan 1", "March 31", name="date")
Out[2]:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
'2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
...
In [1]: pd.__version__
Out[1]: '0.17.0'
In [2]: pd.date_range("Jan 1", "March 31", name="date")
---------------------------------------------------------------------------
OutOfBoundsDatetime Traceback (most recent call last)
<ipython-input-2-8eaca08051ac> in <module>()
----> 1 pd.date_range("Jan 1", "March 31", name="date")
/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in date_range(start, end, periods, freq, tz, normalize, name, closed)
1912 return DatetimeIndex(start=start, end=end, periods=periods,
1913 freq=freq, tz=tz, normalize=normalize, name=name,
-> 1914 closed=closed)
1915
1916
/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/util/decorators.py in wrapper(*args, **kwargs)
87 else:
88 kwargs[new_arg_name] = new_arg_value
---> 89 return func(*args, **kwargs)
90 return wrapper
91 return _deprecate_kwarg
/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
234 return cls._generate(start, end, periods, name, freq,
235 tz=tz, normalize=normalize, closed=closed,
--> 236 ambiguous=ambiguous)
237
238 if not isinstance(data, (np.ndarray, Index, ABCSeries)):
/Users/tom.augspurger/Envs/py3/lib/python3.5/site-packages/pandas/tseries/index.py in _generate(cls, start, end, periods, name, offset, tz, normalize, ambiguous, closed)
383
384 if start is not None:
--> 385 start = Timestamp(start)
386
387 if end is not None:
pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:8967)()
pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:22303)()
pandas/tslib.pyx in pandas.tslib.convert_str_to_tsobject (pandas/tslib.c:24364)()
pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:23344)()
pandas/tslib.pyx in pandas.tslib._check_dts_bounds (pandas/tslib.c:26590)()
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 00:00:00
Comment From: TomAugspurger
Never mind. Never documented.
Comment From: jorisvandenbossche
@jreback commented here https://github.com/mwaskom/seaborn/issues/702#issuecomment-139608714 it was never meant to be supported
Comment From: jorisvandenbossche
Further, @jreback, I thought it was maybe not a written rule, but still somewhat generally assumed that pandas did fall back to dateutil.parser.parse
if it couldn't parse the datetime itself. So in that sense, it is somewhat surprising to me pd.to_datetime('Jan 1')
no longer works
Comment From: jreback
this has to do with the change IIRC @sinhrks, e.g. #7599 made to make all parsing consistent.
I think this did work (though not officially / undoced / not tested). So we prob can support it. But should have a real effort here.
Comment From: sinhrks
I think 0.17 behavior is consistent, but allowing to pass default
datetime (today's date in most cases) for dateutil
parsing may be useful?
- https://github.com/pydata/pandas/blob/master/pandas/tslib.pyx#L1800
Comment From: jorisvandenbossche
Another related case (also no (full) date part provided in the string) from https://github.com/pandas-dev/pandas/issues/16074
In [20]: pd.to_datetime("4pm")
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 16:00:00
In [21]: pd.to_datetime("16:00")
Out[21]: Timestamp('2017-04-21 16:00:00')
so depending on the format of the time string, this does or does not work, so there is at least some inconsistency.
But I think we mainly have to decide on whether, if a part of the date (eg the year) or the full date is missing, do we fill it with 0001-01-01 (and with result that it raises an error) or with the current date? In the first case, it is probably better to raise a custom error message instead of the OutOfBounds error.
Comment From: mikedeltalima
@jreback thanks for linking the duplicate. Would it be reasonable to use the dateutil parser and allow the user to pass a default datetime?
Comment From: jreback
@mikedeltalima you can do that as a user pandas doesn't and shouldn't have an api for that giving back current day as a non explicit operation seems magical
Comment From: mikedeltalima
@jreback I'm not sure I understand. You can specify the default in dateutil, but not pandas. Why shouldn't the user have that option? Also, dateutil chose the current date for their default default, but pandas could choose something else (Jan 1 of current year?).
Comment From: jreback
how is having a default date useful? sure for creating a single date i suppose but not sure of the general utility
Comment From: mikedeltalima
Let's say I have a Series (a column in a DataFrame) that consists of dates like April 5, May 10 etc. None of them specify the year. If I want to capture that information, I need to convert the column using to_datetime, but datetimes need the year specified. Why force the user to implement a (costly?) transform after the fact?
Comment From: jreback
still not sure what you mean this is a context problem and pandas shouldn't be guessing what you mean
Comment From: jorisvandenbossche
Just to be sure it is clear it is clear for everybody: dateutil
by default fills in missing parts with the current date, and also has a keyword to change this behaviour:
In [8]: import datetime
In [9]: import dateutil
In [10]: dateutil.parser.parse('16:00')
Out[10]: datetime.datetime(2017, 4, 27, 16, 0)
In [11]: dateutil.parser.parse('16:00', default=datetime.datetime(2000, 1, 1))
Out[11]: datetime.datetime(2000, 1, 1, 16, 0)
In [13]: dateutil.parser.parse('April 3')
Out[13]: datetime.datetime(2017, 4, 3, 0, 0)
In [16]: dateutil.parser.parse('April 3', default=datetime.datetime(2000, 1, 1))
Out[16]: datetime.datetime(2000, 4, 3, 0, 0)
When it becomes more strange is when filling in the current day of the month:
In [12]: dateutil.parser.parse('Aug 2016')
Out[12]: datetime.datetime(2016, 8, 27, 0, 0)
(today is the 27th of April)
So we can discuss whether we should follow that behaviour in pandas or not. In https://github.com/pandas-dev/pandas/pull/7599 we decided to not follow that for at least the filling of the current day of the month (the last more strange example). The consequence is that we also do not follow the rule for filling the current year, at least for certain formats (this issue), a consequence which was not fully on purpose I think (or at least this is not discussed / tested in the original PR).
Filling with current year as dateutil does, has at least some usecase I think, but if you want this, you can always directly use the dateutil
parser instead of to_datetime
. (@mikedeltalima that is a workaround that you can use as well, instead of adapting the strings)
giving back current day as a non explicit operation seems magical
Jeff, note that we currently actually still do that in specific cases, depending on the format of the string (see my example above https://github.com/pandas-dev/pandas/issues/11430#issuecomment-296122260)
In any case, the current situation is also not ideal, as certain cases still, and I think the error message should not be an OutOfBounds error, but an error message indicating that the string could not be parsed because (part of) the date was missing.
Comment From: mikedeltalima
@jorisvandenbossche thanks for the explanation! You've really improved the readability of the conversation :)
Could you elaborate on the workaround? This is what I could do, but I wonder if there are good reasons to use to_datetime instead if I can fix it.
pd.Series('april 5').apply(dateutil.parser.parse)
>>> 0 2017-04-05
>>> dtype: datetime64[ns]
pd.Series('april 5').apply(lambda x: dateutil.parser.parse(x, default=datetime.datetime(2017, 1,
1)))
>>> 0 2017-04-05
>>> dtype: datetime64[ns]
pd.Series('april 5').apply(lambda x: dateutil.parser.parse(x, default=datetime.datetime(1, 1,
1)))
>>> 0 0001-04-05 00:00:00
>>> dtype: object
That said, @jreback I don't think my suggestion involves pandas guessing anything. It would simply allow the user to override a default (which pandas is already using). As you can see from the last two lines in my example above, there seems to be a bug that does not allow pandas to recognize some datetimes as the correct dtype. Looks like the cutoff is September 21, 1677. :)
>>> pd.Series(datetime.datetime(1677, 1, 1))
0 1677-01-01 00:00:00
dtype: object
>>> pd.Series(datetime.datetime(1678, 1, 1))
0 1678-01-01
dtype: datetime64[ns]
>>> pd.Series(datetime.datetime(1677, 9, 22))
0 1677-09-22
dtype: datetime64[ns]
>>> pd.Series(datetime.datetime(1677, 9, 21))
0 1677-09-21 00:00:00
dtype: object
Comment From: jorisvandenbossche
Regarding the 1677, the reason for that is very simple, once you know it: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-timestamp-limits
When you try to parse a string outside of the supported range, you get an OutOfBounds error, but if you pass a datetime.datetime
object outside that range, it's get stored as object dtype.
Comment From: mikedeltalima
@jorisvandenbossche so proposed fix: pick a default that is within the bounds of the timestamp limits (Jan 1 of current year up to pd.Timestamp.max) and allow users to pass a default (raise exception if out of bounds).
Comment From: jreback
@mikedeltalima
so you want to add an argument to the constructor of Timestamp
and DatetimeIndex
of default=None
.
so
Timestamp('4 pm', default=Timestamp(2000,1,1))
or whatever ?
I suppose we could implement logic, or simply pass thru to dateutil
(which makes it really slow though as its pure python), but then again this is a convenince feature.
Comment From: mikedeltalima
@jreback is that necessary? I was only thinking of adding the argument to the to_datetime
function. From what I can tell, the function is already implementing a default, it's just hardcoded as _DEFAULT_DATETIME
(https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/tslib.pyx#L2171), so I don't think this would slow things down.
Comment From: jreback
so I don't think this would slow things down.
by-definition when thinks get down to use dateutil
they have already slowed down. this is a last ditch effort (and rarely happens actually).
so making this more explicit with a default value makes sense.
Comment From: MarcoGorelli
strong -1 on adding extra arguments to datetime parsing, it's fine for this to error
Comment From: mroeschke
Agreed here. I think whatever dateutil can parse as a string is sufficient at this point so closing