Code Sample, a copy-pastable example if possible
import pandas as pd
import datetime as dt
print(dt.datetime.strptime('2012.01.01 1', '%Y.%m.%d %H'))
# -> 2012-01-01 01:00:00
print(pd.to_datetime('2012.01.01 1', format='%Y.%m.%d %H'))
# -> ValueError: time data '2012.01.01 1' doesn't match format specified
Problem description
When using to_datetime
to parse a date that only includes an hour component, but not minutes and seconds, with a format that is otherwise similar to ISO8601 (such as the format '%Y.%m.%d %H'
), a ValueError
is raised (see above). This behavior is unexpected as strptime
can parse the same date without any problem, using the same format string (see above).
I suspect that the problem is in the _format_is_iso
function of pandas._libs.tslibs.parsing
, where it is just checked if the ISO format starts with the format given - so this format is recognized as being ISO-like. In this case, the format passed to to_datetime
is ignored and tslib.array_to_datetime
function is used to parse the date instead, which doesn't seem to be able to handle this kind of format.
My current workaround is to modify the dates to also have a minutes component (append ':00'
to every string) so that they can be parsed.
Expected Output
2012-01-01 01:00:00
(same as when using strptime
)
Output of pd.show_versions()
Comment From: chris-b1
Thanks for the report - issue seems to be here - raising an error if we reach the end of the string and the hour wasn't 2 digits, which looks unnecessary.
https://github.com/pandas-dev/pandas/blob/7000b899038e9d6559ce80d3c018ec0ad5412efe/pandas/_libs/src/datetime/np_datetime_strings.c#L257
PR to fix welcome - to start with I think could just delete this conditional and see if any tests break (as well as adding new tests)
Comment From: uds5501
@chris-b1 deleting this conditional does solve @johan12345 's issue. But fails multiple tests. I will see if can modify those tests.
Comment From: WillAyd
@jbrockmendel brings up a good point in questioning this in the associated PR. While I understand that the datetime module is parsing that single digit hour for the user, I'm not sure that can be guaranteed across all platforms. In fact, the Python documentation on "%H" from C89 states that it represents a "Hour (24-hour clock) as a zero-padded decimal number", which the example is not.
I think if pandas tries to make guarantees about how this gets parsed for non-standard directives that we'd be opening up a can of worms, so I'd vote for no action here.
Source:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
Comment From: chris-b1
It appears in POSIX std, strptime will handle missing leading 0s for %H
- we already do the same for %m
, %d
, etc, so I think it's reasonable to expand our parser.
http://pubs.opengroup.org/onlinepubs/009695399/functions/strptime.html
I really only worry about ambigous cases, is there is any risk "2016.05.06 1"
is ambiguous with something other than pd.Timestamp('2016-05-05 01:00:00')? (honestly asking)
Comment From: jbrockmendel
FWIW this would be breaking with both numpy and dateutil.
Comment From: chris-b1
we broke with numpy a while back, and its parser is limited so not worried about that, but the dateutil one is more interesting, didn't know that.
from dateutil.parser import parser
parser().parse("2016.05.06 1")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-f30f9e318c90> in <module>()
----> 1 parser().parse("2016.05.06 1")
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\dateutil\parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
557
558 if res is None:
--> 559 raise ValueError("Unknown string format")
560
561 if len(res) == 0:
ValueError: Unknown string format
parser().parse("2016.05.06 01")
Out[11]: datetime.datetime(2016, 5, 6, 1, 0)
Comment From: WillAyd
@chris-b1 good info. Any concerns with Windows? Not sure where to find that information
Comment From: chris-b1
Windows being windows doesn't actually have a strptime
in its c-runtime, but looks like std::get_time
which it would have follows roughly the same spec with respect to leadings 0s.
http://en.cppreference.com/w/cpp/io/manip/get_time
Comment From: uds5501
@chris-b1 Is it really worth messing with present parser? As @WillAyd mentioned, the bugs that may follow can be tricky to tackle with considering following a different path from numpy
and dateutil
parser (which isn't that helpful either, I reckon)
Comment From: chris-b1
To me it's only a question of whether we consider "2016.05.06 1"
an ISO 8601 (loosely as we use it) formatted string, i.e. parses even if no format is passed.
Even if the answer to that is no, we need to match python if an explicit format spec is passed, so some change is necessary either way. It's a bug for pd.to_datetime(..., format='format')
not to match datetime.strptime(..., 'format')
Comment From: johan12345
@chris-b1 Right - It is okay that the date is not parseable without specifying an explicit format, but when specifying the actual format, the behavior should be the same as with strptime
.
Thanks everyone for the quick reactions, even before I had time to look into it in detail myself!
Comment From: mnjenga2
Hi there. I need your help guys, I have a series of dates in '%Y-%m-%d' and I need them in '%Y%m'. like 2018-12-04 to 201804. I will appreciate any assistance. Thanks
Comment From: yogeshkkolte
If x is date time object pandas.datetime.strftime(x,'%Y%m')
Comment From: MarcoGorelli
This'll be addressed in https://github.com/pandas-dev/pandas/pull/50242
thanks for the report!