Code Sample, a copy-pastable example if possible

import pandas as pd
import datetime as dt
print(dt.datetime.strptime('2012.01.01 1', '%Y.%m.%d %H')) 
# -> 2012-01-01 01:00:00
print(pd.to_datetime('2012.01.01 1', format='%Y.%m.%d %H')) 
# -> ValueError: time data '2012.01.01 1' doesn't match format specified

Problem description

When using to_datetime to parse a date that only includes an hour component, but not minutes and seconds, with a format that is otherwise similar to ISO8601 (such as the format '%Y.%m.%d %H'), a ValueError is raised (see above). This behavior is unexpected as strptime can parse the same date without any problem, using the same format string (see above).

I suspect that the problem is in the _format_is_iso function of pandas._libs.tslibs.parsing, where it is just checked if the ISO format starts with the format given - so this format is recognized as being ISO-like. In this case, the format passed to to_datetime is ignored and tslib.array_to_datetime function is used to parse the date instead, which doesn't seem to be able to handle this kind of format.

My current workaround is to modify the dates to also have a minutes component (append ':00' to every string) so that they can be parsed.

Expected Output

2012-01-01 01:00:00 (same as when using strptime)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-6-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: de_DE.UTF-8 pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 38.5.2 Cython: None numpy: 1.14.2 scipy: 0.19.0 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.0 openpyxl: 2.5.0 xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: chris-b1

Thanks for the report - issue seems to be here - raising an error if we reach the end of the string and the hour wasn't 2 digits, which looks unnecessary.

https://github.com/pandas-dev/pandas/blob/7000b899038e9d6559ce80d3c018ec0ad5412efe/pandas/_libs/src/datetime/np_datetime_strings.c#L257

PR to fix welcome - to start with I think could just delete this conditional and see if any tests break (as well as adding new tests)

Comment From: uds5501

@chris-b1 deleting this conditional does solve @johan12345 's issue. But fails multiple tests. I will see if can modify those tests.

Comment From: WillAyd

@jbrockmendel brings up a good point in questioning this in the associated PR. While I understand that the datetime module is parsing that single digit hour for the user, I'm not sure that can be guaranteed across all platforms. In fact, the Python documentation on "%H" from C89 states that it represents a "Hour (24-hour clock) as a zero-padded decimal number", which the example is not.

I think if pandas tries to make guarantees about how this gets parsed for non-standard directives that we'd be opening up a can of worms, so I'd vote for no action here.

Source:

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

Comment From: chris-b1

It appears in POSIX std, strptime will handle missing leading 0s for %H - we already do the same for %m, %d, etc, so I think it's reasonable to expand our parser.

http://pubs.opengroup.org/onlinepubs/009695399/functions/strptime.html

I really only worry about ambigous cases, is there is any risk "2016.05.06 1" is ambiguous with something other than pd.Timestamp('2016-05-05 01:00:00')? (honestly asking)

Comment From: jbrockmendel

FWIW this would be breaking with both numpy and dateutil.

Comment From: chris-b1

we broke with numpy a while back, and its parser is limited so not worried about that, but the dateutil one is more interesting, didn't know that.

from dateutil.parser import parser

parser().parse("2016.05.06 1")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-f30f9e318c90> in <module>()
----> 1 parser().parse("2016.05.06 1")

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\dateutil\parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    557 
    558         if res is None:
--> 559             raise ValueError("Unknown string format")
    560 
    561         if len(res) == 0:

ValueError: Unknown string format

parser().parse("2016.05.06 01")
Out[11]: datetime.datetime(2016, 5, 6, 1, 0)

Comment From: WillAyd

@chris-b1 good info. Any concerns with Windows? Not sure where to find that information

Comment From: chris-b1

Windows being windows doesn't actually have a strptime in its c-runtime, but looks like std::get_time which it would have follows roughly the same spec with respect to leadings 0s. http://en.cppreference.com/w/cpp/io/manip/get_time

Comment From: uds5501

@chris-b1 Is it really worth messing with present parser? As @WillAyd mentioned, the bugs that may follow can be tricky to tackle with considering following a different path from numpy and dateutil parser (which isn't that helpful either, I reckon)

Comment From: chris-b1

To me it's only a question of whether we consider "2016.05.06 1" an ISO 8601 (loosely as we use it) formatted string, i.e. parses even if no format is passed.

Even if the answer to that is no, we need to match python if an explicit format spec is passed, so some change is necessary either way. It's a bug for pd.to_datetime(..., format='format') not to match datetime.strptime(..., 'format')

Comment From: johan12345

@chris-b1 Right - It is okay that the date is not parseable without specifying an explicit format, but when specifying the actual format, the behavior should be the same as with strptime.

Thanks everyone for the quick reactions, even before I had time to look into it in detail myself!

Comment From: mnjenga2

Hi there. I need your help guys, I have a series of dates in '%Y-%m-%d' and I need them in '%Y%m'. like 2018-12-04 to 201804. I will appreciate any assistance. Thanks

Comment From: yogeshkkolte

If x is date time object pandas.datetime.strftime(x,'%Y%m')

Comment From: MarcoGorelli

This'll be addressed in https://github.com/pandas-dev/pandas/pull/50242

thanks for the report!