date_format keyword could take the format, dict of columns to format, or list of formats (and could then obviate the need for parse_dates)

Sometimes months and days get mixed up.

E.g. test.csv:

a,b 27.03.2003 14:55:00.000,1 03.08.2003 15:20:00.000,2

read_csv("/home/john/Documents/test.csv",index_col=0, parse_dates=True) b a
2003-03-27 14:55:00 1 2003-03-08 15:20:00 2

There doesn't appear to be any continuity in the date parsing over the rows. As well as meaning things can easily get switched around, this makes date parsing VERY slow. Once you know the format, using the datetime constructor with string slicing as a parser makes read_csv 20x faster on my machine.

I think there needs to be some more parameters for specifying date formats. seeing as in the general case dates a string of dates can be ambiguous (see above).

A possible approach: Have a few default formats to choose from, as well as a more general format string approach. Obviously the defaults could use the datetime constructor with string slicing, which is very fast.

Perhaps have a dayfirst and yearfirst flag that gets passed to dateutil.parser.parse to solve ambiguities when using automatic parsing.

Comment From: qwhelan

Try using the dayfirst parameter (new in v0.10.0):

In [4]: read_csv('test.csv', index_col=0, parse_dates=True, dayfirst=True)
Out[4]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

If you want to specify the datetime format you can use the date_parser parameter:

In [5]: read_csv('test.csv', index_col=0, date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y %H:%M:%S.%f'))
Out[5]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

Comment From: John-Colvin

sorry, my bad for not looking at the API reference closely enough to spot the dayfirst flag.

the date_parser parameter is what I was using to test using the datetime constructor. In my experience strptime is very slow compared to using the constructor and string slicing. Having some default formats to choose from that then call an optimised parser would seem a good idea.

It would speed up parsing for those formats while also allowing the user to be strict with the format, without them having to write their own, potentially slow, parsing function.

It seems worthwhile, what with the recent efforts to make file parsing much faster the date parsing really dominates the runtime of read_csv

Comment From: qwhelan

Here are the benchmarks I'm getting (50k of the format %d/%M/%Y):

In [68]: %timeit read_csv('dates.csv', parse_dates=[0], dayfirst=True)
1 loops, best of 3: 4.6 s per loop

In [69]: %timeit read_csv('dates.csv', parse_dates=[0], date_parser=lambda x: datetime.strptime(x, '%d/%M/%Y'), dayfirst=True)
1 loops, best of 3: 1.18 s per loop

And for passing directly to the constructor:

In [70]: %timeit data = read_csv('dates.csv')
10 loops, best of 3: 21.1 ms per loop

In [71]: %time data.Date = map(lambda x, y, z: datetime(x, y, z), data.Date.map(lambda a: int(a[4:])), data.Date.map(lambda a: int(a[2:3])), data.Date.map(lambda a: int(a[:1])))
CPU times: user 0.35 s, sys: 0.00 s, total: 0.35 s
Wall time: 0.35 s

So there's a ~3x speedup by specifying the format, and another ~3x from manually passing to the constructor.

I don't think the user should be forced to pass a date format to get faster performance, especially when there might be multiple columns with different formats. We're currently assuming that a date column is not of a homogeneous format (by calling parse() independently on every value). All those independent calls probably means we're seeing a lot of object creation/destruction (ie, compiled regexes that are used once).

If we have a homogeneity flag, we could call parse() on the first row, get the pattern it settled on (may not be currently available from parse()), and then construct an array of datetimes.

Comment From: wesm

I actually added a format argument to to_datetime-- it might be worth adding to read_csv, maybe. Alternately you can parse without using parse_dates and use to_datetime yourself and it should be reasonably fast.

Comment From: dragoljub

A 'format' argument would be a great feature in read_csv(). Would save you from writhing the 'fast' parser lambda function with strptime() while also handling the special non-parseable case where you want to just fill parsing errors with a np.nan or NaT. I'm assuming this would use datetime's format strings.

Comment From: jreback

the new infer_datetime_format flag solves this problem (though leaving open because ideally you specify the format directly)

Comment From: maxgrenderjones

Small point, but just to give an example of a similar use case that should be handled - I have a csv in which some columns are dates and some are datetimes. I need to be able to provide a different format for each column that I want to parse (as per the OP's suggested specification).

Comment From: jreback

@maxgrenderjones interested in doing a PR for this?

as an aside, try infer_datetime_format kw (its in 0.13.1), should be able to figure out most common formats.

Comment From: jbrockmendel

@mroeschke has this been addressed? Closeable?

Comment From: mroeschke

Doesn't appear so. read_csv has a date_parser argument that accepts a generic parsing function but I think the ask is for a format argument like pd.to_datetime(..., format='...')

But given that you can also just pass pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, format='...')) as a suitable alternative, I am not sure how on board I am with added a separate format argument to read_csv. I'd be more clean, but it would grow the already large number of kwargs.

Comment From: jorisvandenbossche

Yep, agreed that given the huge amount of keywords, we should maybe not add yet another one (although I would personally find a format keyword more useful than the current date_parser)

Comment From: WillAyd

I still think is useful. Having to provide a lambda in the current state is rather wonky from an end user perspective and an explicit format keyword could probably enable much better performance on larger datasets

Comment From: alexbk66

Just had this problem in my code -

dayfirst = True,
infer_datetime_format = True

still gets "2010-10-01" for "10/01/2010"

Comment From: jorisvandenbossche

@alexbk66 that seems like a bug (maybe dayfirst is not passed through correctly somewhere), as this combination works fine with to_datetime:

In [81]: pd.to_datetime("10/01/2010", infer_datetime_format=True)                                                                                                                                                   
Out[81]: Timestamp('2010-10-01 00:00:00')

In [82]: pd.to_datetime("10/01/2010", infer_datetime_format=True, dayfirst=True)                                                                                                                                    
Out[82]: Timestamp('2010-01-10 00:00:00')

Comment From: gfyoung

What do you guys think about expanding date_parser to be the following:

  • callable: custom date parser (existing)
  • kwargs: parameters to pass to default date converter (proposed)

This would address the need for greater customized keyword arguments without bloating or requiring wonky lambda arguments. Also, if we allow date_parser to be kwargs, we could then deprecate / remove dayfirst and infer_datetime_format from the signature and require them to be passed as a dict via date_parser (i.e. less API bloat).

Comment From: jorisvandenbossche

That sounds like a nice idea to me! It's of course not yet that user friendly as a date_format keyword, but I would say better than a lambda function that calls to_datetime.

Comment From: gfyoung

It's of course not yet that user friendly as a date_format keyword, but I would say better than a lambda function that calls to_datetime.

Well, who says that we couldn't accept a string (i.e. date format) for date_parser ? 😉 I think that could be a good enhancement after the refactor.

Comment From: WillAyd

Is the only downside to having a dedicated format argument that we already have enough keywords? I still think that as a dedicated keyword is best here, especially given we already have that in to_datetime

Comment From: gfyoung

API bloat isn't the only reason. Another benefit of this consolidation is that the read_csv API becomes agnostic to the signature of to_datetime. We could have 20 keyword arguments for to_datetime, but we wouldn't have to change the signature of read_csv to accommodate. That's what we're running into now. Adding keyword arguments as has been suggested is only a band-aid IMO.

I also might add that adapting date_parser to perform different functions based on type would be consistent with other keywords such as usecols, skiprows, and parse_dates.

Comment From: MarcoGorelli

Solved as of PDEP4, let's close


In [53]: data = """a,b
    ...: 27.03.2003 14:55:00.000,1
    ...: 03.08.2003 15:20:00.000,2"""

In [54]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[54]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-03-08 15:20:00  2

Comment From: MarcoGorelli

Sorry, I was too fast

This is wrong, and is a serious bug. It should've been addressed by PDEP4, but guess_datetime_format errors on this input https://github.com/pandas-dev/pandas/issues/50317

I'm working on a fix, but in any case, we shouldn't be falling back to dateutil here


50319 will fix this, but I'll make a separate PR to remove this fallback and ensure this doesn't happen again

https://github.com/pandas-dev/pandas/blob/7b0d4dd2da6a5565e7fffdeec3a58750640c7f2f/pandas/io/parsers/base_parser.py#L1122-L1134

Comment From: MarcoGorelli

Only took 10 years and 5 days, but this is actually fixed now:

In [1]: data = """a,b
   ...:     27.03.2003 14:55:00.000,1
   ...:     03.08.2003 15:20:00.000,2"""

In [2]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
<ipython-input-2-1174b5f5763d>:1: UserWarning: Parsing dates in     %d.%m.%Y %H:%M:%S.%f format when dayfirst=False was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[2]: 
                     b
a                     
2003-03-27 14:55:00  1
2003-08-03 15:20:00  2

As the saying goes - "in open source, bugs get fixed...eventually"

🥳

Comment From: jorisvandenbossche

While the original use case in the top post (specific example dates that got silently inconsitently parsed with day first vs last) has been resolved (we are now strict, i.e. PDEP-4), the general discussion here was about how we can expose a format keyword (or a more general mechanism to expose to_datetime keywords in read_csv), i.e. the "add date_format to read_csv" part of the title. And I don't think that is already resolved?

Of course, with the fact that with PDEP-4 we now try to guess the format by default, there is probably less need to pass custom to_datetime keywords. But for the subset where you still need to specify a keyword, having a mechanism to pass it through could still be useful.

For example, for the specific case here, the new warning message says "... or specify a format to silence this warning". But you can't actually specify a format through pd.read_csv

(if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed)

Comment From: MarcoGorelli

I read this as "here's a date parsing mistake, and a possible solution would be to add a date_format argument" - as the parsing mistake is addressed, that's why I considered it closed

You can already pass date_parser = lambda x: to_datetime(x, format=fmt), not sure there needs to be a separate argument

(if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed)

I think that'd be better, the conversation here has already got quite long and many comments are now outdated - I've opened https://github.com/pandas-dev/pandas/issues/50528

Comment From: MarcoGorelli

I've looked into this further, and I'll take back

You can already pass date_parser = lambda x: to_datetime(x, format=fmt), not sure there needs to be a separate argument

As mentioned in https://github.com/pandas-dev/pandas/issues/50601, date_parser is actually a net negative (if anyone has a counter-example, please do bring it up) - so, I'd be totally onboard with adding date_format, and deprecating date_parser

Let's take the conversation over to #50601 then - closing again, sorry for the open/close ping pong here