date_format
keyword could take the format, dict of columns to format, or list of formats (and could then obviate the need for parse_dates)
Sometimes months and days get mixed up.
E.g. test.csv:
a,b 27.03.2003 14:55:00.000,1 03.08.2003 15:20:00.000,2
read_csv("/home/john/Documents/test.csv",index_col=0, parse_dates=True)
b
a
2003-03-27 14:55:00 1
2003-03-08 15:20:00 2
There doesn't appear to be any continuity in the date parsing over the rows. As well as meaning things can easily get switched around, this makes date parsing VERY slow. Once you know the format, using the datetime constructor with string slicing as a parser makes read_csv 20x faster on my machine.
I think there needs to be some more parameters for specifying date formats. seeing as in the general case dates a string of dates can be ambiguous (see above).
A possible approach: Have a few default formats to choose from, as well as a more general format string approach. Obviously the defaults could use the datetime constructor with string slicing, which is very fast.
Perhaps have a dayfirst and yearfirst flag that gets passed to dateutil.parser.parse to solve ambiguities when using automatic parsing.
Comment From: qwhelan
Try using the dayfirst
parameter (new in v0.10.0):
In [4]: read_csv('test.csv', index_col=0, parse_dates=True, dayfirst=True)
Out[4]:
b
a
2003-03-27 14:55:00 1
2003-08-03 15:20:00 2
If you want to specify the datetime format you can use the date_parser
parameter:
In [5]: read_csv('test.csv', index_col=0, date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y %H:%M:%S.%f'))
Out[5]:
b
a
2003-03-27 14:55:00 1
2003-08-03 15:20:00 2
Comment From: John-Colvin
sorry, my bad for not looking at the API reference closely enough to spot the dayfirst flag.
the date_parser parameter is what I was using to test using the datetime constructor. In my experience strptime is very slow compared to using the constructor and string slicing. Having some default formats to choose from that then call an optimised parser would seem a good idea.
It would speed up parsing for those formats while also allowing the user to be strict with the format, without them having to write their own, potentially slow, parsing function.
It seems worthwhile, what with the recent efforts to make file parsing much faster the date parsing really dominates the runtime of read_csv
Comment From: qwhelan
Here are the benchmarks I'm getting (50k of the format %d/%M/%Y
):
In [68]: %timeit read_csv('dates.csv', parse_dates=[0], dayfirst=True)
1 loops, best of 3: 4.6 s per loop
In [69]: %timeit read_csv('dates.csv', parse_dates=[0], date_parser=lambda x: datetime.strptime(x, '%d/%M/%Y'), dayfirst=True)
1 loops, best of 3: 1.18 s per loop
And for passing directly to the constructor:
In [70]: %timeit data = read_csv('dates.csv')
10 loops, best of 3: 21.1 ms per loop
In [71]: %time data.Date = map(lambda x, y, z: datetime(x, y, z), data.Date.map(lambda a: int(a[4:])), data.Date.map(lambda a: int(a[2:3])), data.Date.map(lambda a: int(a[:1])))
CPU times: user 0.35 s, sys: 0.00 s, total: 0.35 s
Wall time: 0.35 s
So there's a ~3x speedup by specifying the format, and another ~3x from manually passing to the constructor.
I don't think the user should be forced to pass a date format to get faster performance, especially when there might be multiple columns with different formats. We're currently assuming that a date column is not of a homogeneous format (by calling parse()
independently on every value). All those independent calls probably means we're seeing a lot of object creation/destruction (ie, compiled regexes that are used once).
If we have a homogeneity flag, we could call parse()
on the first row, get the pattern it settled on (may not be currently available from parse()
), and then construct an array of datetimes.
Comment From: wesm
I actually added a format
argument to to_datetime
-- it might be worth adding to read_csv
, maybe. Alternately you can parse without using parse_dates
and use to_datetime
yourself and it should be reasonably fast.
Comment From: dragoljub
A 'format' argument would be a great feature in read_csv(). Would save you from writhing the 'fast' parser lambda function with strptime() while also handling the special non-parseable case where you want to just fill parsing errors with a np.nan or NaT. I'm assuming this would use datetime's format strings.
Comment From: jreback
the new infer_datetime_format flag solves this problem (though leaving open because ideally you specify the format directly)
Comment From: maxgrenderjones
Small point, but just to give an example of a similar use case that should be handled - I have a csv in which some columns are dates and some are datetimes. I need to be able to provide a different format for each column that I want to parse (as per the OP's suggested specification).
Comment From: jreback
@maxgrenderjones interested in doing a PR for this?
as an aside, try infer_datetime_format
kw (its in 0.13.1), should be able to figure out most common formats.
Comment From: jbrockmendel
@mroeschke has this been addressed? Closeable?
Comment From: mroeschke
Doesn't appear so. read_csv
has a date_parser
argument that accepts a generic parsing function but I think the ask is for a format
argument like pd.to_datetime(..., format='...')
But given that you can also just pass pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, format='...'))
as a suitable alternative, I am not sure how on board I am with added a separate format
argument to read_csv
. I'd be more clean, but it would grow the already large number of kwargs.
Comment From: jorisvandenbossche
Yep, agreed that given the huge amount of keywords, we should maybe not add yet another one (although I would personally find a format
keyword more useful than the current date_parser
)
Comment From: WillAyd
I still think is useful. Having to provide a lambda in the current state is rather wonky from an end user perspective and an explicit format keyword could probably enable much better performance on larger datasets
Comment From: alexbk66
Just had this problem in my code -
dayfirst = True,
infer_datetime_format = True
still gets "2010-10-01" for "10/01/2010"
Comment From: jorisvandenbossche
@alexbk66 that seems like a bug (maybe dayfirst
is not passed through correctly somewhere), as this combination works fine with to_datetime
:
In [81]: pd.to_datetime("10/01/2010", infer_datetime_format=True)
Out[81]: Timestamp('2010-10-01 00:00:00')
In [82]: pd.to_datetime("10/01/2010", infer_datetime_format=True, dayfirst=True)
Out[82]: Timestamp('2010-01-10 00:00:00')
Comment From: gfyoung
What do you guys think about expanding date_parser
to be the following:
- callable: custom date parser (existing)
- kwargs: parameters to pass to default date converter (proposed)
This would address the need for greater customized keyword arguments without bloating or requiring wonky lambda arguments. Also, if we allow date_parser
to be kwargs
, we could then deprecate / remove dayfirst
and infer_datetime_format
from the signature and require them to be passed as a dict
via date_parser
(i.e. less API bloat).
Comment From: jorisvandenbossche
That sounds like a nice idea to me!
It's of course not yet that user friendly as a date_format
keyword, but I would say better than a lambda function that calls to_datetime.
Comment From: gfyoung
It's of course not yet that user friendly as a date_format keyword, but I would say better than a lambda function that calls to_datetime.
Well, who says that we couldn't accept a string (i.e. date format) for date_parser
? 😉 I think that could be a good enhancement after the refactor.
Comment From: WillAyd
Is the only downside to having a dedicated format
argument that we already have enough keywords? I still think that as a dedicated keyword is best here, especially given we already have that in to_datetime
Comment From: gfyoung
API bloat isn't the only reason. Another benefit of this consolidation is that the read_csv
API becomes agnostic to the signature of to_datetime
. We could have 20 keyword arguments for to_datetime
, but we wouldn't have to change the signature of read_csv
to accommodate. That's what we're running into now. Adding keyword arguments as has been suggested is only a band-aid IMO.
I also might add that adapting date_parser
to perform different functions based on type
would be consistent with other keywords such as usecols
, skiprows
, and parse_dates
.
Comment From: MarcoGorelli
Solved as of PDEP4, let's close
In [53]: data = """a,b
...: 27.03.2003 14:55:00.000,1
...: 03.08.2003 15:20:00.000,2"""
In [54]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[54]:
b
a
2003-03-27 14:55:00 1
2003-03-08 15:20:00 2
Comment From: MarcoGorelli
Sorry, I was too fast
This is wrong, and is a serious bug. It should've been addressed by PDEP4, but guess_datetime_format
errors on this input https://github.com/pandas-dev/pandas/issues/50317
I'm working on a fix, but in any case, we shouldn't be falling back to dateutil here
50319 will fix this, but I'll make a separate PR to remove this fallback and ensure this doesn't happen again
https://github.com/pandas-dev/pandas/blob/7b0d4dd2da6a5565e7fffdeec3a58750640c7f2f/pandas/io/parsers/base_parser.py#L1122-L1134
Comment From: MarcoGorelli
Only took 10 years and 5 days, but this is actually fixed now:
In [1]: data = """a,b
...: 27.03.2003 14:55:00.000,1
...: 03.08.2003 15:20:00.000,2"""
In [2]: read_csv(io.StringIO(data),index_col=0, parse_dates=True)
<ipython-input-2-1174b5f5763d>:1: UserWarning: Parsing dates in %d.%m.%Y %H:%M:%S.%f format when dayfirst=False was specified. Pass `dayfirst=True` or specify a format to silence this warning.
read_csv(io.StringIO(data),index_col=0, parse_dates=True)
Out[2]:
b
a
2003-03-27 14:55:00 1
2003-08-03 15:20:00 2
As the saying goes - "in open source, bugs get fixed...eventually"
🥳
Comment From: jorisvandenbossche
While the original use case in the top post (specific example dates that got silently inconsitently parsed with day first vs last) has been resolved (we are now strict, i.e. PDEP-4), the general discussion here was about how we can expose a format
keyword (or a more general mechanism to expose to_datetime
keywords in read_csv
), i.e. the "add date_format to read_csv" part of the title. And I don't think that is already resolved?
Of course, with the fact that with PDEP-4 we now try to guess the format by default, there is probably less need to pass custom to_datetime
keywords. But for the subset where you still need to specify a keyword, having a mechanism to pass it through could still be useful.
For example, for the specific case here, the new warning message says "... or specify a format to silence this warning". But you can't actually specify a format through pd.read_csv
(if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed)
Comment From: MarcoGorelli
I read this as "here's a date parsing mistake, and a possible solution would be to add a date_format
argument" - as the parsing mistake is addressed, that's why I considered it closed
You can already pass date_parser = lambda x: to_datetime(x, format=fmt)
, not sure there needs to be a separate argument
(if we still want to consider this feature request, it might be better to open a new, dedicated issue about it though, and keep this one closed)
I think that'd be better, the conversation here has already got quite long and many comments are now outdated - I've opened https://github.com/pandas-dev/pandas/issues/50528
Comment From: MarcoGorelli
I've looked into this further, and I'll take back
You can already pass date_parser = lambda x: to_datetime(x, format=fmt), not sure there needs to be a separate argument
As mentioned in https://github.com/pandas-dev/pandas/issues/50601, date_parser
is actually a net negative (if anyone has a counter-example, please do bring it up) - so, I'd be totally onboard with adding date_format
, and deprecating date_parser
Let's take the conversation over to #50601 then - closing again, sorry for the open/close ping pong here