Feature Type
-
[ ] Adding new functionality to pandas
-
[X] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
If you run pd.to_datetime
on the following Series:
"11-12-2029",
"02-12-2012",
"11-09-2012",
"13-02-2000",
"10-11-2001"
pandas (>= 2.0) will infer the datetime format from the first non-missing example (%m%d%Y), try to apply this type to all the series, fail on 13-02-2000
, and raise an error (before version 2.0, this would silently create a mixed type). I wish pandas could infer the right format from such a series, where only one format works for all rows.
Feature Description
Pseudo code
If using dayfirst=True
and dayfirst=False
don't give the same format for guess_datetime_format
on the first non missing example (i.e both works):
Try both formats on the Series (probably on a random subset for speed).
If one works for all rows, return this format.
If both work, trust the dayfirst parameter (and maybe raise a warning).
If none work and error="raise"
, raise an error. If errors = "coerce"
or errors="ignore"
, one could either trust the dayfirst
parameter, or see which of dayfirst
value leads to the smallest number of non-parsed values.
Implementation
Change function _guess_datetime_format_for_array
(in pandas.core.tools.datetimes
) so that it tries both dayfirst=True
and dayfirst=False
on the first non-null example. In the same function, if both options give a different format, try array_strptime
with both format on a random subset of the array (100?) with strict error, and check that one of the tries doesn't fail.
Alternative Solutions
I don't know.
Additional Context
No response
Comment From: MarcoGorelli
thanks @LeoGrin for the suggestion
yeah I think we could improve the inference, for example by trying the first n
non-null rows or taking random sample, and then taking a majority vote
would you be interested in trying this out and submitting a PR?
Comment From: LeoGrin
Thanks for the feedback! Yes I would :)
Comment From: LeoGrin
take