A small, complete example of the issue
import pandas as pd
df = pd.read_csv('greedy.txt', parse_dates=['EuroDate'])
print(df)
with "greedy.txt" as
EuroDate
10/9/2016
30/9/2016
Expected Output
EuroDate
0 2016-09-10
1 2016-09-30
Actual Output
EuroDate
0 2016-10-09
1 2016-09-30
So read_csv()
has interpreted the first line as a US-format date, then realised that the second line cannot be a US-formatted date, so switched to European format. But it has not gone back and reevaluated the first line in light of its new information. So the resulting data is inconsistent, and pandas knows this.
Obviously, I appreciate that - CSV files are a disaster - This code is asking Pandas to infer the dates - Going back and re-evaluating previous data in light of new information is slow and annoying. - It won't always be possible to do anything except interpret a field as a string if there is inconsistent data. - Some datasets will include dates in multiple formats (e.g. if humans have entered them free-form) and in those cases is might just be useful for Pandas to take its best guess on a row-by-row basis.
However, I contend that in this case the behaviour is incorrect (because there is a consistent interpretation of the column as a date, which is in fact clear by the second record). Even if some people don't regard this as a bug, I contend that it is at the very least dangerous and likely to cause serious (and sometimes baffling) errors. In my view, it would be much better to go back and reinterpret the data according to the information now available or to fail. If even this is considered too much, at the very least Pandas should issue a prominent warning that it has interpreted different rows in the column using different date formats.
Output of pd.show_versions()
Comment From: chris-b1
xref #12585, this is essentially a symptom of it.
Note that there is a dayfirst=
argument on read_csv
for this exact case
Comment From: jreback
yes this is a duplicate of that issue.
@njr0 your comments are appreciated and a pull-request is welcome! (for now best to maybe raise an error if inconsistency exists and let the user be more explict).