Code Sample, a copy-pastable example if possible
import pandas as pd
d = {'start_date': ['37:04.0', '55:20.0', '44:05.3', '43:41.7', '40:35.0', '26:14.4', '57:00.7', '18:53.8', '12:23.2', '48:33.2', '36:38.1', '29:35.7', '56:02.4', '27:41.4', '11:49.9', '17:36.6', '50:44.2', '36:36.1', '21:14.7']}
data = pd.DataFrame(data=d)
data['customer_created_at'] = pd.to_datetime(data['start_date'], errors='coerce')
print (data['customer_created_at'])
Problem description
When coercing the list of time values I expect all of them to return NaT. However, a few of these come back with today's date. I tried yesterday and the returned value was:
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
5 NaT
6 NaT
7 2018-10-31 18:53:48
8 2018-10-31 12:23:12
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 2018-10-31 11:49:54
15 2018-10-31 17:36:36
16 NaT
17 NaT
18 2018-10-31 21:14:42
Today I get:
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
5 NaT
6 NaT
7 2018-11-01 18:53:48
8 2018-11-01 12:23:12
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 2018-11-01 11:49:54
15 2018-11-01 17:36:36
16 NaT
17 NaT
18 2018-11-01 21:14:42
Expected Output
All time values should return NaT when coerced
Output of pd.show_versions()
Comment From: mroeschke
I am guessing this is hitting our datutil parser as it will interpret these as times corresponding to the current date:
In [1]: from dateutil import parser
In [2]: parser.parse('18:53:48')
Out[2]: datetime.datetime(2018, 11, 1, 18, 53, 48)
In general we try to stay consistent with dateutil, but I can see how this is a surprising (undocumented) behavior. Could use some discussion on whether we should just document this case (which I am in more favor of) or should coerce these to NaT
.
Comment From: nickwoelk
Thanks for the update. It's strange that it is only happening for certain rows. I've created a workaround on my end using an if statement to prevent the to.datetime from affecting rows without a date.
Comment From: mroeschke
Well the rows that have "time looking" values are getting parsed like this. As you can see, rows where the first value is < 23 (i.e an hour looking value) are getting parsed and the rest to NaT
Comment From: MarcoGorelli
This now throws a warning making it clear that each row will be parsed by dateutil
:
In [29]: import pandas as pd
...:
...: d = {'start_date': ['37:04.0', '55:20.0', '44:05.3', '43:41.7', '40:35.0', '26:14.4', '57:00.7', '18:53.8', '12:23.2', '48:33.2', '36:38.1', '29:35.7', '56:02.4', '27:41.4',
...: '11:49.9', '17:36.6', '50:44.2', '36:36.1', '21:14.7']}
...: data = pd.DataFrame(data=d)
...:
...: data['customer_created_at'] = pd.to_datetime(data['start_date'], errors='coerce')
...:
...: print (data['customer_created_at'])
<ipython-input-29-c6a01a69a58d>:6: UserWarning: Could not infer format, so each element will be parsed individually by `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
data['customer_created_at'] = pd.to_datetime(data['start_date'], errors='coerce')
0 NaT
1 NaT
2 NaT
3 NaT
4 NaT
5 NaT
6 NaT
7 2022-12-17 18:53:48
8 2022-12-17 12:23:12
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 2022-12-17 11:49:54
15 2022-12-17 17:36:36
16 NaT
17 NaT
18 2022-12-17 21:14:42
Name: customer_created_at, dtype: datetime64[ns]
I think we can close then, but thanks for the report!