Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(
{"date": pd.date_range("2023-01-01", periods=5, unit="s"), "values": 1}
)
# datetime64[s]
print(df["date"].dtype)
grouped = df.groupby("date", as_index=False).sum()
# datetime64[ns]
print(grouped["date"].dtype)
concatted = pd.concat([df, grouped])["date"]
# object
print(concatted.dtype)
Issue Description
I stumbled upon this behavior when test-driving 2.0rc0 with my code base.
Some data frames now contain date columns of type datetime64[s]
. Did not investigate yet why. The problem occurs when working with these columns. After a groupby
, the type changes to datetime64[ns]
.
When now concatting datetime64[s]
to datetime64[ns]
things get even worse and I have an object
column.
Expected Behavior
As I started with a datetime column, I'd expect the datetime column dtype not to change magically. If a later pd.merge
wouldn't have complained, it could have been hard to track down what was going on.
Installed Versions
Comment From: MarcoGorelli
thanks @aberres for the report (and for having tried out the RC)
cc @jbrockmendel (though I'll try to take a look myself this week)
Comment From: aberres
I looked at where the initial datetime64[s]
comes from.
Similar to this:
d = dt.date.today()
np_array = np.arange(
np.datetime64(d), np.datetime64(d + dt.timedelta(days=1)) # dtype='datetime64[D]'
)
df = pd.DataFrame({"date": np_array}) # dtype='datetime64[s]'
Comment From: MarcoGorelli
The supported resolutions in pandas datetimes are now: - s - ms - us - ns
and so numpy's D resolution gets converted to pandas' 's'
resolution. Previously, the only supported resolution in pandas was 'ns' (nanosecond)
Non-nanosecond resolution in pandas is pretty new, and admittedly poorly documented
I'd say that the bug here is that groupby
didn't preserve the second resolution
Comment From: aberres
I'd say that the bug here is that groupby didn't preserve the second resolution
Yes, definitely. I just wanted to double-check that the appearance of the s
timedelta is not happening by accident.
Comment From: aberres
FWIW: Now I am explicitly using .astype('timedelta64[ns]')
and everything is fine on my side.
Comment From: MarcoGorelli
I think this is the issue:
(Pdb) self.grouping_vector
<DatetimeArray>
['2023-01-01 00:00:00', '2023-01-02 00:00:00', '2023-01-03 00:00:00',
'2023-01-04 00:00:00', '2023-01-05 00:00:00']
Length: 5, dtype: datetime64[s]
(Pdb) algorithms.factorize(self.grouping_vector, sort=self._sort, use_na_sentinel=self._dropna)
(array([0, 1, 2, 3, 4]), array(['2023-01-01T00:00:00.000000000', '2023-01-02T00:00:00.000000000',
'2023-01-03T00:00:00.000000000', '2023-01-04T00:00:00.000000000',
'2023-01-05T00:00:00.000000000'], dtype='datetime64[ns]'))
from
https://github.com/pandas-dev/pandas/blob/4af57562b7d96cabd2383acfd80d8b083c7944f3/pandas/core/groupby/grouper.py#L797-L799
Comment From: jbrockmendel
Hah marco is moments ahead of me in tracking this down. I had only gotten as far as group_arraylike
when seeing his post
Comment From: jbrockmendel
There are a few potential ways of fixing this, with different levels of cleanliness/quickness tradeoff.
Things go wrong in algorithms.factorize bc we go through a path that casts non-nano to nano (in _ensure_data and _reconstruct_data). My attempts to get rid of that casting (#48670) have turned into yak shaving (most recently #50960 which was broken by the change in the C code to use a capsule xref #51952, #51951).
Shorter-term, we could patch algorithms.factorize to dispatch on ExtensionArray rather than on is_extension_array_dtype. Update... which seems to raise an AssertionError, yikes