Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [1]: import pandas as pd

In [2]: pdf = pd.DataFrame({"a": [32424324, None, None], "b": [24242342, 3432434234, 23434234]})

In [3]: pdf = pdf.astype("datetime64[s]")

In [4]: pdf
Out[4]: 
                    a                   b
0 1971-01-11 06:45:24 1970-10-08 13:59:02
1                 NaT 2078-10-08 05:57:14
2                 NaT 1970-09-29 05:30:34

In [5]: pdf.dtypes
Out[5]: 
a    datetime64[s]
b    datetime64[s]
dtype: object

In [6]: pdf['a'] - pdf['b']
Out[6]: 
0   94 days 16:46:22
1                NaT
2                NaT
dtype: timedelta64[s]

In [7]: def udf(row):
   ...:     return operator.sub(row['a'], row['b'])
   ...: 

In [8]: import operator

In [10]: result = pdf.apply(udf, axis=1)

In [11]: result
Out[11]: 
0   94 days 16:46:22
1                NaT
2                NaT
dtype: timedelta64[ns]          # BUG: Should be `timedelta64[s]`, to be consistent with the rest of the calls

In [13]: operator.sub(pdf['a'], pdf['b'])
Out[13]: 
0   94 days 16:46:22
1                NaT
2                NaT
dtype: timedelta64[s]

In [14]: pdf['a'] - pdf['b']
Out[14]: 
0   94 days 16:46:22
1                NaT
2                NaT
dtype: timedelta64[s]

In [15]: pdf['a'][0] - pdf['b'][0]
Out[15]: Timedelta('94 days 16:46:22')

In [18]: (pdf['a'][0] - pdf['b'][0]).unit
Out[18]: 's'

Issue Description

A binop, being performed as a UDF seems to be always returning ns dtype as opposed to the correct dtype(s in the above case).

Expected Behavior

In [14]: pdf.apply(udf, axis=1)
Out[14]: 
0   94 days 16:46:22
1                NaT
2                NaT
dtype: timedelta64[s]

Installed Versions

INSTALLED VERSIONS ------------------ commit : c2a7f1ae753737e589617ebaaff673070036d653 python : 3.10.10.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-76-generic Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.0.0rc1 numpy : 1.23.5 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 23.0.1 Cython : 0.29.33 pytest : 7.2.2 hypothesis : 6.70.1 sphinx : 5.3.0 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.12.0 pandas_datareader: None bs4 : 4.12.0 bottleneck : None brotli : fastparquet : None fsspec : 2023.3.0 gcsfs : None matplotlib : None numba : 0.56.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 11.0.0 pyreadstat : None pyxlsb : None s3fs : 2023.3.0 scipy : 1.10.1 snappy : sqlalchemy : 1.4.46 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

Comment From: mroeschke

It appears the core issue here is that apply returns scalars internally, and when constructing the result, then scalar units is not taken into account. Similar to:

In [6]: ser = pd.Series([pd.Timedelta(1).as_unit("s")])

In [7]: ser
Out[7]: 
0   0 days
dtype: timedelta64[ns]

@jbrockmendel should the above be able to preserve the "s" unit from the scalar and return timedelta64[s] type?

Comment From: jbrockmendel

Yes, for the pd.Series constructor example we need to get unit inference working in array_to_datetime/maybe_convert_objects/infer_dtype. The UDF case might be handle-able independently inside maybe_cast_pointwise_result (which im guessing the OP case goes through, haven't checked)

Comment From: mroeschke

The OP case goes through FrameApply.apply_series_generator to compose a dict of results and then passed FrameApply.wrap_results to construct a Series from the dict result, so I think unit inference would be the way to solve this case

Comment From: luke396

take

Comment From: mroeschke

Looks for this case maybe_convert_objects needs to be patched.

A few open questions:

  1. Should Series([unsupported_unit])) return the closest supported unit (second) for day and lower resolutions and raise for picosecond and higher resolutions?
  2. Should Series([supported_unit1, supported_unit2]) resolve to the highest supported resolution or raise?

Comment From: luke396

I've been working on this for a while now, but unfortunately, I haven't made much progress yet. (This is more complex than I thought, time and it's format seems have a lot significant issues will fix in the future.)

On a positive note, I did find something interesting while working on this.

pd.Timedelta(1).as_unit('s').unit
Out[4]: 's'
pd.Timedelta(1, unit="s").unit
Out[5]: 'ns'

The question that arises is whether this is really what we want, or if it's a part of the issue we mentioned earlier?