Should #49737 be reverted?
This was brought in the dev call yesterday. It's great that non-nanosecond support is coming along, but it's still not complete and there are still some bugs. Hence, it probably shouldn't happen by default yet, but rather should be opt-in.
Arguably:
In [2]: Timestamp(np.datetime64('2000', 's')).unit
Out[2]: 's'
counts as opt-in, because the user had to go through a numpy
object which already had a set resolution. But perhaps it's not the case for
In [7]: Timestamp('2000').unit
Out[7]: 's'
, where users would have been used to receiving 'ns'
.
This would be fine once non-nano support is complete, but currently would introduce bugs - e.g. of a fairly common operation I came across this week: #50656. I think @jorisvandenbossche also had an example of this breaking the arrow CI (though from a quick look I couldn't find it - could you please share it?)
Comment From: phofl
cc @jbrockmendel
Comment From: jbrockmendel
Is it the Timestamp object or the Series etc having non-nano dtype causing troubles?
If we revert inference on strings, do we also revert support for strings that are out of bounds for nanos? Or do inference only on those cases? IIRC joris and i agreed a while back that the latter was undesirable.
xref #49076 which might alleviate some of the changiness for the Timestamp object. (not a First Issue; I have a branch that stalled out)
Comment From: jorisvandenbossche
The example from the pyarrow CI where I noticed this, is the change of the return value of pd.Timestamp("2022-01-01").value
, which is now a number of seconds instead of a number of nanoseconds, i.e. the https://github.com/pandas-dev/pandas/issues/49076 issues you linked.
(this was only in a test for checking the expected result, so that is easy to fix on our side (although we also use .value
in actual conversion code, I have to check if that is affected), but in general that is a breaking change)
Is it the Timestamp object or the Series etc having non-nano dtype causing troubles?
I think both. Although I suppose the only aspect about the Timestamp object that might affect users is the .value
attribute? (https://github.com/pandas-dev/pandas/issues/49076)
But, if this causes your Series to get a non-nano dtype, that does have a wider impact.
Comment From: MarcoGorelli
I only noticed issues when it's part of a DatetimeIndex
- and that wouldn't be affected, its reso would still be 'ns'
:
In [2]: ts = Timestamp('2000')
In [3]: ts.unit
Out[3]: 's'
In [4]: to_datetime([ts])
Out[4]: DatetimeIndex(['2000-01-01'], dtype='datetime64[ns]', freq=None)
If the only issue with Timestamp
is .value
, then I'm not sure that this would be a strong-enough reason to revert - the suggestion in https://github.com/pandas-dev/pandas/issues/49076#issuecomment-1279265403 looks fine
Comment From: jorisvandenbossche
But, if this causes your Series to get a non-nano dtype, that does have a wider impact.
Ah, I was assuming that we would use the unit of the Timestamp object to determine the dtype when creating a Series or column. But if the above is not the case, then it's indeed only .value
that has user-facing impact at this moment.
Comment From: jbrockmendel
Ah, I was assuming that we would use the unit of the Timestamp object to determine the dtype when creating a Series or column
In the relevant cases (object dtype or sequence-without-dtype) we go through array_to_datetime
, which is the last big place to implement non-nano support. Not coincidentally, it is the most difficult. It is straightforward to add a reso keyword to array_to_datetime, but doing inference there is tough bc inferred resos might not be homogeneous.
Comment From: MarcoGorelli
resolved in https://github.com/pandas-dev/pandas/issues/49076