Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> import pandas as pd
>>> pd.Timestamp(str(pd.Timestamp("0001-01-01")))
Timestamp('2001-01-01 00:00:00')
>>> pd.Timestamp("0001-01-01").isoformat()
'1-01-01T00:00:00'
Issue Description
With https://github.com/pandas-dev/pandas/pull/49737 it is now possible to create Timestamp
objects without four-digit years. The default and ISO formatted versions of them do not contain zero-padded years, which can lead to ambiguous date strings (https://github.com/pandas-dev/pandas/issues/37071) and improper round tripping between str
and Timestamp
as illustrated in the example.
cc: @keewis
Expected Behavior
>>> import pandas as pd
>>> pd.Timestamp(str(pd.Timestamp("0001-01-01")))
Timestamp('0001-01-01 00:00:00')
>>> pd.Timestamp("0001-01-01").isoformat()
'0001-01-01T00:00:00'
Installed Versions
Comment From: keewis
fixing this seems pretty easy: we'd only have to change https://github.com/pandas-dev/pandas/blob/25c1942c394ecbd3281240382728006e50bc28dd/pandas/_libs/tslibs/timestamps.pyx#L1018 to
base = f"{self.year:04d}-" + base.split("-", 1)[1]
but that causes pandas/tests/scalar/timestamp/test_constructs.py::TestTimestampConstructors::test_out_of_bounds_string_consistency
to fail, and I have no idea how to fix that.
Edit: actually, maybe it's the failing tests that need to change? It doesn't seem helpful to look at 1-01-01 00:00:00
without knowing which time it represents (001-01-01
and 0001-01-01
should be the same date, but 01-01-01
and 1-01-01
are interpreted as 2001-01-01
).
Comment From: spencerkclark
Thanks for looking into this @keewis! That's encouraging that it might be that easy.
I agree -- I think that test only fails because the error message changes in the way we would expect. I'm not sure I read any more into the original expected error message other than that it was consistent with how f"Cannot cast {self} to unit='{unit}' without overflow."
appeared before for pd.Timestamp("0001-01-01").as_unit("ns")
.
Comment From: keewis
the reason why I was confused was that immediately below that check there's this (I was confused mostly because of the comment): https://github.com/pandas-dev/pandas/blob/6c50f70650faf0f8f192725c78c53c86eb40a4a3/pandas/tests/scalar/timestamp/test_constructors.py#L605-L610
where pd.Timestamp("001-01-01") == pd.Timestamp("0001-01-01")
, so unless I'm missing something both should get a precision of "s"
, while "01-01-01"
and "1-01-01"
should have a precision of "ns"
.
In any case, I'll try to send in a PR tomorrow.
Comment From: spencerkclark
Sounds good!
I see, yeah, I agree that comment is a little confusing and perhaps that part of the test could also be updated. For what it's worth (and maybe you've noticed this already) it looks like pd.Timestamp("001-01-01")
in fact does get assigned second precision on the current main branch:
>>> import pandas as pd
>>> pd.Timestamp("001-01-01").unit
's'
It could be there was a brief time during recent development when it did not and the second option of the expected error messages occurred for "001-01-01"
(raised by the Timestamp constructor rather than the as_unit
call). Maybe it can be discussed in your PR. The good news is that I do not think it should block this formatting change one way or the other.