Pandas version checks
- [X] I have checked that this issue has not already been reported.
- Query:
is:issue normalize overflow
- Query:
- [X] I have confirmed this bug exists on the latest version of pandas.
- [X] I have confirmed this bug exists on the main branch of pandas.
Issue description
I was trying to get the minimum date, so I tried pd.Timestamp.min.normalize()
before realizing that would be out of range, but what's weird is that it overflows without warning:
>>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145224193')
>>> pd.Timestamp.min.normalize()
Timestamp('2262-04-11 23:34:33.709551616')
For reference:
>>> pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
Compare that to .floor('D')
, which I thought would be equivalent, but it raises:
>>> pd.Timestamp.min.floor('D')
OverflowError: value too large
The above exception was the direct cause of the following exception:
OutOfBoundsDatetime: Cannot round 1677-09-21 00:12:43.145224193 to freq=<Day> without overflow
as well as .round('D')
(same error).
Sidenote: What I really wanted was .ceil('D')
, which gets the first whole day:
>>> pd.Timestamp.min.ceil('D')
Timestamp('1677-09-22 00:00:00')
Installed Versions
INSTALLED VERSIONS
------------------
commit : d41884b2dd0823dc6288ab65d06650302e903c6b
python : 3.12.8
python-bits : 64
OS : Linux
OS-release : 5.4.0-202-generic
Version : #222-Ubuntu SMP Fri Nov 8 14:45:04 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_CA.UTF-8
LOCALE : fr_CA.UTF-8
pandas : 3.0.0.dev0+1777.gd41884b2dd
numpy : 2.1.0.dev0+git20240403.e59c074
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : None
IPython : 8.22.2
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pytz : 2024.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
Comment From: shashankrushiya
I don't think the above mention is a bug. The reason is because-
Here, when we use normalize()
method, it sets the time part of the timestamp to 00:00:00
. However, pd.Timestamp.min
is already extremely close to the lower boundary of supported date times. And talking about floor()
and round()
methods, both are more strict in their handling. They basically check whether the operation would cause an overflow and raise an OverFlowError
if it would.
If you consider ceil('D')
, it will work as this method rounds up to the next day, which is still within the valid range for pd.Timestamp
.
Comment From: wjandrea
@shashankrushiya I'm not following. How does that mean it's not a bug?
talking about
floor()
andround()
methods, both are more strict in their handling. They basically check whether the operation would cause an overflow and raise anOverFlowError
if it would.
So then why doesn't .normalize()
check? Is there a good reason or is it just an oversight? (i.e. a bug, or maybe not a bug per se but a potential enhancement)
BTW, note that pd.Timestamp.min
is the lower boundary, not "extremely close" to the boundary.
Comment From: rhshadrach
Thanks for the report! Agreed normalize should raise an OverflowError here. PRs to fix are welcome!
@shashankrushiya Sorry, is this text AI-generated? It's not very helpful.
I view this as a violation of pandas' Code of conduct and ask you to refrain from making such statements in the future. We'd like to make pandas a welcome place for all contributors, and this type of comment works against this goal.
Comment From: wjandrea
@rhshadrach My bad. I've edited my above comment to be respectful and more constructive. Sorry @shashankrushiya!
Comment From: anishkarki
take
Comment From: Anurag-Varma
take
Comment From: Anurag-Varma
@MarcoGorelli @WillAyd
So the min value in Timestamp
is 1677-09-21 00:12:43.145224193
import pandas as pd
pd.Timestamp.min
# Output:
# Timestamp('1677-09-21 00:12:43.145224193')
And when i run below different codes, i get correct outputs:
pd.Timestamp('1677-09-21').normalize()
# Output in days:
# Timestamp('1677-09-21 00:00:00')
pd.Timestamp('1677-09-21 00:12:43').normalize()
# Output in sec:
# Timestamp('1677-09-21 00:00:00')
pd.Timestamp('1677-09-21 00:12:43.145').normalize()
# Output in milli sec:
# Timestamp('1677-09-21 00:00:00')
pd.Timestamp('1677-09-21 00:12:43.145224').normalize()
# Output in micro sec:
# Timestamp('1677-09-21 00:00:00')
The issue comes when i use precision of nano seconds with any value equal or more than 1677-09-21 00:12:43.145224193
, it gives wrong output
pd.Timestamp('1677-09-21 00:12:43.145224193').normalize()
# Output in nano sec:
# Timestamp('2262-04-11 23:34:33.709551616')
But if nano seconds precision is not mentioned, then it works fine
Comment From: Anurag-Varma
And also should the functions like floor
, ceil
, round
, normalize
work for Timestamps before Timestamp('1677-09-21 00:12:43.145224193')
. Even though this is the least minimum value.
Should they work for any Timestamp ? Or give an OverflowError
if its below the min value ?
For example in year 1000, they all are working fine as below:
pd.Timestamp('1000-09-21 23:01:01.111111').ceil('D')
# Output of ceil:
# Timestamp('1000-09-22 00:00:00')
pd.Timestamp('1000-09-21 23:01:01.111111').floor('D')
# Output of floor:
# Timestamp('1000-09-21 00:00:00')
pd.Timestamp('1000-09-21 23:01:01.111111').round('D')
# Ouput of round:
# Timestamp('1000-09-22 00:00:00')
pd.Timestamp('1000-09-21 23:01:01.111111').normalize()
# Output of normailize:
# Timestamp('1000-09-21 00:00:00')