Pandas ENH: Timedelta cannot span more than 293 years => implementation limitation

The Pandas datetime arithmetic module is indeed very useful and it can be used for archeological research studies as well. Unfortunately, the limitation that Timedelta object cannot span more than 293 years puts a huge shame on this wonderful piece of data science library. With this limitation, it is not possible to go back for thousands or even more than 300 years in history.

>>> pd.to_timedelta('1Y')*5
Timedelta('1826 days 05:06:00')
>>> pd.to_timedelta('1Y')*292
Timedelta('106650 days 19:26:24')
>>> pd.to_timedelta('1Y')*293
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/_libs/tslibs/timedeltas.pyx", line 1347, in pandas._libs.tslibs.timedeltas.Timedelta.__mul__
  File "pandas/_libs/tslibs/timedeltas.pyx", line 1230, in pandas._libs.tslibs.timedeltas.Timedelta.__new__
  File "pandas/_libs/tslibs/timedeltas.pyx", line 180, in pandas._libs.tslibs.timedeltas.convert_to_timedelta64
  File "pandas/_libs/tslibs/timedeltas.pyx", line 308, in pandas._libs.tslibs.timedeltas.cast_from_unit
OverflowError: Python int too large to convert to C long

In terms of practicality, time resolution and time span are always contradicting with each other. Nowadays, quantum physics often deals with time objects at a scale of nano (10-9), pico (10-12), or even femto(10**-15)-second. While archeology often deals with time objects at a span of thousands, millions, or even billions of years. If I remember correctly, pandas set the time counter base unit at nano-seconds, thus, the span will be short. The solution to cater for both high resolution and large span is to use floating point rather than a large integer, as the time counter base unit. The speed will be slightly slower for floating point. But if you take a look at Intel Architecture, on modern CPU, floating point arithmetic is almost the same as integer especially when SIMD is used.

Comment From: xuancong84

Criticism is good, but the last sentence goes beyond fair criticism imho... So may I ask you in return, why you haven't thought about that earlier, considering how foreseeing you are?

I have removed criticism and proposed a solution. Please take a look!

Comment From: jreback

timedeltas are backed by an int64 which give a reasonable tradeoff between resolution (ns) and time range. This is the same issue w.r.t. Timestamp ranges as well where there have been many discussions. (search the repo).

If someone wants to implement a timedelta64[ms] extension array (and/or [s]) then this limitation can be avoided. This is a non-trivial project and would require some effort.

But to re-iterate, I find this limitation for timedeltas really no big deal. What exactly is the usecase?

Comment From: xuancong84

But to re-iterate, I find this limitation for timedeltas really no big deal. What exactly is the usecase?

I was doing automated carbon dating at some archeological site, for every item the machine has collected, the computer computes its history by measuring Carbon-13 isotope, so the range can go from a few years to a few thousand years back from now ("now" is different for different runs). Most people in the same field are still using MS-Excel to do manual computation. Honestly speaking, I do not mind that timedelta has a small span, but a span of only 292 years is really too small for many scientific works.

Comment From: bashtage

IMO a long-term solution would be to move to a default 128bit (or 96 bit) type which could then have both very precise resolution and a long time span. This is likely even more challenging than that @jreback suggested. FWIW this is not unprecedented as MATLAB has gone down this route.

Comment From: bashtage

But to re-iterate, I find this limitation for timedeltas really no big deal. What exactly is the usecase?

Astronomy, archeology, and geology are three places where the natural limit will be regularly hit.

Comment From: venaturum

I also run into this issue for my package staircase which provides functionality for working with mathematical step functions.

An example application is representing the number of "active objects" over time, eg buses, website users etc with a step function. Integrating the step function (to find the area underneath) gives the total active time, eg total bus hours, total time website viewed, and the answer is a timedelta. Due to the nature of the integration it is possible to reach large timedeltas. Eg if we wanted to obtain the total active time, over a year, then if the average value of the function, i.e. the average number of active objects, is larger than 293, then the result is >= 365 days * 293 = 106945 days, which is beyond the current timedelta limit.

Comment From: jbrockmendel

In 2.0 we support non-nanosecond Timedeltas, though depending on how you call the constructor you may still need to explicitly cast to the lower resolution:

>>> pd.to_timedelta("365D").as_unit("s") * 10**6
Timedelta('365000000 days 00:00:00')

@xuancong84 does this handle your use case?

Comment From: xuancong84

@jbrockmendel Yes, this handles the case. However, it is too mechanical and manual. A more elegant implementation should be able to automatically switch unit among ns/us/ms/s depending on the time scale.

That is why a high-precision floating point implementation is recommended rather than a large integer. This is because the IEEE floating point mechanism is designed to handle such dynamic scaling problem. If you don't use floating point, then you have to handle it manually which is troublesome and bug-borne. Of course, the drawback is that when you are adding two durations that are out of the precision scale, e.g., adding 1 nanosecond to 1000 years, then the addition will underflow, resulting in no increment. But such a situation is rarely the case, can someone think of a case when you need to add a very small duration (such as ns) to a very large duration (>1000 years)?

In conclusion, the most elegant solution in my opinion is to switch between floating point timedelta and large integer timedelta instead of switching among fixed units such as ns/us/ms/s. Otherwise, what if you want to scale to trillions of years, is a unit of 1 second enough? What if you want to deal with scales at femto-second level, can you aggregate even to a few minutes??

Comment From: jbrockmendel

You're welcome to implement a float-based datetime dtype. I don't expect pandas to implement one internally. Closing as complete.

Comment From: bashtage

I would add that with the changes made available in 2 it should be relatively straight forward to write and extension type that could support ns resolution plus a wider range as a stand along package.

I agree that acceptance into pandas would be a long shot. This said, the only way forward for something like this would be to have it as a stand alone that could be shown to be very popular, and to have it mature outside pandas.