Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': pd.Series([232132231, 1233213123232, 21331123213123], dtype='datetime64[ms]')})
In [3]: df
Out[3]:
a
0 1970-01-03 16:28:52.231
1 2009-01-29 07:12:03.232
2 2645-12-16 00:00:13.123
In [4]: df.dtypes
Out[4]:
a datetime64[ms]
dtype: object
In [5]: df.to_parquet("abc")
In [6]: df = pd.read_parquet("abc")
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[6], line 1
----> 1 df = pd.read_parquet("abc")
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/io/parquet.py:532, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, **kwargs)
529 use_nullable_dtypes = False
530 check_dtype_backend(dtype_backend)
--> 532 return impl.read(
533 path,
534 columns=columns,
535 storage_options=storage_options,
536 use_nullable_dtypes=use_nullable_dtypes,
537 dtype_backend=dtype_backend,
538 **kwargs,
539 )
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/io/parquet.py:253, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, dtype_backend, storage_options, **kwargs)
249 try:
250 pa_table = self.api.parquet.read_table(
251 path_or_handle, columns=columns, **kwargs
252 )
--> 253 result = pa_table.to_pandas(**to_pandas_kwargs)
255 if manager == "array":
256 result = result._as_manager("array", copy=False)
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/array.pxi:830, in pyarrow.lib._PandasConvertible.to_pandas()
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/table.pxi:3990, in pyarrow.lib.Table._to_pandas()
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:820, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
818 _check_data_column_metadata_consistency(all_columns)
819 columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 820 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
822 axes = [columns, index]
823 return BlockManager(blocks, axes)
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:1169, in _table_to_blocks(options, block_table, categories, extension_columns)
1164 def _table_to_blocks(options, block_table, categories, extension_columns):
1165 # Part of table_to_blockmanager
1166
1167 # Convert an arrow table to Block from the internal pandas API
1168 columns = block_table.column_names
-> 1169 result = pa.lib.table_to_blocks(options, block_table, categories,
1170 list(extension_columns.keys()))
1171 return [_reconstruct_block(item, columns, extension_columns)
1172 for item in result]
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/table.pxi:2646, in pyarrow.lib.table_to_blocks()
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()
ArrowInvalid: Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: 21331123213123
# If the values are small, we are seeing the parquet reader is defaulting to `ns` resolutions instead of `ms`.
In [7]: df = pd.DataFrame({'a': pd.Series([232132231, 123321312, 213311232], dtype='datetime64[ms]')})
In [8]: df
Out[8]:
a
0 1970-01-03 16:28:52.231
1 1970-01-02 10:15:21.312
2 1970-01-03 11:15:11.232
In [9]: df.dtypes
Out[9]:
a datetime64[ms]
dtype: object
In [10]: df.to_parquet("abc")
In [11]: pd.read_parquet("abc")
Out[11]:
a
0 1970-01-03 16:28:52.231
1 1970-01-02 10:15:21.312
2 1970-01-03 11:15:11.232
In [12]: pd.read_parquet("abc").dtypes
Out[12]:
a datetime64[ns]
dtype: object
Issue Description
Looks like with the introduction of s
, ms
, us
time resolutions in pandas-2.0
, we seem to be unable to properly round-trip the data to and from parquet. While reading the parquet, it appears to be that ns
time resolution is being defaulted to always, which results in overflow errors for values that are decently large for other time resolutions.
Expected Behavior
In [8]: pd.read_parquet("abc")
Out[8]:
a
0 1970-01-03 16:28:52.231
1 1970-01-02 10:15:21.312
2 1970-01-03 11:15:11.232
In [9]: pd.read_parquet("abc").dtypes
Out[9]:
a datetime64[ms]
dtype: object
Installed Versions
INSTALLED VERSIONS
------------------
commit : c2a7f1ae753737e589617ebaaff673070036d653
python : 3.10.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-76-generic
Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.0rc1
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.2
hypothesis : 6.70.1
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : 4.12.0
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : 1.10.1
snappy :
sqlalchemy : 1.4.46
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Comment From: phofl
Thanks for your report. This is an upstream issue
https://github.com/apache/arrow/issues/33321