Code Sample, a copy-pastable example if possible
from datetime import datetime, date
import pandas as pd
df = pd.DataFrame([(1, "x", date.today(), datetime(2018, 1, 2, 18, 53))],
columns=["number", "string", "date", "datetime"])
df.dtypes
# number int64
# string object
# date object
# datetime datetime64[ns]
# dtype: object
df["date"][0]
# datetime.date(2018, 3, 9)
df.to_parquet("test.gz.parquet", compression='gzip', engine='pyarrow')
dfp = pd.read_parquet("test.gz.parquet")
dfp.dtypes
# number int64
# string object
# date datetime64[ns] <- !!!! Type change
# datetime datetime64[ns]
# dtype: object
However if I do:
import pyarrow.parquet as pq
pq.read_table("test.gz.parquet")
I get:
pyarrow.Table
number: int64
string: string
date: date32[day]
datetime: timestamp[ms]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "number", "field_name": "number", "pandas_type": "int'
b'64", "numpy_type": "int64", "metadata": null}, {"name": "string"'
b', "field_name": "string", "pandas_type": "unicode", "numpy_type"'
b': "object", "metadata": null}, {"name": "date", "field_name": "d'
b'ate", "pandas_type": "date", "numpy_type": "object", "metadata":'
b' null}, {"name": "datetime", "field_name": "datetime", "pandas_t'
b'ype": "datetime", "numpy_type": "datetime64[ns]", "metadata": nu'
b'll}, {"name": null, "field_name": "__index_level_0__", "pandas_t'
b'ype": "int64", "numpy_type": "int64", "metadata": null}], "panda'
b's_version": "0.22.0"}'}
Problem description
Pandas not preserving the date type on reading back parquet.
When pandas read a dataframe that originally had a date
type column it converts it to Timestamp
type. It is clear that it is a pandas problem since reading it using pyarrow, we get the original correct type
Expected Output
Output of pd.show_versions()
Comment From: jreback
datetime.date
are not first class datatypes in pandas, so this is as expected. The conversion is actually done in pyarrow
.
Comment From: jreback
you could open an issue on the arrow tracker, but likely the answer is going to be the same.
Comment From: JivanRoquet
datetime.date
are not first class datatypes in pandas, so this is as expected. The conversion is actually done inpyarrow
.
I believe the whole point of this issue is about providing a first-class support of datetime.date
in Pandas.
Comment From: simonjayhawkins
we have #32473 for adding a date type
Comment From: yohplala
Hello, There is another bug that your code demonstrate. It is linked, and maybe is considered to be the same. When saving to parquet, datetimes are saved in 'ms', not 'ns'. (this is normal, due to a pyarrow limitation) So there is a mistake in the metadata pandas is writing.
from datetime import datetime, date
import pandas as pd
df = pd.DataFrame([(1, "x", date.today(), datetime(2018, 1, 2, 18, 53))],
columns=["number", "string", "date", "datetime"])
df.dtypes
# number int64
# string object
# date object
# datetime datetime64[ns] # <- here
# dtype: object
df.to_parquet("test.parquet", engine='pyarrow')
import pyarrow.parquet as pq
pq.read_table("test.parquet")
pyarrow.Table
number: int64
string: string
date: date32[day]
datetime: timestamp[ms] # <- Here, yes, normal
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "number", "field_name": "number", "pandas_type": "int'
b'64", "numpy_type": "int64", "metadata": null}, {"name": "string"'
b', "field_name": "string", "pandas_type": "unicode", "numpy_type"'
b': "object", "metadata": null}, {"name": "date", "field_name": "d'
b'ate", "pandas_type": "date", "numpy_type": "object", "metadata":'
b' null}, {"name": "datetime", "field_name": "datetime", "pandas_t'
b'ype": "datetime", "numpy_type": "datetime64[ns]", "metadata": nu' # <- numpy_type: [ns]: NO
b'll}, {"name": null, "field_name": "__index_level_0__", "pandas_t'
b'ype": "int64", "numpy_type": "int64", "metadata": null}], "panda'
b's_version": "0.22.0"}'}
Should I re-open a new ticket or can this ticket be re-opened? Thanks, Bests
Comment From: anderl80
@yohplala we're struggling with the same issue here, is this still open/resolved?
Comment From: yohplala
@yohplala we're struggling with the same issue here, is this still open/resolved?
Hi @anderl80 , I have switched to directly use fastparquet instead of pandas to write down parquet files (fastparquet offers additional options in terms of appending modes).
following your question, I had a look at pyarrow doc, and they state that now here: Timestamps with nanoseconds can be stored without casting when using the more recent Parquet format version 2.0
So maybe the bug is naturally soved, I cannot say.