Pandas Pandas not preserving date type on reading back parquet

Code Sample, a copy-pastable example if possible

from datetime import datetime, date
import pandas as pd

df = pd.DataFrame([(1, "x", date.today(), datetime(2018, 1, 2, 18, 53))], 
                  columns=["number", "string", "date", "datetime"])
df.dtypes

# number               int64
# string              object
# date                object
# datetime    datetime64[ns]
# dtype: object

df["date"][0]
# datetime.date(2018, 3, 9)

df.to_parquet("test.gz.parquet", compression='gzip', engine='pyarrow')

dfp = pd.read_parquet("test.gz.parquet")
dfp.dtypes

# number               int64
# string              object
# date                datetime64[ns] <- !!!! Type change
# datetime    datetime64[ns]
# dtype: object

However if I do:


import pyarrow.parquet as pq
pq.read_table("test.gz.parquet")

I get:

pyarrow.Table
number: int64
string: string
date: date32[day]
datetime: timestamp[ms]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "number", "field_name": "number", "pandas_type": "int'
            b'64", "numpy_type": "int64", "metadata": null}, {"name": "string"'
            b', "field_name": "string", "pandas_type": "unicode", "numpy_type"'
            b': "object", "metadata": null}, {"name": "date", "field_name": "d'
            b'ate", "pandas_type": "date", "numpy_type": "object", "metadata":'
            b' null}, {"name": "datetime", "field_name": "datetime", "pandas_t'
            b'ype": "datetime", "numpy_type": "datetime64[ns]", "metadata": nu'
            b'll}, {"name": null, "field_name": "__index_level_0__", "pandas_t'
            b'ype": "int64", "numpy_type": "int64", "metadata": null}], "panda'
            b's_version": "0.22.0"}'}

Problem description

Pandas not preserving the date type on reading back parquet. When pandas read a dataframe that originally had a date type column it converts it to Timestamp type. It is clear that it is a pandas problem since reading it using pyarrow, we get the original correct type

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-36-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: pt_BR.UTF-8 pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 38.4.0 Cython: None numpy: 1.14.0 scipy: 1.0.0 pyarrow: 0.8.0 xarray: None IPython: 6.2.1 sphinx: None patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: 0.1.2 fastparquet: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

datetime.date are not first class datatypes in pandas, so this is as expected. The conversion is actually done in pyarrow.

Comment From: jreback

you could open an issue on the arrow tracker, but likely the answer is going to be the same.

Comment From: JivanRoquet

datetime.date are not first class datatypes in pandas, so this is as expected. The conversion is actually done in pyarrow.

I believe the whole point of this issue is about providing a first-class support of datetime.date in Pandas.

Comment From: simonjayhawkins

we have #32473 for adding a date type

Comment From: yohplala

Hello, There is another bug that your code demonstrate. It is linked, and maybe is considered to be the same. When saving to parquet, datetimes are saved in 'ms', not 'ns'. (this is normal, due to a pyarrow limitation) So there is a mistake in the metadata pandas is writing.

from datetime import datetime, date
import pandas as pd

df = pd.DataFrame([(1, "x", date.today(), datetime(2018, 1, 2, 18, 53))], 
                  columns=["number", "string", "date", "datetime"])
df.dtypes

# number               int64
# string              object
# date                object
# datetime    datetime64[ns]     # <- here
# dtype: object

df.to_parquet("test.parquet", engine='pyarrow')

import pyarrow.parquet as pq
pq.read_table("test.parquet")

pyarrow.Table
number: int64
string: string
date: date32[day]
datetime: timestamp[ms]          # <- Here, yes, normal
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "number", "field_name": "number", "pandas_type": "int'
            b'64", "numpy_type": "int64", "metadata": null}, {"name": "string"'
            b', "field_name": "string", "pandas_type": "unicode", "numpy_type"'
            b': "object", "metadata": null}, {"name": "date", "field_name": "d'
            b'ate", "pandas_type": "date", "numpy_type": "object", "metadata":'
            b' null}, {"name": "datetime", "field_name": "datetime", "pandas_t'
            b'ype": "datetime", "numpy_type": "datetime64[ns]", "metadata": nu'       # <-  numpy_type: [ns]: NO
            b'll}, {"name": null, "field_name": "__index_level_0__", "pandas_t'
            b'ype": "int64", "numpy_type": "int64", "metadata": null}], "panda'
            b's_version": "0.22.0"}'}

Should I re-open a new ticket or can this ticket be re-opened? Thanks, Bests

Comment From: anderl80

@yohplala we're struggling with the same issue here, is this still open/resolved?

Comment From: yohplala

@yohplala we're struggling with the same issue here, is this still open/resolved?

Hi @anderl80 , I have switched to directly use fastparquet instead of pandas to write down parquet files (fastparquet offers additional options in terms of appending modes).

following your question, I had a look at pyarrow doc, and they state that now here: Timestamps with nanoseconds can be stored without casting when using the more recent Parquet format version 2.0

So maybe the bug is naturally soved, I cannot say.

Pandas Pandas not preserving date type on reading back parquet

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`