Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import tempfile

# str_list is a column of lists of strings
df = pd.DataFrame(
    {
        "id": [1,2,3,4,5],
        "str_list": [
            ["a", "aa", "aaa"],
            ["b", "bb","bbb", "bbbb"],
            ["c", "cc"],
            ["d"],
            ["e","ee","eee"]
        ]
    })

# will output list
type(df['str_list'][0])

# Write to a feather file and read back the feather file
df.to_feather(tempfile.gettempdir()+"/df.feather")
df2 = pd.read_feather(tempfile.gettempdir()+"/df.feather")

# will output numpy.ndarray
type(df2['str_list'][0])

Issue Description

When a dataframe with column of lists is written to feather or parquet formats and then subsequently read back the column of lists becomes column of numpy.ndarray.

A column of lists is especially useful for NLP tasks where the corpus needs to be a list of list of strings. Passing a column of ndarray breaks this and requires an intermediate step of converting each ndarray in the column to a list.

Expected Behavior

Expected behavior - If a column of lists is written to feather/parquet files, it should be read back as a column of lists and not numpy.ndarray.

FWIW - to_pickle and subsequent read_pickle don't have this issue as can be seen by the code below

df.to_pickle(tempfile.gettempdir()+"/df.pickle")
df3 = pd.read_pickle(tempfile.gettempdir()+"/df.pickle")
type(df3['str_list'][0])

Installed Versions

list numpy.ndarray INSTALLED VERSIONS ------------------ commit : 91111fd99898d9dcaa6bf6bedb662db4108da6e6 python : 3.9.13.final.0 python-bits : 64 OS : Linux OS-release : 5.19.16-200.fc36.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Sun Oct 16 22:50:04 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.1 numpy : 1.23.4 pytz : 2022.6 dateutil : 2.8.2 setuptools : 65.5.1 pip : 22.3.1 Cython : 0.29.32 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.6.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : 1.3.5 brotli : None fastparquet : None fsspec : 2022.10.0 gcsfs : None matplotlib : 3.6.2 numba : 0.56.4 numexpr : 2.8.4 odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : 10.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.9.3 snappy : None sqlalchemy : None tables : None tabulate : 0.9.0 xarray : 2022.11.0 xlrd : None xlwt : None zstandard : None tzdata : None

Comment From: eshanja1n

take

Comment From: kostyafarber

I'm not able to reproduce this

TypeError: Argument 'table' has incorrect type (expected pyarrow.lib.Table, got DataFrame)