Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import tempfile
# str_list is a column of lists of strings
df = pd.DataFrame(
{
"id": [1,2,3,4,5],
"str_list": [
["a", "aa", "aaa"],
["b", "bb","bbb", "bbbb"],
["c", "cc"],
["d"],
["e","ee","eee"]
]
})
# will output list
type(df['str_list'][0])
# Write to a feather file and read back the feather file
df.to_feather(tempfile.gettempdir()+"/df.feather")
df2 = pd.read_feather(tempfile.gettempdir()+"/df.feather")
# will output numpy.ndarray
type(df2['str_list'][0])
Issue Description
When a dataframe with column of lists is written to feather or parquet formats and then subsequently read back the column of lists becomes column of numpy.ndarray.
A column of lists is especially useful for NLP tasks where the corpus needs to be a list of list of strings. Passing a column of ndarray breaks this and requires an intermediate step of converting each ndarray in the column to a list.
Expected Behavior
Expected behavior - If a column of lists is written to feather/parquet files, it should be read back as a column of lists and not numpy.ndarray.
FWIW - to_pickle
and subsequent read_pickle
don't have this issue as can be seen by the code below
df.to_pickle(tempfile.gettempdir()+"/df.pickle")
df3 = pd.read_pickle(tempfile.gettempdir()+"/df.pickle")
type(df3['str_list'][0])
Installed Versions
Comment From: eshanja1n
take
Comment From: kostyafarber
I'm not able to reproduce this
TypeError: Argument 'table' has incorrect type (expected pyarrow.lib.Table, got DataFrame)