Code Sample, a copy-pastable example if possible
import pandas as pd
df = pd.DataFrame([1,2,3])
df['col1'] = pd.DataFrame([1,2,3], dtype='category')
df['col2'] = pd.Series([1,2,3], dtype='category')
df.dtypes
Problem description
This returns
0 int64
col1 int64
col2 category
dtype: object
Expected Output
0 int64
col1 category
col2 category
dtype: object
Output of pd.show_versions()
[paste the output of ``pd.show_versions()`` here below this line]
INSTALLED VERSIONS
------------------
commit : d9f028dba97e9a3f21077ceeb96ca6552909c3b3
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-74-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 0.26.0.dev0+2031.gd9f028dba.dirty
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.0.0.post20200113
Cython : 0.29.14
pytest : 5.3.3
hypothesis : 5.1.5
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.3
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0
Comment From: DanielBustillos
Hi, I have checked the error, can I work on it?
Comment From: MarcoGorelli
Hi @DanielBustillos - sure, pull requests are welcome! See the contributing guide for how to get started
Comment From: jbrockmendel
Is it obvious that this should be allowed? I would almost expect setting a DataFrame into a single column to raise cc @phofl
if this is something we want to allow, someone can salvage this from an old branch im trimming:
+++ b/pandas/core/frame.py
@@ -4076,7 +4076,10 @@ class DataFrame(NDFrame, OpsMixin):
f"column {key}"
)
- self[key] = value[value.columns[0]]
+ # now align rows
+ # TODO: could lose dtypes if multi-column xref #31581?
+ arraylike = _reindex_for_setitem(value, self.index)
+ self._set_item_mgr(key, arraylike)
def _iset_item_mgr(
self, loc: int | slice | np.ndarray, value, inplace: bool = False
@@ -11576,6 +11579,10 @@ def _from_nested_dict(data) -> collections.defaultdict:
def _reindex_for_setitem(value: DataFrame | Series, index: Index) -> ArrayLike:
# reindex if necessary
+ if isinstance(value, DataFrame) and value.shape[1] == 1:
+ # GH#31581 avoid losing dtype for EAs
+ value = value._ixs(0, axis=1)
+
if value.index.equals(index) or not len(index):
return value._values.copy()
+++ b/pandas/tests/frame/indexing/test_setitem.py
@@ -44,6 +44,17 @@ from pandas.tseries.offsets import BDay
class TestDataFrameSetItem:
+ def test_setitem_categorical_frame(self):
+ # GH#31581 don't lose categorical dtype when setting column
+ df = DataFrame([1, 2, 3])
+
+ value = pd.DataFrame([1, 2, 3], dtype="category")
+
+ df["col1"] = value
+
+ expected = DataFrame({0: [1, 2, 3], "col1": value[0]})
+ tm.assert_frame_equal(df, expected)
+
def test_setitem_str_subclass(self):
Alternatively could make DataFrame._values return a 2D categorical.
Comment From: phofl
I think this could work ( from a user pov).
it actually does on main, returned type is category
Comment From: phofl
0 int64
col1 category
col2 category
dtype: object