Code Sample, a copy-pastable example if possible
import pandas as pd
# list length does NOT match DataFrame dimensions, works as expected:
df = pd.DataFrame()
df['A'] = [[1, 2], [1, 2], [1, 2]]
df['B'] = [[3, 4], [3, 4], [3, 4]]
df['C'] = [[4, 6], [4, 6], [4, 6]]
print(df)
# A B C
# 0 [1, 2] [3, 4] [4, 6]
# 1 [1, 2] [3, 4] [4, 6]
# 2 [1, 2] [3, 4] [4, 6]
D = df.apply(lambda row: [a+b for a, b in zip(row.A, row.B)], axis=1)
print(D)
# 0 [4, 6]
# 1 [4, 6]
# 2 [4, 6]
# dtype: object
df['D'] = D
print(df)
# A B C D
# 0 [1, 2] [3, 4] [4, 6] [4, 6]
# 1 [1, 2] [3, 4] [4, 6] [4, 6]
# 2 [1, 2] [3, 4] [4, 6] [4, 6]
# list length DOES match DataFrame dimensions, does not work as expected:
df = pd.DataFrame()
df['A'] = [[1, 2, 3], [1, 2, 3], [1, 2, 3]]
df['B'] = [[3, 4, 5], [3, 4, 5], [3, 4, 5]]
df['C'] = [[4, 6, 8], [4, 6, 8], [4, 6, 8]]
print(df)
# A B C
# 0 [1, 2, 3] [3, 4, 5] [4, 6, 8]
# 1 [1, 2, 3] [3, 4, 5] [4, 6, 8]
# 2 [1, 2, 3] [3, 4, 5] [4, 6, 8]
# unwanted result:
D = df.apply(lambda row: [a+b for a, b in zip(row.A, row.B)], axis=1)
print(D)
# A B C
# 0 4 6 8
# 1 4 6 8
# 2 4 6 8
# correct:
df['D'] = [[a+b for a, b in zip(row.A, row.B)] for row in df.itertuples()]
print(df)
# A B C D
# 0 [1, 2, 3] [3, 4, 5] [4, 6, 8] [4, 6, 8]
# 1 [1, 2, 3] [3, 4, 5] [4, 6, 8] [4, 6, 8]
# 2 [1, 2, 3] [3, 4, 5] [4, 6, 8] [4, 6, 8]
Problem description
This behavior came up in a bigger project, when the length of a stored list incidentally met the DataFrame's dimension and resulted in an exception raised. I didn't expect apply() to return a DataFrame at all. I tried to find a parameter to change this behavior, but didn't have any luck there.
May be related to https://github.com/pandas-dev/pandas/issues/5299
Expected Output
see above
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.11-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.3.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.3
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.5
boto: None
pandas_datareader: None
Comment From: jreback
xref https://github.com/pandas-dev/pandas/issues/14370 (and some linked issues).
you are on your own if you have lists inside a cell. This is not idiomatic (and certainly not performant in any way). If you really really want to do this, then return tuples.