Pandas Apply with dataframes as return values of function

Code Sample, a copy-pastable example if possible

import pandas as pd
import sys

parent = pd.DataFrame({"value1": [1, 2], "value2": [3, 4]}, index=["id0", "id1"])

def test_01(row):
    print(row)
    df = pd.DataFrame({"a": [1], "b": [2]}) * row.value1
    return [df]

def test_02(row):
    print(row)
    df = pd.DataFrame({"a": [1], "b": [2]}) * row.value1
    return df

try:
    result_01 = parent.apply(test_01, axis=1).apply(lambda x: x[0])
    print("test_01 succeeded\n")
except:
    print(sys.exc_info()[:2])
    print("test_01 failed\n")

try:
    result_02 = parent.apply(test_02, axis=1)
    print("test_02 succeeded\n")
except:
    print(sys.exc_info()[:2])
    print("test_02 failed\n")

Problem description

I ran into an issue concerning DataFrame.apply . Shortly speaking, I want to apply a function to a pandas.DataFrame row-wise using DataFrame.apply, the function returning again a DataFrame, such that each element of the resulting array contains a DataFrame (see "test_01" in the example below). However, if I do so, i get errors. Instead, I need to return the DataFrame as a list with one element and in a second "apply" iteration return the first element of each row of the resulting Series object.

The overall question boils down to the minimal example below: Why does "test_02" fail in the example? I guess that somehow "Series.apply" handles the return values different from "DataFrame.apply", but from the docs [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html] and [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html], I can't figure out what is the actual difference.

Expected Output

Expected ouput is result_01, while the calculation if result_02 produces an error.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.24.1 numpy: 1.11.2 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: 1.1.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 1.5.3 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.3 lxml: 3.6.4 bs4: 4.5.1 html5lib: 0.9999999 httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: None psycopg2: None jinja2: 2.9.4 boto: 2.42.0 pandas_datareader: None

Comment From: jorisvandenbossche

Storing DataFrames inside a Series/DataFrame is not really supported, and not guaranteed to work. So as somebody also answered on the mailing list, you should try to avoid this.

If you all you want is the ability to get the correct subframe using indexing, you can also store the frames in a dictionary. The df = all_data.loc[measurementID, "measurement_data"] would then just become df = all_data_dict[measurementID]["measurement_data"]. And the apply could look like all_data_dict = {key: func(parent.loc[key]) for key in parent.index}.

To your question on why the second function fails and the first one not: pandas tries to fit in the returned value into the dataframe (eg if it returns a series, those will be used as the different columns of the resulting frame). This fails for a DataFrame as the dimensions don't match.

Comment From: jreback

this is non-idiomatic pandas and confusing at best. This is a user error as described above.

Pandas Apply with dataframes as return values of function

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`