Pandas Default iteration behavior of the DataFrame

Code Sample, a copy-pastable example if possible

for x in my_data_frame:
    print rep(x)

Problem description

The current behavior is to print the columns during the normal iteration of a data frame. However, most of the time when actually using a DataFrame we want to iterate over the rows. How useful in data analysis is it to iterate over the columns? In my nearly 2 years of using pandas, not once have I had to do this. If one needs this, my_data_frame.columns works just fine.

In addition, there is no my_data_frame.rows to work on. The only good way to iterate over rows is to use iterrows:


for index, row in my_data_frame.iterrows():
    print row

This seems like something that should be much more obvious than it is, and to even find that answer, one would pretty much need to google and find the stack overflow answer.

Is there any way we could iterate over the rows when using the python in syntax? I think this is a much more obvious thing to do during iteration than going through the columns for most users.

Expected Output

Row output using the python in keyword.

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here loaded rc file /Users/jadolfbr/.matplotlib/matplotlibrc matplotlib version 1.5.1 verbose.level helpful interactive is False platform is darwin INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 pandas: 0.18.1 nose: 1.3.7 pip: 9.0.1 setuptools: 20.3.1 Cython: None numpy: 1.11.1 scipy: 0.13.0b1 statsmodels: 0.6.1 xarray: None IPython: 4.1.2 sphinx: None patsy: 0.4.0 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Comment From: TomAugspurger

Docs are here: http://pandas-docs.github.io/pandas-docs-travis/basics.html#iteration

The analogy for DataFrame.__iter__ is like iterating over a dictionary (columns are the keys in this case).

You might not want .iterrows() either, this converts the row to a Series. If you have heterogenous data this will force object dtype. In practice, I find .itertuples() most useful for row-wise iteration.

Even if we wanted to, this isn't something we could change without breaking a lot of people's code.

Comment From: jadolfbr

Thanks for the .itertuples() suggestion - will definitely look into this.

I get the analogy, but I still think it is not very useful as the default iteration. The same analogy could be used for rows - IE each row being part of a list, like a list of dicts or named tupples. I code Rosetta, so I also get that people's code would break for this large change, but I've had my code break from pandas changes many times so far as well...

Pandas Default iteration behavior of the DataFrame

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`