Pandas Resetting Index on slice

Code Sample, a copy-pastable example if possible

# Your code here
df = data_index[data_index['ntorsions'] == 2]

Problem description

When slicing a dataframe, the index is not reset by default. This becomes an issue if you want to output that dataframe, combine that dataframe with other dataframes (good luck with that), or output the dataframe without two index columns.

Fixing this will not break code in the wild.

Expected Output

Index being correct - without the need to manually call reset_index over and over again. This is much more intuitive to end users.

-> At end of slice, call reset_index(drop = True) on the returned dataframe or current dataframe if you are slicing in-place.

Output of `pd.show_versions()`

loaded rc file /Users/jadolfbr/.matplotlib/matplotlibrc matplotlib version 1.5.1 verbose.level helpful interactive is False platform is darwin INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 pandas: 0.18.1 nose: 1.3.7 pip: 9.0.1 setuptools: 20.3.1 Cython: None numpy: 1.11.1 scipy: 0.13.0b1 statsmodels: 0.6.1 xarray: None IPython: 4.1.2 sphinx: None patsy: 0.4.0 dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None

Comment From: TomAugspurger

Can you make a full copy-pastable example (including constructing data_index).

Comment From: jadolfbr

Sure.

import pandas
data_index = pandas.read_table("data_index.tsv")
df = data_index[data_index['torsions'] == 2]
print df

Top Contents of data_index.tsv (sugar molecule linkages):

name    fname   ntorsions   torsion_bins
b-D-GlcpNAc ->4)-D-GlcpNAc  b-D-GlcpNAc_-_4_-D-GlcpNAc.tsv  2   2.3
a-D-Glcp ->4)-D-Glcp    a-D-Glcp_-_4_-D-Glcp.tsv    2   2.2
b-D-Manp ->4)-D-GlcpNAc b-D-Manp_-_4_-D-GlcpNAc.tsv 2   5.5
a-L-Fucp ->3)-D-GlcpNAc a-L-Fucp_-_3_-D-GlcpNAc.tsv 2   2.3
b-D-Xylp ->2)-D-Manp    b-D-Xylp_-_2_-D-Manp.tsv    2   3.4
a-D-Manp ->3)-D-Manp    a-D-Manp_-_3_-D-Manp.tsv    2   4.3
a-D-Manp ->6)-D-Manp    a-D-Manp_-_6_-D-Manp.tsv    3   2.4.4

Comment From: jorisvandenbossche

As you mention yourself, you can use reset_index(drop = True) to get the desired result.

Changing this (to do this resetting automatically when doing a slice) would fundamentally change how pandas currently works. And would for sure break a huge amount of code (many pandas code relies on the index being certain values, certainly when having specific indexes like a DatetimeIndex).

There are some ideas for a future release of pandas to allow dataframes without an explicit index (see https://github.com/pandas-dev/pandas2/issues/17, but that is currenlty just a discussion, no code or commitment this actually will happen)

Comment From: jreback

@jadolfbr this is fundamental pandas behavior. The index is preserved thru virtually all operations. That's the point.

And combing is actually quite easy with .join, or assignment, again, that the fundamental thing that pandas does, it aligns on the index.

Comment From: jadolfbr

I think options to allow dataframes without indexes would be great. They are extremely unwieldy without resetting that index. Maybe if you have multiple indexes with layers, etc, they would be good. However, these easily run into tons of problems in pandas as it stands now, so most of us in our lab shy away from that.

Here is is a simple example. Maybe you can suggest a better way and say that I'm using pandas wrong. That's fine too. This is just to divide two values that are different experiments (and yes, in this case the row order does matter):

length_data['length_rr'] = (rr_data[rr_data['exp'] == 'mw']['length_rr'].reset_index(drop=True)\
                            /rr_data[rr_data['exp'] == 'mo']['length_rr'].reset_index(drop=True))

length_data2['length_rr'] = (rr_data[rr_data['exp'] == 'rmw']['length_rr'].reset_index(drop=True)\
                             /rr_data[rr_data['exp'] == 'rmo']['length_rr'].reset_index(drop=True))

length_enrich = pandas.concat([length_data, length_data2]).reset_index(drop=True)

Note that for the concat, if you don't reset and drop the index, pandas throws a duplicate index error if you do not reset with drop.

For joining, etc. the indexes can again get in the way. Many times you want to be joining based on some operation of the data, so we use merge. But I guess that might be preference.


**Comment From: jreback**

you are not using pandas power at all. you are in fact making a big assumption that the data that you are dividing is exactly the same length and perfectly lines up. maybe that's always true for you.

I would probably do something like this. In fact this is quite general and deals with missing labeled data.

In [34]: df = DataFrame({'l': [1, 1, 1, 2, 2, 2], 'obs': [1, 2, 3, 1, 2, 3], 'value': [1, 2, 3, 4, 5, 6]})

In [35]: df Out[35]: l obs value 0 1 1 1 1 1 2 2 2 1 3 3 3 2 1 4 4 2 2 5 5 2 3 6

In [36]: df = df.set_index(['l', 'obs'])

In [37]: df Out[37]: value l obs
1 1 1 2 2 3 3 2 1 4 2 5 3 6

In [38]: df.value.loc[1] / df.value.loc[2] Out[38]: obs 1 0.25 2 0.40 3 0.50 Name: value, dtype: float64


This is a slightly different and IMHO better way of organizing things.

In [39]: df.unstack() Out[39]: value
obs 1 2 3 l
1 1 2 3 2 4 5 6

In [40]: u = df.unstack()

In [41]: u.loc[1] / u.loc[2] Out[41]: obs value 1 0.25 2 0.40 3 0.50 ```

Comment From: jadolfbr

Thanks for the suggestion. Yes, this seems much better than what I was trying to do - use the indexes instead of fighting with them and trying to go around them. Makes sense. I guess this would make joining a whole lot more straightforward too. Awesome. Thanks for taking the time to write back.

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`