Code Sample, a copy-pastable example if possible
# Your code here
df = data_index[data_index['ntorsions'] == 2]
Problem description
When slicing a dataframe, the index is not reset by default. This becomes an issue if you want to output that dataframe, combine that dataframe with other dataframes (good luck with that), or output the dataframe without two index columns.
Fixing this will not break code in the wild.
Expected Output
Index being correct - without the need to manually call reset_index over and over again. This is much more intuitive to end users.
-> At end of slice, call reset_index(drop = True) on the returned dataframe or current dataframe if you are slicing in-place.
Output of pd.show_versions()
Comment From: TomAugspurger
Can you make a full copy-pastable example (including constructing data_index
).
Comment From: jadolfbr
Sure.
import pandas
data_index = pandas.read_table("data_index.tsv")
df = data_index[data_index['torsions'] == 2]
print df
Top Contents of data_index.tsv
(sugar molecule linkages):
name fname ntorsions torsion_bins
b-D-GlcpNAc ->4)-D-GlcpNAc b-D-GlcpNAc_-_4_-D-GlcpNAc.tsv 2 2.3
a-D-Glcp ->4)-D-Glcp a-D-Glcp_-_4_-D-Glcp.tsv 2 2.2
b-D-Manp ->4)-D-GlcpNAc b-D-Manp_-_4_-D-GlcpNAc.tsv 2 5.5
a-L-Fucp ->3)-D-GlcpNAc a-L-Fucp_-_3_-D-GlcpNAc.tsv 2 2.3
b-D-Xylp ->2)-D-Manp b-D-Xylp_-_2_-D-Manp.tsv 2 3.4
a-D-Manp ->3)-D-Manp a-D-Manp_-_3_-D-Manp.tsv 2 4.3
a-D-Manp ->6)-D-Manp a-D-Manp_-_6_-D-Manp.tsv 3 2.4.4
Comment From: jorisvandenbossche
As you mention yourself, you can use reset_index(drop = True)
to get the desired result.
Changing this (to do this resetting automatically when doing a slice) would fundamentally change how pandas currently works. And would for sure break a huge amount of code (many pandas code relies on the index being certain values, certainly when having specific indexes like a DatetimeIndex).
There are some ideas for a future release of pandas to allow dataframes without an explicit index (see https://github.com/pandas-dev/pandas2/issues/17, but that is currenlty just a discussion, no code or commitment this actually will happen)
Comment From: jreback
@jadolfbr this is fundamental pandas behavior. The index is preserved thru virtually all operations. That's the point.
And combing is actually quite easy with .join
, or assignment, again, that the fundamental thing that pandas does, it aligns on the index.
Comment From: jadolfbr
I think options to allow dataframes without indexes would be great. They are extremely unwieldy without resetting that index. Maybe if you have multiple indexes with layers, etc, they would be good. However, these easily run into tons of problems in pandas as it stands now, so most of us in our lab shy away from that.
Here is is a simple example. Maybe you can suggest a better way and say that I'm using pandas wrong. That's fine too. This is just to divide two values that are different experiments (and yes, in this case the row order does matter):
length_data['length_rr'] = (rr_data[rr_data['exp'] == 'mw']['length_rr'].reset_index(drop=True)\
/rr_data[rr_data['exp'] == 'mo']['length_rr'].reset_index(drop=True))
length_data2['length_rr'] = (rr_data[rr_data['exp'] == 'rmw']['length_rr'].reset_index(drop=True)\
/rr_data[rr_data['exp'] == 'rmo']['length_rr'].reset_index(drop=True))
length_enrich = pandas.concat([length_data, length_data2]).reset_index(drop=True)
Note that for the concat, if you don't reset and drop the index, pandas throws a duplicate index error if you do not reset with drop.
For joining, etc. the indexes can again get in the way. Many times you want to be joining based on some operation of the data, so we use merge. But I guess that might be preference.
**Comment From: jreback**
you are not using pandas power at all. you are in fact making a big assumption that the data that you are dividing is exactly the same length and perfectly lines up. maybe that's always true for you.
I would probably do something like this. In fact this is quite general and deals with missing labeled data.
In [34]: df = DataFrame({'l': [1, 1, 1, 2, 2, 2], 'obs': [1, 2, 3, 1, 2, 3], 'value': [1, 2, 3, 4, 5, 6]})
In [35]: df Out[35]: l obs value 0 1 1 1 1 1 2 2 2 1 3 3 3 2 1 4 4 2 2 5 5 2 3 6
In [36]: df = df.set_index(['l', 'obs'])
In [37]: df
Out[37]:
value
l obs
1 1 1
2 2
3 3
2 1 4
2 5
3 6
In [38]: df.value.loc[1] / df.value.loc[2] Out[38]: obs 1 0.25 2 0.40 3 0.50 Name: value, dtype: float64
This is a slightly different and IMHO better way of organizing things.
In [39]: df.unstack()
Out[39]:
value
obs 1 2 3
l
1 1 2 3
2 4 5 6
In [40]: u = df.unstack()
In [41]: u.loc[1] / u.loc[2] Out[41]: obs value 1 0.25 2 0.40 3 0.50 ```
Comment From: jadolfbr
Thanks for the suggestion. Yes, this seems much better than what I was trying to do - use the indexes instead of fighting with them and trying to go around them. Makes sense. I guess this would make joining a whole lot more straightforward too. Awesome. Thanks for taking the time to write back.