Feature Type

  • [X] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

Sometimes you can get pretty strange results with simple operations on Series.corr(Series) because of index mismatches. An ignore_index would be very useful.

Feature Description

eg, s1.corr(s2, ignore_index = True)

Alternative Solutions

None

Additional Context

No response

Comment From: phofl

Hi, thanks for your report. The keyword should ignore the index during the operation not in the result? In this case we need a different name, since ignore_index has a different meaning in pandas. Can you show an example?

Comment From: blazespinnaker

print(n1)
print(n2)
print(type(n1), type(n2))
print(scipy.stats.spearmanr(n1, n2))
print(n1.corr(n2, method="spearman"))
0    2317.0
1    2293.0
2    1190.0
3     972.0
4    1391.0
Name: r6000, dtype: float64
0.0    2317.0
1.0    2293.0
3.0    1190.0
4.0     972.0
5.0    1391.0
Name: 6000, dtype: float64
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
SpearmanrResult(correlation=0.9999999999999999, pvalue=1.4042654220543672e-24)
0.7999999999999999

Here's an example of the issue. Another approach might be to spit out a warning, though that can get noisy. The benefit of a parameter to not do the automatic alignment is that it serves as a warning. A final approach would just be a note in the documentation.

Comment From: jreback

all operations in pandas align if you want to not align the use .to_numpy()

-1 on changing anything - this would be a very special case

Comment From: blazespinnaker

I think reset_index() is actually the right answer here. to_numpy() would require scipy.

Your point is taken, but I guess what got me was that all operations do appear to sort of align but in rather arbitrary ways, leaving things quite confusing.

eg:

n1 = pd.DataFrame([(1,2),(2,4),(3,5)], columns = ['ind', 'val']).set_index('ind')
n2 = pd.DataFrame([(1,2),(5,4),(3,5)], columns = ['ind', 'val']).set_index('ind')
display(n1+n2)
display(n1['val']+n2['val'])

If it were consistent with corr, than n1+n2 should be a new DF with only the agreed upon indexes, but instead the missing val is NaN.

If corr followed the same logic consistently, it would give a NaN correlation. When I saw a seemingly valid correlation I assumed everything was correct, which of course was mistaken.

A short note in the docs describing the assumptions made would at least help a bit.

Comment From: blazespinnaker

Some other examples of inconsistent behavior with data alignment. I appreciate the tradeoffs made and even why they might be optimal, but a little bit of extra documentation here and there would help educate users like me better I think.

https://github.com/pandas-dev/pandas/issues/20831 https://github.com/pandas-dev/pandas/issues/47554

Comment From: MarcoGorelli

@blazespinnaker you might be interested in PDEP5, which may (mind you I said may, it's not yet been accepted, and it's still being ironed out) allow you to not need to think about alignment if you don't want to

Closing then as I don't think there's anything actionable here - regarding clarifying docs, PRs to improve them are welcome, feel free to submit one https://pandas.pydata.org/docs/dev/development/contributing.html