Feature Type
-
[X] Adding new functionality to pandas
-
[ ] Changing existing functionality in pandas
-
[ ] Removing existing functionality in pandas
Problem Description
Sometimes you can get pretty strange results with simple operations on Series.corr(Series) because of index mismatches. An ignore_index would be very useful.
Feature Description
eg, s1.corr(s2, ignore_index = True)
Alternative Solutions
None
Additional Context
No response
Comment From: phofl
Hi, thanks for your report. The keyword should ignore the index during the operation not in the result? In this case we need a different name, since ignore_index
has a different meaning in pandas. Can you show an example?
Comment From: blazespinnaker
print(n1)
print(n2)
print(type(n1), type(n2))
print(scipy.stats.spearmanr(n1, n2))
print(n1.corr(n2, method="spearman"))
0 2317.0
1 2293.0
2 1190.0
3 972.0
4 1391.0
Name: r6000, dtype: float64
0.0 2317.0
1.0 2293.0
3.0 1190.0
4.0 972.0
5.0 1391.0
Name: 6000, dtype: float64
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
SpearmanrResult(correlation=0.9999999999999999, pvalue=1.4042654220543672e-24)
0.7999999999999999
Here's an example of the issue. Another approach might be to spit out a warning, though that can get noisy. The benefit of a parameter to not do the automatic alignment is that it serves as a warning. A final approach would just be a note in the documentation.
Comment From: jreback
all operations in pandas align if you want to not align the use .to_numpy()
-1 on changing anything - this would be a very special case
Comment From: blazespinnaker
I think reset_index() is actually the right answer here. to_numpy() would require scipy.
Your point is taken, but I guess what got me was that all operations do appear to sort of align but in rather arbitrary ways, leaving things quite confusing.
eg:
n1 = pd.DataFrame([(1,2),(2,4),(3,5)], columns = ['ind', 'val']).set_index('ind')
n2 = pd.DataFrame([(1,2),(5,4),(3,5)], columns = ['ind', 'val']).set_index('ind')
display(n1+n2)
display(n1['val']+n2['val'])
If it were consistent with corr, than n1+n2 should be a new DF with only the agreed upon indexes, but instead the missing val is NaN.
If corr followed the same logic consistently, it would give a NaN correlation. When I saw a seemingly valid correlation I assumed everything was correct, which of course was mistaken.
A short note in the docs describing the assumptions made would at least help a bit.
Comment From: blazespinnaker
Some other examples of inconsistent behavior with data alignment. I appreciate the tradeoffs made and even why they might be optimal, but a little bit of extra documentation here and there would help educate users like me better I think.
https://github.com/pandas-dev/pandas/issues/20831 https://github.com/pandas-dev/pandas/issues/47554
Comment From: MarcoGorelli
@blazespinnaker you might be interested in PDEP5, which may (mind you I said may, it's not yet been accepted, and it's still being ironed out) allow you to not need to think about alignment if you don't want to
Closing then as I don't think there's anything actionable here - regarding clarifying docs, PRs to improve them are welcome, feel free to submit one https://pandas.pydata.org/docs/dev/development/contributing.html