Pandas pd.stats.api.ols inconsistent estimates

I am running into an issue trying to run OLS using pandas 0.13.1.

Here is a simple example: I want to regress a variable on itself, in this case excess returns. The intercept should be 0, and the coefficient should be 1. pandas provides the wrong estimates, while statsmodels gives the correct estimates.

This is not due to the silly regression specification, as I have noticed the pandas.ols estimates are inconsistent for other specifications as well.

Has anyone else encountered this problem?

import pandas as pd
import statsmodels.formula.api

In [1]: pd.ols(y=test.exret,x=test.exret).beta
Out[1]: 
x            0.003107
intercept    0.006438
dtype: float64

In [2]: sm.ols(formula="exret ~ exret", data=test).fit().params
Out[2]: 
Intercept   -3.469447e-18
exret        1.000000e+00
dtype: float64

In [3]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.11.0-19-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.0.0-dev
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.2
sqlalchemy: 0.9.2
lxml: 3.3.1
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None

Comment From: jreback

seems ok to me

In [7]: x = Series(np.random.randn(100))

In [8]: pd.ols(y=x,x=x)
Out[8]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         100
Number of Degrees of Freedom:   2

R-squared:         1.0000
Adj R-squared:     1.0000

Rmse:              0.0000

F-stat (1, 98):        inf, p-value:     0.0000

Degrees of Freedom: model 1, resid 98

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     1.0000     0.0000 45335499035463352.00     0.0000     1.0000     1.0000
     intercept     0.0000     0.0000       0.65     0.5184    -0.0000     0.0000
---------------------------------End of Summary---------------------------------

In [9]: pd.ols(y=x,x=x).beta
Out[9]: 
x            1.000000e+00
intercept    1.277919e-17
dtype: float64

In [12]: sm.ols(formula="x ~ x", data=x).fit().params
Out[12]: 
Intercept    3.237783e-17
x            1.000000e+00
dtype: float64

Comment From: edwinhu

I see the same thing using your example. However it seems the issue occurs when there are row labels.

Is this the expected behavior?

In [1]: a = pd.Series(np.random.randn(5),index=['a','a','a','a','a'])
Out [1]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         25
Number of Degrees of Freedom:   2

R-squared:        -0.0000
Adj R-squared:    -0.0435

Rmse:              0.5372

F-stat (1, 23):    -0.0000, p-value:     1.0000

Degrees of Freedom: model 1, resid 23

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.0000     0.2085       0.00     1.0000    -0.4087     0.4087
     intercept     0.0832     0.1088       0.76     0.4523    -0.1301     0.2965
---------------------------------End of Summary---------------------------------

In [2]: b = a.reset_index()

In [3]: pd.ols(y=b[0],x=b[0])
Out [3]: 

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   2

R-squared:         1.0000
Adj R-squared:     1.0000

Rmse:              0.0000

F-stat (1, 3):        inf, p-value:     0.0000

Degrees of Freedom: model 1, resid 3

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     1.0000     0.0000 11982830741228190.00     0.0000     1.0000     1.0000
     intercept     0.0000     0.0000       0.64     0.5693    -0.0000     0.0000
---------------------------------End of Summary---------------------------------

Comment From: jreback

have duplicate labels rarely makes sense how would expect it to align the data?

should prob raise an error with a duplicate index

I don't know what statsmodels does in this case

Comment From: jreback

@jseabold does patsy/sm align on the index?

Comment From: edwinhu

I noticed this issue when using groupby and ols with an indexed DataFrame.

GroupBy splits have "duplicate" row labels. I noticed this issue when applying pd.ols to a GroupBy object.

It seems that sm correctly ignores the duplicate row labels.

Comment From: jreback

would be helpful to show some code groupby in general won't produce a duplicate indexed frame

Comment From: jseabold

We check for alignment.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/data.py#L308

Comment From: edwinhu

Sure. My data is organized by id and date. I have the dataframe indexed by id. It looks something like this (without the date column):

a = pd.Series(np.random.randn(5),index=['a','a','a','a','a'])
b = pd.Series(np.random.randn(5),index=['b','b','b','b','b'])

x = a.append(b)

grp = x.groupby(level=0)

grp.apply(lambda x: pd.ols(y=x,x=x).beta)

a  x            0.000000e+00
   intercept    9.327435e-02
b  x           -8.673617e-17
   intercept    3.037757e-01
dtype: float64

sm.ols(formula="x ~ x",data=x).fit().params

Intercept    6.245005e-17
x            1.000000e+00
dtype: float64

Comment From: jreback

refering to statsmodels as this functionaility is not supported (not deprecated either as of yet).

Comment From: rsdenijs

What is the status of this issue? Duplicate indices silently mess up the number of observations resulting model. I guess a check for df.index.is_unique would solve this?

Comment From: jorisvandenbossche

@rsdenijs As @jreback pointed out in his last comment, this is not supported anymore in pandas (they will also be effectively deprecated in the coming release, see https://github.com/pydata/pandas/pull/11898). So the status of this issue is that we do not plan to take any action on this.

Can you use statsmodels for your use case? (for OLS everything should be in statsmodels, for the other functions in pandas there are still some things missing in statsmodels: https://github.com/statsmodels/statsmodels/issues/2745)