I am running into an issue trying to run OLS using pandas 0.13.1.
Here is a simple example: I want to regress a variable on itself, in this case excess returns. The intercept should be 0, and the coefficient should be 1. pandas provides the wrong estimates, while statsmodels gives the correct estimates.
This is not due to the silly regression specification, as I have noticed the pandas.ols estimates are inconsistent for other specifications as well.
Has anyone else encountered this problem?
import pandas as pd
import statsmodels.formula.api
In [1]: pd.ols(y=test.exret,x=test.exret).beta
Out[1]:
x 0.003107
intercept 0.006438
dtype: float64
In [2]: sm.ols(formula="exret ~ exret", data=test).fit().params
Out[2]:
Intercept -3.469447e-18
exret 1.000000e+00
dtype: float64
In [3]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.11.0-19-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.0.0-dev
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.2
sqlalchemy: 0.9.2
lxml: 3.3.1
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None
Comment From: jreback
seems ok to me
In [7]: x = Series(np.random.randn(100))
In [8]: pd.ols(y=x,x=x)
Out[8]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 100
Number of Degrees of Freedom: 2
R-squared: 1.0000
Adj R-squared: 1.0000
Rmse: 0.0000
F-stat (1, 98): inf, p-value: 0.0000
Degrees of Freedom: model 1, resid 98
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 1.0000 0.0000 45335499035463352.00 0.0000 1.0000 1.0000
intercept 0.0000 0.0000 0.65 0.5184 -0.0000 0.0000
---------------------------------End of Summary---------------------------------
In [9]: pd.ols(y=x,x=x).beta
Out[9]:
x 1.000000e+00
intercept 1.277919e-17
dtype: float64
In [12]: sm.ols(formula="x ~ x", data=x).fit().params
Out[12]:
Intercept 3.237783e-17
x 1.000000e+00
dtype: float64
Comment From: edwinhu
I see the same thing using your example. However it seems the issue occurs when there are row labels.
Is this the expected behavior?
In [1]: a = pd.Series(np.random.randn(5),index=['a','a','a','a','a'])
Out [1]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 25
Number of Degrees of Freedom: 2
R-squared: -0.0000
Adj R-squared: -0.0435
Rmse: 0.5372
F-stat (1, 23): -0.0000, p-value: 1.0000
Degrees of Freedom: model 1, resid 23
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 0.0000 0.2085 0.00 1.0000 -0.4087 0.4087
intercept 0.0832 0.1088 0.76 0.4523 -0.1301 0.2965
---------------------------------End of Summary---------------------------------
In [2]: b = a.reset_index()
In [3]: pd.ols(y=b[0],x=b[0])
Out [3]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 2
R-squared: 1.0000
Adj R-squared: 1.0000
Rmse: 0.0000
F-stat (1, 3): inf, p-value: 0.0000
Degrees of Freedom: model 1, resid 3
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 1.0000 0.0000 11982830741228190.00 0.0000 1.0000 1.0000
intercept 0.0000 0.0000 0.64 0.5693 -0.0000 0.0000
---------------------------------End of Summary---------------------------------
Comment From: jreback
have duplicate labels rarely makes sense how would expect it to align the data?
should prob raise an error with a duplicate index
I don't know what statsmodels does in this case
Comment From: jreback
@jseabold does patsy/sm align on the index?
Comment From: edwinhu
I noticed this issue when using groupby and ols with an indexed DataFrame.
GroupBy splits have "duplicate" row labels. I noticed this issue when applying pd.ols to a GroupBy object.
It seems that sm correctly ignores the duplicate row labels.
Comment From: jreback
would be helpful to show some code groupby in general won't produce a duplicate indexed frame
Comment From: jseabold
We check for alignment.
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/data.py#L308
Comment From: edwinhu
Sure. My data is organized by id and date. I have the dataframe indexed by id. It looks something like this (without the date column):
a = pd.Series(np.random.randn(5),index=['a','a','a','a','a'])
b = pd.Series(np.random.randn(5),index=['b','b','b','b','b'])
x = a.append(b)
grp = x.groupby(level=0)
grp.apply(lambda x: pd.ols(y=x,x=x).beta)
a x 0.000000e+00
intercept 9.327435e-02
b x -8.673617e-17
intercept 3.037757e-01
dtype: float64
sm.ols(formula="x ~ x",data=x).fit().params
Intercept 6.245005e-17
x 1.000000e+00
dtype: float64
Comment From: jreback
refering to statsmodels as this functionaility is not supported (not deprecated either as of yet).
Comment From: rsdenijs
What is the status of this issue? Duplicate indices silently mess up the number of observations resulting model. I guess a check for df.index.is_unique would solve this?
Comment From: jorisvandenbossche
@rsdenijs As @jreback pointed out in his last comment, this is not supported anymore in pandas (they will also be effectively deprecated in the coming release, see https://github.com/pydata/pandas/pull/11898). So the status of this issue is that we do not plan to take any action on this.
Can you use statsmodels for your use case? (for OLS everything should be in statsmodels, for the other functions in pandas there are still some things missing in statsmodels: https://github.com/statsmodels/statsmodels/issues/2745)