Is there a function which standardizes data in the columns of a dataframe? If not, can we introduce it?
Through standardization, one can re-scale the data to have a mean of zero and standard deviation of one. I use standardization regularly for two purposes
- To facilitate interpretation of regression estimates. Here is a related discussion on Cross Validated.
- To facilitate inspection of plots of time-series. Especially when I like to understand whether different series co-move over time, plotting them on the scale helps.
So far, I use a simple function:
def standardize(self, df, label):
"""
standardizes a series with name ``label'' within the pd.DataFrame
``df''.
"""
df = df.copy(deep=True)
series = df.loc[:, label]
avg = series.mean()
stdv = series.std()
series_standardized = (series - avg)/ stdv
return series_standardized
I thought if there could be a function standardize which can be used similarly to the rolling
function, such as df.standardize()
.
Comment From: gfyoung
@FabianSchuetze : Thanks for the proposal! My only concern is that what happens when DataFrame
does not contain numeric values? Suddenly, standardize
doesn't mean much.
I'm hesitant to add this only because this method would only be applicable for the numeric case. In addition, the implementation isn't that hard to do oneself as you so demonstrated.
Comment From: jorisvandenbossche
Personally, given this is a rather simple function to do yourself, I have not the feeling this warrants adding a new method to DataFrame/Series.
Comment From: jreback
-0 on this. this just adds to an already very large api. many definitions. of this, which name shall we pick? IIRC we rejected zscore
in the past.
Comment From: FabianSchuetze
Thank you for all your replies! I agree very much that the functions itself is simple (albeit useful) and if you rejected a function called zscore
in the past, standardize
shouldn't be included either.
Comment From: mwaskom
IIRC we rejected zscore in the past
Indeed you have, which remains a pain when trying to do "piped" processing of a series (cf #12515)
Comment From: jreback
as I wrote before here
In [13]: standarize = lambda x: (x-x.mean()) / x.std()
In [14]: s = pd.Series(np.random.rand(10))
...:
...: (s-s.mean())/s.std()
...:
Out[14]:
0 0.395159
1 0.611805
2 -1.976001
3 0.512755
4 0.954300
5 -0.873228
6 -0.988174
7 -0.099802
8 0.196835
9 1.266350
dtype: float64
In [15]: standarize = lambda x: (x-x.mean()) / x.std()
In [16]: s.pipe(standarize)
Out[16]:
0 0.395159
1 0.611805
2 -1.976001
3 0.512755
4 0.954300
5 -0.873228
6 -0.988174
7 -0.099802
8 0.196835
9 1.266350
dtype: float64
Comment From: mwaskom
For the record, the fact that pandas doesn't handle using scipy.zstats
properly here, and so the user needs to write a lambda (for an extremely common operation), remains incredibly annoying.