Is there a function which standardizes data in the columns of a dataframe? If not, can we introduce it?

Through standardization, one can re-scale the data to have a mean of zero and standard deviation of one. I use standardization regularly for two purposes

  • To facilitate interpretation of regression estimates. Here is a related discussion on Cross Validated.
  • To facilitate inspection of plots of time-series. Especially when I like to understand whether different series co-move over time, plotting them on the scale helps.

So far, I use a simple function:

def standardize(self, df, label):
    """
    standardizes a series with name ``label'' within the pd.DataFrame
    ``df''.
    """
    df = df.copy(deep=True)
    series = df.loc[:, label]
    avg = series.mean()
    stdv = series.std()
    series_standardized = (series - avg)/ stdv
return series_standardized

I thought if there could be a function standardize which can be used similarly to the rolling function, such as df.standardize().

Comment From: gfyoung

@FabianSchuetze : Thanks for the proposal! My only concern is that what happens when DataFrame does not contain numeric values? Suddenly, standardize doesn't mean much.

I'm hesitant to add this only because this method would only be applicable for the numeric case. In addition, the implementation isn't that hard to do oneself as you so demonstrated.

Comment From: jorisvandenbossche

Personally, given this is a rather simple function to do yourself, I have not the feeling this warrants adding a new method to DataFrame/Series.

Comment From: jreback

-0 on this. this just adds to an already very large api. many definitions. of this, which name shall we pick? IIRC we rejected zscore in the past.

Comment From: FabianSchuetze

Thank you for all your replies! I agree very much that the functions itself is simple (albeit useful) and if you rejected a function called zscore in the past, standardize shouldn't be included either.

Comment From: mwaskom

IIRC we rejected zscore in the past

Indeed you have, which remains a pain when trying to do "piped" processing of a series (cf #12515)

Comment From: jreback

as I wrote before here

In [13]: standarize = lambda x: (x-x.mean()) / x.std()

In [14]: s = pd.Series(np.random.rand(10))
    ...: 
    ...: (s-s.mean())/s.std()
    ...: 
Out[14]: 
0    0.395159
1    0.611805
2   -1.976001
3    0.512755
4    0.954300
5   -0.873228
6   -0.988174
7   -0.099802
8    0.196835
9    1.266350
dtype: float64

In [15]: standarize = lambda x: (x-x.mean()) / x.std()

In [16]: s.pipe(standarize)
Out[16]: 
0    0.395159
1    0.611805
2   -1.976001
3    0.512755
4    0.954300
5   -0.873228
6   -0.988174
7   -0.099802
8    0.196835
9    1.266350
dtype: float64

Comment From: mwaskom

For the record, the fact that pandas doesn't handle using scipy.zstats properly here, and so the user needs to write a lambda (for an extremely common operation), remains incredibly annoying.