In R, the summary(data) function provides counts for categorical data of the most frequent values. For datetime and numeric values, it gives their mean, quartiles and range.
The describe function in Pandas is similar but does not provide such a human-readable data output for heterogeneous dataframes. As the output is primarily intended for human consumption, we could make a new summary() function with a new tabular layout designed solely for human consumption.
Examples of the R summary function
Heterogeneous columns from input data below:
A,1
B,2
C,3
A,1
summary(d)
V1 V2
A:2 Min. :1.00
B:1 1st Qu.:1.75
C:1 Median :2.50
Mean :2.75
3rd Qu.:3.50
Max. :5.00
Datetime example:
datetime data
Min. :2012-12-22 12:00:00 Min. :-1.392
1st Qu.:2012-12-22 18:00:00 1st Qu.: 7.906
Median :2012-12-23 00:00:00 Median :10.836
Mean :2012-12-23 00:00:00 Mean :11.271
3rd Qu.:2012-12-23 06:00:00 3rd Qu.:16.362
Max. :2012-12-23 12:00:00 Max. :21.097
Discussed with @jreback and he suggested that starting with a new function rather than changing describe.
This is how describe works on heterogeneous columns now, showing how it is difficult to read:
In [3]: df = pd.DataFrame({'A' : list('aab'), 'B' : range(3)})
In [4]: df
Out[4]:
A B
0 a 0
1 a 1
2 b 2
In [5]: df.describe()
Out[5]:
B
count 3.0
mean 1.0
std 1.0
min 0.0
25% 0.5
50% 1.0
75% 1.5
max 2.0
In [7]: df.describe(include='all')
Out[7]:
A B
count 3 3.0
unique 2 NaN
top a NaN
freq 2 NaN
mean NaN 1.0
std NaN 1.0
min NaN 0.0
25% NaN 0.5
50% NaN 1.0
75% NaN 1.5
max NaN 2.0
Comment From: mroeschke
Thanks for the suggestion but it appears there hasn't been much appetite for this issue over the years so closing. Happy to reopen if there's renewed interest