Feature request: can we add "sum" to the pandas.describe() method? Not every dataset cares about sum, of course, but enough datasets do care about it that it seems worth adding to the default describe() method. (For instance, financial applications where you have multiple columns representing different scenarios, and you want to see how much profit would be earned of each scenario).
Comment From: jreback
xref https://github.com/pandas-dev/pandas/issues/7014
I am a bit - on this ATM as .aggregate/.agg
will be easily able to do this pretty generically.
Comment From: jorisvandenbossche
I also leaning to "not needed to add it" (you already have many statistics, adding one more makes the output more complex).
(for reference, R's summary
also does not include sum
, not that we cannot deviate from that of course)
Comment From: darindillon
Well yes, you can always call .describe() to get some of the statistics and also call .agg() to get other statistics and alsocall something else to get other statistics and then merge them all together; but of course that's more work. Obviously, .describe() can't contain everything, it should contains the "commonly used" ones; and what's "commonly used" for you may be different than for me. But I would argue that (especially for financial applications) "sum" is a very common metric it's worth including. (If this were an exotic metric like "kurtosis" that most people don't understand, we wouldn't include it; but "sum" is useful to lots of people).
If I did all the work to add this (including test cases, etc) and submitted a PR, would you be willing to consider including it?
Comment From: jorisvandenbossche
It is difficult to assess, as I am also biased by the type of data I am working with, but I think sum
is less generally applicable as the others.
For example, I work with time series data of concentrations. For such a case, all current statistics are useful (or at least mean something), while a sum makes no sense. And I have the feeling that this can be generalized, meaning that the current (distribution) statistics almost always make some sense, while for sum it much more depends on the type of data.
Comment From: TomAugspurger
I'm also -0 on this. I think focusing on DataFrame.agg
would be more useful than modifying describe
.
Comment From: jorisvandenbossche
@darindillon I am going to close this issue, as there does not seem much enthusiasm to add this. But have a look at #14668 as a possible way to more easily define a custom aggregation function.
Comment From: palewire
Thank you for considering this addition. I've drafted an implementation with a number of expanded tests in #14986.
I would argue for including sum
in describe
because it is one of the most common descriptive statistics.
I use it routinely in my work as a data journalist, and I can testify from my years as a trainer in the field that it is a routine tool for my peers. I now find myself adding an additional line to calculate the total every time I run describe
. Adding sum
to the mix would remove that requirement and make life easier for myself and others.
But you don't have to take my word for it. A sum is included in the default set of descriptive statistics in:
- Microsoft Excel (Surely the most popular statistics software)
- LibreOffice's Calc (The most popular open-source Excel variant)
- Perl's
Statistics::Descriptive
library - Python's agate library
- SAS's
PROC UNIVARIATE
method
I would add that R not having sum in summary
can be looked at differently. By including it here, pandas will have what I would regard as a competitive edge.
I see evidence that my view is not alone within the pandas community. The fact that sum
features so prominently in the examples given for the new agg
method in #14668 suggests to me there is an appetite to have it closer at hand.
Comment From: jreback
@palewire don't read into the docs of .agg
using sum
a lot, that's just happenstance (and copy-pasting) :>
Comment From: jreback
In [7]: s = Series(np.arange(100000))
In [8]: s.describe()
Out[8]:
count 100000.000000
mean 49999.500000
std 28867.657797
min 0.000000
25% 24999.750000
50% 49999.500000
75% 74999.250000
max 99999.000000
dtype: float64
In [9]: s.agg(['count', 'mean', 'std', 'min', 'median', 'max'])
Out[9]:
count 100000.000000
mean 49999.500000
std 28867.657797
min 0.000000
median 49999.500000
max 99999.000000
dtype: float64
In [10]: s.agg(['count', 'mean', 'std', 'min', 'median', 'max', 'sum'])
Out[10]:
count 1.000000e+05
mean 4.999950e+04
std 2.886766e+04
min 0.000000e+00
median 4.999950e+04
max 9.999900e+04
sum 4.999950e+09
dtype: float64
the biggest issue with including .sum
is that it generally is a different scale than the other numbers, and so can easily throw off the display (it works but you have to resort to)
In [20]: pd.options.display.float_format = lambda x: "%.4f" % x
In [21]: s.agg(['count', 'mean', 'std', 'min', 'median', 'max', 'sum'])
Out[21]:
count 100000.0000
mean 49999.5000
std 28867.6578
min 0.0000
median 49999.5000
max 99999.0000
sum 4999950000.0000
dtype: float64
Comment From: palewire
I hadn't considered that limitation. Thanks for sharing it. Couldn't that issue come up with other data series, regardless of whether the sum is included?
Comment From: jreback
@palewire sure that can always happen. The difference here is trying to make this the default.
All of the stats are on the scale of a value in the dataset.
.sum
is not. Now whether sum is an interesting statistic or not depends on your data. It might be. But that's a user question, and not a question to change the pandas default.
Comment From: palewire
As with most "sensible defaults," I believe the answer to this question is ultimately subjective.
In my view, if the point of the describe
method is to create a humanized readout that satisfies a wide array of users, I would advocate for a "big tent" that meets your needs, my needs and those of others as well.
From that point of view, the scale issue above would be viewed as a bug in describe
that we should endeavor to repair rather than a justification for rejecting additions to the method.
Comment From: jreback
From that point of view, the scale issue above would be viewed as a bug in describe that we should endeavor to repair rather than a justification for rejecting additions to the method.
how so?
Comment From: palewire
I mean: Totally independent of the question of including sum
, wouldn't it be great if describe
was "smart" enough to solve those scale issues rather than relying on users to change a display setting? I would be willing to venture that many newbies don't know those display settings exists.
Comment From: jreback
and how would you 'solve' the scale issue?
to be honest count and mean already solve this
Comment From: palewire
I can't say I have the solution to your issue immediately at hand, but it seems simple to generate the same problem with count
. All it appears to take is a large Series
filled with small values.
>>> s = pd.Series(np.random.sample(size=1000*1000))
>>> s.describe()
count 1.000000e+06
mean 5.001840e-01
std 2.886875e-01
min 5.431750e-07
25% 2.501076e-01
50% 5.003766e-01
75% 7.503524e-01
max 9.999982e-01
dtype: float64
>>> pd.options.display.float_format = lambda x: "%.4f" % x
>>> s.describe()
count 1000000.0000
mean 0.5002
std 0.2887
min 0.0000
25% 0.2501
50% 0.5004
75% 0.7504
max 1.0000
dtype: float64
Could there be a way to have describe
set a different default for its representations? Or perhaps to set them on a value by value basis?
Comment From: jreback
@palewire I can't repro that example. In any event, I still don't see any good reason to add sum
as a defaults. I believe @TomAugspurger and @jorisvandenbossche already commented.
Comment From: jorisvandenbossche
I certainly acknowledge that in certain (many) cases sum
is an interesting statistics for a dataset, but, as I said above, there are also many cases where it does not make sense (eg certain measurements like concentrations, a set of observations like age, weight, length, .. of study objects, ...), while the current statistics (min, max, mean, ..) make sense in almost any case I think.
So the question is whether we want by default a 'meaningless' statistic included for certain cases, or rather an interesting statistic missing for other cases. And it's never possible to have a default that satisfies all cases .. Personally I would go with the 'safe' option of not including it, but I am also certainly biased by the type of data I typically work with at the moment.
Comment From: palewire
Alrighty. That's a totally fair decision. One more thing I'd like to argue for, though.
How would you feel about sharpening up the definition for this function?
Currently the documentation for describe
simply reads: "Generate various summary statistics, excluding NaN value."
Would you object to me submitting a future pull request that aims to make that more specific and, forgive the pun, descriptive?
Comment From: jreback
absolutely would take a PR for better doc string (with examples!)
Comment From: darindillon
Ok, since we all agree some but not all people would find "sum" useful, and there are also other stats that will be useful to some but not all people (maybe you want kurtosis, maybe someone else wants skew, etc) then how about we leave the default as is but add a keyword arg to request additional non-default columns?
series.describe(extra_cols=['sum', 'kurtosis', 'etc'])
To start, we'd only implement "sum" but as time permits we'll add additional calcs in that list so it's very flexible and will handle lots of people's use cases without harming anyone who doesn't need those fields. Do you think everybody could be happy with that?
On Dec 26, 2016, at 4:52 PM, Ben Welsh notifications@github.com wrote:
Alrighty. That's a totally fair decision. One more thing I'd like to argue for, though.
How would you feel about sharpening up the definition for this function?
Currently the documentation for describe simply reads: "Generate various summary statistics, excluding NaN value."
Would you object to me submitting a future pull request that aims to make that more specific and, forgive the pun, descriptive?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Comment From: underthecurve
"useful" is certainly subjective. I use R a lot and I'm a fan of the inclusion of how many NA's are in the dataset in the summary
feature (see example below, using Population data from the World Bank).
Would the number of missing items in a data frame be a statistic that "makes sense in almost any case"?
library(WDI)
> pop <- WDI(indicator = c('SP.POP.TOTL'))
> summary(pop$SP.POP.TOTL)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
9.694e+03 1.335e+06 9.133e+06 2.709e+08 5.077e+07 7.007e+09 7
Comment From: jreback
In [9]: s = Series(range(5))
In [10]: s.iloc[2] = np.nan
In [11]: s.describe()
Out[11]:
count 4.000000
mean 2.000000
std 1.825742
min 0.000000
25% 0.750000
50% 2.000000
75% 3.250000
max 4.000000
dtype: float64
In [12]: len(s)
Out[12]: 5
count
already describes the number of non-nans
however you can see that len (or size) is 5 here. So would not object to adding: len (or size), maybe nnulls (len - count).
Comment From: TomAugspurger
I'm still -0 on adding sum, and I don't have a strong opinion on including the length or number of nulls. That information is included in DataFrame.info()
Changing the output of describe
will be an API breaking change I think, as people could have been relying on the position of each statistic. I'd say the cost of breaking API isn't worth it in this case, but am willing to be convinced otherwise.
Comment From: palewire
With an eye towards capturing the sentiments of the voices in this thread, I took a run at rewriting the documentation for the describe
method to be more precise.
That effort was recently merged into the master branch and you can read it here.
The top-line change is to shift the definition of the method's mission from the broad "Generate various summary statistics, excluding NaN value" to something more specific.
Aiming for something that fit the current output and set down a boundary that would rationally exclude sum
and any number of arbitrary statistics, the new definition is below. (My emphasis).
"Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."
My hope is that it can provide a guide for evaluating future additions to the method that aren't based on the use cases of different users. Take a look and let me know what you think.
To me, if we accept this as the definition, we have a pretty strong case for excluding sum
, but it raises for me the question of whether it might make sense to add common statistics that do match the definition, like mode
, to the mix. If we appended them to the bottom of the current output, I think we could avoid the backwards breaking changes that @TomAugspurger rightly warns of.