As mentioned in issue #6897, working with data with error ranges is pretty much universal in science, as well as many other fields. There are python packages, like uncertainties
, for working with this sort of data. However, pandas has no built-in tools for creating or working with data with error ranges, leaving users to create their own columns or a separate pandas object to hold error ranges (see, e.g. #5638) or manually creating and using uncertainties
objects.
I think it would be very helpful if there was an aggregate method that would aggregate data to data with an error range (such as an uncertanties
array). By default, it could use mean
to get the center values and and sem
(standard error of the mean) or std
to get the error ranges, but it would probably be good for users to be able to specify their own functions for calculating the center values and/or the error ranges.
Comment From: jreback
can u show a sample use case and implementation (even if slow)?
Comment From: toddrjen
So, say someone has 15 experiments (multindex 1), each with 20 experimental conditions (multindex 2), and they record 100 trials for each condition (columns). The person wants to publish this data, so they need to get the mean and error range for the data. So they need to collapse along condition, getting both the mean and standard error. With this approach, the user could use experiment as index and condition as column.
Here is a simple example (using a lambda for the implementation):
import pandas as pd
import numpy as np
from scipy import stats
from uncertainties import ufloat
ind = pd.MultiIndex.from_product([np.arange(15), np.arange(20)])
df = pd.DataFrame(np.random.randn(15*20,100), index=ind, columns=np.arange(100))
res = df.apply(lambda x: ufloat(np.mean(x), stats.sem(x)), axis=1).unstack()
This becomes much more important for more complicated analyses. Doing manipulations of data with many-level multi-indexes becomes much, much harder if you also have to manage a second error table, column, or index. I can give an example for that as well, but it will be longer.
Comment From: jreback
Using uncertainties
makes all of your data object
dtype, negating pretty much all of pandas efficiencies. Instead something like this would work (for the multi-level slicing this requires master/0.14), coming soon:
In [20]: ind = pd.MultiIndex.from_product([np.arange(5), np.arange(2)])
In [21]: cols = pd.MultiIndex.from_product([np.arange(5), ['value','error']])
In [22]: df = pd.DataFrame(np.random.randn(5*2,10), index=ind, columns=cols).sortlevel().sortlevel(axis=1)
In [25]: df
Out[25]:
0 1 2 3 4
error value error value error value error value error value
0 0 1.684247 -0.768990 1.745643 -0.460112 0.547230 1.204622 -0.645565 0.767882 1.038075 -0.004924
1 -1.038735 1.268667 0.288511 -0.056458 0.052893 -0.181397 -0.416198 -0.117648 1.092671 -0.085161
1 0 -1.027876 -0.504794 1.145330 0.149904 -1.735783 -1.292422 0.111824 1.213310 -0.165664 -1.644664
1 0.356636 1.076804 -2.442231 -0.694032 -0.531767 -0.177785 0.911135 -0.477786 0.677379 1.758926
2 0 1.720729 0.170775 0.348073 -1.441842 1.377164 -1.434962 -1.332751 -0.681837 -0.169488 -0.847964
1 -1.260312 -0.000384 0.333589 0.338253 -0.871582 -0.813060 -0.056995 -0.653637 -0.937449 1.143176
3 0 -1.457335 -1.102507 0.691152 -2.469394 0.615936 1.310255 1.306816 -0.035045 0.435257 1.455832
1 1.855440 0.923589 -1.061110 0.995526 0.126394 -0.579312 -1.445212 -1.391565 1.575050 0.071588
4 0 -0.155716 0.917270 -0.257610 -1.180983 1.356626 -0.077675 0.973249 -0.418510 -0.607244 -0.927557
1 -1.305623 0.737657 -0.891516 0.893158 1.387652 -1.825456 1.406268 -0.827154 0.147286 -1.361848
[10 rows x 10 columns]
In [23]: df.loc[:,(slice(None),'error')]
Out[23]:
0 1 2 3 4
error error error error error
0 0 1.684247 1.745643 0.547230 -0.645565 1.038075
1 -1.038735 0.288511 0.052893 -0.416198 1.092671
1 0 -1.027876 1.145330 -1.735783 0.111824 -0.165664
1 0.356636 -2.442231 -0.531767 0.911135 0.677379
2 0 1.720729 0.348073 1.377164 -1.332751 -0.169488
1 -1.260312 0.333589 -0.871582 -0.056995 -0.937449
3 0 -1.457335 0.691152 0.615936 1.306816 0.435257
1 1.855440 -1.061110 0.126394 -1.445212 1.575050
4 0 -0.155716 -0.257610 1.356626 0.973249 -0.607244
1 -1.305623 -0.891516 1.387652 1.406268 0.147286
[10 rows x 5 columns]
In [24]: df.loc[:,0]
Out[24]:
error value
0 0 1.684247 -0.768990
1 -1.038735 1.268667
1 0 -1.027876 -0.504794
1 0.356636 1.076804
2 0 1.720729 0.170775
1 -1.260312 -0.000384
3 0 -1.457335 -1.102507
1 1.855440 0.923589
4 0 -0.155716 0.917270
1 -1.305623 0.737657
[10 rows x 2 columns]
Comment From: toddrjen
Yes, that is the problem with the current situation. The idea of this issue is to improve the current situation by creating values with uncertainties in a more integrated, reliable, and useful way.
Your proposal works fine for simple situations at the end of the analysis. But if you want to do manipulations, it becomes much more difficult. If you want to do manipulations with a many-level multiindex, it becomes extremely difficult. Under this proposal, these manipulations would be no more difficult than they are for single values.
If you want to do mathematics, such as adding or multiplying two dataframes, your proposal is also far more difficult. Mathematical operations on means and mathematical operations on standard errors are different. There are mathematical rules for handling this, called error propogation, that are handled automatically by the uncertainties package, but under your proposal would need to be looked up and coded explicitly. Also, using uncertainties just involves doing operations on the dataframe, while under your proposal you would need to split out the mean and error columns, do different mathematical operations on each, then recombine them. You can do this in pandas, but it is much more difficult than just df1*df2.
Since working with errors is almost universal in science, I think having strong, built-in support for it in pandas is important.
Comment From: jreback
@toddrjen its a nice idea
not sure how efficient the uncertainties
package handles these types of things. these are going to be represented as object
dtype by pandas/numpy, so not sure efficient this would be. you might want to ask the author / investigate this.
If this could be integrated as a pseuo-dtype into numpy (or perhaps cythonize some hotspots) that might help.
So would need some performance tests to determine feasibility.
Comment From: mroeschke
Yeah this would be best implemented by a 3rd party library using uncertainties as an EA data dtype. Closing