Pandas ENH: unit of measurement / physical quantities

quantities related xref #2494 xref #1071

custom meta-data xref #2485

It would be very convenient if unit support could be integrated into pandas. Idea: pandas checks for the presence of a unit-attribute of columns and - if present - uses it - with 'print' to show the units e.g. below the column names - to calculate 'under the hood' with these units similar to the example below

For my example I use the module pint and add an attribute 'unit' to columns (and a 'title'...).

Example:

from pandas import DataFrame as DF
from pint import UnitRegistry
units = UnitRegistry()

class ColumnDescription():
    '''Column description with additional attributes.

    The idea is to use this description to be able to add unit and title
    attributes to a column description in one step.

    A list of ColumnDescriptions is than used as argument to DataFrame()
    with unit support.
    '''

    def __init__(self, name, data, title = None, unit = None):
        '''
        Args:
            name (str): Name of the column..
            data (list): List of the column data.
            title (str): Title of the column. Defaults to None.
            unit (str): Unit of the column (see documentation of module pint).
                Defaults to None.

        '''

        self.data = data 
        '''(list): List of the column data.'''

        self.name = name
        '''(str): Name of the column, naming convention similar to python variables.

        Used to access the column with pandas syntax, e.g. df['column'] or df.column.
        '''

        self.title = title 
        '''(str): Title of the column. 

        More human readable than the 'name'. E.g.:
        Title: 'This is a column title'.
        name: 'column_title'.
        '''

        self.unit = unit
        '''Unit of the column (see module pint).

        Intended to be used in calculations involving different columns.
        '''

class DataFrame(DF):
    '''Data Frame with support for ColumnDescriptions (e.g. unit support).

    1. See documentation of pandas.DataFrame.
    2. When used with ColumnDescriptions supports additional column attributes
    like title and unit.
    '''

    def __init__(self, data, title = None):
        '''
        Args:
            data (list or dict):
                1. Dict, as in documentation of DataFrame
                2. List of the column data (of type ColumnDescription).
            title (str): Title of the data frame. Defaults to None.
        '''

        if isinstance(data, list):
            if isinstance(data[0], ColumnDescription):
                d = {}

                for column in data:
                    d[column.name] = column.data

                super(DataFrame, self).__init__(d)

                for column in data:
                    self[column.name].title = column.title
                    self[column.name].unit = column.unit

                self.title = title

        else:
            super(DataFrame, self).__init__(data)

if __name__ == '__main__':

    data = [ ColumnDescription('length',
                               [1, 10],
                               title = 'Length in meter',
                               unit = 'meter'),
             ColumnDescription('time',
                               [10, 1],
                               title = 'Time in s',
                               unit = 's') ]

    d = {'length':[1, 10],
         'time': [10, 1]}
    df = DataFrame(d)
    print 'standard df'
    print df

    df = DataFrame(data)
    print '\n' + 'new df'
    print df

    ####use of dimensions####
    # pint works with numpy arrays
    # df[name] is currently not working with pint, but would be I think 
    # it would be a real enhancement if it would...
    test = df.as_matrix(['length']) * units(df['length'].unit) / \
           (df.as_matrix(['time']) * units(df['time'].unit))
    print '\n' + 'unit test'
    print test
    print '\n' + 'magnitude'
    print test.magnitude
    print '\n' + 'dimensionality'
    print test.dimensionality

Comment From: jreback

see #2485

The tricky thing with this is how to actually propogate this meta-data. I think this could work if it was attached to the index itself (as an optional additional array of meta data). If this were achieved, then this should be straightforward to have operations work on it (though to be honest that is a bit out of scope for main pandas, perhaps a sub-class / other library would be better).

Comment From: mdk73

Thanks for your comment. I am not sure what you mean with attaching metadata to the index, and why this is important.

Maybe the proposed way with adding an attribute 'unit' to the columns is not the best way, but hopefully units are significantly less difficult than arbitrary metadata. Personally I do not think that an attribute 'unit' needs to support all kind of data, 'str' could be enough.

I think pint (there are other modules, but I do not know them, sorry) is capable of taking care about the units itself (also throwing errors when misused), so this would not be a pandas issue.

Here is a small snippet that demonstrates how a new unit could be created if two columns are multiplicated:

#prototype column1, omitting the name and index
value1 = [1]
unit1 = 'meter'
# column1: representation of value and unit
column1 = value1 * units(unit1)
# column2: representation of value and unit
column2 = [2] * units('meter')
# creating a new column: column1 * column2
column12 = column1 * column2
print 'column12: {}'.format(column12)
# value could go to a new column of a DataFrame
print 'value of column12: {}'.format(column12.magnitude)
# str(column12.units) could serve as the unit-attribute for the new column
print 'unit of column12: {}'.format(column12.units)

output:

column12: [2] meter ** 2
value of column12: [2]
unit of column12: meter ** 2

Comment From: jreback

@mdk73 as I said this could be done, but there are lots and lots of tests cases and behavior that are needed, e.g.

x = DataFrame with some quantities
y = DataFrame with no quantities
z = DataFrame with different quantities

so what are

x * x
x * y
x * z

these may seem completely obvious, and for the most part they are, but you have to propogate things very carefully. As I said, this is a natural attachment for the Index, a new property that can be examined (kind of how .name works).

The way to investigate is to add the property and write a suite of tests that ensure correct propogation on the Index object, e.g. things like: .reindex,.union,.intersection,__init__ etc.

Comment From: shoyer

Unit aware arrays are indeed be extremely valuable for some use cases, but it's difficult to see how they could be integrated into the core of pandas in a way that is agnostic about the particular implementation. We definitely do not want to duplicate the work of pint or other similar packages in pandas, nor even pick a preferred units library. Instead, we want to define a useful set of extension points, e.g., similar to __numpy_ufunc__. So, this won't be an easy change, and possibly is something best reserved for thinking about in the design of "pandas 2.0".

Comment From: blalterman

What about having a user define a dictionary containing any units she or he uses via pd.set_option. Whenever pandas does a calculation, it checks all objects in the calculation. It then takes all units and combines them just as would be in the function (e.g. pass all units through the function?). If an object has no units, take units as 1. At the end of the computation, you can then specify a new unit for the result and pandas will divide out the units accordingly. Alternatively, whenever pandas does a calculation, it can just multiply any values (perhaps excluding a user-defined flag value) and then run the calculation, converting out at the end. This is how I run a lot of my calculations. Why not do something like use lines with to_SI?

def traditional(b_mag, rho_vals, fill=-9999):
    """Calculate the Alfven speed."""

    # I store all of my physical constants in `_pccv`.
    mu0 = _pccv.misc['mu0'] #.physical_constants

    # Have pandas do this to every value before a computation.
    b_to_SI   = _pccv.to_SI['b']
    rho_to_SI = _pccv.to_SI['rho']
    v_to_SI   = _pccv.to_SI['v']    

    b = b_mag.copy() * b_to_SI
    rho = rho_vals.copy() * rho_to_SI

    if rho.ndim > 1: rho = rho.sum(axis=_argmin(rho.shape))

    Ca_denominator = _sqrt(mu0 * rho, dtype=_float64)
    Ca_calc = _divide(b, Ca_denominator, dtype=_float64)

    # At the end of your computation, specify the output unit and the 
    # following line would be run automatically.
    Ca_kms = Ca_calc / v_to_SI    

    return Ca_kms

Comment From: shoyer

There are several existing approaches to units in Python -- notably pint and astropy.units. We should definitely be careful before reinventing the wheel here.

Comment From: den-run-ai

+1 on units, especially for plots with multiple axis:

http://matplotlib.org/examples/axes_grid/demo_parasite_axes2.html

Comment From: mikofski

Similar to #2494

Comment From: VelizarVESSELINOV

:+1: units awareness of the column as a string it is a good enough first step to be able to store the unit associated with the column. It will be nice read_csv to be able to capture the unit line and store them. Other metadata enhancement will be nice to store is description for each column or even history if some operations are done with the column.

I think the community will be not able to align on unit naming conversion this should be managed outside pandas, also the conversion factors can be managed outside the pandas.

Unit name challenge: there are a lot of unit aliases and in some case conflicts. I think the community will be not able to align on unit naming conversion this should be managed outside pandas, also the conversion factors can be managed outside the pandas.

Unit name challenge: there are a lot of unit aliases and in some case conflicts. There are a lot of units spellings B for different purposes is it Bytes, Bites or Bels https://en.wikipedia.org/wiki/Decibel#bel Or S is it seconds, Siemens https://en.wikipedia.org/wiki/Siemens_(unit)

In my domain, UOM from Energistics (http://www.energistics.org/asset-data-management/unit-of-measure-standard) is covering most of my needs, but I agree for people that manage more digital storage units or date time units maybe this is out of scope.

Comment From: jreback

I think a very straightforward way of doing this (though will get a bit of flack from @shoyer, @wesm, @njsmith @teoliphant for not doing this in c :<) is to simply define an 'extension pandas dtype' along the lines of DatetimeTZDtype.

E.g. you would have a float64 like dtype with a unit parameter (which could be a value from one of the units libraries, so pandas is basically agnostic).

Then would need some modification to the ops routines to handle the interactions.

Comment From: tomchor

Just to make things more explicit, this same discussion is happening at a pint's issue (that is actually referenced here).

I think there should be an exchange of information from both sides to make robust solution and to avoid "reinventing the wheel", but IMHO the actual implementation should come from pint, with pandas only providing a good base for it (as some comments here have already said).

Comment From: Bernhard10

I tried to follow @jreback's idea of adding an additional dtype. My pull request is not ready to merge, but an outline how it could work.

@tomchor I started to write this pull request yesterday, before you commented that you would prefer to implement this in pint instead, that's why I post it here.

Comment From: mikofski

@Bernhard10 any reason you choose not to use Pint or Quantities or another established, mature, tested, robust, popular units package?

Comment From: tomchor

@Bernhard10 I think the additional dtype can work. I'm happy someone's working on it.

About implementing it in Pint, unfortunately I'm not the man to do it (at least right now). I still have a lot to learn about Pint and I have some other urgent priorities to take care.

@mikofski I guess Pint looks like a better candidate (at least for me) because it seems more intuitive and simpler. But I guess there would be no strong argument against using Quantities. I think the point of providing a general basis for the implementation in Pandas (such as the dtype idea) is because it can be implemented by whatever units package indepedently. So people using Pint could easily develop support for Pandas, as could people using Quantities.

Comment From: mikofski

@tomchor, I do think a backend approach that allowed the units package to be swappable is the best approach. Also I agree, Pint is easier and more popular IMHO than Quantities right now, although before Pint, Quantities was definitely the most popular, and is still very good

@bernhard10, if you are implementing dtype approach, maybe look at Quantities first and talk to their maintainers because Quantities also uses dtype so this may save you a lot of time and testing. Also please consider making your pandas units abstract, defaulting to your version but allowing any other suitable backend to be used as long as it implements the abstract API

Comment From: Bernhard10

@mikofski I am currently testing with pint, but the idea of the dtype approach would be to make the units package swappable.

Comment From: dopplershift

@mikofski Can you point to which "Quantities" package you're referring to? I'm aware of a few unit packages that use numpy array sub-classes, but I'm not aware of any that use custom dtypes.

Comment From: mikofski

You're right, my bad, https://pypi.python.org/pypi/quantities overloads the ndarray __new__ method not dtype sorry

Comment From: JoElfner

Is there any new information on this topic? Unit handling, at least by having a simple field to store a unit description in for each column, would be really helpful. I really like pandas, but the lack of support for meta-data makes me fall back to dictionaries more and more.

Comment From: jreback

this is certainly possible via ExtensionArray which getting lots of support lately: http://pandas-docs.github.io/pandas-docs-travis/extending.html#extension-types

but would need a community contribution to bootstrap

Comment From: JoElfner

Yeah, this looks really nice from the view of a programmer, but I guess the main user of pandas is some kind of data scientist, who just wants an easy solution which can be implemented in a low amount of time and does not want to delve to deep into the backend of the modules. So without any community contribution, I guess most people won't be able to use the functionality of ExtensionArray.

Comment From: znicholls

@jreback I've started to have a go with an ExtensionArray. Unfortunately I'm a complete newbie when it comes to pandas and pint (and coding, really) so I don't actually know what I'm doing. I've got it to the point where a DataFrame can store a pint array without automatically converting it to a numpy array but am lost from there. Would you mind having a 5 minute look and potentially giving a pointer or two as I have no idea what sensible next steps are. https://github.com/hgrecco/pint/pull/671

Comment From: UniqASL

I think the issue of dealing with units in pandasshould really be at the top of the agenda. Most scientific calculations have to deal with units. I am currently using pint and its module pint-pandas, which offers indeed very nice possibilities. It is very practical to have an automatized way to deal with unit conversions when making calculations with large dataframes. It also brings more safety in the calculations as it avoids "multiplying apples and oranges". pint-pandas is however in its current state having a lot of issues and it is quite a pain to deal with. Wouldn't it make sense to integrate all the nice work already done there directly in pandas, so that issues can be easier fixed?

pandas can deal very well with time series and I am very thankful for that because it is a very useful feature. Dealing with units should be in my opinion another core feature of pandas (especially as part of NumFOCUS).

Thanks in advance for your feedback.

Comment From: 5igno

Hi there, would indeed be great to be able to store the measurement unit for the columns of DataFrame or as some kind of metadata. Given the breath of use of pandas, of which only a small subset cares deeply about units, I am of the opinion that unit conversion could be handled externally by packages like pint or astropy. However, one thing that would be needed to both approaches is that values need to be stored with a unit in the dataframe. As suggested before, this can also be a simple string metadata field.

Is there a suggested way to simply store measurement units in a Pandas DataFrame column?

Comment From: wkerzendorf

I think I have an idea how to marry astropy quantities with could work using the extension part of pandas.

Extending a pandas DataFrame:

class QDataFrame(pd.DataFrame):
    # normal properties
    _metadata = ["units"]
    def __init__(self, *args, units=[], **kwargs):
        super(QDataFrame, self).__init__(*args, **kwargs)
        self.units = units
    @property
    def _constructor(self):
        return QDataFrame

Then an accessor like this for the series.

@pd.api.extensions.register_series_accessor("quantity")
class QuantityAccessor():
    def __init__(self, pandas_obj):
        self._obj = pandas_obj

    def to(self, unit):
        return (self._obj.values * self._obj.unit).to(unit)

I'm struggling how to now propagate the list of units to the single column via _constructor_sliced as I do not seem to know which column it is accessing. Any ideas @jreback ?

Comment From: znicholls

As an update on efforts with pint-pandas, we have hit the wall with https://github.com/pandas-dev/pandas/issues/35131. The crux is how to handle this scalar/array ambiguity without potentially destroying performance elsewhere.

The first attempt at a fix was https://github.com/pandas-dev/pandas/pull/35127, that was superseded by #39790 but that went stale.

The latest person to raise this (as far as I'm aware) is https://github.com/pandas-dev/pandas/issues/43196.

It seems this is still an issue, but not one which is high enough priority to get over the tricky hurdles required to allow things to move forward.

Comment From: demisjohn

Is there a way to make an extra "Dimension" of the Database (not just Row & Column)? So you could have "Unit" strings (or any other data), associated with each Column, in addition to the Column "name". Or, is there a "relational DB" method that could accomplish the same, with pandas? Whichever is syntactically and cognitively simpler would be best for most scientists.

I am thinking of a solution that may already exist within Pandas - and a documentation suggestion/example may show users a simple implementation.

Comment From: mroeschke

Looks like pint-pandas (https://github.com/hgrecco/pint-pandas) is a 3rd party library that provides this support which is mentioned in our ecosystem docs so closing