Pandas DOC/WIP: doc page for the layout of the internals of pandas

this would be useful for reference purposes and also so that @jreback doesn't have to fix almost every non-trivial bug that pops up :). i would be happy to start writing this (with the help of others who know these things better than i do), i think it would be an excellent way to gain a deeper understanding of the internals.

Comment From: jreback

sure...all for that!

Comment From: jtratner

:+1: though maybe it would be better to add the documentation into the code (e.g., module level docstrings or comments at the top of modules) as opposed to putting it into documentation elsewhere -- might make it easier to keep them updated as changes occur.

Comment From: cpcloud

yeah i think a doc page might be better than the wiki for this

Comment From: jreback

maybe in this case a description at the top of core/internals would be useful......

Comment From: clham

Is this still alive and kicking? Perhaps adding a subhead to contributing to pandas titled code layout? I'm envisioning a paragraph/bulleted style doc with what calls what when you (For example) make a DataFrame, and how the major parts and pieces interact.

Comment From: jreback

I think some progress has been made in groupby,internals,index to document more with some top level comments

not sure this is really for public consumption and better documented in the modules themselves

Comment From: jreback

on second thought

this might be nice if it's included in the docs so can be updated when the code is updated (and as an rst might be easier)

Imaybe u want to give a stab at some things that might be useful in this page? (and I can fill them in a bit)

Comment From: jreback

and their is a section on index internals at the end of indexing.rst which should be moved to internals as well

Comment From: clham

Sure! I'll put together a PR with a TOC and some headings, then start muddling through the code.

Comment From: jreback

document internal attributes of DataFrameGroupby and friends: http://stackoverflow.com/questions/24806601/convert-groupby-to-dataframe-join-the-groups-again/24807309#24807309

Comment From: immerrr

After reinventing several cythonized routines and hitting my head against the wall of pytables io code I was thinking along the lines of actually generating a separate developer doc (with its own conf.py): separation would help keeping the scope and build time of public doc down, and one could use cross-references where necessary.

Comment From: clham

That is a much cleaner solution than the disaster I've been trying to cook up.

Comment From: jreback

little tidbits that need docs (see end of this): https://github.com/pydata/pandas/pull/7790 e.g how to compare tz with 'UTC'

Comment From: sinhrks

I think the guide is really useful for contributors (including me). I prepared a rough summary for internal docs for discussion.

Data Layers

Explanation of internal data layers. Consists from following 4 levels. - Series, DataFrame and Panel: Contains internal data in BlockManager - BlockManager: Allow to handle multiple Blocks. - Block: Representing data based on each internal data types. - pandas raw data: Representing internal data types which doesn't exist in numpy. Currently, Categorical and Sparse. numpy existing dtypes doesn't have this layer. - numpy.array: All the internal data are finally mapped to numpy.array.

ToDo: Explain what ops are (basically) defined in what layers, such as slicing and numeric ops.

Internal Data Access

Assuming following DataFrame.

import pandas as pd
df = pd.DataFrame({'int': [1, 2],
                   'float': [1.1, 2.1],
                   'complex': [1+1j, 1+2j],
                   'bool': [True, False],
                   'object': ['A', 'B'],
                   'category (object)': pd.Categorical(['A', 'B']),
                   'datetime': [pd.Timestamp('2015-01-01'), pd.Timestamp('2015-02-01')],
                   'timedelta': [pd.Timedelta('1 day'), pd.Timedelta('2 day')],
                   'sparse': pd.SparseSeries([1, 0], fill_value=0),
                  }, columns=['int', 'float', 'complex', 'bool', 'object',
                              'category (object)', 'datetime', 'timedelta', 'sparse'])
df
#    int  float  complex   bool object category (object)   datetime  timedelta  \
# 0    1    1.1   (1+1j)   True      A                 A 2015-01-01     1 days   
# 1    2    2.1   (1+2j)  False      B                 B 2015-02-01     2 days   
# 
#    sparse  
# 0       1  
# 1       0

Access to `BlockManager` and `Block`

DataFrame._data contains its internal BlockManager. BlockManager has blocks attribute which stores its internal Blocks. Blocks are separated based on its types.

for c, s in df.iteritems():
    for block in s._data.blocks:
        print(c, type(block), block.dtype, block.dtype.type)
# ('int', <class 'pandas.core.internals.IntBlock'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <class 'pandas.core.internals.FloatBlock'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <class 'pandas.core.internals.ComplexBlock'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <class 'pandas.core.internals.BoolBlock'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <class 'pandas.core.internals.ObjectBlock'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <class 'pandas.core.internals.CategoricalBlock'>, category, <class 'pandas.core.common.CategoricalDtypeType'>)
# ('datetime', <class 'pandas.core.internals.DatetimeBlock'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <class 'pandas.core.internals.TimeDeltaBlock'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <class 'pandas.core.internals.SparseBlock'>, dtype('float64'), <type 'numpy.float64'>)

DataFrame.values or Block.values returns pandas raw data.

# values
for c, s in df.iteritems():
    v = s.values
    print(c, type(v), v.dtype, v.dtype.type)
# ('int', <type 'numpy.ndarray'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <type 'numpy.ndarray'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <type 'numpy.ndarray'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <class 'pandas.core.categorical.Categorical'>, category, <class 'pandas.core.common.CategoricalDtypeType'>)
# ('datetime', <type 'numpy.ndarray'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <type 'numpy.ndarray'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <class 'pandas.sparse.array.SparseArray'>, dtype('float64'), <type 'numpy.float64'>)

DataFrame.get_values() or Block.get_values() returns numpy.array. All data including Categorical and Sparce are mapped to numpy.array based on its internal data types.

# get_values
for c, s in df.iteritems():
    v = s.get_values()
    print(c, type(v), v.dtype, v.dtype.type)
# ('int', <type 'numpy.ndarray'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <type 'numpy.ndarray'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <type 'numpy.ndarray'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('datetime', <type 'numpy.ndarray'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <type 'numpy.ndarray'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)

ToDo: It may useful to draw conversion maps between each layers.

Comment From: mroeschke

Looks like we have https://pandas.pydata.org/docs/development/internals.html as a start so I think we can close in favor of issues noting what aspects we are missing

Pandas DOC/WIP: doc page for the layout of the internals of pandas

Data Layers

Internal Data Access

Access to BlockManager and Block

Access to `BlockManager` and `Block`