this would be useful for reference purposes and also so that @jreback doesn't have to fix almost every non-trivial bug that pops up :). i would be happy to start writing this (with the help of others who know these things better than i do), i think it would be an excellent way to gain a deeper understanding of the internals.
Comment From: jreback
sure...all for that!
Comment From: jtratner
:+1: though maybe it would be better to add the documentation into the code (e.g., module level docstrings or comments at the top of modules) as opposed to putting it into documentation elsewhere -- might make it easier to keep them updated as changes occur.
Comment From: cpcloud
yeah i think a doc page might be better than the wiki for this
Comment From: jreback
maybe in this case a description at the top of core/internals
would be useful......
Comment From: clham
Is this still alive and kicking? Perhaps adding a subhead to contributing to pandas
titled code layout
? I'm envisioning a paragraph/bulleted style doc with what calls what when you (For example) make a DataFrame, and how the major parts and pieces interact.
Comment From: jreback
I think some progress has been made in groupby,internals,index to document more with some top level comments
not sure this is really for public consumption and better documented in the modules themselves
Comment From: jreback
on second thought
this might be nice if it's included in the docs so can be updated when the code is updated (and as an rst might be easier)
Imaybe u want to give a stab at some things that might be useful in this page? (and I can fill them in a bit)
Comment From: jreback
and their is a section on index internals at the end of indexing.rst which should be moved to internals as well
Comment From: clham
Sure! I'll put together a PR with a TOC and some headings, then start muddling through the code.
Comment From: jreback
document internal attributes of DataFrameGroupby
and friends: http://stackoverflow.com/questions/24806601/convert-groupby-to-dataframe-join-the-groups-again/24807309#24807309
Comment From: immerrr
After reinventing several cythonized routines and hitting my head against the wall of pytables io code I was thinking along the lines of actually generating a separate developer doc (with its own conf.py): separation would help keeping the scope and build time of public doc down, and one could use cross-references where necessary.
Comment From: clham
That is a much cleaner solution than the disaster I've been trying to cook up.
Comment From: jreback
little tidbits that need docs (see end of this): https://github.com/pydata/pandas/pull/7790 e.g how to compare tz with 'UTC'
Comment From: sinhrks
I think the guide is really useful for contributors (including me). I prepared a rough summary for internal docs for discussion.
Data Layers
Explanation of internal data layers. Consists from following 4 levels.
- Series
, DataFrame
and Panel
: Contains internal data in BlockManager
- BlockManager
: Allow to handle multiple Block
s.
- Block
: Representing data based on each internal data types.
- pandas
raw data: Representing internal data types which doesn't exist in numpy
. Currently, Categorical
and Sparse
. numpy
existing dtypes doesn't have this layer.
- numpy.array
: All the internal data are finally mapped to numpy.array
.
ToDo: Explain what ops are (basically) defined in what layers, such as slicing and numeric ops.
Internal Data Access
Assuming following DataFrame
.
import pandas as pd
df = pd.DataFrame({'int': [1, 2],
'float': [1.1, 2.1],
'complex': [1+1j, 1+2j],
'bool': [True, False],
'object': ['A', 'B'],
'category (object)': pd.Categorical(['A', 'B']),
'datetime': [pd.Timestamp('2015-01-01'), pd.Timestamp('2015-02-01')],
'timedelta': [pd.Timedelta('1 day'), pd.Timedelta('2 day')],
'sparse': pd.SparseSeries([1, 0], fill_value=0),
}, columns=['int', 'float', 'complex', 'bool', 'object',
'category (object)', 'datetime', 'timedelta', 'sparse'])
df
# int float complex bool object category (object) datetime timedelta \
# 0 1 1.1 (1+1j) True A A 2015-01-01 1 days
# 1 2 2.1 (1+2j) False B B 2015-02-01 2 days
#
# sparse
# 0 1
# 1 0
Access to BlockManager
and Block
DataFrame._data
contains its internal BlockManager
. BlockManager
has blocks
attribute which stores its internal Block
s. Block
s are separated based on its types.
for c, s in df.iteritems():
for block in s._data.blocks:
print(c, type(block), block.dtype, block.dtype.type)
# ('int', <class 'pandas.core.internals.IntBlock'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <class 'pandas.core.internals.FloatBlock'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <class 'pandas.core.internals.ComplexBlock'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <class 'pandas.core.internals.BoolBlock'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <class 'pandas.core.internals.ObjectBlock'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <class 'pandas.core.internals.CategoricalBlock'>, category, <class 'pandas.core.common.CategoricalDtypeType'>)
# ('datetime', <class 'pandas.core.internals.DatetimeBlock'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <class 'pandas.core.internals.TimeDeltaBlock'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <class 'pandas.core.internals.SparseBlock'>, dtype('float64'), <type 'numpy.float64'>)
DataFrame.values
or Block.values
returns pandas
raw data.
# values
for c, s in df.iteritems():
v = s.values
print(c, type(v), v.dtype, v.dtype.type)
# ('int', <type 'numpy.ndarray'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <type 'numpy.ndarray'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <type 'numpy.ndarray'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <class 'pandas.core.categorical.Categorical'>, category, <class 'pandas.core.common.CategoricalDtypeType'>)
# ('datetime', <type 'numpy.ndarray'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <type 'numpy.ndarray'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <class 'pandas.sparse.array.SparseArray'>, dtype('float64'), <type 'numpy.float64'>)
DataFrame.get_values()
or Block.get_values()
returns numpy.array
. All data including Categorical
and Sparce
are mapped to numpy.array
based on its internal data types.
# get_values
for c, s in df.iteritems():
v = s.get_values()
print(c, type(v), v.dtype, v.dtype.type)
# ('int', <type 'numpy.ndarray'>, dtype('int64'), <type 'numpy.int64'>)
# ('float', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
# ('complex', <type 'numpy.ndarray'>, dtype('complex128'), <type 'numpy.complex128'>)
# ('bool', <type 'numpy.ndarray'>, dtype('bool'), <type 'numpy.bool_'>)
# ('object', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('category (object)', <type 'numpy.ndarray'>, dtype('O'), <type 'numpy.object_'>)
# ('datetime', <type 'numpy.ndarray'>, dtype('<M8[ns]'), <type 'numpy.datetime64'>)
# ('timedelta', <type 'numpy.ndarray'>, dtype('<m8[ns]'), <type 'numpy.timedelta64'>)
# ('sparse', <type 'numpy.ndarray'>, dtype('float64'), <type 'numpy.float64'>)
ToDo: It may useful to draw conversion maps between each layers.
Comment From: mroeschke
Looks like we have https://pandas.pydata.org/docs/development/internals.html as a start so I think we can close in favor of issues noting what aspects we are missing