I am trying to create a pandas DataFrame from a bunch of numpy ndarray without copying anything. The overall objective is to share a huge DataFrame between many processes from a shared memory.

Some information on my environment:

  • pandas 0.18.0
  • numpy 1.10.4
  • python 3.5

The function I use to retrieve the underlying data pointer of my ndarrays:

def ndarray_data_ptr(a):
    return hex(a.__array_interface__['data'][0])

Problem 1

It seems that pandas does not like int16 index. It will silently convert them to int64 and make a copy even though I specifically ask for zero-copy.

>>> import pandas as pd
>>> import numpy as np

>>> data = np.array([1.23, 4.56, 7.89], dtype=np.dtype(np.float64))
>>> index = np.array([1, 2, 3], dtype=np.dtype(np.int16))

>>> df = pd.DataFrame(
...     data=val,
...     index=pd.Index(idx, copy=False),
...     copy=False)

>>> print('data', ndarray_data_ptr(data))
data 0x2322790

>>> print('index', ndarray_data_ptr(index))
index 0x228d680

>>> print('df data', ndarray_data_ptr(df.values))
df data 0x2322790

>>> print('df index', ndarray_data_ptr(df.index.values))
df index 0x218a100

I managed to overcome this issue by explicitely providing the desired dtype when creating the pd.Index. It seems to work as expected now, even though the behavior of pandas is not really obvious. I would have expected the default value for dtype= of the constructor to be None and re-use the dtype of the provided ndarray. Instead, the default dtype= is object and the resulting pd.Index is coerced to int64.

>>> df = pd.DataFrame(
...     data=data,
...     index=pd.Index(index, copy=False, dtype=index.dtype),
...     copy=False)

>>> print('data', ndarray_data_ptr(data))
data 0x2403780

>>> print('index', ndarray_data_ptr(index))
index 0x1abe930

>>> print('df data', ndarray_data_ptr(df.values))
df data 0x2403780

>>> print('df index', ndarray_data_ptr(df.index.values))
df index 0x1abe930

Problem 2

The previous example was just a starter. The real DataFrame that I want to build actually uses a pd.MultiIndex.

>>> data = np.array([1.23, 4.56, 7.89], dtype=np.dtype(np.float64))

>>> index_levels = [
>>>     np.array([1, 2, 3], dtype=np.dtype(np.int16)),
>>>     np.array([4, 5, 6], dtype=np.dtype(np.int16)),
>>> ]

>>> index_labels = [
>>>     np.array([0, 1, 2], dtype=np.dtype(np.int16)),
>>>     np.array([0, 1, 2], dtype=np.dtype(np.int16)),
>>> ]

>>> df = pd.DataFrame(
>>>     data=data,
>>>     index=pd.MultiIndex(
>>>         levels=[
>>>             pd.Index(index_levels[0], dtype=index_levels[0].dtype, copy=False),
>>>             pd.Index(index_levels[1], dtype=index_levels[1].dtype, copy=False),
>>>         ],
>>>         labels=[
>>>             pd.core.base.FrozenNDArray(index_labels[0], dtype=index_labels[0].dtype, copy=False),
>>>             pd.core.base.FrozenNDArray(index_labels[1], dtype=index_labels[1].dtype, copy=False),
>>>         ],
>>>         copy=False,
>>>     ),
>>>     copy=False)

>>> print('index level 0', ndarray_data_ptr(index_levels[0]))
index level 0 0x24adf70

>>> print('index level 1', ndarray_data_ptr(index_levels[1]))
index level 1 0x24a7d90

>>> print('index label 0', ndarray_data_ptr(index_labels[0]))
index label 0 0x1400b10

>>> print('index label 1', ndarray_data_ptr(index_labels[1]))
index label 1 0x25d34f0

>>> print('df index level 0', ndarray_data_ptr(df.index.levels[0].values))
df index level 0 0x24adf70

>>> print('df index level 1', ndarray_data_ptr(df.index.levels[1].values))
df index level 1 0x24a7d90

>>> print('df index label 0', ndarray_data_ptr(df.index.labels[0)))
df index label 0 0x24c0ad0

>>> print('df index label 1', ndarray_data_ptr(df.index.labels[1]))
df index label 1 0x24c0ad0

The index levels were properly re-used without copy, whereas the index labels were copied. Actually, I noticed that the labels are not pd.Index but pandas instead uses pd.core.base.FrozenNDArray instances, so I tried to trick them with the same technique used in Problem 1 but this is apparently not enough.

Even the dtype of the FrozenNDArray were coerced to int8 when building the pd.MultiIndex.

>>> df.index.levels[0]
Int64Index([1, 2, 3], dtype='int16')

>>> df.index.levels[1]
Int64Index([4, 5, 6], dtype='int16')

>>> df.index.labels[0]
FrozenNDArray([0, 1, 2], dtype='int8')

>>> df.index.labels[1]
FrozenNDArray([0, 1, 2], dtype='int8')

I spent quite a lot of time figuring out how to prevent this dtype coercion in the case of pd.MultiIndex, but I still cannot figure out how.

In my real world problem, the overall DataFrame is ~40GB and the index labels account for ~300MB. I would like to prevent this unnecessary per-process memory overhead.

From my understanding, this is an issue with the way pd.MultiIndex creates its labels.

Comment From: NewbiZ

It seems to me that during the creation of the labels (_set_labels()) the call to _ensure_frozen always makes a copy while coercing the dtype, even if copy=False:

>>> level = np.array([12, 34, 56], dtype=np.dtype(np.int16))
>>> label = np.array([0, 1, 2], dtype=np.dtype(np.int16))
>>> pd.indexes.base._ensure_frozen(label, level, copy=False)
FrozenNDArray([0, 1, 2], dtype='int8')

The implementation of _ensure_frozen will unconditionally call _coerce_indexer_dtype which will change the dtype and make a copy:

>>> pd.core.common._coerce_indexer_dtype(label, level)
array([0, 1, 2], dtype=int8)

Comment From: NewbiZ

Following is the implementation of _ensure_frozen:

def _ensure_frozen(array_like, categories, copy=False):
    array_like = com._coerce_indexer_dtype(array_like, categories)
    array_like = array_like.view(FrozenNDArray)
    if copy:
        array_like = array_like.copy()
    return array_like

I dont understand why the first line calling _coerce_indexer_dtype is required here. This call should only be done if copy=True.

Comment From: NewbiZ

I checked that the following code path is still present in the master branch.

Comment From: jreback

none of this will be supported in the current version of pandas. you should checkout pyarrow which implements zero-copy construction. Further it has the plasma store to zero-copy share objects in shared memory.

I don't know if MultiIndexes are supported.

Comment From: NewbiZ

This is not a matter of "supporting" a feature or not. Pandas has an API, the "copy=False" is part of it, it should be fixed. You are basically saying "there's a bug, I don't care, use another library"...

Comment From: pmuller

I have the same issue on my side. Users should be able to trust the API. Why shouldn't this be fixed? While pyarrow looks nice, it's a different project. It's not easy to migrate existing code to it without breaking stuff.

Comment From: jreback

welcome to have a PR if you think you can ‘fix’ this there are a number of underlying structural issues which make this a general problem not solvable in the current version of pandas

pyarrow is going to form the basis for pandas 2