I am trying to create a pandas DataFrame from a bunch of numpy ndarray
without copying anything. The overall objective is to share a huge DataFrame
between many processes from a shared memory.
Some information on my environment:
pandas 0.18.0
numpy 1.10.4
python 3.5
The function I use to retrieve the underlying data pointer of my ndarrays
:
def ndarray_data_ptr(a):
return hex(a.__array_interface__['data'][0])
Problem 1
It seems that pandas does not like int16
index. It will silently convert them to int64
and make a copy even though I specifically ask for zero-copy.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.array([1.23, 4.56, 7.89], dtype=np.dtype(np.float64))
>>> index = np.array([1, 2, 3], dtype=np.dtype(np.int16))
>>> df = pd.DataFrame(
... data=val,
... index=pd.Index(idx, copy=False),
... copy=False)
>>> print('data', ndarray_data_ptr(data))
data 0x2322790
>>> print('index', ndarray_data_ptr(index))
index 0x228d680
>>> print('df data', ndarray_data_ptr(df.values))
df data 0x2322790
>>> print('df index', ndarray_data_ptr(df.index.values))
df index 0x218a100
I managed to overcome this issue by explicitely providing the desired dtype
when creating the pd.Index
. It seems to work as expected now, even though the behavior of pandas is not really obvious. I would have expected the default value for dtype=
of the constructor to be None
and re-use the dtype
of the provided ndarray
. Instead, the default dtype=
is object
and the resulting pd.Index
is coerced to int64
.
>>> df = pd.DataFrame(
... data=data,
... index=pd.Index(index, copy=False, dtype=index.dtype),
... copy=False)
>>> print('data', ndarray_data_ptr(data))
data 0x2403780
>>> print('index', ndarray_data_ptr(index))
index 0x1abe930
>>> print('df data', ndarray_data_ptr(df.values))
df data 0x2403780
>>> print('df index', ndarray_data_ptr(df.index.values))
df index 0x1abe930
Problem 2
The previous example was just a starter. The real DataFrame
that I want to build actually uses a pd.MultiIndex
.
>>> data = np.array([1.23, 4.56, 7.89], dtype=np.dtype(np.float64))
>>> index_levels = [
>>> np.array([1, 2, 3], dtype=np.dtype(np.int16)),
>>> np.array([4, 5, 6], dtype=np.dtype(np.int16)),
>>> ]
>>> index_labels = [
>>> np.array([0, 1, 2], dtype=np.dtype(np.int16)),
>>> np.array([0, 1, 2], dtype=np.dtype(np.int16)),
>>> ]
>>> df = pd.DataFrame(
>>> data=data,
>>> index=pd.MultiIndex(
>>> levels=[
>>> pd.Index(index_levels[0], dtype=index_levels[0].dtype, copy=False),
>>> pd.Index(index_levels[1], dtype=index_levels[1].dtype, copy=False),
>>> ],
>>> labels=[
>>> pd.core.base.FrozenNDArray(index_labels[0], dtype=index_labels[0].dtype, copy=False),
>>> pd.core.base.FrozenNDArray(index_labels[1], dtype=index_labels[1].dtype, copy=False),
>>> ],
>>> copy=False,
>>> ),
>>> copy=False)
>>> print('index level 0', ndarray_data_ptr(index_levels[0]))
index level 0 0x24adf70
>>> print('index level 1', ndarray_data_ptr(index_levels[1]))
index level 1 0x24a7d90
>>> print('index label 0', ndarray_data_ptr(index_labels[0]))
index label 0 0x1400b10
>>> print('index label 1', ndarray_data_ptr(index_labels[1]))
index label 1 0x25d34f0
>>> print('df index level 0', ndarray_data_ptr(df.index.levels[0].values))
df index level 0 0x24adf70
>>> print('df index level 1', ndarray_data_ptr(df.index.levels[1].values))
df index level 1 0x24a7d90
>>> print('df index label 0', ndarray_data_ptr(df.index.labels[0)))
df index label 0 0x24c0ad0
>>> print('df index label 1', ndarray_data_ptr(df.index.labels[1]))
df index label 1 0x24c0ad0
The index levels were properly re-used without copy, whereas the index labels were copied. Actually, I noticed that the labels are not pd.Index
but pandas instead uses pd.core.base.FrozenNDArray
instances, so I tried to trick them with the same technique used in Problem 1 but this is apparently not enough.
Even the dtype
of the FrozenNDArray
were coerced to int8
when building the pd.MultiIndex
.
>>> df.index.levels[0]
Int64Index([1, 2, 3], dtype='int16')
>>> df.index.levels[1]
Int64Index([4, 5, 6], dtype='int16')
>>> df.index.labels[0]
FrozenNDArray([0, 1, 2], dtype='int8')
>>> df.index.labels[1]
FrozenNDArray([0, 1, 2], dtype='int8')
I spent quite a lot of time figuring out how to prevent this dtype
coercion in the case of pd.MultiIndex
, but I still cannot figure out how.
In my real world problem, the overall DataFrame is ~40GB and the index labels account for ~300MB. I would like to prevent this unnecessary per-process memory overhead.
From my understanding, this is an issue with the way pd.MultiIndex
creates its labels
.
Comment From: NewbiZ
It seems to me that during the creation of the labels (_set_labels()
) the call to _ensure_frozen
always makes a copy while coercing the dtype
, even if copy=False
:
>>> level = np.array([12, 34, 56], dtype=np.dtype(np.int16))
>>> label = np.array([0, 1, 2], dtype=np.dtype(np.int16))
>>> pd.indexes.base._ensure_frozen(label, level, copy=False)
FrozenNDArray([0, 1, 2], dtype='int8')
The implementation of _ensure_frozen
will unconditionally call _coerce_indexer_dtype
which will change the dtype and make a copy:
>>> pd.core.common._coerce_indexer_dtype(label, level)
array([0, 1, 2], dtype=int8)
Comment From: NewbiZ
Following is the implementation of _ensure_frozen
:
def _ensure_frozen(array_like, categories, copy=False):
array_like = com._coerce_indexer_dtype(array_like, categories)
array_like = array_like.view(FrozenNDArray)
if copy:
array_like = array_like.copy()
return array_like
I dont understand why the first line calling _coerce_indexer_dtype
is required here. This call should only be done if copy=True
.
Comment From: NewbiZ
I checked that the following code path is still present in the master
branch.
Comment From: jreback
none of this will be supported in the current version of pandas. you should checkout pyarrow which implements zero-copy construction. Further it has the plasma store to zero-copy share objects in shared memory.
I don't know if MultiIndexes are supported.
Comment From: NewbiZ
This is not a matter of "supporting" a feature or not. Pandas has an API, the "copy=False" is part of it, it should be fixed. You are basically saying "there's a bug, I don't care, use another library"...
Comment From: pmuller
I have the same issue on my side. Users should be able to trust the API. Why shouldn't this be fixed? While pyarrow looks nice, it's a different project. It's not easy to migrate existing code to it without breaking stuff.
Comment From: jreback
welcome to have a PR if you think you can ‘fix’ this there are a number of underlying structural issues which make this a general problem not solvable in the current version of pandas
pyarrow is going to form the basis for pandas 2