Pandas ValueError and IndexError for pivot_table

I'v found two examples of pivot_table which unexpectedly fail but for different reasons:

NUM_ROWS = 51364452
NUM_INDEX = 34262015
NUM_COLUMNS = 1732
df = pd.DataFrame({'A' : np.random.randint(NUM_INDEX, size=NUM_ROWS), 
                   'B' : np.random.randint(NUM_COLUMNS, size=NUM_ROWS), 
                   'C' : np.random.randn(NUM_ROWS)})

df_pivoted = df.pivot_table(index='A', columns='B', values='C', margins=False)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda\lib\site-packages\pandas\tools\pivot.py", line 117, in pivot_table
    table = agged.unstack(to_unstack)
  File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 3605, in unstack
    return unstack(self, level)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 394, in unstack
    return _unstack_multiple(obj, level)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 294, in _unstack_multiple
    unstacked = dummy.unstack('__placeholder__')
  File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 3605, in unstack
    return unstack(self, level)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 398, in unstack
    return _unstack_frame(obj, level)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 438, in _unstack_frame
    value_columns=obj.columns)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 98, in __init__
    self._make_selectors()
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 132, in _make_selectors
    mask = np.zeros(np.prod(self.full_shape), dtype=bool)
ValueError: negative dimensions are not allowed

Second example (I have just changed the input values):

NUM_ROWS = 5e7
NUM_INDEX = 3e7
NUM_COLUMNS = 2e3
df = pd.DataFrame({'A' : np.random.randint(NUM_INDEX, size=NUM_ROWS), 
                   'B' : np.random.randint(NUM_COLUMNS, size=NUM_ROWS), 
                   'C' : np.random.randn(NUM_ROWS)})

df_pivoted = df.pivot_table(index='A', columns='B', values='C', margins=False)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda\lib\site-packages\pandas\tools\pivot.py", line 117, in pivot_table
    table = agged.unstack(to_unstack)
  File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 3605, in unstack
    return unstack(self, level)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 394, in unstack
    return _unstack_multiple(obj, level)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 294, in _unstack_multiple
    unstacked = dummy.unstack('__placeholder__')
  File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 3605, in unstack
    return unstack(self, level)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 398, in unstack
    return _unstack_frame(obj, level)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 438, in _unstack_frame
    value_columns=obj.columns)
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 98, in __init__
    self._make_selectors()
  File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 133, in _make_selectors
    mask.put(selector, True)
IndexError: index 1421936250 is out of bounds for axis 0 with size 1421935744

The example works fine if you reduce the input values.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: 0.9
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None

Comment From: jreback

you are running out of memory. Pivoting like this creates gigantic structures.

Comment From: joshlk

Why does it throw different errors? A memory error would be useful

Comment From: jreback

its probably running out of memory in different places.

you really need a sparse pivot (as these are likely quite sparse). But that doesn't exist atm; would need some work.

Comment From: joshlk

Ok thanks. A memory error would be useful and less confusing

On 15 July 2015 at 14:36, jreback notifications@github.com wrote:

its probably running out of memory in different places.

you really need a sparse pivot (as these are likely quite sparse). But that doesn't exist atm; would need some work.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/10582#issuecomment-121619617.

Josh Levy-Kramer Data Scientist @ Starcount

[image: starcount-logo]

UK Office +44 (0)203 770 7554 | Mobile 0781 7970 736 | Address: Henry Wood House, 2 Riding House Street, London, W1W 7FA

Singapore: Office +65 6595 6254 | Address: The Nomad Offices, Suntec City 9 Temasek Boulevard 09-01 Suntec Tower Two Singapore 038989

www.starcount.com http://www.starcount.com/

Confidentiality

The information contained in this e-mail is confidential, may be privileged and is intended solely for the use of the named addressee. Access to this e-mail by any other person is not authorised. If you are not the intended recipient, you should not disclose, copy, distribute, take any action or rely on it and you should please notify the sender by reply. Any opinions expressed are not necessarily those of the company.

We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.

Comment From: jreback

see #10554

Comment From: jreback

though the 2nd part might be a bit related to the hashtable impl

Comment From: jreback

cc @behzadnouri cc @sinhrks

IIRC you looked at some of this code before

Comment From: usnishmukherjee

@jreback This Index Error still exists right? Becase I am having the same while performing df.pivote_table(). Here is the details of my data frame: RangeIndex: 25000095 entries, 0 to 25000094 Data columns (total 4 columns): Column Dtype
0 userId int64
1 movieId int64
2 rating float64 3 timestamp int64
dtypes: float64(1), int64(3) memory usage: 762.9 MB Error: IndexError: index 1007624404 is out of bounds for axis 0 with size 1007623835 Please let me know if there is any update?