I'v found two examples of pivot_table which unexpectedly fail but for different reasons:
NUM_ROWS = 51364452
NUM_INDEX = 34262015
NUM_COLUMNS = 1732
df = pd.DataFrame({'A' : np.random.randint(NUM_INDEX, size=NUM_ROWS),
'B' : np.random.randint(NUM_COLUMNS, size=NUM_ROWS),
'C' : np.random.randn(NUM_ROWS)})
df_pivoted = df.pivot_table(index='A', columns='B', values='C', margins=False)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\tools\pivot.py", line 117, in pivot_table
table = agged.unstack(to_unstack)
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 3605, in unstack
return unstack(self, level)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 394, in unstack
return _unstack_multiple(obj, level)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 294, in _unstack_multiple
unstacked = dummy.unstack('__placeholder__')
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 3605, in unstack
return unstack(self, level)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 398, in unstack
return _unstack_frame(obj, level)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 438, in _unstack_frame
value_columns=obj.columns)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 98, in __init__
self._make_selectors()
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 132, in _make_selectors
mask = np.zeros(np.prod(self.full_shape), dtype=bool)
ValueError: negative dimensions are not allowed
Second example (I have just changed the input values):
NUM_ROWS = 5e7
NUM_INDEX = 3e7
NUM_COLUMNS = 2e3
df = pd.DataFrame({'A' : np.random.randint(NUM_INDEX, size=NUM_ROWS),
'B' : np.random.randint(NUM_COLUMNS, size=NUM_ROWS),
'C' : np.random.randn(NUM_ROWS)})
df_pivoted = df.pivot_table(index='A', columns='B', values='C', margins=False)
Error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\tools\pivot.py", line 117, in pivot_table
table = agged.unstack(to_unstack)
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 3605, in unstack
return unstack(self, level)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 394, in unstack
return _unstack_multiple(obj, level)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 294, in _unstack_multiple
unstacked = dummy.unstack('__placeholder__')
File "C:\Anaconda\lib\site-packages\pandas\core\frame.py", line 3605, in unstack
return unstack(self, level)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 398, in unstack
return _unstack_frame(obj, level)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 438, in _unstack_frame
value_columns=obj.columns)
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 98, in __init__
self._make_selectors()
File "C:\Anaconda\lib\site-packages\pandas\core\reshape.py", line 133, in _make_selectors
mask.put(selector, True)
IndexError: index 1421936250 is out of bounds for axis 0 with size 1421935744
The example works fine if you reduce the input values.
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: 0.9
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None
Comment From: jreback
you are running out of memory. Pivoting like this creates gigantic structures.
Comment From: joshlk
Why does it throw different errors? A memory error would be useful
Comment From: jreback
its probably running out of memory in different places.
you really need a sparse pivot (as these are likely quite sparse). But that doesn't exist atm; would need some work.
Comment From: joshlk
Ok thanks. A memory error would be useful and less confusing
On 15 July 2015 at 14:36, jreback notifications@github.com wrote:
its probably running out of memory in different places.
you really need a sparse pivot (as these are likely quite sparse). But that doesn't exist atm; would need some work.
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/10582#issuecomment-121619617.
Josh Levy-Kramer Data Scientist @ Starcount
[image: starcount-logo]
UK Office +44 (0)203 770 7554 | Mobile 0781 7970 736 | Address: Henry Wood House, 2 Riding House Street, London, W1W 7FA
Singapore: Office +65 6595 6254 | Address: The Nomad Offices, Suntec City 9 Temasek Boulevard 09-01 Suntec Tower Two Singapore 038989
www.starcount.com http://www.starcount.com/
Confidentiality
The information contained in this e-mail is confidential, may be privileged and is intended solely for the use of the named addressee. Access to this e-mail by any other person is not authorised. If you are not the intended recipient, you should not disclose, copy, distribute, take any action or rely on it and you should please notify the sender by reply. Any opinions expressed are not necessarily those of the company.
We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.
Comment From: jreback
see #10554
Comment From: jreback
though the 2nd part might be a bit related to the hashtable impl
Comment From: jreback
cc @behzadnouri cc @sinhrks
IIRC you looked at some of this code before
Comment From: usnishmukherjee
@jreback
This Index Error still exists right?
Becase I am having the same while performing df.pivote_table().
Here is the details of my data frame:
0 userId int64
1 movieId int64
2 rating float64
3 timestamp int64
dtypes: float64(1), int64(3)
memory usage: 762.9 MB
Error:
IndexError: index 1007624404 is out of bounds for axis 0 with size 1007623835
Please let me know if there is any update?