Pandas pivot_table: precision lost

Code Sample, a copy-pastable example if possible

In [29]: df = pd.DataFrame([[1, 2, 11111111111111111]], columns=['index', 'type', 'value'])

In [30]: df.dtypes
Out[30]: 
index    int64
type     int64
value    int64
dtype: object

In [31]: df.pivot_table(index='index', columns='type', values='value')
Out[31]: 
type                   2
index                   
1      11111111111111112

Problem description

Since value is a 64-bit integer, we should be able to present it with zero precision loss.

Expected Output

In [31]: df.pivot_table(index='index', columns='type', values='value')
Out[31]: 
type                   2
index                   
1      11111111111111111

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 pandas: 0.18.0 nose: 1.3.7 pip: 9.0.1 setuptools: 28.8.0.post20161110 Cython: 0.23.4 numpy: 1.11.2 scipy: 0.17.0 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.3.5 patsy: 0.4.0 dateutil: 2.6.0 pytz: 2016.2 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.5 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.4.1 html5lib: None httplib2: 0.9.2 apiclient: 1.5.2 sqlalchemy: 1.0.15 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.39.0

Comment From: jreback

this is passing thru float

In [5]: np.int64(11111111111111111).astype(float).astype('int64')
Out[5]: 11111111111111112

very similar to this

In [7]: df.groupby('index').value.mean()
Out[7]: 
index
1    11111111111111112
Name: value, dtype: int64

so even though its cast back it will lose precision.

not really sure of a great way to prevent this and simulataneously deal with overflow.

Comment From: jorisvandenbossche

You can use pivot instead of pivot_table for now to circumvent this:

In [6]: df.pivot(index='index', columns='type', values='value')
Out[6]: 
type                   2
index                   
1      11111111111111111

The thing is that pivot_table takes a mean, and does not just represent the original value, so this is the reason you see the precion error.

In [10]: df['value'].mean()
Out[10]: 1.1111111111111112e+16

So not sure there is something to do about it.

Comment From: jreback

xref to https://github.com/pandas-dev/pandas/issues/3707

Comment From: jorisvandenbossche

@jreback Do we have a general issue about the precision for int64 algos like mean? As this is not specific to pivot_table, so I would close this. Although it is specific to pivot_table/groupby that we cast back to int instead of leaving it as a float (which masks the reason for the precision error), but that is covered here: https://github.com/pandas-dev/pandas/issues/11199#issuecomment-156911364

Comment From: jreback

oh that whole issue is better anyhow #11199

Comment From: mstanichenko

You can use pivot instead of pivot_table for now

Unfortunately, pivot is not able to handle multiindex. I deliberately simplified the example to present the point.

Comment From: jorisvandenbossche

@mstanichenko pivot is specifically defined not to handle aggregations (pure reshaping function), but otherwise there should not be more restrictions in comparison to pivot_table (or otherwise this may be a bug). Can you give a small example of how you want to use multiindex? (if it is still without aggregation of course)

Comment From: mstanichenko

@jorisvandenbossche, I don't really need a multiindex. You are right that I need just to reshape a dataframe. Unfortunately, pivot didn't work for me.

Can you give a small example of how you want to use multiindex?

I have a set of components and subcomponents which regularly take timestamps. The timestamps can be of various types. Therefore the dataframe looks like following.

In [41]: df_ts.columns
Out[41]: Index(['component_id', 'subcomponent_id', 'timestamp_type', 'timestamp_value'], dtype='object')

All I want to do is

df = df_ts.pivot(index=['component_id', 'subcomponent_id'], colums='timestamp_type', values='timestamp_value').reset_index()

Comment From: jorisvandenbossche

So with a dummy example, something like this?

In [73]: df = pd.DataFrame({'A':[0,0,0,1,1,1], 'B':[0,1,0,1,0,1], 'C':[0,0,1,1,1,0], 'D':[1,2,3,4,5,6]})

In [74]: df
Out[74]: 
   A  B  C  D
0  0  0  0  1
1  0  1  0  2
2  0  0  1  3
3  1  1  1  4
4  1  0  1  5
5  1  1  0  6

In [75]: df.pivot(index=['A', 'B'], columns='C', values='D')
...
ValueError: Wrong number of items passed 6, placement implies 2

In [76]: df.pivot_table(index=['A', 'B'], columns='C', values='D')
Out[76]: 
C      0    1
A B          
0 0  1.0  3.0
  1  2.0  NaN
1 0  NaN  5.0
  1  6.0  4.0

where you would expect the output of pivot being the same as pivot_table ?

Comment From: mstanichenko

@jorisvandenbossche , sorry but I lost the point.

You dummy example looks fine to me. I need the output that pivot_table gives to us

In [48]: df = pd.DataFrame({'A':[0,0,0,1,1,1], 'B':[0,1,0,1,0,1], 'C':[0,0,1,1,1,0], 'D':[1,2,3,4,5,6]})

In [49]: df.pivot_table(index=['A', 'B'], columns='C', values='D').reset_index()
Out[49]: 
C  A  B    0    1
0  0  0  1.0  3.0
1  0  1  2.0  NaN
2  1  0  NaN  5.0
3  1  1  6.0  4.0

I never claim that pivot and pivot_table are the same. I just don't understand how one can come up with the desired output with pivot.

Comment From: jorisvandenbossche

My point is that pivot should (maybe) also be able to do this (so give the same result as pivot_table in this case). As my original comment was: if you don't do any aggregation but pure reshape, you can use pivot instead of pivot_table (which didn't work for you because of the above).

BTW, you can mimick pivot with a combination of set_index and unstack:

In [95]: df.set_index(['index', 'type']).unstack()
Out[95]: 
                   value
type                   2
index                   
1      11111111111111111

but if that will help against the precision loss will depend on whether there will be NaNs in the result or not (as it is cast to float anyway in that case)

Comment From: mstanichenko

Could you please provide a dataframe for which pivot would work? Although, I got your point that pivot just reshapes the data, it does not even work for the simplest case

In [53]: df = pd.DataFrame({'A':[0], 'B':[0], 'C':[1], 'D':[2]})

In [54]: df
Out[54]: 
   A  B  C  D
0  0  0  1  2

In [55]: df.pivot(index=['A', 'B'], columns='C', values='D')
IndexError: index 1 is out of bounds for size 1

Comment From: jreback

In [28]: df.set_index(['A', 'B', 'C']).unstack()
Out[28]: 
     D
C    1
A B   
0 0  2

what would you expected [55] to yield?

Comment From: mstanichenko

@jreback, I expect [55] to give the output [28] of your example.

@jreback, @jorisvandenbossche, thank you very much for the useful tips. I managed to mimic pivot_table behavior without mangling the underlying data.

Nevertheless, it's rather difficult to spot that the data were actually changed by pandas as there are no warning that might catch one's attention. In the meantime I managed to bump into another (although very similar) issue (#8596)

From my limited experience data types are the most frequent source of issues and the most annoying thing for me :-)

Again, thank you very much for you assistance :-)

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`