Code Sample, a copy-pastable example if possible
In [29]: df = pd.DataFrame([[1, 2, 11111111111111111]], columns=['index', 'type', 'value'])
In [30]: df.dtypes
Out[30]:
index int64
type int64
value int64
dtype: object
In [31]: df.pivot_table(index='index', columns='type', values='value')
Out[31]:
type 2
index
1 11111111111111112
Problem description
Since value
is a 64-bit integer, we should be able to present it with zero precision loss.
Expected Output
In [31]: df.pivot_table(index='index', columns='type', values='value')
Out[31]:
type 2
index
1 11111111111111111
Output of pd.show_versions()
Comment From: jreback
this is passing thru float
In [5]: np.int64(11111111111111111).astype(float).astype('int64')
Out[5]: 11111111111111112
very similar to this
In [7]: df.groupby('index').value.mean()
Out[7]:
index
1 11111111111111112
Name: value, dtype: int64
so even though its cast back it will lose precision.
not really sure of a great way to prevent this and simulataneously deal with overflow.
Comment From: jorisvandenbossche
You can use pivot
instead of pivot_table
for now to circumvent this:
In [6]: df.pivot(index='index', columns='type', values='value')
Out[6]:
type 2
index
1 11111111111111111
The thing is that pivot_table
takes a mean, and does not just represent the original value, so this is the reason you see the precion error.
In [10]: df['value'].mean()
Out[10]: 1.1111111111111112e+16
So not sure there is something to do about it.
Comment From: jreback
xref to https://github.com/pandas-dev/pandas/issues/3707
Comment From: jorisvandenbossche
@jreback Do we have a general issue about the precision for int64 algos like mean? As this is not specific to pivot_table, so I would close this. Although it is specific to pivot_table/groupby that we cast back to int instead of leaving it as a float (which masks the reason for the precision error), but that is covered here: https://github.com/pandas-dev/pandas/issues/11199#issuecomment-156911364
Comment From: jreback
oh that whole issue is better anyhow #11199
Comment From: mstanichenko
You can use
pivot
instead ofpivot_table
for now
Unfortunately, pivot
is not able to handle multiindex.
I deliberately simplified the example to present the point.
Comment From: jorisvandenbossche
@mstanichenko pivot
is specifically defined not to handle aggregations (pure reshaping function), but otherwise there should not be more restrictions in comparison to pivot_table
(or otherwise this may be a bug). Can you give a small example of how you want to use multiindex? (if it is still without aggregation of course)
Comment From: mstanichenko
@jorisvandenbossche, I don't really need a multiindex.
You are right that I need just to reshape a dataframe. Unfortunately, pivot
didn't work for me.
Can you give a small example of how you want to use multiindex?
I have a set of components
and subcomponents
which regularly take timestamps
.
The timestamps
can be of various types. Therefore the dataframe looks like following.
In [41]: df_ts.columns
Out[41]: Index(['component_id', 'subcomponent_id', 'timestamp_type', 'timestamp_value'], dtype='object')
All I want to do is
df = df_ts.pivot(index=['component_id', 'subcomponent_id'], colums='timestamp_type', values='timestamp_value').reset_index()
Comment From: jorisvandenbossche
So with a dummy example, something like this?
In [73]: df = pd.DataFrame({'A':[0,0,0,1,1,1], 'B':[0,1,0,1,0,1], 'C':[0,0,1,1,1,0], 'D':[1,2,3,4,5,6]})
In [74]: df
Out[74]:
A B C D
0 0 0 0 1
1 0 1 0 2
2 0 0 1 3
3 1 1 1 4
4 1 0 1 5
5 1 1 0 6
In [75]: df.pivot(index=['A', 'B'], columns='C', values='D')
...
ValueError: Wrong number of items passed 6, placement implies 2
In [76]: df.pivot_table(index=['A', 'B'], columns='C', values='D')
Out[76]:
C 0 1
A B
0 0 1.0 3.0
1 2.0 NaN
1 0 NaN 5.0
1 6.0 4.0
where you would expect the output of pivot
being the same as pivot_table
?
Comment From: mstanichenko
@jorisvandenbossche , sorry but I lost the point.
You dummy example looks fine to me. I need the output that pivot_table
gives to us
In [48]: df = pd.DataFrame({'A':[0,0,0,1,1,1], 'B':[0,1,0,1,0,1], 'C':[0,0,1,1,1,0], 'D':[1,2,3,4,5,6]})
In [49]: df.pivot_table(index=['A', 'B'], columns='C', values='D').reset_index()
Out[49]:
C A B 0 1
0 0 0 1.0 3.0
1 0 1 2.0 NaN
2 1 0 NaN 5.0
3 1 1 6.0 4.0
I never claim that pivot
and pivot_table
are the same. I just don't understand how one can come up with the desired output with pivot
.
Comment From: jorisvandenbossche
My point is that pivot
should (maybe) also be able to do this (so give the same result as pivot_table
in this case). As my original comment was: if you don't do any aggregation but pure reshape, you can use pivot
instead of pivot_table
(which didn't work for you because of the above).
BTW, you can mimick pivot
with a combination of set_index
and unstack
:
In [95]: df.set_index(['index', 'type']).unstack()
Out[95]:
value
type 2
index
1 11111111111111111
but if that will help against the precision loss will depend on whether there will be NaNs in the result or not (as it is cast to float anyway in that case)
Comment From: mstanichenko
Could you please provide a dataframe for which pivot
would work?
Although, I got your point that pivot
just reshapes the data, it does not even work for the simplest case
In [53]: df = pd.DataFrame({'A':[0], 'B':[0], 'C':[1], 'D':[2]})
In [54]: df
Out[54]:
A B C D
0 0 0 1 2
In [55]: df.pivot(index=['A', 'B'], columns='C', values='D')
IndexError: index 1 is out of bounds for size 1
Comment From: jreback
In [28]: df.set_index(['A', 'B', 'C']).unstack()
Out[28]:
D
C 1
A B
0 0 2
what would you expected [55] to yield?
Comment From: mstanichenko
@jreback, I expect [55]
to give the output [28]
of your example.
@jreback, @jorisvandenbossche, thank you very much for the useful tips. I managed to mimic pivot_table
behavior without mangling the underlying data.
Nevertheless, it's rather difficult to spot that the data were actually changed by pandas
as there are no warning that might catch one's attention. In the meantime I managed to bump into another (although very similar) issue (#8596)
From my limited experience data types
are the most frequent source of issues and the most annoying thing for me :-)
Again, thank you very much for you assistance :-)