Code Sample, a copy-pastable example if possible
import pandas as pd
df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]})
# df1_filtered will have NAs in column A
df1_filtered = df1.loc[df1['B'] != 2]
df2 = pd.DataFrame({"A": pd.SparseArray([0, 1, 0]), 'B': [1,2,3]})
# df2_filtered has no NAs in column A
df2_filtered = df2.loc[df2['B'] != 2]
where df1_filtered
will look like
A B
0 NaN 1
2 NaN 3
and df2_filtered
like
A B
0 0 1
2 0 3
Problem description
Filtering a dataframe with an all-zero sparse column can lead to NAs in the sparse column.
Expected Output
Both data frames should be the same, as filtering a dataframe with non-missing data should not lead to missing data.
Output of pd.show_versions()
Comment From: jorisvandenbossche
@stelsemeyer Thanks for the report! This seems to be a regression compared to 0.23 (compared to SparseSeries).
This is a bug in the SparseArray.take
implementation, when allow_fill=True
it uses a wrong fill value (nan instead of 0):
In [3]: df1.A.array.take([0, 2])
Out[3]:
[0, 0]
Fill: 0
BlockIndex
Block locations: array([], dtype=int32)
Block lengths: array([], dtype=int32)
In [4]: df1.A.array.take([0, 2], allow_fill=True)
Out[4]:
[nan, nan]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)
(both above should give the same result)
Always welcome to take a look to see how it could be fixed!
Comment From: stelsemeyer
@jorisvandenbossche: Thanks for investigating!
I checked the SparseArray, a naive solution would be to use self.fill_value
if fill_value
is None
in _take_with_fill
, here: https://github.com/pandas-dev/pandas/blob/d1accd032b648c9affd6dce1f81feb9c99422483/pandas/core/arrays/sparse.py#L1173-L1174
Comment From: TomAugspurger
Removing from the 0.25.1 milestone, but if anyone is working on this LMK and we can probably get it in.
@stelsemeyer your proposal looks reasonable.
Comment From: scottgigante
I think this is a related issue:
>>> import pandas as pd
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(pd.SparseDtype(float, fill_value=0.0))
>>> X
0 1 2
0 0.0 1.0 0.0
1 1.0 0.0 0.0
2 1.0 1.0 0.0
>>> X.loc[0]
0 0.0
1 1.0
2 0.0
Name: 0, dtype: Sparse[float64, 0.0]
>>> X.loc[[0,1]]
0 1 2
0 0.0 1.0 NaN
1 1.0 0.0 NaN
>>> X.iloc[[0,1]]
0 1 2
0 0.0 1.0 NaN
1 1.0 0.0 NaN
Comment From: scottgigante
I edited the proposed line, but to no avail. The error in @jorisvandenbossche's answer is resolved:
>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]})
>>> df1.A.array.take([0, 2])
[0, 0]
Fill: 0
BlockIndex
Block locations: array([], dtype=int32)
Block lengths: array([], dtype=int32)
>>> df1.A.array.take([0, 2], allow_fill=True)
[0, 0]
Fill: 0
IntIndex
Indices: array([], dtype=int32)
but my and @stelsemeyer's issues remain.
>>> df1.loc[df1['B'] != 2]
A B
0 NaN 1
2 NaN 3
>>>
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(pd.SparseDtype(float, fill_value=0.0))
>>> X.loc[[0,1]]
0 1 2
0 0.0 1.0 NaN
1 1.0 0.0 NaN
Seems to me that there is another problem here:
https://github.com/pandas-dev/pandas/blob/a45760fd45b434caf9107bb19f1536636cc3fbd8/pandas/core/internals/managers.py#L1262
>>> blk = X._data.blocks[2]
>>> blk.take_nd(indexer=np.array([0,1]), axis=1).values
[0.0, 0.0]
Fill: 0.0
IntIndex
Indices: array([], dtype=int32)
>>> blk.take_nd(indexer=np.array([0,1]), axis=1, fill_tuple=(blk.fill_value,)).values
[nan, nan]
Fill: 0.0
IntIndex
Indices: array([0, 1], dtype=int32)
which is because of a discrepancy between blk.fill_value
and blk.dtype.fill_value
>>> blk.fill_value
nan
>>> blk.dtype.fill_value
0.0
I don't know if we should a) reference blk.dtype.fill_value
or b) make blk.dtype.fill_value
consistent with blk.fill_value
.
Comment From: scottgigante
@TomAugspurger any thoughts on this? I'm happy to write the PR, just need some guidance.
Comment From: TomAugspurger
Mmm I'm not sure I understand the issue. But note that doing a .take
which introduces missing values via a -1
in the indices should result in a NaN in the output, regardless of the fill value. Not sure if that helps or not.
Comment From: scottgigante
Can you give an example of how/why that would happen? I don't understand quite how we should get a NaN when the fill_value is not nan.
Comment From: scottgigante
Here's my proposed solution:
Replace
@property
def fill_value(self):
# Used in reindex_indexer
return self.values.dtype.na_value
from https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1730 with
@property
def fill_value(self):
# Used in reindex_indexer
try:
return self.values.dtype.fill_value
except AttributeError:
return self.values.dtype.na_value
Thoughts?
Comment From: akdor1154
Pretty sure my dataset is showing this. Seems to apply to some columns but not all. strange_test.pickle.gz
If anyone wants a large real-world set to test this on:
import pandas as pd
test = pd.read_pickle('strange_test.pickle.gz')
any(test.DistinctSKUs_SEPB.isna())
> False
any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
> True # !?!
Comment From: scottgigante
@akdor1154 can you try this monkey patch and see if it solves your issue?
def fill_value(self):
# Used in reindex_indexer
try:
return self.values.dtype.fill_value
except AttributeError:
return self.values.dtype.na_value
from pandas.core.internals.blocks import ExtensionBlock
setattr(ExtensionBlock, "fill_value", property(fill_value))
Comment From: mroeschke
As mentioned https://github.com/pandas-dev/pandas/issues/29321#issuecomment-551251530, it may be an issue as well when the sparse series matches the fill value
Comment From: scottgigante
Pretty sure my monkey patch works. I can write a PR if I can get approval from @TomAugspurger or @jorisvandenbossche
>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 0]), 'B': [1,2,3]})
>>> df1.loc[df1['B'] != 2]
A B
0 NaN 1
2 NaN 3
>>> def fill_value(self):
... # Used in reindex_indexer
... try:
... return self.values.dtype.fill_value
... except AttributeError:
... return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> df1.loc[df1['B'] != 2]
A B
0 0 1
2 0 3
>>> import pandas as pd
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(
... pd.SparseDtype(float, fill_value=0.0))
>>> X.loc[[0,1]]
0 1 2
0 0.0 1.0 NaN
1 1.0 0.0 NaN
>>> def fill_value(self):
... # Used in reindex_indexer
... try:
... return self.values.dtype.fill_value
... except AttributeError:
... return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> X.loc[[0,1]]
0 1 2
0 0.0 1.0 0.0
1 1.0 0.0 0.0
>>> import pandas as pd
>>> test = pd.read_pickle('strange_test.pickle.gz')
>>> any(test.DistinctSKUs_SEPB.isna())
False
>>> any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
True
>>> def fill_value(self):
... # Used in reindex_indexer
... try:
... return self.values.dtype.fill_value
... except AttributeError:
... return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
False
Comment From: akdor1154
@scottgigante sorry, from memory I tested it at the time and it worked, thanks.
Comment From: connesy
@scottgigante Just tested your monkey patch, it works for me in pandas 1.0.2
.
Comment From: TomAugspurger
The fix here is being revereted in https://github.com/pandas-dev/pandas/pull/35287. Some discussion on a potential fix at https://github.com/pandas-dev/pandas/issues/35286#issuecomment-658788801.
cc @scottgigante if you want to take another shot :)
Comment From: mroeschke
This looks fixed on master. Could use a test
In [4]: df1_filtered
Out[4]:
A B
0 0 1
2 0 3
In [5]: df2_filtered
Out[5]:
A B
0 0 1
2 0 3
Comment From: EnerH
What about this:
df1_filtered['A'] = df1_filtered['A'].fillna(0)
Similarly, to change the NaN values in column 'A' of df2_filtered to 0, you can use the same method:
df2_filtered['A'] = df2_filtered['A'].fillna(0)
Comment From: ConnorMcKinley
Take