Pandas Filtering dataframe with sparse column leads to NAs in sparse column

Code Sample, a copy-pastable example if possible

import pandas as pd

df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]})
# df1_filtered will have NAs in column A
df1_filtered = df1.loc[df1['B'] != 2]

df2 = pd.DataFrame({"A": pd.SparseArray([0, 1, 0]), 'B': [1,2,3]})
# df2_filtered has no NAs in column A
df2_filtered = df2.loc[df2['B'] != 2]

where df1_filtered will look like

    A   B
0   NaN 1
2   NaN 3

and df2_filtered like

    A   B
0   0   1
2   0   3

Problem description

Filtering a dataframe with an all-zero sparse column can lead to NAs in the sparse column.

Expected Output

Both data frames should be the same, as filtering a dataframe with non-missing data should not lead to missing data.

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit : None pandas : 0.25.0 numpy : 1.16.2 pytz : 2019.1 dateutil : 2.8.0 pip : 19.2.1 setuptools : 39.1.0 Cython : None pytest : 4.3.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.8.3 (dt dec pq3 ext lo64) jinja2 : 2.10.1 IPython : 7.6.1 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : 0.2.1 lxml.etree : None matplotlib : 3.1.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.13.0 pytables : None s3fs : 0.2.0 scipy : 1.2.1 sqlalchemy : 1.3.5 tables : None xarray : None xlrd : None xlwt : None xlsxwriter : None

Comment From: jorisvandenbossche

@stelsemeyer Thanks for the report! This seems to be a regression compared to 0.23 (compared to SparseSeries).

This is a bug in the SparseArray.take implementation, when allow_fill=True it uses a wrong fill value (nan instead of 0):

In [3]: df1.A.array.take([0, 2]) 
Out[3]: 
[0, 0]
Fill: 0
BlockIndex
Block locations: array([], dtype=int32)
Block lengths: array([], dtype=int32)

In [4]: df1.A.array.take([0, 2], allow_fill=True) 
Out[4]: 
[nan, nan]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)

(both above should give the same result)

Always welcome to take a look to see how it could be fixed!

Comment From: stelsemeyer

@jorisvandenbossche: Thanks for investigating! I checked the SparseArray, a naive solution would be to use self.fill_value if fill_value is None in _take_with_fill, here: https://github.com/pandas-dev/pandas/blob/d1accd032b648c9affd6dce1f81feb9c99422483/pandas/core/arrays/sparse.py#L1173-L1174

Comment From: TomAugspurger

Removing from the 0.25.1 milestone, but if anyone is working on this LMK and we can probably get it in.

@stelsemeyer your proposal looks reasonable.

Comment From: scottgigante

I think this is a related issue:

>>> import pandas as pd
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(pd.SparseDtype(float, fill_value=0.0))
>>> X
     0    1    2
0  0.0  1.0  0.0
1  1.0  0.0  0.0
2  1.0  1.0  0.0
>>> X.loc[0]
0    0.0
1    1.0
2    0.0
Name: 0, dtype: Sparse[float64, 0.0]
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN
>>> X.iloc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN

Comment From: scottgigante

I edited the proposed line, but to no avail. The error in @jorisvandenbossche's answer is resolved:

>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]})
>>> df1.A.array.take([0, 2])
[0, 0]
Fill: 0
BlockIndex
Block locations: array([], dtype=int32)
Block lengths: array([], dtype=int32)
>>> df1.A.array.take([0, 2], allow_fill=True)
[0, 0]
Fill: 0
IntIndex
Indices: array([], dtype=int32)

but my and @stelsemeyer's issues remain.

>>> df1.loc[df1['B'] != 2]
    A  B
0 NaN  1
2 NaN  3
>>> 
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(pd.SparseDtype(float, fill_value=0.0))
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN

Seems to me that there is another problem here:

https://github.com/pandas-dev/pandas/blob/a45760fd45b434caf9107bb19f1536636cc3fbd8/pandas/core/internals/managers.py#L1262

>>> blk = X._data.blocks[2]
>>> blk.take_nd(indexer=np.array([0,1]), axis=1).values
[0.0, 0.0]
Fill: 0.0
IntIndex
Indices: array([], dtype=int32)
>>> blk.take_nd(indexer=np.array([0,1]), axis=1, fill_tuple=(blk.fill_value,)).values
[nan, nan]
Fill: 0.0
IntIndex
Indices: array([0, 1], dtype=int32)

which is because of a discrepancy between blk.fill_value and blk.dtype.fill_value

>>> blk.fill_value
nan
>>> blk.dtype.fill_value
0.0

I don't know if we should a) reference blk.dtype.fill_value or b) make blk.dtype.fill_value consistent with blk.fill_value.

Comment From: scottgigante

@TomAugspurger any thoughts on this? I'm happy to write the PR, just need some guidance.

Comment From: TomAugspurger

Mmm I'm not sure I understand the issue. But note that doing a .take which introduces missing values via a -1 in the indices should result in a NaN in the output, regardless of the fill value. Not sure if that helps or not.

Comment From: scottgigante

Can you give an example of how/why that would happen? I don't understand quite how we should get a NaN when the fill_value is not nan.

Comment From: scottgigante

Here's my proposed solution:

Replace

    @property
    def fill_value(self):
        # Used in reindex_indexer
        return self.values.dtype.na_value

from https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1730 with

    @property
    def fill_value(self):
        # Used in reindex_indexer
        try:
            return self.values.dtype.fill_value
        except AttributeError:
            return self.values.dtype.na_value

Thoughts?

Comment From: akdor1154

Pretty sure my dataset is showing this. Seems to apply to some columns but not all. strange_test.pickle.gz

If anyone wants a large real-world set to test this on:

import pandas as pd
test = pd.read_pickle('strange_test.pickle.gz')
any(test.DistinctSKUs_SEPB.isna())
> False
any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
> True # !?!

Comment From: scottgigante

@akdor1154 can you try this monkey patch and see if it solves your issue?

def fill_value(self):
    # Used in reindex_indexer
    try:
        return self.values.dtype.fill_value
    except AttributeError:
        return self.values.dtype.na_value

from pandas.core.internals.blocks import ExtensionBlock

setattr(ExtensionBlock, "fill_value", property(fill_value))

Comment From: mroeschke

As mentioned https://github.com/pandas-dev/pandas/issues/29321#issuecomment-551251530, it may be an issue as well when the sparse series matches the fill value

Comment From: scottgigante

Pretty sure my monkey patch works. I can write a PR if I can get approval from @TomAugspurger or @jorisvandenbossche

>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 0]), 'B': [1,2,3]})
>>> df1.loc[df1['B'] != 2]
    A  B
0 NaN  1
2 NaN  3
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> df1.loc[df1['B'] != 2]
   A  B
0  0  1
2  0  3

>>> import pandas as pd
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(
...     pd.SparseDtype(float, fill_value=0.0))
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> X.loc[[0,1]]
     0    1    2
0  0.0  1.0  0.0
1  1.0  0.0  0.0

>>> import pandas as pd
>>> test = pd.read_pickle('strange_test.pickle.gz')
>>> any(test.DistinctSKUs_SEPB.isna())
False
>>> any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
True
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
False

Comment From: akdor1154

@scottgigante sorry, from memory I tested it at the time and it worked, thanks.

Comment From: connesy

@scottgigante Just tested your monkey patch, it works for me in pandas 1.0.2.

Comment From: TomAugspurger

The fix here is being revereted in https://github.com/pandas-dev/pandas/pull/35287. Some discussion on a potential fix at https://github.com/pandas-dev/pandas/issues/35286#issuecomment-658788801.

cc @scottgigante if you want to take another shot :)

Comment From: mroeschke

This looks fixed on master. Could use a test

In [4]: df1_filtered
Out[4]:
   A  B
0  0  1
2  0  3

In [5]: df2_filtered
Out[5]:
   A  B
0  0  1
2  0  3

Comment From: EnerH

What about this:

df1_filtered['A'] = df1_filtered['A'].fillna(0)

Similarly, to change the NaN values in column 'A' of df2_filtered to 0, you can use the same method:

df2_filtered['A'] = df2_filtered['A'].fillna(0)

Comment From: ConnorMcKinley

Take

Pandas Filtering dataframe with sparse column leads to NAs in sparse column

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`