Code Sample

df = Dataframe(numpy.zeros(10000,10000))
random_fill_df(df, num_elements=20)
df = df.to_sparse(fill_value=0)
timeit.timeit('df.loc[[23, 45, 65, 67],:]', globals=globals(), number=10)

Problem description

The reason why row slicing takes so long is because a sparse dataframe a bunch of sparse series. Column slicing is several order of magnitude faster but row slicing is very poor. The sparse dataframe doesn't take advantage of the scipy sparse matrix library which is even faster (both column and row).

Expected Output

In case data is stored as a scipy sparse matrix (as well) inside dataframe object, the slicing operations can be improved, by several orders of magnitude.

I propose that data be stored as a sparse matrix as well in the dataframe object.

Output of `pd.show_versions()`

pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 36.2.0 Cython: None numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: None numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

duplicate of https://github.com/pandas-dev/pandas/issues/14310

you example is also not reproducible. See the issue for comments.

Comment From: nesdis

python sparse dataframes are ridiculously slow when it comes to most operations vs sparse matrices:

Sparse matrix instantiation vs Sparse dataframe instantiation from numpy 2darray:

d = numpy.zeros((10000,10000))
d[1,2] = 3

timeit.timeit('m = coo_matrix(d)', globals=globals(), number=1)
0.7182237296299819

timeit.timeit('df = pandas.DataFrame(d).to_sparse(0)', globals=globals(), number=1)
206.5695096827077

Sparse dataframe instantiation is about 280 times slower vs sparse matrix

Sparse matrix slicing vs Sparse dataframe slicing

r = m.tocsr()
timeit.timeit('r[:5,:].toarray()', globals=globals(), number=1)
0.0005268476787705367

timeit.timeit('df.iloc[:5,:]', globals=globals(), number=1)

'''
MEMORY EXCEPTION!!
python ended up consuming 6GB of my RAM
'''

I dont understand why this bug is a duplicate of Row slicing of a sparse dataframe is too slow #17408

14310 bug is about multi row indexing being slow. #17408 is about sparse dataframe being buggy and slow overall.

Pandas Row slicing of a sparse dataframe is too slow