Code Sample
df = Dataframe(numpy.zeros(10000,10000))
random_fill_df(df, num_elements=20)
df = df.to_sparse(fill_value=0)
timeit.timeit('df.loc[[23, 45, 65, 67],:]', globals=globals(), number=10)
Problem description
The reason why row slicing takes so long is because a sparse dataframe a bunch of sparse series. Column slicing is several order of magnitude faster but row slicing is very poor. The sparse dataframe doesn't take advantage of the scipy sparse matrix library which is even faster (both column and row).
Expected Output
In case data is stored as a scipy sparse matrix (as well) inside dataframe object, the slicing operations can be improved, by several orders of magnitude.
I propose that data be stored as a sparse matrix as well in the dataframe object.
Output of pd.show_versions()
Comment From: jreback
duplicate of https://github.com/pandas-dev/pandas/issues/14310
you example is also not reproducible. See the issue for comments.
Comment From: nesdis
python sparse dataframes are ridiculously slow when it comes to most operations vs sparse matrices:
Sparse matrix instantiation vs Sparse dataframe instantiation from numpy 2darray:
d = numpy.zeros((10000,10000))
d[1,2] = 3
timeit.timeit('m = coo_matrix(d)', globals=globals(), number=1)
0.7182237296299819
timeit.timeit('df = pandas.DataFrame(d).to_sparse(0)', globals=globals(), number=1)
206.5695096827077
Sparse dataframe instantiation is about 280 times slower vs sparse matrix
Sparse matrix slicing vs Sparse dataframe slicing
r = m.tocsr()
timeit.timeit('r[:5,:].toarray()', globals=globals(), number=1)
0.0005268476787705367
timeit.timeit('df.iloc[:5,:]', globals=globals(), number=1)
'''
MEMORY EXCEPTION!!
python ended up consuming 6GB of my RAM
'''
I dont understand why this bug is a duplicate of Row slicing of a sparse dataframe is too slow #17408