Code Sample, a copy-pastable example if possible
To run this code you'll need a somewhat large matrix. You can get my expression.tsv
from here. Then run this:
import numpy as np
import pandas as pd
expr=pd.read_csv('expression.tsv',sep='\t',index_col=0).T
expr.loc['Tulip',['Sst', 'Pvalb', 'Npy', 'Calm2']] # <-- this is very fast
labels=np.array(['filler' for i in range(len(expr))])
df=expr.assign(celltype=labels)
df.loc['Tulip',['Sst', 'Pvalb', 'Npy', 'Calm2']] # <-- this is insanely slow and allocates lots of RAM
Problem description
For some reason indexing into a dataframe is sometimes really slow and results in large memory allocation.
Expected Output
I feel like it shouldn't be necessary to allocate a huge amount of memory every time you want to index a dataframe.
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-112-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.2
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.9.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Comment From: TomAugspurger
Do you have a a self-contained example? Perhaps the data could be generated randomly.
Comment From: jacksonloper
Here is a totally self-contained example:
import numpy as np
import numpy.random as npr
import pandas as pd
import time
import numpy as np
import numpy.random as npr
import pandas as pd
N1=20000
N2=3000
N1_labels=npr.randint(0,10,size=N1-4).astype('S10')
N1_labels=np.r_[N1_labels,['Sst', 'Pvalb', 'Npy', 'Calm2']]
N2_labels=npr.randint(0,10,size=N2-1).astype('S10')
N2_labels=np.r_[N2_labels,['Tulip']]
mm=npr.randn(N1,N2)
df=pd.DataFrame(data=mm,index=N1_labels,columns=N2_labels).T
t=time.time()
df.loc['Tulip',['Sst', 'Pvalb', 'Npy', 'Calm2']] # <-- fast
print("fast:",time.time()-t)
foo=np.array(['chhh' for i in range(len(df))])
df2=df.assign(newcolumn=foo)
for i in range(3):
t=time.time()
df2.loc['Tulip',['Sst', 'Pvalb', 'Npy', 'Calm2']] # <-- crazy slow
print("Slow:",time.time()-t)
Comment From: jorisvandenbossche
Thanks for the runnable example!
The reason for this slowdown is because you now have mixed dtype columns (by adding the string column), pandas will convert the values to a common dtype (in internals.py _interleave
method), which results in converting the float values to object, which in itself already takes more than a second:
In [3]: %time df.values.astype(object)
CPU times: user 756 ms, sys: 508 ms, total: 1.26 s
Wall time: 1.26 s
Now, the above is the explanation of why it currently is so slow / memory inefficient. But of course, given that you only select a few columns and a single row (and not even the string column), all this work to convert the full dataframe to a single array is not needed, so there is certainly room to improve this, if someone wants to dive into this part of the code (although that might be substantial work).
Comment From: jacksonloper
Large dataframes with mixed dtypes is not a sufficient condition to guarantee this kind of slowdown. Here is an almost identical example where there are mixed dtypes, but it is fast.
Non-unique indices are part of the problem in some important way. If I use unique columns and a unique index, it is pretty fast again:
import numpy as np
import numpy.random as npr
import pandas as pd
import time
import numpy as np
import numpy.random as npr
import pandas as pd
N1=20000
N2=3000
N1_labels=np.r_[0:N1-4,['Sst', 'Pvalb', 'Npy', 'Calm2']]
N2_labels=npr.randint(0,10000,size=N2-1).astype('S10')
N2_labels=np.r_[0:N2-1,['Tulip']]
mm=npr.randn(N2,N1)#.astype(object)
df=pd.DataFrame(data=mm,columns=N1_labels,index=N2_labels)
t=time.time()
df.loc['Tulip',['Sst', 'Pvalb', 'Npy', 'Calm2']] # <-- fast
print("fast:",time.time()-t)
foo=np.array(['chhh' for i in range(len(df))])
df2=df.assign(newcolumn=foo)
for i in range(3):
t=time.time()
df2.loc['Tulip',['Sst', 'Pvalb', 'Npy', 'Calm2']] # <-- crazy slow
print("Fast:",time.time()-t)
which returns
fast: 0.0027163028717041016
Fast: 0.011390924453735352
Fast: 0.009278535842895508
Fast: 0.00907135009765625
Comment From: jbrockmendel
The "crazy slow" paths now take about .01 seconds for me. The df.values.astype(object) mentioned https://github.com/pandas-dev/pandas/issues/21005#issuecomment-388549324 still takes 4.5s, so I'm inferring that we have stopped doing a .values since the OP. Closing as complete.