Code Sample, a copy-pastable example if possible
I'm trying to locate a specific row and change the data in that row in pandas. I'm doing that in the following way.
df.set_index(['row_1', 'row_2'], inplace=True)
for entry_1, entry_2 in data:
try:
df.loc[(entry_1, entry_2)] # if it doesn't exist throws keyerror
df.loc[(entry_1, entry_2), 'my_column'] = 'new unique value'
except KeyError:
pass # ignore as value wasn't in df from the beginning
df.reset_index(inplace=True)
Problem description
When the dataframe contains ~12k entries it manages to check around ~900 entries per second. But if I decrease the dataframe to around ~5k entries it checks around 10 values per second. This is counterintuitive, I would expect the opposite to happen. What determines the speed when looking up something in the index?
Any ideas on what's going on here?
There is another issue related to indexes that appears when the dataframe is ~12k entries and I'm setting the index to 5 rows. When trying to get a specific entry based on the index it's not able to find the entry. However, when I reduce the number entries in the dataframe it does find the entry.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8
pandas: 0.21.1 pytest: None pip: 9.0.1 setuptools: 34.3.3 Cython: None numpy: 1.13.1 scipy: None pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.6.5 patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: 2.4.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.0.0 bs4: 4.6.0 html5lib: None sqlalchemy: None pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Comment From: jreback
this is quite non-idiomatic and there for inefficient
use df.where()
or df.isin()
to index and do a vectorized set. see the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html
if you still have questions like this Stack Overflow is a good forum.
Comment From: jonathan-s
Thanks for the reply! I'll look into it 👍. Also thanks for your work on Pandas!