we already have too many public indexers.....
Comment From: bluss
Is there a fast replacement for set_value
? Using loc
in its place seems to be slow.
In [328]: columns = list("abcdef")
In [329]: dx = 0.01; xs = np.arange(0, 1, step=dx);
In [330]: df = pd.DataFrame(index=xs)
In [331]: %%timeit
...: for x in xs:
...: for c in columns:
...: df.set_value(x, c, 1)
...:
The slowest run took 8.40 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.24 ms per loop
In [333]: df = pd.DataFrame(index=xs)
In [334]: %%timeit
...: for x in xs:
...: for c in columns:
...: df.loc[x, c] = 1
...:
10 loops, best of 3: 96.4 ms per loop
Comment From: jorisvandenbossche
In your example, df.at[x, c] = 1
is also faster than loc
Comment From: jorisvandenbossche
Sparked by https://github.com/pandas-dev/pandas/issues/15268
Personally, I never use those functions, so would also not regret them being gone (and would welcome the cleaning up of namespace).
But, are there genuine cases where these methods can be useful? (compared to the other indexing methods) Why were they added in the first place?
On StackOverflow, it seems mainly mentioned for its speed (eg http://stackoverflow.com/questions/13842088/set-value-for-particular-cell-in-pandas-dataframe/24517695#24517695)
cc @pandas-dev/pandas-core
Comment From: jreback
IIRC these were always there :> (e.g. way before .iloc/.loc
even).
replacements are well-supported .iat/.at
. (which internally do almost what these functions do, they should be performant, yet support all of the pandas types). These routines are 'raw' and don't do any validation / checking.
These are not idiomatic (e.g. calling functions to set/get values). And confusing to beginners.
Comment From: wesm
Once upon a time, I spent a lot of time making get_value and set_value extremely fast. Over time the performance has degraded significantly.
In the course of rewriting the scalar value access code paths for pandas 2, things will get fast again, so I'm not sure how to proceed given that will occur at some point in the future
Comment From: bluss
.at
is indeed fast, but still slows down that loop example by 2-3x. .at
is barely documented, too.
Comment From: jreback
@bluss
you know you can just do this right:
In [7]: %timeit pd.DataFrame(1.0, index=xs, columns=list('abcde'))
10000 loops, best of 3: 165 µs per loop
Comment From: bluss
Yes, the actual use case does not write just the same value, but a value specific to the current index and column. I've noticed that if I do it wrong, building the DataFrame overshadows all other computation, but with .set_value it is fine (and .at too, slower but not dominating).
Comment From: dov
I was refered to this issue from #17256, where chris-b1 suggested that I use loc
instead of set_value
. I thought that I would share a small test that I made in building a dataframe through various methods. Indeed I switched from loc
to set_value
because of a performance problem with loc
. Consider the following test program that shows the huge difference in efficiency:
import pandas as pd
import numpy as numpy
pd.options.display.float_format = '{:,.0f}'.format
import time
df = pd.DataFrame(numpy.random.rand(1000,100)*100)
df.loc[:,'A'] = None
df.loc[:,'B'] = None
df.loc[:,'C'] = None
t0 = time.time()
for idx,row in df.iterrows():
row.loc[('A','B','C')] = (100+idx,200+idx,300+idx)
df.loc[idx] = row
print 'First: ', time.time()-t0
t0 = time.time()
for idx,row in df.iterrows():
row.loc[('A','B','C')] = (100+idx,200+idx,300+idx)
print 'Second: ', time.time()-t0
t0 = time.time()
for idx,row in df.iterrows():
df.set_value(idx,'A', 100+idx)
df.set_value(idx,'B', 200+idx)
df.set_value(idx,'C', 300+idx)
print 'Third: ', time.time()-t0
t0 = time.time()
for idx,row in df.iterrows():
df.loc[idx,'A'] = 100+idx
df.loc[idx,'B'] = 200+idx
df.loc[idx,'C'] = 300+idx
print 'Fourth: ', time.time()-t0
t0 = time.time()
for idx,row in df.iterrows():
df.loc[idx,('A','B','C')] = (100+idx,200+idx,300+idx)
print 'Fifth: ', time.time()-t0
On my home box this gives the output:
First: 13.1010141373
Second: 0.216350078583
Third: 0.0418322086334
Fourth: 0.731967926025
Fifth: 0.692142963409
Using set_value
in the third example is more than 15 times faster than using loc
in the forth and the fifth exmple! In a real world problem I reduced my run time from 54 to 6 seconds, just by doing these kind of changes. Needless to say, until these problems are solved, will I continue to use set_value
.
Comment From: jreback
@dov you should simply use .at
and .iat
. .set_value
/ .get_value
are going to be deprecated. Iteratively setting values is an anti-pattern.
Comment From: dov
Thanks @jreback, I missed the .at
accessor. There are just too many different options! Adding the following to my above script:
t0 = time.time()
for idx,row in df.iterrows():
df.at[idx,'A'] = 100+idx
df.at[idx,'B'] = 200+idx
df.at[idx,'C'] = 300+idx
print 'Sixth: ', time.time()-t0
AIdx = len(df.columns)-3
BIdx = len(df.columns)-2
CIdx = len(df.columns)-1
t0 = time.time()
for idx,row in df.iterrows():
df.iat[idx,AIdx] = 100+idx
df.iat[idx,BIdx] = 200+idx
df.iat[idx,CIdx] = 300+idx
print 'Seventh: ', time.time()-t0
gives the additional output
Sixth: 0.0538790225983
Seventh: 0.0564727783203
I.e. it is "only" about 20% slower than the set_value()
calls. I can live with that, but I still think it is strange to deprecate the fastest option.
Regarding the anti-pattern of the loop, I agree. But it can just be considered as a stress test run a thousand times. On the other hand I often do something non-pandas related (e.g. image processing) and you just want to store the result in an existing dataframe.
Comment From: jreback
.ie. it is "only" about 20% slower than the set_value() calls. I can live with that, but I still think it is strange to deprecate the fastest option.
@dov well you can have correct, or slightly faster. I would always take correct.
(e.g. image processing)
this most certainly is an anti-pattern, there are vectorized methods in other libraries.
Comment From: PointyShinyBurning
What's the right way to set a single value in a method chain given this deprecation?
There's got to be something faster/more readable than:
%timeit df.set_value(1, "a", 1).mean()
__main__:1: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
68.6 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df.assign(a=lambda f: f.a.mask(f.a.index==1,1)).mean()
632 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Comment From: jreback
the deprecate warning is pretty explicit
Comment From: PointyShinyBurning
Yes, but .at[1, "a"] = 1
returns None, so I have to break the method chain and give the intermediate variable a name. I have a lot of code that's patterned like https://tomaugspurger.github.io/method-chaining.html which I thought was a style pandas aimed to support?
I can roll my own but still losing speed/readability quite a bit:
def set_value(df, index, col, val):
new_df = df.copy()
df.at[index, col] = val
return new_df
%timeit df.pipe(set_value, 1, "a", 1).mean()
160 µs ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Comment From: jreback
you are missing the point setting a single item at a time is non performant and an anti-pattern
sure you can do it but in a method chain doesn’t make any sense
Comment From: PointyShinyBurning
I understand that it's an anti-pattern to construct a dataframe by doing it 2000 times. I admit I don't really understand why it's wrong to correct, say, the first value in my data because the instrument it comes from has a warm up period?
(there also isn't a good pattern to set on a slice in a method chain either, of course, but that's a separate issue from removing functionality that already exists)
Comment From: LinuxIsCool
@PointyShinyBurning has a really good point and I think this issue should be re-opened.