Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[ ] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Series.value_counts() are slower especially if it counts for a small size of iterable elements.
Case1: value_counts() for counting small iterable elements
The picture below is the result of simulation 1 where I compared the speed of counting values for Series.value_counts() and collections.Counter(). In this simulation, I measured time for counting iterables of variable sizes ( $N \in 10^1 - 10^7$ ) using Series.value_counts() and collections.Counter(). It shows Series.value_counts() comes to be slower if the $N < 10^3$.
Source code for simulation 1:
@timing
def value_count_collections_x(n, sort=False):
x = np.random.randint(0, 1000, (n))
vc = Counter(x)
if sort:
vc = dict(sorted(vc.items(), key=lambda x: x[1], reverse=True))
return vc
@timing
def value_count_series_x(n, sort=False):
x = np.random.randint(0, 1000, (n))
vc = pd.Series(x).value_counts(sort=sort)
return vc
for _ in range(5):
for n in [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]:
value_count_collections_x(n)
value_count_series_x(n)
Case2: repeatedly calls value_counts() in aggregation functions
That also comes to be issue when using in aggregation functions. I run another simulation to compare the execution time for Series.value_counts() and collections.Counter() 2. This time I used counting methods in an aggregate function. It also shows Series.value_counts() is 5 times slower than collections.Counter() method.
source code for simulation 2:
@timing
def collection_counter(n, sort=False):
def agg_top20(x):
counter = Counter(x)
top20 = dict(counter.most_common(20)).keys()
return " ".join(map(str, top20))
df = pd.DataFrame(
pd.DataFrame(np.random.randint(0, 1000, (n, 100))).stack().reset_index()
)
return df.groupby("level_0")[0].agg(agg_top20)
@timing
def series_value_count(n, sort=False):
def agg_top20(x):
counter = x.value_counts()
top20 = counter.iloc[:20].index
return " ".join(map(str, top20))
df = pd.DataFrame(
pd.DataFrame(np.random.randint(0, 1000, (n, 100))).stack().reset_index()
)
return df.groupby("level_0")[0].agg(agg_top20)
for _ in range(5):
for n in [1000, 2_000, 5_000, 10_000]:
collection_counter(n)
series_value_count(n)
Is this an expected performance for value_counts()? If it isn't performance improvement should be cared.
Installed Versions
Prior Performance
No response
Comment From: mroeschke
Is this an expected performance for value_counts()? If it isn't performance improvement should be cared.
Possibly. pandas.Series
and collection.Counter
are different objects at the end of the day. So while each has a "value counts" like functionality, Series
also has multiple value counts options, nan
handling, etc. Therefore due to difference in scope, I expect there to be some difference in performance between the two.
Thanks for the analysis but I don't think this is entirely actionable as of now unless you also had some profiling results of Series.value_counts
and suggestion of bottlenecks where performance could be improved
Comment From: bilzard
I see.
Since I have a workaround now (collections.Counter), I’m not taking this seriously. I’m not sure I can find the bottleneck, so I’m closing this issue.
Thank you for the quick response.
On Thu, Nov 17, 2022 at 2:25 Matthew Roeschke @.***> wrote:
Is this an expected performance for value_counts()? If it isn't performance improvement should be cared.
Possibly. pandas.Series and collection.Counter are different objects at the end of the day. So while each has a "value counts" like functionality, Series also has multiple value counts options, nan handling, etc. Therefore due to difference in scope, I expect there to be some difference in performance between the two.
Thanks for the analysis but I don't think this is entirely actionable as of now unless you also had some profiling results of Series.value_counts and suggestion of bottlenecks where performance could be improved
— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/49728#issuecomment-1317386818, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIW6IKXXNRZXCRN2JWTVTYDWIUKJFANCNFSM6AAAAAASB7PO5I . You are receiving this because you authored the thread.Message ID: @.***>