Pandas PERF: Series.value_counts() in aggregation function is slower than collections.Counter()

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[ ] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Series.value_counts() are slower especially if it counts for a small size of iterable elements.

Case1: value_counts() for counting small iterable elements

The picture below is the result of simulation ¹ where I compared the speed of counting values for Series.value_counts() and collections.Counter(). In this simulation, I measured time for counting iterables of variable sizes ( $N \in 10^1 - 10^7$ ) using Series.value_counts() and collections.Counter(). It shows Series.value_counts() comes to be slower if the $N < 10^3$.

Source code for simulation 1:

@timing
def value_count_collections_x(n, sort=False):
    x = np.random.randint(0, 1000, (n))
    vc = Counter(x)
    if sort:
        vc = dict(sorted(vc.items(), key=lambda x: x[1], reverse=True))
    return vc


@timing
def value_count_series_x(n, sort=False):
    x = np.random.randint(0, 1000, (n))
    vc = pd.Series(x).value_counts(sort=sort)
    return vc


for _ in range(5):
    for n in [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]:
        value_count_collections_x(n)
        value_count_series_x(n)

Case2: repeatedly calls value_counts() in aggregation functions

That also comes to be issue when using in aggregation functions. I run another simulation to compare the execution time for Series.value_counts() and collections.Counter() ². This time I used counting methods in an aggregate function. It also shows Series.value_counts() is 5 times slower than collections.Counter() method.

source code for simulation 2:

@timing
def collection_counter(n, sort=False):
    def agg_top20(x):
        counter = Counter(x)
        top20 = dict(counter.most_common(20)).keys()
        return " ".join(map(str, top20))

    df = pd.DataFrame(
        pd.DataFrame(np.random.randint(0, 1000, (n, 100))).stack().reset_index()
    )
    return df.groupby("level_0")[0].agg(agg_top20)


@timing
def series_value_count(n, sort=False):
    def agg_top20(x):
        counter = x.value_counts()
        top20 = counter.iloc[:20].index
        return " ".join(map(str, top20))

    df = pd.DataFrame(
        pd.DataFrame(np.random.randint(0, 1000, (n, 100))).stack().reset_index()
    )
    return df.groupby("level_0")[0].agg(agg_top20)


for _ in range(5):
    for n in [1000, 2_000, 5_000, 10_000]:
        collection_counter(n)
        series_value_count(n)

Is this an expected performance for value_counts()? If it isn't performance improvement should be cared.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 91111fd99898d9dcaa6bf6bedb662db4108da6e6 python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-52-generic Version : #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.1 numpy : 1.23.4 pytz : 2022.1 dateutil : 2.8.2 setuptools : 59.6.0 pip : 22.0.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.6.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : 0.8.3 fsspec : 2022.10.0 gcsfs : None matplotlib : 3.6.0 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Prior Performance

No response

Comment From: mroeschke

Is this an expected performance for value_counts()? If it isn't performance improvement should be cared.

Possibly. pandas.Series and collection.Counter are different objects at the end of the day. So while each has a "value counts" like functionality, Series also has multiple value counts options, nan handling, etc. Therefore due to difference in scope, I expect there to be some difference in performance between the two.

Thanks for the analysis but I don't think this is entirely actionable as of now unless you also had some profiling results of Series.value_counts and suggestion of bottlenecks where performance could be improved

Comment From: bilzard

I see.

Since I have a workaround now (collections.Counter), I’m not taking this seriously. I’m not sure I can find the bottleneck, so I’m closing this issue.

Thank you for the quick response.

On Thu, Nov 17, 2022 at 2:25 Matthew Roeschke @.***> wrote:

Is this an expected performance for value_counts()? If it isn't performance improvement should be cared.

Possibly. pandas.Series and collection.Counter are different objects at the end of the day. So while each has a "value counts" like functionality, Series also has multiple value counts options, nan handling, etc. Therefore due to difference in scope, I expect there to be some difference in performance between the two.

Thanks for the analysis but I don't think this is entirely actionable as of now unless you also had some profiling results of Series.value_counts and suggestion of bottlenecks where performance could be improved

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/49728#issuecomment-1317386818, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIW6IKXXNRZXCRN2JWTVTYDWIUKJFANCNFSM6AAAAAASB7PO5I . You are receiving this because you authored the thread.Message ID: @.***>

https://gist.github.com/bilzard/558f07a7d13f62b4222a92446704cbe5 ↩
https://gist.github.com/bilzard/08eb013b7a65e35e7bc8f0768c4bd147 ↩