Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

# creating a single series dataframe
frame = pd.DataFrame(np.array([1, 2, 3, 4, 5, 100]))

# getting the describe with single percentile value
frame.describe(percentiles = [0.25])

Issue Description

Using a single percentile value below 50 for percentiles for data frame describe function returns 50th percentile data by default, while the same is not reflected when the value is more than 50.

# considering the above dataframe in example
>>> frame.describe(percentiles = [0.25])
                0
count    6.000000
mean    19.166667
std     39.625329
min      1.000000
25%      2.250000
50%      3.500000
max    100.000000
>>> frame.describe(percentiles = [0.35])
                0
count    6.000000
mean    19.166667
std     39.625329
min      1.000000
35%      2.750000
50%      3.500000
max    100.000000
>>> frame.describe(percentiles = [0.51])
                0
count    6.000000
mean    19.166667
std     39.625329
min      1.000000
50%      3.500000
51%      3.550000
max    100.000000

Expected Behavior

Should return only given percentile value instead.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.4 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 140 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_India.1252 pandas : 2.2.3 numpy : 2.2.0 pytz : 2024.2 dateutil : 2.9.0.post0 pip : 24.1 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.2 qtpy : None pyqt5 : None

Comment From: rhshadrach

Thanks for the report. This goes back to https://github.com/ivanovmg/pandas/commit/843aa600c8ed359f828c5a25d5e7130d6df1e08a and is indeed intentional.

But it is certainly not well documented, and I'm supportive of removing the behavior where we always include 0.5.

Comment From: kevkle

take

Comment From: yanweiSu

take

Comment From: Pushkar3232

Hi, I'm a first-time contributor and working on this issue. Could you please help me by pointing to the specific file or path where the .describe() function is implemented? I'm having trouble navigating the large repository and would appreciate some guidance on where to start. Thanks in advance for your help!

Comment From: rhshadrach

@Pushkar3232 - thanks for your interest, but this issue already has a PR up resolving it.

Comment From: Abhibhav2003

take

Comment From: Abhibhav2003

refer to pull request : #61024

Comment From: preet545

take

Comment From: xaris96

take

Comment From: MartinBraquet

@rhshadrach and other maintainers, I read different opinions regarding the usefulness of the change desired here. In #60557 , some comments advise to preserve the median, even when percentiles is not None.

If we remove the median when percentiles is not None, perhaps it then raises the question as to why we don't also remove the other default percentiles, namely the min and max. Indeed, the default percentiles rendered by .describe are the 0th, 25th, 50th, 75th and 100th ones. Keeping the 0th and 100th ones while discarding the 50th one -- when percentiles is passed -- might seem arbitrary in the eyes of some users. In some sense, this would make the judgment that the min and max are more important / valuable than the median. To me, this decision potentially makes sense; I am simply wondering if such a judgment is also widely agreed upon by the users.

I allowed myself to handle this issue in #61158 as all related activity got stale for weeks. If my comment above about the utility of the issue is reconsidered, I'll update my PR; it was asked to cover with more tests in any case.

Comment From: MartinBraquet

take