Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

class MyFrame(pd.DataFrame): 
    def __init__(self, *args, **kwargs): 
        super().__init__(*args, **kwargs)
        for col in self.columns:
            if self.dtypes[col] == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self): 
        return type(self)

def get_frame(N): 
    return MyFrame(
        data=np.vstack(
            [np.where(np.random.rand(N) > 0.36, np.random.rand(N), np.nan) for _ in range(10)]
        ).T, 
        columns=[f"col{i}" for i in range(10)]
    )

# When N is smallish, no issue
frame = get_frame(5000)
frame.dropna(subset=["col0", "col1"])
print("5000 passed")

# When N is largeish, `dropna` recurses in the `__init__` through `self.dtypes[col]` access
frame = get_frame(5000000)
frame.dropna(subset=["col0", "col1"])
print("5000000 passed")

Modifying the class __init__ to (remove self.dtypes[col]):

class MyFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        for col, dt in zip(self.columns, self.dtypes):
            if dt == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self):
        return type(self)

Issue Description

I think there has been a regression with access to .dtypes property in inherited DataFrame constructors, as noted in the MRE.

We noticed this on pandas 1.5.2 when upgrading our production environment , but reproduced with pandas 1.4.4, 1.4.0. The code works as expected going back to 1.3.5.

As far as what should be done, perhaps more notes about what can/can't/should not be called/done in subclass __init__ routines when inheriting from pd.DataFrame?

Expected Behavior

No infinite loop?

Installed Versions

In [2]: pd.show_versions()
C:\Users\user\Python\lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.10.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.22621
machine          : AMD64
processor        : AMD64 Family 25 Model 33 Stepping 2, AuthenticAMD
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_United States.1252

pandas           : 1.5.2
numpy            : 1.21.6
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 65.6.3
pip              : 22.3.1
Cython           : 0.29.33
pytest           : 6.2.5
hypothesis       : 6.62.0
sphinx           : 5.3.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.2
html5lib         : None
pymysql          : None
psycopg2         : 2.9.3
jinja2           : 3.1.2
IPython          : 8.8.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : None
fsspec           : 2022.11.0
gcsfs            : None
matplotlib       : 3.5.3
numba            : 0.56.4
numexpr          : 2.8.3
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 10.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.0
snappy           : None
sqlalchemy       : 1.4.46
tables           : 3.7.0
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : None
zstandard        : 0.19.0
tzdata           : None

Comment From: phofl

This works as expected for me on main, can reproduce on 1.5.2. Could you double check on our nightly builds?

Comment From: natmokval

take

Comment From: MarcoGorelli

From https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=117662725 I'm seeing #49551 as the commit that fixed this, cc @rhshadrach

doesn't really seem plausible though?

EDIT: it totally is plausible, because the numeric_only determined whether a transpose was called

Comment From: MarcoGorelli

Looks like this has crept back in

Am doing a bisect to see what brought it back, maybe that'll shed some light into what's going on

Comment From: MarcoGorelli

51335 brought it back:

git checkout b836a88f81c575e86a67b47208b1b5a1067b6b40
. compile-c-extensions.sh 
python myt.py  # hangs indefinitely
git checkout b836a88f81c575e86a67b47208b1b5a1067b6b40~1
. compile-c-extensions.sh 
python myt.py  # runs nearly instantly

myt.py contains:

import pandas as pd
import numpy as np
from pandas import *


class MyFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        for col in self.columns:
            if self.dtypes[col] == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self):
        return type(self)

def get_frame(N):
    return MyFrame(
        data=np.vstack(
            [np.where(np.random.rand(N) > 0.36, np.random.rand(N), np.nan) for _ in range(10)]
        ).T,
        columns=[f"col{i}" for i in range(10)]
    )

def long_running_function(n):
    get_frame(n).dropna()


long_running_function(1000000)

Comment From: MarcoGorelli

Investigations:

dropna goes here:

https://github.com/pandas-dev/pandas/blob/b836a88f81c575e86a67b47208b1b5a1067b6b40/pandas/core/frame.py#L6398

which then goes to

https://github.com/pandas-dev/pandas/blob/b836a88f81c575e86a67b47208b1b5a1067b6b40/pandas/core/frame.py#L10482

The transpose calls the constructor, which here is overwritten and involves a Python for-loop which goes through millions of elements. So, that's why it's become a lot slower

Before: - numeric_only and axis=1 would end up calling df.T

Now: - axis=1 goes to df.T regardless of numeric_only

@rhshadrach reckon this is a cause for concern or that anything should be done here? Not sure pandas can support arbitrary subclasses anyway, might be OK to just close as out-of-scope?

Comment From: rhshadrach

52250 would fix again.

Comment From: MarcoGorelli

ok thanks

if we do want to make a PR for the sake of this, then I'd really suggest getting a test like the one in https://github.com/pandas-dev/pandas/pull/50751 merged, else this'll happen again

Comment From: MarcoGorelli

52250 would fix again.

Looks like that one's been closed

Is there still scope to fix this one? Or should we just close as not-supported?

Comment From: rhshadrach

@MarcoGorelli - I plan to look into this.