Pandas BUG: select_dtypes return unexpected results when assigning a column by subtraction or adding for large dataframe

Problem description and code example

It is similar with issue I mentioned. But for this case, it return nothing for a new-assign column by substraction or adding for largre dataframe even though I do not use eval but it works for small dataframe. The details below:

In [12]: df = pd.DataFrame({'col1': np.random.randint(500, size=1800000),
    ...:                    'col2': np.random.randint(1, size=1800000),
    ...:                    'col3':np.random.randint(4, size=1800000),
    ...:                    'col4':np.random.randint(2, size=1800000),
    ...:                   })
    ...:
    ...: df['test'] = df['col1'] + df['col2'] + df['col3'] + df['col4']
    ...:

# `df['test'] = df['col1']  -  df['col2'] - df['col3'] - df['col4']`, same for subtraction.

n [13]: df.select_dtypes(include=['int64'])
Out[13]:
         col1  col2  col3  col4
0         453     0     2     1
1          74     0     1     0
2         468     0     0     0
3          82     0     3     1
4         288     0     2     1
5         474     0     0     0
6         167     0     2     1
7         182     0     0     1
8          26     0     3     1
(do not show details because it's long.)

In [14]: df.select_dtypes(include=['int64'])
Out[14]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[1800000 rows x 0 columns]

Notice: I do not show small dataframe like iris data from sklearn because it's easy to repeat.

Expected Output

It can return new columns after assigning columns no matter how many times running select_dtype(include=['int64'])

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: zh_CN.UTF-8 LOCALE: zh_CN.UTF-8 pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 36.5.0 Cython: None numpy: 1.13.1 scipy: 0.19.0 xarray: None IPython: 6.2.1 sphinx: None patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.0.0 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: jreback

show df.dtypes its possible you are starting with int32's to begin with (though this should only happen on windows in 0.20.3 as the native int dtype is int32).

Comment From: jreback

looks fine on macosx on master, on 0.20.3 on 3.5 and 3.6

In [9]: pd.options.display.max_rows=12

In [10]: df = pd.DataFrame({'col1': np.random.randint(500, size=1800000),
    ...:     ...:                    'col2': np.random.randint(1, size=1800000),
    ...:     ...:                    'col3':np.random.randint(4, size=1800000),
    ...:     ...:                    'col4':np.random.randint(2, size=1800000),
    ...:     ...:                   })
    ...:     ...:
    ...:     ...: df['test'] = df['col1'] + df['col2'] + df['col3'] + df['col4']
    ...:     ...:

In [11]: 

In [11]: df
Out[11]: 
         col1  col2  col3  col4  test
0          42     0     3     0    45
1         418     0     0     1   419
2         163     0     1     1   165
3         435     0     1     0   436
4          52     0     2     1    55
5         141     0     3     1   145
...       ...   ...   ...   ...   ...
1799994    87     0     2     1    90
1799995   243     0     1     0   244
1799996   211     0     1     0   212
1799997    29     0     1     1    31
1799998   311     0     0     0   311
1799999   405     0     3     0   408

[1800000 rows x 5 columns]

In [13]: df.dtypes
Out[13]: 
col1    int64
col2    int64
col3    int64
col4    int64
test    int64
dtype: object

In [14]: df.select_dtypes(include=['int64'])
Out[14]: 
         col1  col2  col3  col4
0          42     0     3     0
1         418     0     0     1
2         163     0     1     1
3         435     0     1     0
4          52     0     2     1
5         141     0     3     1
...       ...   ...   ...   ...
1799994    87     0     2     1
1799995   243     0     1     0
1799996   211     0     1     0
1799997    29     0     1     1
1799998   311     0     0     0
1799999   405     0     3     0

[1800000 rows x 4 columns]

Comment From: zero-jack

@jreback It does not show test column by df.select_dtypes(include=['int64']) and if rerun same command , it just returns index.

Comment From: jreback

hmm, ok then this is the same issue as #17945

Comment From: gfyoung

In light of your comment @jreback , closing.

Pandas BUG: select_dtypes return unexpected results when assigning a column by subtraction or adding for large dataframe

Problem description and code example

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`