Problem description and code example
It is similar with issue I mentioned. But for this case, it return nothing for a new-assign column by substraction or adding for largre dataframe even though I do not use eval
but it works for small dataframe. The details below:
In [12]: df = pd.DataFrame({'col1': np.random.randint(500, size=1800000),
...: 'col2': np.random.randint(1, size=1800000),
...: 'col3':np.random.randint(4, size=1800000),
...: 'col4':np.random.randint(2, size=1800000),
...: })
...:
...: df['test'] = df['col1'] + df['col2'] + df['col3'] + df['col4']
...:
# `df['test'] = df['col1'] - df['col2'] - df['col3'] - df['col4']`, same for subtraction.
n [13]: df.select_dtypes(include=['int64'])
Out[13]:
col1 col2 col3 col4
0 453 0 2 1
1 74 0 1 0
2 468 0 0 0
3 82 0 3 1
4 288 0 2 1
5 474 0 0 0
6 167 0 2 1
7 182 0 0 1
8 26 0 3 1
(do not show details because it's long.)
In [14]: df.select_dtypes(include=['int64'])
Out[14]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[1800000 rows x 0 columns]
Notice: I do not show small dataframe like iris data from sklearn
because it's easy to repeat.
Expected Output
It can return new columns after assigning columns no matter how many times running select_dtype(include=['int64'])
Output of pd.show_versions()
Comment From: jreback
show df.dtypes
its possible you are starting with int32's to begin with (though this should only happen on windows in 0.20.3 as the native int dtype is int32).
Comment From: jreback
looks fine on macosx on master, on 0.20.3 on 3.5 and 3.6
In [9]: pd.options.display.max_rows=12
In [10]: df = pd.DataFrame({'col1': np.random.randint(500, size=1800000),
...: ...: 'col2': np.random.randint(1, size=1800000),
...: ...: 'col3':np.random.randint(4, size=1800000),
...: ...: 'col4':np.random.randint(2, size=1800000),
...: ...: })
...: ...:
...: ...: df['test'] = df['col1'] + df['col2'] + df['col3'] + df['col4']
...: ...:
In [11]:
In [11]: df
Out[11]:
col1 col2 col3 col4 test
0 42 0 3 0 45
1 418 0 0 1 419
2 163 0 1 1 165
3 435 0 1 0 436
4 52 0 2 1 55
5 141 0 3 1 145
... ... ... ... ... ...
1799994 87 0 2 1 90
1799995 243 0 1 0 244
1799996 211 0 1 0 212
1799997 29 0 1 1 31
1799998 311 0 0 0 311
1799999 405 0 3 0 408
[1800000 rows x 5 columns]
In [13]: df.dtypes
Out[13]:
col1 int64
col2 int64
col3 int64
col4 int64
test int64
dtype: object
In [14]: df.select_dtypes(include=['int64'])
Out[14]:
col1 col2 col3 col4
0 42 0 3 0
1 418 0 0 1
2 163 0 1 1
3 435 0 1 0
4 52 0 2 1
5 141 0 3 1
... ... ... ... ...
1799994 87 0 2 1
1799995 243 0 1 0
1799996 211 0 1 0
1799997 29 0 1 1
1799998 311 0 0 0
1799999 405 0 3 0
[1800000 rows x 4 columns]
Comment From: zero-jack
@jreback It does not show test
column by df.select_dtypes(include=['int64'])
and if rerun same command , it just returns index.
Comment From: jreback
hmm, ok then this is the same issue as #17945
Comment From: gfyoung
In light of your comment @jreback , closing.