Code Sample, a copy-pastable example if possible
import pandas as pd import numpy as np from time import time import sys
df_data = pd.DataFrame(np.random.randint(0,int(1e6),int(20e6)), columns=['pop_id']) df_data['PL_dB'] = 50 + np.random.random(df_data.shape[0]) * 100 df_data['Rx_dBm'] = 23 - df_data.PL_dB df_data['noise_mW'] = (10.**(df_data.Rx_dBm / 10.)).astype('float32')
start = time() df_data.sort_values(by=['pop_id', 'Rx_dBm'], ascending=[True, False], inplace=True) df_data.reset_index(drop=True, inplace=True)
print("Sort took {:0.2f} seconds".format(time() - start)) print('Python version ' + sys.version) print('pandas version ' + pd.version)
output of pd.show_versions()
For Python 2.7
INSTALLED VERSIONS
commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.18.1 nose: 1.3.7 pip: 8.1.2 setuptools: 25.1.6 Cython: 0.24.1 numpy: 1.11.1 scipy: 0.18.0 statsmodels: 0.6.1 xarray: 0.8.2 IPython: 5.1.0 sphinx: 1.4.1 patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: 1.1.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.2 lxml: 3.6.4 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.40.0 pandas_datareader: None
For Python 3.5
INSTALLED VERSIONS
commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 25.1.6 Cython: 0.24.1 numpy: 1.11.1 scipy: 0.18.0 statsmodels: None xarray: 0.8.2 IPython: 5.1.0 sphinx: 1.4.1 patsy: None dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: 1.1.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 1.5.1 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None
Results with Python 2.7
Sort took 40.91 seconds Python version 2.7.12 |Anaconda custom (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] pandas version 0.18.1
Results with Python 3.5
Sort took 81.30 seconds Python version 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)] pandas version 0.18.1
Comment From: jreback
looks the same as issue fixed by https://github.com/pydata/pandas/pull/13436 if someone could confirm
Comment From: jreback
note that using inplace
is pretty non-idiomatic as it promotes less readable and more error prone code
2.7
In [2]: import pandas as pd
...: import numpy as np
...: from time import time
...: import sys
...:
...: df_data = pd.DataFrame(np.random.randint(0,int(1e6),int(20e5)), columns=['pop_id'])
...: df_data['PL_dB'] = 50 + np.random.random(df_data.shape[0]) * 100
...: df_data['Rx_dBm'] = 23 - df_data.PL_dB
...: df_data['noise_mW'] = (10.**(df_data.Rx_dBm / 10.)).astype('float32')
In [3]: %timeit df_data.sort_values(by=['pop_id', 'Rx_dBm'], ascending=[True, False])
1 loop, best of 3: 1.86 s per loop
In [4]: pd.__version__
Out[4]: '0.18.1+403.ga0151a7'
In [5]: sys.version
Out[5]: '2.7.11 |Continuum Analytics, Inc.| (default, Dec 6 2015, 18:57:58) \n[GCC 4.2.1 (Apple Inc. build 5577)]'
3.5
In [2]: %timeit df_data.sort_values(by=['pop_id', 'Rx_dBm'], ascending=[True, False])
1 loop, best of 3: 1.76 s per loop
In [3]: pd.__version__
...:
Out[3]: '0.18.1+403.ga0151a7'
In [4]: sys.version
...:
Out[4]: '3.5.1 |Continuum Analytics, Inc.| (default, Dec 7 2015, 11:24:55) \n[GCC 4.2.1 (Apple Inc. build 5577)]'