# data.csv contains 10000*10  records which is depicted as follows:
# 1,1,1,1,1,1,1,1,1,1
# 2,2,2,2,2,2,2,2,2,2
# ...
# 10000,10000,10000,10000,10000,10000,10000,10000,10000,10000
import pandas as pd
import numpy as np
import time

data = pd.read_csv('data.csv')
data_matrix = data.as_matrix()
s = time.time()
for _ in range(10000):
    v = data.iloc[_]
e = time.time()
print('Performance with DataFrame:'+str(e-s))

s = time.time()
for _ in range(10000):
    v = data_matrix[_]
e = time.time()
print('Performance with Array:         '+str(e-s))


#result
#performance with DataFrame:3.964857816696167
#performance with array:    0.015623092651367188

Problem description

As is shown in the coding above,locating an element while given the index takes much much much longer in the DataFrame than in a raw array.

In a common sense,it hardly takes any time to locate an element in an raw array or list as the index of that element is given and there is no reason to use a sequential search.

However,comparing to the array,the performance is quite lower while retrieving a specific row in a DataFrame.The row-index is already given,the DataFrame should be able to locate the row directly,could it be said that the DataFrame actually start a sequential search?

Any alternative methods that works like locating an element in an array?

Output of pd.show_versions()

INSTALLED VERSIONS commit: None python: 3.4.4.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 37 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.1 nose: None pip: 9.0.1 setuptools: 32.0.0 Cython: None numpy: 1.10.2 scipy: 0.16.1 statsmodels: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None #

Comment From: discort

http://stackoverflow.com/a/16476974/3960038

Comment From: jreback

once you account for the construction of a Series for each row, and a small amount of overhead for additional checking that .iloc does they are about the same

In [12]: %timeit Series(df.values[10])
10000 loops, best of 3: 62.8 µs per loop

In [13]: %timeit df.iloc[10]
10000 loops, best of 3: 74.4 µs per loop

In general iterative looping is not recommended (in numpy or pandas) unless absolutely necessary.