# data.csv contains 10000*10 records which is depicted as follows:
# 1,1,1,1,1,1,1,1,1,1
# 2,2,2,2,2,2,2,2,2,2
# ...
# 10000,10000,10000,10000,10000,10000,10000,10000,10000,10000
import pandas as pd
import numpy as np
import time
data = pd.read_csv('data.csv')
data_matrix = data.as_matrix()
s = time.time()
for _ in range(10000):
v = data.iloc[_]
e = time.time()
print('Performance with DataFrame:'+str(e-s))
s = time.time()
for _ in range(10000):
v = data_matrix[_]
e = time.time()
print('Performance with Array: '+str(e-s))
#result
#performance with DataFrame:3.964857816696167
#performance with array: 0.015623092651367188
Problem description
As is shown in the coding above,locating an element while given the index takes much much much longer in the DataFrame than in a raw array.
In a common sense,it hardly takes any time to locate an element in an raw array or list as the index of that element is given and there is no reason to use a sequential search.
However,comparing to the array,the performance is quite lower while retrieving a specific row in a DataFrame.The row-index is already given,the DataFrame should be able to locate the row directly,could it be said that the DataFrame actually start a sequential search?
Any alternative methods that works like locating an element in an array?
Output of pd.show_versions()
Comment From: discort
http://stackoverflow.com/a/16476974/3960038
Comment From: jreback
once you account for the construction of a Series for each row, and a small amount of overhead for additional checking that .iloc
does they are about the same
In [12]: %timeit Series(df.values[10])
10000 loops, best of 3: 62.8 µs per loop
In [13]: %timeit df.iloc[10]
10000 loops, best of 3: 74.4 µs per loop
In general iterative looping is not recommended (in numpy or pandas) unless absolutely necessary.