Context
I'm writing a text mining toolkit, and decided to try pandas because it offered convenience row-and-column labelled access. Numpy doesn't, and structured numpy arrays only have named columns.
Performance Problem
Indexing text, and manipulating dataframes of word co-occurrence/relevance appeared to be very slow, so I profiled my code using %time, cProfile and pprofile.
The following 2 blogs detail this work: http://makeyourowntextminingtoolkit.blogspot.co.uk/2016/10/profiling-indexing-relevance-and-co.html http://makeyourowntextminingtoolkit.blogspot.co.uk/2016/10/profiling-indexing-relevance-and-co_16.html
Key Findings
It seems that some operations are really slow: - growing a pandas dataframe using df.ix[a,b] = value ... 96% time of a profiled function is in indexing.py setitem - performing cell calculations eg df.ix[a,b] += something .. one example saw 60% of a function doing an in-place increment! - performing calculations on rows eg df.loc[a] = somethng ... 37% of a function is spent doing df.loc[a] = df.loc[a] * factor
Solution?
I know the optimal use-case for pandas is working with whole-arrays and columns at a time .. but this performance seems pretty bad.
Does the pandas project consider it a priority to improve this?
My initial testing with pure numpy and also h5py (vs pandas.HD5) seems to show massive improvements .. but at a loss of convenience.
Comment From: myyc
(unrelated to the main point you're making, just FYI) having a quick look at the first post you have a 11000x in the regex vs join where it would be just 10x (5.1ms is 5100μs)...
Comment From: jreback
@makeyourowntextminingtoolkit pls show specific examples as well as a frame that replicates the structure.
Comment From: makeyourowntextminingtoolkit
thanks @myyc - fixed the blog
Comment From: makeyourowntextminingtoolkit
@jreback - not sure how to help here . the data is big, not something I can copy n paste here.
anyway - starting to rewrite using pure numpy and getting an overall 20x performance boost: http://makeyourowntextminingtoolkit.blogspot.co.uk/2016/10/pandas-vs-numpy-performance.html
Comment From: jreback
I was pretty clear
Comment From: wesm
@makeyourowntextminingtoolkit you can provide a proxy for your use case by getting the dimensions and data types right
Comment From: shoyer
This is about what I would expect.
Growing DataFrames is extremely inefficient -- pandas builds on NumPy arrays, which means it does a complete copy of the DataFrame for each new row.
Likewise, indexing in pandas involves a fair amount of pure Python code. In NumPy, it's all done in C. This part is likely to be faster in pandas 2.0.
Comment From: jreback
@makeyourowntextminingtoolkit if you have specific example pls post them.
Comment From: makeyourowntextminingtoolkit
I've written better code which gives me both good performance and also the readability of pandas vs messier numpy code.
http://makeyourowntextminingtoolkit.blogspot.co.uk/2016/10/fixed-faster-indexing-with-pandas.html