Context

I'm writing a text mining toolkit, and decided to try pandas because it offered convenience row-and-column labelled access. Numpy doesn't, and structured numpy arrays only have named columns.

Performance Problem

Indexing text, and manipulating dataframes of word co-occurrence/relevance appeared to be very slow, so I profiled my code using %time, cProfile and pprofile.

The following 2 blogs detail this work: http://makeyourowntextminingtoolkit.blogspot.co.uk/2016/10/profiling-indexing-relevance-and-co.html http://makeyourowntextminingtoolkit.blogspot.co.uk/2016/10/profiling-indexing-relevance-and-co_16.html

Key Findings

It seems that some operations are really slow: - growing a pandas dataframe using df.ix[a,b] = value ... 96% time of a profiled function is in indexing.py setitem - performing cell calculations eg df.ix[a,b] += something .. one example saw 60% of a function doing an in-place increment! - performing calculations on rows eg df.loc[a] = somethng ... 37% of a function is spent doing df.loc[a] = df.loc[a] * factor

Solution?

I know the optimal use-case for pandas is working with whole-arrays and columns at a time .. but this performance seems pretty bad.

Does the pandas project consider it a priority to improve this?

My initial testing with pure numpy and also h5py (vs pandas.HD5) seems to show massive improvements .. but at a loss of convenience.

Comment From: myyc

(unrelated to the main point you're making, just FYI) having a quick look at the first post you have a 11000x in the regex vs join where it would be just 10x (5.1ms is 5100μs)...

Comment From: jreback

@makeyourowntextminingtoolkit pls show specific examples as well as a frame that replicates the structure.

Comment From: makeyourowntextminingtoolkit

thanks @myyc - fixed the blog

Comment From: makeyourowntextminingtoolkit

@jreback - not sure how to help here . the data is big, not something I can copy n paste here.

anyway - starting to rewrite using pure numpy and getting an overall 20x performance boost: http://makeyourowntextminingtoolkit.blogspot.co.uk/2016/10/pandas-vs-numpy-performance.html

Comment From: jreback

I was pretty clear

Comment From: wesm

@makeyourowntextminingtoolkit you can provide a proxy for your use case by getting the dimensions and data types right

Comment From: shoyer

This is about what I would expect.

Growing DataFrames is extremely inefficient -- pandas builds on NumPy arrays, which means it does a complete copy of the DataFrame for each new row.

Likewise, indexing in pandas involves a fair amount of pure Python code. In NumPy, it's all done in C. This part is likely to be faster in pandas 2.0.

Comment From: jreback

@makeyourowntextminingtoolkit if you have specific example pls post them.

Comment From: makeyourowntextminingtoolkit

I've written better code which gives me both good performance and also the readability of pandas vs messier numpy code.

http://makeyourowntextminingtoolkit.blogspot.co.uk/2016/10/fixed-faster-indexing-with-pandas.html