Pandas DOC: Rethink recommending eval/query in "Enhancing performance" section of user guide

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#expression-evaluation-via-eval

Documentation problem

It seems that eval and query are not faster than boolean indexing in latest versions of Pandas. See here for example and my own experience is the same.

Suggested fix for documentation

Remove this section from "Ehnancing perfromance".

Comment From: Liam3851

It certainly can still be faster, it depends on the size of your DataFrame and the complexity of your query. For example while I recreate your linked metrics, if I add one more column to the one you tried, I get .query being faster:

In [31]: %timeit df2[(df2['Trip Duration Minutes'] > 60) & (df2.Year > 2015) & (df2['Checkout Date'] == '11/14/2019')]
56 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [32]: %timeit df2.query("`Trip Duration Minutes` > 60 & Year > 2015 & `Checkout Date` == '11/14/2019'")
25 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As the doc says currently: You should not use eval() for simple expressions or for expressions involving small DataFrames

Maybe the latter point should be emphasized more, Right now the doc mentions the 10,000-20,000 row threshold and shows a graph to this effect, but the complexity matters a great deal and while mentioned it isn't demonstrated like the row threshold is.

Comment From: tebeka

I stand corrected, thanks @Liam3851

Since I'm not the only one who missed that, maybe combine the second and first note in the documentation so it'll more "in your face" when you're reading the documentation?

I'm OK if you want to close this issue.

Comment From: rhshadrach

Looking at the documentation, both the size and arithmetic complexity aspects are mentioned three separate times "above the fold"; I'd hazard a guess that combining the notes won't have much impact here. That said, the example you posted on Google Groups it was 1 million rows. From reading the docs, I could see how you might have an expectation that query would produce a speedup with numexpr here. In particular, this line:

A good rule of thumb is to only use eval() when you have a DataFrame with more than 10,000 rows.

might be amended to mentioning sufficiently many operations. I don't know what a good rule of thumb is here, but my expectation is that query and eval will always be slower for a single operation. Not entirely certain if this is correct.

Comment From: jbrockmendel

Looks like Dataframe.eval (with numexpr installed) performs better both with more rows and with more complex expressions.

In [3]: arr = np.random.randint(0, 5, size=(10_000_000, 5))
In [4]: df = pd.DataFrame(arr, columns=list("ABCDE"))

In [5]: %timeit (df["A"] + df["B"]) * df["C"] + df["D"] - df["E"]
145 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit df.eval("(A+B)*C + D - E")
85.6 ms ± 6.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit df["A"] + df["B"]
35.1 ms ± 759 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit df.eval("A + B")
82.6 ms ± 921 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]: df2 = df.iloc[:100_000]

In [16]: %timeit (df2["A"] + df2["B"]) * df2["C"] + df2["D"] - df2["E"]
658 µs ± 7.42 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [17]: %timeit df2.eval("(A + B) * C + D - E")
1.69 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

PR improving the docs for this would be welcome.