I have had issues with DataFrame.query() producing erroneous results on machines with Window 8.1, Windows Server 2012 R2 and Windows 10.

The issue can be re-produced with the code below which is creating a simple data frame and then querying it successively and checking the results. The issue appears to be intermittent: running the script multiple times will show the query failing on different tests. The issue also seems to depend on the data frame size: the the issue does not seem to existing for data frames below ~1,000 rows, data frames with ~1000 to 100,000 rows return the wrong results, and data frames with more than 100k rows commonly return no results.

import numpy as np
import pandas as pd

# create a data frame to test with
num_groups = 100
rows_per_group = 1000

# set the evaluation engine
use_python_engine = True

df = pd.DataFrame({
    'grp_col': np.arange(num_groups).repeat(rows_per_group),
    'val_col': np.random.rand(num_groups * rows_per_group)
    })

# check counts via group by
print df.groupby('grp_col').size()

# look for problems in DataFrame.query
tests_per_group = 1000

print '-------------------------'
print ' expected rows per group: ' + str(rows_per_group)
print '-------------------------'

bad_result = []

for i in range(num_groups):

    print 'group: ' + str(i)
    query_str = 'grp_col == ' + str(i)

    for j in range(0, tests_per_group):

        if use_python_engine:
            result = df.query(query_str, engine='python')
        else:
            result = df.query(query_str)
        min_grp =  result.grp_col.min()
        max_grp =  result.grp_col.max()

        if len(result) != rows_per_group or i != min_grp or i != max_grp:
            bad_result.append(result)
            print ' bad query result on test #{}, num rows: {}'.format(str(j), str(len(result)))
            print ' min group: ' + str(min_grp)
            print ' max group: ' + str(max_grp)

I believe the perhaps lies with the numexpr evaluation. In the above script, changing the evaluation engine to python, does not return any errors. Also, the script below demonstrates the issue occur when doing numexpr by itself:

import numpy as np
import numexpr as ne

# create an array to test with
num_items = 10000
arr = np.array(np.random.randint(0, 100, num_items))

# get a value to query from the 1st entry in the array
query_value = arr[0]
query_value
print 'query value: ' + str(query_value)

# get the corresponding query result
count = len(arr[arr == query_value])
print 'query result count: ' + str(count)
print ''

num_tests = 100

for i in range(num_tests):
    curr_test = len(arr[ne.evaluate("arr == " + str(query_value))])

    if curr_test != count:
        print "***Query problem:"
        print "   on test: " + str(i)
        print "   returned count: " + str(curr_test)

Comment From: jreback

dupe of #12023 (and a couple of others)

simply upgrade to numexpr 2.4.6

Comment From: bridwell

Awesome. Thanks.