Pandas DataFrame.query() could be vulnerable to SQL injections like attacks if the user doesn't know what they are doing

Currently DataFrame.query() takes a string to query a database in a way that's quite similar to how an SQL query reads data from a database. If a user writes code that uses pythons standard string replacement or that uses regex, the code is vunerable in a similar way to how SQL queries that are build with string replacement and regex are vulnerable.

Pythons sqlite3 writes in it's documentation:

# Never do this -- insecure!
symbol = 'RHAT'
c.execute("SELECT * FROM stocks WHERE symbol = '%s'" % symbol)

# Do this instead
t = ('RHAT',)
c.execute('SELECT * FROM stocks WHERE symbol=?', t)

I think it would be valuable to provide a similar API for DataFrame.query(). In most cases DataFrame.query() likely will be used in scientific computing context where injection attacks aren't a concern but sooner or later someone will write code with DataFrame.query() where injection attacks are a concern.

Comment From: chris-b1

There is already a good degree of sanitation to the inputs, in general you can't run arbitrary python, e.g. something like this:

In [17]: df.query('del a')
NotImplementedError                       Traceback (most recent call last)
<ipython-input-17-17a3d89bb7a9> in <module>()
----> 1 df.query('del a')
<snip>
NotImplementedError: 'Delete' nodes are not implemented

What would be an example "injection?"

Comment From: ChristianKleineidam

Example code:

import pandas as pd

ids = (1,2,3)
name = ("alice", "bob", "carol")
password = ("secret", "very secret", "uber-secret")
email_address = ("alice@gmail.com", "bob@gmail.com", "carol@gmail.com")
age = (21, 24, 23)

data = pd.DataFrame({
    "ids": ids, 
    "name" : name, 
    "password": password,
    "email_address": email_address,
    "age":  age})

def login_get_id(name:str, password:str):
    return data.query("name == '%s' and password == '%s'" %(name,password))[["ids"]]

login_get_id("alice", "secret")
login_get_id("alice", "X'  or name == 'alice")
login_get_id("X", "X'  or email_address == 'alice@gmail.com")

All three of those queries give the same result but a naive user of pandas could think that login_get_id() only returns an id of an user when it's given a correct password. The naive user also wouldn't think that login_get_id() allows accessing information about the email address.

Comment From: TomAugspurger

Maybe I'm missing something, but I don't see how this could cause some kind of injection. That just looks like a poorly defined function :)

You would need to show how something like DataFrame.query/eval(string) executes arbitrary code. Something like remove all the files in the current directory. It might be possible, but imports & call nodes aren't allowed in query strings (at the moment) so it'll be difficult.

Comment From: ChristianKleineidam

It doesn't execute arbitary code but, if an attacker can control the input he can access all the information in the dataframe.

If the user wants to know the age in this case he can also run queries like

login_get_id("X", "X'  or age>20 and name == 'alice")
login_get_id("X", "X'  or age>25 and name == 'alice")

Do you think that nobody is going to use DataFrame.query() with string replace with user submitted data, where they don't want the user to know all the data that's contained in a given DataFrame?

I think in common usage of the term SQL injection, it's called a SQL injection if a website loses it's user data to an attacker even if the attacker isn't able to alter any data.

Comment From: jreback

@ChristianKleineidam

I suppose you could add some docs but this is really out of scope for pandas

Comment From: alexmojaki

I've found a way to call exec and thus run arbitrary Python code. I think it's best that I don't post it here for anyone to see. Is there a procedure for reporting security issues here?

Comment From: TomAugspurger

https://github.com/pandas-dev/pandas/security/policy.

But I'll reiterate that we don't make any claims about eval being safe on untrusted input. This is the wrong layer for that.

Comment From: do-me

@alexmojaki: Any news on this? I think that's quite an important matter.
Does your exploit only apply to DataFrame.query() or the "normal" pandas indexing as well? E.g. df.loc[df['some_column'] == 'evil_string'].

In case you found some general measures to prevent it, it might be a good idea to share them here.

Might be a good idea to reopen this issue in case.

Comment From: alexmojaki

The exploit I had in mind was resolved within pandas. It only applied to query/eval, not indexing.

Comment From: do-me

Thanks for your immediate reply!

How about the query variable substitution, e.g.

var1 = "evil string"
df.query("col1 == @var1") # engine='python')

Is there room for your former exploit or similar (also with respect to different engines)?

Maybe this might lead to a general discussion about SQL-like parameter binding. I would suggest adding some side note in the documentation about security under DataFrame.query().

Comment From: alexmojaki

If var1 is an actual string, that's safe. If users can pass arbitrary objects with an overridden __eq__ then it might be a problem, I haven't checked.

@ already seems like the SQL-like parameter binding you want. The problem as described originally is if you use something like %s instead which makes it easy to inject code into the string passed to query.