Pandas New Feature: Evaluating regular expressions over data frames

I would like to describe a potentially useful feature of evaluating a regular expression over a pandas data frame. For users, it is similar to evaluating an arithmetic expression using pandas.eval but now we will use a regular expression instead. I have implemented a prototype in my personal repo. Since I am a newbie for pandas (or large open-source) projects, a discussion and brainstorming here would increase the quality of implementation and hopefully make it a new pandas feature.

The following regular operations are supported: ; for concatenation, | for union, *, +, ? for zero-or-more, one-or-more, and zero-or-one repetitions, respectively. Regular expressions are defined over column names (or predicates over column names) of a pandas data frame. An example is the best to explain it all so consider a data frame of Boolean values with columns named p1 and p2.

import numpy as np
import pandas as pd

from reelay.pandas import regexp         # My prototype

d = {
    'p1': [False, True,  False,  True,  False,  False, True,  True, True, True], 
    'p2': [False,  False, True, False, True, False,  False, False, True, True]
}
df = pd.DataFrame(data=d)

Then we write a regular expression such as

expression = 'p1;p2;p1'

that describes a pattern such that p1 is True at a row, and p2 is True at the next row, and then p1 is True again at the next-next row. When the pattern is matched, we return True for that row and False otherwise. So this is classical pattern matching over pandas data frames.

Optionally, you can give a name to the expression

named_expression = 'out = p1;p2;p1'

and evaluate it using regexp.eval as follows:

df = regexp.eval(df, named_expression, inplace=True)

The evaluation returns the data frame below.

      p1     p2    out
0  False  False  False
1   True  False  False
2  False   True  False
3   True  False   True
4  False   True  False
5  False  False  False
6   True  False  False
7   True  False  False
8   True   True  False
9   True   True   True

Now you can check that out is True at rows 3 and 9 since segments1-3 and 7-9 match the expression.

Moreover, you can define custom predicates over Boolean and numerical variables. For that, you need to pass your predicate as a local dictionary as follows.

def my_pred(a, b):
    return (a or b)    

df = regexp.eval(df, 'my_pred(p1, p2); p2', inplace=True, local_dict={'my_pred' : my_pred})

This expression matches a sequence where p1 or p2 holds at a row and then p2 holds at the next row.

The implementation uses dynamic Python code generation and compilation with respect to the expression given. I am not sure that this particular Pythonic implementation is the most efficient way. My tests with C++ have comparable performance with Google's RE2 engine but this is another matter. I have used antlr4 parser generator to parse the expression. This is one reason why I didn't fork pandas and add this feature directly. It would create another dependency for the project and I do not know the policy about these things. But I think it is a useful feature so tell me your opinion!

Bonus: I can do the same thing for temporal logic specifications (e.g. LTL if you are familiar with it) but regular expressions are surely my favorite.

Comment From: TomAugspurger

Seems interesting, but I think outside of scope for pandas. I don't think we have much of an appetite for DSLs beyond the little bit in .query.

An entry in http://pandas-docs.github.io/pandas-docs-travis/ecosystem.html would certainly be welcome.

Comment From: doganulus

I hardly call regular expressions a DSL (they are not domain-specific) but you have a point that the ecosystem page (I didn't know it before) is an easier option. Thanks!

Comment From: mroeschke

Thanks for the suggestion bug it seems this functionality hasn't gained much interest from the community or core teams in a while so closing