I would like to describe a potentially useful feature of evaluating a regular expression over a pandas
data frame. For users, it is similar to evaluating an arithmetic expression using pandas.eval
but now we will use a regular expression instead. I have implemented a prototype in my personal repo. Since I am a newbie for pandas
(or large open-source) projects, a discussion and brainstorming here would increase the quality of implementation and hopefully make it a new pandas feature.
The following regular operations are supported: ;
for concatenation, |
for union, *, +, ?
for zero-or-more, one-or-more, and zero-or-one repetitions, respectively. Regular expressions are defined over column names (or predicates over column names) of a pandas data frame. An example is the best to explain it all so consider a data frame of Boolean values with columns named p1
and p2
.
import numpy as np
import pandas as pd
from reelay.pandas import regexp # My prototype
d = {
'p1': [False, True, False, True, False, False, True, True, True, True],
'p2': [False, False, True, False, True, False, False, False, True, True]
}
df = pd.DataFrame(data=d)
Then we write a regular expression such as
expression = 'p1;p2;p1'
that describes a pattern such that p1
is True
at a row, and p2
is True
at the next row, and then p1
is True
again at the next-next row. When the pattern is matched, we return True
for that row and False
otherwise. So this is classical pattern matching over pandas
data frames.
Optionally, you can give a name to the expression
named_expression = 'out = p1;p2;p1'
and evaluate it using regexp.eval
as follows:
df = regexp.eval(df, named_expression, inplace=True)
The evaluation returns the data frame below.
p1 p2 out
0 False False False
1 True False False
2 False True False
3 True False True
4 False True False
5 False False False
6 True False False
7 True False False
8 True True False
9 True True True
Now you can check that out
is True
at rows 3
and 9
since segments1-3
and 7-9
match the expression.
Moreover, you can define custom predicates over Boolean and numerical variables. For that, you need to pass your predicate as a local dictionary as follows.
def my_pred(a, b):
return (a or b)
df = regexp.eval(df, 'my_pred(p1, p2); p2', inplace=True, local_dict={'my_pred' : my_pred})
This expression matches a sequence where p1 or p2 holds at a row and then p2 holds at the next row.
The implementation uses dynamic Python code generation and compilation with respect to the expression given. I am not sure that this particular Pythonic implementation is the most efficient way. My tests with C++
have comparable performance with Google's RE2
engine but this is another matter. I have used antlr4
parser generator to parse the expression. This is one reason why I didn't fork pandas and add this feature directly. It would create another dependency for the project and I do not know the policy about these things. But I think it is a useful feature so tell me your opinion!
Bonus: I can do the same thing for temporal logic specifications (e.g. LTL if you are familiar with it) but regular expressions are surely my favorite.
Comment From: TomAugspurger
Seems interesting, but I think outside of scope for pandas. I don't think we have much of an appetite for DSLs beyond the little bit in .query
.
An entry in http://pandas-docs.github.io/pandas-docs-travis/ecosystem.html would certainly be welcome.
Comment From: doganulus
I hardly call regular expressions a DSL (they are not domain-specific) but you have a point that the ecosystem page (I didn't know it before) is an easier option. Thanks!
Comment From: mroeschke
Thanks for the suggestion bug it seems this functionality hasn't gained much interest from the community or core teams in a while so closing