Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

pandas.Series.str.contains accepts the parameter regex which is a boolean defaulting to True, meaning that the provided pattern is treated as a regex expression. It is useful that this behaviour can be turned of, for instance if one tries to match a string which contains parentheses (which would be interpreted as regex groups)

Example:

pd.Series.str.match('AB(CD EF)') would be interpreted as regex with groups

...same for pd.Series.str.fullmatch('AB(CD EF)'). This is not always wanted. I.e. I would like to do a fullmatch on the literal string 'AB(CD EF)'

It would work with: pd.Series.str.contains('AB(CD EF)', regex=False), but then of course the matching is less strict.

Feature Description

I don't know if its possible this way but perhaps copy how pd.Series.str.contains works with regex=False in the pandas/core/strings/accessor.py

Alternative Solutions

An alternative might by to escape relevant characters to avoid interpretation as regex control symbols. But this becomes unpractical if the match string is not directly entered as a string but obtained from a variable which contents are not fixed. Or when its used within a loop.

Additional Context

No response

Comment From: tomaarsen

Regarding your alternative solutions, you can rely on re.escape(...) to programmatically escape a string in preparation for an exact regex match.

A regex parameter does not make much sense on match and fullmatch methods, which clearly refer to the regular expression methods of the same name, and refer to matching a regular expression pattern.

Furthermore, I don't exactly see what functionality is missing currently. You are looking for an exact match of the series element, but that is also possible like so:

>>> series = pd.Series(list("abcdefghij"))
>>> series
0    a
1    b
2    c
3    d
4    e
5    f
6    g
7    h
8    i
9    j
dtype: object
>>> series == "c"
0    False
1    False
2     True
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

This would be equivalent to your idea of pd.Series.str.fullmatch("c", regex=False). Furthermore, the following should be equivalent to your pd.Series.str.match("c", regex=False):

>>> series.str.startswith("c")
0    False
1    False
2     True
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

Or perhaps I'm misunderstanding.

Comment From: goeb86

I agree with @tomaarsen , unless I am missing something. I would just use the pd.Series.str.find , or just use re.escape.

I'm not sure how strict the match needs to be, but either approach should work fine.

Comment From: robna

The idea was that people might not know that pd.Series.str.match and pd.Series.str.fullmatch are intended to be used with regex primarily, and may assume that they can apply it to non-regex strings in the same way they can do with pd.Series.str.contains. But for me the way with using re.escape is simple enough. I will close this issue, maybe it will help some others too.

Comment From: MMK-IBSEN

I ran into the same issue as @robna , took a while to troubleshoot but in the end I found out that the parentheses in my match string were the issue. If the regex=false was implemented in this function, it would have saved me a lot of time and worries.

The suggestion of @tomaarsen is of course a simpler way of achieving the same thing and I will be using that method from now on. Lesson learned the hard way I guess.