Pandas Filter based on lambda or some function

Code Sample, a copy-pastable example if possible

# Your code here
dataset[dataset.Column1 =="abc123"] # this works fine
dataset.loc[dataset.Column1.startswith("abc")] #this fails

Problem description

Traceback (most recent call last): File "", line 1, in File "/Library/Python/2.7/site-packages/pandas/core/generic.py", line 2970, in getattr return object.getattribute(self, name) AttributeError: 'Series' object has no attribute 'startswith'

Overall intent: To use a filtering function or a lambda to select a row or reject it.

Expected Output

just as == is defined on Series as checking each element, we can also define startswith or other function like re match etc

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.20.1 pytest: None pip: 9.0.1 setuptools: 18.5 Cython: None numpy: 1.8.0rc1 scipy: 0.13.0b1 xarray: None IPython: None sphinx: None patsy: None dateutil: 1.5 pytz: 2013.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 1.3.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: chris-b1

Please make this a reproducible example, thanks.

Comment From: jorisvandenbossche

See your error message: a pandas Series has no method startswith. There is however a .str.startswith() method, see http://pandas.pydata.org/pandas-docs/stable/text.html. So you can do: dataset.Column1.str.startswith("abc")

But maybe that's not your point, and you want to raise a more general issue. But then indeed, please make a reproducible example with what you want to work and the desired output.

Comment From: amitavmohanty01

As per your suggestion, dataset.loc[dataset.Column1.str.startswith("abc")] works fine. How about considering dataset[re.search("^[A-Z]*", dataset.Column1.str).group(0) == "IAD"] for the more general case that I want to achieve?

Comment From: jorisvandenbossche

. How about considering dataset[re.search("^[A-Z]*", dataset.Column1.str).group(0) == "IAD"] for the more general case that I want to achieve?

re.search does not accept pandas Series objects, and that is not something we can change.

Closing this, as I don't think there is issue here for pandas. But feel free to further clarify if you disagree with that.

Comment From: amitavmohanty01

Thanks for the help and direction. I was able to use the apply function as follows to get close to what I want. dataset[dataset.Column1.apply(lambda x : re.search("^[A-Z]*", x).group(0)) == "xYz"]. However, dataset[dataset.Column1.apply(lambda x : re.search("^[A-Z]*", x).group(0)) == dataset.Column2.apply(lambda x : re.search("^[A-Z]*", x).group(0))] did not work out.

Comment From: jorisvandenbossche

Note that there is a Series.str.extract which basically does a re.search. So I think you should be able to do something like:

dataset[dataset.Column1.str.extract("^[A-Z]*") == "xYz"]

instead of using the apply

Comment From: amitavmohanty01

The intent is to select rows that meet a criteria. The criteria is re(column1).group(n) == re(column2).group(m). Am I approaching the problem incorrectly?

Comment From: amitavmohanty01

same = dataset[dataset.Column1.str.extract("^([A-Z]*)") == dataset.Column2.str.extract("^([A-Z]*)")] this solved my purpose. Thanks for the help.

Pandas Filter based on lambda or some function

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`