Pandas loc with regex
The loc
function in Pandas allows you to select data from a DataFrame based on a certain condition. With the use of regular expressions, you can perform more complex selection operations.
Syntax
df.loc[df[column_name].str.contains(regex_pattern)]
Explanation
The loc
function is used to filter rows from the DataFrame based on a specific condition. Using the str.contains()
method, you can search for a specific pattern in a column.
The column_name
represents the name of the column you want to apply the regex pattern on. The regex_pattern
is the regular expression you want to use for matching.
The str.contains()
method returns a boolean Series indicating whether each element in the column matches the regex pattern or not. By passing this boolean Series to the loc
function, you can filter the DataFrame to only include the rows where the condition is True.
Example
Let’s say we have a DataFrame called df
with a column named "Text"
and we want to select all the rows where the text contains the word "apple"
:
import pandas as pd
# Sample data
data = {'Text': ['I like apples', 'Only oranges here', 'Applesauce is great', 'I prefer grapes']}
df = pd.DataFrame(data)
# Applying the regex pattern on "Text" column using loc
filtered_df = df.loc[df['Text'].str.contains('apple')]
print(filtered_df)
# Output:
# Text
# 0 I like apples
# 2 Applesauce is great
In the above example, we first import the Pandas library and create a DataFrame called df
with some sample data. We then apply the str.contains()
method to the "Text"
column using df['Text'].str.contains('apple')
as the condition.
This condition will return a boolean Series, where True
indicates that the pattern ‘apple’ is found in the corresponding row. We pass this boolean Series to the loc
function as df.loc[condition]
, which filters the DataFrame and returns only the rows where the condition is True.
The final result is a DataFrame called filtered_df
that contains only the rows where the text contains the word “apple”.