Raaga Priya Madhan

Posted on Jun 12

Python Regex Explained Simply — Extract Anything From Text

#python #regex #beginners #datascience

Regex sounds intimidating. It is not. Once you understand the 5 core concepts, you can extract any pattern from any text in seconds. Here is everything you need to know.

What is regex?

Regex is a pattern language. You describe what you are looking for using special characters and Python finds it for you — in any block of text, any size.

Real example: your client sends you a document with 500 customer records mixed with random text. They need all email addresses extracted into Excel. Without regex this takes hours. With regex it takes 3 lines.

import re

text = "Contact john@gmail.com or sales@company.com for details"
emails = re.findall(r'[\w.-]+@[\w.-]+\.\w+', text)
print(emails)
# ['john@gmail.com', 'sales@company.com']

The 5 patterns you need to know

1. \d — any digit

re.findall(r'\d', 'abc123def456')
# ['1', '2', '3', '4', '5', '6']

2. \w — any word character (letter, digit, underscore)

re.findall(r'\w+', 'hello world_123')
# ['hello', 'world_123']

3. + — one or more of the previous

re.findall(r'\d+', 'price is 45000 and tax is 8100')
# ['45000', '8100']

4. [] — any character in this set

re.findall(r'[aeiou]', 'hello world')
# ['e', 'o', 'o']

5. . — any single character

re.findall(r'c.t', 'cat cut cot bat')
# ['cat', 'cut', 'cot']

The 3 functions you will use constantly

re.findall — find all matches

Returns a list of everything that matches the pattern.

text = "Prices: ₹45,000 and ₹12,500 and ₹8,750"
prices = re.findall(r'[\d,]+', text)
print(prices)
# ['45,000', '12,500', '8,750']

re.sub — find and replace

Replaces every match with something else.

messy = "phone: 98-765-43210"
clean = re.sub(r'\D', '', messy)  # remove all non-digits
print(clean)
# '9876543210'

re.search — find first match

Returns just the first match with its position.

text = "Order #A12345 placed successfully"
match = re.search(r'#(\w+)', text)
if match:
    print(match.group(1))  # A12345

A real data cleaning example

Client problem: they have a spreadsheet with phone numbers in 6 different formats. They need them all standardised to 10 digits.

import pandas as pd
import re

df = pd.DataFrame({
    'Phone': ['9876543210', '+91-9876543210', 
              '(080) 4567-8901', '91 98765 43210']
})

def clean_phone(phone):
    digits = re.sub(r'\D', '', phone)
    if len(digits) == 10:
        return digits
    elif len(digits) == 12 and digits.startswith('91'):
        return digits[2:]
    return None

df['Clean'] = df['Phone'].apply(clean_phone)
print(df)

Output:
Phone Clean
0 9876543210 9876543210
1 +91-9876543210 9876543210
2 (080) 4567-8901 None
3 91 98765 43210 9876543210

The one-line summary

Regex is a pattern language — you describe what you are looking for and Python finds every instance of it in any text, any size.

Learn these 5 patterns and 3 functions and you can handle 90% of real data extraction gigs immediately.

Written by Raaga Priya Madhan — CSE student, Bangalore. I build Python automation and data extraction scripts. See my work on GitHub and connect on LinkedIn

Top comments (1)

𝑻𝒉𝒆 𝑳𝒂𝒛𝒚 𝑮𝒊𝒓𝒍 • Jun 12 • Edited

Very simple explanation