Regex is hard. But you only need 5 patterns.
After 3 years of building scrapers and data pipelines, I use the same 5 regex patterns in almost every project.
1. Extract Email Addresses
import re
text = 'Contact us at hello@company.com or support@company.co.uk'
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.]+', text)
# ['hello@company.com', 'support@company.co.uk']
2. Extract URLs
text = 'Visit https://example.com/page?q=test or http://api.site.io/v2'
urls = re.findall(r'https?://[\w.-]+(?:/[\w./?=&%-]*)?', text)
# ['https://example.com/page?q=test', 'http://api.site.io/v2']
3. Extract Numbers (including decimals and negatives)
text = 'Price: $29.99, discount: -5.50, items: 3'
numbers = re.findall(r'-?\d+\.?\d*', text)
# ['29.99', '-5.50', '3']
4. Clean Whitespace (multiple spaces, tabs, newlines → single space)
text = 'Too many\n\n spaces here\t\ttabs'
clean = re.sub(r'\s+', ' ', text).strip()
# 'Too many spaces here tabs'
5. Extract Content Between Tags/Delimiters
html = '<title>My Page Title</title>'
title = re.search(r'<title>(.*?)</title>', html).group(1)
# 'My Page Title'
# Also works for:
json_str = '{"key": "value"}'
value = re.search(r'"key":\s*"(.*?)"', json_str).group(1)
# 'value'
Cheat sheet
| Pattern | Matches |
|---|---|
\d+ |
One or more digits |
\w+ |
Word characters (letters, digits, _) |
\s+ |
Whitespace |
.*? |
Anything (non-greedy) |
[^\s]+ |
Non-whitespace |
(?:...) |
Non-capturing group |
(?P<name>...) |
Named group |
What regex pattern do you use most?
I build 77 web scrapers — regex is my daily bread. Follow for more practical patterns.
More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs
Top comments (0)