Regex is, without a doubt, the most helpful text processing tool ever invented. It helps us find patterns rather than exact words or phrases in a text. And regex engines are noticeably faster too.
Yet, the difficult part is to define a pattern. Experienced programmers may define it on the go. But most developers will have to spend time googling and reading through documentation.
Regardless of experience, everyone finds reading a pattern others defined difficult.
This is the problem PRegEx solves.
PRegEx is a Python library that makes regex patterns more elegant and readable. It’s now one of my favorite libraries for cleaner python code.
You can install it from the PyPI repository.
pip install pregex
If you're using Poetry instead of Virtualenv
poetry add pregex
Start writing more readable regex.
Here’s an example of grasping how cool PRegEx is.
It’s widespread to need to extract (US) zip codes from addresses. It’s not difficult if the addresses are standardized. Otherwise, we need to use some clever techniques to extract them.
United States zip codes are usually five-digit numbers. Also, some zipcodes may have an extension of four digits separated by a hyphen.
For instance, 88310 is a postal code in New Mexico. Some prefer to use also the geographic segment with an extension like 88310–7241.
Here’s the typical approach (using the re module) to find patterns of this kind.
import re
pattern = r"\d{5}(-\d{4})?"
address = "730 S White Sands Blvd, Alamogordo, NM 88310, United States"
zip_code = re.search(pattern, address).group()
print(zip_code)
# 88310
The steps may seem straightforward. However, if you’re to explain how you defined the pattern to a novice programmer, you’ll have to do an hour-long lecture.
I’m not going to explain it either. Because we have PRegEx. Here’s the PRegEx version of it.
from pregex.classes import AnyDigit
from pregex.quantifiers import Exactly, Optional
pattern = Exactly(AnyDigit(), 5) + Optional("-" + Exactly(AnyDigit(), 4))
address1 = "730 S White Sands Blvd, Alamogordo, NM 88310, United States"
address2 = "730 S White Sands Blvd, Alamogordo, NM 88310-7421, United States"
pattern.get_matches(address1)
# ['88310']
pattern.get_matches(address2)
# ['88310-7421']
As you can see, this code is both simple to define and understand.
The pattern has two segments. The first segment should have exactly five digits, and the second one is optional. Also, the second segment, if available, should have a hyphen and four numbers.
Understand the submodules to create more exciting regex patterns.
Here we used a couple of submodules of the PRegEx library — classes and quantifiers. The ‘classes’ submodule determines what to match and the quantifier submodule help specifying how many repetitions to perform.
You could use other classes such as AnyButDigit to match non-numeric values or AnyLowercaseLetter with lower case strings. To create more complex regex patterns, you could also use different quantifiers such as OneOrMore, AtLeast, AtMost, or Indefinite.
Here’s another example with more exciting matches. We need to find out email addresses in a text. That’s simple. But we’re also interested in capturing the domains of email addresses in addition to matching the pattern.
from pregex.classes import AnyButWhitespace
from pregex.groups import Capture
from pregex.quantifiers import OneOrMore, AtLeastAtMost
pattern = (
OneOrMore(AnyButWhitespace())
+ "@"
+ Capture(
OneOrMore(AnyButWhitespace()) + "." + AtLeastAtMost(AnyButWhitespace(), 2, 3)
)
)
text = """My names is Alice. I live in Wonderland. You can mail me: alice@wonderland.com.
In case if I couldn't reply, please main my friend the White Rabbit: whiterabbit@wonderland.com.
But for more serious issues, you should main Tony Stark at tony@stark.org.
"""
# Get everything you captured.
pattern.get_captures(text)
# [('wonderland.com',), ('wonderland.com',), ('stark.org',)]
# Get all your matches.
pattern.get_matches(text)
# ['alice@wonderland.com', 'whiterabbit@wonderland.com', 'tony@stark.org']
We’ve used the Capture class from the ‘groups’ submodule in the above example. It allows us to collect segments within a match so that you don’t have to do any post-processing to extract them.
Another submodule you’d often need is the operator module. It helps you concatenate patterns or select either of a set of options.
Here’s a slightly modified version of the same example above.
from pregex.classes import AnyButWhitespace
from pregex.groups import Capture
from pregex.operators import Either
from pregex.quantifiers import OneOrMore
pattern = (
OneOrMore(AnyButWhitespace())
+ "@"
+ Capture(OneOrMore(AnyButWhitespace()) + Either(".com", ".org"))
)
text = """My names is Alice. I live in Wonderland. You can mail me: alice@wonderland.com.
In case if I couldn't reply, please main my friend the White Rabbit: whiterabbit@wonderland.com.
But for more serious issues, you should main Tony Stark at tony@stark.org.
Please don't message [thanos@wierdland.err](https://thuwarakesh.medium.com/subscribe)
"""
pattern.get_captures(text)
# [('wonderland.com',), ('wonderland.com',), ('stark.org',)]
In the above example, we’ve restricted the top-level domain to either ‘.com’ or ‘.org. We’ve used the ‘Either’ class from the operator submodule to build this pattern. As you can see, it didn’t match with thanos@wierdland.err as its top-level domain is ‘.err,’ not ‘.com’ or ‘.org.’
Final thoughts
Defining regex may not be a massive task for experienced developers. But even for them, reading and understanding a pattern created by someone else is difficult. For beginners, both can be daunting.
Besides, regex is an excellent tool for text mining. Any developer or data scientist will almost certainly come across regex usage.
If you’re a Python programmer, PRegEx has the complex parts covered.
Did you like what you read? Consider subscribing to my email newsletter because I post more like this frequently.
Thanks for reading, friend! Say Hi to me on LinkedIn, Twitter, and Medium.
Top comments (0)