DEV Community

DoriDoro
DoriDoro

Posted on

Python regular expression methods re.match() and re.sub()

Introduction

Let's go over the two methods, re.sub() and re.match() from Python's re module with examples.

1. re.sub():

The re.sub() function is used for substituting occurrences of a pattern in a string. It takes three main arguments:

  • The pattern you want to replace (a regular expression).
  • The replacement string (what you want to replace it with).
  • The original string in which you want to replace the occurrences of the pattern.

Syntax:

re.sub(pattern, replacement, string, count=0, flags=0)
Enter fullscreen mode Exit fullscreen mode
  • pattern: The regex pattern to search for.
  • replacement: The string to replace the matched pattern.
  • string: The input string where the replacement will occur.
  • count: (Optional) Limits the number of replacements. By default, all occurrences are replaced.
  • flags: (Optional) Allows modification of matching behavior (like case-insensitivity).

Example:

Let's replace all the digits in a string with the word NUM.

import re

text = "The price is 123 dollars and 45 cents."
new_text = re.sub(r'\d+', 'NUM', text)

print(new_text)
Enter fullscreen mode Exit fullscreen mode

Output:

The price is NUM dollars and NUM cents.
Enter fullscreen mode Exit fullscreen mode

Here, \d+ is the regex pattern that matches one or more digits. The re.sub() function replaces all occurrences of this pattern with the string 'NUM'.


2. re.match():

The re.match() function checks for a match only at the beginning of the string. If the match is found at the start of the string, it returns a match object. Otherwise, it returns None.

Syntax:

re.match(pattern, string, flags=0)
Enter fullscreen mode Exit fullscreen mode
  • pattern: The regex pattern to match.
  • string: The input string to be searched.
  • flags: (Optional) Allows modification of matching behavior.

Example:

Let's check if a string starts with a word followed by numbers.

import re

text = "Price123 is the total cost."
match = re.match(r'\w+\d+', text)

if match:
    print(f"Matched: {match.group()}")
else:
    print("No match found")
Enter fullscreen mode Exit fullscreen mode

Output:

Matched: Price123
Enter fullscreen mode Exit fullscreen mode

Here, \w+ matches one or more word characters (letters, digits, and underscores), and \d+ matches one or more digits. Since the string starts with "Price123", it successfully matches and prints it.


Key Differences:

  • re.sub() is used for substitution and applies to the whole string.
  • re.match() checks if the string starts with a match, and it doesn’t search beyond the first match in the string.

Would you like more examples or a deeper dive into regular expressions?


Let’s dive deeper into re.sub() and re.match() with more advanced examples and explanations of regular expressions (regex) patterns.

re.sub() Advanced Example:

Suppose we want to format phone numbers by replacing their format. We have phone numbers like 123-456-7890 and we want to replace them with a format that looks like (123) 456-7890.

Example:

import re

text = "Contact me at 123-456-7890 or 987-654-3210."
formatted_text = re.sub(r'(\d{3})-(\d{3})-(\d{4})', r'(\1) \2-\3', text)

print(formatted_text)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • \d{3}: This matches exactly 3 digits.
  • (\d{3}): Parentheses () are used for capturing groups. In this case, we're capturing the first three digits as one group.
  • r'(\1) \2-\3': This is the replacement string. It uses \1, \2, and \3 to refer to the captured groups (the area code, first three digits, and last four digits, respectively).
  • So, this example finds phone numbers in the 123-456-7890 format and converts them to (123) 456-7890.

Output:

Contact me at (123) 456-7890 or (987) 654-3210.
Enter fullscreen mode Exit fullscreen mode

re.match() Advanced Example:

Let's now look at how we can use re.match() with more complex patterns. Assume you want to validate whether a given string is a valid email address, but we only want to check if it starts with an email format.

Example:

import re

email = "someone@example.com sent you a message."

# Basic email pattern matching the start of a string
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

match = re.match(pattern, email)

if match:
    print(f"Valid email found: {match.group()}")
else:
    print("No valid email at the start")
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • ^[a-zA-Z0-9_.+-]+: This part matches one or more alphanumeric characters, dots (.), underscores (_), plus signs (+), or hyphens (-). The ^ ensures the match starts at the beginning of the string.
  • @[a-zA-Z0-9-]+: This matches the @ symbol followed by one or more alphanumeric characters or hyphens (the domain name).
  • \.[a-zA-Z0-9-.]+: Matches a dot (.) followed by alphanumeric characters, hyphens, or additional dots (the top-level domain).

This pattern will match valid email addresses at the beginning of the string.

Output:

Valid email found: someone@example.com
Enter fullscreen mode Exit fullscreen mode

Explaining Common Regex Patterns:

  1. \d: Matches any digit (equivalent to [0-9]).
  2. \w: Matches any word character (alphanumeric plus underscore). Equivalent to [a-zA-Z0-9_].
  3. +: Matches 1 or more occurrences of the preceding character or group.
  4. *: Matches 0 or more occurrences of the preceding character or group.
  5. .: Matches any character except newline.
  6. ^: Anchors the pattern to the start of the string.
  7. $: Anchors the pattern to the end of the string.
  8. {m,n}: Matches between m and n occurrences of the preceding character or group.
  9. [ ]: Used to define a character set. For example, [a-z] matches any lowercase letter.
  10. (): Used for capturing groups, allowing us to extract parts of the match and reference them later (like in re.sub()).

Combining re.sub() with Functions:

You can also use a function as the replacement in re.sub() if you want more dynamic behavior. Let’s see how.

Example: Capitalize every word in a sentence.

import re

text = "this is a test sentence."

def capitalize(match):
    return match.group(0).capitalize()

new_text = re.sub(r'\b\w+\b', capitalize, text)

print(new_text)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • \b: Word boundary.
  • \w+: Matches one or more word characters.
  • The capitalize() function is called for each match, and it capitalizes the first letter of each word.

Output:

This Is A Test Sentence.
Enter fullscreen mode Exit fullscreen mode

re.match() vs re.search():

If you want to search for a pattern anywhere in the string (not just at the beginning), you should use re.search() instead of re.match().

Example using re.search():

import re

text = "This is my email someone@example.com"

# Search for an email pattern anywhere in the string
pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

search = re.search(pattern, text)

if search:
    print(f"Email found: {search.group()}")
else:
    print("No email found")
Enter fullscreen mode Exit fullscreen mode

Output:

Email found: someone@example.com
Enter fullscreen mode Exit fullscreen mode

Here, re.search() looks for the pattern anywhere in the string, unlike re.match(), which only checks the start.

Summary:

  • re.sub(): Replaces matches of a pattern within a string. Can use captured groups for dynamic replacements or even a function.
  • re.match(): Checks for a match at the beginning of a string. Useful for validation or checking the start of a string.
  • re.search(): Searches for a pattern anywhere in the string, not limited to the start.

These examples should give you a more comprehensive understanding of how regex works in Python! Would you like to explore any specific patterns or problems further?

Top comments (0)