DEV Community

Paulo GP
Paulo GP

Posted on

Working with Regular Expressions in Python

In this post, we'll explore some common operations on regular expressions in Python, using examples from the world of astronomy.

Regular expressions are a powerful tool for pattern matching and text processing. Python's re module provides several functions for working with regular expressions, including search(), match(), findall(), and sub().

The search() function searches a string for a pattern and returns a match object if the pattern is found. The match() function is similar to search(), but only matches at the beginning of the string. The findall() function returns a list of all non-overlapping matches of a pattern in a string. The sub() function replaces all occurrences of a pattern in a string with a specified replacement string.

Here are some examples of using these functions:

import re

text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."
Enter fullscreen mode Exit fullscreen mode
# Search for a pattern
match = re.search(pattern="spiral", string=text)
if match:
    print(f"Found: {match.group()}")

# Output: Found: spiral
Enter fullscreen mode Exit fullscreen mode
# Match at the beginning of the string
match = re.match(pattern=r"The", string=text
if match:
    print(f"Found: {match.group()}")

# Output: Found: The
Enter fullscreen mode Exit fullscreen mode
# Find all occurrences of a pattern
matches = re.findall(pattern=r"\b\w{5}\b", string=text)
print(matches)

# Output: ["spiral", "Earth"]
Enter fullscreen mode Exit fullscreen mode
# Replace all occurrences of a pattern
new_text = re.sub(pattern=r"\d", repl="#", string=text)
print(new_text)

# Output: The Andromeda Galaxy is a spiral galaxy approximately #.# million light-years away from Earth.
Enter fullscreen mode Exit fullscreen mode

Regular expressions can also be used to extract specific information from a text. Here are some examples:

text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."
Enter fullscreen mode Exit fullscreen mode
# Extract the first two words from a text
match = re.search(pattern=r"^(\w+)\s+(\w+)", string=text)
if match:
    print(f"First word: {match.group(1)}")  # Output: First word: The
    print(f"Second word: {match.group(2)}")  # Output: Second word: Andromeda
Enter fullscreen mode Exit fullscreen mode
# Extract a starting number as long as it has 10 digits
match = re.search(pattern=r"^\d{10}", string=text)
if match:
    print(f"Found: {match.group()}")

Output: 
Enter fullscreen mode Exit fullscreen mode
# Separate a number into units and decimals
match = re.search(pattern=r"(\d+)\.(\d+)", string=text)
if match:
    print(f"Units: {match.group(1)}")  # Output: Units: 2
    print(f"Decimals: {match.group(2)}")  # Output: Decimals: 5
Enter fullscreen mode Exit fullscreen mode
# Separate text into words using space characters as reference
words = re.split(pattern=r"\s+", string=text)
print(words)

# Output: ['The', 'Andromeda', 'Galaxy', 'is', 'a', 'spiral', 'galaxy', 'approximately', '2.5', 'million', 'light-years', 'away', 'from', 'Earth.']
Enter fullscreen mode Exit fullscreen mode
# Use regex similar to the strip() function
stripped_text = re.sub(pattern=r"^\s+|\s+$", repl="", string=text)
print(stripped_text)

# Output: The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth.
Enter fullscreen mode Exit fullscreen mode
# Remove symbols from a filename except the dot character
filename = "image-of-the-andromeda-galaxy.jpg"
new_filename = re.sub(pattern=r"[^\w\.]", repl="", string=filename)
print(new_filename)

# Output: imageoftheandromedagalaxy.jpg
Enter fullscreen mode Exit fullscreen mode
# Use regex to split a text into a list of words and get the frequency for the list of words
words = re.findall(pattern=r"\b\w+\b", string=text)
word_counts = {}
for word in words:
    word_counts[word] = word_counts.get(word, 0) + 1
print(word_counts)

# Output: {'The': 1, 'Andromeda': 1, 'Galaxy': 1, 'is': 1, 'a': 1, 'spiral': 1, 'galaxy': 1, 'approximately': 1, '2': 1, '5': 1, 'million': 1, 'light': 1, 'years': 1, 'away': 1, 'from': 1, 'Earth': 1}
Enter fullscreen mode Exit fullscreen mode
# Use regex to split the text into sentences and get the frequency for each sentence
sentences = re.split(pattern=r"\.\s+", string=text)
sentence_counts = {}
for sentence in sentences:
    sentence_counts[sentence] = sentence_counts.get(sentence, 0) + 1
print(sentence_counts)

# Output: {'The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth': 1}
Enter fullscreen mode Exit fullscreen mode

These are just a few examples of the many powerful ways that regular expressions can be used to process and manipulate text in Python. With a little practice, you'll be able to use regular expressions to solve a wide variety of text-processing problems.

Top comments (0)