Working with Regular Expressions in Python

#python #programming #beginners

In this post, we'll explore some common operations on regular expressions in Python, using examples from the world of astronomy.

Regular expressions are a powerful tool for pattern matching and text processing. Python's re module provides several functions for working with regular expressions, including search(), match(), findall(), and sub().

The search() function searches a string for a pattern and returns a match object if the pattern is found. The match() function is similar to search(), but only matches at the beginning of the string. The findall() function returns a list of all non-overlapping matches of a pattern in a string. The sub() function replaces all occurrences of a pattern in a string with a specified replacement string.

Here are some examples of using these functions:

import re

text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."

# Search for a pattern
match = re.search(pattern="spiral", string=text)
if match:
    print(f"Found: {match.group()}")

# Output: Found: spiral

# Match at the beginning of the string
match = re.match(pattern=r"The", string=text
if match:
    print(f"Found: {match.group()}")

# Output: Found: The

# Find all occurrences of a pattern
matches = re.findall(pattern=r"\b\w{5}\b", string=text)
print(matches)

# Output: ["spiral", "Earth"]

# Replace all occurrences of a pattern
new_text = re.sub(pattern=r"\d", repl="#", string=text)
print(new_text)

# Output: The Andromeda Galaxy is a spiral galaxy approximately #.# million light-years away from Earth.

Regular expressions can also be used to extract specific information from a text. Here are some examples:

text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."

# Extract the first two words from a text
match = re.search(pattern=r"^(\w+)\s+(\w+)", string=text)
if match:
    print(f"First word: {match.group(1)}")  # Output: First word: The
    print(f"Second word: {match.group(2)}")  # Output: Second word: Andromeda

# Extract a starting number as long as it has 10 digits
match = re.search(pattern=r"^\d{10}", string=text)
if match:
    print(f"Found: {match.group()}")

Output:

# Separate a number into units and decimals
match = re.search(pattern=r"(\d+)\.(\d+)", string=text)
if match:
    print(f"Units: {match.group(1)}")  # Output: Units: 2
    print(f"Decimals: {match.group(2)}")  # Output: Decimals: 5

# Separate text into words using space characters as reference
words = re.split(pattern=r"\s+", string=text)
print(words)

# Output: ['The', 'Andromeda', 'Galaxy', 'is', 'a', 'spiral', 'galaxy', 'approximately', '2.5', 'million', 'light-years', 'away', 'from', 'Earth.']

# Use regex similar to the strip() function
stripped_text = re.sub(pattern=r"^\s+|\s+$", repl="", string=text)
print(stripped_text)

# Output: The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth.

# Remove symbols from a filename except the dot character
filename = "image-of-the-andromeda-galaxy.jpg"
new_filename = re.sub(pattern=r"[^\w\.]", repl="", string=filename)
print(new_filename)

# Output: imageoftheandromedagalaxy.jpg

# Use regex to split a text into a list of words and get the frequency for the list of words
words = re.findall(pattern=r"\b\w+\b", string=text)
word_counts = {}
for word in words:
    word_counts[word] = word_counts.get(word, 0) + 1
print(word_counts)

# Output: {'The': 1, 'Andromeda': 1, 'Galaxy': 1, 'is': 1, 'a': 1, 'spiral': 1, 'galaxy': 1, 'approximately': 1, '2': 1, '5': 1, 'million': 1, 'light': 1, 'years': 1, 'away': 1, 'from': 1, 'Earth': 1}

# Use regex to split the text into sentences and get the frequency for each sentence
sentences = re.split(pattern=r"\.\s+", string=text)
sentence_counts = {}
for sentence in sentences:
    sentence_counts[sentence] = sentence_counts.get(sentence, 0) + 1
print(sentence_counts)

# Output: {'The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth': 1}

These are just a few examples of the many powerful ways that regular expressions can be used to process and manipulate text in Python. With a little practice, you'll be able to use regular expressions to solve a wide variety of text-processing problems.

DEV Community

Working with Regular Expressions in Python

Top comments (0)