Day 38: Python Domain Extractor, Pull Unique Domains from Emails with Validation and Order Preservation

#python #programming #challenge #beginners

Welcome to Day 38 of the #80DaysOfChallenges journey! This beginner-to-intermediate task centers on extracting unique domains from a list of emails, skipping invalids and ignoring case while keeping the first-seen order, using string splits and a set for deduping. It's great for practicing parsing, lowercasing for consistency, and ordered uniqueness without sorted structures, applicable in data cleaning or email analytics. If you're building keys for uniqueness or handling lists with conditions, this "Python extract domains" guide covers a function that's reliable and easy to tweak for stricter validation or subdomains.

💡 Key Takeaways from Day 38: Domain Extraction Function

This challenge builds a function that processes emails, normalizes, validates, extracts domains, and collects uniques in appearance order. It's a solid filter with dedupe: lower for case-insensitivity, check '@', split for domain, set for seen. We'll break it down: function with seen set and result list, loop with normalization and split, and example skipping invalids.

1. Function Design: Set for Dedupe, List for Order

The extract_domains function takes a list of emails and returns a list of unique domains, processed in a way that ignores case differences and preserves the order of first appearance:

def extract_domains(emails):
    """
    Extract unique domains from a list of emails (case-insensitive, ordered).
    """
    # track seen domains to avoid duplicates while preserving output order
    seen = set()
    result = []   # store domains in the order they first appear

The set seen provides fast O(1) lookups to check if a domain has been encountered before, preventing duplicates. The list result maintains the insertion order, ensuring domains are added only when first seen and in the sequence they appear in the input list. This combination is efficient for small to medium lists, avoiding the need for more complex structures like OrderedDict, while still achieving the goal of ordered uniqueness without sorting the final output.

2. Loop Processing: Normalize, Validate, Extract

Inside the function, a loop handles each email individually for normalization, validation, and extraction:

for email in emails:
    # normalize email for consistent comparison
    email = email.lower()
    # skip entries that don't have a valid '@' separator
    if "@" not in email:
        continue
    # split at the first '@' and take everything after it as the domain
    domain = email.split("@", 1)[1]
    # add domain only if it hasn't been seen before
    if domain not in seen:
        seen.add(domain)        # mark domain as seen
        result.append(domain)   # append domain to output list
# return unique domains in original order
return result

Lowercasing the entire email ensures case-insensitive matching, so "Gmail.com" and "gmail.com" are treated as the same. The '@' check acts as basic validation, skipping any invalid entries without an '@' to avoid errors in splitting. The split("@", 1) limits to one split, grabbing everything after the first '@' as the domain, which handles cases with multiple '@' (though rare in valid emails). If the domain hasn't been seen, it's added to both the set and list, preserving order. This loop is O(n) efficient, with constant-time set operations.

3. Example Usage: Real List Test

To demonstrate, the function is tested with a sample list that includes mixed cases, duplicates, and an invalid entry:

sample = [
    "User@Gmail.com",
    "admin@yahoo.com",
    "test@gmail.com",
    "invalid-email",
    "hello@Outlook.com"
]
print(extract_domains(sample))

This outputs ['gmail.com', 'yahoo.com', 'outlook.com'], correctly extracting domains while skipping the invalid "invalid-email" (no '@'), deduping the repeated "gmail.com" appearances, and normalizing cases like "Gmail.com" to "gmail.com". The order reflects the first occurrence: gmail first from "User@Gmail.com", then yahoo, then outlook, ignoring later gmail.

🎯 Summary and Reflections

This domain extractor shows parsing with order and uniqueness, using set-list pair. It reinforced:

Case handling: Lower for consistency in comparisons.
Validation simplicity: Quick '@' check filters basics.
Dedupe efficiency: Set for O(1) checks, list for order.

Appreciated its cleanup utility. For more, add full email validation.

Advanced Alternatives: Use list comp with seen: seen = set(); [d for e in emails if (d := e.lower().split('@',1)[-1]) not in seen and seen.add(d) or '@' in e]. Or regex for domains. Your parsing trick? Share!

🚀 Next Steps and Resources

Day 38 advanced list processing, prepping for data tasks. In #80DaysOfChallenges? Added subdomains? Post!

Source Code for Challenge #38: scripts/extract_domains.py
Main Repository: 80-days-of-challenges
Daily Updates: Twitter/X (@Shahrouzlogs)

DEV Community