German Yamil

Posted on May 14 • Edited on May 16

Python Sets: Fast Lookups, Deduplication, and Set Operations

#python #tutorial #codenewbie #beginners

If you have ever written a loop to check whether a value already exists in a list, or used a list just to collect unique items, you have been solving a problem that Python sets handle natively — and far more efficiently. Sets give you O(1) membership testing, automatic deduplication, and a suite of mathematical operations that make filtering and comparing collections of data almost trivial.

🎁 Free: AI Publishing Checklist — 7 steps in Python · Full pipeline: germy5.gumroad.com/l/xhxkzz (pay what you want, min $9.99)

Why Sets? O(1) Lookup vs O(n) List

The core reason to reach for a set instead of a list is performance. When you write value in my_list, Python scans every element from the beginning until it finds a match — that is O(n) time. When you write value in my_set, Python hashes the value and checks a single bucket — that is O(1) time, regardless of whether the set has 10 items or 10 million.

import time

data_list = list(range(1_000_000))
data_set = set(data_list)

# List: scans up to 1,000,000 elements
start = time.perf_counter()
999_999 in data_list
print(f"List lookup: {time.perf_counter() - start:.6f}s")

# Set: one hash lookup
start = time.perf_counter()
999_999 in data_set
print(f"Set lookup:  {time.perf_counter() - start:.6f}s")

On a typical machine the set lookup is hundreds of times faster. This gap grows as the collection grows.

Sets also deduplicate automatically. Converting a list to a set is the idiomatic one-liner for removing duplicates:

tags = ["python", "tutorial", "python", "beginners", "tutorial"]
unique_tags = list(set(tags))
# ['beginners', 'python', 'tutorial']  — order not guaranteed

Creating Sets

There are three ways to create a set:

# 1. Literal syntax — curly braces with values
languages = {"python", "rust", "go"}

# 2. set() constructor — from any iterable
from_list = set([1, 2, 3, 2, 1])   # {1, 2, 3}
from_str  = set("hello")            # {'h', 'e', 'l', 'o'}

# 3. Empty set — MUST use set(), not {}
empty = set()    # {} creates an empty dict, not a set

One trap beginners hit: {} creates an empty dict. Always use set() for an empty set.

Set Operations: Union, Intersection, Difference, Symmetric Difference

Python sets support the full algebra of mathematical sets through readable operators:

published = {"intro-python", "python-lists", "python-dicts", "python-sets"}
drafted   = {"python-sets", "python-generators", "python-typing"}

# Union — all slugs in either collection
all_slugs = published | drafted
# {'intro-python', 'python-lists', 'python-dicts',
#  'python-sets', 'python-generators', 'python-typing'}

# Intersection — slugs in both (already published AND drafted)
overlap = published & drafted
# {'python-sets'}

# Difference — drafted but NOT yet published
needs_review = drafted - published
# {'python-generators', 'python-typing'}

# Symmetric difference — in one but not both
unique_to_each = published ^ drafted
# {'intro-python', 'python-lists', 'python-dicts',
#  'python-generators', 'python-typing'}

Each operator has a method equivalent (union(), intersection(), difference(), symmetric_difference()) that accepts any iterable, not just other sets, which is useful when your source data hasn't been converted yet.

Membership Testing with `in`

The in operator is where sets shine in everyday code:

STOP_WORDS = {"the", "a", "an", "is", "it", "in", "on", "at"}

def extract_keywords(text: str) -> list[str]:
    words = text.lower().split()
    return [w for w in words if w not in STOP_WORDS]

print(extract_keywords("Python is a great language for automation"))
# ['python', 'great', 'language', 'for', 'automation']

Making STOP_WORDS a set instead of a list costs nothing in readability but makes each in check instant rather than linear.

Set Methods: add, remove, discard, pop, clear

Sets are mutable and come with a practical set of mutation methods:

seen = {"slug-a", "slug-b"}

seen.add("slug-c")        # adds slug-c; no-op if already present

seen.remove("slug-a")     # removes slug-a; raises KeyError if missing
seen.discard("slug-z")    # removes if present; silent no-op if missing

item = seen.pop()         # removes and returns an arbitrary element
seen.clear()              # empties the set in place

The distinction between remove() and discard() matters in pipelines: use discard() when absence is not an error, remove() when you want to catch unexpected state.

Frozen Sets: Immutable and Hashable

A frozenset is an immutable version of a set. Once created it cannot be modified. This unlocks two capabilities that regular sets lack: it can be used as a dictionary key and as an element inside another set.

# frozenset as a dict key — useful for caching combination results
cache: dict[frozenset, str] = {}

def get_combined_tag_slug(tags: set[str]) -> str:
    key = frozenset(tags)
    if key not in cache:
        cache[key] = "-".join(sorted(tags))
    return cache[key]

print(get_combined_tag_slug({"python", "tutorial"}))
# 'python-tutorial'
print(get_combined_tag_slug({"tutorial", "python"}))   # cache hit
# 'python-tutorial'

# frozenset as an element of another set
pipelines = {
    frozenset({"fetch", "parse", "publish"}),
    frozenset({"fetch", "validate", "upload"}),
}

frozenset supports all the read-only set operations (union, intersection, in, etc.) but none of the mutation methods.

Set Comprehensions

Just like list and dict comprehensions, Python has set comprehensions:

slugs = [
    "python-lists",
    "Python-Lists",   # duplicate after normalization
    "python-sets",
    "PYTHON-SETS",
]

normalized = {slug.lower() for slug in slugs}
# {'python-lists', 'python-sets'}

Set comprehensions are a clean way to transform and deduplicate in one expression. You can also add a filter condition:

# Only slugs that start with "python"
python_slugs = {s.lower() for s in slugs if s.lower().startswith("python")}

Real Pattern: Deduplicating Published Article Slugs

Here is a realistic use case from an automated publishing pipeline. You have a JSON file of published slugs and an incoming batch of candidate articles. You want to skip anything already published without scanning a list on every check:

import json
from pathlib import Path

def load_published_slugs(path: Path) -> set[str]:
    if not path.exists():
        return set()
    data = json.loads(path.read_text())
    return set(data.get("published", []))

def filter_new_articles(
    candidates: list[dict],
    published: set[str],
) -> list[dict]:
    return [a for a in candidates if a["slug"] not in published]

def mark_published(slug: str, path: Path) -> None:
    data = json.loads(path.read_text()) if path.exists() else {"published": []}
    published = set(data["published"])
    published.add(slug)
    data["published"] = sorted(published)
    path.write_text(json.dumps(data, indent=2))

# Usage
queue_path = Path("publish_queue.json")
published_slugs = load_published_slugs(queue_path)

candidates = [
    {"slug": "python-sets",        "title": "Python Sets"},
    {"slug": "python-generators",  "title": "Python Generators"},
    {"slug": "intro-python",       "title": "Intro to Python"},  # already published
]

to_publish = filter_new_articles(candidates, published_slugs)
for article in to_publish:
    # ... publish article ...
    mark_published(article["slug"], queue_path)

The published_slugs set makes not in published a single hash lookup for every candidate, keeping the pipeline fast even with hundreds of tracked slugs.

When NOT to Use Sets

Sets are powerful but not universal. Avoid them when:

Order matters. Sets are unordered; you cannot rely on iteration order or index into a set with my_set[0]. Use a list or collections.OrderedDict if sequence is meaningful.
You need duplicates. Sets silently discard repeated values. If multiplicity matters (a shopping cart with qty > 1, a word-frequency count), use a list or collections.Counter.
Elements are unhashable. Sets can only hold hashable objects. Lists and dicts cannot be set elements; frozenset and tuple can.
You need the nth item. There is no slicing on sets. Convert to a sorted list first if you need positional access.

A quick decision rule: if your primary operations are "does this item exist?" and "give me the unique items", reach for a set. If you need order, indexing, or duplicates, reach for a list.

Automating a content pipeline — tracking published slugs, filtering already-processed URLs, deduplicating API responses — is exactly where sets pay off most. If you want to see how these patterns fit together in a full Python publishing workflow, the pipeline guide linked above walks through the complete system.

DEV Community

Python Sets: Fast Lookups, Deduplication, and Set Operations

Why Sets? O(1) Lookup vs O(n) List

Creating Sets

Set Operations: Union, Intersection, Difference, Symmetric Difference

Membership Testing with `in`

Set Methods: add, remove, discard, pop, clear

Frozen Sets: Immutable and Hashable

Set Comprehensions

Real Pattern: Deduplicating Published Article Slugs

When NOT to Use Sets

Further Reading

Top comments (0)

Why Sets? O(1) Lookup vs O(n) List

Creating Sets

Set Operations: Union, Intersection, Difference, Symmetric Difference

Membership Testing with in

Set Methods: add, remove, discard, pop, clear

Frozen Sets: Immutable and Hashable

Set Comprehensions

Real Pattern: Deduplicating Published Article Slugs

When NOT to Use Sets

Further Reading

Membership Testing with `in`