If you have ever written a loop to check whether a value already exists in a list, or used a list just to collect unique items, you have been solving a problem that Python sets handle natively โ and far more efficiently. Sets give you O(1) membership testing, automatic deduplication, and a suite of mathematical operations that make filtering and comparing collections of data almost trivial.
๐ Free: AI Publishing Checklist โ 7 steps in Python ยท Full pipeline: germy5.gumroad.com/l/xhxkzz (pay what you want, min $9.99)
Why Sets? O(1) Lookup vs O(n) List
The core reason to reach for a set instead of a list is performance. When you write value in my_list, Python scans every element from the beginning until it finds a match โ that is O(n) time. When you write value in my_set, Python hashes the value and checks a single bucket โ that is O(1) time, regardless of whether the set has 10 items or 10 million.
import time
data_list = list(range(1_000_000))
data_set = set(data_list)
# List: scans up to 1,000,000 elements
start = time.perf_counter()
999_999 in data_list
print(f"List lookup: {time.perf_counter() - start:.6f}s")
# Set: one hash lookup
start = time.perf_counter()
999_999 in data_set
print(f"Set lookup: {time.perf_counter() - start:.6f}s")
On a typical machine the set lookup is hundreds of times faster. This gap grows as the collection grows.
Sets also deduplicate automatically. Converting a list to a set is the idiomatic one-liner for removing duplicates:
tags = ["python", "tutorial", "python", "beginners", "tutorial"]
unique_tags = list(set(tags))
# ['beginners', 'python', 'tutorial'] โ order not guaranteed
Creating Sets
There are three ways to create a set:
# 1. Literal syntax โ curly braces with values
languages = {"python", "rust", "go"}
# 2. set() constructor โ from any iterable
from_list = set([1, 2, 3, 2, 1]) # {1, 2, 3}
from_str = set("hello") # {'h', 'e', 'l', 'o'}
# 3. Empty set โ MUST use set(), not {}
empty = set() # {} creates an empty dict, not a set
One trap beginners hit: {} creates an empty dict. Always use set() for an empty set.
Set Operations: Union, Intersection, Difference, Symmetric Difference
Python sets support the full algebra of mathematical sets through readable operators:
published = {"intro-python", "python-lists", "python-dicts", "python-sets"}
drafted = {"python-sets", "python-generators", "python-typing"}
# Union โ all slugs in either collection
all_slugs = published | drafted
# {'intro-python', 'python-lists', 'python-dicts',
# 'python-sets', 'python-generators', 'python-typing'}
# Intersection โ slugs in both (already published AND drafted)
overlap = published & drafted
# {'python-sets'}
# Difference โ drafted but NOT yet published
needs_review = drafted - published
# {'python-generators', 'python-typing'}
# Symmetric difference โ in one but not both
unique_to_each = published ^ drafted
# {'intro-python', 'python-lists', 'python-dicts',
# 'python-generators', 'python-typing'}
Each operator has a method equivalent (union(), intersection(), difference(), symmetric_difference()) that accepts any iterable, not just other sets, which is useful when your source data hasn't been converted yet.
Membership Testing with in
The in operator is where sets shine in everyday code:
STOP_WORDS = {"the", "a", "an", "is", "it", "in", "on", "at"}
def extract_keywords(text: str) -> list[str]:
words = text.lower().split()
return [w for w in words if w not in STOP_WORDS]
print(extract_keywords("Python is a great language for automation"))
# ['python', 'great', 'language', 'for', 'automation']
Making STOP_WORDS a set instead of a list costs nothing in readability but makes each in check instant rather than linear.
Set Methods: add, remove, discard, pop, clear
Sets are mutable and come with a practical set of mutation methods:
seen = {"slug-a", "slug-b"}
seen.add("slug-c") # adds slug-c; no-op if already present
seen.remove("slug-a") # removes slug-a; raises KeyError if missing
seen.discard("slug-z") # removes if present; silent no-op if missing
item = seen.pop() # removes and returns an arbitrary element
seen.clear() # empties the set in place
The distinction between remove() and discard() matters in pipelines: use discard() when absence is not an error, remove() when you want to catch unexpected state.
Frozen Sets: Immutable and Hashable
A frozenset is an immutable version of a set. Once created it cannot be modified. This unlocks two capabilities that regular sets lack: it can be used as a dictionary key and as an element inside another set.
# frozenset as a dict key โ useful for caching combination results
cache: dict[frozenset, str] = {}
def get_combined_tag_slug(tags: set[str]) -> str:
key = frozenset(tags)
if key not in cache:
cache[key] = "-".join(sorted(tags))
return cache[key]
print(get_combined_tag_slug({"python", "tutorial"}))
# 'python-tutorial'
print(get_combined_tag_slug({"tutorial", "python"})) # cache hit
# 'python-tutorial'
# frozenset as an element of another set
pipelines = {
frozenset({"fetch", "parse", "publish"}),
frozenset({"fetch", "validate", "upload"}),
}
frozenset supports all the read-only set operations (union, intersection, in, etc.) but none of the mutation methods.
Set Comprehensions
Just like list and dict comprehensions, Python has set comprehensions:
slugs = [
"python-lists",
"Python-Lists", # duplicate after normalization
"python-sets",
"PYTHON-SETS",
]
normalized = {slug.lower() for slug in slugs}
# {'python-lists', 'python-sets'}
Set comprehensions are a clean way to transform and deduplicate in one expression. You can also add a filter condition:
# Only slugs that start with "python"
python_slugs = {s.lower() for s in slugs if s.lower().startswith("python")}
Real Pattern: Deduplicating Published Article Slugs
Here is a realistic use case from an automated publishing pipeline. You have a JSON file of published slugs and an incoming batch of candidate articles. You want to skip anything already published without scanning a list on every check:
import json
from pathlib import Path
def load_published_slugs(path: Path) -> set[str]:
if not path.exists():
return set()
data = json.loads(path.read_text())
return set(data.get("published", []))
def filter_new_articles(
candidates: list[dict],
published: set[str],
) -> list[dict]:
return [a for a in candidates if a["slug"] not in published]
def mark_published(slug: str, path: Path) -> None:
data = json.loads(path.read_text()) if path.exists() else {"published": []}
published = set(data["published"])
published.add(slug)
data["published"] = sorted(published)
path.write_text(json.dumps(data, indent=2))
# Usage
queue_path = Path("publish_queue.json")
published_slugs = load_published_slugs(queue_path)
candidates = [
{"slug": "python-sets", "title": "Python Sets"},
{"slug": "python-generators", "title": "Python Generators"},
{"slug": "intro-python", "title": "Intro to Python"}, # already published
]
to_publish = filter_new_articles(candidates, published_slugs)
for article in to_publish:
# ... publish article ...
mark_published(article["slug"], queue_path)
The published_slugs set makes not in published a single hash lookup for every candidate, keeping the pipeline fast even with hundreds of tracked slugs.
When NOT to Use Sets
Sets are powerful but not universal. Avoid them when:
-
Order matters. Sets are unordered; you cannot rely on iteration order or index into a set with
my_set[0]. Use a list orcollections.OrderedDictif sequence is meaningful. -
You need duplicates. Sets silently discard repeated values. If multiplicity matters (a shopping cart with qty > 1, a word-frequency count), use a list or
collections.Counter. -
Elements are unhashable. Sets can only hold hashable objects. Lists and dicts cannot be set elements;
frozensetandtuplecan. - You need the nth item. There is no slicing on sets. Convert to a sorted list first if you need positional access.
A quick decision rule: if your primary operations are "does this item exist?" and "give me the unique items", reach for a set. If you need order, indexing, or duplicates, reach for a list.
Automating a content pipeline โ tracking published slugs, filtering already-processed URLs, deduplicating API responses โ is exactly where sets pay off most. If you want to see how these patterns fit together in a full Python publishing workflow, the pipeline guide linked above walks through the complete system.
Further Reading
- Python List Comprehensions: From Loops to One-Liners
- Python dataclasses: Cleaner Code Than Dicts or NamedTuples
- Python Type Hints: A Practical Beginner's Guide
If this was useful, the โค๏ธ button helps other developers find it.
Top comments (0)