When dealing with large datasets in Python, memory consumption quickly becomes a bottleneck. Loading a 10GB CSV file or processing millions of API responses can crash your program or bring your server to a crawl. That's where generators come in.
In this tutorial, you'll learn how to use Python generators to process data efficiently, build streaming pipelines, and write cleaner code — all while keeping memory usage flat.
What Are Python Generators?
A generator is a special type of function that yields values one at a time, pausing its state between yields. Unlike a regular function that builds a full list and returns it, a generator produces items lazily — on demand.
Here's the simplest generator:
def count_up_to(n):
count = 1
while count <= n:
yield count
count += 1
for num in count_up_to(5):
print(num) # 1, 2, 3, 4, 5
The keyword yield is the magic. When Python encounters yield, it saves the function's entire local state — including variable bindings, the instruction pointer, and the call stack — and returns the value to the caller. On the next call via next(), execution resumes right after the yield statement, with all state restored.
This state-suspension mechanism is what makes generators fundamentally different from regular functions. A normal function creates a new stack frame every time you call it. A generator keeps its frame alive between calls.
Generators vs Lists: The Memory Difference
Let's compare a list-based approach with a generator-based one:
| Aspect | List | Generator |
|---|---|---|
| Memory usage | Grows with data size | Constant (fixed) |
| Speed for single iteration | Fast | Fast |
| Can be indexed | Yes (list[5]) |
No |
| Reusable | Yes (multiple iterations) | No (exhausted after one pass) |
| Creation syntax | [x for x in range(10)] |
(x for x in range(10)) |
Consider this real-world example — reading a large file:
# List-based — loads everything into memory
def read_lines_list(filepath):
with open(filepath, 'r') as f:
return f.readlines()
# Generator-based — streams one line at a time
def read_lines_generator(filepath):
with open(filepath, 'r') as f:
for line in f:
yield line.strip()
For a 500MB log file, the list version uses ~500MB of RAM. The generator version uses a few kilobytes, regardless of file size. When you're running on a production server with limited memory, this difference can mean the difference between a healthy application and an OOM crash.
Generator Expressions
Generator expressions are the lazy sibling of list comprehensions. Instead of square brackets, use parentheses:
# List comprehension — builds entire list in memory
squares_list = [x**2 for x in range(1000000)]
# Generator expression — lazy evaluation
squares_gen = (x**2 for x in range(1000000))
print(type(squares_list)) # <class 'list'>
print(type(squares_gen)) # <class 'generator'>
You can pass generator expressions directly to functions that accept iterables:
total = sum(x**2 for x in range(1000000)) # no intermediate list
all_evens = any(x % 2 == 0 for x in nums) # short-circuits early
The standard library functions sum(), any(), all(), min(), and max() all accept iterables, making them natural consumers of generator expressions. This pattern eliminates entire classes of memory-related bugs.
Building Data Pipelines with Generators
The real power of generators shines when you chain them into processing pipelines. Each stage transforms data and hands it to the next stage, with no intermediate storage.
def read_sales(filepath):
with open(filepath, 'r') as f:
for line in f:
yield line.strip().split(',')
def filter_international(rows):
for row in rows:
if row[2] != 'US':
yield row
def calculate_totals(rows):
for row in rows:
quantity, price = int(row[3]), float(row[4])
yield {**dict(zip(['name', 'country', 'qty', 'price'], row)),
'total': quantity * price}
# Pipeline — memory constant at every stage
pipeline = calculate_totals(filter_international(read_sales('sales.csv')))
for record in pipeline:
print(f"{record['name']}: ${record['total']:.2f}")
Each function is a standalone generator. You can test them individually, reuse them in different pipelines, and the whole chain never holds more than one row at a time. This composability is why generators are the backbone of Python's itertools module and many data processing frameworks.
Infinite Generators
Generators can represent infinite sequences — something impossible with lists:
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
fib = fibonacci()
first_10 = [next(fib) for _ in range(10)]
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
Common use cases for infinite generators:
- Retry with backoff: yield increasing wait times between API retries
- Rate limiting: yield tokens from a leaky bucket algorithm
- ID generation: yield unique sequential identifiers across microservices
- Streaming data: yield from a WebSocket or Kafka consumer
The consumer controls how many values to pull, making infinite generators safe when consumed correctly.
yield from — Delegating to Sub-generators
Python 3.3 introduced yield from, which lets you yield values from another iterable or generator:
def flatten(nested):
for item in nested:
if isinstance(item, (list, tuple)):
yield from flatten(item)
else:
yield item
nested = [1, [2, [3, 4], 5], 6]
print(list(flatten(nested))) # [1, 2, 3, 4, 5, 6]
Without yield from, you'd need a nested loop with verbose yield statements. With it, the code reads like plain English. The expression yield from generator also establishes a two-way communication channel — values sent to the outer generator via send() are forwarded to the inner generator automatically.
Advanced: Bidirectional Communication with send() and throw()
Generators aren't just data sources — they can receive data too. The send() method lets you pass values back into a running generator:
def echo():
while True:
received = yield
print(f"Received: {received}")
gen = echo()
next(gen) # Prime the generator (advance to first yield)
gen.send("hello") # Received: hello
gen.send("world") # Received: world
More practically, send() is useful for coroutine-style patterns like state machines:
def accumulator():
total = 0
while True:
value = yield total
total += value
acc = accumulator()
next(acc) # Prime
print(acc.send(10)) # 10
print(acc.send(20)) # 30
print(acc.send(30)) # 60
The throw() method allows you to raise exceptions inside a generator at the yield point, and close() signals the generator to clean up.
Practical Patterns
1. Batch Processing
Process data in fixed-size chunks to control memory:
def batched(iterable, size):
batch = []
for item in iterable:
batch.append(item)
if len(batch) == size:
yield batch
batch = []
if batch:
yield batch
for batch in batched(range(100), 10):
# process 10 items at a time
print(sum(batch))
2. Progress Tracking
Wrap any generator to add progress reporting:
def with_progress(gen, total=None):
processed = 0
for item in gen:
yield item
processed += 1
if total and processed % 1000 == 0:
print(f"Progress: {processed}/{total}")
lines = with_progress(read_lines_generator('huge_file.log'))
for line in lines:
pass # process line
3. Exception-Safe Resource Cleanup
Generators support try/finally for cleanup — even if the caller breaks out early:
def managed_resource():
try:
print("Resource opened")
yield "data"
finally:
print("Resource cleaned up")
res = managed_resource()
print(next(res)) # "data"
# If we don't exhaust the generator...
res.close() # Triggers finally block
This makes generators a clean alternative to context managers when you need to yield control mid-operation.
Performance Benchmarks
Here's a quick comparison processing 10 million integers:
| Method | Memory (MB) | Time (s) |
|---|---|---|
| List comprehension | 320 | 0.45 |
| Generator expression | 0.01 | 0.52 |
| Manual for-loop with list | 320 | 0.50 |
| Generator function | 0.01 | 0.55 |
Generators use negligibly more CPU but save orders of magnitude in memory. For I/O-bound tasks (reading files, hitting APIs), the CPU difference is invisible — the bottleneck is always I/O. This is exactly why modern Python async libraries like aiofiles and httpx use generator-based streaming under the hood.
Common Pitfalls
Exhausting a generator: You can only iterate once. Use
list()oritertools.tee()if you need multiple passes.Side effects in generator expressions: Since they're lazy, side effects may not execute when you expect:
# Prints nothing until iterated
effects = (print(x) for x in [1, 2, 3])
Recursion depth: Deeply nested
yield fromchains can hit recursion limits. For truly nested structures, prefer an explicit stack.Forgetting to prime: Generators that use
send()need an initialnext()call to advance to the first yield. Forgetting this raisesTypeError: can't send non-None value to a just-started generator.
Summary
Python generators are not an advanced curiosity — they're a practical tool every developer should reach for when processing data at scale. The key takeaways:
- Use generators when working with files, streams, or large collections
- Prefer generator expressions over list comprehensions for one-pass iteration
- Chain generators into pipelines to decompose complex transformations
- Leverage
yield fromto flatten nested structures cleanly - Always handle cleanup with
try/finallyin resource-holding generators - Use
send()for bidirectional communication in stateful processing
Start using generators in your next data pipeline, and watch your memory usage drop to near zero while your code becomes more readable. Whether you're processing server logs, transforming API responses, or building ETL pipelines, generators will make your Python code leaner and more maintainable.
Happy coding!
Top comments (0)