Python Generators Deep Dive Part 1: Lazy Evaluation & Memory Optimization🚀
The Problem: Memory Bloat in Data Processing
You've hit this before: processing a large dataset crashes your application with MemoryError. The culprit? Loading everything into memory at once.
# Processing a 50GB log file
def analyze_logs(filename):
with open(filename) as f:
lines = f.readlines() # Loads entire 50GB into RAM
return [line for line in lines if 'ERROR' in line]
# Result: MemoryError (or system freeze)
Root cause: Eager evaluation - computing all values before you need them.
Solution: Lazy evaluation with generators.
What Are Generators?
Generators are iterators that produce values on-demand using lazy evaluation. Instead of computing all values upfront, they:
- Compute one value at a time
- Pause execution after yielding
- Preserve local state between calls
- Resume from the exact pause point
Memory footprint: Constant O(1), regardless of data size.
Eager vs Lazy Evaluation
Eager (Lists)
def squares(n):
return [i**2 for i in range(n)]
data = squares(1_000_000)
# - Computes all 1M values immediately
# - Stores all 1M values in memory
# - Memory: ~8.4 MB
# - Time to first value: Seconds
Lazy (Generators)
def squares(n):
for i in range(n):
yield i**2
data = squares(1_000_000)
# - Computes values on next() call
# - Stores only current state
# - Memory: 112 bytes
# - Time to first value: Microseconds
Key difference: yield instead of return
The yield Keyword Explained
yield transforms a function into a generator factory. When called, it returns a generator object without executing the function body.
def countdown(n):
while n > 0:
yield n # Pause here, return n, save state
n -= 1
gen = countdown(3) # No execution yet
type(gen) # <class 'generator'>
next(gen) # 3 - Executes until first yield, then pauses
next(gen) # 2 - Resumes, executes until next yield
next(gen) # 1 - Resumes again
next(gen) # StopIteration - No more yields
Execution model:
Call countdown(3):
→ Creates generator object
→ Saves reference to function code
→ Does NOT execute function body
Call next(gen):
→ Starts/resumes execution
→ Runs until yield statement
→ Saves execution state (locals, instruction pointer)
→ Returns yielded value
→ Pauses (GEN_SUSPENDED state)
Call next(gen) again:
→ Resumes from saved instruction pointer
→ Continues execution with preserved locals
→ Repeat process
Generator State Management
When a generator pauses, Python stores:
def example():
x = 0 # Stored in frame locals
while x < 3:
y = x ** 2
yield y # Pause point
x += 1
gen = example()
next(gen) # Returns 0
# Internal state at this point:
# {
# 'x': 1, # Updated during execution
# 'y': 0, # Last computed value
# 'instruction': 42 # Bytecode offset (after yield)
# }
Memory breakdown:
- Generator object: ~112 bytes
- Local variables: Stored in frame (f_locals)
- Instruction pointer: Stored in frame (f_lasti)
- No data values stored (only state)
Memory Comparison: Real Numbers
import sys
# List comprehension
numbers = [i for i in range(1_000_000)]
print(sys.getsizeof(numbers)) # 8,448,728 bytes (~8.4 MB)
# Generator expression
numbers = (i for i in range(1_000_000))
print(sys.getsizeof(numbers)) # 112 bytes
# Reduction: 75,000x
Scaling:
| Items | List Memory | Generator Memory | Ratio |
|---|---|---|---|
| 1K | 8 KB | 112 B | 73x |
| 1M | 8.4 MB | 112 B | 75,000x |
| 1B | 8.4 GB | 112 B | 75,000,000x |
Generator Expressions
Syntactic sugar for simple generators:
# List comprehension (eager)
squares = [x**2 for x in range(1000)]
# Generator expression (lazy)
squares = (x**2 for x in range(1000))
# Usage in functions
total = sum(x**2 for x in range(1_000_000)) # Memory efficient
# Better than: sum([x**2 for x in range(1_000_000)])
Rule: Use () instead of [] for lazy evaluation.
Practical Example: Log File Streaming
Problem: Analyzing 50GB Log File
Naive approach (crashes):
def find_errors(filename):
with open(filename) as f:
lines = f.readlines() # 50GB in RAM
return [l for l in lines if 'ERROR' in l]
Generator approach (constant memory):
def find_errors(filename):
with open(filename) as f:
for line in f: # File iterator (built-in generator)
if 'ERROR' in line:
yield line.strip()
# Usage
for error in find_errors('app.log'):
process(error) # Only one line in memory at a time
Benefits:
- âś… Memory: O(1) - constant
- âś… Starts immediately (no upfront loading)
- âś… Handles files of any size
- âś… Can terminate early (no wasted computation)
Iteration Protocol
Generators implement the iterator protocol:
gen = (x for x in range(3))
# Manual iteration
try:
while True:
value = next(gen)
print(value)
except StopIteration:
pass
# Automatic iteration (preferred)
for value in gen:
print(value) # for loop handles StopIteration
Under the hood:
# This:
for item in generator:
process(item)
# Is equivalent to:
iterator = iter(generator) # Generators are their own iterators
while True:
try:
item = next(iterator)
process(item)
except StopIteration:
break
Generator Exhaustion
Important: Generators are single-use iterators.
gen = (x**2 for x in range(5))
# First iteration
list(gen) # [0, 1, 4, 9, 16]
# Second iteration
list(gen) # [] - Generator exhausted!
# Must recreate
gen = (x**2 for x in range(5))
list(gen) # [0, 1, 4, 9, 16] - Works again
Design implication: Use generators when you need one pass. Use lists for multiple iterations.
Performance Characteristics
Time Complexity
| Operation | List | Generator |
|---|---|---|
| Creation | O(n) | O(1) |
| Iteration | O(n) | O(n) |
| Random access | O(1) | N/A |
| Multiple passes | O(n) each | Recreate each time |
Space Complexity
| Operation | List | Generator |
|---|---|---|
| Storage | O(n) | O(1) |
| Iteration | O(n) | O(1) |
Conclusion: Generators trade random access and re-iterability for memory efficiency.
When to Use Generators
âś… Use Generators For:
- Large datasets that don't fit in memory
- Streaming data (logs, network streams, sensors)
- Single-pass processing (filter → transform → aggregate)
- Infinite sequences (Fibonacci, primes, etc.)
- Pipeline processing (chaining transformations)
- Data that's expensive to compute (defer computation)
❌ Use Lists For:
- Small datasets (< 10,000 items typically)
- Multiple iterations required
-
Random access needed (
items[42]) -
Length required (
len(items)) -
Slicing needed (
items[10:20]) - Sorting/reversing in-place
Practical Examples
Example 1: Infinite Sequence
def fibonacci():
"""Infinite Fibonacci generator"""
a, b = 0, 1
while True:
yield a
a, b = b, a + b
# Usage
from itertools import islice
first_10 = list(islice(fibonacci(), 10))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
Example 2: Data Pipeline
def read_file(path):
with open(path) as f:
for line in f:
yield line.strip()
def filter_comments(lines):
for line in lines:
if not line.startswith('#'):
yield line
def to_uppercase(lines):
for line in lines:
yield line.upper()
# Chain generators
pipeline = to_uppercase(filter_comments(read_file('data.txt')))
# Memory: 3 Ă— 112 bytes = 336 bytes
# (vs loading entire file into memory)
for line in pipeline:
process(line)
Example 3: Batch Processing
def batch(iterable, size):
"""Yield items in batches"""
batch = []
for item in iterable:
batch.append(item)
if len(batch) == size:
yield batch
batch = []
if batch:
yield batch
# Process database records in batches
for batch in batch(query_all_users(), 1000):
bulk_update(batch)
Common Pitfalls
Pitfall 1: Assuming Multiple Iterations
gen = (x for x in range(5))
sum(gen) # 10
sum(gen) # 0 - exhausted!
Fix: Convert to list if you need multiple passes.
Pitfall 2: Debugging Difficulties
gen = (expensive_computation(x) for x in data)
# What's in gen? Can't easily inspect without consuming it
Fix: Use lists during development if you need to inspect intermediate values.
Pitfall 3: Delayed Execution Side Effects
# This doesn't execute immediately!
gen = (print(x) for x in range(5)) # No output yet
# Must iterate to trigger execution
list(gen) # Now prints 0, 1, 2, 3, 4
Best Practices
- Default to generator expressions for transformations
# Good
total = sum(x**2 for x in data)
# Bad (unless you need the list)
total = sum([x**2 for x in data])
-
Use
itertoolsfor common patterns
from itertools import islice, chain
# Take first N items
first_100 = list(islice(large_generator(), 100))
# Chain multiple generators
combined = chain(gen1(), gen2(), gen3())
- Close generators that manage resources
gen = file_reader('data.txt')
try:
process(gen)
finally:
gen.close() # Triggers cleanup
- Document if your function returns a generator
def read_logs(path):
"""
Read log file line by line.
Yields:
str: Each log line
"""
with open(path) as f:
for line in f:
yield line.strip()
Summary
🎯 Key Concepts:
- Generators use lazy evaluation (compute on-demand)
-
yieldpauses execution and preserves state - Memory: O(1) constant, regardless of data size
- Single-use - exhaust after one iteration
- Perfect for streaming large or infinite data
🔑 Remember:
- Lists: Eager, multiple access, O(n) memory
- Generators: Lazy, single access, O(1) memory
- Use generators by default for large data processing
Quick Reference
# Generator function
def gen_func():
yield value
# Generator expression
gen_expr = (x for x in iterable)
# Iteration
for item in generator:
process(item)
# Manual control
value = next(generator)
# Convert to list (materializes all values)
items = list(generator)
What's Next?
In Part 2, we'll explore:
- 🔬 Generator internals (frames, bytecode, state storage)
- 📡
send()method (bidirectional communication) - 🛑
close()andthrow()methods - đź”—
yield fromdelegation - ⚙️ Advanced generator patterns
Next: Python Generators Part 2: How They Actually Work (The Magic Revealed)
You now understand what generators are and why they exist. Part 2 reveals how they work under the hood.
Top comments (0)