Python Iterators: Clearing the Lazy/Eager Confusion
As a backend developer, I sometimes help companies evaluate candidates by reviewing their recorded technical interviews. However, over time, Iโve noticed a deeply ingrained misconception. When discussing memory management or data streaming, many developers explicitly state:
"Iterators in Python are inherently eager. If you want true lazy loading or lazy evaluation, you have to use generators and the yield keyword."
This misconception is common. Many popular bootcamps and online courses introduce lazy evaluation exclusively through generators. Custom class-based iterators are usually skipped or dismissed as boilerplate-heavy OOP theory rarely used in production Python.
This confusion is further reinforced by two common educational simplifications:
- The List vs. Generator Expression Analogy: Beginners are taught that square brackets
[...](list comprehensions) are eager and take up memory, while parentheses(...)(generator expressions) are lazy. This often creates a false binary mental model: "generators = lazy, everything else = eager." - Standard "Textbook" Examples: When courses demonstrate a custom iterator, they usually write a basic class that accepts an already fully loaded list in its
__init__and simply increments an index in__next__. While this is valid for in-memory data, it leads developers to assume that custom iterators inherently require loading all data upfront.
In reality, generators are a specialized language feature designed to implement the iterator protocol automatically. They comply with the exact same interface (__iter__ and __next__). A generator is lazy not because of some magical property of the yield keyword, but simply because it adheres to this underlying contract.
To show that custom iterators can be lazy without using any generators or yield keywords, Iโve put together a lightweight and reproducible benchmark.
๐งช The Experiment: Proving Lazy Loading with Custom Iterators
Suppose we need to read a database export file (test_users_db.txt) containing 1_000_000 rows of user data. We want to iterate over these users but stop immediately once we find a specific record (at HALF_ROWS=500_000). If custom iterators were truly "eager," our lazy implementation would significantly increase RAM consumption.
๐ก Please note: Iโm using a local text file here as a simplified data source. However, in a production system, this exact same pattern is used to stream rows from a heavy SQL database cursor, fetch paginated data from an external HTTP API, or consume messages from a queue.
Let's look at the code:
from __future__ import annotations
import tracemalloc
from dataclasses import dataclass
from pathlib import Path
FILE_PATH = Path("test_users_db.txt")
TOTAL_ROWS = 1_000_000
HALF_ROWS = TOTAL_ROWS // 2
def generate_test_file(path: Path, rows: int = TOTAL_ROWS) -> None:
if not path.exists():
with open(path, "w", encoding="utf-8") as f:
for i in range(1, rows + 1):
f.write(f"{i},User_Name_{i}\n")
@dataclass(frozen=True)
class UserDTO:
id: int
name: str
# --- LAZY IMPLEMENTATION ---
class LazyUserGateway:
def __init__(self, file_object):
self.file = file_object
def __iter__(self) -> LazyUserIterator:
return LazyUserIterator(self.file)
class LazyUserIterator:
def __init__(self, file_object):
self.file = file_object
def __iter__(self) -> LazyUserIterator:
return self
def __next__(self) -> UserDTO:
# Step-by-step stream: reading only one line into memory at a time
line = self.file.readline()
if not line:
raise StopIteration
user_id, name = line.strip().split(",")
return UserDTO(id=int(user_id), name=name)
# --- EAGER IMPLEMENTATION ---
class EagerUserGateway:
def __init__(self, file_object):
self.file = file_object
def __iter__(self) -> EagerUserIterator:
return EagerUserIterator(self.file)
class EagerUserIterator:
def __init__(self, file_object):
# The readlines() call forces Python to build a massive list of strings in RAM upfront
self.lines = file_object.readlines()
self.index = 0
def __iter__(self) -> EagerUserIterator:
return self
def __next__(self) -> UserDTO:
if self.index >= len(self.lines):
raise StopIteration
line = self.lines[self.index]
self.index += 1
user_id, name = line.strip().split(",")
return UserDTO(id=int(user_id), name=name)
# --- BENCHMARKS ---
def test_eager_gateway() -> float:
tracemalloc.start()
with open(FILE_PATH, "r", encoding="utf-8") as f:
gateway = EagerUserGateway(f)
for user in gateway:
if user.id == HALF_ROWS:
print(f"[Eager] Final reached: {user}")
break
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
return peak / (1024 * 1024)
def test_lazy_gateway() -> float:
tracemalloc.start()
with open(FILE_PATH, "r", encoding="utf-8") as f:
gateway = LazyUserGateway(f)
for user in gateway:
if user.id == HALF_ROWS:
print(f"[Lazy] Final reached: {user}")
break
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
return peak / (1024 * 1024)
if __name__ == "__main__":
generate_test_file(FILE_PATH)
print("\n--- Memory Benchmark Running ---")
peak_eager = test_eager_gateway()
print(f"-> Peak Memory (EagerUserGateway): {peak_eager:.2f} MB\n")
peak_lazy = test_lazy_gateway()
print(f"-> Peak Memory (LazyUserGateway): {peak_lazy:.2f} MB\n")
if FILE_PATH.exists():
FILE_PATH.unlink()
๐ Benchmark Results
Running the script yields the following terminal output:
--- Memory Benchmark Running ---
[Eager] Final reached: UserDTO(id=500000, name='User_Name_500000')
-> Peak Memory (EagerUserGateway): 69.97 MB
[Lazy] Final reached: UserDTO(id=500000, name='User_Name_500000')
-> Peak Memory (LazyUserGateway): 0.15 MB
This represents a ~500x reduction in RAM consumption (dropping from ~69.97MB to 0.15MB). While the eager implementation scales linearly with the dataset size, the lazy iterator maintains a flat, near-zero memory footprint regardless of the row count.
๐ง Iterators vs. Generators: Contract vs. Convenience
In practice, generators are the industry standard for a reason. In most daily tasks, writing a quick generator function with yield is much faster, safer, and cleaner than writing custom iterator classes. The generator object itself handles the boilerplate, implicitly saves execution state, and natively facilitates resource cleanup.
But as engineers, we shouldn't confuse implementation-specific capabilities with a structural interface contract, especially when designing architectural boundaries:
- An Iterator is just a contract (interface). It guarantees that data can be traversed step-by-step via
__next__. How that data is managed โ whether the iterator eagerly fetches it upfront, lazily streams it from an API, or simply acts as a pointer to data passed to it from the outside โ is entirely up to the developer who implements it. An iterator doesn't even need to own or fetch data; it just drives the traversal logic. - A Generator is a specialized language feature that auto-implements this protocol while supporting additional lifecycle and communication methods (via
.send(),.throw(), and.close()). For basic data streaming, you don't need this additional complexity. A custom iterator class gives you a clean, one-way contract strictly limited to data traversal, keeping your architectural boundaries well-defined.
Custom class-based iterators deliver the exact same polymorphic interface. They aren't meant to replace generators everywhere. Instead, they prove that lazy evaluation is an architectural design choice, not a magic trick powered strictly by yield.
๐ Conclusion
The "iterators are eager" take is just a byproduct of simplified educational content. Diving deeper into core protocols helps us write much better, more predictable architectural layers.
๐ฌ What about you? Have you ever heard that custom Python iterators are inherently eager?
Top comments (0)