Kunwar Jhamat

Posted on Mar 5 • Originally published at decyon.com

Iterator Patterns: How to Process Millions of Records Without Running Out of Memory

#php #programming #dataengineering #performance

You have 100,000 rows in your database. You need to process each one. The obvious approach loads everything into an array, loops through it, and writes the results. This works fine with 1,000 records. With 100,000, your server hits the memory limit and crashes.

This is not just a tutorial on PHP iterators. This is about understanding what actually happens in memory and why iterator patterns keep memory usage constant regardless of dataset size.

What Actually Happens in Memory

Let us start with the simplest case: three database records, each 1KB in size.

The Array Approach

// Assuming $database is a PDO instance or similar wrapper
$rows = $database->query("SELECT * FROM users")->fetchAll();
// At this point: ALL rows are loaded into memory

Step-by-step memory allocation:

Step 1: fetchAll() is called → PHP allocates memory for an empty array (~200 bytes).
Step 2: First row arrives (1KB) → Memory: ~1.2KB.
Step 3: Second row arrives (1KB) → Memory: ~2.2KB.
Step 4: Third row arrives (1KB) → Memory: ~3.2KB.

With 100,000 rows at 1KB each, that is 100MB. If your PHP memory_limit (e.g., in php.ini) is 64MB, the process crashes before you even process the first row.

Key insight: fetchAll() waits until every single row is loaded from the result set before returning control to your code.

Approach	Memory (1K Rows)	Memory (100K Rows)	Memory (1M Rows)
Array (fetchAll)	~1 MB	~100 MB	~1 GB (Crash)
Iterator (cursor)	~1 KB	~1 KB	~1 KB
Batched Iterator	~1 MB	~1 MB	~1 MB

The Iterator Approach

Iterators use a "cursor" concept. They load one record, hand it to you, and then move on, minimizing the "resident" memory required.

// $cursor is typically an unbuffered PDOStatement or an iterator
$cursor = $database->query("SELECT * FROM users");
foreach ($cursor as $row) {
    // Do your work here — transform, validate, insert, etc.
    $name = trim($row['name']);
    $db->insert('clean_users', ['name' => $name, 'email' => $row['email']]);
    // Memory used is just for this ONE row
}

The Step-by-Step:

Step 1: PHP asks the cursor/stream for the first row → 1KB allocated.
Step 2: Your code processes the row (transform, insert, etc.) → Memory remains ~1.2KB.
Step 3: Loop moves to the next iteration → PHP overwrites/releases row 1 and fetches row 2.
Step 4: Memory remains ~1.2KB.

Whether you have 3 rows or 3 million, your memory footprint stays flat.

What `yield` Actually Does Internally

In PHP, generators use the yield keyword. Understanding its internal mechanics explains the magic. A generator function looks like a standard function but behaves like an iterator.

// A simple generator function
function getRows($pdo) {
    // We assume the PDO driver is configured for unbuffered queries
    $stmt = $pdo->query("SELECT * FROM users");
    while ($row = $stmt->fetch()) {
        yield $row; // Execution pauses HERE
    }
}

Under the hood, yield performs three main actions:

Returns the value: The current $row is sent back to the loop/caller (e.g., foreach ($rows as $row)).
Pauses execution: PHP "freezes" the function, remembering exactly which line it was on and the state of all local variables ($stmt, $row).
Waits for next(): Execution stays paused until the loop asks for another item (the next iteration). No CPU is consumed while waiting, and only the small state context is kept in memory.

When the loop moves to the next iteration, PHP resumes the function immediately after the yield, continues the loop, fetches the next $row, and yields again.

Combining Iterators with Batch Processing

Processing one record at a time is memory-safe but can be slow due to the overhead of many sequential operations, like database single-inserts. The "pro" move is batch processing with bounded memory.

/**
 * Processes data in memory-efficient batches.
 *
 * @param iterable $iterator A Generator or Iterator of records.
 * @param int $batchSize How many records to process at once.
 */
function processBatched(iterable $iterator, int $batchSize = 1000) {
    $batch = [];
    foreach ($iterator as $record) {
        $batch[] = $record;

        // When the batch is full, process and clear it
        if (count($batch) >= $batchSize) {
            bulkInsert($batch); // One DB call for 1000 records
            $batch = [];        // Crucial: Release the memory
        }
    }

    // Process any remaining records (the last partial batch)
    if (!empty($batch)) {
        bulkInsert($batch);
    }
}

Maximum memory usage is always (batch size x average record size). With 1,000 records at 1KB each, you never use more than ~1MB, whether processing 10,000 or 10,000,000 records.

Choosing the Right Batch Size

Batch Size	Memory	DB Efficiency	Error Recovery	Best For
100	~100KB	Moderate	Easy (Small scope)	Strict memory limits
1,000	~1MB	Good	Acceptable	General-purpose ETL
5,000	~5MB	Excellent	Coarse	Fast networks/Large records

Pro Tip: High batch sizes (e.g., >10,000) often show diminishing returns in database performance and make error recovery harder. If a batch of 10,000 fails, finding the single "bad" record that caused the constraint violation is like finding a needle in a haystack.

Chaining Transformations (Lazy Evaluation)

What happens when you need to normalize, filter, and enrich the data?

// (Pseudocode concept)
$stream = getRows($pdo); // Generator (0 memory)

$processedStream = $stream
    ->map(fn($row) => normalize($row))
    ->filter(fn($row) => $row['active'])
    ->map(fn($row) => enrich($row));

Each operation does not run yet; it returns a new iterator that wraps the previous one. This is lazy evaluation. The chain is a blueprint, not execution. When you finally iterate on $processedStream, each record flows through all transformations one at a time. Chaining ten transformations together still only holds ONE active record in memory at any given point in the pipeline.

Common Anti-Patterns

Anti-Pattern	Why It Fails	Solution
`iterator_to_array()`	Converts the efficient stream back into a giant, memory-crashing array.	Stay inside the `foreach` loop.
Logging every record	If you store detailed logs in a local array, that array grows unbounded.	Log summaries per batch or stream directly to a file/service.
Accumulating errors	Storing every "failed" row in a `$failedRows` array for later processing.	Write errors to a dedicated file or 'error' database table immediately.

When Iterator Patterns "Break"

Some operations require state — meaning they need to see the whole dataset to work. You cannot stream these.

Sorting: You cannot know the mathematically first item until you have seen the very last one. (Solution: Push this to the database with ORDER BY).
Deduplication: To know if record #1,000,000 is a duplicate, you have to remember the unique keys of all previous 999,999 records. (Solution: Use database constraints or a bloom filter/LRU cache for very large sets).
Aggregations: Sums, averages, and counts usually need the full set. (Solution: Use database functions like SUM() or COUNT()).

The general rule: if an operation needs to see more than one record at a time to complete, it belongs in the database layer.

Key Takeaways

Never load all data at once: Stop using fetchAll() for large datasets; switch to unbuffered queries/cursors or generators.
Understand yield internally: It pauses, returns one value, and releases the previous value.
Batch your writes: Stream data one record at a time to keep memory safe, but execute expensive operations (like DB inserts) in chunks of 500-2,000.
Watch for anti-patterns: iterator_to_array() and error accumulation silently undo all your memory savings.
Push stateful operations: Let the database handle sorting, deduplication, and aggregation (SQL).

This is not just a PHP thing. Whether it is Python Generators, Java Streams, or Go Channels, the principle is identical: process and release, process and release.

This is part of the ETL Pipeline Series on DECYON — real engineering patterns from 20+ years of building production systems.

Read the full version with interactive diagrams and benchmarks at decyon.com

DEV Community

Iterator Patterns: How to Process Millions of Records Without Running Out of Memory

What Actually Happens in Memory

The Array Approach

The Iterator Approach

What `yield` Actually Does Internally

Combining Iterators with Batch Processing

Choosing the Right Batch Size

Chaining Transformations (Lazy Evaluation)

Common Anti-Patterns

When Iterator Patterns "Break"

Key Takeaways

Top comments (0)

What Actually Happens in Memory

The Array Approach

The Iterator Approach

What yield Actually Does Internally

Combining Iterators with Batch Processing

Choosing the Right Batch Size

Chaining Transformations (Lazy Evaluation)

Common Anti-Patterns

When Iterator Patterns "Break"

Key Takeaways

What `yield` Actually Does Internally