DEV Community

Cover image for Iterator Patterns: How to Process Millions of Records Without Running Out of Memory
Kunwar Jhamat
Kunwar Jhamat

Posted on • Originally published at decyon.com

Iterator Patterns: How to Process Millions of Records Without Running Out of Memory

You have 100,000 rows in your database. You need to process each one. The obvious approach loads everything into an array, loops through it, and writes the results. This works fine with 1,000 records. With 100,000, your server hits the memory limit and crashes.

This is not just a tutorial on PHP iterators. This is about understanding what actually happens in memory and why iterator patterns keep memory usage constant regardless of dataset size.

What Actually Happens in Memory

Let us start with the simplest case: three database records, each 1KB in size.

The Array Approach

// Assuming $database is a PDO instance or similar wrapper
$rows = $database->query("SELECT * FROM users")->fetchAll();
// At this point: ALL rows are loaded into memory
Enter fullscreen mode Exit fullscreen mode

Step-by-step memory allocation:

  • Step 1: fetchAll() is called → PHP allocates memory for an empty array (~200 bytes).
  • Step 2: First row arrives (1KB) → Memory: ~1.2KB.
  • Step 3: Second row arrives (1KB) → Memory: ~2.2KB.
  • Step 4: Third row arrives (1KB) → Memory: ~3.2KB.

With 100,000 rows at 1KB each, that is 100MB. If your PHP memory_limit (e.g., in php.ini) is 64MB, the process crashes before you even process the first row.

Key insight: fetchAll() waits until every single row is loaded from the result set before returning control to your code.

Approach Memory (1K Rows) Memory (100K Rows) Memory (1M Rows)
Array (fetchAll) ~1 MB ~100 MB ~1 GB (Crash)
Iterator (cursor) ~1 KB ~1 KB ~1 KB
Batched Iterator ~1 MB ~1 MB ~1 MB

The Iterator Approach

Iterators use a "cursor" concept. They load one record, hand it to you, and then move on, minimizing the "resident" memory required.

// $cursor is typically an unbuffered PDOStatement or an iterator
$cursor = $database->query("SELECT * FROM users");
foreach ($cursor as $row) {
    // Do your work here — transform, validate, insert, etc.
    $name = trim($row['name']);
    $db->insert('clean_users', ['name' => $name, 'email' => $row['email']]);
    // Memory used is just for this ONE row
}
Enter fullscreen mode Exit fullscreen mode

The Step-by-Step:

  • Step 1: PHP asks the cursor/stream for the first row → 1KB allocated.
  • Step 2: Your code processes the row (transform, insert, etc.) → Memory remains ~1.2KB.
  • Step 3: Loop moves to the next iteration → PHP overwrites/releases row 1 and fetches row 2.
  • Step 4: Memory remains ~1.2KB.

Whether you have 3 rows or 3 million, your memory footprint stays flat.

What yield Actually Does Internally

In PHP, generators use the yield keyword. Understanding its internal mechanics explains the magic. A generator function looks like a standard function but behaves like an iterator.

// A simple generator function
function getRows($pdo) {
    // We assume the PDO driver is configured for unbuffered queries
    $stmt = $pdo->query("SELECT * FROM users");
    while ($row = $stmt->fetch()) {
        yield $row; // Execution pauses HERE
    }
}
Enter fullscreen mode Exit fullscreen mode

Under the hood, yield performs three main actions:

  1. Returns the value: The current $row is sent back to the loop/caller (e.g., foreach ($rows as $row)).
  2. Pauses execution: PHP "freezes" the function, remembering exactly which line it was on and the state of all local variables ($stmt, $row).
  3. Waits for next(): Execution stays paused until the loop asks for another item (the next iteration). No CPU is consumed while waiting, and only the small state context is kept in memory.

When the loop moves to the next iteration, PHP resumes the function immediately after the yield, continues the loop, fetches the next $row, and yields again.

Combining Iterators with Batch Processing

Processing one record at a time is memory-safe but can be slow due to the overhead of many sequential operations, like database single-inserts. The "pro" move is batch processing with bounded memory.

/**
 * Processes data in memory-efficient batches.
 *
 * @param iterable $iterator A Generator or Iterator of records.
 * @param int $batchSize How many records to process at once.
 */
function processBatched(iterable $iterator, int $batchSize = 1000) {
    $batch = [];
    foreach ($iterator as $record) {
        $batch[] = $record;

        // When the batch is full, process and clear it
        if (count($batch) >= $batchSize) {
            bulkInsert($batch); // One DB call for 1000 records
            $batch = [];        // Crucial: Release the memory
        }
    }

    // Process any remaining records (the last partial batch)
    if (!empty($batch)) {
        bulkInsert($batch);
    }
}
Enter fullscreen mode Exit fullscreen mode

Maximum memory usage is always (batch size x average record size). With 1,000 records at 1KB each, you never use more than ~1MB, whether processing 10,000 or 10,000,000 records.

Choosing the Right Batch Size

Batch Size Memory DB Efficiency Error Recovery Best For
100 ~100KB Moderate Easy (Small scope) Strict memory limits
1,000 ~1MB Good Acceptable General-purpose ETL
5,000 ~5MB Excellent Coarse Fast networks/Large records

Pro Tip: High batch sizes (e.g., >10,000) often show diminishing returns in database performance and make error recovery harder. If a batch of 10,000 fails, finding the single "bad" record that caused the constraint violation is like finding a needle in a haystack.

Chaining Transformations (Lazy Evaluation)

What happens when you need to normalize, filter, and enrich the data?

// (Pseudocode concept)
$stream = getRows($pdo); // Generator (0 memory)

$processedStream = $stream
    ->map(fn($row) => normalize($row))
    ->filter(fn($row) => $row['active'])
    ->map(fn($row) => enrich($row));
Enter fullscreen mode Exit fullscreen mode

Each operation does not run yet; it returns a new iterator that wraps the previous one. This is lazy evaluation. The chain is a blueprint, not execution. When you finally iterate on $processedStream, each record flows through all transformations one at a time. Chaining ten transformations together still only holds ONE active record in memory at any given point in the pipeline.

Common Anti-Patterns

Anti-Pattern Why It Fails Solution
iterator_to_array() Converts the efficient stream back into a giant, memory-crashing array. Stay inside the foreach loop.
Logging every record If you store detailed logs in a local array, that array grows unbounded. Log summaries per batch or stream directly to a file/service.
Accumulating errors Storing every "failed" row in a $failedRows array for later processing. Write errors to a dedicated file or 'error' database table immediately.

When Iterator Patterns "Break"

Some operations require state — meaning they need to see the whole dataset to work. You cannot stream these.

  • Sorting: You cannot know the mathematically first item until you have seen the very last one. (Solution: Push this to the database with ORDER BY).
  • Deduplication: To know if record #1,000,000 is a duplicate, you have to remember the unique keys of all previous 999,999 records. (Solution: Use database constraints or a bloom filter/LRU cache for very large sets).
  • Aggregations: Sums, averages, and counts usually need the full set. (Solution: Use database functions like SUM() or COUNT()).

The general rule: if an operation needs to see more than one record at a time to complete, it belongs in the database layer.

Key Takeaways

  1. Never load all data at once: Stop using fetchAll() for large datasets; switch to unbuffered queries/cursors or generators.
  2. Understand yield internally: It pauses, returns one value, and releases the previous value.
  3. Batch your writes: Stream data one record at a time to keep memory safe, but execute expensive operations (like DB inserts) in chunks of 500-2,000.
  4. Watch for anti-patterns: iterator_to_array() and error accumulation silently undo all your memory savings.
  5. Push stateful operations: Let the database handle sorting, deduplication, and aggregation (SQL).

This is not just a PHP thing. Whether it is Python Generators, Java Streams, or Go Channels, the principle is identical: process and release, process and release.


This is part of the ETL Pipeline Series on DECYON — real engineering patterns from 20+ years of building production systems.

Read the full version with interactive diagrams and benchmarks at decyon.com

Top comments (0)