DEV Community

Cover image for High-Performance Bulk Insert with relationships
Anant Parmar
Anant Parmar

Posted on • Originally published at anantparmar.com

High-Performance Bulk Insert with relationships

Ever tried to bulk insert thousands of records that depend on each other?

If you have, you probably know the pain.

If you haven't, we're about to dive into a problem that'll make you question your life choices, and then show you a neat trick to solve it.

The Problem

Picture this: You're building a project management system and the templating feature requires the ability to save project templates and then create project instances for all clients.

Your project structure looks like this:

ProjectPhasesTasksSubtasks

We store a similar structure for templates, so we can reuse them across different clients - make sense?

Template ProjectTemplate PhasesTemplate TasksTemplate Subtasks

For simplicity, let’s assume - A Project template has about 5 phases, 10 tasks per phase, and 3 subtasks per task.

We have 100 clients, and are we’ll create a project for each of them from our project template.

So you're looking at generating:

  • 100 projects × 5 phases × 10 tasks × 3 subtasks = 15,000 related records

Sounds simple, right? Wrong. Here's the catch:

  • You need the Project ID to create Phases

  • You need the Phase ID to create Tasks

  • You need the Task ID to create Subtasks

  • You want to do this FAST (because reasons)

  • You want to avoid 15,000+ individual database queries

We’ll explore a scenario with Laravel, but the pattern can apply to any other language or framework.

The "Normal" Approaches (And Why They Hurt)

Approach 1: Individual Inserts

foreach ($clients as $client) {
    $project = Project::create([
        'name' => 'Website Development - ' . $client->name,
        'client_id' => $client->id,
        'start_date' => $startDate,
    ]);

    foreach ($template->phases as $phaseTemplate) {
        $phase = Phase::create([
            'project_id' => $project->id,
            'name' => $phaseTemplate->name,
            'order' => $phaseTemplate->order,
        ]);

        foreach ($phaseTemplate->tasks as $taskTemplate) {
            $task = Task::create([
                'phase_id' => $phase->id,
                'name' => $taskTemplate->name,
                'estimated_hours' => $taskTemplate->estimated_hours,
            ]);

            foreach ($taskTemplate->subtasks as $subtaskTemplate) {
                Subtask::create([
                    'task_id' => $task->id,
                    'name' => $subtaskTemplate->name,
                    'description' => $subtaskTemplate->description,
                ]);
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Pros: Simple to understand and write

Cons: You're looking at 15,000+ database queries


Approach 2: Memory Mapping

// Step 1: Bulk insert all projects and keep references
$projectData = [];
foreach ($clients as $client) {
    $projectData[] = [
        'name' => 'Website Development - ' . $client->name,
        'client_id' => $client->id,
        'start_date' => $startDate,
        'created_at' => now(),
    ];
}

// Bulk insert projects
Project::insert($projectData);

// Get the inserted projects (this is where it gets tricky)
$insertedProjects = Project::where('name', 'LIKE', 'Website Development%')
    ->where('created_at', '>=', now()->subMinute())
    ->orderBy('id')
    ->get();

// Step 2: Prepare phases with real project IDs
$phaseData = [];
foreach ($insertedProjects as $index => $project) {
    foreach ($template->phases as $phaseTemplate) {
        $phaseData[] = [
            'project_id' => $project->id,
            'name' => $phaseTemplate->name,
            'order' => $phaseTemplate->order,
            'created_at' => now(),
        ];
    }
}

// Bulk insert phases
Phase::insert($phaseData);

// Get inserted phases and map them back...
$insertedPhases = Phase::whereIn('project_id', $insertedProjects->pluck('id'))
    ->where('created_at', '>=', now()->subMinute())
    ->orderBy('project_id', 'order')
    ->get();

// Step 3: Continue this pattern for tasks and subtasks...
// (This gets increasingly complex and error-prone)
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Uses bulk inserts for better performance than individual creates

Cons:

  • Complex ID mapping logic that's prone to race conditions

  • Requires careful ordering and filtering to match inserted records

  • Gets exponentially more complex with each relationship level


Our Solution: The temp_identifier Trick

Here's where we got creative.

What if we could get the benefits of bulk inserts AND handle relationships properly?

The idea is simple:

  1. Add a temporary column to your tables

  2. Use it to track relationships during bulk operations

  3. Clean it up when you're done

Step 1: Add Temporary Columns

ALTER TABLE projects ADD COLUMN temp_identifier BIGINT NULL, ADD COLUMN batch_id VARCHAR(255) NULL;
ALTER TABLE phases ADD COLUMN temp_identifier BIGINT NULL, ADD COLUMN batch_id VARCHAR(255) NULL;  
ALTER TABLE tasks ADD COLUMN temp_identifier BIGINT NULL, ADD COLUMN batch_id VARCHAR(255) NULL;
ALTER TABLE subtasks ADD COLUMN batch_id VARCHAR(255) NULL;
Enter fullscreen mode Exit fullscreen mode

Step 2: Prepare Data with Temporary IDs

$batchId = uniqid(); // Unique identifier for this batch to prevent concurrency issues
$projectData = [];
$phaseData = [];
$taskData = [];
$subtaskData = [];
$projectCounter = 0;
$phaseCounter = 0;
$taskCounter = 0;

foreach ($clients as $client) {
    $tempProjectId = $projectCounter++;

    // Prepare project data
    $projectData[] = [
        'temp_identifier' => $tempProjectId,
        'batch_id' => $batchId,
        'name' => 'Recurring Task - ' . $client->name,
        'client_id' => $client->id,
        'start_date' => $startDate,
        'created_at' => now(),
        'updated_at' => now(),
    ];

    foreach ($template->phases as $phaseTemplate) {
        $tempPhaseId = $phaseCounter++;

        // Prepare phase data
        $phaseData[] = [
            'temp_identifier' => $tempPhaseId,
            'batch_id' => $batchId,
            'project_temp_identifier' => $tempProjectId, // Reference to project's temp_identifier
            'name' => $phaseTemplate->name,
            'order' => $phaseTemplate->order,
            'created_at' => now(),
            'updated_at' => now(),
        ];

        foreach ($phaseTemplate->tasks as $taskTemplate) {
            $tempTaskId = $taskCounter++;

            // Prepare task data
            $taskData[] = [
                'temp_identifier' => $tempTaskId,
                'batch_id' => $batchId,
                'phase_temp_identifier' => $tempPhaseId, // Reference to phase's temp_identifier
                'name' => $taskTemplate->name,
                'estimated_hours' => $taskTemplate->estimated_hours,
                'created_at' => now(),
                'updated_at' => now(),
            ];

            foreach ($taskTemplate->subtasks as $subtaskTemplate) {
                $subtaskData[] = [
                    'batch_id' => $batchId,
                    'task_temp_identifier' => $tempTaskId, // Reference to task's temp_identifier
                    'name' => $subtaskTemplate->name,
                    'description' => $subtaskTemplate->description,
                    'created_at' => now(),
                    'updated_at' => now(),
                ];
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Bulk Insert and Map IDs

// Bulk insert projects
    Project::insert($projectData);

    // Get the mapping of temp_identifier to real ID
    $projectMapping = Project::where('batch_id', $batchId)
        ->whereNotNull('temp_identifier')
        ->pluck('id', 'temp_identifier')
        ->toArray();

    // Update phase data with real project IDs
    foreach ($phaseData as &$phase) {
        $phase['project_id'] = $projectMapping[$phase['project_temp_identifier']] 
            ?? throw new Exception('Project temp_identifier not found');
        unset($phase['project_temp_identifier']);
    }

    // Bulk insert phases
    Phase::insert($phaseData);

    // Get phase mapping
    $phaseMapping = Phase::where('batch_id', $batchId)
        ->whereNotNull('temp_identifier')
        ->pluck('id', 'temp_identifier')
        ->toArray();

    // Update task data with real phase IDs
    foreach ($taskData as &$task) {
        $task['phase_id'] = $phaseMapping[$task['phase_temp_identifier']] 
            ?? throw new Exception('Phase temp_identifier not found');
        unset($task['phase_temp_identifier']);
    }

    // Bulk insert tasks
    Task::insert($taskData);

    // Get task mapping
    $taskMapping = Task::where('batch_id', $batchId)
        ->whereNotNull('temp_identifier')
        ->pluck('id', 'temp_identifier')
        ->toArray();

    // Update subtask data with real task IDs
    foreach ($subtaskData as &$subtask) {
        $subtask['task_id'] = $taskMapping[$subtask['task_temp_identifier']] 
            ?? throw new Exception('Task temp_identifier not found');
        unset($subtask['task_temp_identifier']);
    }

    // Bulk insert subtasks
    Subtask::insert($subtaskData);
Enter fullscreen mode Exit fullscreen mode

This should be wrapped in a DB::transaction with temporarily disabling foriegn key checks if required.

DB::transaction(function () use ($projectData, $phaseData, $taskData, $subtaskData, $batchId) {
    // Temporarily disable foreign key checks for bulk inserts
    DB::statement('SET FOREIGN_KEY_CHECKS=0;');


    // Bulk Operation

    // Re-enable foreign key checks
    DB::statement('SET FOREIGN_KEY_CHECKS=1;');
});
Enter fullscreen mode Exit fullscreen mode

Step 4: Clean Up

Project::where('batch_id', $batchId)->update(['temp_identifier' => null, 'batch_id' => null]);
Phase::where('batch_id', $batchId)->update(['temp_identifier' => null, 'batch_id' => null]);
Task::where('batch_id', $batchId)->update(['temp_identifier' => null, 'batch_id' => null]);
Subtask::where('batch_id', $batchId)->update(['batch_id' => null]);
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Achieves true bulk insert performance with clean relationship handling

  • Simple integer lookups eliminate complex field matching

  • More predictable than approaches relying on insertion order or field matching

Cons:

  • Requires temporary database columns and schema changes

  • More complex setup than simpler approaches

  • Cleanup step needed after completion

When Should You Use This?

This approach works best when:

  • You're dealing with thousands of related records - multi-level relationships

Don't use it when:

  • You're inserting just a few hundred records (the overhead isn't worth it)

  • Your relationships are simple (just use insertGetId loops)

  • Your team isn't comfortable with the added complexity

Comparison

Let's say you're generating projects for 100 clients from a single template, each with 5 phases, 10 tasks per phase, and 3 subtasks per task:

Approach Database Queries Memory Usage Reliability
Individual Inserts ~20,600 Low High
Memory Mapping ~7 High Medium
temp_identifier ~10 High High

The next time you're staring at a bulk insert problem that seems too complex, remember: sometimes the answer is as simple as adding a temporary column and getting creative with how you use it.


I hope you find this valuable, if you did - share it with your folks who'll find it useful too.

If you have any suggestions/comments please feel free.

Happy coding!

Top comments (0)