Anant Parmar

Posted on Jul 29 • Originally published at anantparmar.com

High-Performance Bulk Insert with relationships

#laravel #database #webdev

Ever tried to bulk insert thousands of records that depend on each other?

If you have, you probably know the pain.

If you haven't, we're about to dive into a problem that'll make you question your life choices, and then show you a neat trick to solve it.

The Problem

Picture this: You're building a project management system and the templating feature requires the ability to save project templates and then create project instances for all clients.

Your project structure looks like this:

Project → Phases → Tasks → Subtasks

We store a similar structure for templates, so we can reuse them across different clients - make sense?

Template Project → Template Phases → Template Tasks → Template Subtasks

For simplicity, let’s assume - A Project template has about 5 phases, 10 tasks per phase, and 3 subtasks per task.

We have 100 clients, and are we’ll create a project for each of them from our project template.

So you're looking at generating:

100 projects × 5 phases × 10 tasks × 3 subtasks = 15,000 related records

Sounds simple, right? Wrong. Here's the catch:

You need the Project ID to create Phases
You need the Phase ID to create Tasks
You need the Task ID to create Subtasks
You want to do this FAST (because reasons)
You want to avoid 15,000+ individual database queries

We’ll explore a scenario with Laravel, but the pattern can apply to any other language or framework.

The "Normal" Approaches (And Why They Hurt)

Approach 1: Individual Inserts

foreach ($clients as $client) {
    $project = Project::create([
        'name' => 'Website Development - ' . $client->name,
        'client_id' => $client->id,
        'start_date' => $startDate,
    ]);

    foreach ($template->phases as $phaseTemplate) {
        $phase = Phase::create([
            'project_id' => $project->id,
            'name' => $phaseTemplate->name,
            'order' => $phaseTemplate->order,
        ]);

        foreach ($phaseTemplate->tasks as $taskTemplate) {
            $task = Task::create([
                'phase_id' => $phase->id,
                'name' => $taskTemplate->name,
                'estimated_hours' => $taskTemplate->estimated_hours,
            ]);

            foreach ($taskTemplate->subtasks as $subtaskTemplate) {
                Subtask::create([
                    'task_id' => $task->id,
                    'name' => $subtaskTemplate->name,
                    'description' => $subtaskTemplate->description,
                ]);
            }
        }
    }
}

Pros: Simple to understand and write

Cons: You're looking at 15,000+ database queries

Approach 2: Memory Mapping

// Step 1: Bulk insert all projects and keep references
$projectData = [];
foreach ($clients as $client) {
    $projectData[] = [
        'name' => 'Website Development - ' . $client->name,
        'client_id' => $client->id,
        'start_date' => $startDate,
        'created_at' => now(),
    ];
}

// Bulk insert projects
Project::insert($projectData);

// Get the inserted projects (this is where it gets tricky)
$insertedProjects = Project::where('name', 'LIKE', 'Website Development%')
    ->where('created_at', '>=', now()->subMinute())
    ->orderBy('id')
    ->get();

// Step 2: Prepare phases with real project IDs
$phaseData = [];
foreach ($insertedProjects as $index => $project) {
    foreach ($template->phases as $phaseTemplate) {
        $phaseData[] = [
            'project_id' => $project->id,
            'name' => $phaseTemplate->name,
            'order' => $phaseTemplate->order,
            'created_at' => now(),
        ];
    }
}

// Bulk insert phases
Phase::insert($phaseData);

// Get inserted phases and map them back...
$insertedPhases = Phase::whereIn('project_id', $insertedProjects->pluck('id'))
    ->where('created_at', '>=', now()->subMinute())
    ->orderBy('project_id', 'order')
    ->get();

// Step 3: Continue this pattern for tasks and subtasks...
// (This gets increasingly complex and error-prone)

Pros:

Uses bulk inserts for better performance than individual creates

Cons:

Complex ID mapping logic that's prone to race conditions
Requires careful ordering and filtering to match inserted records
Gets exponentially more complex with each relationship level

Our Solution: The `temp_identifier` Trick

Here's where we got creative.

What if we could get the benefits of bulk inserts AND handle relationships properly?

The idea is simple:

Add a temporary column to your tables
Use it to track relationships during bulk operations
Clean it up when you're done

Step 1: Add Temporary Columns

ALTER TABLE projects ADD COLUMN temp_identifier BIGINT NULL, ADD COLUMN batch_id VARCHAR(255) NULL;
ALTER TABLE phases ADD COLUMN temp_identifier BIGINT NULL, ADD COLUMN batch_id VARCHAR(255) NULL;  
ALTER TABLE tasks ADD COLUMN temp_identifier BIGINT NULL, ADD COLUMN batch_id VARCHAR(255) NULL;
ALTER TABLE subtasks ADD COLUMN batch_id VARCHAR(255) NULL;

Step 2: Prepare Data with Temporary IDs

$batchId = uniqid(); // Unique identifier for this batch to prevent concurrency issues
$projectData = [];
$phaseData = [];
$taskData = [];
$subtaskData = [];
$projectCounter = 0;
$phaseCounter = 0;
$taskCounter = 0;

foreach ($clients as $client) {
    $tempProjectId = $projectCounter++;

    // Prepare project data
    $projectData[] = [
        'temp_identifier' => $tempProjectId,
        'batch_id' => $batchId,
        'name' => 'Recurring Task - ' . $client->name,
        'client_id' => $client->id,
        'start_date' => $startDate,
        'created_at' => now(),
        'updated_at' => now(),
    ];

    foreach ($template->phases as $phaseTemplate) {
        $tempPhaseId = $phaseCounter++;

        // Prepare phase data
        $phaseData[] = [
            'temp_identifier' => $tempPhaseId,
            'batch_id' => $batchId,
            'project_temp_identifier' => $tempProjectId, // Reference to project's temp_identifier
            'name' => $phaseTemplate->name,
            'order' => $phaseTemplate->order,
            'created_at' => now(),
            'updated_at' => now(),
        ];

        foreach ($phaseTemplate->tasks as $taskTemplate) {
            $tempTaskId = $taskCounter++;

            // Prepare task data
            $taskData[] = [
                'temp_identifier' => $tempTaskId,
                'batch_id' => $batchId,
                'phase_temp_identifier' => $tempPhaseId, // Reference to phase's temp_identifier
                'name' => $taskTemplate->name,
                'estimated_hours' => $taskTemplate->estimated_hours,
                'created_at' => now(),
                'updated_at' => now(),
            ];

            foreach ($taskTemplate->subtasks as $subtaskTemplate) {
                $subtaskData[] = [
                    'batch_id' => $batchId,
                    'task_temp_identifier' => $tempTaskId, // Reference to task's temp_identifier
                    'name' => $subtaskTemplate->name,
                    'description' => $subtaskTemplate->description,
                    'created_at' => now(),
                    'updated_at' => now(),
                ];
            }
        }
    }
}

Step 3: Bulk Insert and Map IDs

// Bulk insert projects
    Project::insert($projectData);

    // Get the mapping of temp_identifier to real ID
    $projectMapping = Project::where('batch_id', $batchId)
        ->whereNotNull('temp_identifier')
        ->pluck('id', 'temp_identifier')
        ->toArray();

    // Update phase data with real project IDs
    foreach ($phaseData as &$phase) {
        $phase['project_id'] = $projectMapping[$phase['project_temp_identifier']] 
            ?? throw new Exception('Project temp_identifier not found');
        unset($phase['project_temp_identifier']);
    }

    // Bulk insert phases
    Phase::insert($phaseData);

    // Get phase mapping
    $phaseMapping = Phase::where('batch_id', $batchId)
        ->whereNotNull('temp_identifier')
        ->pluck('id', 'temp_identifier')
        ->toArray();

    // Update task data with real phase IDs
    foreach ($taskData as &$task) {
        $task['phase_id'] = $phaseMapping[$task['phase_temp_identifier']] 
            ?? throw new Exception('Phase temp_identifier not found');
        unset($task['phase_temp_identifier']);
    }

    // Bulk insert tasks
    Task::insert($taskData);

    // Get task mapping
    $taskMapping = Task::where('batch_id', $batchId)
        ->whereNotNull('temp_identifier')
        ->pluck('id', 'temp_identifier')
        ->toArray();

    // Update subtask data with real task IDs
    foreach ($subtaskData as &$subtask) {
        $subtask['task_id'] = $taskMapping[$subtask['task_temp_identifier']] 
            ?? throw new Exception('Task temp_identifier not found');
        unset($subtask['task_temp_identifier']);
    }

    // Bulk insert subtasks
    Subtask::insert($subtaskData);

This should be wrapped in a DB::transaction with temporarily disabling foriegn key checks if required.

DB::transaction(function () use ($projectData, $phaseData, $taskData, $subtaskData, $batchId) {
    // Temporarily disable foreign key checks for bulk inserts
    DB::statement('SET FOREIGN_KEY_CHECKS=0;');


    // Bulk Operation

    // Re-enable foreign key checks
    DB::statement('SET FOREIGN_KEY_CHECKS=1;');
});

Step 4: Clean Up

Project::where('batch_id', $batchId)->update(['temp_identifier' => null, 'batch_id' => null]);
Phase::where('batch_id', $batchId)->update(['temp_identifier' => null, 'batch_id' => null]);
Task::where('batch_id', $batchId)->update(['temp_identifier' => null, 'batch_id' => null]);
Subtask::where('batch_id', $batchId)->update(['batch_id' => null]);

Pros:

Achieves true bulk insert performance with clean relationship handling
Simple integer lookups eliminate complex field matching
More predictable than approaches relying on insertion order or field matching

Cons:

Requires temporary database columns and schema changes
More complex setup than simpler approaches
Cleanup step needed after completion

When Should You Use This?

This approach works best when:

You're dealing with thousands of related records - multi-level relationships

Don't use it when:

You're inserting just a few hundred records (the overhead isn't worth it)
Your relationships are simple (just use insertGetId loops)
Your team isn't comfortable with the added complexity

Comparison

Let's say you're generating projects for 100 clients from a single template, each with 5 phases, 10 tasks per phase, and 3 subtasks per task:

Approach	Database Queries	Memory Usage	Reliability
Individual Inserts	~20,600	Low	High
Memory Mapping	~7	High	Medium
temp_identifier	~10	High	High

The next time you're staring at a bulk insert problem that seems too complex, remember: sometimes the answer is as simple as adding a temporary column and getting creative with how you use it.

I hope you find this valuable, if you did - share it with your folks who'll find it useful too.

If you have any suggestions/comments please feel free.

Happy coding!

DEV Community

High-Performance Bulk Insert with relationships

The Problem

The "Normal" Approaches (And Why They Hurt)

Approach 1: Individual Inserts

Approach 2: Memory Mapping

Our Solution: The `temp_identifier` Trick

Step 1: Add Temporary Columns

Step 2: Prepare Data with Temporary IDs

Step 3: Bulk Insert and Map IDs

Step 4: Clean Up

When Should You Use This?

Comparison

Top comments (0)

The Problem

The "Normal" Approaches (And Why They Hurt)

Approach 1: Individual Inserts

Approach 2: Memory Mapping

Our Solution: The temp_identifier Trick

Step 1: Add Temporary Columns

Step 2: Prepare Data with Temporary IDs

Step 3: Bulk Insert and Map IDs

Step 4: Clean Up

When Should You Use This?

Comparison

Our Solution: The `temp_identifier` Trick