Babacar Cisse DIA

Posted on Sep 25, 2024

Optimizing Sorting and Filtering on JSON Columns in Laravel with Indexed Virtual Columns

#databaseoptimization #mysql #phpdevelopment #jsonprocessing

Originally posted on bcd.dev

Outline

JSON Column Performance Challenges
- Monitoring and Execution Time Issues
- Sorting and Filtering with JSON Columns
Introducing Virtual Columns
- Why Virtual Columns Work Better
- How to Implement Virtual Columns
Benchmarking the Results
- Before: Nested JSON Columns
- After: Virtual Column + Index
How to Optimize JSON Queries with Virtual Columns
Conclusion and Best Practices

JSON Column Performance Challenges

Working with large datasets stored in JSON columns presents significant performance issues, especially when filtering and sorting. In my experience, these challenges became evident while monitoring PHP processes and managing large volumes of records, leading to execution time limits being hit.

Monitoring and Execution Time Issues

As part of my regular monitoring duties, I encountered max execution times of 30 seconds while querying JSON columns in a 580k record dataset. JSON columns, though flexible, are prone to performance bottlenecks, particularly without proper indexing.

Sorting and Filtering with JSON Columns

The first major issue appeared when working on a Filament list record page, which had default sorting applied to a JSON attribute. The absence of indexing on this attribute resulted in a significant slowdown, especially when processing over 10,000 records. Without an index, querying and sorting through nested JSON attributes can cause execution delays and inefficiencies in retrieving results, pushing PHP processes beyond acceptable limits.

Introducing Virtual Columns

When faced with performance issues from sorting and filtering large JSON columns, I revisited an old solution: virtual columns from my friend Rob Fonseca. Virtual columns in MySQL allow me to create an indexed, computed column from JSON data, making queries more efficient without duplicating data.

Why Virtual Columns Work Better

Unlike standard JSON columns, virtual columns are calculated automatically from existing data but can be indexed, making them faster for querying. This improves sorting and filtering performance significantly, especially in large datasets where execution time is critical.

How to Implement Virtual Columns

I implemented virtual columns by adding a migration that created a new indexed column for filtering and sorting. This virtual column extracted and indexed specific JSON attributes, drastically improving query performance. Here's an example migration:

$table->string('approved_at')
    ->nullable()
    ->virtualAs("json_unquote(json_extract(data, '$.latest_approval_date'))");
$table->index('approved_at');

By indexing this virtual column, I was able to reduce query times and improve overall efficiency, especially when filtering and sorting large datasets.

Benchmarking the Results

Once I implemented the virtual columns, I needed to ensure the performance gains were real. Benchmarking provided concrete data, comparing the execution times of filtering, sorting, and paginating large datasets using both the original nested JSON column and the new virtual column with indexing.

Before: Nested JSON Columns

With over 580k records, queries on the nested JSON column were slow:

Sorting a page of 100 records took over 5,000ms.
Filtering + sorting + paginating took nearly 2,000ms.

Benchmark::dd([
    'count' => fn () => Document::count(),
    'paginate' => fn () => Document::paginate(100),
    'filter + paginate' => fn () => Document::where('data->latest_approval_date', '>', '2024-09-05')->paginate(100),
    'sort + paginate' => fn () => Document::orderBy('data->latest_approval_date')->paginate(100),
    'filter + sort + paginate' => fn () => Document::where('data->latest_approval_date', '>', '2024-09-05')->orderBy('data->latest_approval_date')->paginate(100),
], iterations: 100);

After: Virtual Column + Index

After indexing the virtual column, the improvements were substantial:

Sorting the same page of 100 records dropped to 750ms (7.5x faster).
Filtering + sorting + paginating improved to just 53ms (36x faster).

These benchmarks confirmed the effectiveness of virtual columns in optimizing query performance.

Benchmark::dd([
    'count' => fn () => Document::count(),
    'paginate' => fn () => Document::paginate(100),
    'filter + paginate' => fn () => Document::where('approved_at', '>', '2024-09-05')->paginate(100),
    'sort + paginate' => fn () => Document::orderBy('approved_at')->paginate(100),
    'filter + sort + paginate' => fn () => Document::where('approved_at', '>', '2024-09-05')->orderBy('approved_at')->paginate(100),
], iterations: 100);

Steps

1. Add a Virtual Column with Migration

To improve performance, we’ll start by adding a virtual column for the approved_at field. This column extracts and indexes the JSON attribute for better query performance.

use Illuminate\Database\Migrations\Migration;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Support\Facades\Schema;

return new class extends Migration {
    public function up(): void {
        Schema::table('documents', function (Blueprint $table) {
            $table->string('approved_at')
                ->nullable()
                ->virtualAs("json_unquote(json_extract(data, '$.latest_approval_date'))");
            $table->index('approved_at');
        });
    }

    public function down(): void {
        Schema::table('documents', function (Blueprint $table) {
            $table->dropColumn('approved_at');
        });
    }
};

2. Create a Trait for Virtual Fields

We’ll create a HasVirtualFields trait to ensure that virtual fields are not mistakenly saved.

namespace App\Models\Concerns;

trait HasVirtualFields {
    public function save(array $options = []) {
        if (isset($this->virtualFields)) {
            $this->attributes = array_diff_key($this->attributes, array_flip($this->virtualFields));
        }
        return parent::save($options);
    }
}

3. Add the Trait and Virtual Column Property to Your Model

In the model, include the trait and define the virtual fields. This ensures that any virtual columns are properly managed.

use App\Models\Concerns\HasVirtualFields;

class Document extends Model {
    use HasVirtualFields;

    protected array $virtualFields = [
        'approved_at',
    ];
}

4. Testing Environment

To test the performance improvements, we’ll generate fake data and benchmark the queries before and after using virtual columns. Use the following provisioning script:

$count = 500 * 1000;
for ($i = 0; $i < 250; $i++) {
    Document::factory()->count(1000)->create();
}

6. Wrapping Up with Unit Tests

Write tests to verify that the virtual column works as expected. Here’s an example test suite:

namespace Tests\Feature\Models;

use Tests\TestCase;
use App\Models\Document;

class DocumentTest extends TestCase {
    public function testApprovedAt() {
        $date = fake()->dateTimeBetween()->format(DATE_ATOM);
        $document = Document::factory()->create([
            'data' => [
                'latest_approval_date' => $date,
            ],
        ]);
        $document->refresh();
        $this->assertEquals($date, $document->approved_at);
    }
}

This complete solution ensures that your JSON columns can be optimized for performance, particularly for large datasets.

Conclusion and Best Practices

Using virtual columns with indexing can dramatically improve performance when working with large datasets and JSON columns. By transitioning from nested JSON queries to indexed virtual columns, I was able to reduce query times by up to 36x.

Best Practices:

Use virtual columns to index frequently queried JSON attributes.
Always benchmark before and after implementing changes to measure real performance improvements.
Ensure your database structure evolves with your data as it scales, especially with JSON-heavy models.