Scaling to 300K+ Records Daily: How We Handle High Volume Data Processing with Lumen & MySQL

#php #mysql #programming #distributedsystems

Building a lean, mean data processing machine that handles 100 I/O operations per second without breaking a sweat

When your application suddenly needs to process hundreds of thousands of records daily with peak loads hitting 100 I/O operations per second, you quickly learn that standard CRUD operations won't cut it. Here's how we transformed our Lumen application into a high performance data processing powerhouse.

The Challenge

Our monitoring system processes 300,000+ data records daily, generating complex reports and exports while maintaining sub-second response times. The system handles everything from real-time aggregations to massive CSV exports all while keeping memory usage under control.

Strategy 1: Database Schema Optimization

JSON Columns with Generated Virtual Columns

Instead of creating multiple tables with complex joins, we leveraged MySQL's JSON capabilities with a twist:

// Migration: Create virtual columns for frequently queried JSON fields
Schema::table('data_xxx', function (Blueprint $table) {
    $table->string('feedback_extracted')->virtualAs(
        "JSON_UNQUOTE(JSON_EXTRACT(content_data, '$.feedback'))"
    )->index();

    $table->decimal('amount_extracted', 15, 2)->virtualAs(
        "CAST(JSON_EXTRACT(content_data, '$.amount') AS DECIMAL(15,2))"
    )->index();
});

Why this works:

Virtual columns are computed on the fly but can be indexed
Eliminates need for complex joins
Maintains data flexibility while enabling fast queries

Strategic Indexing

// Composite indexes for common query patterns
Schema::table('data_xxx', function (Blueprint $table) {
    $table->index(['branch', 'visit_date', 'visit_type']);
    $table->index(['personnel_id', 'visit_date', 'status']);
    $table->index(['visit_type', 'status', 'feedback_extracted']);
});

Strategy 2: Query Optimization Patterns

Avoiding N+1 with Smart Aggregation

Instead of loading relations, we aggregate at the database level:

public function getDataSummary($filters)
{
    return DB::table('data_xxx')
        ->select([
            'branch',
            DB::raw('SUM(CASE WHEN status = "COMPLETED" THEN 1 ELSE 0 END) as completed'),
            DB::raw('SUM(CASE WHEN status = "PLANNED" THEN 1 ELSE 0 END) as planned'),
            DB::raw('AVG(CAST(JSON_EXTRACT(content_data, "$.score") AS DECIMAL)) as avg_score')
        ])
        ->where('visit_date', $filters['date'])
        ->groupBy('branch')
        ->get();
}

Generator Powered Data Processing

For large datasets, we use PHP generators to maintain constant memory usage:

public function processLargeDataset($filters): \Generator
{
    $query = DB::table('data_xxx')
        ->where('visit_date', '>=', $filters['start_date'])
        ->where('visit_date', '<=', $filters['end_date'])
        ->orderBy('id');

    foreach ($query->lazy(2000) as $record) {
        yield $this->transformRecord($record);
    }
}

// Usage
foreach ($this->processLargeDataset($filters) as $processedRecord) {
    // Memory stays constant regardless of dataset size
    $this->handleRecord($processedRecord);
}

Strategy 3: High Performance Export System

Memory-Efficient CSV Generation

Our export system handles massive datasets while keeping memory usage under 50MB:

public function exportToCSV($filters): string
{
    // Create temporary file
    $tempFile = tmpfile();
    $tempPath = stream_get_meta_data($tempFile)['uri'];

    // Write headers
    fputcsv($tempFile, ['Date', 'Branch', 'Personnel', 'Customer', 'Result']);

    // Stream data in chunks
    foreach ($this->getExportData($filters) as $record) {
        fputcsv($tempFile, [
            $record['visit_date'],
            $record['branch_name'],
            $record['personnel_name'],
            $record['customer_name'],
            $record['visit_result']
        ]);
    }

    // Upload to storage
    $finalPath = "exports/data_" . date('Y-m-d_H-i-s') . ".csv";
    Storage::put($finalPath, fopen($tempPath, 'r'));

    fclose($tempFile);
    return $finalPath;
}

private function getExportData($filters): \Generator
{
    $query = DB::table('data_xxx')
        ->select([
            'visit_date', 'branch_name', 'personnel_name', 
            'customer_name', 'visit_result'
        ])
        ->where('visit_date', $filters['date'])
        ->orderBy('id');

    foreach ($query->lazy(2000) as $record) {
        yield (array) $record;
    }
}

Background Processing with Chunked Operations

For time intensive operations, we use job queues with intelligent chunking:

public function processInBackground($requestData)
{
    // Create tracking record
    $exportLog = $this->createExportLog($requestData);

    // Queue the processing job
    Queue::push(new ProcessDataExport($exportLog->id, $requestData));

    return $exportLog;
}

// In the job class
public function handle()
{
    $startTime = microtime(true);

    foreach ($this->getDataInChunks() as $chunk) {
        $this->processChunk($chunk);

        // Prevent memory leaks and timeouts
        if (microtime(true) - $startTime > 300) { // 5 minutes
            Queue::push(new ProcessDataExport($this->logId, $this->remainingData));
            return;
        }
    }

    $this->markAsCompleted();
}

Strategy 4: Caching and Optimization

Smart Cache Invalidation

public function getCachedSummary($filters)
{
    $cacheKey = 'summary_' . md5(serialize($filters));

    // For today's data, cache for 30 minutes
    // For historical data, cache for 24 hours
    $ttl = $filters['date'] === date('Y-m-d') ? 1800 : 86400;

    return Cache::remember($cacheKey, $ttl, function () use ($filters) {
        return $this->generateSummary($filters);
    });
}

Performance Results

Before optimization:

Memory usage: 500MB+ for large exports
Export time: 5+ minutes for 100K records
Database CPU: 80%+ during peak hours

After optimization:

Memory usage: <50MB consistently
Export time: 30 seconds for 100K records
Database CPU: <30% during peak hours
Response time: <200ms for most queries

Key Takeaways

JSON columns + virtual indexes eliminate complex joins while maintaining query performance
PHP generators keep memory usage constant regardless of dataset size
Strategic chunking prevents timeouts and resource exhaustion
Proper indexing strategy is crucial for high volume operations
Stream processing beats loading everything into memory

The beauty of this approach is its simplicity no complex technology, no exotic databases, just well optimized PHP and MySQL doing what they do best.