Andrew Huntingdon

Posted on Oct 28

Performance Testing Restaurant Software: Simulating Rush Hour Traffic

#programming #testing #performance

How we test BistroBee's performance under real-world restaurant conditions, and why generic load testing misses the point

At 6:47 PM on a Friday evening, BistroBee's response times suddenly spiked. Our monitoring showed database query times jumping from 50ms to 3 seconds. API endpoints that normally returned in under 200ms were taking 5+ seconds.

The cause wasn't a code bug or infrastructure failure. It was success.

We'd just onboarded a large restaurant group with 30 locations. That Friday evening, all 30 restaurants hit their dinner rush simultaneously. Hundreds of staff members were clocking in, managers were checking schedules, servers were swapping shifts, and the system was experiencing exactly the kind of concentrated, bursty traffic pattern that restaurants create.

Our standard load tests had shown BistroBee could handle 10,000 concurrent users. But those tests simulated steady, distributed traffic, not the reality of restaurant operations where everyone in an entire city tries to clock in at 5:00 PM sharp.

That evening taught us that performance testing restaurant software requires understanding restaurant operations, not just running generic load tests. This post shares what we learned about simulating realistic restaurant traffic patterns and building performance tests that actually matter.

Why Restaurant Traffic Is Different

Before diving into our testing approach, it's worth understanding why restaurant software creates unique performance challenges.

The Rush Hour Problem

Most web applications experience gradual traffic increases and decreases throughout the day. Restaurant software experiences sharp, predictable spikes:

Typical SaaS traffic pattern:

Traffic throughout day: ~~~~∿~~~∿~~~~∿~~~∿~~~~
Gradual peaks and valleys, relatively smooth

Restaurant software traffic pattern:

Traffic throughout day: _____|‾‾‾‾|_____|‾‾‾‾|___
Sharp spikes at meal services, nearly idle between

These spikes correspond to shift changes, meal service starts, and other predictable restaurant events.

The Thundering Herd Effect

When a restaurant's dinner shift starts at 5:00 PM, it's not just one person logging in, it's the entire serving staff, kitchen crew, and management team all accessing the system within a 5-minute window.

Multiply this across dozens or hundreds of restaurant locations in the same time zone, and you get concentrated load spikes that dwarf the steady-state traffic levels.

The Zero-Tolerance Reality

Performance degradation in restaurant software has immediate, visible impacts:

Staff can't clock in, backing up at the time clock
Managers can't access schedules during shift changes
Real-time updates lag during busy service periods
Mobile apps become unusable exactly when most needed

Unlike many SaaS applications where slight slowdowns are annoying but tolerable, restaurant software performance issues create operational chaos during the business's most critical moments.

Our Performance Testing Philosophy

Based on these realities, we developed performance testing principles specific to restaurant operations:

Test Real Patterns, Not Uniform Load

Generic load testing tools simulate uniform traffic, X users steadily making Y requests. This doesn't match restaurant reality.

We simulate actual restaurant behavior:

Shift change login spikes
Hourly schedule checks during service
End-of-shift activity bursts
Late-night schedule adjustments

Test the Right Metrics

Response time averages can be misleading. We focus on:

P95 and P99 response times: The worst 5% and 1% of requests matter most. These are what users actually experience during peak load.

Error rates under load: When load spikes, do requests fail or just slow down? Failures are worse than slowness.

Recovery time: How quickly does the system return to normal after a spike passes?

Database connection pool saturation: Are we running out of database connections during peaks?

Test Failure Scenarios

Restaurant software can't just perform well under ideal conditions, it needs to handle:

Database slowdowns
Network issues
Dependency failures
Partial outages

Our testing includes deliberate failure injection to verify graceful degradation.

Building Realistic Test Scenarios

Here's how we model actual restaurant traffic patterns in our performance tests.

Scenario 1: The Dinner Shift Login Storm

Real-world pattern: At 5:00 PM, 50 staff members need to clock in for dinner service at a single location.

Test implementation:

// k6 load testing script
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

const failureRate = new Rate('failed_requests');

export const options = {
  scenarios: {
    dinner_shift_login: {
      executor: 'ramping-arrival-rate',
      startRate: 0,
      timeUnit: '1m',
      preAllocatedVUs: 100,
      stages: [
        // Ramp from 0 to 50 logins per minute over 2 minutes
        { target: 50, duration: '2m' },
        // Hold at 50 per minute for 3 minutes (peak)
        { target: 50, duration: '3m' },
        // Ramp down over 5 minutes (stragglers)
        { target: 5, duration: '5m' },
      ],
    },
  },
  thresholds: {
    'http_req_duration{scenario:dinner_shift_login}': [
      'p(95)<500', // 95% of requests complete in under 500ms
      'p(99)<1000', // 99% complete in under 1s
    ],
    'failed_requests': ['rate<0.01'], // Less than 1% error rate
  },
};

export default function () {
  // Simulate staff login
  const loginResponse = http.post(
    'https://api.bistrobee.com/auth/login',
    JSON.stringify({
      email: `staff${__VU}@testrestaurant.com`,
      password: 'testpass123',
      restaurantId: 'test-restaurant-1',
    }),
    { headers: { 'Content-Type': 'application/json' } }
  );

  const loginSuccess = check(loginResponse, {
    'login successful': (r) => r.status === 200,
    'has auth token': (r) => r.json('token') !== undefined,
  });

  failureRate.add(!loginSuccess);

  if (loginSuccess) {
    const token = loginResponse.json('token');

    // Immediately check today's schedule (typical behavior)
    const scheduleResponse = http.get(
      'https://api.bistrobee.com/schedule/today',
      {
        headers: { Authorization: `Bearer ${token}` },
      }
    );

    check(scheduleResponse, {
      'schedule loaded': (r) => r.status === 200,
      'schedule response time acceptable': (r) => r.timings.duration < 500,
    });
  }

  // Random think time between 5-15 seconds
  sleep(Math.random() * 10 + 5);
}

This test simulates the concentrated login spike that happens at shift changes, measuring how the system handles authentication and immediate schedule lookups under peak load.

Scenario 2: Multi-Location Rush Hour

Real-world pattern: 50 restaurant locations across a region all hit dinner rush between 5:00-7:00 PM.

Test implementation:

export const options = {
  scenarios: {
    // East Coast restaurants (5 PM ET = 2 PM PT)
    east_coast_rush: {
      executor: 'ramping-arrival-rate',
      startTime: '0s',
      startRate: 0,
      timeUnit: '1m',
      preAllocatedVUs: 500,
      stages: [
        { target: 250, duration: '5m' }, // Rush builds
        { target: 250, duration: '15m' }, // Peak
        { target: 50, duration: '10m' }, // Dies down
      ],
    },
    // Central restaurants (6 PM CT = 4 PM PT)
    central_rush: {
      executor: 'ramping-arrival-rate',
      startTime: '60m', // Start 1 hour after east coast
      startRate: 0,
      timeUnit: '1m',
      preAllocatedVUs: 300,
      stages: [
        { target: 150, duration: '5m' },
        { target: 150, duration: '15m' },
        { target: 30, duration: '10m' },
      ],
    },
    // West Coast restaurants (5 PM PT)
    west_coast_rush: {
      executor: 'ramping-arrival-rate',
      startTime: '180m', // Start 3 hours after east coast
      startRate: 0,
      timeUnit: '1m',
      preAllocatedVUs: 400,
      stages: [
        { target: 200, duration: '5m' },
        { target: 200, duration: '15m' },
        { target: 40, duration: '10m' },
      ],
    },
  },
};

export default function () {
  const location = __ENV.SCENARIO; // east_coast_rush, central_rush, west_coast_rush

  // Different behaviors based on time zone
  const restaurantId = `test-restaurant-${location}-${__VU % 50}`;

  // Simulate typical rush hour activities
  performLoginAndScheduleCheck(restaurantId);
  sleep(Math.random() * 30);

  performScheduleUpdates(restaurantId);
  sleep(Math.random() * 60);

  performEndOfShiftActivities(restaurantId);
}

This test simulates the reality of serving customers across multiple time zones, each creating load spikes at their local peak hours.

Scenario 3: The Weekend Chaos Pattern

Real-world pattern: Weekend dinner service is 2-3x busier than weekdays, with more concurrent activity.

Test implementation:

export const options = {
  scenarios: {
    weekend_peak: {
      executor: 'ramping-vus',
      startVUs: 0,
      stages: [
        // Friday happy hour (4-7 PM)
        { duration: '30m', target: 500 },
        { duration: '3h', target: 500 },
        // Late night wind down
        { duration: '2h', target: 100 },
        // Saturday brunch build (9 AM - 2 PM)
        { duration: '1h', target: 400 },
        { duration: '5h', target: 400 },
        // Brief afternoon lull
        { duration: '2h', target: 150 },
        // Saturday dinner rush (5-10 PM)
        { duration: '1h', target: 800 },
        { duration: '5h', target: 800 },
        // Late night
        { duration: '2h', target: 200 },
      ],
      gracefulRampDown: '30m',
    },
  },
};

// Mix of activities during peak hours
export default function () {
  const activities = [
    scheduleChecks,
    shiftSwapRequests,
    timeClockPunches,
    breakScheduling,
    staffMessaging,
    managerDashboardViews,
  ];

  // Randomly select activities weighted by real usage patterns
  const activity = weightedRandom(activities, [30, 15, 25, 10, 15, 5]);
  activity();

  sleep(Math.random() * 45 + 15); // 15-60 second think time
}

This marathon test runs for hours, simulating sustained weekend traffic patterns including multiple peak periods.

Scenario 4: The Emergency Scenario

Real-world pattern: A key staff member calls out 30 minutes before shift, triggering frantic scheduling activity.

Test implementation:

export const options = {
  scenarios: {
    normal_operations: {
      executor: 'constant-arrival-rate',
      rate: 50,
      timeUnit: '1m',
      duration: '30m',
      preAllocatedVUs: 100,
    },
    emergency_spike: {
      executor: 'ramping-arrival-rate',
      startTime: '15m', // Emergency happens 15 minutes in
      startRate: 0,
      timeUnit: '1m',
      preAllocatedVUs: 200,
      stages: [
        { target: 150, duration: '2m' }, // Managers check coverage
        { target: 150, duration: '5m' }, // Frantically contacting staff
        { target: 50, duration: '5m' }, // Resolution
      ],
    },
  },
};

export default function () {
  if (__ENV.SCENARIO === 'emergency_spike') {
    // Emergency behavior: rapid-fire schedule checks and updates
    checkStaffAvailability();
    sendCoverageRequests();
    checkForResponses();
    updateSchedule();
    // Minimal think time during emergencies
    sleep(Math.random() * 5 + 2);
  } else {
    // Normal operations
    normalSchedulingActivity();
    sleep(Math.random() * 60 + 30);
  }
}

This test verifies the system can handle sudden load spikes on top of baseline traffic, a critical real-world scenario.

Database Performance Under Load

Restaurant software is database-intensive. Much of our performance testing focuses on database behavior under load.

Connection Pool Monitoring

We monitor database connection pool utilization during load tests:

// Custom k6 extension to check database metrics
import { Counter, Gauge } from 'k6/metrics';

const dbConnections = new Gauge('db_connections_active');
const dbQueuedRequests = new Counter('db_requests_queued');

export function setup() {
  // Start monitoring database connection pool
  return {
    dbMonitoringEndpoint: 'https://api.bistrobee.com/internal/metrics',
  };
}

export default function (data) {
  // Normal test activities
  performTestActions();

  // Periodically check database health
  if (Math.random() < 0.1) { // 10% of iterations
    const metrics = http.get(data.dbMonitoringEndpoint).json();
    dbConnections.add(metrics.database.connectionsActive);
    if (metrics.database.requestsQueued > 0) {
      dbQueuedRequests.add(metrics.database.requestsQueued);
    }
  }
}

If we see connection pool saturation (queued requests), we know we need to optimize queries or scale database resources.

Query Performance Analysis

During load tests, we capture slow query logs:

-- PostgreSQL slow query logging
ALTER SYSTEM SET log_min_duration_statement = 100; -- Log queries > 100ms
ALTER SYSTEM SET log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ';
SELECT pg_reload_conf();

Then analyze which queries degrade under load:

# Analyze slow queries during load test
pg_badger postgresql.log --outfile=/tmp/report.html --jobs 4

# Common issues we find:
# - Missing indexes on frequently joined tables
# - N+1 queries in ORM code
# - Inefficient aggregations
# - Lock contention on hot tables

Database Optimization Example

Here's a real optimization we made based on load testing:

Problem: During shift changes, fetching available staff for a shift was slow.

Original query:

-- Took 2-3 seconds under load
SELECT s.* FROM staff s
WHERE s.restaurant_id = $1
AND s.active = true
AND NOT EXISTS (
  SELECT 1 FROM shifts sh
  WHERE sh.staff_id = s.id
  AND sh.start_time <= $3
  AND sh.end_time >= $2
);

Optimized query:

-- Takes 50-100ms under load
SELECT s.* FROM staff s
LEFT JOIN shifts sh ON (
  sh.staff_id = s.id
  AND sh.start_time <= $3
  AND sh.end_time >= $2
)
WHERE s.restaurant_id = $1
AND s.active = true
AND sh.id IS NULL;

-- Added index:
CREATE INDEX idx_shifts_staff_time ON shifts(staff_id, start_time, end_time)
WHERE end_time >= CURRENT_DATE;

This 40x improvement came from load testing revealing the bottleneck.

Caching Strategy Testing

Caching is crucial for handling rush hour load. We test our caching strategy's effectiveness:

Redis Cache Hit Rate Monitoring

import { Trend } from 'k6/metrics';

const cacheHitRate = new Trend('cache_hit_rate');

export default function () {
  const scheduleResponse = http.get(
    'https://api.bistrobee.com/schedule/today',
    { headers: { Authorization: `Bearer ${token}` } }
  );

  // Check if response came from cache
  const fromCache = scheduleResponse.headers['X-Cache-Status'] === 'HIT';
  cacheHitRate.add(fromCache ? 100 : 0);

  check(scheduleResponse, {
    'cached response fast': (r) =>
      r.headers['X-Cache-Status'] === 'HIT' ? r.timings.duration < 100 : true,
  });
}

We aim for 80%+ cache hit rates during steady-state operations.

Cache Invalidation Testing

The tricky part of caching is invalidation. We test scenarios that should clear cached data:

// Test cache invalidation behavior
export function testCacheInvalidation() {
  const token = login();

  // Fetch schedule (should cache)
  const schedule1 = getSchedule(token);
  check(schedule1.headers, { 'initial fetch cached': (h) => h['X-Cache-Status'] === 'MISS' });

  // Fetch again (should hit cache)
  const schedule2 = getSchedule(token);
  check(schedule2.headers, { 'second fetch from cache': (h) => h['X-Cache-Status'] === 'HIT' });

  // Update schedule
  updateSchedule(token, { shiftId: 'shift-123', startTime: '18:00' });

  // Fetch again (cache should be invalidated)
  const schedule3 = getSchedule(token);
  check(schedule3.headers, { 'cache invalidated after update': (h) => h['X-Cache-Status'] === 'MISS' });

  // Verify schedule reflects update
  check(schedule3.json(), {
    'updated data returned': (data) => 
      data.shifts.find(s => s.id === 'shift-123').startTime === '18:00',
  });
}

Stale cache data is worse than no cache, so we extensively test invalidation logic.

Failure Mode Testing

Performance isn't just about speed, it's about graceful degradation when things go wrong.

Database Slowdown Simulation

We use Toxiproxy to inject latency into database connections:

# Add 500ms latency to database connections
toxiproxy-cli toxic add bistrobee-db -t latency -a latency=500

# Run load test while database is slow
k6 run --vus 200 --duration 10m load-test.js

# Remove latency
toxiproxy-cli toxic remove bistrobee-db latency

We verify that:

Request timeout handling works correctly
Connection pools don't exhaust
Users see appropriate error messages
System recovers when database performance returns

Dependency Failure Testing

BistroBee integrates with external services (email, SMS, payment processors). We test what happens when they fail during peak load:

export const options = {
  scenarios: {
    normal_load: {
      executor: 'constant-vus',
      vus: 100,
      duration: '20m',
    },
  },
  thresholds: {
    // Core operations should succeed even if dependencies fail
    'http_req_duration{endpoint:schedule}': ['p(95)<500'],
    'http_req_duration{endpoint:shifts}': ['p(95)<500'],
    // Dependent operations may degrade but shouldn't bring down core
    'http_req_duration{endpoint:notifications}': ['p(95)<2000'],
  },
};

export default function () {
  // Core operations should always work
  const schedule = getSchedule();
  check(schedule, { 'schedule loads': (r) => r.status === 200 });

  // Try to send notification (might fail if SMS service down)
  const notification = sendShiftReminder();
  // Don't fail test if notification fails - just log it
  if (notification.status !== 200) {
    console.warn('Notification failed - acceptable during dependency outage');
  }
}

Circuit breakers should prevent cascading failures:

// Circuit breaker implementation
class CircuitBreaker {
  constructor(service, threshold = 5, timeout = 60000) {
    this.service = service;
    this.failureThreshold = threshold;
    this.timeout = timeout;
    this.failureCount = 0;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = Date.now();
  }

  async execute(operation) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error(`Circuit breaker OPEN for ${this.service}`);
      }
      // Try half-open
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
    }
  }
}

Continuous Performance Monitoring

Load testing isn't just a pre-release activity, we run tests continuously:

Scheduled Performance Tests

# GitHub Actions workflow
name: Performance Tests
on:
  schedule:
    - cron: '0 2 * * *' # 2 AM daily
  workflow_dispatch: # Manual trigger

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Run load tests
        run: |
          k6 run --out json=results.json load-tests/dinner-rush.js

      - name: Check performance thresholds
        run: |
          python scripts/analyze-performance.py results.json

      - name: Alert on degradation
        if: failure()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Performance regression detected in nightly tests"
            }

Production Performance Monitoring

We monitor real production traffic to catch performance issues early:

// Application performance monitoring
const apm = require('elastic-apm-node').start({
  serviceName: 'bistrobee-api',
  serverUrl: process.env.APM_SERVER_URL,
  transactionSampleRate: 0.1, // Sample 10% of transactions
});

// Track custom metrics
app.use((req, res, next) => {
  const transaction = apm.currentTransaction;
  if (transaction) {
    transaction.setLabel('restaurant_id', req.restaurantId);
    transaction.setLabel('user_role', req.user.role);
    transaction.setLabel('time_of_day', new Date().getHours());
  }
  next();
});

This lets us see performance patterns by restaurant, user role, and time of day, identifying issues before customers complain.

Performance Budgets

We maintain performance budgets for critical operations:

Operation	P95 Target	P99 Limit	Under Load
Login	200ms	500ms	500ms
Schedule fetch	150ms	300ms	400ms
Schedule update	250ms	500ms	750ms
Shift swap	300ms	600ms	1000ms
Time clock punch	100ms	200ms	300ms

Automated tests fail if any operation exceeds its budget during load testing.

Lessons Learned

Building effective performance tests for BistroBee taught us:

Generic Load Tests Are Insufficient

Tools like Apache Bench or wrk can generate traffic, but they don't simulate realistic restaurant usage patterns. We needed custom test scenarios based on actual behavior.

Peak Performance Matters Most

Average performance during low load is irrelevant. What matters is how the system performs during Friday dinner rush when 500 restaurants are simultaneously active.

Database Is Usually the Bottleneck

Almost every performance problem we've found traces back to database queries, connection pools, or locking issues. Database optimization is continuous work.

Caching Is Essential but Tricky

Aggressive caching handles load spikes effectively, but cache invalidation bugs create data consistency issues. Testing both cache performance and correctness is critical.

Real-World Failure Testing Matters

Perfect conditions don't exist in production. Testing how the system degrades when dependencies fail prevents cascading outages during actual incidents.

The Bottom Line

Performance testing restaurant software requires understanding restaurant operations, not just running generic load tests.

Rush hour traffic patterns, thundering herd effects during shift changes, and zero-tolerance for slowdowns during service create unique requirements that standard load testing approaches miss.

By simulating realistic traffic patterns, focusing on peak performance rather than averages, extensively testing database performance under load, and validating graceful degradation during failures, we ensure BistroBee performs when it matters most, during the dinner rush when restaurants can't afford slowdowns.

The Friday evening incident that started this post? It led to database query optimizations, improved connection pooling, and more aggressive caching. Those changes came from performance tests that simulated actual restaurant rush hour patterns, not from generic load testing.

Sometimes the best performance tests aren't the ones that generate the most traffic, they're the ones that generate the right kind of traffic.

Questions about performance testing patterns for domain-specific applications? Want to share your own approaches to realistic load testing? Drop a comment , we love talking about performance engineering.

DEV Community

Performance Testing Restaurant Software: Simulating Rush Hour Traffic

Why Restaurant Traffic Is Different

The Rush Hour Problem

The Thundering Herd Effect

The Zero-Tolerance Reality

Our Performance Testing Philosophy

Test Real Patterns, Not Uniform Load

Test the Right Metrics

Test Failure Scenarios

Building Realistic Test Scenarios

Scenario 1: The Dinner Shift Login Storm

Scenario 2: Multi-Location Rush Hour

Scenario 3: The Weekend Chaos Pattern

Scenario 4: The Emergency Scenario

Database Performance Under Load

Connection Pool Monitoring

Query Performance Analysis

Database Optimization Example

Caching Strategy Testing

Redis Cache Hit Rate Monitoring

Cache Invalidation Testing

Failure Mode Testing

Database Slowdown Simulation

Dependency Failure Testing

Continuous Performance Monitoring

Scheduled Performance Tests

Production Performance Monitoring

Performance Budgets

Lessons Learned

Generic Load Tests Are Insufficient

Peak Performance Matters Most

Database Is Usually the Bottleneck

Caching Is Essential but Tricky

Real-World Failure Testing Matters

The Bottom Line

Top comments (0)