synthaicode

Posted on Jan 16

The 1-Second Mystery: How AI Found What Load Testing Missed

#ai #programming #softwaredevelopment #testing

The Mystery

Our SQL Server service was processing approximately 200 updates per second. Performance was critical. Everything seemed fine—until it wasn't.

DataDog logs revealed something strange: a 1-second delay during updates. Not consistently, but occasionally. In a system where throughput is everything, this was unacceptable.

Received:  2024-01-15 10:23:45.123
DB Stored: 2024-01-15 10:23:46.123  ← 1 second gap

The timestamps told us what was happening. But they couldn't tell us why.

The Investigation

I did what any developer would do: I started with the obvious suspects.

Thread locks? I reviewed the source code carefully. No threading issues. No potential deadlocks.

Performance bottleneck? We had load testing data. The system handled 1,000 records per second without breaking a sweat. We were only seeing 200/sec in production.

Database constraints? No locks, no blocking queries, no resource contention.

The code was clean. The performance was proven. Yet the 1-second delays kept appearing.

The Blind Spot

Here's what I didn't realize at the time: our load testing was lying to us.

Not intentionally. But load tests create artificial conditions:

Sustained high load: Constant stream of 1,000 req/sec
Steady state: No traffic variations
Predictable patterns: Same conditions throughout the test

Production was different:

Variable traffic: 200 req/sec average with peaks and valleys
Burst patterns: Sudden spikes after quiet periods
Real-world chaos: Traffic patterns that no load test captures

The problem wasn't visible under sustained load. It only appeared during traffic variations.

AI's Perspective

Here's the interesting part: I later learned that other engineers had tried feeding similar logs and source code into AI systems to diagnose this issue. None of them reached this conclusion.

The difference wasn't the AI. It was the question.

Most people asked: "What's wrong with the code?"

I asked: "What changed in the system state before each delay?"

I fed the AI three things:

Time-series logs with reception timestamps
Source code for the SQL connection handling
Database storage timestamps

I asked it to find the correlation.

Interestingly, the AI initially pointed to resource contention and general performance bottlenecks—which is a reasonable assumption for most latency issues.

So I showed it our load testing results: the system handled 1,000 requests per second without degradation.

That single piece of evidence forced the AI to abandon the "insufficient resources" hypothesis.

Only then did it start looking for state-dependent behaviors instead of capacity limits.

What the AI saw was a pattern I had missed:

Time    Traffic   Connection Pool State
10:20   High      Pool fully utilized
10:21   Low       Pool partially released (idle cleanup)
10:22   Burst     Reconnection cost → 1-second delay

The AI recognized something critical: the traffic volume changed before each 1-second delay.

Then it connected this pattern to its knowledge of SqlClient behavior:

"SqlClient's connection pool performs periodic cleanup of idle connections. When traffic drops, connections are released. When traffic suddenly spikes, new connections must be established, incurring connection overhead."

That was it. The missing piece.

The Root Cause

SqlClient has a connection pool management feature that:

Monitors idle connections
Releases them after a certain period (typically 4-8 minutes)
Creates new connections when needed

Under sustained load, the pool stays warm. Connections are always ready.

Under variable load:

Traffic drops → Connections marked as idle → Pool cleanup triggered
Traffic spikes → New connections needed → Connection establishment cost

That connection establishment cost? About 1 second in our environment.

The Solution

The fix was straightforward:

SqlConnectionStringBuilder builder = new SqlConnectionStringBuilder
{
    // ... other settings
    MinPoolSize = 10  // Keep minimum connections alive
};

By setting MinPoolSize, we ensured that a baseline number of connections stayed warm, even during low-traffic periods. When bursts occurred, connections were ready.

The 1-second delays disappeared.

Lessons Learned

1. Load Testing Has Blind Spots

Sustained high-load testing won't catch issues that only appear during traffic variations. You need:

Variable load patterns
Burst testing after idle periods
Real traffic replay when possible

2. AI Sees Different Patterns

Humans investigate by looking for "suspicious code." We think in terms of:

"What could go wrong here?"
"Where are the known failure points?"

AI investigates by looking for data correlations. It thinks in terms of:

"What patterns exist in the time series?"
"What library behaviors match these patterns?"

These are complementary approaches, not competing ones.

3. The Right Information Matters

The breakthrough came from providing AI with:

Time-series data (not just isolated incidents)
Source code (for context)
Multiple timestamps (to see the delta)

Without the correlation between reception logs and storage timestamps, the pattern wouldn't have been visible.

4. External Knowledge Is Valuable

I knew SqlClient. I had read the documentation. But I hadn't connected that specific behavior to this specific problem.

AI's advantage wasn't just pattern recognition—it was the ability to cross-reference observed patterns with library specifications that I had read but not internalized.

5. The Question Shapes the Answer

This isn't about prompt engineering—crafting the perfect sentence. It's about problem framing.

"What's wrong with the code?" → AI looks for bugs, anti-patterns, potential errors

"What changed in the system state before each delay?" → AI looks for correlations, patterns, state transitions

The same data, the same AI, different outcomes. The question determines where the AI looks and what patterns it recognizes.

This is why I call my approach the "Amagi Protocol"—it's not about how you talk to AI, it's about how you structure the investigation itself.

6. AI Needs Constraints, Not Just Data

The AI's initial hypothesis was wrong. It defaulted to the most common explanation: resource limits.

What changed its direction wasn't more data—it was contradictory evidence. The load testing results didn't just add information; they eliminated an entire category of possible causes.

This is human-AI collaboration: the AI generates hypotheses quickly, but humans guide the search space by providing constraints and counter-examples.

Practical Takeaway

When debugging performance issues:

Collect time-series data, not just error logs
Look for patterns in traffic variations, not just absolute load
Consider library-level behaviors, not just your code
Use AI to correlate data patterns with specification knowledge
Frame your question around system state changes, not just code correctness
Provide counter-evidence to guide AI's reasoning, not just more raw data

The 1-second mystery taught me that the most insidious problems aren't in your code at all. They're in the interaction between your traffic patterns and library behaviors you thought you understood.

DEV Community