The Mystery
Our SQL Server service was processing approximately 200 updates per second. Performance was critical. Everything seemed fine—until it wasn't.
DataDog logs revealed something strange: a 1-second delay during updates. Not consistently, but occasionally. In a system where throughput is everything, this was unacceptable.
Received: 2024-01-15 10:23:45.123
DB Stored: 2024-01-15 10:23:46.123 ← 1 second gap
The timestamps told us what was happening. But they couldn't tell us why.
The Investigation
I did what any developer would do: I started with the obvious suspects.
Thread locks? I reviewed the source code carefully. No threading issues. No potential deadlocks.
Performance bottleneck? We had load testing data. The system handled 1,000 records per second without breaking a sweat. We were only seeing 200/sec in production.
Database constraints? No locks, no blocking queries, no resource contention.
The code was clean. The performance was proven. Yet the 1-second delays kept appearing.
The Blind Spot
Here's what I didn't realize at the time: our load testing was lying to us.
Not intentionally. But load tests create artificial conditions:
- Sustained high load: Constant stream of 1,000 req/sec
- Steady state: No traffic variations
- Predictable patterns: Same conditions throughout the test
Production was different:
- Variable traffic: 200 req/sec average with peaks and valleys
- Burst patterns: Sudden spikes after quiet periods
- Real-world chaos: Traffic patterns that no load test captures
The problem wasn't visible under sustained load. It only appeared during traffic variations.
AI's Perspective
Here's the interesting part: I later learned that other engineers had tried feeding similar logs and source code into AI systems to diagnose this issue. None of them reached this conclusion.
The difference wasn't the AI. It was the question.
Most people asked: "What's wrong with the code?"
I asked: "What changed in the system state before each delay?"
I fed the AI three things:
- Time-series logs with reception timestamps
- Source code for the SQL connection handling
- Database storage timestamps
I asked it to find the correlation.
Interestingly, the AI initially pointed to resource contention and general performance bottlenecks—which is a reasonable assumption for most latency issues.
So I showed it our load testing results: the system handled 1,000 requests per second without degradation.
That single piece of evidence forced the AI to abandon the "insufficient resources" hypothesis.
Only then did it start looking for state-dependent behaviors instead of capacity limits.
What the AI saw was a pattern I had missed:
Time Traffic Connection Pool State
10:20 High Pool fully utilized
10:21 Low Pool partially released (idle cleanup)
10:22 Burst Reconnection cost → 1-second delay
The AI recognized something critical: the traffic volume changed before each 1-second delay.
Then it connected this pattern to its knowledge of SqlClient behavior:
"SqlClient's connection pool performs periodic cleanup of idle connections. When traffic drops, connections are released. When traffic suddenly spikes, new connections must be established, incurring connection overhead."
That was it. The missing piece.
The Root Cause
SqlClient has a connection pool management feature that:
- Monitors idle connections
- Releases them after a certain period (typically 4-8 minutes)
- Creates new connections when needed
Under sustained load, the pool stays warm. Connections are always ready.
Under variable load:
- Traffic drops → Connections marked as idle → Pool cleanup triggered
- Traffic spikes → New connections needed → Connection establishment cost
That connection establishment cost? About 1 second in our environment.
The Solution
The fix was straightforward:
SqlConnectionStringBuilder builder = new SqlConnectionStringBuilder
{
// ... other settings
MinPoolSize = 10 // Keep minimum connections alive
};
By setting MinPoolSize, we ensured that a baseline number of connections stayed warm, even during low-traffic periods. When bursts occurred, connections were ready.
The 1-second delays disappeared.
Lessons Learned
1. Load Testing Has Blind Spots
Sustained high-load testing won't catch issues that only appear during traffic variations. You need:
- Variable load patterns
- Burst testing after idle periods
- Real traffic replay when possible
2. AI Sees Different Patterns
Humans investigate by looking for "suspicious code." We think in terms of:
- "What could go wrong here?"
- "Where are the known failure points?"
AI investigates by looking for data correlations. It thinks in terms of:
- "What patterns exist in the time series?"
- "What library behaviors match these patterns?"
These are complementary approaches, not competing ones.
3. The Right Information Matters
The breakthrough came from providing AI with:
- Time-series data (not just isolated incidents)
- Source code (for context)
- Multiple timestamps (to see the delta)
Without the correlation between reception logs and storage timestamps, the pattern wouldn't have been visible.
4. External Knowledge Is Valuable
I knew SqlClient. I had read the documentation. But I hadn't connected that specific behavior to this specific problem.
AI's advantage wasn't just pattern recognition—it was the ability to cross-reference observed patterns with library specifications that I had read but not internalized.
5. The Question Shapes the Answer
This isn't about prompt engineering—crafting the perfect sentence. It's about problem framing.
"What's wrong with the code?" → AI looks for bugs, anti-patterns, potential errors
"What changed in the system state before each delay?" → AI looks for correlations, patterns, state transitions
The same data, the same AI, different outcomes. The question determines where the AI looks and what patterns it recognizes.
This is why I call my approach the "Amagi Protocol"—it's not about how you talk to AI, it's about how you structure the investigation itself.
6. AI Needs Constraints, Not Just Data
The AI's initial hypothesis was wrong. It defaulted to the most common explanation: resource limits.
What changed its direction wasn't more data—it was contradictory evidence. The load testing results didn't just add information; they eliminated an entire category of possible causes.
This is human-AI collaboration: the AI generates hypotheses quickly, but humans guide the search space by providing constraints and counter-examples.
Practical Takeaway
When debugging performance issues:
- Collect time-series data, not just error logs
- Look for patterns in traffic variations, not just absolute load
- Consider library-level behaviors, not just your code
- Use AI to correlate data patterns with specification knowledge
- Frame your question around system state changes, not just code correctness
- Provide counter-evidence to guide AI's reasoning, not just more raw data
The 1-second mystery taught me that the most insidious problems aren't in your code at all. They're in the interaction between your traffic patterns and library behaviors you thought you understood.
Top comments (0)