<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David Campbell</title>
    <description>The latest articles on DEV Community by David Campbell (@davidcampbelldc).</description>
    <link>https://dev.to/davidcampbelldc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3889415%2Fa2d1bb05-72e2-4c31-9f05-14008ba8b9b0.png</url>
      <title>DEV Community: David Campbell</title>
      <link>https://dev.to/davidcampbelldc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/davidcampbelldc"/>
    <language>en</language>
    <item>
      <title>Why AI Correlation Is Harder Than You Think (And What 25 Years of Pain Taught Me)</title>
      <dc:creator>David Campbell</dc:creator>
      <pubDate>Mon, 20 Apr 2026 17:23:50 +0000</pubDate>
      <link>https://dev.to/davidcampbelldc/why-ai-correlation-is-harder-than-you-think-and-what-20-years-of-pain-taught-me-16o4</link>
      <guid>https://dev.to/davidcampbelldc/why-ai-correlation-is-harder-than-you-think-and-what-20-years-of-pain-taught-me-16o4</guid>
      <description>&lt;p&gt;Every performance tester knows the feeling. You record a user journey, hit replay, and watch your script crash within seconds. The culprit is almost always the same: dynamic data. Session tokens, CSRF values, authentication keys – they all change between requests. If your script replays the values it recorded rather than extracting fresh ones from server responses, the test is dead on arrival.&lt;/p&gt;

&lt;p&gt;The process of fixing this – identifying dynamic values, finding their origin in a previous server response, and extracting them for reuse – is called &lt;strong&gt;correlation&lt;/strong&gt;. It is the single most time-consuming and frustrating part of performance test preparation. A simple script might need a handful of correlations. A complex enterprise application (Salesforce, SAP, a modern microservices checkout) can require dozens or even hundreds.&lt;/p&gt;

&lt;p&gt;I have spent most of my career wrestling with this problem. First as a tester, then as a consultant helping teams dig out of correlation backlogs, and now as the founder of a platform built to solve it. Along the way I built a framework for thinking about the different approaches the industry has tried.&lt;/p&gt;

&lt;p&gt;I call it the Correlation Spectrum: five levels of capability, from fully manual through to fully autonomous. Understanding these levels is not academic. It determines whether your performance testing programme is viable, efficient, or too expensive to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Level 1: Manual Correlation (With AI Hints)
&lt;/h2&gt;

&lt;p&gt;The engineer opens a recorded script containing hundreds of HTTP requests. They find a failing request, compare the recorded response to the replayed response, spot a value that changed, then manually search backward through prior responses to find where it first appeared. They write an extraction rule, insert it after the originating response, and replace the hardcoded value with a variable reference.&lt;/p&gt;

&lt;p&gt;Modern tools at this level may use lightweight AI to suggest "this looks dynamic" or auto-generate a regex once the engineer has identified the target. But the detective work remains human-driven.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem is scale.&lt;/strong&gt; Manual correlation effort does not grow linearly with complexity – it grows exponentially. Each additional dynamic value increases the search space and the likelihood of cascading errors, where fixing one correlation breaks another. For scripts with thirty or more correlation candidates, manual effort can hit 40+ hours per script. At that point, teams abandon scripts rather than maintain them.&lt;/p&gt;

&lt;p&gt;I call this the "Script Museum": test assets that sit unused because they are too expensive to keep current.&lt;/p&gt;

&lt;h2&gt;
  
  
  Level 2: Rules-Based Frameworks
&lt;/h2&gt;

&lt;p&gt;The tool ships with a library of predefined rules organised by framework. Record against a .NET application and it auto-detects &lt;code&gt;__VIEWSTATE&lt;/code&gt;, &lt;code&gt;__EVENTVALIDATION&lt;/code&gt;, and ASP.NET session IDs. All the leading commercial tools (LoadRunner, BlazeMeter, NeoLoad, OctoPerf) implement some version of this. Success rates of 60-90% on well-matched stacks.&lt;/p&gt;

&lt;p&gt;The limitation is inherent: rules only work for patterns the vendor has already seen. Custom frameworks, bespoke auth mechanisms, anything not in the library – missed. And that last 10-40% often represents the &lt;em&gt;hardest&lt;/em&gt; correlations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Level 3: Algorithmic Diffing
&lt;/h2&gt;

&lt;p&gt;Compare recordings, spot values that change, generate extractors. Framework-agnostic. Scales better than manual work. But algorithms are smart, not intelligent. They tell you &lt;em&gt;what&lt;/em&gt; changed but struggle with &lt;em&gt;why&lt;/em&gt; it changed and whether it matters. False positives require review. There is no learning between sessions — each new script starts from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Level 4: General-Purpose AI
&lt;/h2&gt;

&lt;p&gt;LLMs enter the workflow. Feed the recorded traffic to GPT/Claude/Gemini and let it reason about what needs correlating. The AI understands that &lt;code&gt;"csrfToken": "abc123"&lt;/code&gt; is a security token. It can trace authentication flows and reason about error messages.&lt;/p&gt;

&lt;p&gt;But general-purpose AI was not built for this problem, and it shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context window limits&lt;/strong&gt;: A complex recording has thousands of requests. Feeding it all in exceeds limits or forces summarisation that loses critical detail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No accumulated knowledge&lt;/strong&gt;: Each session starts fresh. The AI re-discovers patterns it solved last week.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-specific blindness&lt;/strong&gt;: Generating a valid regex is one thing. Generating a regex safe for JMeter's ORO engine, which has specific syntax quirks and boundary handling, is another.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination risk&lt;/strong&gt;: Plausible-looking but incorrect JSONPaths and regex patterns propagate without warning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I ran a stopwatch study on this exact scenario. Same test plan, same HAR recording. Manual correlation with ChatGPT as a coding assistant versus automated correlation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Automated&lt;/th&gt;
&lt;th&gt;Manual + ChatGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;75 seconds&lt;/td&gt;
&lt;td&gt;25 min 20 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context switches&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Human errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Candidates found&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6 of 6&lt;/td&gt;
&lt;td&gt;2 of 6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The headline is 20x faster. But the coverage gap matters more: the manual run missed four out of six dynamic values. The "finished" script was sending stale data to the server. A test that looks correct but sends hardcoded tokens is worse than no test at all — it creates false confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Level 5: Where It Gets Interesting - Specialised AI With Persistent Knowledge
&lt;/h2&gt;

&lt;p&gt;This is the level I have been building toward. The key insight: &lt;strong&gt;correlation is not one problem. It is three.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observe&lt;/strong&gt;: Scan every value in every response against every subsequent request. Flag what changes. Detect encodings. Fingerprint frameworks. This is mechanical work — no intelligence required, just thoroughness. Machines do this in milliseconds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Decide&lt;/strong&gt;: Which values &lt;em&gt;need&lt;/em&gt; extraction? What type of extractor? Where does it go? How does it interact with other correlations? This requires understanding, not pattern matching.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prove&lt;/strong&gt;: Run the test. Did the extractor capture the right value? Did the server accept the request? This is ground truth from a real execution.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most tools collapse all three into a single operation. When they break, you cannot tell which step failed. A broken extractor might mean the pattern was wrong (bad observation), the extraction strategy was wrong (bad decision), or the application changed (invalid proof). Single-layer tools show you a red result and leave you guessing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why separation enables self-healing
&lt;/h3&gt;

&lt;p&gt;When the three layers are separate, self-healing becomes diagnostic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application update detected:
  - Observation layer: "Token format changed from opaque to JWT"
  - Decision layer: "Extraction strategy (regex on refresh_token) still correct,
    but response structure moved from flat JSON to nested auth.tokens object"
  - Proof layer: "Updated JSONPath works. Stamping golden baseline."

Result: 3 minutes, one targeted fix
vs. 20 minutes re-running the entire pipeline and hoping
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system builds a &lt;strong&gt;Golden Map&lt;/strong&gt; - a snapshot of the world view at the moment everything was proven to work. When something breaks, the first response is not AI investigation. It is a deterministic diff against the Golden Map and a restore of what changed. Faster than an LLM call, cheaper (no API tokens), more predictable (same input, same output every time).&lt;/p&gt;

&lt;p&gt;AI agents only get involved when the restore fails - meaning the &lt;em&gt;application itself&lt;/em&gt; changed, not just the test configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  The compounding effect
&lt;/h3&gt;

&lt;p&gt;The real power is what happens over time. Each layer feeds the next cycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The observation layer builds a world view. First import: sparse. Tenth import: the system already knows where tokens live.&lt;/li&gt;
&lt;li&gt;The decision layer accumulates proven strategies. A JSONPath that worked for this Salesforce token last month still applies today.&lt;/li&gt;
&lt;li&gt;The proof layer builds a golden baseline. Drift is detected and investigated, not discovered during a production test run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap between specialised AI and every other approach widens with every session:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Profile&lt;/th&gt;
&lt;th&gt;Manual Estimate&lt;/th&gt;
&lt;th&gt;Automated Estimate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Simple login flow&lt;/strong&gt; (measured)&lt;/td&gt;
&lt;td&gt;25 min&lt;/td&gt;
&lt;td&gt;75 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Standard commercial app&lt;/strong&gt; (projected)&lt;/td&gt;
&lt;td&gt;3-5 hours&lt;/td&gt;
&lt;td&gt;~10 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Complex financial system&lt;/strong&gt; (projected)&lt;/td&gt;
&lt;td&gt;~1 week&lt;/td&gt;
&lt;td&gt;~1-2 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Heavy enterprise (SAP-scale)&lt;/strong&gt; (projected)&lt;/td&gt;
&lt;td&gt;8-10 weeks&lt;/td&gt;
&lt;td&gt;~1-2 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Manual costs accelerate (more candidates = more cascading errors). Automated costs stay near-linear.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question to Ask Your Tooling
&lt;/h2&gt;

&lt;p&gt;Whatever correlation approach you use today, ask this: &lt;strong&gt;when a test breaks, can the system tell you whether the observation was wrong, the decision was wrong, or the application changed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the answer is no, you are working with a single-layer tool. It may work at small scale. But as applications grow and change accelerates, you will spend more time on diagnostic work that a well-separated architecture handles by design.&lt;/p&gt;

&lt;p&gt;Correlation improves through architecture, not through better pattern matching or bigger language models.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article is adapted from &lt;a href="https://leanpub.com/ai-performance-engineering" rel="noopener noreferrer"&gt;AI Performance Engineering: How Agentic AI Is Transforming Load Testing&lt;/a&gt; by David Campbell. The book covers the full Correlation Spectrum framework, the three-layer architecture, a minute-by-minute time-motion study, self-healing tests, and a step-by-step guide to building your own AI testing pipeline. Available on Leanpub.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you want to see the automated correlation workflow in action, &lt;a href="https://loadmagic.ai" rel="noopener noreferrer"&gt;LoadMagic&lt;/a&gt; has a free tier.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>performance</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
