Most outbound machines are dumb loops. The SDR sends emails, some get opened, a few get replied to, and at the end of the week a sales manager asks how it went. The SDR says fine or not fine. Adjustments get made based on gut feel. The same mistakes get made next week because nobody was measuring what specifically changed and why.
The reason this happens is not because sales leaders do not care about data. It is because the data collection, the filtering, the week-over-week comparison, and the learning capture all require manual work that nobody has time to do rigorously. So it does not get done. The loop stays dumb.
Over the past two months I have been building a version of this loop that learns. It is not one agent — it is two, working in sequence. One handles the research and prospecting quality. The other reviews what happened, measures what changed, and checks whether the team is applying the lessons. Together they produced results I can now put numbers to.
First-touch outbound volume went from 9.7 emails per business day to 23.2 — a 139% increase. Bounce rate dropped from 2.76% (above our alert threshold) to 1.18%. Our SDR went from running a single generic sequence to seven personalised entry-point variants. And for the first time since we started measuring against our new benchmark, we hit a GREEN day — clearing both the daily email target and the unique-companies-reached target in the same session.
This is the story of how that happened, what the system actually looks like, and what it took to make it work.
The two-agent system
The outbound machine has two moving parts.
The first is the Outbound Campaign Agent. It runs before a prospecting push. It researches each prospect, identifies buyer intent signals, pulls CRM data on dormant and re-engagement opportunities, and produces a ranked prospect list with personalised opening angles for each contact. When the SDR sits down to start a sequence, they are not writing generic emails to a cold list — they are sending specifically calibrated outreach based on what the agent found. The quality of the research feeds directly into the quality of the opening line, which feeds directly into the reply rate.
The second is the SDR Debrief Agent. It runs every Friday. It reviews what happened the previous week — which emails were sent, which sequences they belonged to, which touch stage each contact was at, how the engagement metrics moved — and produces a structured report before the weekly team debrief. It also maintains a learning loop: when a pattern proves durable enough to become a playbook rule, the agent tracks whether the SDRs are actually applying it week over week.
Neither agent alone produces the results we have seen. The Campaign Agent improves the input quality. The Debrief Agent closes the learning loop. The combination is what compounds.
Why I built the debrief agent when I did
The honest reason is pressure. Our inbound pipeline was underperforming and we had decided to make outbound work harder. When we looked closely at what was actually happening, we found that each SDR was running their own sequences in their own way, making changes without structure, and lessons from one campaign were not carrying over to the next.
We had dashboards that showed raw numbers, but they mixed outbound prospecting with customer correspondence, follow-up steps fired automatically by the sequence engine with genuine new first-touch sends, and active sequences with dormant ones. We were flying blind with an instrument panel we thought we could trust.
The first thing I had to fix before writing a single line of the agent was the data. This took longer than the agent itself. The CRM stores many types of outbound emails under the same object type, and pulling only genuine first-touch prospecting required mapping each sequence template ID to its correct step, building exclusion filters for customer and internal correspondence, and running manual verification samples until the output was clean.
The lesson: the data problem is always harder than the agent problem. Anyone building something similar should budget more time here than they think they need.
What the debrief agent actually does
Once the data foundation was solid, the agent runs in five stages every Friday morning.
Stage 1 — Data pull and filtering. The agent pulls all outbound emails sent by the SDRs during the target week and applies layered filters to isolate genuine prospecting emails from everything else. A random sample of five emails is surfaced for manual verification every run — the ongoing calibration step that keeps the filter accurate as the team introduces new sequences.
Stage 2 — Cohort assignment. Every email is assigned to the ISO calendar week it was sent. This is the foundation of honest measurement. Because replies and meetings arrive days or weeks after an email is sent, you cannot evaluate engagement on current-week sends. Each cohort gets measured again as the data matures: seven days for open rates, twenty-eight days for reply rates and meetings booked.
Stage 3 — Metric computation. Engagement is measured per touch stage, per sequence, per SDR, per country, and per recipient role. The north-star metric is always meetings booked from outbound — not sends, not opens. The daily activity bar tracks first-touch emails per day and unique companies reached per day against the agreed targets. Crucially, only genuine first-touch sends count toward the bar — not the sequence engine's automated follow-ups.
Stage 4 — Rolling baseline comparison. Each metric is compared against a four-week rolling average. A movement qualifies as significant only if it passes both a standard deviation test and a minimum threshold: five percentage points for rates, twenty percent relative change for counts. Below that bar, it is noise.
Stage 5 — Candidate lesson generation. When a pattern is significant and large enough to generalise, it surfaces as a candidate — not a recommendation. The human reviews it, discusses it with the team, and decides whether to apply it. Only then does it enter the playbook.
The two design choices that determined whether this was useful
Anyone with CRM access and a capable model could build a version of this in a few days. What separates a useful version from a useless one is two decisions.
Cohort Maturity Windows. The temptation, every week, is to measure engagement on emails sent this week. The data is right there. It looks meaningful. It is not. Reply rates and meeting-booked rates need twenty-eight days to stabilise. If you draw conclusions from immature data, you will change things that were working and hold on to things that were not. The agent only surfaces recommendations from cohorts at least twenty-eight days old. Current-week data is shown as in-flight — visible but not actionable.
This felt conservative in the early weeks. By week five, when the first statistically valid recommendations started arriving, it felt correct.
The Feedback Compliance Loop. Once a lesson is confirmed and shared with the SDRs, the agent does not forget it. It adds the lesson to a structured file with a machine-checkable rule and measures compliance every subsequent week. The sales coaching failure mode — rep agrees in the meeting, changes nothing in practice — becomes visible as data rather than a vague feeling. If compliance drops below target for three consecutive weeks, it surfaces as an escalation. The conversation shifts from "I think you might not be doing this" to "compliance has been at 40% for three weeks — let's talk about what's blocking it."
This was the design choice that surprised me most with how much it mattered. The feedback loop without the compliance check is just a journal. The compliance check is what makes it a coaching system.
What changed in the first 30 days
The numbers I shared at the top did not come from the Debrief Agent alone. It is worth being precise about what drove what.
The +139% increase in first-touch volume came primarily from measurement clarity. Before the Debrief Agent, the daily activity bar counted all outbound emails — including the sequence engine's automatic follow-ups, customer correspondence, and manual thread replies. The SDR appeared to be hitting targets. What was actually happening was the engine doing most of the sends while genuine new prospecting sat at 9-10 emails per day — well below the 25-30 target. Once the bar was reframed to count only genuine first-touch sends, the real baseline became visible. The improvement — to 23 per day over the last two weeks, and 27.6 in the most recent week — came from the SDRs adjusting their actual new-prospecting behaviour once they could see the real number.
The sequence sophistication — from one generic opening angle to seven personalised entry-point variants — came from the Campaign Agent improving the research quality. The SDRs were not just sending more emails. They were sending better ones, anchored to specific signals: a company scaling its team, a new ERP rollout being managed alongside new headcount, a multi-entity structure with project tracking complexity. This is what happens when the prospecting research is done by an agent that knows what signals to look for, rather than by a rep with thirty minutes before the call.
The -57% bounce rate reduction came from deliverability cleanup that the Debrief Agent made visible. At 2.76%, we were above the alert threshold that most email deliverability guides cite as the point where domain reputation starts to suffer. Once that figure surfaced weekly in the debrief with a clean breakdown of which sends were bouncing and why, it became a priority to fix rather than background noise. Within two weeks it was down to 1.18%.
The step ratio improvement — from 1:1.16 to 1:0.93 — is the metric I find most structurally meaningful.
A quick explanation of what it measures. Your sequence engine automatically fires follow-up steps on contacts already in a sequence — step 2, step 3, step 4 — without the SDR doing anything. Genuine first-touch emails are the opposite: the SDR starting a brand new sequence with a brand new contact. The step ratio compares those two numbers: first-touch sends vs. automated follow-up steps fired.
At 1:1.16, the engine was firing 1.16 automated steps for every 1 new contact being added. Consumption was outpacing replenishment. Like a bank account where withdrawals slightly exceed deposits — you don't notice immediately, but the active prospect pool slowly shrinks. In practice this means the SDR can look busy (lots of activity from the engine) while barely adding anyone new.
At 1:0.93, that flipped. New prospecting is now ahead of consumption. The pool is growing.
Most outbound dashboards never surface this. They show opens, replies, meetings — none of which tell you whether the pipeline is being built or quietly depleted. The step ratio catches that early. When it rises above 1.0 it is not a crisis today; it is a pipeline problem arriving in three to four weeks when the current sequences expire. The Debrief Agent tracks it as a structural health signal alongside the engagement metrics.
What comes next
The first fully mature cohort — twenty-eight days old, with complete reply and meeting data — arrives in early June. That is when the recommendation engine fires for the first time with statistically valid input. Until then, we have been in the baseline-establishment phase.
I will update this article at sixty and ninety days with the reply rate and meeting-booked data. If those metrics follow the same trajectory as volume and deliverability, the story compounds. If they do not, that will also be worth writing about honestly.
Results in this article reflect approximately 30 days of deployment. Updated results at 60 and 90 days will be published in the newsletter.
The deeper lesson
The agent is not the moat. The loop is.
Anyone can prompt a model to summarise last week's emails. The moat is the structured, compounding system that makes each week's data more valuable than the last — because it is being compared against a rolling baseline, interpreted through a consistent framework, and translated into lessons that get measured for compliance. That loop, once it is running, does not forget. It does not have bad weeks. And it does not let the team slide back to the habits they had before someone started measuring.
The 139% volume increase is a real number. But the more important number is week five — when the first statistically valid recommendation arrives, grounded in cohorts that are old enough to trust. That is the moment the loop stops being a reporting tool and starts being a learning system.
We are not there yet. But the trajectory is pointing in the right direction, and this time we have the data to prove it.
Top comments (0)