DEV Community

Herbert Cuba Garcia
Herbert Cuba Garcia

Posted on • Originally published at cubagarcia.com

APEX: Agentic Production Execution

APEX: Agentic Production Execution

Pronounced "AY-pecks" — a production-grade operating model for teams where humans design and verify while agents execute and iterate.


☝️ This is a reference document, not a quick read. Bookmark it. Return to it when you're configuring a new team, troubleshooting a system that isn't improving, or explaining to a stakeholder why agentic production requires more than choosing a model.


Table of Contents

  1. What Is APEX and Why It Matters
  2. Who This Guide Is For
  3. Built on Proven Foundations
  4. Core Principles
  5. The APEX Cycle
  6. Phase 1: Strategic
  7. Phase 2: Execution
  8. Phase 3: Reflection
  9. The /goal Inner Loop
  10. Measuring Success: Five Metrics
  11. Getting Started
  12. Use Case: Content Production
  13. Conclusion
  14. Appendix A: The Nine Domains Reference
  15. Appendix B: Use Case Matrix
  16. About the Author

What Is APEX and Why It Matters

APEX stands for Agentic Production Execution. It is a framework for organizing how humans and agents work together to produce reliable output at scale — not the output of a single session, but the sustained output of a team running agentic systems across consecutive work cycles.

A few years into working with agents in production, I found myself running ten agents across three different projects simultaneously. Code agents, content agents, research agents. They were producing work. Some of it was good. Some of it was quietly terrible, and I didn't catch it until it had already shipped downstream. The problem wasn't the agents. The problem was me. I had no system for who decides what, when agents should iterate on their own, and when a human needs to step in.

APEX is my attempt to build that system into something a team can actually run.

What APEX IS

  • A three-phase operating cycle: Strategic → Execution → Reflection
  • An organizational scaffold with three areas and nine named domains, each with clear ownership
  • A measurement framework with five metrics that make calibration data-driven
  • A methodology wrapper that sits around whatever tools, harnesses, or methodologies you're already using
  • A team operating model — not an individual practitioner's workflow

What APEX IS NOT

  • A prompt engineering guide
  • A replacement for your existing delivery methodology (Scrum, Kanban, SAFe — APEX wraps around all of them)
  • A prescription for specific tools or models
  • A guarantee of output quality — it's a structure that makes quality improvable
  • An individual playbook that scales by adding more individuals

💡 The gap between "one person using agents well" and "a team of five running agentic production" is where most organizations stall. APEX is designed to close that gap.

The most important conviction behind APEX: outsource execution, keep strategy human. The moment you let agents decide what to build rather than how to build it, you've lost the thread. Every design decision in this framework follows from that.


Who This Guide Is For

Tech Directors and Engineering Leads who are responsible for how their teams adopt agentic systems and need an organizational model, not just tooling advice.

Product Managers who need to understand why spec quality is the primary variable in agentic output quality — and what that means for their role.

QA Leads and Editors who own quality definitions and need a framework that elevates that ownership rather than automates it away.

AI Engineers who configure agent systems and need a vocabulary for communicating with the rest of the organization about what they're building.

Anyone running agents in production who has experienced the gap between a promising demo and reliable team-wide output, and wants a structured approach to closing it.


Built on Proven Foundations

APEX didn't emerge from theory. It emerged from practice — and it builds on research that confirms the patterns practitioners keep discovering independently.

Anthropic's harness design research established the foundational insight that this entire framework depends on: separating the agent that generates work from the agent that evaluates it produces dramatically better output than self-assessment. APEX formalizes this structural separation through its two-tier QA system: QA Strategic (humans define quality) and QA Operational (agents enforce it).

The convergence of Judge-Evaluated Continuation — the pattern that OpenAI's Codex CLI /goal, Anthropic's Claude Code /goal, and other agentic platforms independently shipped within weeks of each other — confirms that the Execution loop APEX describes is discoverable. What isn't standardized is the organizational scaffolding around it, which is exactly what APEX provides.

DORA metrics and the DevOps research tradition inform the APEX Metrics design. The principle is the same: a small set of named, measurable indicators that cover the full cycle and make data-driven improvement possible.


Core Principles

These ten principles govern how APEX operates.

① 🏗️ Harness First
Your runtime choice sets all constraints. Decide the harness before you configure anything else.

② 👤 Human in Control of Outcome
Humans own the outcome — not every step, but the result. They design the system, verify the output, and decide what to change.

③ 📥 Quality In = Quality Out
The output of any agentic system is a direct function of what goes in — specs, context, configuration, criteria.

④ 🤖 Agents Review Agents First
All work passes through agent-to-agent review before a human sees it. This is how you get speed without sacrificing quality.

⑤ 🎯 Domain-Mapped Ownership
Quality gates map to expertise, not generic reviews. Nine domains across three areas, each with a clear owner.

⑥ 🔄 Iterate Often, Iterate Fast
Agent-to-agent iteration loops compress what used to take days into hours.

⑦ 🔒 Least Privilege
Every agent gets only the access it needs. No more.

⑧ 📈 Calibrate the System, Not Just the Output
Reflections improve the system itself, not just the current deliverable.

⑨ 📊 Data-Driven Reflections
Agents report metrics. Humans decide based on data, not gut feelings.

⑩ 🔭 Think Big, Scale Back
Design the whole system first. Remove what's premature. Keep the architecture.


The APEX Cycle

Everything in APEX revolves around three phases that repeat continuously:

The APEX Framework — Full Cycle

Strategic (2–3× velocity)
    ↓  Human-First: design, specify, configure
Execution (10–20× velocity)
    ↓  Agent-First: execute, review, iterate
Reflection (1–2× velocity)
    ↓  Data-First: evaluate, reflect, calibrate
    ↑
    └─── feeds back into next Strategic phase
Enter fullscreen mode Exit fullscreen mode

☝️ The phase most teams cut under delivery pressure: Reflection. They ship, move on, start the next sprint. The result: the same problems repeat, iteration depth stays flat, and first-pass acceptance doesn't improve.


Phase 1: Strategic

Strategic is where humans do all the thinking that agents will later act on. All nine domains live here.

Strategic organizes work into three areas containing nine domains.

Area 1: Platform

📡 Domain 1 — Infrastructure

Owns the runtime, compute, and harness selection. The most consequential decision in the entire framework is harness selection. Document it in a Harness Decision Record.

Harness Type Examples Best For
General Purpose Claude Code, Codex CLI, Cursor Exploratory work, human available
Specialized Devin, Harvey Regulated domains, auditability
Autonomous OpenClaw, CrewAI Scheduled, unattended workflows
Hierarchical AutoGen, MetaGPT Large projects with sub-task decomposition
DAG-Based LangGraph, Prefect, Dagster Fixed-shape pipelines
Hybrid Combination Where most mature setups end up

🖥️ Domain 2 — Operational Tooling

Dashboards, metrics pipelines, agent activity monitors. Without this, Reflection degrades into guesswork.

🔐 Domain 3 — Security and Compliance

Permissions, data flows, regulatory constraints.

Area 2: Spec

📚 Domain 4 — Business Context

The "why" and the "for whom." Brand understanding, personas, competitive landscape.

📝 Domain 5 — Spec Engineering

The translation layer between strategic thinking and executable work. PRDs, user stories with acceptance criteria, editorial briefs.

💡 Vague specs produce vague output. Not because of the model — because of the spec.

✅ Domain 6 — QA Strategic

Where humans define quality. What "done" means. How output is evaluated. What data feeds into Reflections.

Area 3: Config

🤖 Domain 7 — Agent Design

Agent identities, behavior configuration, skills, instructions, memory. The richer the identity file, the less the agent needs to infer.

🔀 Domain 8 — Orchestration Design

Routing rules, delegation chains, handoff protocols, workflow maps.

🔎 Domain 9 — QA Operational

Agent-to-agent review criteria, quality gates within iteration cycles. The generator never grades its own homework.


Phase 2: Execution

The inner loop where agents do the actual work.

Spec → Execute → Review → Iterate → Verify
Enter fullscreen mode Exit fullscreen mode
  1. Spec — a human writes a spec
  2. Execute — an agent executes against that spec
  3. Review — a separate review agent evaluates output against QA Operational criteria
  4. Iterate — if issues are found, the executing agent revises
  5. Verify — when the agent loop passes, a human verifies against QA Strategic criteria

🤖 If you find yourself constantly stepping into agent loops to clarify or redirect, the problem is not the agents. The problem is in your Strategic configuration.


Phase 3: Reflection

The rhythm that keeps the system alive.

Step 1 — Evaluate

Review and polish actual output against original intent.

Step 2 — Reflect

Agents report metrics. Humans identify patterns.

Step 3 — Calibrate

Implement changes across whichever area needs them. These changes flow into the next Strategic phase.

This is what separates APEX from a static pipeline. A pipeline runs the same way forever. APEX evolves.


The /goal Inner Loop

Between April 16 and May 11, 2026, three independent platforms shipped the same agentic execution pattern. OpenAI's Codex CLI launched /goal. Anthropic's Claude Code followed. Three independent teams. One pattern. Same month.

The mechanics: you give the agent a goal. An internal Judge assesses each turn against the goal. If not met and budget remains, the loop continues. Call this Judge-Evaluated Continuation.

How /goal Fits Inside APEX

/goal covers the Execution Phase of APEX. But there are two things it doesn't address:

No Strategic Phase. /goal takes a goal as input. It doesn't help you figure out what goal to set or define what "good output" means.

No Reflection Phase. Without Reflection, every run starts from the same baseline. The agent configurations don't improve. The specifications don't sharpen.

💡 The Judge's evaluation is only as good as the criteria it's evaluating against. When an agent produces output that technically satisfies the goal but misses actual intent, the gap is almost always in the Strategic setup.

In APEX terms: /goal is the Execution Phase. APEX is the operating model around it.


Measuring Success: Five Metrics

📊 Metric 1 — First-Pass Acceptance Rate

Percentage of deliverables accepted at human verification without being sent back. Signals: spec quality.

📊 Metric 2 — Iteration Depth

Average agent-to-agent iterations per task before human verification. Signals: spec quality and agent capability. The trend matters more than the number.

📊 Metric 3 — Human Touch Rate

Percentage of tasks requiring human intervention during Execution outside designed verification points. Should decrease over time.

📊 Metric 4 — Calibration Impact

Change in the other four metrics from one cycle to the next. The meta-metric. If it's flat, the ceremony is happening but the learning isn't.

📊 Metric 5 — Cycle Time

Elapsed time from spec entering the loop to verified delivery. Signals: end-to-end system maturity.

📊 Metric 6 — Cost Per Task

Total compute and API cost per verified deliverable. Signals: efficiency and ROI. Track per deliverable type — a complex feature and a social media post have fundamentally different cost profiles. The goal isn't minimizing cost; it's understanding what you're paying per unit of verified output so you can make informed decisions about model selection, iteration budgets, and harness configuration. Declining cost per task at stable quality is the clearest sign your system is becoming efficient, not just fast.


Getting Started

Week 1: Map the nine domains to people. Match domain to expertise.

Week 2: Set up Platform. Harness Decision Record. Basic dashboard.

Week 3: Build the Spec Area. Business Context, Spec Engineering, QA Strategic.

Week 4: Configure agents. Agent Design, Orchestration Design, QA Operational. Run a cycle.

After the cycle: Run Reflection. Measure. Calibrate. Run another cycle.

In my experience, the second cycle is meaningfully better than the first. The fifth is dramatically better than the second.


Use Case: Code Production

This is the use case I've spent the most time with — running agentic software development in production.

The People

Seven experts, each owning specific domains:

  • CTO / Tech Director — owns Infrastructure. The harness decision, the model strategy, the compute architecture.
  • Tech Lead — owns Orchestration Design and co-owns Agent Design. They know the codebase, they design how work flows between agents.
  • Product Manager — owns Business Context and Spec Engineering. Their specs are what agents execute against.
  • AI Engineer — owns Agent Design and co-owns Infrastructure. Configures agent identities, skills, memory, and tool access.
  • Developer — owns Operational Tooling. Dashboards, metrics pipelines, context-generation tools.
  • QA Lead — owns QA Strategic and QA Operational. Defines what "done" means and translates those definitions into automated agent-level checks.
  • Security Engineer — owns Security & Compliance. Permissions, audit trails, access boundaries.

Every person maps to specific domains. Nobody owns everything. Nobody owns nothing.

Strategic Phase

Infrastructure. The CTO chooses an autonomous orchestration harness — specialized agents spawned for their respective domains. Frontend work routes to a frontend agent, integrations to an integration agent, QA to a QA agent. Model strategy is tiered: premium model for code review (where nuance matters most), mid-tier for code generation (balancing quality and speed), fast models for linting and formatting validation.

Business Context. The PM populates the workspace with: product vision doc, competitive analysis, user personas — all markdown, all in the fleet workspace where agents reference them.

Spec Engineering. Feature PRD with specific, testable acceptance criteria — not "the feature works well" but "API response time under 200ms, error states render the correct component, validation messages match the copy doc." Vague specs produce vague output.

Agent Design. Four agents:

  • Architect agent — reads the PRD, decomposes features into implementable tasks, reviews output with architectural judgment. Identity file contains architecture decision records and technical standards.
  • Frontend Developer agent — identity includes UI component standards, design system references, accessibility requirements. Skills for running dev server and component tests.
  • Integrator agent — focused on backend connections, API integrations, data flows. Identity includes API contracts, auth patterns, and infrastructure docs.
  • QA Engineer agent — writes regression and e2e tests for completed work. Reads acceptance criteria, reviews implementation, produces test suites. Runs after human verification — codifying correctness into automated tests.

Orchestration. Architect decomposes the PRD → assigns tasks to Frontend Developer and Integrator (parallel where possible) → reviews output against architectural standards → passes or sends back with feedback → human verifies → QA Engineer builds tests. Routing is explicit — no agent decides on its own where work goes next.

Execution Phase

The PM writes the feature spec. The Architect decomposes it into five tasks — three frontend, two integration. Frontend Developer and Integrator work in parallel on independent tasks.

Frontend Developer completes task one. Architect reviews: follows design system? Components structured correctly? Accessibility handled? First pass: error state component doesn't match the design system. Frontend Developer fixes it. Second pass: approved. Surfaces for human verification.

The Developer verifies — not checking CSS (the Architect handled that), but checking intent. Does this capture what the PM specified? Is this the right abstraction?

Once approved, the QA Engineer writes regression tests that codify the verified correctness. Meanwhile, the Integrator finishes the API task. Same review loop. Same verification. Same test generation.

The velocity gain is real. Developers review pre-validated output instead of raw pull requests. QA tests are auto-generated from verified implementations. A week's cycle compresses into two days of agent execution plus verification touchpoints.

Reflection Phase

Five features implemented, reviewed, verified, merged. The dashboard shows:

  • Feature A: 4 iterations average. Feature B: only 2.
  • First-pass acceptance: 62%.
  • Human touch rate: 8% — one Developer stepped in to clarify an architectural decision mid-loop.

The Integrator agent's logs reveal a pattern: API integration tasks consistently required more review passes. The Architect's reports show the Integrator kept misinterpreting the auth flow because Business Context didn't include auth architecture patterns.

Calibration actions:

  • PM adds auth architecture patterns document to Business Context
  • AI Engineer updates Integrator's identity file to reference the new auth doc
  • AI Engineer updates QA Engineer's test-generation context with auth flow patterns
  • QA Lead adds auth-pattern-specific check to QA Operational

Next cycle, the Integrator won't struggle with auth integration because the context exists. Iteration depth on API tasks should drop. First-pass acceptance should improve. That's the hypothesis — the metrics will confirm or deny it.

The agents aren't magically smarter. The context they operate within is smarter.


Use Case: Content Production

Same framework, different world.

The People

  • Editorial Lead — owns Business Context and QA Strategic
  • Content Strategist — owns Spec Engineering
  • Copywriter / Editor — contributes to QA Operational criteria
  • AI Engineer — owns Infrastructure and Agent Design
  • Developer — owns Operational Tooling
  • Brand Manager — contributes to Business Context and Security & Compliance

Strategic Phase

Infrastructure. Autonomous harness — agents run on schedule, producing content batches without a human present for every execution. Model strategy: premium for writing (voice and nuance), mid-tier for review (evaluating against criteria), fast models for research (speed over depth).

Business Context. Brand voice document, audience personas, editorial calendar, competitive positioning. These aren't nice-to-haves — they're the foundation every content agent references.

Agent Design. Three agents:

  • Research agent — searches, evaluates source credibility, compiles data and quotes
  • Writer agent — brand voice document embedded in its identity file. It writes in the brand's voice because it carries the brand's voice
  • Review agent — configured to be skeptical. Scores independently: brand voice consistency (1–10), SEO optimization (1–10), factual accuracy (1–10), audience relevance (1–10). Not "this is good" — a scorecard.

Execution Phase

Content Strategist writes a brief. Research agent gathers sources. Writer produces a draft. Review agent evaluates:

  • Brand voice: 8/10 — one paragraph drifts formal
  • SEO: 6/10 — secondary keywords underrepresented
  • Factual accuracy: 9/10 — one statistic needs a more recent citation
  • Audience relevance: 8/10 — angle is practical as specified

Three issues flagged. Writer revises. Second review: all dimensions pass. Article surfaces for the Editorial Lead.

The Editorial Lead reads the final draft. Not checking grammar or keyword density — agents handled that. Checking: Does this serve our audience? Would I put the team's name on this?

Five briefs go in Monday morning. By Monday afternoon, three articles pass the loop. By Tuesday, all five are verified and scheduled.

Reflection Phase

Blog posts hit 75% first-pass acceptance. Social content only 40%. LinkedIn posts average 3.2 iterations. Twitter/X posts average 4.1.

The diagnosis: social content briefs don't include platform-specific guidance. The Writer uses the same voice for LinkedIn as for Twitter — but they're fundamentally different platforms.

Calibration:

  • Content Strategist adds platform tone appendix to editorial briefs
  • AI Engineer creates platform-specific skills for the Writer agent
  • QA Lead adds platform-specific criteria to QA Operational

Next cycle: social acceptance improved. The agents didn't get smarter. The context they operate within got smarter.


Use Case: Data & Research Pipeline

This walkthrough changes the harness entirely.

Product and content both ran on flexible autonomous setups. A financial research team running a daily market analysis pipeline is a different animal. The shape is known. The steps are fixed. The auditability requirements are non-negotiable. That calls for a DAG-based harness like LangGraph.

Why a Different Harness

For product work, I wouldn't choose a DAG — features surface unknowns during implementation, and you want the system to route work back for another pass. For content, same — editorial iteration is the whole point. For a daily analysis pipeline, the fixed shape is exactly what you want. Every run does the same thing. Every run needs to be auditable end to end.

The Agents (DAG Nodes)

  • Market Scanner — fetches price data, news, sentiment from approved sources (fast model)
  • Fundamental Analyst — reads overnight earnings, balance sheets, filings (mid-tier model)
  • Technical Analyst — chart patterns, indicators, volatility analysis (mid-tier)
  • Correlator — synthesizes all three upstream outputs, flags cross-stream patterns (premium model — this node benefits most from reasoning depth)
  • Report Writer — formats the daily brief in house style
  • Compliance Checker — verifies no forbidden data sources were touched, generates audit trace

Orchestration

Scanner, Fundamental, and Technical run in parallel. All three converge on Correlator → Report Writer → Compliance Checker → human review. One directed graph, forward-only edges. No iteration between agents — if a node degrades, the downstream node compensates or flags it.

Reflection

Weekly review of daily runs. 70% of flagged signals led to actual market moves. 30% false positives. The Fundamental Analyst showed lower confidence on energy sector filings — the energy sector uses terminology not in the Business Context docs.

Calibration: Added energy sector vocabulary guide. Tightened QA thresholds for energy signals. Added per-sector confidence breakdown to dashboard.


One Organization, Multiple Fleets

APEX scales by instantiation, not by making one instance bigger. Each team — product, content, research — runs its own fleet with its own agents, cadence, and artifacts. Same framework, different configurations.

The same people can participate in multiple fleets wearing different hats. An AI Engineer might configure coding agents in the product fleet and writing agents in the content fleet. The underlying skill — understanding how agents consume context and where drift happens — transfers. The domain content is completely different.

💡 Cross-fleet learning: An AI Engineer who discovers that agents produce better output when identity files reference specific documents in the product fleet will bring the same principle to other fleets. The pattern transfers even though the content is different.

Each fleet runs on its own clock. Product cycles weekly. Content cycles daily. Research runs daily execution with weekly Reflection. Different work, different rhythms — forcing them into the same cadence is an anti-pattern.


Conclusion

I tried to map out a problem I kept running into: the gap between "one person using agents well" and "a team running agentic production reliably."

Your existing experts don't become obsolete in an agentic system. They become more valuable. A tech lead who spent ten years understanding architecture doesn't get replaced — they own Orchestration Design. A QA lead doesn't disappear — they design the quality criteria that agents enforce at scale. The job changes. The value concentrates.

Start with one fleet. One cycle. Measure, reflect, calibrate. The system will teach you what it needs next.


Learn More

The APEX Framework was first published as a series of articles exploring each dimension in depth:


— End of Guide —


Appendix A: The Nine Domains Reference

# Domain Area Owns Key Artifacts
1 Infrastructure Platform Runtime, harness, model strategy Harness Decision Record, Model Strategy
2 Operational Tooling Platform Dashboards, metrics pipelines Tooling Registry, Dashboard Configs
3 Security & Compliance Platform Permissions, audit trails Permission Map, Compliance Registry
4 Business Context Spec Brand, personas, competitive landscape Brand Voice Doc, Persona Library
5 Spec Engineering Spec Requirements, acceptance criteria, briefs PRD, User Stories, Editorial Briefs
6 QA Strategic Spec Quality definitions, evaluation criteria Review Criteria Docs, Measurement Plan
7 Agent Design Config Agent identities, skills, memory Agent Roster, Identity Files
8 Orchestration Design Config Routing rules, workflow maps Workflow Maps, Routing Rules
9 QA Operational Config Agent-level review criteria, quality gates Agent Review Criteria, Quality Gate Specs

Appendix B: Use Case Matrix

Dimension Software Dev Content Production Client Delivery
Harness Hierarchical / Autonomous Autonomous DAG-based
Cycle Cadence Weekly Daily Daily exec, weekly Reflection
Heaviest Area Spec Config Platform
Primary QA Signal Test coverage, arch conformance Brand voice, factual accuracy Signal hit rate, compliance
First-Pass Benchmark 65–75% 70–80% (blog), 40–60% (social) Context-dependent
Biggest Anti-Pattern Promptless Agent Set-and-Forget System Unlimited Agent

About the Author

Herbert Cuba Garcia is a Tech Director working at the intersection of AI systems and organizational design. He writes about what it actually looks like to run agentic teams in production — the organizational structures, the failure modes, and the calibration discipline that separates systems that improve from systems that degrade.

Read more at cubagarcia.com


APEX Framework — Reference Version 1.0 | First published April 2026 | Last updated June 2026

Top comments (0)