Herbert Cuba Garcia

Posted on Jun 1 • Edited on Jun 22 • Originally published at cubagarcia.com

APEX: Agentic Production Execution

#ai #programming #productivity #devops

APEX: Agentic Production Execution

Pronounced "AY-pecks" — a production-grade operating model for teams where humans design and verify while agents execute and iterate.

☝️ This is a reference document, not a quick read. Bookmark it. Return to it when you're configuring a new team, troubleshooting a system that isn't improving, or explaining to a stakeholder why agentic production requires more than choosing a model.

What Is APEX and Why It Matters
Who This Guide Is For
Built on Proven Foundations
Core Principles
The APEX Cycle
Phase 1: Strategic
Phase 2: Execution
Phase 3: Reflection
The /goal Inner Loop
Measuring Success: Five Metrics
Getting Started
Use Case: Content Production
Conclusion
Appendix A: The Nine Domains Reference
Appendix B: Use Case Matrix
About the Author

What Is APEX and Why It Matters

APEX stands for Agentic Production Execution. It is a framework for organizing how humans and agents work together to produce reliable output at scale — not the output of a single session, but the sustained output of a team running agentic systems across consecutive work cycles.

What APEX IS

A three-phase operating cycle: Strategic → Execution → Reflection
An organizational scaffold with three areas and nine named domains, each with clear ownership
A measurement framework with five metrics that make calibration data-driven
A methodology wrapper that sits around whatever tools, harnesses, or methodologies you're already using
A team operating model — not an individual practitioner's workflow

What APEX IS NOT

A prompt engineering guide
A replacement for your existing delivery methodology (Scrum, Kanban, SAFe — APEX wraps around all of them)
A prescription for specific tools or models
A guarantee of output quality — it's a structure that makes quality improvable
An individual playbook that scales by adding more individuals

💡 The gap between "one person using agents well" and "a team of five running agentic production" is where most organizations stall. APEX is designed to close that gap.

The most important conviction behind APEX: outsource execution, keep strategy human. The moment you let agents decide what to build rather than how to build it, you've lost the thread. Every design decision in this framework follows from that.

Who This Guide Is For

Tech Directors and Engineering Leads who are responsible for how their teams adopt agentic systems and need an organizational model, not just tooling advice.

Product Managers who need to understand why spec quality is the primary variable in agentic output quality — and what that means for their role.

QA Leads and Editors who own quality definitions and need a framework that elevates that ownership rather than automates it away.

AI Engineers who configure agent systems and need a vocabulary for communicating with the rest of the organization about what they're building.

Anyone running agents in production who has experienced the gap between a promising demo and reliable team-wide output, and wants a structured approach to closing it.

Built on Proven Foundations

APEX didn't emerge from theory. It emerged from practice — and it builds on research that confirms the patterns practitioners keep discovering independently.

Anthropic's harness design research established the foundational insight that this entire framework depends on: separating the agent that generates work from the agent that evaluates it produces dramatically better output than self-assessment. APEX formalizes this structural separation through its two-tier QA system: QA Strategic (humans define quality) and QA Operational (agents enforce it).

The convergence of Judge-Evaluated Continuation — the pattern that OpenAI's Codex CLI /goal, Anthropic's Claude Code /goal, and other agentic platforms independently shipped within weeks of each other — confirms that the Execution loop APEX describes is discoverable. What isn't standardized is the organizational scaffolding around it, which is exactly what APEX provides.

DORA metrics and the DevOps research tradition inform the APEX Metrics design. The principle is the same: a small set of named, measurable indicators that cover the full cycle and make data-driven improvement possible.

Core Principles

These ten principles govern how APEX operates.

① 🏗️ Harness First
Your runtime choice sets all constraints. Decide the harness before you configure anything else.

② 👤 Human in Control of Outcome
Humans own the outcome — not every step, but the result. They design the system, verify the output, and decide what to change.

③ 📥 Quality In = Quality Out
The output of any agentic system is a direct function of what goes in — specs, context, configuration, criteria.

④ 🤖 Agents Review Agents First
All work passes through agent-to-agent review before a human sees it. This is how you get speed without sacrificing quality.

⑤ 🎯 Domain-Mapped Ownership
Quality gates map to expertise, not generic reviews. Nine domains across three areas, each with a clear owner.

⑥ 🔄 Iterate Often, Iterate Fast
Agent-to-agent iteration loops compress what used to take days into hours.

⑦ 🔒 Least Privilege
Every agent gets only the access it needs. No more.

⑧ 📈 Calibrate the System, Not Just the Output
Reflections improve the system itself, not just the current deliverable.

⑨ 📊 Data-Driven Reflections
Agents report metrics. Humans decide based on data, not gut feelings.

⑩ 🔭 Think Big, Scale Back
Design the whole system first. Remove what's premature. Keep the architecture.

The APEX Cycle

Everything in APEX revolves around three phases that repeat continuously:

Strategic (2–3× velocity)
    ↓  Human-First: design, specify, configure
Execution (10–20× velocity)
    ↓  Agent-First: execute, review, iterate
Reflection (1–2× velocity)
    ↓  Data-First: evaluate, reflect, calibrate
    ↑
    └─── feeds back into next Strategic phase

☝️ The phase most teams cut under delivery pressure: Reflection. They ship, move on, start the next sprint. The result: the same problems repeat, iteration depth stays flat, and first-pass acceptance doesn't improve.

Phase 1: Strategic

Strategic is where humans do all the thinking that agents will later act on. All nine domains live here.

Strategic organizes work into three areas containing nine domains.

Area 1: Platform

📡 Domain 1 — Infrastructure

Owns the runtime, compute, and harness selection. The most consequential decision in the entire framework is harness selection. Document it in a Harness Decision Record.

Harness Type	Examples	Best For
General Purpose	Claude Code, Codex CLI, Cursor	Exploratory work, human available
Specialized	Devin, Harvey	Regulated domains, auditability
Autonomous	OpenClaw, CrewAI	Scheduled, unattended workflows
Hierarchical	AutoGen, MetaGPT	Large projects with sub-task decomposition
DAG-Based	LangGraph, Prefect, Dagster	Fixed-shape pipelines
Hybrid	Combination	Where most mature setups end up

🖥️ Domain 2 — Operational Tooling

Dashboards, metrics pipelines, agent activity monitors. Without this, Reflection degrades into guesswork.

🔐 Domain 3 — Security and Compliance

Permissions, data flows, regulatory constraints.

Area 2: Spec

📚 Domain 4 — Business Context

The "why" and the "for whom." Brand understanding, personas, competitive landscape.

📝 Domain 5 — Spec Engineering

The translation layer between strategic thinking and executable work. PRDs, user stories with acceptance criteria, editorial briefs.

💡 Vague specs produce vague output. Not because of the model — because of the spec.

✅ Domain 6 — QA Strategic

Where humans define quality. What "done" means. How output is evaluated. What data feeds into Reflections.

Area 3: Config

🤖 Domain 7 — Agent Design

Agent identities, behavior configuration, skills, instructions, memory. The richer the identity file, the less the agent needs to infer.

🔀 Domain 8 — Orchestration Design

Routing rules, delegation chains, handoff protocols, workflow maps.

🔎 Domain 9 — QA Operational

Agent-to-agent review criteria, quality gates within iteration cycles. The generator never grades its own homework.

Phase 2: Execution

The inner loop where agents do the actual work.

Spec → Execute → Review → Iterate → Verify

Spec — a human writes a spec
Execute — an agent executes against that spec
Review — a separate review agent evaluates output against QA Operational criteria
Iterate — if issues are found, the executing agent revises
Verify — when the agent loop passes, a human verifies against QA Strategic criteria

🤖 If you find yourself constantly stepping into agent loops to clarify or redirect, the problem is not the agents. The problem is in your Strategic configuration.

Phase 3: Reflection

The rhythm that keeps the system alive.

Step 1 — Evaluate

Review and polish actual output against original intent.

Step 2 — Reflect

Agents report metrics. Humans identify patterns.

Step 3 — Calibrate

Implement changes across whichever area needs them. These changes flow into the next Strategic phase.

This is what separates APEX from a static pipeline. A pipeline runs the same way forever. APEX evolves.

The /goal Inner Loop

Between April 16 and May 11, 2026, three independent platforms shipped the same agentic execution pattern. OpenAI's Codex CLI launched /goal. Anthropic's Claude Code followed. Three independent teams. One pattern. Same month.

The mechanics: you give the agent a goal. An internal Judge assesses each turn against the goal. If not met and budget remains, the loop continues. Call this Judge-Evaluated Continuation.

How /goal Fits Inside APEX

/goal covers the Execution Phase of APEX. But there are two things it doesn't address:

No Strategic Phase. /goal takes a goal as input. It doesn't help you figure out what goal to set or define what "good output" means.

No Reflection Phase. Without Reflection, every run starts from the same baseline. The agent configurations don't improve. The specifications don't sharpen.

💡 The Judge's evaluation is only as good as the criteria it's evaluating against. When an agent produces output that technically satisfies the goal but misses actual intent, the gap is almost always in the Strategic setup.

In APEX terms: /goal is the Execution Phase. APEX is the operating model around it.

Measuring Success: Five Metrics

📊 Metric 1 — First-Pass Acceptance Rate

Percentage of deliverables accepted at human verification without being sent back. Signals: spec quality.

📊 Metric 2 — Iteration Depth

Average agent-to-agent iterations per task before human verification. Signals: spec quality and agent capability. The trend matters more than the number.

📊 Metric 3 — Human Touch Rate

Percentage of tasks requiring human intervention during Execution outside designed verification points. Should decrease over time.

📊 Metric 4 — Calibration Impact

Change in the other four metrics from one cycle to the next. The meta-metric. If it's flat, the ceremony is happening but the learning isn't.

📊 Metric 5 — Cycle Time

Elapsed time from spec entering the loop to verified delivery. Signals: end-to-end system maturity.

📊 Metric 6 — Cost Per Task

Total compute and API cost per verified deliverable. Signals: efficiency and ROI. Track per deliverable type — a complex feature and a social media post have fundamentally different cost profiles. The goal isn't minimizing cost; it's understanding what you're paying per unit of verified output so you can make informed decisions about model selection, iteration budgets, and harness configuration. Declining cost per task at stable quality is the clearest sign your system is becoming efficient, not just fast.

Getting Started

Week 1: Map the nine domains to people. Match domain to expertise.

Week 2: Set up Platform. Harness Decision Record. Basic dashboard.

Week 3: Build the Spec Area. Business Context, Spec Engineering, QA Strategic.

Week 4: Configure agents. Agent Design, Orchestration Design, QA Operational. Run a cycle.

After the cycle: Run Reflection. Measure. Calibrate. Run another cycle.

In my experience, the second cycle is meaningfully better than the first. The fifth is dramatically better than the second.

Use Case: Code Production

This is the use case I've spent the most time with — running agentic software development in production.

The People

Seven experts, each owning specific domains:

CTO / Tech Director — owns Infrastructure. The harness decision, the model strategy, the compute architecture.
Tech Lead — owns Orchestration Design and co-owns Agent Design. They know the codebase, they design how work flows between agents.
Product Manager — owns Business Context and Spec Engineering. Their specs are what agents execute against.
AI Engineer — owns Agent Design and co-owns Infrastructure. Configures agent identities, skills, memory, and tool access.
Developer — owns Operational Tooling. Dashboards, metrics pipelines, context-generation tools.
QA Lead — owns QA Strategic and QA Operational. Defines what "done" means and translates those definitions into automated agent-level checks.
Security Engineer — owns Security & Compliance. Permissions, audit trails, access boundaries.

Every person maps to specific domains. Nobody owns everything. Nobody owns nothing.

Strategic Phase

Infrastructure. The CTO chooses an autonomous orchestration harness — specialized agents spawned for their respective domains. Frontend work routes to a frontend agent, integrations to an integration agent, QA to a QA agent. Model strategy is tiered: premium model for code review (where nuance matters most), mid-tier for code generation (balancing quality and speed), fast models for linting and formatting validation.

Business Context. The PM populates the workspace with: product vision doc, competitive analysis, user personas — all markdown, all in the fleet workspace where agents reference them.

Spec Engineering. Feature PRD with specific, testable acceptance criteria — not "the feature works well" but "API response time under 200ms, error states render the correct component, validation messages match the copy doc." Vague specs produce vague output.

Agent Design. Four agents:

Architect agent — reads the PRD, decomposes features into implementable tasks, reviews output with architectural judgment. Identity file contains architecture decision records and technical standards.
Frontend Developer agent — identity includes UI component standards, design system references, accessibility requirements. Skills for running dev server and component tests.
Integrator agent — focused on backend connections, API integrations, data flows. Identity includes API contracts, auth patterns, and infrastructure docs.
QA Engineer agent — writes regression and e2e tests for completed work. Reads acceptance criteria, reviews implementation, produces test suites. Runs after human verification — codifying correctness into automated tests.

Orchestration. Architect decomposes the PRD → assigns tasks to Frontend Developer and Integrator (parallel where possible) → reviews output against architectural standards → passes or sends back with feedback → human verifies → QA Engineer builds tests. Routing is explicit — no agent decides on its own where work goes next.

Execution Phase

The PM writes the feature spec. The Architect decomposes it into five tasks — three frontend, two integration. Frontend Developer and Integrator work in parallel on independent tasks.

Frontend Developer completes task one. Architect reviews: follows design system? Components structured correctly? Accessibility handled? First pass: error state component doesn't match the design system. Frontend Developer fixes it. Second pass: approved. Surfaces for human verification.

The Developer verifies — not checking CSS (the Architect handled that), but checking intent. Does this capture what the PM specified? Is this the right abstraction?

Once approved, the QA Engineer writes regression tests that codify the verified correctness. Meanwhile, the Integrator finishes the API task. Same review loop. Same verification. Same test generation.

The velocity gain is real. Developers review pre-validated output instead of raw pull requests. QA tests are auto-generated from verified implementations. A week's cycle compresses into two days of agent execution plus verification touchpoints.

Reflection Phase

Five features implemented, reviewed, verified, merged. The dashboard shows:

Feature A: 4 iterations average. Feature B: only 2.
First-pass acceptance: 62%.
Human touch rate: 8% — one Developer stepped in to clarify an architectural decision mid-loop.

The Integrator agent's logs reveal a pattern: API integration tasks consistently required more review passes. The Architect's reports show the Integrator kept misinterpreting the auth flow because Business Context didn't include auth architecture patterns.

Calibration actions:

PM adds auth architecture patterns document to Business Context
AI Engineer updates Integrator's identity file to reference the new auth doc
AI Engineer updates QA Engineer's test-generation context with auth flow patterns
QA Lead adds auth-pattern-specific check to QA Operational

Next cycle, the Integrator won't struggle with auth integration because the context exists. Iteration depth on API tasks should drop. First-pass acceptance should improve. That's the hypothesis — the metrics will confirm or deny it.

The agents aren't magically smarter. The context they operate within is smarter.

Use Case: Content Production

Same framework, different world.

The People

Editorial Lead — owns Business Context and QA Strategic
Content Strategist — owns Spec Engineering
Copywriter / Editor — contributes to QA Operational criteria
AI Engineer — owns Infrastructure and Agent Design
Developer — owns Operational Tooling
Brand Manager — contributes to Business Context and Security & Compliance

Strategic Phase

Infrastructure. Autonomous harness — agents run on schedule, producing content batches without a human present for every execution. Model strategy: premium for writing (voice and nuance), mid-tier for review (evaluating against criteria), fast models for research (speed over depth).

Business Context. Brand voice document, audience personas, editorial calendar, competitive positioning. These aren't nice-to-haves — they're the foundation every content agent references.

Agent Design. Three agents:

Research agent — searches, evaluates source credibility, compiles data and quotes
Writer agent — brand voice document embedded in its identity file. It writes in the brand's voice because it carries the brand's voice
Review agent — configured to be skeptical. Scores independently: brand voice consistency (1–10), SEO optimization (1–10), factual accuracy (1–10), audience relevance (1–10). Not "this is good" — a scorecard.

Execution Phase

Content Strategist writes a brief. Research agent gathers sources. Writer produces a draft. Review agent evaluates:

Brand voice: 8/10 — one paragraph drifts formal
SEO: 6/10 — secondary keywords underrepresented
Factual accuracy: 9/10 — one statistic needs a more recent citation
Audience relevance: 8/10 — angle is practical as specified

Three issues flagged. Writer revises. Second review: all dimensions pass. Article surfaces for the Editorial Lead.

The Editorial Lead reads the final draft. Not checking grammar or keyword density — agents handled that. Checking: Does this serve our audience? Would I put the team's name on this?

Five briefs go in Monday morning. By Monday afternoon, three articles pass the loop. By Tuesday, all five are verified and scheduled.

Reflection Phase

Blog posts hit 75% first-pass acceptance. Social content only 40%. LinkedIn posts average 3.2 iterations. Twitter/X posts average 4.1.

The diagnosis: social content briefs don't include platform-specific guidance. The Writer uses the same voice for LinkedIn as for Twitter — but they're fundamentally different platforms.

Calibration:

Content Strategist adds platform tone appendix to editorial briefs
AI Engineer creates platform-specific skills for the Writer agent
QA Lead adds platform-specific criteria to QA Operational

Next cycle: social acceptance improved. The agents didn't get smarter. The context they operate within got smarter.

Use Case: Data & Research Pipeline

This walkthrough changes the harness entirely.

Product and content both ran on flexible autonomous setups. A financial research team running a daily market analysis pipeline is a different animal. The shape is known. The steps are fixed. The auditability requirements are non-negotiable. That calls for a DAG-based harness like LangGraph.

Why a Different Harness

For product work, I wouldn't choose a DAG — features surface unknowns during implementation, and you want the system to route work back for another pass. For content, same — editorial iteration is the whole point. For a daily analysis pipeline, the fixed shape is exactly what you want. Every run does the same thing. Every run needs to be auditable end to end.

The Agents (DAG Nodes)

Market Scanner — fetches price data, news, sentiment from approved sources (fast model)
Fundamental Analyst — reads overnight earnings, balance sheets, filings (mid-tier model)
Technical Analyst — chart patterns, indicators, volatility analysis (mid-tier)
Correlator — synthesizes all three upstream outputs, flags cross-stream patterns (premium model — this node benefits most from reasoning depth)
Report Writer — formats the daily brief in house style
Compliance Checker — verifies no forbidden data sources were touched, generates audit trace

Orchestration

Scanner, Fundamental, and Technical run in parallel. All three converge on Correlator → Report Writer → Compliance Checker → human review. One directed graph, forward-only edges. No iteration between agents — if a node degrades, the downstream node compensates or flags it.

Reflection

Weekly review of daily runs. 70% of flagged signals led to actual market moves. 30% false positives. The Fundamental Analyst showed lower confidence on energy sector filings — the energy sector uses terminology not in the Business Context docs.

Calibration: Added energy sector vocabulary guide. Tightened QA thresholds for energy signals. Added per-sector confidence breakdown to dashboard.

One Organization, Multiple Fleets

APEX scales by instantiation, not by making one instance bigger. Each team — product, content, research — runs its own fleet with its own agents, cadence, and artifacts. Same framework, different configurations.

The same people can participate in multiple fleets wearing different hats. An AI Engineer might configure coding agents in the product fleet and writing agents in the content fleet. The underlying skill — understanding how agents consume context and where drift happens — transfers. The domain content is completely different.

💡 Cross-fleet learning: An AI Engineer who discovers that agents produce better output when identity files reference specific documents in the product fleet will bring the same principle to other fleets. The pattern transfers even though the content is different.

Each fleet runs on its own clock. Product cycles weekly. Content cycles daily. Research runs daily execution with weekly Reflection. Different work, different rhythms — forcing them into the same cadence is an anti-pattern.

Conclusion

I tried to map out a problem I kept running into: the gap between "one person using agents well" and "a team running agentic production reliably."

Your existing experts don't become obsolete in an agentic system. They become more valuable. A tech lead who spent ten years understanding architecture doesn't get replaced — they own Orchestration Design. A QA lead doesn't disappear — they design the quality criteria that agents enforce at scale. The job changes. The value concentrates.

Start with one fleet. One cycle. Measure, reflect, calibrate. The system will teach you what it needs next.

Learn More

The APEX Framework was first published as a series of articles exploring each dimension in depth:

The APEX Framework: Agentic Production Execution — the full framework specification with all nine domains, ten principles, and anti-patterns
APEX: Three Use Cases, One Framework — deep dives into software development, content production, and client delivery
/goal Is the Inner Loop. APEX Is the Operating Model. — how Judge-Evaluated Continuation fits inside the APEX cycle

— End of Guide —

Appendix A: The Nine Domains Reference

#	Domain	Area	Owns	Key Artifacts
1	Infrastructure	Platform	Runtime, harness, model strategy	Harness Decision Record, Model Strategy
2	Operational Tooling	Platform	Dashboards, metrics pipelines	Tooling Registry, Dashboard Configs
3	Security & Compliance	Platform	Permissions, audit trails	Permission Map, Compliance Registry
4	Business Context	Spec	Brand, personas, competitive landscape	Brand Voice Doc, Persona Library
5	Spec Engineering	Spec	Requirements, acceptance criteria, briefs	PRD, User Stories, Editorial Briefs
6	QA Strategic	Spec	Quality definitions, evaluation criteria	Review Criteria Docs, Measurement Plan
7	Agent Design	Config	Agent identities, skills, memory	Agent Roster, Identity Files
8	Orchestration Design	Config	Routing rules, workflow maps	Workflow Maps, Routing Rules
9	QA Operational	Config	Agent-level review criteria, quality gates	Agent Review Criteria, Quality Gate Specs

Appendix B: Use Case Matrix

Dimension	Software Dev	Content Production	Client Delivery
Harness	Hierarchical / Autonomous	Autonomous	DAG-based
Cycle Cadence	Weekly	Daily	Daily exec, weekly Reflection
Heaviest Area	Spec	Config	Platform
Primary QA Signal	Test coverage, arch conformance	Brand voice, factual accuracy	Signal hit rate, compliance
First-Pass Benchmark	65–75%	70–80% (blog), 40–60% (social)	Context-dependent
Biggest Anti-Pattern	Promptless Agent	Set-and-Forget System	Unlimited Agent

About the Author

Herbert Cuba Garcia is a Tech Director working at the intersection of AI systems and organizational design. He writes about what it actually looks like to run agentic teams in production — the organizational structures, the failure modes, and the calibration discipline that separates systems that improve from systems that degrade.

APEX: Agentic Production Execution

Table of Contents

What Is APEX and Why It Matters

What APEX IS

What APEX IS NOT

Who This Guide Is For

Built on Proven Foundations

Core Principles

The APEX Cycle

Phase 1: Strategic

Area 1: Platform

📡 Domain 1 — Infrastructure

🖥️ Domain 2 — Operational Tooling

🔐 Domain 3 — Security and Compliance

Area 2: Spec

📚 Domain 4 — Business Context

📝 Domain 5 — Spec Engineering

✅ Domain 6 — QA Strategic

Area 3: Config

🤖 Domain 7 — Agent Design

🔀 Domain 8 — Orchestration Design

🔎 Domain 9 — QA Operational

Phase 2: Execution

Phase 3: Reflection

Step 1 — Evaluate

Step 2 — Reflect

Step 3 — Calibrate

The /goal Inner Loop

How /goal Fits Inside APEX

Measuring Success: Five Metrics

📊 Metric 1 — First-Pass Acceptance Rate

📊 Metric 2 — Iteration Depth

📊 Metric 3 — Human Touch Rate

📊 Metric 4 — Calibration Impact

📊 Metric 5 — Cycle Time

📊 Metric 6 — Cost Per Task

Getting Started

Use Case: Code Production

The People

Strategic Phase

Execution Phase

Reflection Phase

Use Case: Content Production

The People

Strategic Phase

Execution Phase

Reflection Phase

Use Case: Data & Research Pipeline

Why a Different Harness

The Agents (DAG Nodes)

Orchestration

Reflection

One Organization, Multiple Fleets

Conclusion

Learn More

Appendix A: The Nine Domains Reference

Appendix B: Use Case Matrix

About the Author