DEV Community

Tilak Raj
Tilak Raj

Posted on

Building AI Agents That Actually Work in Production: My Technical Approach

Building an AI agent that works in a demo is easy. Building one that works reliably in production is a completely different engineering challenge.

Production systems must handle real users, real data, and real consequences when things fail.

This is the production agent architecture I use across Brainfy AI and Navlyt, along with real code patterns and failure modes I design around.


What Makes Production Agents Different From Demo Agents

Demo agents optimize for the happy path.

Production agents must handle:

  • Real data variance
    Production inputs are messy, ambiguous, and full of edge cases.

  • Concurrent executions
    Multiple agent instances running simultaneously with shared state.

  • Long-running tasks
    Agents that may take minutes or hours requiring durable execution state.

  • Cost management
    Confused agents making unnecessary tool calls can become expensive quickly.

  • Observability
    You must understand exactly what the agent decided and why.


The Core Architecture: Durable Agent State

The most important production decision:

Keep agent state in a database — not in memory.

In-memory state:

  • Dies with the server
  • Cannot scale horizontally
  • Cannot be audited

Database state:

  • Survives restarts
  • Enables horizontal scaling
  • Provides observability
  • Enables debugging

Example schema:

-- Agent execution state table

CREATE TABLE agent_executions (

 id UUID DEFAULT gen_random_uuid() PRIMARY KEY,

 user_id UUID REFERENCES auth.users NOT NULL,

 agent_type TEXT NOT NULL,

 status TEXT NOT NULL DEFAULT 'pending',

 CONSTRAINT valid_status CHECK (
   status IN (
     'pending',
     'running',
     'completed',
     'failed',
     'cancelled',
     'awaiting_review'
   )
 ),

 input_data JSONB NOT NULL,

 state JSONB DEFAULT '{}',

 result JSONB,

 error TEXT,

 step_count INTEGER DEFAULT 0,

 token_count INTEGER DEFAULT 0,

 created_at TIMESTAMPTZ DEFAULT NOW(),

 updated_at TIMESTAMPTZ DEFAULT NOW(),

 completed_at TIMESTAMPTZ

);

-- Tool call log for observability

CREATE TABLE agent_tool_calls (

 id UUID DEFAULT gen_random_uuid() PRIMARY KEY,

 execution_id UUID REFERENCES agent_executions NOT NULL,

 step_number INTEGER NOT NULL,

 tool_name TEXT NOT NULL,

 tool_input JSONB NOT NULL,

 tool_output JSONB,

 status TEXT NOT NULL DEFAULT 'pending',

 latency_ms INTEGER,

 error TEXT,

 called_at TIMESTAMPTZ DEFAULT NOW()

);
Enter fullscreen mode Exit fullscreen mode

The Agent Loop With Production Safeguards

Production agents need hard limits.

Example safeguards:

  • Step limits
  • Token limits
  • Timeout limits
  • Failure conditions

Example TypeScript loop:

// lib/agents/production-agent.ts

const AGENT_LIMITS = {

 maxSteps: 25,

 maxTokens: 50_000,

 stepTimeoutMs: 30_000,

 totalTimeoutMs: 300_000

}

export async function runAgent(

 executionId: string,

 supabase: SupabaseClient

): Promise<void> {

 const startTime = Date.now()

 let execution = await loadExecution(
   executionId,
   supabase
 )

 await updateStatus(
   executionId,
   'running',
   supabase
 )

 while (true) {

   const elapsed =
     Date.now() - startTime

   if (execution.step_count >= AGENT_LIMITS.maxSteps){

     await failWithReason(
       executionId,
       'MAX_STEPS_EXCEEDED',
       supabase
     )

     return
   }

   if (execution.token_count >= AGENT_LIMITS.maxTokens){

     await failWithReason(
       executionId,
       'MAX_TOKENS_EXCEEDED',
       supabase
     )

     return
   }

   if (elapsed >= AGENT_LIMITS.totalTimeoutMs){

     await failWithReason(
       executionId,
       'TOTAL_TIMEOUT',
       supabase
     )

     return
   }

   const response =
     await callModel(messages, TOOLS)

   execution.step_count++

   execution.token_count +=
     response.usage?.total_tokens ?? 0

   await persistState(
     executionId,
     execution,
     supabase
   )

}
Enter fullscreen mode Exit fullscreen mode

The Human-in-the-Loop Gate

For actions that are difficult to reverse, I require human approval.

The agent:

  • Prepares the action
  • Sets status to awaiting_review
  • Stops execution
  • Waits for approval

Example:

const APPROVAL_REQUIRED_TOOLS = [

 'send_email',

 'update_customer_record',

 'generate_compliance_document',

 'submit_to_regulator'

]

async function executeToolCall(

 toolCall,

 executionId,

 supabase

){

 if(APPROVAL_REQUIRED_TOOLS.includes(name)){

   await updateStatus(
     executionId,
     'awaiting_review',
     supabase
   )

   throw new AgentPausedError(
     'Human approval required'
   )
 }

 return await callTool(name,args)

}
Enter fullscreen mode Exit fullscreen mode

Monitoring: What I Track in Production

Metrics I monitor:

  • Step efficiency
  • Tool success rate
  • Human review escalation rate
  • Token cost per completion
  • Completion rate

Example health query:

const { data } =
await supabase.rpc(

 'agent_health_metrics',

 {

   agent_type:
     'compliance_document_generator',

   since:
     new Date(
       Date.now() -
       7 * 24 * 60 * 60 * 1000
     ).toISOString()

 }
)
Enter fullscreen mode Exit fullscreen mode

Typical results:

  • Completion rate: 94%
  • Avg steps: 8.3
  • Human review rate: 3.1%

Key Lessons

Production agents require:

  • Durable state
  • Hard execution limits
  • Observability
  • Cost controls
  • Human approval gates

Most failures come from missing safeguards, not model quality.


About the Author

Tilak Raj
Founder & CEO — Brainfy AI

Building vertical AI SaaS across compliance, real estate, agriculture, and aviation.

Website: https://www.tilakraj.info

Projects: https://www.tilakraj.info/projects


Questions about production agents? Drop a comment — I reply to all of them.

Top comments (0)