Tyson Cung

Posted on Mar 4

From Lambdas to Agents - ECS for Orchestration

#aws #ai #agents #ecs

Seven Lambda functions. Seven isolated AI tools that couldn't talk to each other.

Each function worked perfectly alone - document summarization, content extraction, classification. But when I tried to chain them together for complex workflows, the entire architecture crumbled. Timeouts cascaded. State management became impossible. Users got no feedback until everything finished or failed.

I learned the hard way that Lambdas are tools, not orchestrators.

The Lambda Limitation Wall

Here's what I was trying to build: an AI research agent that could take a topic, find relevant documents, extract key information, summarize findings, and present a coherent report. Simple enough, right?

Wrong. With Lambda functions, this became a nightmare of nested API calls, state management, and timeout issues. Each function had a 15-minute timeout, but my research workflow needed 45 minutes for complex topics. Plus, I had no way to maintain conversation context or stream partial results back to users.

The breaking point came when a user asked: "Can you research climate change impacts on agriculture and give me updates as you find things?"

With Lambdas, my answer was: "No, wait 15 minutes and I'll dump everything on you at once."

That's when I knew I needed a different approach.

Enter ECS Fargate: The Orchestrator

ECS Fargate solved every problem I had with Lambda orchestration:

No timeout limits - tasks can run for hours if needed
Streaming capabilities - real-time updates to users
Persistent connections - maintain state across tool calls
Still serverless-ish - pay per second, auto-scaling

The key realization: use Lambda functions as specialized tools, but orchestrate them with ECS Fargate running AI agents.

The ReAct Pattern: Think, Act, Observe, Repeat

The breakthrough came when I implemented the ReAct pattern (Reasoning and Acting). It's how humans solve complex problems:

Think - analyze the current situation and plan next steps
Act - use a tool to gather information or perform an action
Observe - examine the results and update understanding
Repeat - continue until the goal is achieved

Here's my implementation of the Agent class using the ReAct pattern:

import type { AIGateway } from '@ai-platform-aws/sdk';
import type { Tool, ToolCall, ToolResult, AgentConfig } from './types';

export class Agent {
  private config: AgentConfig;
  private gateway: AIGateway;

  constructor(config: AgentConfig, gateway: AIGateway) {
    this.config = config;
    this.gateway = gateway;
  }

  async run(task: string): Promise<AgentResult> {
    const maxIterations = this.config.maxIterations || 10;
    const toolCalls: ToolCall[] = [];
    let iterations = 0;

    const systemPrompt = this.buildSystemPrompt();
    const memory = this.config.memory;

    // Add user task to memory
    await memory.addMessage({
      role: 'user',
      content: task,
      timestamp: new Date()
    });

    while (iterations < maxIterations) {
      iterations++;

      // Get conversation history and format for LLM
      const history = await memory.getHistory();
      const messages = this.formatMessages(systemPrompt, history);

      // Call LLM for reasoning step
      const response = await this.gateway.complete({
        model: this.config.model,
        messages
      });

      const content = response.content || '';

      // Check if response contains tool call or final answer
      const parsedToolCall = this.parseToolCall(content);

      if (!parsedToolCall) {
        // No tool call found - this is the final answer
        await memory.addMessage({
          role: 'assistant',
          content,
          timestamp: new Date()
        });

        return {
          success: true,
          output: this.extractFinalAnswer(content),
          toolCalls,
          iterations
        };
      }

      // Execute the tool call
      const { toolName, parameters } = parsedToolCall;
      const tool = this.config.tools.find(t => t.name === toolName);

      if (!tool) {
        await memory.addMessage({
          role: 'tool',
          content: `Error: Tool "${toolName}" not found`,
          timestamp: new Date()
        });
        continue;
      }

      // Human approval check for sensitive operations
      if (this.config.onToolCall) {
        const approved = await this.config.onToolCall(call);
        if (!approved) {
          await memory.addMessage({
            role: 'tool',
            content: `Tool call "${toolName}" rejected by human operator`,
            timestamp: new Date()
          });
          continue;
        }
      }

      // Execute the tool
      const call: ToolCall = {
        id: `call_${Date.now()}_${Math.random().toString(36).slice(2, 8)}`,
        toolName,
        parameters,
        timestamp: new Date()
      };

      let result: ToolResult;
      try {
        result = await tool.execute(parameters);
        toolCalls.push(call);
      } catch (err) {
        result = { 
          success: false, 
          data: null, 
          error: `Tool execution error: ${(err as Error).message}` 
        };
      }

      // Add both the thought and tool result to memory
      await memory.addMessage({
        role: 'assistant',
        content,
        toolCall: call,
        timestamp: new Date()
      });
      await memory.addMessage({
        role: 'tool',
        content: JSON.stringify(result, null, 2),
        toolResult: result,
        timestamp: new Date()
      });
    }

    return {
      success: false,
      output: 'Agent reached maximum iterations without finding answer',
      toolCalls,
      iterations,
      error: `Max iterations (${maxIterations}) reached`
    };
  }

  private parseToolCall(content: string): { toolName: string; parameters: Record<string, unknown> } | null {
    // Look for pattern: "Action: toolname\nAction Input: {...}"
    const actionMatch = content.match(/Action:\s*(\w+)\s*\n\s*Action Input:\s*(\{[\s\S]*?\})\s*$/m);

    if (actionMatch) {
      try {
        return {
          toolName: actionMatch[1],
          parameters: JSON.parse(actionMatch[2])
        };
      } catch {
        return null;
      }
    }
    return null;
  }

  private buildSystemPrompt(): string {
    const toolDescriptions = this.config.tools
      .map(t => `- **${t.name}**: ${t.description}\n  Parameters: ${JSON.stringify(t.parameters)}`)
      .join('\n');

    return `You are ${this.config.name}: ${this.config.description}

Available tools:
${toolDescriptions}

When you need to use a tool, respond with:

Thought: [your reasoning]

Action: [tool_name]
Action Input: [JSON parameters]

When you have enough information to answer, respond with:

Thought: [final reasoning]

Final Answer: [complete answer]`;
  }
}

Connecting Lambda Functions as Tools

The breakthrough was turning my existing Lambda functions into agent tools. Each Lambda becomes a specialized capability the agent can invoke:

import { LambdaClient, InvokeCommand } from '@aws-sdk/client-lambda';
import type { Tool, ToolResult } from './types';

export class LambdaTool implements Tool {
  constructor(
    public name: string,
    public description: string,
    public parameters: Record<string, any>,
    private lambdaClient: LambdaClient,
    private functionName: string
  ) {}

  async execute(params: Record<string, unknown>): Promise<ToolResult> {
    try {
      const command = new InvokeCommand({
        FunctionName: this.functionName,
        Payload: JSON.stringify(params)
      });

      const response = await this.lambdaClient.send(command);

      if (!response.Payload) {
        return { success: false, data: null, error: 'No response from Lambda' };
      }

      const result = JSON.parse(Buffer.from(response.Payload).toString());

      if (result.errorMessage) {
        return { success: false, data: null, error: result.errorMessage };
      }

      return { success: true, data: result, error: null };
    } catch (error) {
      return { 
        success: false, 
        data: null, 
        error: `Lambda invocation failed: ${(error as Error).message}` 
      };
    }
  }
}

// Configure agent with Lambda tools
const lambdaClient = new LambdaClient({ region: 'us-east-1' });

const agent = new Agent({
  name: 'ResearchAgent',
  description: 'Analyzes documents and extracts insights',
  model: 'gpt-4',
  maxIterations: 10,
  tools: [
    new LambdaTool(
      'summarize_document',
      'Summarize document text content with specified length',
      { text: 'string', max_length: 'number' },
      lambdaClient,
      'ai-platform-summarize'
    ),
    new LambdaTool(
      'extract_entities', 
      'Extract named entities from text',
      { text: 'string', entity_types: 'string[]' },
      lambdaClient,
      'ai-platform-extract-entities'
    ),
    new LambdaTool(
      'classify_content',
      'Classify text into predefined categories',  
      { text: 'string', categories: 'string[]' },
      lambdaClient,
      'ai-platform-classify'
    )
  ]
}, gateway);

Real Example: Research Agent in Action

Here's what happens when I ask the agent to research renewable energy trends:

Iteration 1:

Think: "I need to find recent information about renewable energy trends. Let me search for relevant documents first."
Act: search_documents({ query: "renewable energy trends 2024", limit: 10 })
Observe: Found 8 relevant documents

Iteration 2:

Think: "I found several documents. Let me extract key information from the most recent one."
Act: extract_entities({ text: document_content, entity_types: ["technologies", "statistics", "companies"] })
Observe: Extracted solar, wind, battery storage technologies with growth statistics

Iteration 3:

Think: "Now I have some data points. Let me summarize the key findings and get more specific information about growth rates."
Act: summarize_document({ text: combined_content, max_length: 500 })
Observe: Generated concise summary with key trends

Final Answer: Comprehensive report on renewable energy trends with specific data points and growth projections.

Lambda vs ECS: Side-by-Side Comparison

What this research task looks like with Lambda:

// Impossible to implement cleanly
export const handler = async (event) => {
  // Step 1: Search (works)
  const documents = await searchLambda.invoke({...});

  // Step 2: Process each document (timeout risk)
  const summaries = [];
  for (const doc of documents) {
    const summary = await summarizeLambda.invoke({...});
    summaries.push(summary);
    // What if this takes 20 minutes total?
  }

  // Step 3: Synthesize (might not even get here)
  const final = await synthesizeLambda.invoke({...});

  // No streaming, no progress updates, all-or-nothing
  return final;
};

What it looks like with ECS:

// Clean, maintainable, observable
export class ResearchAgent extends ReActAgent {
  async researchTopic(topic: string): Promise<void> {
    this.onProgress = (update) => this.streamToClient(update);

    const result = await this.run(`Research ${topic} and provide comprehensive analysis`);
    return result;
  }

  private streamToClient(update: string): void {
    // WebSocket or SSE to client
    this.websocket.send(JSON.stringify({
      type: 'progress',
      message: update,
      timestamp: new Date().toISOString()
    }));
  }
}

Human-in-the-Loop for High-Stakes Actions

Not all actions should be automated. For high-stakes operations, I implemented a human approval pattern:

class ApprovalTool implements Tool {
  name = 'request_approval';
  description = 'Request human approval for sensitive actions';
  parameters = { action: 'string', reasoning: 'string', risk_level: 'string' };

  async execute(params: any): Promise<any> {
    const approval = await this.notificationService.requestApproval({
      message: `Agent wants to: ${params.action}`,
      reasoning: params.reasoning,
      riskLevel: params.risk_level,
      timeout: 300000 // 5 minutes
    });

    if (!approval.approved) {
      throw new Error(`Action rejected: ${approval.reason}`);
    }

    return { approved: true, conditions: approval.conditions };
  }
}

When the agent wants to delete files, send emails, or make API calls with financial implications, it asks for permission first. The user gets a notification and can approve/reject with reasoning.

Performance and Cost Reality

After running this architecture for 3 months in production:

ECS Fargate costs:

Average task: 1 vCPU, 2GB RAM
Runtime: 5-15 minutes per complex query
Cost: $0.04-0.12 per research task
Monthly infrastructure: ~$25 for moderate usage

Lambda costs remained the same:

$0.01-0.05 per tool invocation
No change to individual function costs
Better utilization since agents batch calls efficiently

Performance improvements:

40% reduction in total execution time (parallel tool calls)
90% improvement in user experience (streaming updates)
Zero timeout failures
60% reduction in support tickets ("is it still working?")

The Architecture Today

My current setup runs 3 ECS tasks:

Research Agent - multi-tool workflows like the example above
Content Agent - writing, editing, formatting workflows
Analysis Agent - data processing and reporting workflows

Each agent can call any of the 7 Lambda tools as needed. The agents scale independently based on demand, and I can deploy new tools without touching the orchestration logic.

What's Next

The complete code is available in my repositories:

Agent implementation: ai-platform-aws/packages/agents
Lambda tools example: 03-lambda-ai-tool
ECS agent orchestrator: 04-ecs-agent-orchestrator

Next up: Part 6 covers the TypeScript SDK that makes this platform actually enjoyable for developers to use. No more raw HTTP calls or manual error handling.

Part 5 of 8 in the series "Building an AI Platform on AWS from Scratch". Everything I learned building ai-platform-aws - including the expensive mistakes.

Top comments (1)

Tyson Cung • Mar 5

Do try out the examples to see the differences