Tyson Cung

Posted on Mar 3

We Started with Lambdas. Here's What Broke.

#aws #ai #lambda #serverless

Lambdas seemed perfect for AI workloads. Single-purpose functions, automatic scaling, pay only for what you use. We built 7 of them before realizing our mistake.

Here's our first Lambda - a document summarizer for our asset management platform:

import { APIGatewayProxyHandler } from 'aws-lambda';
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export const handler: APIGatewayProxyHandler = async (event) => {
  try {
    const { document } = JSON.parse(event.body || '{}');

    const response = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [
        {
          role: 'system',
          content: 'Summarize the following document in 2-3 sentences.'
        },
        {
          role: 'user',
          content: document
        }
      ],
      max_tokens: 150
    });

    return {
      statusCode: 200,
      body: JSON.stringify({
        summary: response.choices[0].message.content
      })
    };
  } catch (error) {
    return {
      statusCode: 500,
      body: JSON.stringify({ error: error.message })
    };
  }
};

Clean. Simple. It worked great... until it didn't.

The 29-Second Wall

Our first major problem hit when we built an agent that could analyze complex documents. The agent needed to:

Extract text from the document
Analyze for key themes
Generate tags
Create a summary
Suggest related assets

Each step took 3-7 seconds. Total runtime: ~25 seconds. Within Lambda's 15-minute limit, right?

Wrong.

2024-02-15 14:32:18 START RequestId: abc-123-def
2024-02-15 14:32:18 Calling OpenAI for document analysis...
2024-02-15 14:32:25 Analysis complete, generating tags...
2024-02-15 14:32:32 OpenAI inference still running...
2024-02-15 14:32:47 ERROR Task timed out after 29.00 seconds

API Gateway has a 29-second timeout. Not Lambda - API Gateway. Your Lambda can run for 15 minutes, but if you're exposing it through API Gateway (which you probably are), you hit the wall at 29 seconds.

When this timeout hits, here's what happens:

The client gets a 504 Gateway Timeout
Lambda keeps running and burning money
OpenAI or Bedrock calls complete but results are lost
Users see failed requests
You get charged for the full Lambda execution time

We lost 30% of our complex agent requests to timeouts. Users thought our AI was broken. It wasn't - it was just slow.

Streaming? Not from Lambda

Our users wanted real-time chat responses. They'd seen ChatGPT's streaming interface and expected the same. So we tried to implement streaming:

export const handler: APIGatewayProxyHandler = async (event) => {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [/* ... */],
    stream: true
  });

  // This is where it breaks
  for await (const chunk of stream) {
    // How do you stream through API Gateway?
    // You can't.
  }
};

API Gateway buffers the entire Lambda response before sending it to the client. There's no way to stream partial responses. Even if your Lambda generates data incrementally, the client won't see anything until Lambda completes.

The workaround? WebSockets. But that means:

Separate WebSocket API Gateway
Connection management
Message routing
State tracking
Way more complexity

We tried it. The code ballooned to 3x the size for a simple streaming response.

Cold Starts from Hell

AI SDKs are heavy. Here's what we imported:

import OpenAI from 'openai';                    // 2.1 MB
import { BedrockRuntimeClient } from '@aws-sdk/client-bedrock-runtime'; // 1.8 MB  
import Anthropic from '@anthropic-ai/sdk';     // 1.9 MB
import { DynamoDBClient } from '@aws-sdk/client-dynamodb'; // 1.2 MB
import PDFParse from 'pdf-parse';              // 900 KB

Total bundle size: ~8 MB. Cold start time: 8-12 seconds.

When a Lambda hasn't run for 5+ minutes, AWS creates a new container. Container startup + code initialization = your users wait 10+ seconds for the first response.

Here's the real kicker - this happened constantly because our AI functions were used sporadically:

Document analysis: maybe 20 requests per hour
Image classification: 5-10 requests per hour
Content generation: 1-2 requests per hour

Each function went cold multiple times per day. Users would upload a document, wait 12 seconds, and think our platform was broken.

We tried Lambda provisioned concurrency. It helped but cost $50/month per function just to keep them warm. For 7 functions, that's $350/month before processing a single request.

No Shared State

Multi-turn conversations were impossible. Here's what we tried:

// Turn 1: User asks about a document
export const chatHandler: APIGatewayProxyHandler = async (event) => {
  const { message, conversationId } = JSON.parse(event.body || '{}');

  // Get conversation history from DynamoDB
  const history = await getConversationHistory(conversationId);

  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      ...history,
      { role: 'user', content: message }
    ]
  });

  // Save new message to DynamoDB
  await saveMessage(conversationId, 'user', message);
  await saveMessage(conversationId, 'assistant', response.choices[0].message.content);

  return {
    statusCode: 200,
    body: JSON.stringify({ response: response.choices[0].message.content })
  };
};

Every request required:

DynamoDB read to get conversation history
AI inference
Two DynamoDB writes to save the exchange

For a 3-turn conversation, that's 3 reads + 6 writes. DynamoDB costs added up, and latency increased with conversation length.

Worse, there was no way to maintain context between function calls. If the agent needed to use tools or make multiple API calls, each call was isolated. No shared memory, no persistent connections.

Cost Spikes That Hurt

Lambda billing is per-millisecond, but AI inference has unpredictable latency:

Simple questions: 2-3 seconds
Complex analysis: 15-25 seconds
Code generation: 10-30 seconds
Image analysis: 5-20 seconds

Here's our cost breakdown for one expensive month:

Document Summarizer:    1,200 requests x 8s avg  =  2.7 hours = $180
Image Classifier:         800 requests x 12s avg =  2.7 hours = $180  
Content Generator:        400 requests x 18s avg =  2.0 hours = $135
Chat Agent:             2,000 requests x 15s avg =  8.3 hours = $560
Tag Suggester:          3,000 requests x 5s avg  =  4.2 hours = $280
PDF Analyzer:             200 requests x 22s avg =  1.2 hours = $80
Report Builder:           100 requests x 35s avg =  1.0 hour  = $65
                                                   Total: $1,480

We were paying Lambda compute costs for AI thinking time. A 20-second GPT-4 call that actually uses 50ms of CPU still costs you for 20 full seconds of Lambda runtime.

Compare that to a long-running container that can handle multiple requests while one AI call is processing. Much better cost efficiency.

The worst part? Peak usage amplified the cost problem. During business hours, we'd have 50+ concurrent Lambda executions waiting for AI responses. Each one burning money while the actual compute was happening on OpenAI's servers. It felt like paying for a taxi that's stuck in traffic - you're paying for time, not progress.

Multi-Turn Agent Loops

The final straw was building an agent that could help users organize their assets. The workflow:

User: "Help me organize my product photos"
Agent: Analyzes available photos, asks clarifying questions
User: Provides criteria
Agent: Suggests folder structure
User: Approves or requests changes
Agent: Executes the organization

Each step was a separate Lambda invocation. The state management looked like this:

// Step 1: Initial request
await saveToDynamoDB(sessionId, {
  step: 'analyzing',
  photos: userPhotos,
  status: 'in_progress'
});

// Step 2: Agent response  
const session = await getFromDynamoDB(sessionId);
await openai.chat.completions.create(/* ... */);
await saveToDynamoDB(sessionId, {
  ...session,
  step: 'awaiting_criteria',
  analysis: result
});

// Step 3: User provides criteria
const session = await getFromDynamoDB(sessionId);
// ... and so on

By step 6, we had 12+ DynamoDB operations, 6 Lambda invocations, and a conversation context that was getting expensive to load each time.

The user experience was clunky because each step required a new HTTP request. No persistent connection, no real-time updates, no streaming. Just request-response cycles that felt broken compared to ChatGPT.

I remember showing this to our head of product. He tried the workflow once and said, "This feels like software from 2010." He wasn't wrong.

The Breaking Point

Our Lambda-based AI platform had fundamental problems:

29-second timeout killed complex workflows
No streaming made chat feel broken
Cold starts created 10+ second delays
Cost inefficiency from paying for AI wait time
State management complexity made agents painful
Integration sprawl across 7 different functions

We were spending more time fighting infrastructure than building features. Our users complained about slow responses. Our AWS bill kept climbing.

Lambdas Are Perfect AI Tools, Terrible AI Agents

Here's what I learned: Lambdas are perfect for AI tools but terrible for AI agents.

Tools are single-purpose, stateless, and fast:

Classify this image
Summarize this document
Extract text from PDF
Generate alt text

Agents are multi-turn, stateful, and complex:

Help me organize photos
Analyze this data and create a report
Chat about my documents
Build a workflow based on conversation

For tools, Lambda is ideal. For agents, you need persistent connections, shared state, and streaming. Lambda fights you every step of the way.

What We Built Instead

So we built a gateway instead. A single API endpoint that could handle both tools and agents, with proper streaming, state management, and vendor flexibility.

The architecture is simple: API Gateway routes to Lambda for the gateway logic, but the gateway proxies to long-running containers for actual AI processing. Best of both worlds - serverless scaling for the API layer, persistent connections for AI workloads.

In the next article, I'll walk you through the gateway pattern and show you how we unified 7 different AI Lambdas into one clean API that works with any model provider.

This is part 2 of an 8-part series on building a production AI platform. You can find the complete code examples at https://github.com/tysoncung/ai-platform-aws-examples.

Top comments (1)

Tyson Cung • Mar 3

Please feel free to share your experience too.