DEV Community

Cover image for We Started with Lambdas. Here's What Broke.
Tyson Cung
Tyson Cung

Posted on

We Started with Lambdas. Here's What Broke.

Lambdas seemed perfect for AI workloads. Single-purpose functions, automatic scaling, pay only for what you use. We built 7 of them before realizing our mistake.

Here's our first Lambda - a document summarizer for our asset management platform:

import { APIGatewayProxyHandler } from 'aws-lambda';
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export const handler: APIGatewayProxyHandler = async (event) => {
  try {
    const { document } = JSON.parse(event.body || '{}');

    const response = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [
        {
          role: 'system',
          content: 'Summarize the following document in 2-3 sentences.'
        },
        {
          role: 'user',
          content: document
        }
      ],
      max_tokens: 150
    });

    return {
      statusCode: 200,
      body: JSON.stringify({
        summary: response.choices[0].message.content
      })
    };
  } catch (error) {
    return {
      statusCode: 500,
      body: JSON.stringify({ error: error.message })
    };
  }
};
Enter fullscreen mode Exit fullscreen mode

Clean. Simple. It worked great... until it didn't.

The 29-Second Wall

Our first major problem hit when we built an agent that could analyze complex documents. The agent needed to:

  1. Extract text from the document
  2. Analyze for key themes
  3. Generate tags
  4. Create a summary
  5. Suggest related assets

Each step took 3-7 seconds. Total runtime: ~25 seconds. Within Lambda's 15-minute limit, right?

Wrong.

2024-02-15 14:32:18 START RequestId: abc-123-def
2024-02-15 14:32:18 Calling OpenAI for document analysis...
2024-02-15 14:32:25 Analysis complete, generating tags...
2024-02-15 14:32:32 OpenAI inference still running...
2024-02-15 14:32:47 ERROR Task timed out after 29.00 seconds
Enter fullscreen mode Exit fullscreen mode

API Gateway has a 29-second timeout. Not Lambda - API Gateway. Your Lambda can run for 15 minutes, but if you're exposing it through API Gateway (which you probably are), you hit the wall at 29 seconds.

When this timeout hits, here's what happens:

  • The client gets a 504 Gateway Timeout
  • Lambda keeps running and burning money
  • OpenAI or Bedrock calls complete but results are lost
  • Users see failed requests
  • You get charged for the full Lambda execution time

We lost 30% of our complex agent requests to timeouts. Users thought our AI was broken. It wasn't - it was just slow.

Streaming? Not from Lambda

Our users wanted real-time chat responses. They'd seen ChatGPT's streaming interface and expected the same. So we tried to implement streaming:

export const handler: APIGatewayProxyHandler = async (event) => {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [/* ... */],
    stream: true
  });

  // This is where it breaks
  for await (const chunk of stream) {
    // How do you stream through API Gateway?
    // You can't.
  }
};
Enter fullscreen mode Exit fullscreen mode

API Gateway buffers the entire Lambda response before sending it to the client. There's no way to stream partial responses. Even if your Lambda generates data incrementally, the client won't see anything until Lambda completes.

The workaround? WebSockets. But that means:

  • Separate WebSocket API Gateway
  • Connection management
  • Message routing
  • State tracking
  • Way more complexity

We tried it. The code ballooned to 3x the size for a simple streaming response.

Cold Starts from Hell

AI SDKs are heavy. Here's what we imported:

import OpenAI from 'openai';                    // 2.1 MB
import { BedrockRuntimeClient } from '@aws-sdk/client-bedrock-runtime'; // 1.8 MB  
import Anthropic from '@anthropic-ai/sdk';     // 1.9 MB
import { DynamoDBClient } from '@aws-sdk/client-dynamodb'; // 1.2 MB
import PDFParse from 'pdf-parse';              // 900 KB
Enter fullscreen mode Exit fullscreen mode

Total bundle size: ~8 MB. Cold start time: 8-12 seconds.

When a Lambda hasn't run for 5+ minutes, AWS creates a new container. Container startup + code initialization = your users wait 10+ seconds for the first response.

Here's the real kicker - this happened constantly because our AI functions were used sporadically:

  • Document analysis: maybe 20 requests per hour
  • Image classification: 5-10 requests per hour
  • Content generation: 1-2 requests per hour

Each function went cold multiple times per day. Users would upload a document, wait 12 seconds, and think our platform was broken.

We tried Lambda provisioned concurrency. It helped but cost $50/month per function just to keep them warm. For 7 functions, that's $350/month before processing a single request.

No Shared State

Multi-turn conversations were impossible. Here's what we tried:

// Turn 1: User asks about a document
export const chatHandler: APIGatewayProxyHandler = async (event) => {
  const { message, conversationId } = JSON.parse(event.body || '{}');

  // Get conversation history from DynamoDB
  const history = await getConversationHistory(conversationId);

  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      ...history,
      { role: 'user', content: message }
    ]
  });

  // Save new message to DynamoDB
  await saveMessage(conversationId, 'user', message);
  await saveMessage(conversationId, 'assistant', response.choices[0].message.content);

  return {
    statusCode: 200,
    body: JSON.stringify({ response: response.choices[0].message.content })
  };
};
Enter fullscreen mode Exit fullscreen mode

Every request required:

  1. DynamoDB read to get conversation history
  2. AI inference
  3. Two DynamoDB writes to save the exchange

For a 3-turn conversation, that's 3 reads + 6 writes. DynamoDB costs added up, and latency increased with conversation length.

Worse, there was no way to maintain context between function calls. If the agent needed to use tools or make multiple API calls, each call was isolated. No shared memory, no persistent connections.

Cost Spikes That Hurt

Lambda billing is per-millisecond, but AI inference has unpredictable latency:

  • Simple questions: 2-3 seconds
  • Complex analysis: 15-25 seconds
  • Code generation: 10-30 seconds
  • Image analysis: 5-20 seconds

Here's our cost breakdown for one expensive month:

Document Summarizer:    1,200 requests x 8s avg  =  2.7 hours = $180
Image Classifier:         800 requests x 12s avg =  2.7 hours = $180  
Content Generator:        400 requests x 18s avg =  2.0 hours = $135
Chat Agent:             2,000 requests x 15s avg =  8.3 hours = $560
Tag Suggester:          3,000 requests x 5s avg  =  4.2 hours = $280
PDF Analyzer:             200 requests x 22s avg =  1.2 hours = $80
Report Builder:           100 requests x 35s avg =  1.0 hour  = $65
                                                   Total: $1,480
Enter fullscreen mode Exit fullscreen mode

We were paying Lambda compute costs for AI thinking time. A 20-second GPT-4 call that actually uses 50ms of CPU still costs you for 20 full seconds of Lambda runtime.

Compare that to a long-running container that can handle multiple requests while one AI call is processing. Much better cost efficiency.

The worst part? Peak usage amplified the cost problem. During business hours, we'd have 50+ concurrent Lambda executions waiting for AI responses. Each one burning money while the actual compute was happening on OpenAI's servers. It felt like paying for a taxi that's stuck in traffic - you're paying for time, not progress.

Multi-Turn Agent Loops

The final straw was building an agent that could help users organize their assets. The workflow:

  1. User: "Help me organize my product photos"
  2. Agent: Analyzes available photos, asks clarifying questions
  3. User: Provides criteria
  4. Agent: Suggests folder structure
  5. User: Approves or requests changes
  6. Agent: Executes the organization

Each step was a separate Lambda invocation. The state management looked like this:

// Step 1: Initial request
await saveToDynamoDB(sessionId, {
  step: 'analyzing',
  photos: userPhotos,
  status: 'in_progress'
});

// Step 2: Agent response  
const session = await getFromDynamoDB(sessionId);
await openai.chat.completions.create(/* ... */);
await saveToDynamoDB(sessionId, {
  ...session,
  step: 'awaiting_criteria',
  analysis: result
});

// Step 3: User provides criteria
const session = await getFromDynamoDB(sessionId);
// ... and so on
Enter fullscreen mode Exit fullscreen mode

By step 6, we had 12+ DynamoDB operations, 6 Lambda invocations, and a conversation context that was getting expensive to load each time.

The user experience was clunky because each step required a new HTTP request. No persistent connection, no real-time updates, no streaming. Just request-response cycles that felt broken compared to ChatGPT.

I remember showing this to our head of product. He tried the workflow once and said, "This feels like software from 2010." He wasn't wrong.

The Breaking Point

Our Lambda-based AI platform had fundamental problems:

  1. 29-second timeout killed complex workflows
  2. No streaming made chat feel broken
  3. Cold starts created 10+ second delays
  4. Cost inefficiency from paying for AI wait time
  5. State management complexity made agents painful
  6. Integration sprawl across 7 different functions

We were spending more time fighting infrastructure than building features. Our users complained about slow responses. Our AWS bill kept climbing.

Lambdas Are Perfect AI Tools, Terrible AI Agents

Here's what I learned: Lambdas are perfect for AI tools but terrible for AI agents.

Tools are single-purpose, stateless, and fast:

  • Classify this image
  • Summarize this document
  • Extract text from PDF
  • Generate alt text

Agents are multi-turn, stateful, and complex:

  • Help me organize photos
  • Analyze this data and create a report
  • Chat about my documents
  • Build a workflow based on conversation

For tools, Lambda is ideal. For agents, you need persistent connections, shared state, and streaming. Lambda fights you every step of the way.

What We Built Instead

So we built a gateway instead. A single API endpoint that could handle both tools and agents, with proper streaming, state management, and vendor flexibility.

The architecture is simple: API Gateway routes to Lambda for the gateway logic, but the gateway proxies to long-running containers for actual AI processing. Best of both worlds - serverless scaling for the API layer, persistent connections for AI workloads.

In the next article, I'll walk you through the gateway pattern and show you how we unified 7 different AI Lambdas into one clean API that works with any model provider.


This is part 2 of an 8-part series on building a production AI platform. You can find the complete code examples at https://github.com/tysoncung/ai-platform-aws-examples.

Top comments (0)