Sabarish Sathasivan

Posted on Jan 15

Beyond the 29-Second Limit: 4 Patterns for Serverless GenAI on AWS

#serverless #genai #apigateway #aws

When teams start building GenAI-powered APIs on AWS, the initial architecture often looks straightforward:

It works well for demos and early prototypes. But as soon as prompts grow larger, models get heavier, or agent-style workflows are introduced, many teams hit the same invisible wall: the 29-second API Gateway integration timeout.

Over the past year, AWS has introduced several ways to address this problem. This article walks through those options, based on what actually works when you’re trying to keep GenAI APIs stable, scalable, and usable.

Option 1: Increasing the API Gateway integration timeout

In mid-2024, AWS finally allowed REST API integration timeouts to be increased beyond the long-standing 29-second limit. Amazon API Gateway integration timeout limit increase beyond 29 seconds

This sounds like the obvious fix. It requires no code changes and keeps the synchronous request-response model intact.

The Trade-offs:

The "Spinner" Problem: From a user experience perspective, you’re simply extending how long someone stares at a loading spinner.
Availability: Works only for Regional REST APIs and private REST APIs
Throttling: It might lead to a reduction in your account-level throttle quota limit

Option 2: The Asynchronous "Job" Pattern (Polling)

Sometimes, streaming isn't the right fit. If your GenAI application is generating images, creating PDFs, or running complex "Agentic" workflows that involve 2 minutes of silent "reasoning" before producing an answer, streaming text chunks provides no value.

In this pattern, API Gateway acts as a dispatcher rather than a waiter.

Request: The client sends a POST request.
Dispatch: API Gateway triggers an asynchronous process (via SQS or AWS Step Functions) and immediately returns a 202 Accepted response with a jobId.
Processing: The backend (Lambda/Bedrock) processes the request offline, unaffected by API Gateway timeouts.
Retrieval: The client polls a status endpoint (GET /jobs/{jobId}) every few seconds to check if the work is complete.

The Trade-offs:

Client Complexity: The frontend client must implement polling logic (e.g., "check status every 3 seconds").
Latency: There is inherently a small delay between the job actually finishing and the client's next poll interval catching it.
Cost: You pay for the initial request plus every subsequent polling request, which can add up if thousands of users are polling frequently.

Option 3: Use API Gateway WebSocket APIs

API Gateway WebSocket APIs remove the request-response timeout entirely by switching to a persistent, stateful connection. However, there are following things to consider.

WebSockets work best when the system is designed to communicate progress (e.g., "Thinking...", "Searching knowledge base..."), not just deliver a final answer.

The Trade-offs:

Idle Timeout: There is still a 10-minute idle timeout on WebSocket connections. If no data is sent during that window, the connection is closed.
Complexity: They introduce additional complexity in connection management, retries, and client-side state handling.

Option 4: REST API response streaming

AWS added response streaming for REST APIs in late 2025: Amazon API Gateway now supports response streaming for REST APIs

This allows Lambda to stream chunks of data back to the client as soon as they are available, rather than waiting for the entire response to be generated.

Why this is the preferred pattern for LLMs: Users see immediate feedback instead of a blank screen, even if the total execution time remains the same. This drastically improves the "Time to First Token" (TTFT) metric.

There’s a great walkthrough in Serverless Office Hours:

The Trade-offs:

Runtime Restrictions: Native streaming support is currently strongest in Node.js managed runtimes. If you use Python, Java, or .NET, you often need to implement a custom runtime or use the AWS Lambda Web Adapter to proxy the stream.
Lost API Gateway Features: Because API Gateway no longer buffers the response, it cannot modify it. You lose support for Endpoint Caching, Content Encoding (automatic GZIP compression), and VTL transformations on the response body.
Error Handling Complexity: Once your Lambda sends the first byte (usually a 200 OK header), you cannot change the status code. If the LLM hallucinates or crashes mid-stream, the API will still report "Success" HTTP status, so your client must be smart enough to parse error messages inside the text stream.
Bandwidth Throttling: For very large responses, the first 6-10 MB bursts at full speed, but remaining data is often throttled (e.g., 2 MB/s).

Summary: Which Pattern Should You Choose?

Pattern	Complexity	Best Use Case	Key Constraint
Timeout Increase	Low	Internal tools / MVPs	Users stare at a loading spinner (High TTFT).
The Asynchronous "Job" Pattern (Polling)	Medium	Image gen / Silent agents	Polling cost & delayed completion
WebSockets	High	Bi-directional Agents	Requires managing connection state & heartbeats.
Response Streaming	Medium	Chatbots & Text Gen	Node.js preferred; No API Caching or VTL.

Top comments (1)

Nejimon Raveendran • Jan 31

Nice read! Isn't switching to ALB an option as well?