DEV Community

Cover image for Streaming Bedrock Responses Through API Gateway and Lambda
Gunnar Grosch
Gunnar Grosch

Posted on

Streaming Bedrock Responses Through API Gateway and Lambda

If you're building applications that call Amazon Bedrock through API Gateway and Lambda, your users are probably staring at a spinner. The model generates tokens progressively, but the standard Lambda integration buffers the entire response before sending anything back. For a typical LLM response, that's 8-10 seconds of nothing, then everything at once.

API Gateway response streaming fixes this. Tokens flow from Bedrock through Lambda and API Gateway to the client as they're generated. The first token arrives in ~500ms. The total generation time stays the same. The difference is entirely in when the user starts seeing output. Beyond latency, streaming also lifts two constraints that matter for larger workloads: the 10 MB response payload limit and the 29-second default integration timeout. Streaming responses can run for up to 15 minutes and exceed 10 MB.

I put together a demo that runs both approaches side by side so you can see the difference for yourself. Two Lambda functions, same model, same prompt, same API Gateway. One streams, one buffers. The streaming panel fills up token by token while the buffered panel sits there waiting.

Metric Streaming Buffered
Time to first byte ~500ms ~8-10s
Total time ~8-10s ~8-10s
User experience Progressive, real-time Waiting, then all at once

How It Works

Streaming requires changes in two places: the API Gateway configuration and the Lambda handler. Neither is complicated on its own. The part that trips people up is getting them to work together.

The API Gateway side

In the OpenAPI spec, the streaming endpoint needs two things that the standard endpoint doesn't:

  1. A different Lambda invocation URI path: /response-streaming-invocations instead of /invocations
  2. A responseTransferMode: STREAM property on the integration

Here's what that looks like:

/streaming:
  post:
    x-amazon-apigateway-integration:
      type: AWS_PROXY
      httpMethod: POST
      uri:
        Fn::Sub: "arn:aws:apigateway:${AWS::Region}:lambda:path/2021-11-15/functions/${StreamingFunction.Arn}/response-streaming-invocations"
      responseTransferMode: STREAM
      passthroughBehavior: when_no_match
Enter fullscreen mode Exit fullscreen mode

Compare that to the standard endpoint:

/non-streaming:
  post:
    x-amazon-apigateway-integration:
      type: AWS_PROXY
      httpMethod: POST
      uri:
        Fn::Sub: "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${NonStreamingFunction.Arn}/invocations"
      passthroughBehavior: when_no_match
Enter fullscreen mode Exit fullscreen mode

The URI path version changes from 2015-03-31 to 2021-11-15, and response-streaming-invocations replaces invocations. Under the hood, this tells API Gateway to use Lambda's InvokeWithResponseStreaming API instead of the standard Invoke. Miss either of those details and your streaming endpoint silently falls back to buffered behavior. No error, just a longer wait.

The Lambda side

The streaming handler uses awslambda.streamifyResponse() to wrap the handler function. This gives you a writable HttpResponseStream instead of requiring you to return a response object:

import { BedrockRuntimeClient, ConverseStreamCommand } from '@aws-sdk/client-bedrock-runtime';
import type { APIGatewayProxyEvent } from 'aws-lambda';

const client = new BedrockRuntimeClient({ region: process.env.AWS_REGION });

const streamingHandler = async (
  event: APIGatewayProxyEvent,
  responseStream: NodeJS.WritableStream
): Promise<void> => {
  const httpResponseStream = awslambda.HttpResponseStream.from(responseStream, {
    statusCode: 200,
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive',
      'Access-Control-Allow-Origin': '*',
    },
  });

  const body = JSON.parse(event.body || '{}');
  const command = new ConverseStreamCommand({
    modelId: process.env.BEDROCK_MODEL_ID,
    messages: [{ role: 'user', content: [{ text: body.prompt }] }],
    inferenceConfig: { maxTokens: 2048 },
  });

  const response = await client.send(command);

  for await (const chunk of response.stream!) {
    if (chunk.contentBlockDelta?.delta?.text) {
      const token = chunk.contentBlockDelta.delta.text;
      httpResponseStream.write(`data: ${JSON.stringify({ token })}\n\n`);
    }
  }

  httpResponseStream.write('data: [DONE]\n\n');
  httpResponseStream.end();
};

export const handler = awslambda.streamifyResponse(streamingHandler);
Enter fullscreen mode Exit fullscreen mode

Each token gets written as a Server-Sent Event the moment Bedrock generates it. The data: [DONE] sentinel tells the client the stream is complete.

The HttpResponseStream.from() call is doing something important behind the scenes: it writes the response metadata (status code, headers) as a JSON object followed by an 8-null-byte delimiter that API Gateway uses to separate metadata from the response body. If you're not using HttpResponseStream.from(), you're responsible for writing that delimiter yourself.

The buffered handler does the same Bedrock call but accumulates everything in memory:

const response = await client.send(command);
let fullResponse = '';

for await (const chunk of response.stream!) {
  if (chunk.contentBlockDelta?.delta?.text) {
    fullResponse += chunk.contentBlockDelta.delta.text;
  }
}

return {
  statusCode: 200,
  headers: corsHeaders,
  body: JSON.stringify({ response: fullResponse }),
};
Enter fullscreen mode Exit fullscreen mode

Same model, same prompt, same token generation speed. The only difference is when the client sees the output.

The awslambda global

One thing worth noting: awslambda is a global object injected by the Lambda runtime. It's not in any npm package, and @types/aws-lambda doesn't include it either. In TypeScript, you need a type declaration for it:

declare const awslambda: {
  HttpResponseStream: {
    from: (
      responseStream: NodeJS.WritableStream,
      metadata: { statusCode: number; headers: Record<string, string> }
    ) => NodeJS.WritableStream;
  };
  streamifyResponse: (
    handler: (event: APIGatewayProxyEvent, responseStream: NodeJS.WritableStream) => Promise<void>
  ) => (event: APIGatewayProxyEvent) => Promise<void>;
};
Enter fullscreen mode Exit fullscreen mode

This is the kind of detail that's easy to miss. Without this declaration, TypeScript will reject your handler at compile time.

Deploy and Try It

You'll need:

Clone and deploy:

git clone https://github.com/gunnargrosch/apigw-lambda-streaming.git
cd apigw-lambda-streaming
cd functions && npm install && cd ..
sam build
sam deploy --guided
Enter fullscreen mode Exit fullscreen mode

SAM outputs the API Gateway base URL when deployment completes. Copy it.

The interactive demo

Open demo.html in a browser and paste your API Gateway URL. Click Run Comparison (or press Cmd+Enter on macOS, Ctrl+Enter on Windows/Linux). Both endpoints fire simultaneously. The streaming panel fills up token by token while the buffered panel shows a spinner.

The results panel at the bottom shows Time to First Byte for both approaches and the speedup factor. For longer responses (a few paragraphs or more), streaming TTFB is typically 10-20x faster than buffered. Shorter responses show a smaller gap since the buffered endpoint finishes sooner.

Testing with curl

# Streaming: tokens appear progressively as SSE events
curl -N -X POST https://<api-id>.execute-api.<region>.amazonaws.com/demo/streaming \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Write a short story about serverless computing"}'

# Buffered: waits for the complete response
curl -X POST https://<api-id>.execute-api.<region>.amazonaws.com/demo/non-streaming \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Write a short story about serverless computing"}'
Enter fullscreen mode Exit fullscreen mode

The -N flag on the streaming curl disables output buffering so you see tokens as they arrive.

The Infrastructure

The SAM template defines two Lambda functions sharing an execution role with bedrock:InvokeModelWithResponseStream and bedrock:InvokeModel permissions. Both functions use Node.js 20.x, 256 MB memory, and a 120-second timeout. The Bedrock model ID is configurable via a SAM parameter:

Parameters:
  BedrockModelId:
    Type: String
    Default: us.anthropic.claude-sonnet-4-5-20250929-v1:0
    Description: Bedrock model ID
Enter fullscreen mode Exit fullscreen mode

Override it during deployment to use a different model:

sam deploy --parameter-overrides BedrockModelId=us.anthropic.claude-haiku-4-5-20251001-v1:0
Enter fullscreen mode Exit fullscreen mode

API Gateway is configured with an inline OpenAPI spec via AWS::Include, which keeps the streaming-specific integration properties in a separate openapi.yaml file rather than buried in the SAM template.

Things to Know

A few operational details worth being aware of before you ship this to production:

  • Idle timeouts: Regional and Private API endpoints have a 5-minute idle timeout on streaming responses. Edge-optimized endpoints have a 30-second idle timeout. For LLM token streams this is rarely an issue since tokens arrive continuously, but if you're calling a slower model or one that pauses during longer reasoning chains, keep this in mind.
  • Bandwidth throttling: The first 10 MB of a streaming response has no bandwidth restrictions. After that, data is throttled to 2 MB/s. Not an issue for LLM token streams, but worth knowing if you're streaming larger payloads.
  • Pricing: Each 10 MB of streamed response data (rounded up to the nearest 10 MB) is billed as a single API request. For typical LLM responses, this means one request per call, same as buffered.
  • Not supported with streaming: VTL response transformation, integration response caching, and content encoding. If you rely on any of these, you'll need to handle them differently.
  • Observability: API Gateway adds three new access log variables for streaming: $context.integration.responseTransferMode (BUFFERED or STREAMED), $context.integration.timeToAllHeaders, and $context.integration.timeToFirstContent. Useful for monitoring TTFB at the API Gateway level.

When to Use This

Response streaming makes the biggest difference for LLM applications where users are waiting for generated text: chatbots, content generation, code assistants, summarization tools. The total time doesn't change, but the perceived latency drops significantly.

This demo focuses on Bedrock, but response streaming works with any Lambda or HTTP proxy integration. A few other use cases where it helps:

  • Large file downloads: Streaming lets responses exceed the standard 10 MB payload limit, so you can serve large datasets, reports, or media files directly through API Gateway without routing through S3 pre-signed URLs.
  • Long-running operations with progress updates: An endpoint that runs a multi-step workflow can stream progress events back to the client as each step completes, instead of forcing the client to poll a separate status endpoint.
  • Web and mobile TTFB optimization: Any API response that takes more than a second or two to fully compute benefits from streaming partial results early. Server-side rendering, search results, or aggregation queries can send the first chunk while the backend continues processing.

A few situations where streaming matters less:

  • Batch processing: No user is watching. Buffer the response and process it when it's complete.
  • Short responses: If the backend returns in under a second, streaming adds complexity without a noticeable UX improvement.
  • Structured output you need to parse as a whole: If your application needs the complete JSON response before it can do anything useful, streaming partial data doesn't help.

Clean Up

sam delete
Enter fullscreen mode Exit fullscreen mode

Additional Resources

Have you added response streaming to your Bedrock applications? I'd like to hear about your experience in the comments.

Top comments (0)