After our Lambda approach fell apart, I needed a new architecture. Something that could handle any AI provider through one clean API. Something that could stream properly. Something that wouldn't fight us at every turn.
The solution was the gateway pattern.
One API endpoint that normalizes requests across all providers. Send the same JSON payload whether you're using OpenAI, Claude, or Bedrock. The gateway handles provider-specific formatting, retries, fallbacks, and streaming.
Why Not Just Use LiteLLM?
Before building our own, I seriously considered LiteLLM. It's a clever proxy that does exactly this - normalize API calls across providers.
But we had specific needs:
- TypeScript-first: Our frontend team lives in TypeScript. I needed strong typing across the entire stack.
- AWS CDK deployment: Everything deploys through CDK. I needed infrastructure as code.
- Cost tracking: Built-in tracking per request, per model, per application. LiteLLM doesn't handle this.
- Custom auth: Integration with our existing auth system and user management.
- Streaming through API Gateway: LiteLLM runs as a separate service. I needed streaming that worked with our existing infrastructure.
LiteLLM is great, but it's designed as a general proxy. We needed something purpose-built for our AWS-native architecture.
The Gateway Architecture
Here's the high-level architecture:
Client Request
|
API Gateway (with CORS)
|
Lambda Gateway (routing + auth)
|
Provider Adapter (OpenAI | Anthropic | Bedrock)
|
AI Service
|
Streaming Response (SSE)
The key insight: Lambda is perfect for the gateway logic. It's a short-lived proxy that routes requests and formats responses. The actual AI processing happens in the providers' infrastructure, not in Lambda.
Provider Adapter Pattern
Every AI provider has different request/response formats. The adapter pattern lets us normalize them behind a common interface:
// Base interface all providers must implement
interface AIProvider {
name: string;
supportsStreaming: boolean;
createChatCompletion(
request: NormalizedChatRequest
): Promise<NormalizedChatResponse>;
createStreamingCompletion(
request: NormalizedChatRequest
): AsyncGenerator<NormalizedStreamChunk>;
}
// Normalized request format (what clients send)
interface NormalizedChatRequest {
model: string;
messages: Array<{
role: 'user' | 'assistant' | 'system';
content: string;
}>;
maxTokens?: number;
temperature?: number;
stream?: boolean;
}
// Normalized response format (what clients receive)
interface NormalizedChatResponse {
id: string;
provider: string;
model: string;
content: string;
usage: {
promptTokens: number;
completionTokens: number;
totalTokens: number;
};
cost: number;
}
Now each provider implements this interface:
// OpenAI adapter
export class OpenAIProvider implements AIProvider {
name = 'openai';
supportsStreaming = true;
constructor(private apiKey: string) {}
async createChatCompletion(request: NormalizedChatRequest): Promise<NormalizedChatResponse> {
const openai = new OpenAI({ apiKey: this.apiKey });
const response = await openai.chat.completions.create({
model: request.model,
messages: request.messages,
max_tokens: request.maxTokens,
temperature: request.temperature,
});
return {
id: response.id,
provider: 'openai',
model: response.model,
content: response.choices[0].message.content || '',
usage: {
promptTokens: response.usage?.prompt_tokens || 0,
completionTokens: response.usage?.completion_tokens || 0,
totalTokens: response.usage?.total_tokens || 0,
},
cost: this.calculateCost(request.model, response.usage),
};
}
async* createStreamingCompletion(request: NormalizedChatRequest): AsyncGenerator<NormalizedStreamChunk> {
const openai = new OpenAI({ apiKey: this.apiKey });
const stream = await openai.chat.completions.create({
model: request.model,
messages: request.messages,
max_tokens: request.maxTokens,
temperature: request.temperature,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
if (content) {
yield {
id: chunk.id,
content,
finished: chunk.choices[0]?.finish_reason !== null,
};
}
}
}
private calculateCost(model: string, usage: any): number {
const rates = {
'gpt-4': { prompt: 0.03, completion: 0.06 },
'gpt-3.5-turbo': { prompt: 0.001, completion: 0.002 },
};
const rate = rates[model] || rates['gpt-3.5-turbo'];
return (usage.prompt_tokens * rate.prompt + usage.completion_tokens * rate.completion) / 1000;
}
}
The Bedrock adapter looks similar but handles AWS SDK authentication and different model naming conventions:
// Bedrock adapter
export class BedrockProvider implements AIProvider {
name = 'bedrock';
supportsStreaming = true;
constructor(private region: string = 'us-east-1') {}
async createChatCompletion(request: NormalizedChatRequest): Promise<NormalizedChatResponse> {
const client = new BedrockRuntimeClient({ region: this.region });
// Bedrock has different request formats per model
const modelId = this.mapModelName(request.model);
const body = this.formatBedrockRequest(modelId, request);
const command = new InvokeModelCommand({
modelId,
body: JSON.stringify(body),
});
const response = await client.send(command);
const result = JSON.parse(new TextDecoder().decode(response.body));
return this.formatBedrockResponse(result, modelId);
}
private mapModelName(model: string): string {
const modelMap = {
'claude-3-sonnet': 'anthropic.claude-3-sonnet-20240229-v1:0',
'claude-3-haiku': 'anthropic.claude-3-haiku-20240307-v1:0',
'claude-3.5-sonnet': 'anthropic.claude-3-5-sonnet-20240620-v1:0',
};
return modelMap[model] || model;
}
private formatBedrockRequest(modelId: string, request: NormalizedChatRequest): any {
if (modelId.includes('anthropic')) {
return {
anthropic_version: 'bedrock-2023-05-31',
max_tokens: request.maxTokens || 4096,
temperature: request.temperature || 0.7,
messages: request.messages,
};
}
// Handle other model families...
throw new Error(`Unsupported model: ${modelId}`);
}
}
The Gateway Lambda
The main Lambda function routes requests to the appropriate provider:
import { APIGatewayProxyHandler } from 'aws-lambda';
import { OpenAIProvider } from './providers/openai';
import { BedrockProvider } from './providers/bedrock';
import { AnthropicProvider } from './providers/anthropic';
const providers = {
openai: new OpenAIProvider(process.env.OPENAI_API_KEY!),
bedrock: new BedrockProvider(process.env.AWS_REGION!),
anthropic: new AnthropicProvider(process.env.ANTHROPIC_API_KEY!),
};
export const handler: APIGatewayProxyHandler = async (event) => {
try {
const request = JSON.parse(event.body || '{}');
const provider = getProvider(request.model);
if (request.stream) {
return handleStreamingRequest(request, provider);
} else {
return handleNormalRequest(request, provider);
}
} catch (error) {
return {
statusCode: 500,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ error: error.message }),
};
}
};
function getProvider(model: string): AIProvider {
// Model name routing logic
if (model.startsWith('gpt-') || model.startsWith('text-')) {
return providers.openai;
} else if (model.startsWith('claude-')) {
// Try Anthropic first, fall back to Bedrock
return providers.anthropic;
} else if (model.includes('bedrock') || model.includes('titan')) {
return providers.bedrock;
}
// Default to OpenAI
return providers.openai;
}
async function handleNormalRequest(request: any, provider: AIProvider) {
const response = await provider.createChatCompletion(request);
return {
statusCode: 200,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(response),
};
}
Fallback Chain Implementation
One of the most powerful features is automatic fallback. If one provider fails, we try the next:
async function handleRequestWithFallback(request: NormalizedChatRequest): Promise<NormalizedChatResponse> {
const fallbackChain = [
{ provider: providers.anthropic, models: ['claude-3.5-sonnet', 'claude-3-sonnet'] },
{ provider: providers.openai, models: ['gpt-4', 'gpt-3.5-turbo'] },
{ provider: providers.bedrock, models: ['claude-3-haiku'] },
];
for (const { provider, models } of fallbackChain) {
for (const model of models) {
try {
console.log(`Trying ${provider.name} with model ${model}`);
const fallbackRequest = { ...request, model };
const response = await provider.createChatCompletion(fallbackRequest);
console.log(`Success with ${provider.name}/${model}`);
return response;
} catch (error) {
console.log(`Failed with ${provider.name}/${model}: ${error.message}`);
// Continue to next option
}
}
}
throw new Error('All providers failed');
}
This saved us multiple times when OpenAI had outages or rate limits. Requests automatically failed over to Claude or Bedrock without any client changes.
Streaming with Server-Sent Events
The streaming implementation was tricky but crucial. API Gateway supports server-sent events (SSE), but you have to format responses correctly:
async function handleStreamingRequest(request: any, provider: AIProvider): Promise<any> {
if (!provider.supportsStreaming) {
// Fall back to non-streaming
return handleNormalRequest(request, provider);
}
const generator = provider.createStreamingCompletion(request);
let fullContent = '';
return {
statusCode: 200,
headers: {
'Content-Type': 'text/plain',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
},
body: async function* () {
try {
for await (const chunk of generator) {
fullContent += chunk.content;
// SSE format
yield `data: ${JSON.stringify(chunk)}\n\n`;
if (chunk.finished) {
break;
}
}
// Final usage statistics
yield `data: ${JSON.stringify({
type: 'complete',
usage: { totalTokens: estimateTokens(fullContent) }
})}\n\n`;
} catch (error) {
yield `data: ${JSON.stringify({
type: 'error',
error: error.message
})}\n\n`;
}
}(),
isBase64Encoded: false,
};
}
Configuration-Driven Provider Selection
The best part: you can switch providers with just configuration. No code changes:
// Configuration in DynamoDB or environment variables
const config = {
defaultProvider: 'anthropic',
modelMapping: {
'chat': 'claude-3.5-sonnet',
'summary': 'gpt-3.5-turbo',
'analysis': 'claude-3-sonnet',
},
fallbackChains: {
'claude-3.5-sonnet': ['gpt-4', 'claude-3-sonnet'],
'gpt-4': ['claude-3.5-sonnet', 'gpt-3.5-turbo'],
},
costLimits: {
daily: 100, // $100 per day
monthly: 2000, // $2000 per month
}
};
When OpenAI raised prices, we updated the config to prefer Claude. When Bedrock added new models, we added them to the fallback chain. Zero downtime, zero code changes.
Cost Tracking Built-In
Every request gets tracked automatically:
interface CostRecord {
requestId: string;
timestamp: number;
provider: string;
model: string;
promptTokens: number;
completionTokens: number;
cost: number;
userId?: string;
application?: string;
}
async function logCost(record: CostRecord) {
await dynamoClient.send(new PutItemCommand({
TableName: 'ai-costs',
Item: marshall(record),
}));
}
This gives us real-time visibility into AI spending. We can track costs per user, per feature, per model. When Claude released Haiku (their cheaper model), we could immediately see the cost savings.
Real-World Usage
Here's how clients use the gateway:
// Same API call works with any provider
const response = await fetch('/ai/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'claude-3.5-sonnet', // or 'gpt-4', 'gpt-3.5-turbo', etc.
messages: [
{ role: 'user', content: 'Summarize this document...' }
],
maxTokens: 150,
stream: true // or false
})
});
// Streaming response
if (response.body) {
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
console.log(data.content); // Stream content to UI
}
}
}
}
What This Solved
The gateway pattern eliminated our major pain points:
- Vendor flexibility: Switch providers with config changes
- Unified API: One integration instead of 7 different patterns
- Automatic fallbacks: Reliability through redundancy
- Streaming support: Real-time responses through SSE
- Cost transparency: Track spending per request
- TypeScript-first: Strong typing across the stack
Most importantly, it let us focus on building features instead of fighting integration complexity.
The real test came during OpenAI's major outage in December. Our gateway automatically failed over to Claude for all requests. Users didn't even notice. That's when I knew we'd built something valuable.
Example Implementation
You can see the complete gateway implementation in our examples repo at https://github.com/tysoncung/ai-platform-aws-examples/tree/main/01-multi-provider-gateway. It includes:
- Full provider adapters for OpenAI, Anthropic, and Bedrock
- CDK deployment code
- TypeScript SDK for clients
- Cost tracking and monitoring
- Streaming and non-streaming examples
In the next article, we'll dive into RAG (Retrieval Augmented Generation) and show you how to build a document search pipeline that actually works in production. Most RAG tutorials use toy examples that break on real documents - we'll show you how to handle the edge cases.
This is part 3 of an 8-part series on building a production AI platform. Find the complete code examples at https://github.com/tysoncung/ai-platform-aws-examples.
Top comments (0)