A step by step guide with AI SDK, AWS and Terraform
Introduction
In this post I revisit the implementation of an AI Agent, this time adding the ability to return responses tied to a specific context.
TLDRS
All code used in this article is available in the repository.
Context
As usual, I rely on Serverless services to launch the experiment quickly. With the new capabilities AWS is rolling out, the agent will stream responses in real time as they are generated.
Below is a high level diagram representing the target architecture.
Scope
For this project I will implement a RAG system using serverless services.
We can outline the following requirements for a low volume of 1000 requests per month.
Functional requirements
- The system should allow sending messages and receiving a response in real time as it is generated.
- The system should restrict access only through the predefined domain.
- The system should respond only when the question is related to the previously provided content.
Non functional requirements
As usual we can rely on serverless capabilities and add a few additional characteristics.
- The system should be highly available
- The system should scale automatically to handle variable traffic patterns
- The system should remain secure with domain level access control and encrypted traffic
Out of Scope
This article does not cover data ingestion for the context that the system will use when answering.
However, notebooks are available for testing.
It is also worth mentioning that being possible does not mean this is the ideal way to expose an Agent.
With that out of the way, let’s move on.
Cost breakdown
We are running a serverless RAG application with the following components
Considering a volume of 1k requests per month, we can list the following.
Bedrock API Calls
We will use the most cost effective models, Nova Micro and Titan Embed v2, for language inference and embeddings.
- Nova Micro input (500K tokens × 0.00035 USD) plus output (500K tokens × 0.0014 USD)
- Titan Embed v2 (500K tokens × 0.0001 USD) plus vector search operations
This gives us an approximate total of 1.05 USD per month.
AWS Lambda
We will configure a function with 512 MB and a 60 second timeout. This results in roughly 0.20 USD per month.
CloudFront
We will use PriceClass_100 (US and Europe) assuming minimal data transfer and HTTPS requests.
This results in about 0.085 USD per month.
Container Registry
For storing the Lambda function image, estimating 1 GB for the Docker image and handling all traffic within the same region.
This results in about 0.10 USD per month.
Route53
We will use a hosted zone regardless of traffic.
This results in about 0.50 USD per month.
CloudWatch
For logging with seven day retention.
This results in about 0.01 USD per month.
This brings us to a total of approximately 1.95 USD per month. This aligns with the low requirements we defined for model responses.
Let’s continue.
Implementation
S3 Vector
Terraform does not currently support S3 Vector resources, so everything must be created using the AWS CLI.
First, create the bucket.
aws s3vectors create-vector-bucket \
--vector-bucket-name "$VECTOR_BUCKET_NAME" \
--region "$REGION" \
--profile "$PROFILE"
Next, we will create the Index inside the bucket.
aws s3vectors create-index \
--vector-bucket-name "$VECTOR_BUCKET_NAME" \
--index-name "$VECTOR_BUCKET_NAME" \
--data-type float32 \
--dimension 1024 \
--distance-metric cosine \
--metadata-configuration nonFilterableMetadataKeys=id,chunk \
--region "$REGION" \
--profile "$PROFILE"
To continue, we will need the ARN.
aws s3vectors list-indexes \
--vector-bucket-name "$VECTOR_BUCKET_NAME" \
--region "$REGION" \
--profile "$PROFILE" \
--query "Indexes[?IndexName=='$INDEX_NAME'].IndexArn" \
--output text 2>/dev/null || echo "Unable to retrieve ARN"
With these resources in place we can move forward using Terraform.
Unlike my previous article, this version will stream responses as they are generated.
Lambda creation
To support streamed responses we need to use streamText together with @types/aws-lambda.
The following is the final result.
import { pipeline } from 'node:stream/promises';
import { config } from "./bedrock/config";
import { streamText } from 'ai';
import { bedrock } from './bedrock/model';
import { findRelevantContent } from './bedrock/query-vector';
import type { APIGatewayProxyEventV2 } from 'aws-lambda';
exports.handler = awslambda.streamifyResponse<APIGatewayProxyEventV2>(
async (event, responseStream, _context) => {
try {
responseStream.setContentType('text/plain; charset=utf-8');
let prompt = '';
if (event.body) {
try {
const body = JSON.parse(event.body);
if (body.prompt) prompt = body.prompt;
} catch (e) {
console.warn('Failed to parse request body:', e);
}
}
if (!prompt) {
responseStream.write('Error: No prompt provided');
responseStream.end();
return;
}
const { modelId } = config({});
const similarDocuments = await findRelevantContent(prompt);
const context = similarDocuments.map(doc => doc.chunk).join('\n\n');
const systemPrompt = `You are a helpful assistant. Answer the user's question using ONLY the context provided below.
If the answer is not in the context, say "I don't know" or "The provided context does not contain the answer."
Do not hallucinate or use outside knowledge.
Context:
${context}`;
const result = await streamText({
model: bedrock(modelId),
system: systemPrompt,
messages: [
{ role: "user", content: prompt },
],
});
await pipeline(result.textStream, responseStream);
return;
} catch (error) {
console.error("Error in handler:", error);
responseStream.write('Error');
responseStream.end();
}
}
);
With the general logic ready, we can move on to implementing semantic search inside our Vector Index.
Semantic search using Vector S3
We need to take the user message and generate embeddings using the amazon.titan-embed-text-v2:0 model.
Once we get the model response, we parse the result according to our needs. In this case we know each chunk contains the raw text we will use to look for relevant content.
With this simple flow we return additional context to the model so it can produce a response.
import { embed } from "ai";
import { bedrock } from "./model";
import { env } from "../env";
import { QueryVectorsCommand, S3VectorsClient } from '@aws-sdk/client-s3vectors'
import { config } from "./config";
export interface VectorMetadata {
id: string
chunk: string
}
const s3Vectors = new S3VectorsClient({})
export const findRelevantContent = async (query: string) => {
const text = query.replaceAll('\\n', ' ');
const { modelId } = config({ modelId: 'amazon.titan-embed-text-v2:0' })
const { embedding: userQueryEmbedded } = await embed({
model: bedrock.textEmbeddingModel(modelId),
value: text,
providerOptions: {
bedrock: {
normalize: true
}
},
});
const input = {
indexArn: env.AWS_VECTOR_BUCKET_INDEX_ARN!,
queryVector: {
float32: userQueryEmbedded,
},
topK: 5,
returnMetadata: true,
returnDistance: true,
}
const result = await s3Vectors.send(new QueryVectorsCommand(input));
if (!result.vectors || result.vectors.length === 0) {
console.log('No vectors found for the query')
return []
}
return result.vectors.map(v => {
const metadata = v.metadata as VectorMetadata | undefined
if (!metadata) {
throw new Error('Metadata is required in the vector response')
}
if (!metadata.id || !metadata.chunk) {
throw new Error('Vector metadata must contain id and chunk fields')
}
return {
id: metadata.id,
chunk: metadata.chunk,
key: v.key,
distance: v.distance,
}
})
}
It is important to note that the retrieved results do not always represent the right context. To get higher quality responses we need to spend more time preparing the data we process.
Docker
To package our solution we define the Lambda.
# ---- Build Stage ----
FROM public.ecr.aws/lambda/nodejs:22 AS builder
WORKDIR /usr/src/app
RUN corepack enable
COPY package.json pnpm-lock.yaml* ./
RUN pnpm install --frozen-lockfile
COPY . .
RUN pnpm run build
# ---- Runtime Stage ----
FROM public.ecr.aws/lambda/nodejs:22
WORKDIR ${LAMBDA_TASK_ROOT}
COPY --from=builder /usr/src/app/dist/ ./
COPY --from=builder /usr/src/app/node_modules ./node_modules
ENTRYPOINT [ "/lambda-entrypoint.sh", "index.handler" ]
With this in place we can move on to creating the cloud resources.
Infrastructure
Container registry
We need a centralized location to host our Docker images.
resource "aws_ecr_repository" "warike_development_ecr" {
name = "ecr-${local.project_name}"
image_tag_mutability = "IMMUTABLE_WITH_EXCLUSION"
encryption_configuration {
encryption_type = "AES256"
}
image_scanning_configuration {
scan_on_push = true
}
image_tag_mutability_exclusion_filter {
filter = "*latest"
filter_type = "WILDCARD"
}
force_delete = true
}
Lambda module
For deploying our Lambda function using Docker we need to set the invocation mode to RESPONSE_STREAM, enable create_lambda_function_url, and use AWS_IAM as the authorization type.
locals {
lambda_chat = {
name = "chat-${basename(path.cwd)}"
image = "${aws_ecr_repository.warike_development_ecr.repository_url}:chat-latest"
description = "Lambda chat function for ${local.project_name}"
memory_size = 512
timeout = 60
env_vars = {
AWS_BEARER_TOKEN_BEDROCK = var.aws_bearer_token_bedrock
AWS_BEDROCK_MODEL = var.aws_bedrock_model
AWS_VECTOR_BUCKET_INDEX_ARN = var.vector_bucket_index_arn
AWS_BEDROCK_MODEL_EMBEDDING = var.aws_bedrock_model_embedding
}
}
}
## Lambda Chat
module "warike_development_lambda_chat" {
source = "terraform-aws-modules/lambda/aws"
version = "~> 8.1.2"
## Configuration
function_name = local.lambda_chat.name
description = local.lambda_chat.description
memory_size = local.lambda_chat.memory_size
timeout = local.lambda_chat.timeout
## Package
create_package = false
package_type = "Image"
image_uri = local.lambda_chat.image
environment_variables = merge(
local.lambda_chat.env_vars,
{}
)
## API Gateway
create_current_version_allowed_triggers = false
## Permissions
create_role = false
lambda_role = aws_iam_role.warike_development_lambda_chat_role.arn
## Logging
use_existing_cloudwatch_log_group = true
logging_log_group = aws_cloudwatch_log_group.warike_development_lambda_chat_logs.name
logging_log_format = "JSON"
logging_application_log_level = "INFO"
logging_system_log_level = "WARN"
## Response Streaming
invoke_mode = "RESPONSE_STREAM"
## Lambda Function URL
create_lambda_function_url = true
authorization_type = "AWS_IAM"
cors = {
allow_credentials = true
allow_origins = ["*"]
allow_methods = ["*"]
allow_headers = ["*"]
expose_headers = ["*"]
max_age = 86400
}
tags = merge(local.tags, { Name = local.lambda_chat.name })
depends_on = [
aws_cloudwatch_log_group.warike_development_lambda_chat_logs,
aws_ecr_repository.warike_development_ecr,
null_resource.warike_development_build_image_seed
]
}
ACM and Route 53
Next, to access our Lambda function through CloudFront, we define the DNS records and a certificate.
## Amazon Certificate Manager
module "warike_development_acm" {
source = "terraform-aws-modules/acm/aws"
version = "~> 6.1.0"
domain_name = local.domain_name
zone_id = data.aws_route53_zone.warike_development_warike_tech.id
subject_alternative_names = ["*.${local.domain_name}"]
validation_method = "DNS"
tags = local.tags
}
## Route 53 - Hosted Zone
data "aws_route53_zone" "warike_development_warike_tech" {
name = local.domain_name
}
## Route 53 - Apex Record
resource "aws_route53_record" "warike_development_apex_record" {
zone_id = data.aws_route53_zone.warike_development_warike_tech.zone_id
name = local.domain_name
type = "A"
alias {
name = module.warike_development_cloudfront.cloudfront_distribution_domain_name
zone_id = module.warike_development_cloudfront.cloudfront_distribution_hosted_zone_id
evaluate_target_health = false
}
depends_on = [
module.warike_development_cloudfront,
]
}
With everything above in place we can start configuring CloudFront.
CloudFront
For CloudFront we need to consider that an Origin Access Control will be created.
We will reference lambda_function_url from it.
module "warike_development_cloudfront" {
source = "terraform-aws-modules/cloudfront/aws"
version = "~> 5.0.1"
## Configuration
enabled = true
price_class = "PriceClass_100"
retain_on_delete = false
wait_for_deployment = true
is_ipv6_enabled = true
create_monitoring_subscription = true
## Extra CNAMEs
aliases = ["${local.domain_name}"]
comment = "Chat CloudFront Distribution"
## Origin access control
create_origin_access_control = true
origin_access_control = {
"chat_lambda_function_url" = {
description = "CloudFront access to Lambda Function URL"
origin_type = "lambda"
signing_behavior = "always"
signing_protocol = "sigv4"
}
}
origin = {
"chat_lambda_function_url" = {
domain_name = trimsuffix(replace(module.warike_development_lambda_chat.lambda_function_url, "https://", ""), "/")
origin_access_control = local.cloudfront_oac_lambda_function_url
custom_origin_config = {
http_port = 80
https_port = 443
origin_protocol_policy = "match-viewer"
origin_ssl_protocols = ["TLSv1", "TLSv1.1", "TLSv1.2"]
}
}
}
default_cache_behavior = {
target_origin_id = local.cloudfront_oac_lambda_function_url
viewer_protocol_policy = "redirect-to-https"
allowed_methods = ["HEAD", "DELETE", "POST", "GET", "OPTIONS", "PUT", "PATCH"]
cached_methods = ["GET", "HEAD", "OPTIONS"]
## Cache policy disabled
cache_policy_id = "4135ea2d-6df8-44a3-9df3-4b5a84be39ad"
## Forwarded values disabled
use_forwarded_values = false
## TTL settings
min_ttl = 0
default_ttl = 0
max_ttl = 0
compress = true
function_association = {
viewer-request = {
function_arn = aws_cloudfront_function.warike_development_restrict_domain.arn
}
}
}
viewer_certificate = {
acm_certificate_arn = module.warike_development_acm.acm_certificate_arn
ssl_support_method = "sni-only"
}
tags = merge(local.tags, { Name = local.cloudfront_oac_lambda_function_url })
depends_on = [
module.warike_development_acm,
module.warike_development_lambda_chat,
]
}
resource "aws_cloudfront_function" "warike_development_restrict_domain" {
name = "restrict-domain-${local.project_name}"
runtime = "cloudfront-js-1.0"
comment = "Restrict access to custom domain only"
publish = true
code = file("${path.module}/functions/auth.js")
}
CloudFront function
Additionally, I added a lightweight CloudFront function to ensure requests are directed to the predefined domain.
function handler(event) {
var request = event.request;
var host = request.headers.host.value;
var allowedDomain = 'dev.zaistev.com';
if (host !== allowedDomain) {
return {
statusCode: 403,
statusDescription: 'Forbidden',
headers: {
"content-type": { "value": "text/plain" }
},
body: { "encoding": "text", "value": "Access denied. Please use the custom domain." }
};
}
return request;
}
With the main resources in place we can move on to the actual cloud deployment.
Deployment
I created a set of functions that automate the deployment.
.PHONY: deploy compile-server push-image terraform-apply
# Extract current version from terraform file (e.g., chat-v1)
CURRENT_VERSION := $(shell grep -o 'chat-v[0-9]*' infra/lambda-chat.tf | head -n 1)
# Extract the number part (e.g., 1)
VERSION_NUM := $(shell echo $(CURRENT_VERSION) | sed 's/chat-v//')
# Increment the number
NEXT_VERSION_NUM := $(shell echo $$(($(VERSION_NUM) + 1)))
# Form the new version string (e.g., chat-v2)
NEXT_VERSION := chat-v$(NEXT_VERSION_NUM)
deploy: compile-server push-image terraform-apply test
compile-server:
@echo "Compiling server app..."
cd apps/server && pnpm install && pnpm run build
push-image:
@echo "Current version: $(CURRENT_VERSION)"
@echo "Next version: $(NEXT_VERSION)"
@echo "Building and pushing Docker image..."
./infra/push_chat_image.sh $(NEXT_VERSION)
@echo "Updating Terraform configuration..."
# Update the version in the terraform file
sed -i '' 's/$(CURRENT_VERSION)/$(NEXT_VERSION)/g' infra/lambda-chat.tf
terraform-apply:
@echo "Applying Terraform changes..."
cd infra && terraform apply -auto-approve
test:
@echo "Running integration tests..."
./infra/test_stream.sh
Therefore, to get it running we need to execute
make deploy
Testing
Additionally, we don’t have any context data available. This can be addressed by manually inserting data using Python.
def insert_vector(text, bucket_name, index_name):
"""
Insert a single text chunk into the S3 Vector Index.
"""
if not s3_vectors_client:
print("❌ S3 Vectors client is not initialized.")
return False
if not bucket_name or not index_name:
print("⚠️ S3_VECTOR_BUCKET_NAME or S3_VECTOR_INDEX_NAME is not set.")
return False
# Generate unique ID
unique_id = generate_nanoid()
# Generate embedding
embedding = get_embedding(text)
if not embedding:
print(f"❌ Failed to generate embedding for text: {text[:50]}...")
return False
# Create metadata
metadata = {
"id": unique_id,
"chunk": text
}
try:
# Insert vector into S3 Vector Index
response = s3_vectors_client.put_vectors(
vectorBucketName=bucket_name,
indexName=index_name,
vectors=[
{
"key": unique_id,
"data": {"float32": embedding},
"metadata": metadata
}
]
)
print(f"✅ Inserted vector with ID: {unique_id}")
print(f" Text preview: {text[:80]}...")
return True
except Exception as e:
print(f"❌ Error inserting vector: {e}")
return False
# Insert all predefined texts
print("\n🚀 Starting vector insertion...\n")
success_count = 0
fail_count = 0
for i, text in enumerate(predefined_texts, 1):
print(f"\n[{i}/{len(predefined_texts)}] Processing...")
if insert_vector(text, s3_vector_bucket_name, s3_vector_index_name):
success_count += 1
else:
fail_count += 1
print(f"\n\n📊 Insertion Summary:")
print(f" ✅ Successful: {success_count}")
print(f" ❌ Failed: {fail_count}")
print(f" 📝 Total: {len(predefined_texts)}")
Later, if we test with available context, we can get a result like the following.
🔍 Testing with query: 'Explain how EventBridge works in LLM Workflow context?'
✅ Found 3 similar vectors:
1. ID: HlgHN84M1CUmWH3q5tzHl
Distance: 0.4293
Text: An application emits an event (for example, {"type": "orderCreated", "priority": "high"}). Amazon EventBridge evaluates the event against its routing rules. Based on an event's attributes, the system dynamically dispatches to the following: HighPriorityOrderProcessor (service A), StandardOrderProcessor (service B), UpdateOrderProcessor (service C). This pattern supports loose coupling, domain-based specialization, and runtime extensibility. This allows systems to respond intelligently to changing requirements and event semantics.
2. ID: Pv2Y1r0zYKxdFRrToZ0fo
Distance: 0.5672
Text: LLM-based routing: In agentic systems, routing also performs dynamic task delegation - but instead of Amazon EventBridge rules or metadata filters, the LLM classifies and interprets the user's intent through natural language. The result is a flexible, semantic, and adaptive form of dispatching.
3. ID: Cp98hjhcis6SSNooIuhZ8
Distance: 0.6386
Text: Agent router workflow: A user submits a natural language request through an SDK. An Amazon Bedrock agent uses an LLM to classify the task (for example, legal, technical, or scheduling). The agent dynamically routes the task through an action group to invoke the required agent: Domain-specific agent, Specialized tool chain, Custom prompt configuration. The selected handler processes the task and returns a tailored response.
Conclusions
Building this context-aware Agent proves that serverless RAG is both practical and cost-efficient. By pairing the AI SDK with AWS Lambda, we achieved real-time streaming without the operational overhead of complex implementation, and yes, I mean WebSockets on Lambda.
We had to rely on the AWS CLI for S3 Vectors since native Terraform support is currently missing. While this adds a manual step, it effectively removes the need for a dedicated and expensive vector database.
Ultimately, this architecture provides a secure and scalable starting point. With CloudFront and Docker in place, you have a system that keeps costs remarkably low.
References
https://aws.plainenglish.io/uploading-documents-to-s3-vector-buckets-f418594feca3
https://dev.to/aws-builders/ai-sdk-streaming-text-from-lambda-cfd
https://docs.aws.amazon.com/lambda/latest/dg/configuration-response-streaming.html
Warike technologies

Top comments (0)