DEV Community

Cover image for Launching your RAG system on AWS: CloudFront, Lambda, Bedrock & S3 Vectors
Sergio Esteban
Sergio Esteban

Posted on

Launching your RAG system on AWS: CloudFront, Lambda, Bedrock & S3 Vectors

A step by step guide with AI SDK, AWS and Terraform

Introduction

In this post I revisit the implementation of an AI Agent, this time adding the ability to return responses tied to a specific context.

TLDRS

All code used in this article is available in the repository.

Context

As usual, I rely on Serverless services to launch the experiment quickly. With the new capabilities AWS is rolling out, the agent will stream responses in real time as they are generated.

Below is a high level diagram representing the target architecture.

Scope

For this project I will implement a RAG system using serverless services.

We can outline the following requirements for a low volume of 1000 requests per month.

Functional requirements

  • The system should allow sending messages and receiving a response in real time as it is generated.
  • The system should restrict access only through the predefined domain.
  • The system should respond only when the question is related to the previously provided content.

Non functional requirements

As usual we can rely on serverless capabilities and add a few additional characteristics.

  • The system should be highly available
  • The system should scale automatically to handle variable traffic patterns
  • The system should remain secure with domain level access control and encrypted traffic

Out of Scope

This article does not cover data ingestion for the context that the system will use when answering.

However, notebooks are available for testing.

It is also worth mentioning that being possible does not mean this is the ideal way to expose an Agent.

With that out of the way, let’s move on.

Cost breakdown

We are running a serverless RAG application with the following components

Considering a volume of 1k requests per month, we can list the following.

Bedrock API Calls

We will use the most cost effective models, Nova Micro and Titan Embed v2, for language inference and embeddings.

  • Nova Micro input (500K tokens × 0.00035 USD) plus output (500K tokens × 0.0014 USD)
  • Titan Embed v2 (500K tokens × 0.0001 USD) plus vector search operations

This gives us an approximate total of 1.05 USD per month.

AWS Lambda

We will configure a function with 512 MB and a 60 second timeout. This results in roughly 0.20 USD per month.

CloudFront

We will use PriceClass_100 (US and Europe) assuming minimal data transfer and HTTPS requests.

This results in about 0.085 USD per month.

Container Registry

For storing the Lambda function image, estimating 1 GB for the Docker image and handling all traffic within the same region.

This results in about 0.10 USD per month.

Route53

We will use a hosted zone regardless of traffic.

This results in about 0.50 USD per month.

CloudWatch

For logging with seven day retention.

This results in about 0.01 USD per month.

This brings us to a total of approximately 1.95 USD per month. This aligns with the low requirements we defined for model responses.

Let’s continue.

Implementation

S3 Vector

Terraform does not currently support S3 Vector resources, so everything must be created using the AWS CLI.

First, create the bucket.

aws s3vectors create-vector-bucket \      
  --vector-bucket-name "$VECTOR_BUCKET_NAME" \
  --region "$REGION" \
  --profile "$PROFILE"
Enter fullscreen mode Exit fullscreen mode

Next, we will create the Index inside the bucket.

aws s3vectors create-index \
  --vector-bucket-name "$VECTOR_BUCKET_NAME" \
  --index-name "$VECTOR_BUCKET_NAME" \
  --data-type float32 \
  --dimension 1024 \
  --distance-metric cosine \
  --metadata-configuration nonFilterableMetadataKeys=id,chunk \
  --region "$REGION" \
  --profile "$PROFILE"
Enter fullscreen mode Exit fullscreen mode

To continue, we will need the ARN.

aws s3vectors list-indexes \
    --vector-bucket-name "$VECTOR_BUCKET_NAME" \
    --region "$REGION" \
    --profile "$PROFILE" \
    --query "Indexes[?IndexName=='$INDEX_NAME'].IndexArn" \
    --output text 2>/dev/null || echo "Unable to retrieve ARN"
Enter fullscreen mode Exit fullscreen mode

With these resources in place we can move forward using Terraform.

Unlike my previous article, this version will stream responses as they are generated.

Lambda creation

To support streamed responses we need to use streamText together with @types/aws-lambda.

The following is the final result.

import { pipeline } from 'node:stream/promises';
import { config } from "./bedrock/config";
import { streamText } from 'ai';
import { bedrock } from './bedrock/model';
import { findRelevantContent } from './bedrock/query-vector';
import type { APIGatewayProxyEventV2 } from 'aws-lambda';

exports.handler = awslambda.streamifyResponse<APIGatewayProxyEventV2>(
    async (event, responseStream, _context) => {
        try {
            responseStream.setContentType('text/plain; charset=utf-8');
            let prompt = '';
            if (event.body) {
                try {
                    const body = JSON.parse(event.body);
                    if (body.prompt) prompt = body.prompt;
                } catch (e) {
                    console.warn('Failed to parse request body:', e);
                }
            }

            if (!prompt) {
                responseStream.write('Error: No prompt provided');
                responseStream.end();
                return;
            }

            const { modelId } = config({});
            const similarDocuments = await findRelevantContent(prompt);
            const context = similarDocuments.map(doc => doc.chunk).join('\n\n');

            const systemPrompt = `You are a helpful assistant. Answer the user's question using ONLY the context provided below. 
            If the answer is not in the context, say "I don't know" or "The provided context does not contain the answer."
            Do not hallucinate or use outside knowledge.

            Context:
            ${context}`;

            const result = await streamText({
                model: bedrock(modelId),
                system: systemPrompt,
                messages: [
                    { role: "user", content: prompt },
                ],
            });
            await pipeline(result.textStream, responseStream);
            return;

        } catch (error) {
            console.error("Error in handler:", error);
            responseStream.write('Error');
            responseStream.end();
        }
    }
);
Enter fullscreen mode Exit fullscreen mode

With the general logic ready, we can move on to implementing semantic search inside our Vector Index.

Semantic search using Vector S3

We need to take the user message and generate embeddings using the amazon.titan-embed-text-v2:0 model.

Once we get the model response, we parse the result according to our needs. In this case we know each chunk contains the raw text we will use to look for relevant content.

With this simple flow we return additional context to the model so it can produce a response.

import { embed } from "ai";
import { bedrock } from "./model";
import { env } from "../env";
import { QueryVectorsCommand, S3VectorsClient } from '@aws-sdk/client-s3vectors'
import { config } from "./config";

export interface VectorMetadata {
    id: string
    chunk: string
}

const s3Vectors = new S3VectorsClient({})

export const findRelevantContent = async (query: string) => {
    const text = query.replaceAll('\\n', ' ');
    const { modelId } = config({ modelId: 'amazon.titan-embed-text-v2:0' })
    const { embedding: userQueryEmbedded } = await embed({
        model: bedrock.textEmbeddingModel(modelId),
        value: text,
        providerOptions: {
            bedrock: {
                normalize: true
            }
        },
    });
    const input = {
        indexArn: env.AWS_VECTOR_BUCKET_INDEX_ARN!,
        queryVector: {
            float32: userQueryEmbedded,
        },
        topK: 5,
        returnMetadata: true,
        returnDistance: true,
    }
    const result = await s3Vectors.send(new QueryVectorsCommand(input));

    if (!result.vectors || result.vectors.length === 0) {
        console.log('No vectors found for the query')
        return []
    }

    return result.vectors.map(v => {
        const metadata = v.metadata as VectorMetadata | undefined
        if (!metadata) {
            throw new Error('Metadata is required in the vector response')
        }

        if (!metadata.id || !metadata.chunk) {
            throw new Error('Vector metadata must contain id and chunk fields')
        }

        return {
            id: metadata.id,
            chunk: metadata.chunk,
            key: v.key,
            distance: v.distance,
        }
    })

}
Enter fullscreen mode Exit fullscreen mode

It is important to note that the retrieved results do not always represent the right context. To get higher quality responses we need to spend more time preparing the data we process.

Docker

To package our solution we define the Lambda.

# ---- Build Stage ----
FROM public.ecr.aws/lambda/nodejs:22 AS builder

WORKDIR /usr/src/app
RUN corepack enable

COPY package.json pnpm-lock.yaml* ./

RUN pnpm install --frozen-lockfile

COPY . .

RUN pnpm run build

# ---- Runtime Stage ----
FROM public.ecr.aws/lambda/nodejs:22

WORKDIR ${LAMBDA_TASK_ROOT}

COPY --from=builder /usr/src/app/dist/ ./
COPY --from=builder /usr/src/app/node_modules ./node_modules

ENTRYPOINT [ "/lambda-entrypoint.sh", "index.handler" ]
Enter fullscreen mode Exit fullscreen mode

With this in place we can move on to creating the cloud resources.

Infrastructure

Container registry

We need a centralized location to host our Docker images.

resource "aws_ecr_repository" "warike_development_ecr" {
  name                 = "ecr-${local.project_name}"
  image_tag_mutability = "IMMUTABLE_WITH_EXCLUSION"

  encryption_configuration {
    encryption_type = "AES256"
  }

  image_scanning_configuration {
    scan_on_push = true
  }

  image_tag_mutability_exclusion_filter {
    filter      = "*latest"
    filter_type = "WILDCARD"
  }

  force_delete = true
}
Enter fullscreen mode Exit fullscreen mode

Lambda module

For deploying our Lambda function using Docker we need to set the invocation mode to RESPONSE_STREAM, enable create_lambda_function_url, and use AWS_IAM as the authorization type.

locals {
  lambda_chat = {
    name        = "chat-${basename(path.cwd)}"
    image       = "${aws_ecr_repository.warike_development_ecr.repository_url}:chat-latest"
    description = "Lambda chat function for ${local.project_name}"
    memory_size = 512
    timeout     = 60
    env_vars = {
      AWS_BEARER_TOKEN_BEDROCK    = var.aws_bearer_token_bedrock
      AWS_BEDROCK_MODEL           = var.aws_bedrock_model
      AWS_VECTOR_BUCKET_INDEX_ARN = var.vector_bucket_index_arn
      AWS_BEDROCK_MODEL_EMBEDDING = var.aws_bedrock_model_embedding
    }
  }
}

## Lambda Chat
module "warike_development_lambda_chat" {
  source  = "terraform-aws-modules/lambda/aws"
  version = "~> 8.1.2"

  ## Configuration
  function_name = local.lambda_chat.name
  description   = local.lambda_chat.description
  memory_size   = local.lambda_chat.memory_size
  timeout       = local.lambda_chat.timeout

  ## Package
  create_package = false
  package_type   = "Image"
  image_uri      = local.lambda_chat.image
  environment_variables = merge(
    local.lambda_chat.env_vars,
    {}
  )

  ## API Gateway
  create_current_version_allowed_triggers = false

  ## Permissions
  create_role = false
  lambda_role = aws_iam_role.warike_development_lambda_chat_role.arn

  ## Logging
  use_existing_cloudwatch_log_group = true
  logging_log_group                 = aws_cloudwatch_log_group.warike_development_lambda_chat_logs.name
  logging_log_format                = "JSON"
  logging_application_log_level     = "INFO"
  logging_system_log_level          = "WARN"

  ## Response Streaming
  invoke_mode = "RESPONSE_STREAM"

  ## Lambda Function URL
  create_lambda_function_url = true
  authorization_type         = "AWS_IAM"

  cors = {
    allow_credentials = true
    allow_origins     = ["*"]
    allow_methods     = ["*"]
    allow_headers     = ["*"]
    expose_headers    = ["*"]
    max_age           = 86400
  }

  tags = merge(local.tags, { Name = local.lambda_chat.name })

  depends_on = [
    aws_cloudwatch_log_group.warike_development_lambda_chat_logs,
    aws_ecr_repository.warike_development_ecr,
    null_resource.warike_development_build_image_seed
  ]
}

Enter fullscreen mode Exit fullscreen mode

ACM and Route 53

Next, to access our Lambda function through CloudFront, we define the DNS records and a certificate.

## Amazon Certificate Manager
module "warike_development_acm" {
  source  = "terraform-aws-modules/acm/aws"
  version = "~> 6.1.0"

  domain_name               = local.domain_name
  zone_id                   = data.aws_route53_zone.warike_development_warike_tech.id
  subject_alternative_names = ["*.${local.domain_name}"]

  validation_method = "DNS"

  tags = local.tags
}

## Route 53 - Hosted Zone
data "aws_route53_zone" "warike_development_warike_tech" {
  name = local.domain_name
}

## Route 53 - Apex Record
resource "aws_route53_record" "warike_development_apex_record" {
  zone_id = data.aws_route53_zone.warike_development_warike_tech.zone_id
  name    = local.domain_name
  type    = "A"

  alias {
    name                   = module.warike_development_cloudfront.cloudfront_distribution_domain_name
    zone_id                = module.warike_development_cloudfront.cloudfront_distribution_hosted_zone_id
    evaluate_target_health = false
  }

  depends_on = [
    module.warike_development_cloudfront,
  ]
}

Enter fullscreen mode Exit fullscreen mode

With everything above in place we can start configuring CloudFront.

CloudFront

For CloudFront we need to consider that an Origin Access Control will be created.

We will reference lambda_function_url from it.

module "warike_development_cloudfront" {
  source  = "terraform-aws-modules/cloudfront/aws"
  version = "~> 5.0.1"

  ## Configuration
  enabled                        = true
  price_class                    = "PriceClass_100"
  retain_on_delete               = false
  wait_for_deployment            = true
  is_ipv6_enabled                = true
  create_monitoring_subscription = true

  ## Extra CNAMEs
  aliases = ["${local.domain_name}"]
  comment = "Chat CloudFront Distribution"

  ## Origin access control
  create_origin_access_control = true

  origin_access_control = {
    "chat_lambda_function_url" = {
      description      = "CloudFront access to Lambda Function URL"
      origin_type      = "lambda"
      signing_behavior = "always"
      signing_protocol = "sigv4"
    }
  }

  origin = {
    "chat_lambda_function_url" = {
      domain_name           = trimsuffix(replace(module.warike_development_lambda_chat.lambda_function_url, "https://", ""), "/")
      origin_access_control = local.cloudfront_oac_lambda_function_url
      custom_origin_config = {
        http_port              = 80
        https_port             = 443
        origin_protocol_policy = "match-viewer"
        origin_ssl_protocols   = ["TLSv1", "TLSv1.1", "TLSv1.2"]
      }
    }
  }

  default_cache_behavior = {
    target_origin_id       = local.cloudfront_oac_lambda_function_url
    viewer_protocol_policy = "redirect-to-https"
    allowed_methods        = ["HEAD", "DELETE", "POST", "GET", "OPTIONS", "PUT", "PATCH"]
    cached_methods         = ["GET", "HEAD", "OPTIONS"]

    ## Cache policy disabled
    cache_policy_id = "4135ea2d-6df8-44a3-9df3-4b5a84be39ad"

    ## Forwarded values disabled
    use_forwarded_values = false

    ## TTL settings
    min_ttl     = 0
    default_ttl = 0
    max_ttl     = 0
    compress    = true

    function_association = {
      viewer-request = {
        function_arn = aws_cloudfront_function.warike_development_restrict_domain.arn
      }
    }
  }

  viewer_certificate = {
    acm_certificate_arn = module.warike_development_acm.acm_certificate_arn
    ssl_support_method  = "sni-only"
  }

  tags = merge(local.tags, { Name = local.cloudfront_oac_lambda_function_url })

  depends_on = [
    module.warike_development_acm,
    module.warike_development_lambda_chat,
  ]
}

resource "aws_cloudfront_function" "warike_development_restrict_domain" {
  name    = "restrict-domain-${local.project_name}"
  runtime = "cloudfront-js-1.0"
  comment = "Restrict access to custom domain only"
  publish = true
  code    = file("${path.module}/functions/auth.js")

}

Enter fullscreen mode Exit fullscreen mode

CloudFront function

Additionally, I added a lightweight CloudFront function to ensure requests are directed to the predefined domain.

function handler(event) {
    var request = event.request;
    var host = request.headers.host.value;
    var allowedDomain = 'dev.zaistev.com';

    if (host !== allowedDomain) {
        return {
            statusCode: 403,
            statusDescription: 'Forbidden',
            headers: {
                "content-type": { "value": "text/plain" }
            },
            body: { "encoding": "text", "value": "Access denied. Please use the custom domain." }
        };
    }
    return request;
}

Enter fullscreen mode Exit fullscreen mode

With the main resources in place we can move on to the actual cloud deployment.

Deployment

I created a set of functions that automate the deployment.

.PHONY: deploy compile-server push-image terraform-apply

# Extract current version from terraform file (e.g., chat-v1)
CURRENT_VERSION := $(shell grep -o 'chat-v[0-9]*' infra/lambda-chat.tf | head -n 1)
# Extract the number part (e.g., 1)
VERSION_NUM := $(shell echo $(CURRENT_VERSION) | sed 's/chat-v//')
# Increment the number
NEXT_VERSION_NUM := $(shell echo $$(($(VERSION_NUM) + 1)))
# Form the new version string (e.g., chat-v2)
NEXT_VERSION := chat-v$(NEXT_VERSION_NUM)

deploy: compile-server push-image terraform-apply test

compile-server:
    @echo "Compiling server app..."
    cd apps/server && pnpm install && pnpm run build

push-image:
    @echo "Current version: $(CURRENT_VERSION)"
    @echo "Next version: $(NEXT_VERSION)"
    @echo "Building and pushing Docker image..."
    ./infra/push_chat_image.sh $(NEXT_VERSION)
    @echo "Updating Terraform configuration..."
    # Update the version in the terraform file
    sed -i '' 's/$(CURRENT_VERSION)/$(NEXT_VERSION)/g' infra/lambda-chat.tf

terraform-apply:
    @echo "Applying Terraform changes..."
    cd infra && terraform apply -auto-approve

test:
    @echo "Running integration tests..."
    ./infra/test_stream.sh

Enter fullscreen mode Exit fullscreen mode

Therefore, to get it running we need to execute

make deploy
Enter fullscreen mode Exit fullscreen mode

Testing

Additionally, we don’t have any context data available. This can be addressed by manually inserting data using Python.

def insert_vector(text, bucket_name, index_name):
    """
    Insert a single text chunk into the S3 Vector Index.
    """
    if not s3_vectors_client:
        print("❌ S3 Vectors client is not initialized.")
        return False

    if not bucket_name or not index_name:
        print("⚠️ S3_VECTOR_BUCKET_NAME or S3_VECTOR_INDEX_NAME is not set.")
        return False

    # Generate unique ID
    unique_id = generate_nanoid()

    # Generate embedding
    embedding = get_embedding(text)
    if not embedding:
        print(f"❌ Failed to generate embedding for text: {text[:50]}...")
        return False

    # Create metadata
    metadata = {
        "id": unique_id,
        "chunk": text
    }

    try:
        # Insert vector into S3 Vector Index
        response = s3_vectors_client.put_vectors(
            vectorBucketName=bucket_name,
            indexName=index_name,
            vectors=[
                {
                    "key": unique_id,
                    "data": {"float32": embedding},
                    "metadata": metadata
                }
            ]
        )

        print(f"✅ Inserted vector with ID: {unique_id}")
        print(f"   Text preview: {text[:80]}...")
        return True

    except Exception as e:
        print(f"❌ Error inserting vector: {e}")
        return False

# Insert all predefined texts
print("\n🚀 Starting vector insertion...\n")
success_count = 0
fail_count = 0

for i, text in enumerate(predefined_texts, 1):
    print(f"\n[{i}/{len(predefined_texts)}] Processing...")
    if insert_vector(text, s3_vector_bucket_name, s3_vector_index_name):
        success_count += 1
    else:
        fail_count += 1

print(f"\n\n📊 Insertion Summary:")
print(f"   ✅ Successful: {success_count}")
print(f"   ❌ Failed: {fail_count}")
print(f"   📝 Total: {len(predefined_texts)}")
Enter fullscreen mode Exit fullscreen mode

Later, if we test with available context, we can get a result like the following.

🔍 Testing with query: 'Explain how EventBridge works in LLM Workflow context?'

✅ Found 3 similar vectors:

1. ID: HlgHN84M1CUmWH3q5tzHl
   Distance: 0.4293
   Text: An application emits an event (for example, {"type": "orderCreated", "priority": "high"}). Amazon EventBridge evaluates the event against its routing rules. Based on an event's attributes, the system dynamically dispatches to the following: HighPriorityOrderProcessor (service A), StandardOrderProcessor (service B), UpdateOrderProcessor (service C). This pattern supports loose coupling, domain-based specialization, and runtime extensibility. This allows systems to respond intelligently to changing requirements and event semantics.

2. ID: Pv2Y1r0zYKxdFRrToZ0fo
   Distance: 0.5672
   Text: LLM-based routing: In agentic systems, routing also performs dynamic task delegation - but instead of Amazon EventBridge rules or metadata filters, the LLM classifies and interprets the user's intent through natural language. The result is a flexible, semantic, and adaptive form of dispatching.

3. ID: Cp98hjhcis6SSNooIuhZ8
   Distance: 0.6386
   Text: Agent router workflow: A user submits a natural language request through an SDK. An Amazon Bedrock agent uses an LLM to classify the task (for example, legal, technical, or scheduling). The agent dynamically routes the task through an action group to invoke the required agent: Domain-specific agent, Specialized tool chain, Custom prompt configuration. The selected handler processes the task and returns a tailored response.
Enter fullscreen mode Exit fullscreen mode

Conclusions

Building this context-aware Agent proves that serverless RAG is both practical and cost-efficient. By pairing the AI SDK with AWS Lambda, we achieved real-time streaming without the operational overhead of complex implementation, and yes, I mean WebSockets on Lambda.

We had to rely on the AWS CLI for S3 Vectors since native Terraform support is currently missing. While this adds a manual step, it effectively removes the need for a dedicated and expensive vector database.

Ultimately, this architecture provides a secure and scalable starting point. With CloudFront and Docker in place, you have a system that keeps costs remarkably low.

References

https://aws.plainenglish.io/uploading-documents-to-s3-vector-buckets-f418594feca3
https://dev.to/aws-builders/ai-sdk-streaming-text-from-lambda-cfd
https://docs.aws.amazon.com/lambda/latest/dg/configuration-response-streaming.html

Warike technologies

Top comments (0)