Running a RAG Pipeline in a Production Full-Stack Application (Without a Vector Database)

#aws #rag #programming #python

In the previous post, I focused on how to build a Retrieval-Augmented Generation (RAG) pipeline on AWS using DynamoDB as a low-cost vector store. The goal there was simple: prove that you don’t need heavy infrastructure or a dedicated vector database to get something useful working.

If you haven’t read it, you can check it out by clicking on the link here.

This post picks up where that one stops. We will implement the RAG pipeline we’ve created into a frontend application, which uses our backend and creates a full-stack application, which you can use to upload documents and get easier access to information you need from your PDFs. We will see how the budget RAG pipeline behaves with real users and traffic.

The link to the GitHub repository is the same as from the previous post, it’s just located on a branch called full-stack-implementation — access the repo by clicking here.

What the Application Does

The frontend application is written with the following technologies:

Next.js - React framework with App Router for server-side rendering and routing
Tailwind CSS 4 - Utility-first CSS framework for styling
shadcn/ui - Component library built on Radix UI primitives (New York style)
NextAuth.js - Authentication solution for Next.js
Lucide React - Icon library

When you load up the application, the user is prompted with a signup and a login form, which uses the Authentication stack on the backend.

The Authentication stack uses AWS Cognito for easier user management.

After the user signs up and logs in, they can upload PDF documents to the platform, wait for them to be indexed, and then ask questions that are answered using only the content of those documents.

There’s no attempt to hide latency, no background magic, and no assumption of massive scale. This is designed for early-stage usage: experimentation, internal tools, or MVPs where validating the feature matters more than optimizing for peak performance.

I’m going to show you 2 happy paths, one for the document upload, and the other for asking questions.

Here is a GIF which shows you how the user can upload the documents to the platform. I’ve exported my old blog posts as PDFs and uploaded them to the platform, just as an example.

After the documents got indexed by the backend and saved to the DynamoDB table, they will come up inside the Chat UI. You can choose which documents you want to ask questions about — the backend will take the ID of the uploaded document, get the information from the database and get the answer to your question with Bedrock LLM.

High-Level Architecture

The simple frontend talks to an API layer backed by AWS Lambda.

There are 2 stacks, one for authentication and user management, the other is the stack for the documents. Separating Authentication and Document processing keeps the ingestion and query path independent. In addition, if you ever wanted to expand this application and the other stack needs to have it’s API Gateway protected, you can simply import the Authentication stack’s User Pool ID and easily protect your endpoints.

As mentioned, authentication and user management is completely handled by AWS Cognito, together with a Login and Register Lambdas which handle and validate the incoming payloads.

Regarding the Document stack, documents are stored in S3 via the presigned URLs, embeddings and metadata live in DynamoDB, and the LLM and embedding models are accessed via AWS Bedrock. There are no long-running services, ensuring that there are no fixed monthly costs.

Every architectural choice here optimizes for one thing: minimizing cost and operational overhead, while still providing the RAG pipeline development experience and it’s features.

API and Lambda Performance with Real Traffic

The following image shows you the execution duration of the Lambda which indexes the incoming documents and puts the information inside the DynamoDB table. Remember the architecture from the previous post, when the document is uploaded in S3, it triggers an EventBridge rule which puts the uploaded document information into SQS queue, which triggers the document indexing Lambda. This is done like this to reuse Lambda environments and to lower Lambda invocations, lowering cost.

The first points are Lambda invocations where 1 PDF of 400kb is uploaded, later points are Lambda invocations where 2 PDFs are uploaded, both have ~200kb.

Indexing two small PDFs in just over 8 seconds means document indexing stays comfortably asynchronous. Users aren’t blocked, and indexing cost remains predictable.

Next, you can list indexed documents, inspect their state, and see exactly when they become available for querying. In the meantime, I’ve uploaded the same documents and other blog PDFs with similar sizes and reached around 20+ documents in the database and got very similar results, around ~4 seconds per document to index it and put it inside the database.

Now, we to fetch all available documents for the user, we need to call the get-documents API. The screenshots below show the get-documents endpoint in action, along with the execution duration of the document indexing Lambda. For testing, I’ve used Postman API testing and got the following results when querying all documents inside the database for my user. Notice that the first API call is slower than the rest? That’s because of the Lambda cold start. It’s important to note that an average user of the application won’t experience cold starts often, especially if there are other users on the platform.

Finally, the Ask Questions Lambda which the user invokes when they choose the documents about which they want to send an inquiry about. It’s a simple payload to the Lambda, where the selected document IDs are sent, together with a question.

The following graph is the duration of the execution of the said Lambda. The Lambda was tested first to see the cold start duration, then the warm start duration. First couple of executions questions regarding only one document — the question was:

What is this document about?

Whilst the latest execution in the graph had 4 documents chosen with the question:

Give me a summary of these 4 documents.

As you can see, when inquiring about one document, the cold start duration is around the 3 second mark, while when the Lambda is warm, the execution time is under 1.5 seconds.

On the other hand, the cold start duration when the Lambda is asked about 4 documents is just above 4 seconds, while the warm start duration is just under 4 seconds.

As mentioned before, an average user won’t experience many Lambda cold starts. For a budget RAG system, using very few AWS services for minimal complexity, these performances are acceptable.

Conclusion

This application confirms what the first blog post theorized — a DynamoDB-based RAG system is not just cheap, but usable and if you understand it’s limits, has acceptable performance.

As we’ve discussed in the previous post, trade-offs made are obvious and expected — Lambda cold starts are real, document query latency behaves as expected and increases if the document count grows.

For early-stage applications, internal tools, experiments or even a hackathon competition where you don’t want to break the bank, but still have a working RAG system, this setup does its job well. It lets you ship the product, let your users give you feedback on the application and delay expensive architectural decisions until data forces your hand.