DEV Community

Fernando
Fernando

Posted on

RAG Architecture with n8n + PostgreSQL (pgvector) + Ollama Gemma4 on AWS EC2

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

A Complete Enterprise AI Knowledge Retrieval Architecture for Private Document Intelligence

Summary
This article explains a Retrieval-Augmented Generation (RAG) architecture using n8n, PostgreSQL with pgvector, Ollama, and Gemma4 running on AWS EC2. The platform automatically ingests emails and PDFs, creates embeddings, stores vectors in PostgreSQL, and retrieves contextual information to generate AI answers grounded in enterprise data.

Content
You can view a video of this Gemma4 architecture here:
https://www.youtube.com/watch?v=bTP-sNKlsxc

RAG architectures combine vector search with large language models. In this solution, n8n orchestrates ingestion workflows and query processing. Emails and PDF documents are read automatically, text is extracted and cleaned, then split into semantic chunks. The chunks are embedded using the nomic-embed-text model and stored in PostgreSQL pgvector. When users ask questions, the question is embedded and compared against stored vectors to retrieve the most relevant chunks. Gemma4 then generates a final response using retrieved context.
The architecture uses two AWS EC2 instances. The first server hosts n8n, PostgreSQL, Docker, and orchestration services. The second server hosts Ollama, Gemma4, and the embedding model. This separation improves scalability and isolates AI workloads from orchestration tasks.
Docker containers simplify deployment and maintenance. PostgreSQL with pgvector enables semantic similarity search directly inside the relational database. This architecture is modular and can evolve with future embedding models and LLM technologies.

Business Applications
1.Customer Support AI
Support teams can query manuals, troubleshooting guides, and ticket histories using natural language to accelerate customer assistance.
2.Enterprise Knowledge Management
Organizations can centralize contracts, policies, reports, and procedures into an intelligent AI search platform.
Financial Analytics
Executives can ask natural language questions about sales trends, ERP reports, invoices, and operational metrics.
Technical Details
Infrastructure Requirements:

  • EC2 #1: t3.large or t3.xlarge with PostgreSQL, pgvector, Docker, and n8n.
  • EC2 #2: GPU-enabled instance such as g6e.2xlarge for Ollama and Gemma4.
  • Ubuntu 22.04 recommended.
  • SSD storage and private VPC networking.

Implementation Details:
n8n automates email ingestion, PDF extraction, chunking, embedding generation, and vector storage. Chunk sizes around 1000 characters with overlap improve semantic retrieval. PostgreSQL pgvector performs cosine similarity searches. Gemma4 receives contextual prompts generated from retrieved chunks.
Security and Networking:
Use HTTPS with reverse proxies, encrypted EBS volumes, private networking between EC2 instances, and restricted security groups to protect sensitive enterprise data.
Estimated Costs:

  • EC2 #1 daily cost: approximately USD $2–$3.
  • EC2 #1 monthly cost: approximately USD $60–$90.
  • EC2 #2 GPU server daily cost: approximately USD $25–$40.
  • EC2 #2 monthly cost: approximately USD $750–$1,200.
  • Total monthly infrastructure: approximately USD $850–$1,400 depending on workload.

Figure 1: Architecture of two AWS EC2, one running gemma4 and the other N8N
Figure 1: Architecture of two AWS EC2, one running gemma4 and the other N8N

COMPLETE STEP-BY-STEP FLOW
STEP 1 — User Sends Data Into the System

According to the infographic:

Emails with PDFs are read on EC2 #1.

This is the beginning of the Ingestion Workflow.

The company may receive:

invoices
manuals
contracts
reports
customer emails
technical documentation
support tickets

n8n automatically monitors:

IMAP mailboxes
folders
APIs
SharePoint
Google Drive
CRMs
ERPs
What n8n Does

n8n acts as the automation orchestrator.

Example:

New email arrives
n8n detects it
Downloads PDF attachment
Starts the AI pipeline automatically

No human intervention is required.

*STEP 2 — PDF Text Extraction
*

The infographic shows:

Extract PDF Text

At this stage:

PDFs are parsed
text is extracted
metadata is collected

Metadata may include:

sender
date
filename
document type
department
customer ID
Why This Matters

LLMs cannot directly understand PDFs.

The system must convert documents into raw text before AI processing.

Example:

A 200-page manual becomes machine-readable text.

_STEP 3 — Clean and Normalize Text
_
The infographic shows:

Clean & Normalize Text

Raw PDF extraction is usually messy.

Problems include:

broken lines
duplicated spaces
headers/footers
page numbers
encoding problems
tables split incorrectly

n8n cleans the content using scripts or functions.

Example

Before cleaning:

Invoice #2939

Customer:
ACME Corp

Page 1

After cleaning:

Invoice #2939 Customer: ACME Corp
STEP 4 — Chunking the Text

The infographic shows:

Chunk Text

This is one of the MOST important steps in RAG.

Why Chunking Is Necessary

LLMs have token limits.

A 500-page document cannot be sent entirely to the model.

So the document is split into smaller pieces called:

Chunks

Example chunk size:

1000 characters
200 overlap
What Overlap Means

Suppose chunk #1 ends with:

The warranty expires after...

Chunk #2 begins with:

...after 24 months of operation.

Overlap preserves semantic continuity.

Without overlap:

information can be lost
meaning breaks between chunks
STEP 5 — Create Embeddings

The infographic shows:

Prepare Embedding Payload
Call Embedding Server

This is where semantic AI begins.

What Is an Embedding?

An embedding converts text into mathematical vectors.

The embedding model understands meaning.

Example:

"car"

and

"vehicle"

generate similar vectors.

Embedding Process

Chunk text is sent from EC2 #1 to EC2 #2.

The embedding server uses:

nomic-embed-text

to transform text into vectors.

Example:

[0.023, -0.991, 0.224, ...]

These vectors may contain:

768 dimensions
1024 dimensions
1536 dimensions

depending on the model.

Why Embeddings Are Powerful

Traditional search uses keywords.

Embeddings use:

Semantic Meaning

This means users can ask:

How long is the warranty?

Even if the document says:

Coverage remains valid for 24 months.

The system still finds the answer.

_STEP 6 — Store Embeddings in PostgreSQL pgvector
_
The infographic shows:

Store Embeddings in PostgreSQL (pgvector)

Now the vectors are saved in PostgreSQL.

What Is pgvector?

pgvector is an extension for PostgreSQL that adds:

vector storage
similarity search
AI search capabilities

Example table:

id chunk_text embedding
1 warranty info [0.12, ...]
Why PostgreSQL Is Used

Advantages:

mature database
ACID compliance
reliability
backups
indexing
SQL support
enterprise-ready

Instead of needing a separate vector DB like Pinecone or Weaviate, pgvector keeps everything inside PostgreSQL.

_STEP 7 — User Asks a Question
_

The infographic says:

User sends question (HTTPS POST /ask)

A user opens the web interface and types:

What is the warranty period for industrial pumps?

The question goes to EC2 #1.

STEP 8 — Create Embedding for the Question

The infographic shows:

Create Embedding for Question

The question itself is transformed into a vector using the SAME embedding model.

This is critical.

If documents and questions use different embedding models:

similarity breaks
retrieval quality drops
*STEP 9 — Vector Similarity Search
*

The infographic shows:

Vector Search in PostgreSQL

This is the core of RAG.

How Similarity Search Works

The question vector is compared against ALL stored chunk vectors.

Using:

cosine similarity
Euclidean distance
inner product

PostgreSQL finds the chunks mathematically closest in meaning.

Example

User asks:

How many vacation days do employees receive?

The database may retrieve:

Employees are entitled to 15 annual leave days.

even without keyword matching.

STEP 10 — Build Context

The infographic shows:

Build Context

The best matching chunks are combined together.

Example:

Chunk 1: vacation policy
Chunk 2: HR policy
Chunk 3: employment handbook

The system assembles them into context.

Why Context Is Critical

LLMs hallucinate when lacking information.

RAG prevents hallucinations by giving:

Relevant Ground Truth Data

The model answers from company knowledge.

_STEP 11 — Build Prompt for Gemma4
_

The infographic shows:

Build Prompt for Gemma4

A structured prompt is generated.

Example:

You are an enterprise assistant.

Answer ONLY using the provided context.

Context:
[retrieved chunks]

Question:
How many vacation days do employees receive?

This prompt engineering layer is extremely important.

*STEP 12 — Send to Gemma4 via Ollama
*

The infographic shows:

Send to Gemma4 (EC2 #2)

The prompt is sent to Ollama.

Ollama exposes APIs like:

/v1/chat/completions

Gemma4 processes:

context
instructions
user question

Then generates the final response.

Why Ollama Is Important

Ollama simplifies:

local LLM serving
model management
GPU usage
API exposure

Without Ollama:

running LLMs locally is much harder.

STEP 13 — Return Answer to User

The infographic ends with:

Answer is returned to the user

The final answer travels back:

Gemma4 → EC2 #1 → Web Client

The user receives a grounded response.

Example:

Employees receive 15 vacation days annually after completing one year of employment.
Why This Architecture Is Powerful

This architecture creates:

Private AI

Data stays inside AWS infrastructure.

Semantic Search

Searches by meaning, not keywords.

Scalable AI

You can scale:

database
workflows
LLM server
embedding server

independently.

Conclusions
This RAG architecture demonstrates how organizations can build private and scalable AI systems using open-source technologies. The combination of n8n, PostgreSQL pgvector, Ollama, and Gemma4 enables intelligent retrieval of enterprise knowledge while maintaining full infrastructure control. The modular design supports future scalability, model upgrades, and advanced AI workflows.

Top comments (0)