This is a submission for the Gemma 4 Challenge: Write About Gemma 4
A Complete Enterprise AI Knowledge Retrieval Architecture for Private Document Intelligence
Summary
This article explains a Retrieval-Augmented Generation (RAG) architecture using n8n, PostgreSQL with pgvector, Ollama, and Gemma4 running on AWS EC2. The platform automatically ingests emails and PDFs, creates embeddings, stores vectors in PostgreSQL, and retrieves contextual information to generate AI answers grounded in enterprise data.
Content
You can view a video of this Gemma4 architecture here:
https://www.youtube.com/watch?v=bTP-sNKlsxc
RAG architectures combine vector search with large language models. In this solution, n8n orchestrates ingestion workflows and query processing. Emails and PDF documents are read automatically, text is extracted and cleaned, then split into semantic chunks. The chunks are embedded using the nomic-embed-text model and stored in PostgreSQL pgvector. When users ask questions, the question is embedded and compared against stored vectors to retrieve the most relevant chunks. Gemma4 then generates a final response using retrieved context.
The architecture uses two AWS EC2 instances. The first server hosts n8n, PostgreSQL, Docker, and orchestration services. The second server hosts Ollama, Gemma4, and the embedding model. This separation improves scalability and isolates AI workloads from orchestration tasks.
Docker containers simplify deployment and maintenance. PostgreSQL with pgvector enables semantic similarity search directly inside the relational database. This architecture is modular and can evolve with future embedding models and LLM technologies.
Business Applications
1.Customer Support AI
Support teams can query manuals, troubleshooting guides, and ticket histories using natural language to accelerate customer assistance.
2.Enterprise Knowledge Management
Organizations can centralize contracts, policies, reports, and procedures into an intelligent AI search platform.
Financial Analytics
Executives can ask natural language questions about sales trends, ERP reports, invoices, and operational metrics.
Technical Details
Infrastructure Requirements:
- EC2 #1: t3.large or t3.xlarge with PostgreSQL, pgvector, Docker, and n8n.
- EC2 #2: GPU-enabled instance such as g6e.2xlarge for Ollama and Gemma4.
- Ubuntu 22.04 recommended.
- SSD storage and private VPC networking.
Implementation Details:
n8n automates email ingestion, PDF extraction, chunking, embedding generation, and vector storage. Chunk sizes around 1000 characters with overlap improve semantic retrieval. PostgreSQL pgvector performs cosine similarity searches. Gemma4 receives contextual prompts generated from retrieved chunks.
Security and Networking:
Use HTTPS with reverse proxies, encrypted EBS volumes, private networking between EC2 instances, and restricted security groups to protect sensitive enterprise data.
Estimated Costs:
- EC2 #1 daily cost: approximately USD $2–$3.
- EC2 #1 monthly cost: approximately USD $60–$90.
- EC2 #2 GPU server daily cost: approximately USD $25–$40.
- EC2 #2 monthly cost: approximately USD $750–$1,200.
- Total monthly infrastructure: approximately USD $850–$1,400 depending on workload.

Figure 1: Architecture of two AWS EC2, one running gemma4 and the other N8N
COMPLETE STEP-BY-STEP FLOW
STEP 1 — User Sends Data Into the System
According to the infographic:
Emails with PDFs are read on EC2 #1.
This is the beginning of the Ingestion Workflow.
The company may receive:
invoices
manuals
contracts
reports
customer emails
technical documentation
support tickets
n8n automatically monitors:
IMAP mailboxes
folders
APIs
SharePoint
Google Drive
CRMs
ERPs
What n8n Does
n8n acts as the automation orchestrator.
Example:
New email arrives
n8n detects it
Downloads PDF attachment
Starts the AI pipeline automatically
No human intervention is required.
*STEP 2 — PDF Text Extraction
*
The infographic shows:
Extract PDF Text
At this stage:
PDFs are parsed
text is extracted
metadata is collected
Metadata may include:
sender
date
filename
document type
department
customer ID
Why This Matters
LLMs cannot directly understand PDFs.
The system must convert documents into raw text before AI processing.
Example:
A 200-page manual becomes machine-readable text.
_STEP 3 — Clean and Normalize Text
_
The infographic shows:
Clean & Normalize Text
Raw PDF extraction is usually messy.
Problems include:
broken lines
duplicated spaces
headers/footers
page numbers
encoding problems
tables split incorrectly
n8n cleans the content using scripts or functions.
Example
Before cleaning:
Invoice #2939
Customer:
ACME Corp
Page 1
After cleaning:
Invoice #2939 Customer: ACME Corp
STEP 4 — Chunking the Text
The infographic shows:
Chunk Text
This is one of the MOST important steps in RAG.
Why Chunking Is Necessary
LLMs have token limits.
A 500-page document cannot be sent entirely to the model.
So the document is split into smaller pieces called:
Chunks
Example chunk size:
1000 characters
200 overlap
What Overlap Means
Suppose chunk #1 ends with:
The warranty expires after...
Chunk #2 begins with:
...after 24 months of operation.
Overlap preserves semantic continuity.
Without overlap:
information can be lost
meaning breaks between chunks
STEP 5 — Create Embeddings
The infographic shows:
Prepare Embedding Payload
Call Embedding Server
This is where semantic AI begins.
What Is an Embedding?
An embedding converts text into mathematical vectors.
The embedding model understands meaning.
Example:
"car"
and
"vehicle"
generate similar vectors.
Embedding Process
Chunk text is sent from EC2 #1 to EC2 #2.
The embedding server uses:
nomic-embed-text
to transform text into vectors.
Example:
[0.023, -0.991, 0.224, ...]
These vectors may contain:
768 dimensions
1024 dimensions
1536 dimensions
depending on the model.
Why Embeddings Are Powerful
Traditional search uses keywords.
Embeddings use:
Semantic Meaning
This means users can ask:
How long is the warranty?
Even if the document says:
Coverage remains valid for 24 months.
The system still finds the answer.
_STEP 6 — Store Embeddings in PostgreSQL pgvector
_
The infographic shows:
Store Embeddings in PostgreSQL (pgvector)
Now the vectors are saved in PostgreSQL.
What Is pgvector?
pgvector is an extension for PostgreSQL that adds:
vector storage
similarity search
AI search capabilities
Example table:
id chunk_text embedding
1 warranty info [0.12, ...]
Why PostgreSQL Is Used
Advantages:
mature database
ACID compliance
reliability
backups
indexing
SQL support
enterprise-ready
Instead of needing a separate vector DB like Pinecone or Weaviate, pgvector keeps everything inside PostgreSQL.
_STEP 7 — User Asks a Question
_
The infographic says:
User sends question (HTTPS POST /ask)
A user opens the web interface and types:
What is the warranty period for industrial pumps?
The question goes to EC2 #1.
STEP 8 — Create Embedding for the Question
The infographic shows:
Create Embedding for Question
The question itself is transformed into a vector using the SAME embedding model.
This is critical.
If documents and questions use different embedding models:
similarity breaks
retrieval quality drops
*STEP 9 — Vector Similarity Search
*
The infographic shows:
Vector Search in PostgreSQL
This is the core of RAG.
How Similarity Search Works
The question vector is compared against ALL stored chunk vectors.
Using:
cosine similarity
Euclidean distance
inner product
PostgreSQL finds the chunks mathematically closest in meaning.
Example
User asks:
How many vacation days do employees receive?
The database may retrieve:
Employees are entitled to 15 annual leave days.
even without keyword matching.
STEP 10 — Build Context
The infographic shows:
Build Context
The best matching chunks are combined together.
Example:
Chunk 1: vacation policy
Chunk 2: HR policy
Chunk 3: employment handbook
The system assembles them into context.
Why Context Is Critical
LLMs hallucinate when lacking information.
RAG prevents hallucinations by giving:
Relevant Ground Truth Data
The model answers from company knowledge.
_STEP 11 — Build Prompt for Gemma4
_
The infographic shows:
Build Prompt for Gemma4
A structured prompt is generated.
Example:
You are an enterprise assistant.
Answer ONLY using the provided context.
Context:
[retrieved chunks]
Question:
How many vacation days do employees receive?
This prompt engineering layer is extremely important.
*STEP 12 — Send to Gemma4 via Ollama
*
The infographic shows:
Send to Gemma4 (EC2 #2)
The prompt is sent to Ollama.
Ollama exposes APIs like:
/v1/chat/completions
Gemma4 processes:
context
instructions
user question
Then generates the final response.
Why Ollama Is Important
Ollama simplifies:
local LLM serving
model management
GPU usage
API exposure
Without Ollama:
running LLMs locally is much harder.
STEP 13 — Return Answer to User
The infographic ends with:
Answer is returned to the user
The final answer travels back:
Gemma4 → EC2 #1 → Web Client
The user receives a grounded response.
Example:
Employees receive 15 vacation days annually after completing one year of employment.
Why This Architecture Is Powerful
This architecture creates:
Private AI
Data stays inside AWS infrastructure.
Semantic Search
Searches by meaning, not keywords.
Scalable AI
You can scale:
database
workflows
LLM server
embedding server
independently.
Conclusions
This RAG architecture demonstrates how organizations can build private and scalable AI systems using open-source technologies. The combination of n8n, PostgreSQL pgvector, Ollama, and Gemma4 enables intelligent retrieval of enterprise knowledge while maintaining full infrastructure control. The modular design supports future scalability, model upgrades, and advanced AI workflows.
Top comments (0)