DEV Community

Abhijith
Abhijith

Posted on

Building a RAG Powered Assistant with Spring AI and LM Studio

How to Create an Intelligent Document Q&A System Using Spring AI, PostgreSQL, and LM Studio


Imagine having an AI assistant that can instantly answer questions about hundreds of financial documents, quarterly reports, market analyses, policy papers without you having to manually search through pages of text. That's exactly what Retrieval Augmented Generation (RAG) enables, and in this tutorial, we'll build one from scratch using Spring Boot.

By the end of this guide, you'll have a fully functional application that:

  • Ingests PDF documents and extracts their content
  • Converts text into semantic embeddings using AI models
  • Stores embeddings in a PostgreSQL vector database
  • Answers natural language queries with contextual accuracy

What is RAG and Why Does It Matter?

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from external knowledge sources. Instead of relying solely on the model's training data, RAG systems:

  • Retrieve relevant documents based on semantic similarity
  • Augment the LLM's prompt with retrieved context
  • Generate accurate, grounded answers

This approach is particularly powerful for:

  • Enterprise knowledge bases with proprietary information
  • Financial document analysis and compliance
  • Customer support systems with extensive documentation
  • Research paper exploration and literature reviews

Architecture Overview

Our FinanceRag application follows a straightforward yet powerful architecture:

1. Document Ingestion Pipeline

PDF documents are read from the classpath and processed by Spring AI's PagePdfDocumentReader, which extracts text while preserving structure.

2. Text Chunking

The TokenTextSplitter divides the extracted text into manageable chunks (800 tokens each). This is crucial because:

  • Embedding models have token limits
  • Smaller chunks provide more precise semantic matching
  • Context windows in LLMs benefit from focused, relevant information

3. Vector Embedding Generation

Each text chunk is converted into a high-dimensional vector (embedding) using the nomic-embed-text model. These embeddings capture semantic meaning similar concepts cluster together in vector space.

4. Vector Storage with pgvector

Embeddings are persisted in PostgreSQL using the pgvector extension, which enables efficient similarity searches. We use HNSW indexing for fast approximate nearest neighbor (ANN) queries.

5. Query Processing

When a user asks a question:

  1. The question is embedded using the same model
  2. Vector similarity search retrieves the most relevant document chunks
  3. The QuestionAnswerAdvisor augments the LLM prompt with this context
  4. The LLM generates a contextual answer

Building the Application: Step by Step

Prerequisites

Before diving into code, ensure you have:

  • Java 17+ installed
  • PostgreSQL 12+ with pgvector extension
  • LM Studio (or another OpenAI-compatible LLM endpoint)
  • Maven 3+ for dependency management

Setting Up PostgreSQL with pgvector

You have two options for setting up PostgreSQL:

Option 1: Using Docker (Recommended for Quick Start)

Your repository includes a compose.yaml file for easy setup:

services:
  postgres:
    image: pgvector/pgvector:pg16
    ports:
      - "55419:5432"
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: finance
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:
Enter fullscreen mode Exit fullscreen mode

Simply run:

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

This spins up PostgreSQL with pgvector pre-installed on port 55419.

Option 2: Manual Installation

First, create a database and enable the vector extension:

CREATE DATABASE finance;
\c finance
CREATE EXTENSION IF NOT EXISTS vector;
Enter fullscreen mode Exit fullscreen mode

The pgvector extension adds a new vector data type to PostgreSQL, enabling efficient storage and querying of high-dimensional vectors.

Configuring Spring Boot

Your application.properties file should include:

# Application Name
spring.application.name=finaceRag

# Database Configuration (Docker setup)
spring.datasource.url=jdbc:postgresql://localhost:55419/finance
spring.datasource.username=postgres
spring.datasource.password=postgres

# LLM Configuration (LM Studio)
spring.ai.openai.base-url=http://localhost:1234/
spring.ai.openai.api-key=dummy

# Embedding Model
spring.ai.openai.embedding.options.model=nomic-embed-text

# Chat Model
spring.ai.openai.chat.options.model=google/gemma-3-4b

# Vector Store Configuration
spring.ai.vectorstore.pgvector.initialize-schema=true

# Ingestion Control (IMPORTANT!)
financerag.ingest.enabled=true
Enter fullscreen mode Exit fullscreen mode

Key Configuration Notes:

  • Port 55419 matches the Docker Compose setup
  • The initialize-schema=true automatically creates the vector store table
  • nomic-embed-text is a lightweight, high-quality embedding model
  • google/gemma-3-4b is the chat model served by LM Studio

** Important: Ingestion Control**

The financerag.ingest.enabled property is a smart optimization:

First Run (Initial Setup):

financerag.ingest.enabled=true
Enter fullscreen mode Exit fullscreen mode

This processes your PDFs and populates the vector store.

Subsequent Runs:

financerag.ingest.enabled=false
Enter fullscreen mode Exit fullscreen mode

This skips ingestion and starts the application immediately. The embeddings are already in PostgreSQL, so there's no need to re-process documents every time!

This design prevents:

  • Duplicate embeddings in the database
  • Slow startup times on every restart
  • Unnecessary LLM API calls

Setting Up LM Studio

  1. Download and Install LM Studio from lmstudio.ai

  2. Download the Required Models:

    • Embedding Model: Search for "nomic-embed-text" in LM Studio and download it
    • Chat Model: Search for "google/gemma-3-4b" (or similar) and download it
  3. Start the Local Server:

    • Open LM Studio
    • Go to the "Local Server" tab
    • Select your chat model (gemma-3-4b)
    • Click "Start Server" (it will run on http://localhost:1234 by default)
    • Ensure the embedding model is also loaded
  4. Verify the Connection:

   curl http://localhost:1234/v1/models
Enter fullscreen mode Exit fullscreen mode

You should see your loaded models listed.

The Ingestion Service

The heart of our document processing pipeline is the IngestionService. Here's how it works:

@Component
@ConditionalOnProperty(
    name = "financerag.ingest.enabled",
    havingValue = "true",
    matchIfMissing = false
)
public class IngestionService implements CommandLineRunner {
    private static final Logger logger = LoggerFactory.getLogger(IngestionService.class);

    private final VectorStore vectorStore;

    @Value("classpath:/docs/article.pdf")
    private Resource pdfResource;

    public IngestionService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    @Override
    public void run(String... args) throws Exception {
        logger.info("Starting data ingestion process...");

        // 1. Read PDF using paragraph-based reader
        var pdfReader = new ParagraphPdfDocumentReader(pdfResource);

        // 2. Split text into chunks
        TextSplitter splitter = new TokenTextSplitter();

        // 3. Process and store in vector database
        vectorStore.accept(splitter.apply(pdfReader.get()));

        logger.info("Vector store updated with PDF content.");
    }
}
Enter fullscreen mode Exit fullscreen mode

Key Implementation Insights:

1. Conditional Ingestion
The @ConditionalOnProperty annotation is brilliant - it only runs ingestion when you explicitly enable it:

# Enable ingestion on first run
financerag.ingest.enabled=true

# Disable after initial setup to avoid re-ingesting
financerag.ingest.enabled=false
Enter fullscreen mode Exit fullscreen mode

This prevents re-processing documents on every application restart!

2. CommandLineRunner Interface
By implementing CommandLineRunner, the ingestion happens automatically after Spring Boot starts, but before the application begins serving requests.

3. ParagraphPdfDocumentReader vs PagePdfDocumentReader
Your code uses ParagraphPdfDocumentReader which:

  • Preserves document structure better by respecting paragraph boundaries
  • Creates more semantically meaningful chunks
  • Better suited for financial documents with structured content

4. Simplified API
The vectorStore.accept() method elegantly handles:

  • Embedding generation for each chunk
  • Batch insertion into PostgreSQL
  • All the complexity hidden behind a clean API

The Chat Controller

Now let's expose a REST endpoint for queries:

@RestController
public class ChatController {

    private final ChatClient chatClient;

    public ChatController(ChatClient.Builder chatClient, PgVectorStore vectorStore) {
        this.chatClient = chatClient
            .defaultAdvisors(QuestionAnswerAdvisor.builder(vectorStore).build())
            .build();
    }

    @GetMapping("/chat")
    public String chat(@RequestParam String question) {
        return chatClient.prompt()
            .user(question)
            .call()
            .content();
    }
}
Enter fullscreen mode Exit fullscreen mode

The Magic of QuestionAnswerAdvisor:

The QuestionAnswerAdvisor is where RAG happens. Behind the scenes, it:

  • Converts the user's question into an embedding
  • Performs a similarity search against the vector store
  • Injects the most relevant document chunks into the prompt
  • Sends the augmented prompt to the LLM

Key Implementation Details:

  • The advisor is built using the builder pattern: QuestionAnswerAdvisor.builder(vectorStore).build()
  • Spring AI automatically handles the vector search and context injection
  • The controller method is elegantly simple - just pass the question through the chat client

Real World Considerations

Choosing the Right Chunk Size

The 800-token chunk size is a starting point. Consider:

  • Smaller chunks (200-400 tokens): Better precision, but may lose context
  • Larger chunks (1000-1500 tokens): More context, but less precise matching

Experiment with your specific use case. Financial reports might need larger chunks to preserve numerical context, while FAQs work better with smaller, focused chunks.

Scaling to Production

For production deployments, consider:

  • Async ingestion: Move document processing to background jobs
  • Caching: Cache embeddings for frequently accessed documents
  • Metadata filtering: Add tags (date, category, source) to narrow searches
  • Monitoring: Track query latency and similarity scores

Hybrid Search Strategies

Pure vector search isn't always optimal. Combine it with:

  • Full-text search: For exact keyword matches
  • BM25 ranking: Traditional relevance scoring
  • Re-ranking: Use a cross-encoder model to refine top results

Testing Your RAG System

Start the application and test with curl:

curl "http://localhost:8080/chat?question=What%20were%20the%20key%20trends%20in%20Q4%20earnings?"
Enter fullscreen mode Exit fullscreen mode

You should see an answer grounded in your ingested documents. Compare responses with and without RAG to appreciate the difference in accuracy and relevance.

Common Pitfalls and Solutions

1. Embedding Dimension Mismatch

Problem: Embeddings fail to store with dimension errors.

Solution: Ensure spring.ai.vectorstore.pgvector.dimensions matches your embedding model. For nomic-embed-text, use 768.

2. Poor Retrieval Quality

Problem: Answers don't align with document content.

Solution: Adjust chunk size, increase topK, or lower the similarity threshold. Also verify your embedding model is appropriate for your domain.

3. Memory Issues During Ingestion

Problem: Application crashes with OutOfMemoryError.

Solution: Process documents in batches, increase JVM heap size (-Xmx4g), or limit the maxNumChunks parameter.

Extending FinanceRag: Ideas for Enhancement

This project is a foundation. Here are some powerful extensions:

Multi Document Support

Instead of hardcoding a single PDF, scan a directory or accept uploads via REST API. Add metadata (filename, upload date) to enable filtered searches.

Conversational Memory

Implement session based chat history so users can ask follow up questions without repeating context. Spring AI supports this with MessageChatMemoryAdvisor.

Source Attribution

Return not just the answer but citations showing which document chunks were used. This builds trust and allows users to verify information.

Advanced Analytics

Track which documents are queried most frequently, average similarity scores, and query patterns to identify knowledge gaps.

Conclusion

You've now built a production-ready RAG system that can intelligently answer questions about your documents. This architecture scales to thousands of documents and can be adapted for countless use cases such as customer support, legal document analysis, medical research, and more.

The beauty of Spring AI is how it abstracts the complexity of embeddings, vector stores, and LLM orchestration, letting you focus on business logic. With just three components IngestionService, ChatController, and pgvector we've created a powerful AI assistant.

The full source code for FinanceRag is available on GitHub. Clone it, experiment with different models and chunk sizes, and adapt it to your domain. The future of enterprise AI is built on foundations like these combining the power of LLMs with your organization's proprietary knowledge.

Special Thanks: This project was inspired by the excellent Spring AI content from Dan Vega, whose tutorials have helped countless developers understand the power of RAG architectures.

Happy coding, and may your AI assistants always retrieve the right context!


About the Author

This tutorial is brought to you by the Abhijith Rajesh

Links:


Top comments (2)

Collapse
 
merlin_varghese profile image
Merlin Varghese

Great work! πŸ™Œ

Collapse
 
baby_susan_48f27061d9fc74 profile image
Baby Susan

Really helpfulπŸ‘πŸ‘