Piyush Kumar Singh

Posted on May 18 • Originally published at Medium

Spring AI Explained — ChatClient, RAG, Advisors, and Every Core Component

#webdev #ai #java #springboot

Most Spring AI tutorials jump straight to code. You copy the dependency, paste the config, call ChatClient, and something works. But when you need to actually build something — a chatbot that remembers conversations, an API that answers questions from your own documents — you hit a wall. Because you don't know what's actually doing what. Friend’s Link

What Spring AI actually is — in one sentence

Spring AI is an abstraction layer that lets you wire LLMs into your Spring Boot app without hardcoding any particular AI provider.

That last part matters. OpenAI, Google Gemini, Anthropic Claude, and Ollama are running locally on your machine — Spring AI talks to all of them through the same API. Swap providers without touching your business logic. That’s the entire value proposition, and everything else is built on top of it.

Spring AI Components

ChatClient — the front door
ChatClient is the component you'll interact with the most. It's the fluent API that sits at the top of the stack and handles the actual request-response cycle with the LLM.

Think of it like a RestTemplate or a WebClient— but instead of calling a REST endpoint, you're sending a prompt and getting a response back. It handles all the low-level connection details, request formatting, and response parsing so you don't have to.

What makes ChatClient genuinely well-designed is its fluent builder style. You don't configure it once globally and hope for the best. Each call is composable — you can set the system prompt, attach advisors, pass user input, and control the output format all in one readable chain.

It also separates two things that often get conflated: the default configuration you set at startup (your system prompt, default advisors, model parameters) and the per-request configuration you apply at call time. That separation matters in production, where different endpoints need different behaviours from the same underlying client.

PromptTemplate — how you talk to the LLM properly

A raw string shoved into an LLM is not a prompt. A prompt is a structured piece of text with placeholders, context, and instructions — and this PromptTemplate is how Spring AI handles that.

The idea is simple: you define a template with variables, and at runtime, you fill those variables in. Instead of building prompt strings with Java string concatenation — which gets messy fast — you define the shape of the prompt separately from the data that goes into it.

This matters for three reasons. First, it keeps prompts readable and maintainable. Second, it separates the “what to ask” from the “what data to inject” which is the same separation concerns you apply everywhere else in your codebase. Third, it makes prompt versioning possible. When your prompt needs tweaking, you’re editing a template, not hunting through business logic.

PromptTemplate also gives you a proper Prompt object that carries both the human message and the system message. That distinction — system prompt (the instructions) vs user prompt (the question) — is one of the most important things to understand when working with LLMs, and Spring AI models it explicitly.

EmbeddingModel — the piece that makes search smart

An EmbeddingModel takes text and converts it into a vector — a list of floating point numbers that represents the meaning of that text in multi-dimensional space.

That sounds abstract. Here’s the concrete thing to grasp: two pieces of text that mean similar things will produce vectors that are close to each other mathematically. “What’s your return policy?” and “How do I get a refund?” are different strings, but their vectors will be very close — because semantically, they’re the same question.

This is what makes semantic search possible. Traditional search matches keywords. Embedding-based search matches meaning. A user asking “how do I cancel my order” will find a document titled “Order cancellation policy” even if the words don’t overlap, because the meanings are geometrically close in vector space.

In Spring AI, EmbeddingModel is the interface that abstracts over whatever embedding service you're using — OpenAI's text-embedding-ada-002, Gemini's embedding API, or a local model via Ollama. The abstraction is consistent regardless of provider, which means your RAG pipeline doesn't break if you switch models.

VectorStore — where embeddings live

VectorStore is the database for embeddings. You put vectors in, and you query them by similarity — "give me the top 5 stored vectors that are closest to this query vector."

It’s worth understanding that this is a fundamentally different kind of database from what you’re used to. You don’t query it with SQL. You don’t look things up by ID. You ask: which stored content is most semantically similar to this input? And it returns the matches ranked by similarity score.

Spring AI’s VectorStore interface abstracts over the actual storage engine underneath. In development, you might use SimpleVectorStore an in-memory implementation. In production, you'd swap to Pinecone, Weaviate, pgvector on top of Postgres, or Elasticsearch. The interface stays identical. Your code doesn't change.

The VectorStore is also responsible for handling the metadata that travels alongside each vector — the document title, page number, source URL, whatever you stored at ingestion time. When it returns matching chunks, that metadata comes with it, so your prompt builder knows where the information came from.

Advisors — the middleware nobody talks about enough

This is the component most tutorials skip, and it’s arguably the most powerful part of the whole framework.

An Advisor in Spring AI is a piece of middleware that wraps around every ChatClient request. Before the request goes to the LLM, advisors can intercept it and modify it — add context, inject memory, apply safety rules, log the conversation, filter the input. After the response comes back, they can post-process it too.

The important thing to understand is that advisors form a chain. Each one wraps the next, like servlet filters in a web application. You configure which advisors run in which order, and each one has a defined responsibility.

QuestionAnswerAdvisor is the one you'll use for RAG. Before your question reaches the LLM, this advisor takes that question, queries VectorStore for the most relevant chunks, and injects them into the prompt automatically. From ChatClient's perspective, you just asked a question. Internally, your question has been enriched with your own data before the LLM ever sees it.

MessageChatMemoryAdvisor is what makes conversations persistent. Without it, every call to ChatClient starts fresh — no memory of what was said before. With it, previous turns from ChatMemory are injected into each new request so the LLM has context.

You can write your own advisors too. Any cross-cutting concern that applies to every LLM call — rate limiting, PII detection, response caching, A/B testing between prompts — belongs in an advisor, not in your business logic.

ChatMemory — giving the LLM a memory

LLMs are stateless. Every API call is completely independent. Ask an LLM “what’s the capital of France,” then ask “what did I just ask you,” and it has no idea — because, from its perspective, that second request is the first thing you’ve ever said.

ChatMemory is how Spring AI solves this. It's a storage abstraction for conversation history. After each exchange, the message — both the user's question and the LLM's response — gets saved. On the next request, that history gets loaded and injected into the prompt so the LLM has context.

InMemoryChatMemory is the default — history lives in your application's heap and disappears on restart. That's fine for development and short stateless sessions. For production chatbots that need to remember users across sessions, you'd implement a persistent ChatMemory backed by Redis or a database.

There’s a real constraint here worth knowing upfront: every message you inject into the conversation history costs tokens. LLMs have a context window limit — usually somewhere between 8K and 128K tokens, depending on the model. If a conversation goes on long enough, the accumulated history will either exceed the limit and fail, or you’ll need to implement a summarisation strategy to compress older messages.

This is not a Spring AI problem — it’s a fundamental LLM constraint. But ChatMemory is where you manage it.

RAG Flow

How RAG brings it all together
RAG — Retrieval-Augmented Generation — is the pattern that makes Spring AI genuinely production-useful. The diagram above shows both phases. Here’s the thinking behind it.

The core problem: your LLM knows nothing about your company. It doesn’t know your product documentation, your internal policies, your customer data. Fine-tuning a model on your data is expensive, slow, and goes stale every time the data changes.

RAG is the pragmatic answer. Instead of teaching the model your data, you just hand it the relevant pages at the moment it needs them. Like giving a contractor a specific clause from the contract rather than asking them to memorise the whole thing.

The ingestion phase runs once, or whenever your data changes. Your documents are loaded, split into manageable chunks, embedded into vectors, and stored in a VectorStore. This is how your data gets indexed for semantic retrieval.

The query phase runs on every request. The user’s question is embedded into a vector. That vector is used to query the VectorStore for the closest matching chunks. Those chunks — plus the original question — get injected into the prompt. The LLM reads them as context and answers based on what it finds there.

The LLM never “learned” your data. It reads it fresh on each request, like an open-book exam. That framing matters because it sets the right expectations: if the relevant information isn’t in the retrieved chunks, the model will still try to answer — and that’s when hallucinations happen. RAG reduces hallucinations by providing grounding. It doesn’t eliminate them.

The part that controls retrieval quality isn’t the LLM and isn’t the vector database — it’s the chunking strategy. How you split your documents determines what gets retrieved. A chunk that’s too large buries the relevant detail in noise. A chunk too small loses the surrounding context that makes it meaningful. Getting chunking right is usually where the real tuning work happens.

The one-line mental model for each component

ChatClient — you talk to the LLM through this. PromptTemplate — You structure what you say. EmbeddingModel — converts meaning into math. VectorStore — stores and searches that math. Advisors — middleware that enriches every request automatically. ChatMemory — gives the conversation a past. Together, they’re the full stack for building LLM features that actually behave like software — predictable, configurable, and debuggable.

Top comments (1)

Piyush Kumar Singh • May 18

Curious — are any of you already using Spring AI in prod, or still experimenting? I'm about to write Part 2 on Tool Calling and want to know what people are hitting in real projects.