Standard Retrieval-Augmented Generation (RAG) follows a simple, linear path: take a user query, find similar documents, and send them to the LLM. While effective for basic FAQs, this “one-shot” approach fails when faced with complex, real-world problems.
This guide explores Multistep RAG, a sophisticated pattern where the AI system functions as an agent — reasoning, searching, and refining its answers through multiple iterations.
1. When is Multistep RAG Required?
In a production environment, you should transition from simple RAG to a multistep architecture when your system encounters:
- Multi-Hop Reasoning: When an answer requires connecting two unrelated facts (e.g., “How does the CEO’s bonus in our 2023 report compare to the market average for tech firms in 2024?”). One search won’t find both pieces of data.
- Missing Context (The “Web Bridge”): When internal documents are outdated or incomplete. The system must recognize it lacks info and “step out” to a web search tool.
- Query Ambiguity: When the user’s initial question is too broad. The system needs a “Query Transformation” step to break one question into three specific sub-queries for the vector database.
Spring AI vs. LangChain: The Orchestration Battle
2. Project Configuration (pom.xml)
As of late 2025, Spring AI 1.1.x and 2.0.x provide the most robust support for these patterns. Ensure you have the following dependencies:
XML
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-tavily-ai-spring-boot-starter</artifactId>
</dependency>
</dependencies>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>1.1.1</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
3. Implementation: Multistep RAG with Web Search
In this architecture, the ChatClient uses a RetrievalAugmentationAdvisor to fetch internal data and a ToolCallAdvisor to perform external web searches if the model determines it is necessary.
Step 1: Define the Web Search Tool
First, we expose a web search function as a Spring Bean. The LLM will “see” this tool and its description.
@Configuration
public class AiToolsConfig {
@Bean
@Description("Search the internet for real-time news, current events, or missing technical data.")
public Function<SearchRequest, String> webSearch(TavilyAiApi tavilyApi) {
return request -> {
var response = tavilyApi.search(new TavilyAiApi.SearchRequest(request.query()));
return response.results().toString();
};
}
public record SearchRequest(String query) {}
}
Step 2: The Service with Re-ranking logic
To ensure the model isn’t overwhelmed by “noise” from the web or internal docs, we implement a DocumentPostProcessor for re-ranking.
@Service
public class AdvancedRagService {
private final ChatClient chatClient;
public AdvancedRagService(ChatClient.Builder builder, VectorStore vectorStore) {
// 1. Setup the Multi-step Retriever with a Re-ranker
var retrievalAdvisor = RetrievalAugmentationAdvisor.builder()
.documentRetriever(VectorStoreDocumentRetriever.builder()
.vectorStore(vectorStore)
.topK(15) // Get a wide pool first
.build())
.documentPostProcessors(List.of((query, docs) -> {
// Here you would integrate a model like Cohere Rerank
// For now, we simulate a 'Two-Stage' filter
return docs.stream().limit(5).toList();
}))
.build();
// 2. Build the Agentic ChatClient
this.chatClient = builder
.defaultAdvisors(
retrievalAdvisor, // Internal RAG
new ToolCallAdvisor(List.of("webSearch")) // Web Tool
)
.build();
}
public String execute(String userPrompt) {
return this.chatClient.prompt().user(userPrompt).call().content();
}
}
4. The Re-ranking Step (Two-Stage Retrieval)
The most common reason for RAG failure is that the “top 3” documents found by vector math aren’t actually the best ones. Re-ranking solves this by taking a larger set (e.g., top 20) and running a more expensive “relevance score” on them.
When implementing a custom re-ranker, you effectively compute a score $S = f(Query, Document)$ for every retrieved chunk, ensuring that the documents with the highest semantic signal are placed at the beginning of the prompt.
5. Potential Pitfalls to Avoid
Creating a multistep system introduces “Agentic” risks that standard RAG does not face:
- The Infinite Search Loop: If the LLM is unsatisfied with search results, it may call the web search tool repeatedly.
- Solution: Always set maxToolCalls in your ChatOptions to cap the number of iterations.
- Context Drift: In a 3-step search, the prompt grows significantly as each step adds more text. This can cause the model to lose the original user intent.
- Solution: Use a CompressionQueryTransformer to summarize previous search results before the final generation.
- Latency vs. Accuracy: Every “hop” adds 1–3 seconds of delay.
- Solution: Only trigger the web tool if the internal vector search similarity score is below a certain threshold (e.g., < 0.7).
- Hallucination in the “Reasoning” Step: The model might invent a “fact” during Step 1 that it then uses to search the web in Step 2.
- Solution: Use an Evaluator Advisor to check if the tool output contradicts the initial retrieved internal context.

Top comments (0)