Fine-tuning For Domain-Customized Retriever Noise Mitigation in RAG Pipelines

#rag #finetuning #ibmresearch

Our Experience With Conversational AI Use-cases For Agriculture And Finance Domains

Authors (Affiliation: IBM Research, India)
- Padmanabha V. Seshadri
- Rudra Murthy
- Arkadeep Acharya
- Jaydeep Sen
- Kushagra Bhushan
- Yatin Nandwani
- Praveen Jayachandran
- Ashok Pon Kumar

RAG pipelines are the go-to framework to support Conversation AI with domain-specific customization. A typical system would involve a set of documents with domain-specific knowledge used as source-of-knowledge, with end-users placing a query which would trigger retrieval of query-context-specific chunks of the relevant documents from the original set and infused as context along with the query.

The intent is that an LLM powering the system and getting the chunks and the query as input should be able to generate the response using the contextual chunks.

However, as with any retriever system, the retrieval of chunks is not fool-proof. The retrieved chunks may or may not be actually relevant to the query, while appearing relevant to the retriever.

In addition to the retriever noise, there is also the possibility that the expected ground-truth answer could be expressed in a para-phrased, yet correct manner which is perfectly acceptable to end-user participating in a conversational interaction. To help the LLM understand this para-phrasing ability, there is a need to infuse answer variations into the LLM.

To address these two challenges, we have conducted an ablation study to identify a suitable fine-tuning data recipe to mitigate retriever noise, with IBM’s Granite 4 hybrid models and BharatGen’s sovereign models for real-world use cases like agriculture and finance.

Domain-specific data recipe: An Overview

Figure 1 illustrates the steps involved in the data recipe. The input to the pipeline is a set of documents. There are two main stages in processing these documents and generating a dataset:

Documents-to-samples: Converts the documents into question-and-answer(QA) pairs.
Sample augmentation: The QA pairs are then augmented first by generating distractors (only this step would be the RAFT[3] method), followed by answer paraphrasing (distractors + paraphrasing is the PA-RAG[4] method).

Figure 1: Illustration of end-to-end domain-specific data generation

The steps are elaborated in the sections below.

Documents-To-Samples

We convert the documents into chunks using tools like docling and then generate Q&A pairs associated with the chunked data. This forms the first step in creating a training set for domain customization. This step involves a synthetic data generation (SDG) approach to generate this raw training set to cater to Q&A tasks in agriculture and finance domains. Figure 2 illustrates the SDG flow. It involves the following steps:

Chunking & Embedding → Break documents into chunks, store in VectorDB.
Synthetic Q&A Generation →
- Use LLMs to create question-answer pairs using SDGHub[5] framework. Following LLMs were used for this purpose. The data obtained from both models, was mixed before use:
  - openai/gpt-oss-120b
  - mistralai/Mixtral-8x7B-v0.1
- An LLM-based scorer is used to measure the answerability and faithfulness of the Q&A, which is then used to filter answers with these two characteristics as shown below. Following are the definitions:
  - Answerability tries to determine using an LLM scorer, whether a query can be answered from the information present in the document/passage.
  - Faithfulness criterion attempts to quantify if the answer is faithful to the document/passage.

Figure 2: Flow of synthetic data generation

Sample Augmentation

The synthetically generated training set is further augmented using distractor and paraphrasing strategies to help train the generator to be resilient to retriever noise. This involves two components:

Creating distractors with RAFT [3]: We leverage the Retrieval augmented fine tuning (RAFT) [3] post-training recipe to create the distractors for each Q&A pair as illustrated below. If N is the number of samples, and p is the proportion of samples with gold and distractors, the training set is augmented as shown below. The distractors are generated by the following simple steps:
- We use the query to retrieve the Top-k matching chunks from VectorDB. Those chunks which do not match the gold chunk are distractors.
- The distractor chunks are added to the sample with or without the gold in the same list based on sampling proportion "p".
Paraphrasing with PA-RAG [4]: This component augments the RAFT dataset with answer paraphrasing. For each training question, multiple answers (controlled by para-phrasing degree (E.g. three paraphrases per answer) are generated synthetically. Fine-tuning with multiple paraphrased answers for the same question helps LLMs internalize domain knowledge.

Tuning configuration

Once the dataset is generated, we prepare the tuning configuration as follows:

Evaluation dataset preparation: Around 2000+ eval samples have been sampled from the training dataset, such that each eval sample has at least one training sample which uses the same gold chunk. This is meant to ensure the training dataset stays relevant to the knowledge on which it is evaluated. We use the chunkid of the samples to cater to this coverage constraint.
Training configuration: The tuning was performed using the fms-hf-tuning[6] stack. The key tuning configuration used is as follows:
- Learning rate: 1e^-5
- Warm up ratio: 0.01
- Gradient accumulation step: 1
- Number of epochs: 3
- AdamW optimizer was used with beta1=0.9, beta2=0.98, epsilon=1e^-10
Hardware: 1 node x 8 A100 80GB gpus were used for fine-tuning experiments.
Models and baselines:
- Agriculture use-case: We compare the various recipes by fine-tuning the granite 4.0 tiny hybrid model [1] which combines MoE architecture, Mamba state-space and transformer architecture. With 7B parameters it is light on resources, and at the same time is trained to handle multi-turn messages in RAG systems.
- Finance use-case: For the finance use case, we use BharatGen’s FinanceParam sovereign model [2] and compare the performance of the various recipes. It is derived by fine-tuning BharatGen’s Param-1-2.9B-Instruct for finance domain.

Results

Evaluation metrics

We use two evaluation metrics for evaluating the performance of the baselines and tuned models:

Rouge-L [7]: Favors exact match of longest sequence of n-grams between the machine and the gold response. However, it does not factor in paraphrased word representations.
LLM-as-a-Judge: We use the meta-llama/llama-3-3-70b-instruct model as the llm-as-a-judge model. The prompt includes the instruction to compare the gold and generated responses, given the question in terms of meaning, correctness, and completeness of the gold answer. The scoring to be provided by the judge is in the scale of 0: entire incorrect to 5: full matches, with various intermediate criteria corresponding to the intermediate scores. The average of all the scores for all evaluation samples for each metric is used as a result. These numbers are discussed below.

Agriculture use-case

Table 1: Results for Agriculture use case

Key insights from results in Table 1:

RAFT excels in retrieval robustness, giving the best RougeL score.
PA-RAG leads in overall answer quality, as shown by the highest Judge Score.
Both advanced recipes outperform Base and SFT significantly, making them ideal for domain-specific RAG tasks.

Finance use-case

Table 2: Results for Finance use case

Key insights from Table 2: Like the Agriculture use-case, PA-RAG recipe shows higher judge performance, while marginally lower RougeL score compared to RAFT, due to the answer paraphrasing effect which results in correct answers which are worded differently from the ground-truth.

Conclusion

Our experiments demonstrate that mitigating retriever noise and handling paraphrased answers are helpful in building robust domain-specific RAG pipelines. Among the fine-tuning strategies evaluated, RAFT significantly improves retrieval robustness, while PA-RAG enhances overall answer quality by incorporating paraphrasing diversity. These results hold consistently across agriculture and finance domains, validating the importance of combining retrieval-aware and paraphrase-aware augmentation in real-world conversational AI systems.

Going forward, integrating these approaches with scalable synthetic data generation and lightweight hybrid models like Granite 4 and BharatGen sovereign models can help build domain-customized AI solutions. This not only improves accuracy and user satisfaction but also sets the foundation for resilient, context-aware conversational systems tailored to specialized industry needs.