hamid zangiabadi

Posted on Aug 17

Rasa, LLMs, and RAG — Powering a Solution for Conversational AI

#llm #ai #python #programming

Conversational AI hasn’t just evolved in the past few years — it’s quietly staged a revolution.

We’ve gone from basic intent classification and rigid, rule‑based scripts to Large Language Models (LLMs) that can hold conversations with fluency.

And yet — in real‑world production environments where accuracy, trustworthiness, and scalability matter — LLMs alone aren’t enough. The real magic happens when you blend their creativity with structured, reliable systems.

In this post, we’ll explore how Rasa, LLMs, and RAG (Retrieval‑Augmented Generation) can work together to build chatbots that are natural, reliable, and grounded.

Bridging the Gap: Why Mix Rasa, LLMs, and RAG?

LLMs are brilliant at generating smooth, coherent replies — but they can “hallucinate” when they don’t know the answer, especially in regulated or domain‑specific contexts. Rasa, on the other hand, excels at predictable control — intent classification, slot filling, business logic, and integrating with APIs. RAG acts as the glue, retrieving trusted documents, FAQs, or knowledge base entries, feeding them into the LLM so responses remain factual and grounded in the organization’s data.

This is where:

Rasa ⟶ gives you predictable control: intent classification, slot filling, API calls, and workflow logic.
RAG (Retrieval‑Augmented Generation) ⟶ injects facts by retrieving trusted answers from your documentation, FAQs, or knowledge base.
LLM ⟶ uses both user input and retrieved context to generate natural, grounded responses.

The Architecture at a Glance

You can think of this trio as a human‑like conversation team inside your chatbot:

Rasa is the executive planner — listening to the user, interpreting their intent, and deciding which pathway to follow.
RAG plays the research assistant — diving into the knowledge base and coming back with the most relevant facts.
LLM is the storyteller — presenting those facts in a friendly, human‑like tone.

A typical workflow looks like this:

The user asks a question.
Rasa analyses the intent and determines whether this is a standard flow (e.g., “Book me a meeting”) or a factual query (“What’s your refund policy?”).
For factual queries, Rasa triggers the RAG pipeline.
The query is converted into an embedding vector, similar content is fetched from a FAISS index or other vector storage.
Retrieved data is inserted into an LLM prompt so the answer is grounded in the right context.
The LLM generates a polished response, and Rasa sends it back to the user.

Production Benefits

By combining these components together, you create a balanced conversational system: Rasa provides structured, predictable dialogue control; RAG contributes accurate, context‑aware information; and the LLM adds natural, human‑like interaction. This blend minimizes misleading responses, handles complex queries with context retention, scales gracefully as your knowledge base grows, and allows new features to be introduced without disrupting existing behavior.

Minimal Working Example (Rasa + FAISS RAG Custom Action)

# Pseudo-code: Rasa + FAISS RAG Custom Action

load embedding_model("all-MiniLM-L6-v2")
load faiss_index("kb.index")  # Built from company knowledge base

class ActionRAGLookup:
    function name():
        return "action_rag_lookup"

    function run(dispatcher, tracker, domain):
        user_query = tracker.get_latest_message_text()
        query_vector = embedding_model.encode(user_query)

        distances, indices = faiss_index.search(query_vector, top_k=1)
        best_answer = read_from_file("kb_answers.txt")[indices[0]]

        dispatcher.send_message(best_answer)
        return no_events

Minimal Working Example (Pseudo‑Code)

load embedding_model("all-MiniLM-L6-v2")
load faiss_index("kb.index")  # Built from company knowledge base

class ActionRAGLookup:
    function name():
        return "action_rag_lookup"

    function run(dispatcher, tracker, domain):
        user_query = tracker.get_latest_message_text()
        query_vector = embedding_model.encode(user_query)

        distances, indices = faiss_index.search(query_vector, top_k=1)
        best_answer = read_from_file("kb_answers.txt")[indices[0]]

        dispatcher.send_message(best_answer)
        return no_events

Step‑by‑Step Implementation

1. Set up Rasa

pip install rasa
rasa init

This creates a default bot with intents, domain, and sample actions.

2. Prepare your Knowledge Base

Gather Q&A pairs from your docs.
Generate embeddings (all-MiniLM-L6-v2).
Build and save FAISS index (kb.index) and answers file.

3. Create the Custom RAG Action

Add retrieval logic like in ActionRAGLookup.
Encode query, search index, return top match.

4. Update domain.yml

Add action_rag_lookup under actions.
Declare needed intents (e.g., ask_question).
Optional: add utter_fallback for low‑confidence cases.

5. Train and Run

rasa train
rasa run            # Terminal 1 — starts bot server
rasa run actions    # Terminal 2 — starts action server

6. Test

rasa shell

Type a query → Rasa processes it → RAG retrieves matching answer → LLM formats and sends it back.

Domain — The Blueprint of Your Bot

File: domain.yml

Think of the domain as your chatbot’s ID card + skill list.

It tells Rasa:

Who you are (list of possible responses the bot knows)
What you can do (custom actions, forms)
What you understand (intents & entities you’re trained to recognize)
What you remember (slots that hold values through a conversation)

A typical structure:

intents:
  - greet
  - ask_price

entities:
  - product_name

slots:
  product_name:
    type: text

responses:
  utter_greet:
    - text: "Hi there! How can I help you?"
  utter_ask_price:
    - text: "How much this product costs?"

actions:
  - action_rag_lookup

The domain is like the API surface of your bot. If your bot’s ability or response isn’t declared here, Rasa won’t touch it — even if it’s coded elsewhere.

Stories — Example Conversations Rasa Learns From

File: stories.yml

Stories are like training scripts for actors — they’re examples of how a conversation might go from start to finish.

They teach Rasa’s dialogue management how to react over multiple turns.
Based on these examples, Rasa learns patterns for fluid, non‑rigid conversations.

Example:

stories:
  - story: user asks a factual question
    steps:
      - intent: ask_question
      - action: action_rag_lookup

Here, we’re saying:

“If the user intent is ask_question, the next step in the conversation is to run our RAG retrieval action.”

The more varied but plausible your stories, the smarter Rasa becomes at guessing what to do when a new but similar situation happens.

Rules — Fixed Conversation Logic

File: rules.yml

Rules are if‑this‑then‑that triggers, for when you want absolute determinism.

Unlike stories, rules don’t let Rasa “improvise” — they always fire exactly as defined.

Example:

rules:
  - rule: Handle fallback
    steps:
      - intent: nlu_fallback
      - action: utter_fallback

Meaning:

“Every time my NLU confidence is low, trigger utter_fallback.”

You use rules for:

Mandatory flows (always confirm a user’s email before proceeding)
Hard fallbacks (don’t let the bot guess)
Compliance scripts (legal disclaimers, medical advice warnings)

What happens:

This launches an interactive CLI where you type messages as a user.
Rasa will process them through your NLU pipeline, then — if your logic routes to the RAG action — it will:
1. Embed your text
2. Retrieve the top match from FAISS
3. Send that as the bot’s reply.

Top comments (1)

ElBatoul Bechiri • Nov 23

Hi Hamid, thanks for the great article!
I’m trying to reproduce the Rasa + FAISS + RAG setup you describe, but I’m stuck on the embeddings part.

In your implementation, which embedding API did you actually use for building the FAISS index (e.g. OpenAI embeddings, sentence-transformers locally, Gemini, or something else)?

Also, did your code rely on any specific Rasa or LangChain version for the embedding client?

I’m asking because with recent Rasa Pro + LiteLLM changes I keep hitting OpenAI API key errors even though I want to use a different provider. Any detail on your exact stack would really help.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.