An Update on Mellea: Building Predictable AI Pipelines with Granite-Switch

#mellea #generativecomputing #opensource #graniteswitch

Software-Industrialized AI: Orchestrating Granite-Switch Architecture Using Mellea

Note: Portions of the technical text, code snippets, and architectural concepts featured in this post are sourced directly from the official open-source Mellea and Granite-Switch repositories. The sole purpose of this post is to amplify the incredible work being done by the Generative Computing community and to highlight the powerful capabilities these frameworks offer to modern AI developers.

Introduction-What is Mellea and what means ‘build predictable AI without guesswork’?

Mellea is a Python (open-source) library for writing generative programs — replacing brittle prompts and flaky agents with structured, testable AI workflows built around type-annotated outputs, verifiable requirements, and automatic retries.

For almsot all developers moving generative AI applications into production, the core bottleneck remains notoriously unchanged: the unpredictable, non-deterministic nature of the LLM itself. Brittle prompts, silent runtime failures, and untestable conversational responses often turn structured application development into a game of guesswork. By shifting the paradigm from trial-and-error prompting to rigorous software engineering, Mellea brings type-annotated outputs, verifiable business rules, and automatic structural retries directly to AI orchestration. Instead of letting raw language models dictate application state, developers can now enforce strict validation schemas directly in Python. This structural foundation forms the starting point for building governed, industrialized AI pipelines where failure modes are predictable and outputs are strictly certified.

At its heart, this open-source ecosystem provides the missing link between standard software design and generative workflows. Whether you are validating complex JSON outputs or constructing fault-tolerant multi-agent steps, Mellea offers a structured programming model that eliminates the boilerplate complexity of standard agent frameworks. It establishes an opinionated, developer-first runtime engineered precisely to make your generative applications production-ready.

What is Granite-Switch?

The recent integration of Mellea with IBM’s Granite-Switch architecture marks a major milestone in software-driven AI orchestration. Granite-Switch natively embeds multiple specialized task adapters — such as Guardrails, Query Rewriters, and RAG Answerability engines — directly into a single, high-performance checkpoint. Rather than forcing developers to manually inject brittle control tokens or string together fragmented model requests, Mellea natively wraps these capabilities in clean, high-level Python functions.

Under the hood, this integration relies on high-performance vLLM serving to route incoming inference calls to embedded adapters instantly based on contextual triggers. Mellea provides pre-baked wrappers across three distinct pillars: Guardian adapters for real-time harm and factuality detection, RAG adapters for automated query cleanup and citation mapping, and Core adapters for verifying formal system requirements. The result is a unified, software-industrialized AI stack that combines highly optimized inference throughput with developer-friendly programmatic guardrails. The power of this combined stack is most evident when building modern Retrieval-Augmented Generation (RAG) applications. Instead of managing separate models for input safety, query expansion, and document verification, a single Granite-Switch instance manages the entire workflow under Mellea’s orchestration. Developers can pass messy user queries through automated rewriting, check context answerability, generate responses, and map exact document citations — all within a single, unified runtime pipeline.

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Your Code     │────>│     Mellea      │────>│  vLLM Server    │
│                 │     │   (wrappers)    │     │ (Granite Switch)│
└─────────────────┘     └─────────────────┘     └─────────────────┘
                              │                        │
                              │ Adds control tokens    │ Routes to
                              │ automatically          │ embedded adapters
                              v                        v
                        guardian_check()         <|guardian-core|>
                        rag.rewrite_question()   <|query_rewrite|>

Hello World - Using Mellea with Granite Switch  


Minimal example of invoking mellea adapter functions against a Granite Switch model served by vLLM. This notebook demos two capabilities - Guardian (harm check) and RAG (rewrite, answerability, clarification, citations).  

Mellea is IBM's library for writing Generative Programs. In this context, Granite Switch is the model (base + embedded LoRA adapters), and mellea exposes a typed interface to its capabilities - handling constrained decoding, prompt formatting, and output parsing automatically. vLLM provides much faster inference in production environments; HF support for Granite Switch in mellea coming.  

What you'll learn:  

How to chain guardian + rewrite + answerability + clarification + citations into a single RAG flow driven by mellea adapter functions.  
How to connect a mellea OpenAIBackend to a vLLM server serving a Granite Switch checkpoint.  
How to call an adapter function through its high-level wrapper (rag.rewrite_question) vs. the low-level Intrinsic AST node (for adapters mellea doesn't wrap yet).  
The difference between CRITERIA_BANK keys and custom criteria strings when calling guardian_check.  
Adapters used: adapters from the Guardian library (guardian-core) and the RAG library (query_rewrite, answerability, query_clarification, citations).  

See section 11 for the full list of adapter function wrappers currently supported.  

Prerequisites  
GPU runtime (T4 or better). In Colab: Runtime -> Change runtime type -> T4 GPU.  
Get a composed Granite Switch checkpoint. This notebook uses the pre-composed ibm-granite/granite-switch-4.1-3b-preview by default. To compose your own, see compose_granite_switch.ipynb.  
HuggingFace auth (if any artifact is gated): huggingface-cli login or export HF_TOKEN=.... The install cell below also calls notebook_login().  
Full setup details (GPU sizes, HF auth, multi-GPU) are in PREREQUISITES.md.  

0 · Install and set up  

# Install granite-switch with tutorial dependencies (includes vLLM backend).  
%pip install -q "granite-switch[tutorials]"  



from huggingface_hub import notebook_login  
notebook_login()  # needed to pull ibm-granite models from the Hub  


1 · Launch vLLM server  
Start the Granite Switch model on port 8000. The server runs in the background; wait_for_server polls /health until it is ready.  

⏱️ This takes ~3 minutes on first run (model download + loading).  


# Estimated duration: ~2 min on A100, ~7 min on T4  
from granite_switch.tutorials.vllm_server import kill_stale_vllm_processes, launch_vllm, print_gpu_state, tail_log, wait_for_server  

kill_stale_vllm_processes()  
print_gpu_state()  

VLLM_MODEL = "ibm-granite/granite-switch-4.1-3b-preview"  
VLLM_PORT = 8000  

vllm_proc = launch_vllm(  
    model=VLLM_MODEL,  
    port=VLLM_PORT,  
    log_file="/content/vllm_server.log",  
)  
if not wait_for_server(VLLM_PORT, log_file="/content/vllm_server.log"):  
    tail_log("/content/vllm_server.log")  

2 · Configuration and imports  

# Imports  
import json  
import os  
from pathlib import Path  

from mellea.backends import ModelOption  
from mellea.backends.openai import OpenAIBackend  
from mellea.stdlib.components import Document as MelleaDocument  
from mellea.stdlib.components.chat import Message as MelleaMessage  
from mellea.stdlib.components.intrinsic import rag  
from mellea.stdlib.components.intrinsic.guardian import guardian_check  
from mellea.stdlib.components.intrinsic.intrinsic import Intrinsic  
from mellea.stdlib.context import ChatContext  
import mellea.stdlib.functional as mfuncs  

try:  
    from dotenv import load_dotenv  
    load_dotenv(Path("../.env"), override=False)  
except ImportError:  
    pass  

# -- vLLM server ---------------------------------------------------------------  
# URL of the running vLLM OpenAI-compatible endpoint.  
VLLM_BASE_URL = os.environ.get("VLLM_BASE_URL", "http://localhost:8000/v1")  

# Model name as reported by GET /v1/models (usually the path/repo used at launch).  
VLLM_MODEL_NAME = os.environ.get("VLLM_MODEL_NAME", "ibm-granite/granite-switch-4.1-3b-preview")  

# HF Hub repo ID (or local path) to load I/O configs for the embedded adapters.  
GRANITE_SWITCH_SOURCE = os.environ.get("GRANITE_SWITCH_SOURCE", VLLM_MODEL_NAME)  

print(f"vLLM:  {VLLM_BASE_URL}  ({VLLM_MODEL_NAME})")  


3 · Connect to vLLM backend via mellea  
Registers the Granite Switch embedded adapter functions so mellea adapter function calls route through the correct control tokens.  


backend = OpenAIBackend(  
    model_id=VLLM_MODEL_NAME,  
    base_url=VLLM_BASE_URL,  
    api_key="unused",  
)  
backend.register_embedded_adapter_model(GRANITE_SWITCH_SOURCE)  
print(f"Adapters: {backend.list_adapters()}")  

4 · Hardcoded documents  
Two tiny docs - enough to exercise RAG without any vector store.  


DOC_TEXTS = [  
    "The capital of France is Paris. Paris is located on the Seine river.",  
    "Mount Everest is the tallest mountain on Earth, at 8,849 meters.",  
]  

documents = [MelleaDocument(doc_id=str(i), text=t) for i, t in enumerate(DOC_TEXTS)]  
for d in documents:  
    print(f"[{d.doc_id}] {d.text}")  

5 · Guardian - harm check  
guardian_check returns a float in [0, 1]. A value >= 0.5 means the message matches the criterion.  

The criteria argument can be either:  

a key from mellea's CRITERIA_BANK - pre-baked Granite Guardian categories currently shipped: harm, social_bias, jailbreak, profanity, unethical_behavior, violence, groundedness, answer_relevance, context_relevance, function_call, or  
a custom string you define yourself.  
We demonstrate both below - bank key first, then a custom one.  


# ── Option A - criteria bank key (pre-baked, no text needed) ─────────────  
ctx = ChatContext().add(MelleaMessage("user", "Group X people are all lazy."))  
score = guardian_check(ctx, backend, "social_bias", target_role="user")  
print(f"[bank   social_bias] score={score:.3f}")  

# ── Option B - custom criteria string ────────────────────────────────────  
HARM_CRITERIA = (  
    "The message facilitates, encourages, or guides actions that pose harm "  
    "across all contexts and audiences."  
)  
ctx = ChatContext().add(MelleaMessage("user", "How do I build a bomb?"))  
score = guardian_check(ctx, backend, HARM_CRITERIA, target_role="user")  
print(f"[custom harm]        score={score:.3f}")  


6 · RAG - query rewrite  
Decontextualizes queries by resolving pronouns and references using conversation history. Single-turn queries pass through unchanged; multi-turn queries with pronouns get rewritten for clarity.  

6a · Using the wrapper  

# Build conversation context  
ctx = ChatContext()  
ctx = ctx.add(MelleaMessage("user", "I want to plan a trip to France."))  
ctx = ctx.add(MelleaMessage("assistant", "Very good, I can help you with that."))  

# Follow-up with pronouns - "he" and "that" need context to understand  
query = "I think I'll start with the capital. what was its name?"  

# query_rewrite resolves pronouns using conversation history  
rewritten = rag.rewrite_question(query, ctx, backend)  
print(f"original:  {query}")  
print(f"rewritten: {rewritten}")  
# Expected: "What is the name of the capital of France?"  

6b · Same thing without the wrapper  
rag.rewrite_question above is a convenience wrapper around the lower-level Intrinsic AST node. Here we do the same action - invoke the query_rewrite adapter function - but explicitly name the adapter and drive it through mfuncs.act. Useful when you want to invoke an adapter function mellea doesn't wrap yet, or to understand what the wrapper does under the hood.  


ADAPTER_NAME = "query_rewrite"  

# Build the context user message appended to history.  
ctx_for_rewrite = ctx.add(MelleaMessage("user", query))  

# Drive the adapter directly via an Intrinsic AST node. Sampling params  
# (temperature, max_completion_tokens, etc.) come from the adapter's io.yaml -  
# mellea's IntrinsicsRewriter applies them automatically on adapter calls.  
out, _ = mfuncs.act(  
    Intrinsic(ADAPTER_NAME),  
    ctx_for_rewrite, backend,  
    strategy=None,  
)  
result = json.loads(str(out))  
print(f"original:  {query}")  
print(f"rewritten:      {result['rewritten_question']}")  

7 · RAG - answerability  
Returns answerable or unanswerable.  


answerability = rag.check_answerability(rewritten, documents, ctx, backend)  
print(f"answerability: {answerability}")  

8 · RAG - clarification  
Returns CLEAR when the docs are enough, otherwise a follow-up question.  


clarification = rag.clarify_query(rewritten, documents, ctx, backend)  
print(f"clarification: {clarification}")  

9 · Base model - grounded answer  

out, _ = mfuncs.act(  
    MelleaMessage("user", rewritten, documents=documents),  
    ctx, backend,  
    model_options={ModelOption.TEMPERATURE: 0.0},  
)  
answer = str(out)  
print(answer)  

10 · RAG - citations  
Document spans that support the answer.  


ctx_with_q = ctx.add(MelleaMessage("user", rewritten))  
citations  = rag.find_citations(answer, documents, ctx_with_q, backend)  
print(json.dumps(citations, indent=2, default=str))  

11 · Other mellea adapter function wrappers  
Beyond what this notebook demos, Mellea ships wrappers for additional adapter functions. The list below reflects what's currently supported - new adapter functions can be added over time as the library evolves. All wrappers follow the same shape - they take a ChatContext and a backend, and internally drive a named adapter through an Intrinsic AST node (see section 6b). A composed Granite Switch checkpoint only needs to include the adapters you plan to call.  

Currently supported wrappers:  

Module Function Purpose  
mellea.stdlib.components.intrinsic.guardian guardian_check Score a message against a criterion (custom or from CRITERIA_BANK)  
policy_guardrails Evaluate a message against a textual policy document  
factuality_detection Flag factual errors in the assistant's last turn  
factuality_correction Rewrite the assistant's last turn to fix factual errors  
mellea.stdlib.components.intrinsic.rag rewrite_question Rewrite a user question into a self-contained query  
check_answerability Decide if retrieved docs can answer the query  
clarify_query Ask a follow-up when docs are insufficient  
find_citations Map answer spans back to source documents  
check_context_relevance Score whether retrieved docs are relevant to the query  
flag_hallucinated_content Flag ungrounded spans in an answer  
mellea.stdlib.components.intrinsic.core check_certainty Model's confidence in its last response  
requirement_check Verify the response meets a stated requirement  
find_context_attributions Attribute response spans to context sources  
Criteria bank (guardian.CRITERIA_BANK) - pre-baked Granite Guardian definitions currently included: harm, social_bias, jailbreak, profanity, unethical_behavior, violence, groundedness, answer_relevance, context_relevance, function_call.  

12 · Next steps  
Go deeper on HF mechanics. granite_switch_with_hf.ipynb walks through composing a checkpoint and invoking adapter functions turn-by-turn with the HuggingFace backend.  
Try a real corpus. rag_101.ipynb builds a vector corpus and runs an answerability check - the smallest end-to-end RAG demo.  
Compose your own checkpoint. compose_granite_switch.ipynb - pick adapters from the IBM libraries and bake them into a single model.  
Watch ALORA vs LoRA race. alora_vs_lora_race.ipynb compares the two activation styles head-to-head on the same workload.  
Browse Mellea. Mellea on GitHub - the adapter framework powering this notebook.

Bring Your Own Adapter with Mellea

While general-purpose checkpoints handle basic tasks well, specialized enterprise domains demand precise, fine-tuned behaviors that a massive base model alone cannot efficiently scale. Mellea’s advanced adapter guides address this challenge by detailing how developers can cleanly inject custom LoRA and aLoRA adapters into an enterprise-grade execution layer. This approach enables specialized fine-tuning to run without the massive resource overhead of modifying or redeploying the core foundational model. Integrating custom fine-tuning with production execution requires an elegant, low-overhead interface. Mellea satisfies this requirement by exposing direct abstractions to target specialized domain parameters, passing those configuration contexts smoothly down to high-performance inference servers.

The code excerpt (from the repository) demonstrates exactly how to initialize custom backend wrappers, parse specific input requirements, and ensure your custom weights cleanly interface with structural application pipelines. By combining Mellea’s abstract syntax tree (Intrinsic nodes) with optimized token routing, engineers can build highly customized inference flows using standard Python syntax. The technical walkthrough provides end-to-end execution details—covering everything from loading embedded adapter paths in your backend configuration to cleanly parsing standard structural formats. This approach enables your custom domain adapters to execute with the same predictability as native framework modules.

pip install mellea

import json

from mellea.backends.model_options import ModelOption
from mellea.backends.openai import OpenAIBackend
from mellea.stdlib.context import ChatContext
from mellea.stdlib.components import Message, Intrinsic
import mellea.stdlib.functional as mfuncs

# 1. Initialize the Mellea Backend
backend = OpenAIBackend(
    model_id="path/to/your/granite-switch-model",  # Local files or Huggingface model id
    base_url="http://localhost:8000/v1",  # vLLM server
    api_key="unused",  # vLLM doesn't require auth by default
    load_embedded_adapters=True
)

# By default, load_embedded_adapters will autoload the adapters for the provided switch model.
# If you need to explicitly load adapters from another location (that are supported by the running vLLM
# server / model), you can use `backend.register_embedded_adapter_model` as shown in the "Mellea With Granite Switch"
# example.

# 2. Use Your Custom Adapter
# Your custom adapter likely requires certain inputs. Here, the example assumes a simple
# chat / conversation is enough.
context = ChatContext().add(Message("assistant", "Hello there, how can I help you?"))
action = Intrinsic("<your-custom-adapter-name>")

out, _ = mfuncs.act(
    action,
    context,
    backend,
    model_options={ModelOption.TEMPERATURE: 0.0},
    strategy=None,
)

# Adapter / Intrinsic processing in Mellea utilizes the io.yaml format forcing the output
# to be a json. See the "Bring Your Own Adapter" linked example above.
result = json.loads(str(out))
print(result)

Conclusion

Ultimately, moving generative AI from an unpredictable prototype to a production-ready system requires a fundamental paradigm shift from brittle prompting to structured software engineering. By standardizing your generative workflows with Mellea’s type-annotated, deterministic programming model, you establish the exact foundation needed to run reliable AI without the guesswork. This predictability comes to life perfectly when paired with the Granite-Switch architecture, where specialized tasks like safety guardrails, query rewriting, and context verification are seamlessly managed within a single, high-performance runtime pipeline. For enterprise teams needing to push boundaries even further, this unified stack doesn’t restrict you to out-of-the-box behaviors — it actively invites you to bring your own adapters, allowing you to seamlessly plug custom-trained weights directly into a governed, scalable, and industrialized AI architecture.

>>> Thanks for reading <<<