Granite Guardian 🪨

#granite #graniteguardian #llmguardrails #llmguardian

A very short introduction on Granite Guardian 🛡️

What are LLM Guardians? An Introduction

At their core, LLM Guardians act as a protective and evaluative layer for other AI models. They are trained on instruction-fine-tuned languages models and use human-annotated and synthetic red-teaming data to detect risks.

Key features of these models include:

Pre-baked Criteria: They come ready to detect common risks such as jailbreak attempts, profanity, and hallucinations in Retrieval-Augmented Generation (RAG) or tool-calling workflows.
Bring Your Own Criteria (BYOC): Users can define their own natural-language rules, such as specific formatting requirements, length constraints, or domain-specific safety guidelines.
Hybrid Thinking Modes:

<no-think>: Provides a low-latency "yes" or "no" judgment, ideal for real-time production guardrails.
<think>: Generates a detailed reasoning trace before the final score, which is useful for auditing, debugging, or justifying a decision to a human reviewer.

Why are they needed?

The need for LLM Guardians arises from several critical challenges in deploying AI:

Safety and Risk Mitigation: They help detect dimensions catalogued in frameworks like the IBM AI Risk Atlas, ensuring that models do not promote harmful, illegal, or unethical behavior.
Hallucination Detection: Guardians are highly effective at identifying when a model generates false information, especially in complex agentic workflows or RAG systems where accuracy is paramount.
Instruction Following: They can evaluate if an assistant’s response faithfully follows multi-part requirements, such as “Each line must start with a capital letter,” providing a reliable way to automate quality checks.
Model Observability: By acting as an independent judge, they provide a layer of monitoring and assessment that can outperform much larger general-purpose models on specific fact-checking benchmarks.

IBM Granite Guardian

The Granite Guardian family is a collection of models designed to judge if the input prompts and the output responses of an LLM based system meet specified criteria. The models come pre-baked with certain criteria including but not limited to: jailbreak attempts, profanity, and hallucinations related to tool calls and retrieval augmented generation in agent-based systems. Additionally, the models also allow users to bring their own criteria and tailor the judging behavior to specific use-cases.

Trained on instruction fine-tuned Granite languages models, these models can help with detection along many key dimensions catalogued in the IBM AI Risk Atlas. These models are trained on unique data comprising human annotations from socioeconomically diverse people and synthetic data informed by internal red-teaming. They outperform similar models on standard benchmarks.

Excerpt from GitHub repository

The implementation is very simple and straighforward as shown in the following samples.

Detect groundedness in RAG settings (think mode) 🕵🏻‍♂️

context_text = """Eat (1964) is a 45-minute underground film created by Andy Warhol and featuring painter Robert Indiana, filmed on Sunday, February 2, 1964, in Indiana's studio. The film was first shown by Jonas Mekas on July 16, 1964, at the Washington Square Gallery at 530 West Broadway.
Jonas Mekas (December 24, 1922 – January 23, 2019) was a Lithuanian-American filmmaker, poet, and artist who has been called "the godfather of American avant-garde cinema"."""
documents = [{"doc_id": "0", "text": context_text}]
response_text = "The film Eat was first shown by Jonas Mekas on December 24, 1922 at the Washington Square Gallery at 530 West Broadway."

criteria = (
    "A text is considered ungrounded or unfaithful if it includes information "
    "lacking support from, or directly contradicting, the provided document(s). "
    "This risk arises when the text fabricates details, misinterprets the content, "
    "or makes unsupported extrapolations beyond what is explicitly stated in the "
    "document(s)."
)

messages = [
    {"role": "assistant", "content": response_text},
    {"role": "user", "content": build_guardian_block(criteria, think=True)},
]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, documents=documents
)
output = llm.generate([prompt], sampling_params)
response = output[0].outputs[0].text.strip()

score, trace = parse_response(response)
print(f"# score: {score}")  # yes
print(f"# trace: {trace}")

Detect function-calling hallucination (no-think mode) 🥷🏿

tools = [{
    "name": "comment_list",
    "description": "Fetches a list of comments for a specified video using the given API.",
    "parameters": {
        "aweme_id": {"description": "The ID of the video.", "type": "int", "default": "7178094165614464282"},
        "cursor":   {"description": "The cursor for pagination. Defaults to 0.", "type": "int, optional", "default": "0"},
        "count":    {"description": "The number of comments to fetch. Maximum is 30. Defaults to 20.", "type": "int, optional", "default": "20"},
    },
}]

user_text = "Fetch the first 15 comments for the video with ID 456789123."
response_text = json.dumps([{
    "name": "comment_list",
    "arguments": {
        "video_id": 456789123,  # Wrong argument name: should be "aweme_id"
        "count": 15,
    },
}])

criteria = (
    "Function call hallucination occurs when a text includes function calls that "
    "either don't adhere to the correct format defined by the available tools or "
    "are inconsistent with the query's requirements. This risk arises from function "
    "calls containing incorrect argument names, values, or types that clash with "
    "the tool definitions or the query itself."
)

messages = [
    {"role": "user", "content": user_text},
    {"role": "assistant", "content": response_text},
    {"role": "user", "content": build_guardian_block(criteria, think=False)},
]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, available_tools=tools
)
output = llm.generate([prompt], sampling_params)
response = output[0].outputs[0].text.strip()

score, _ = parse_response(response)
print(f"# score: {score}")  # yes

Conclusion

Ultimately, as AI systems transition from simple chatbots to autonomous agents and critical enterprise tools, the margin for error effectively disappears. Granite Guardians are not just an optional safety layer but a fundamental necessity for responsible AI deployment; they bridge the gap between “generative potential” and “operational reliability.” By providing real-time, explainable oversight — capable of catching everything from subtle hallucinations to complex policy violations — these models ensure that AI remains a helpful asset rather than a liability, offering the precise, programmable guardrails required to build lasting trust between technology and the people who use it.

>>> Thanks for reading <<<

DEV Community

Granite Guardian 🪨

What are LLM Guardians? An Introduction

Why are they needed?

IBM Granite Guardian

Conclusion

Links

Top comments (0)