Alain Airom

Posted on Oct 5

Mixture of Experts Implementation using Granite4: Harnessing Specialization with the Latest Granite Family Model

#granite #ollama #llm #moe

A sample implementation of Mixture of Experts (MOE) using the latest Granite family LLM!

The Era of Specialization: Why Mixture of Experts Matters

In the rapidly evolving landscape of Large Language Models (LLMs), we often seek models that are both general-purpose and highly specialized. This is where the “Mixture of Experts” (MoE) architecture shines. Instead of one monolithic model trying to be good at everything, MoE allows for a network of smaller, specialized “expert” models, with a “router” or “gating network” intelligently directing incoming queries to the most relevant expert(s).

The benefits are compelling:

Efficiency: Only a subset of the model’s parameters are activated for any given query, leading to faster inference and reduced computational cost compared to a single, giant dense model.
Scalability: It’s easier to scale by adding more experts without retraining the entire system.
Specialization: Each expert can be fine-tuned or designed for a particular domain (e.g., coding, creative writing, factual Q&A), leading to higher quality and more accurate responses within its niche. Today, we’ll explore a practical implementation of the MoE concept using granite4:micro-h, the latest iteration in the Granite family of models, running locally via Ollama.

Introducing Granite4: A Powerhouse for Enterprise AI

The Granite family of models from IBM has been making waves for its performance and efficiency. granite4:micro-h is particularly interesting for deployments due to its optimized footprint, making it ideal for experiments and applications. Its adaptability to MoE-style applications allows us to guide its behavior effectively through system prompts.

My MoE Implementation: A Practical Approach

While a true MoE architecture involves a sophisticated gating network that is often another neural network, we can effectively simulate and demonstrate MoE capacities by dynamically crafting prompts that guide a single powerful LLM like granite4:micro-h to act as different "experts."

The Python application will:

Define Experts: Establish several distinct “experts” (e.g., coding_expert, creative_writer_expert, factual_qa_expert, joke_expert). Each expert is characterized by a specific system_prompt that defines its persona and a set of keywords associated with its domain.
Intelligent Routing: Implement a simple keyword-based router that analyzes the user’s query and directs it to the most appropriate expert. This mimics the role of a gating network.
Dynamic Prompting: Based on the selected expert, the application constructs a tailored prompt for granite4:micro-h. This prompt includes the expert's system_prompt to condition the model's response.
Local Inference with Ollama: Utilize the ollama Python library to send these dynamically generated prompts to the locally running granite4:micro-h model.
Comprehensive Logging: Store all interactions, including the chosen expert and the model’s response, into a Markdown file for easy review and analysis.

The Code: Bringing MoE to Life 🪄

Let’s dive into the Python code. First, ensure that Ollama installed and granite4:micro-h pulled:

# using Ollama on my laptop
ollama pull granite4:micro-h

Then, prepare the Python application environment:

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip

pip install ollama

And below, the sample code;

import ollama
import os
import datetime

class MixtureOfExpertsDemo:
    def __init__(self, model_name="granite4:micro-h", output_dir="./output"):
        self.model_name = model_name
        self.output_dir = output_dir
        self.experts = {
            "coding_expert": {
                "keywords": ["code", "program", "python", "javascript", "bug", "algorithm", "develop", "function", "java", "c++", "ruby", "go", "php"],
                "system_prompt": "You are an expert software developer. Provide concise and accurate code examples or explanations. Focus on best practices and clear logic. When asked for code, output only the code block if possible, otherwise explain briefly.",
            },
            "creative_writer_expert": {
                "keywords": ["story", "poem", "write", "creative", "fiction", "describe", "imagine", "narrative", "poetry", "verse", "lore", "tale"],
                "system_prompt": "You are a highly imaginative and eloquent creative writer. Craft engaging stories, vivid descriptions, or beautiful poetry. Focus on evocative language and emotional depth.",
            },
            "factual_qa_expert": {
                "keywords": ["what is", "how does", "explain", "who is", "when did", "fact", "information", "define", "history", "science", "geography", "calculate"],
                "system_prompt": "You are a knowledgeable and precise factual question-answering expert. Provide direct, accurate, and concise answers based on general knowledge. Avoid speculation.",
            },
            "joke_expert": {
                "keywords": ["joke", "funny", "humor", "pun", "laugh", "jest"],
                "system_prompt": "You are a comedian and a joke teller. Your goal is to tell amusing jokes that are appropriate and light-hearted. Make people laugh!",
            }
            # You can add more experts here
        }
        print(f"Initialized MoE Demo with model: {self.model_name}")
        print("Available experts:", ", ".join(self.experts.keys()))

        os.makedirs(self.output_dir, exist_ok=True)
        print(f"Output directory '{self.output_dir}' ensured.")

        self.markdown_file_path = os.path.join(
            self.output_dir,
            f"moe_demo_output_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.md"
        )
        self._initialize_markdown_file()

    def _initialize_markdown_file(self):
        """Creates and writes initial header to the markdown file."""
        with open(self.markdown_file_path, 'w', encoding='utf-8') as f:
            f.write(f"# Mixture of Experts Demo Results\n\n")
            f.write(f"**Date:** {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write(f"**Model Used:** `{self.model_name}`\n\n")
            f.write("---")
            f.write(f"\n\nThis document contains the interactions with the Mixture of Experts demo application.\n\n")

    def _append_to_markdown(self, content):
        """Appends content to the markdown output file."""
        with open(self.markdown_file_path, 'a', encoding='utf-8') as f:
            f.write(content)

    def _route_query(self, query):
        """
        A simple keyword-based router to determine the most suitable expert.
        Prioritizes experts with more keyword matches.
        """
        query_lower = query.lower()
        best_expert = None
        max_keyword_matches = 0

        for expert_name, expert_data in self.experts.items():
            matches = sum(1 for keyword in expert_data["keywords"] if keyword in query_lower)
            if matches > max_keyword_matches:
                max_keyword_matches = matches
                best_expert = expert_name

        if best_expert:
            return best_expert
        else:
            # Default to a general expert if no specific one is matched
            print("No specific expert matched, defaulting to factual_qa_expert.")
            return "factual_qa_expert"

    def query_moe(self, user_query):
        """
        Routes the query to the appropriate expert and gets a response from the model.
        Also writes the interaction to a markdown file.
        """
        selected_expert_name = self._route_query(user_query)
        selected_expert_data = self.experts[selected_expert_name]

        console_output = []
        md_output = []

        console_output.append(f"\n--- Routing to Expert: {selected_expert_name} ---")
        md_output.append(f"\n\n---\n\n## User Query\n\n")
        md_output.append(f"**Query:** `{user_query}`\n")
        md_output.append(f"**Routed to Expert:** `{selected_expert_name}`\n")
        md_output.append(f"**Expert System Prompt:**\n```
{% endraw %}
\n{selected_expert_data['system_prompt']}\n
{% raw %}
```\n")

        messages = [
            {"role": "system", "content": selected_expert_data["system_prompt"]},
            {"role": "user", "content": user_query},
        ]

        response_content = "Error: No response generated."
        try:
            response = ollama.chat(model=self.model_name, messages=messages, stream=False)
            response_content = response['message']['content']
        except Exception as e:
            response_content = f"An error occurred: {e}"

        console_output.append(f"Response: {response_content}")
        md_output.append(f"\n### Model Response\n\n{response_content}\n")

        for line in console_output:
            print(line)

        self._append_to_markdown("".join(md_output))

        return response_content

if __name__ == "__main__":
    moe_app = MixtureOfExpertsDemo()

    print("\n--- Testing the Mixture of Experts ---")

    test_queries = [
        "Write a Python function to check if a number is prime.",
        "Craft a short story about a detective who solves a mystery in a magical library.",
        "What are the main components of a healthy diet?",
        "Tell me a funny joke about a computer.",
        "What is the history of the internet?",
        "How do I write a basic 'Hello World' program in Java?",
        "Describe a serene mountain landscape at dawn.",
        "Give me a fact about black holes."
    ]

    for i, query in enumerate(test_queries):
        print(f"\nUser: {query}")
        moe_app.query_moe(query)
        print(f"--- End of response {i+1} ---\n")

    print("\n--- Interactive Mode (Type 'exit' to quit) ---")
    while True:
        user_input = input("\nEnter your query: ")
        if user_input.lower() == 'exit':
            break
        moe_app.query_moe(user_input)

Run the code and whatch the console, there is also a Markdown file generated.

# Mixture of Experts Demo Results

**Date:** 2025-10-05 14:10:03
**Model Used:** `granite4:micro-h`

---

This document contains the interactions with the Mixture of Experts demo application.



---

User Query

Query:** `Write a Python function to check if a number is prime.`
Routed to Expert:** `coding_expert`
Expert System Prompt:**

You are an expert software developer. Provide concise and accurate code examples or explanations. Focus on best practices and clear logic. When asked for code, output only the code block if possible, otherwise explain briefly.


Model Response

python
def is_prime(n):
    """Check if a given number n is prime."""
    if n <= 1:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True



---

User Query

Query:** `Craft a short story about a detective who solves a mystery in a magical library.`
Routed to Expert:** `creative_writer_expert`
Expert System Prompt:**

You are a highly imaginative and eloquent creative writer. Craft engaging stories, vivid descriptions, or beautiful poetry. Focus on evocative language and emotional depth.

Model Response

In the heart of the ancient city of Veridia lay an enigma nestled within ivy-clad walls - The Library of Whispers, a place where books whispered tales from every era and reality collided with fantasy. It was here that Detective Elias Blackwood found himself ensnared in a riddle woven by the library's enchanting labyrinth.

Elias had been summoned to investigate after reports of strange occurrences echoed throughout the hallowed halls - missing pages, inexplicable whispers echoing against stone walls, an unsettling chill that seemed to emanate from within the books themselves. The librarians were terrified; their once peaceful sanctuary now threatened with a mystery more sinister than any dust-covered tome.

The detective stepped into the dimly lit lobby, his keen eyes adjusted to the faint glow of lantern light casting long shadows across ancient book spines. He found himself standing before an enormous wooden door inscribed with arcane symbols - a portal to countless worlds and times. It was here that he knew he must begin his investigation.

He approached the door cautiously, running fingers along its cool surface as if seeking clues in cold metal veins. As his hand brushed against one particular symbol etched deeply into the wood, it began to glow softly, responding subtly to his touch. Intrigued by this unexpected reaction, Elias pressed on until he felt a familiar weight pressing against his chest - the door creaked open revealing an expansive room filled with shelves stretching towards infinity.

The air was thick with the scent of old paper and ink; every inch teeming with stories untold. Each shelf held thousands upon thousands of books bound in leather, each telling tales from realms unknown or forgotten corners of history. As Elias began to explore this magical library, he realized that it wasn't merely a place filled with physical objects - but also an entity itself.

Suddenly, whispers began filling the air around him, not just audible words but thoughts too - fragmented snippets of conversations spanning millennia mingling together like echoes in a vast cavern. He listened intently as the voices began painting a picture of intrigue and deception within these hallowed halls. Someone had been tampering with books, altering stories to serve their own sinister purposes.

Detective Blackwood's mind raced, piecing together fragments into coherent narratives until he realized that at heart, all whispers pointed towards one culprit - none other than the elusive phantom author known as The Specter of Ink. This enigmatic figure was rumored to have been writing tales within the library since time immemorial, influencing every story told therein.

With newfound determination, Elias embarked on a quest through the labyrinthine aisles chasing each whisper like breadcrumbs leading him closer to his quarry. He crossed paths with spectral librarians and ghostly manuscripts - all witnesses eager to recount their experiences but unable to reveal the identity of the culprit directly.

Finally, after days spent immersed in this magical realm, he reached the final bookshelf where whispers intensified into a crescendo, revealing not voices, but names written across pages bound together by invisible threads. It was then that Elias deduced who The Specter truly was - none other than his predecessor from long ago whose obsession with preserving history had led him astray.

As dawn broke outside through the stained glass window above, Detective Blackwood stepped out of the library leaving behind a world slightly altered yet profoundly enriched by the tales he'd uncovered. Though victorious in solving the mystery, there remained an underlying sense of melancholy as he acknowledged that perhaps some secrets were meant to remain lost within those magical pages forever.

Thus ended the tale of Detective Elias Blackwood and his quest into The Library of Whispers - a testament to the power of curiosity, imagination, and the enduring magic hidden in plain sight.


---

User Query

Query:** `What are the main components of a healthy diet?`
Routed to Expert:** `factual_qa_expert`
Expert System Prompt:**

You are a knowledgeable and precise factual question-answering expert. Provide direct, accurate, and concise answers based on general knowledge. Avoid speculation.


Model Response

The main components of a healthy diet typically include fruits, vegetables, lean proteins, whole grains, and healthy fats. It's also important to stay hydrated by drinking plenty of water.


---

User Query

Query: `Tell me a funny joke about a computer.`
Routed to Expert:** `joke_expert`
Expert System Prompt:**

You are a comedian and a joke teller. Your goal is to tell amusing jokes that are appropriate and light-hearted. Make people laugh!


Model Response

Alright, here's a classic computer joke for you:

Why don't computers take their hats off?

Because they have no brain cells!

I hope this silly little joke brings a smile to your face. Computer humor can be quite amusing when you think about how much we rely on these machines in our daily lives! Let me know if you'd like another joke or if there's something specific you're looking for. I'm here to make you laugh.


---

User Query

**Query:** `What is the history of the internet?`
**Routed to Expert:** `factual_qa_expert`
**Expert System Prompt:**

You are a knowledgeable and precise factual question-answering expert. Provide direct, accurate, and concise answers based on general knowledge. Avoid speculation.


Model Response

The history of the internet begins in the 1960s with the development of ARPANET by the United States Department of Defense's Advanced Research Projects Agency (ARPA). The first successful message was sent over the ARPANET network on October 29, 1969. In the following decades, various technologies and protocols were developed to improve data transmission, leading to the emergence of World Wide Web in 1991 by Tim Berners-Lee. By the early 2000s, internet usage became widespread among the general public. The internet continues to evolve with advancements in technology like mobile broadband, cloud computing, social media, and IoT (Internet of Things).

So What Happened? - Observing Specialization in Action

When we review the Markdown file or the console output, we can dig into:

How queries related to programming (Write a Python function...) are consistently routed to the coding_expert, and the responses often include relevant code snippets.
How imaginative queries (Craft a short story...) are sent to the creative_writer_expert, resulting in narrative text.
How factual questions (What are the main components...) elicit direct and informative answers from the factual_qa_expert.
The joke_expert provides humorous responses, demonstrating another distinct persona.

This dynamic redirection and context-setting showcase granite4:micro-h's capacity to adapt its generation style and knowledge focus based on the "expert" persona it's asked to adopt.

Beyond the Demo: Real-World MoE

This demonstration uses a simplified routing mechanism. In production MoE systems, the router is often a trained component (e.g., a small neural network) that learns which experts are best for different types of inputs, allowing for more nuanced and intelligent distribution of queries.

However, even with this basic setup, we can clearly see the power of guiding a versatile model like granite4:micro-h to behave as a specialized expert, unlocking more precise and contextually appropriate responses.

Conclusion

The Mixture of Experts paradigm offers a powerful way to build more efficient, scalable, and specialized AI applications. By leveraging granite4:micro-h with Ollama, we've seen how we can begin experimenting with these concepts right on a local machine. This approach allows to tap into the model's vast capabilities while also imposing a structured, expert-driven interaction style, pushing the boundaries of what's possible with local LLMs.

Start experimenting 🏋️ with your own experts and discover the specialized potential of Granite4!

DEV Community