Simran Shaikh

Posted on May 18

I Built an XAI Agent with Hermes Agent That Explains Any ML Model in Plain English — Here's Everything I Learned

#hermesagentchallenge #devchallenge #agents

Hermes Agent Challenge Submission

TL;DR: I spent a weekend building XAI-Agent — an autonomous Hermes Agent pipeline that runs real SHAP + LIME analysis on any ML model and generates a plain-English explainability report. This post is everything I learned: how Hermes Agent's multi-step planning actually works, where it surprised me, where it frustrated me, and why I think it's the most underrated open-source agent framework right now.

The Conversation That Started This

Three weeks ago I was in a meeting presenting a RandomForest model I'd trained to predict customer churn. The model was good — 91% accuracy, solid precision-recall curve. I was proud of it.

Then our Head of Product asked: "But why does it predict churn for this specific customer?"

I opened my Jupyter notebook. Showed the SHAP waterfall plot.

She stared at it for five seconds.

"Can you just... tell me in normal words?"

That moment broke something in my brain. I'd spent three days training the model and thirty minutes on explainability. The explainability was useless to the person who needed it most.

So I built XAI-Agent. And in doing so, I learned more about Hermes Agent than I expected.

What Is Hermes Agent, Actually?

Before I get into what I built, let me give you the honest explanation I wish I'd had when I started.

Hermes Agent is an open-source agentic framework built for multi-step autonomous task execution.

That sentence has a lot of words. Here's what it actually means in practice:

Most "AI" tools you interact with are single-shot. You send a prompt, you get a response. Done. The model doesn't remember what it did, doesn't use the output of one step as input to the next, and doesn't make decisions about how to approach a problem.

Hermes Agent does all three of those things.

It gives you:

A planning loop — the agent breaks down a task into steps before executing
Tool use — distinct callable functions the agent can invoke in sequence
Context persistence — output from Tool 1 is available to Tool 2, 3, 4, and 5
Autonomous decision-making — the agent decides which tool to use and when, based on what it finds

This sounds abstract. Let me make it concrete with what I actually built.

The Problem I Was Solving (And Why It's Bigger Than You Think)

Explainable AI (XAI) has a dirty secret: the tools exist, but the workflow is broken.

SHAP and LIME are genuinely powerful libraries. But using them requires:

Writing custom Python code for each model type
Knowing which explainer to use (TreeExplainer? KernelExplainer? DeepExplainer?)
Interpreting the numerical output yourself
Translating that into language a non-technical person understands
Running a separate bias audit
Writing a report that combines all of this

That's 4-6 hours of work per model, requiring a data scientist to babysit every step.

And this isn't a niche problem. The EU AI Act requires explainability for high-risk AI. GDPR Article 22 gives EU citizens the right to explanation for automated decisions. US financial regulators require banks to explain ML-based credit decisions.

The demand for XAI is going to explode over the next 3 years. The tools to deliver it efficiently don't exist yet.

That's the gap I built XAI-Agent to fill.

How I Used Hermes Agent to Build It

Here's the part I want to go deep on — because understanding how Hermes Agent enabled this architecture is the most useful thing I can share.

The Core Insight: Tools Are Not Functions

When I first read about Hermes Agent's tool use, I thought of tools as just... functions. Like, the agent calls shap_analyze() and gets back a result.

That mental model is wrong, and it held me back for half a day.

Tools in Hermes Agent are autonomous units of responsibility. Each tool:

Has a single, clear job
Can access shared agent state (what previous tools discovered)
Makes decisions based on that state
Produces output that enriches the shared state for future tools

The difference sounds subtle. The impact on architecture is enormous.

Here's what my 5-tool pipeline looks like and, more importantly, why it's designed this way:

class HermesXAIAgent:
    """
    Hermes Agent implementation — 5 autonomous tools
    Each tool feeds context into the next
    """

    def tool_file_reader(self, model_path, dataset_path, target_col):
        """
        TOOL 1: Inspection

        This isn't just 'load the files'. The tool makes decisions:
        - What type of model is this? (affects every downstream tool)
        - What is the task? (classification vs regression)  
        - Is there class imbalance? (affects which samples LIME picks)
        - Which SHAP explainer should Tool 2 use?

        All of this gets stored in self.results for Tools 2-5 to use.
        """
        model = joblib.load(model_path)
        model_type = type(model).__name__  # e.g. "RandomForestClassifier"

        # This decision propagates through the entire pipeline
        task = "classification" if y.nunique() <= 10 else "regression"

        self.results["model_type"] = model_type  # Tool 2 reads this
        self.results["task"] = task               # Tools 3, 4, 5 read this

    def tool_shap_analyzer(self):
        """
        TOOL 2: Global Explainability

        Uses model_type from Tool 1 to auto-select the right explainer.
        A naive implementation would hardcode TreeExplainer.
        This tool makes an intelligent decision.
        """
        model_type = self.results["model_type"]  # Read from Tool 1

        TREE_MODELS = ["RandomForestClassifier", "XGBClassifier", ...]

        if model_type in TREE_MODELS:
            # TreeExplainer: fast, exact, works with tree structure
            explainer = shap.TreeExplainer(model)
        else:
            # KernelExplainer: slower, universal fallback
            explainer = shap.KernelExplainer(model.predict_proba, background)

        # Handle SHAP version incompatibility — this one took me 2 hours
        shap_raw = explainer.shap_values(X_sample)

        if isinstance(shap_raw, list):
            shap_vals = shap_raw[1]           # Old SHAP: list per class
        elif shap_raw.ndim == 3:
            shap_vals = shap_raw[:, :, 1]    # New SHAP: 3D array
        else:
            shap_vals = shap_raw             # Regression: use as-is

        self.results["shap_vals"] = shap_vals  # Tool 5 uses this
        self.results["imp_df"] = importance_df # Tools 3, 5 use this

See how Tool 2 reads model_type from Tool 1's output? And produces imp_df that Tool 5 will use? That's the Hermes Agent planning loop in action. It's not a chain of independent function calls — it's a coherent reasoning process where every step builds on what came before.

The Part That Surprised Me: Fallback Planning

One of the things that makes Hermes Agent genuinely useful in production is that it handles failure gracefully within the planning loop.

In my tool_shap_analyzer, if the primary explainer fails (say, the model doesn't support TreeExplainer), the agent doesn't crash and show an error. It falls back to KernelExplainer, logs which method it used, and continues to Tool 3.

This sounds simple. But it means the agent's output is always a complete report — not a half-finished analysis with an error message in the middle.

That's the difference between a demo and a tool people actually use.

The Part That Frustrated Me: Context Size

Here's something nobody tells you about agentic pipelines: context accumulates fast.

By the time I got to Tool 5 (report_writer), my agent state contained:

The full model object
The original dataset (569 rows × 31 columns)
The SHAP values array (150 samples × 30 features)
Three LIME explanation objects
Feature importance DataFrames

That's a lot of state to pass around. For my use case (Streamlit + local execution) it was fine. But if you're building a Hermes Agent pipeline for production use with large datasets, you need to think carefully about what you store in self.results vs. what you compute on demand.

My solution: I store only what downstream tools need. The raw SHAP array (large) gets converted to a summary DataFrame (small) before storage. The LIME explanation objects get converted to plain lists immediately.

The Technical Deep Dive: SHAP + Hermes Agent

Let me go deeper on the SHAP integration because this is where the agentic approach really pays off.

Why SHAP Alone Isn't Enough

SHAP gives you this:

worst area                  0.0756
worst concave points        0.0538
mean concave points         0.0503
worst perimeter             0.0489
worst radius                0.0401

That's useful to me. It's useless to my Head of Product.

What she needs is:

"The size of the tumor at its largest measurement ('worst area') is the single most important factor in this model's predictions. When this measurement is high, the model is significantly more likely to predict malignancy. Think of it as the model's primary red flag."

The Hermes Agent pipeline handles this translation. Tool 2 generates the numbers. Tool 5 converts them to prose using model-aware language. The agent knows it's a medical dataset (from the feature names) and calibrates its language accordingly.

The 3D SHAP Array Bug That Will Break Your Code

I want to specifically call this out because it's a real issue that hit me hard and I've seen it break other implementations.

SHAP's output format changed between versions. Older SHAP returns:

# Old SHAP (< 0.40): list of arrays, one per class
shap_values = [array_class_0, array_class_1]  # each: (n_samples, n_features)

Newer SHAP (0.44+) returns:

# New SHAP (>= 0.44): single 3D array
shap_values  # shape: (n_samples, n_features, n_classes)

If you write naive code like shap_values[1] to get class 1, it works on old SHAP and silently returns wrong results (a column, not a matrix) on new SHAP.

The Hermes Agent approach of having a dedicated tool_shap_analyzer with explicit version handling is what caught this. Because the tool is isolated, I could add this logic cleanly:

if isinstance(shap_raw, list):
    sv = shap_raw[1] if len(shap_raw) > 1 else shap_raw[0]
elif hasattr(shap_raw, 'ndim') and shap_raw.ndim == 3:
    sv = shap_raw[:, :, 1]  # New SHAP: take class 1 slice
else:
    sv = shap_raw  # Regression or binary

This is a good example of why agentic tool isolation matters. If my SHAP analysis was one giant function, this fix would be buried in 200 lines of code. As a distinct tool, it's 10 lines, clearly documented, easy to test independently.

Hermes Agent vs. Other Agentic Frameworks

I know some of you are thinking: "Why Hermes Agent? Why not LangChain, AutoGen, or CrewAI?"

Fair question. Here's my honest take after building this:

LangChain

LangChain is powerful but opinionated. It works best when you're chaining LLM calls together. For my use case — where the "tools" are Python scientific computing libraries, not LLM endpoints — LangChain felt like overkill. I'd be fighting its abstractions rather than using them.

Use LangChain when: You're primarily chaining LLM calls, doing RAG, or need the massive ecosystem of pre-built integrations.

Use Hermes Agent when: You want clean, explicit tool definitions where you control exactly what runs and when.

AutoGen

AutoGen's multi-agent conversation model is fascinating but complex. For a pipeline where the execution order is deterministic (always: inspect → SHAP → LIME → bias → report), AutoGen's flexibility is unnecessary complexity.

Use AutoGen when: You need agents to negotiate with each other, have non-deterministic execution paths, or are building multi-agent debate systems.

Use Hermes Agent when: You have a clear sequential task where each step has defined inputs and outputs.

CrewAI

CrewAI is probably the closest conceptually — role-based agents with tool access. The main difference I found: Hermes Agent gives you more direct control over the planning loop. In CrewAI, the "crew" model adds abstraction that I didn't need.

Use CrewAI when: You want to model your problem as a team of specialized agents with roles and delegation.

Use Hermes Agent when: You want to own the planning logic explicitly and aren't sure yet what "roles" make sense for your problem.

My Honest Assessment

Hermes Agent's sweet spot is deterministic multi-step pipelines where you care about exactly what runs. Scientific computing, data analysis, document processing, code generation — anything where you can define clear tool responsibilities and want explicit control over the execution flow.

It's not trying to be everything. That's actually its strength.

What I'd Build Next

After spending a weekend with Hermes Agent, here's what I'm thinking about:

Counterfactual explanations. The natural next question after "why did the model predict X?" is "what would need to change to get a different prediction?" Hermes Agent would add a Tool 6 for this — using DiCE (Diverse Counterfactual Explanations) to generate "if your tumor's worst area had been 15% smaller, the model would have predicted benign."

Neural network support. Right now XAI-Agent handles tree models and most sklearn models. Adding SHAP DeepExplainer for PyTorch/TensorFlow would open up the majority of production models running in industry right now.

CI/CD integration. The most powerful version of this isn't a web app — it's a tool that runs automatically every time a model is retrained, generates a comparison report (did the important features change? did bias metrics shift?), and posts it as a GitHub PR comment.

The Bigger Picture: Why Open-Source Agents Matter

I want to end with something that's been on my mind since I started building this.

XAI platforms from companies like Fiddler AI, Arize, and Arthur AI are excellent. They're also $30K–$100K+ per year. A startup, a researcher, a solo developer, a nonprofit building AI for healthcare in a developing country — none of them can afford that.

The open-source AI agent ecosystem is the equalizer.

Hermes Agent running SHAP + LIME locally, generating a downloadable report, requiring nothing beyond a Python environment — that's accessible to anyone with a laptop.

As AI becomes more embedded in consequential decisions (loan approvals, medical diagnoses, hiring, parole, content moderation), the ability to explain and audit those decisions cannot be a luxury that only well-funded companies can afford.

Open, capable agent systems like Hermes Agent aren't just technically interesting. They're the infrastructure for making AI accountability universal.

That's why I built XAI-Agent. That's why I'm writing about Hermes Agent. And that's why I think the work this community is doing with open-source agents genuinely matters.

Try It Yourself

GitHub: github.com/SimranShaikh20/xai-agent

git clone https://github.com/SimranShaikh20/xai-agent
cd xai-agent
pip install -r requirements.txt
streamlit run app.py

Test files (sample_model.pkl + sample_dataset.csv) are included. Full analysis in under 3 minutes.

Questions I'd Love to Discuss

I'm genuinely curious about the community's experience:

Have you used Hermes Agent on something other than text/LLM tasks? What was the use case?
How do you handle context accumulation in long agentic pipelines?
What explainability features would make XAI-Agent actually useful for your work?

Drop them in the comments — I read and respond to everything 👇

Built for the DEV Hermes Agent Challenge — May 2026

Tags: #hermesagentchallenge #agents #ai #machinelearning #explainableai #opensource #python #streamlit