Alain Airom (Ayrom)

Posted on Jun 25

Testing ‘Mellea’ Skills Compiler

#mellea #generativecomputing #claude #graniteguardium

An implementation of skills compiler provided by ‘Mellea’

What is Mellea Skills Compiler and how it helps?

In modern enterprise AI architectures, shifting from loose, non-deterministic agent prompts to structured, verifiable routines is standard engineering practice. The Mellea Skills Compiler from IBM Research establishes a formalized blueprint for this transition. Operating on a principles-based methodology of Spec-Driven Development, the compiler converts unstructured natural language Markdown definitions into tightly typed schemas, deterministic execution flows, and fully auditable pipeline components.

Excerpt from the official repository
Mellea Skills Compiler is a certification pipeline for AI agent skills. It takes a natural-language skill specification (a .md file) and produces a typed, instrumented program with policy-driven guardrails and auditable execution traces.
AI agents increasingly ship as natural-language specifications — Markdown files, YAML configs, system prompts — executed by LLMs without formal verification, runtime monitoring, or compliance documentation. The specification format is right for rapid development, but specifications alone don’t guarantee reliable execution at scale.
Mellea Skills Compiler addresses three governance gaps:
Specification opacity — When an LLM interprets a Markdown spec, contradictions are silently resolved through implicit judgement. Structured decomposition surfaces these conflicts as testable failures.
Runtime unobservability — Agent outputs are typically unmonitored. Mellea Skills Compiler instruments every LLM generation with Guardian risk checks and JSONL audit trails.
Compliance disconnect — Enterprise frameworks (NIST AI RMF, EU AI Act) require documented evidence of risk management. Mellea Skills Compiler maps governance requirements to runtime capabilities and produces evidence packages.

Building a simple implementation

Using the IBM Bob SDLC Companion, I built an end-to-end framework wrapper designed to test, automate, and orchestrate this pipeline. This compilation engine acts as an intermediary layer between pure natural language specifications and ready-for-deployment multi-agent architectures (e.g., Model Context Protocol (MCP) servers or LangGraph pipelines), ensuring strict conformance with local compliance rules, risk taxonomies, and operational thresholds.

High-Level Architectural Engineering

[ spec.md ] (Natural-Language Specification)
     │
     ▼  Step 1: COMPILE (Spawns 'claude' CLI via npm)
[ *_mellea/ Package ] (Pydantic schemas, @generative slots, fixtures)
     │
     ├──► Step 2: RUN (Executes against fixtures, optional --enforce)
     │
     ▼  Step 3: CERTIFY (AI Atlas Nexus knowledge graph mapping)
[ audit/ Artifacts ] (policy_manifest.json, CERTIFICATION.md, audit_trail.jsonl)
     │
     ▼  Step 4: AUDIT & DEPLOYMENT
[ Unified HTML Dashboard / Production Deployment ready (MCP / LangGraph) ]

This modularized deployment structure isolates external compilation constraints from local governance structures. The pipeline enforces four foundational lifecycle phases:

| Pipeline Phase | Underlying Commands & Core Tooling | Functional Contribution & Artifacts                          |
| -------------- | ---------------------------------- | ------------------------------------------------------------ |
| **1. Compile** | `mellea-skills compile spec.md`    | Decomposes natural language definitions into structured Pydantic objects, validators, and extraction slots via Anthropic Claude. |
| **2. Run**     | `mellea-skills run *_mellea/`      | Executes the compiled pipeline definitions directly against developer-supplied fixture data. |
| **3. Certify** | `mellea-skills certify *_mellea/`  | Maps runtime attributes to NIST AI RMF risks using `granite3.3:8b` and enforces safe execution parameters. |
| **4. Audit**   | Inspecting `audit/` files via UI   | Surfaces cryptographically anchored or locally structured `audit_trail.jsonl` logs and regulatory compliance maps. |

And in terms of UI practically there are 4 stages of implementation.

Compile

Run

Certify

Audit

Example of a skill for test purpose;

---
name: weather
description: >
  Get current weather conditions and forecasts via wttr.in.
  Use when: user asks about weather, temperature, precipitation, or forecasts for any location.
  NOT for: historical data, climate analysis, severe weather alerts, or aviation weather.
  No API key needed.
homepage: https://wttr.in/:help
metadata:
  openclaw:
    emoji: ☔
    requires:
      bins:
        - curl
---

# Weather Skill

Get current weather conditions and forecasts using public wttr.in API.

## When to Use

✅ USE this skill when:
- "What's the weather in [city]?"
- "Will it rain today/tomorrow?"
- "Temperature in [city]"
- "3-day / week forecast"
- Travel planning weather checks

❌ DON'T use this skill when:
- Historical weather data → use weather archives/APIs
- Climate analysis or long-range trends
- Severe weather alerts → check official NWS sources
- Aviation/marine weather (METAR, etc.)

## Location

Always include a city, region, or airport code in queries.

## Intent Classification

Classify the user query into one of these seven intents:

1. **current_oneliner** — Quick one-line summary of current conditions
2. **current_detailed** — Full detailed current conditions (multi-field)
3. **forecast_3day** — Three-day forecast
4. **forecast_week** — Full week forecast
5. **forecast_tomorrow** — Tomorrow's weather only
6. **rain_check** — Precipitation focus ("will it rain?")
7. **custom_format** — User specifies a particular data field

## URL Templates

Map each intent to a wttr.in URL:

| Intent              | URL Template                                           |
|---------------------|--------------------------------------------------------|
| current_oneliner    | `https://wttr.in/{location}?format=3`                 |
| current_detailed    | `https://wttr.in/{location}?format=%l:+%c+%t+(feels+like+%f),+%w+wind,+%h+humidity` |
| forecast_3day       | `https://wttr.in/{location}?format=v2`                |
| forecast_week       | `https://wttr.in/{location}?format=v2`                |
| forecast_tomorrow   | `https://wttr.in/{location}?1`                        |
| rain_check          | `https://wttr.in/{location}?format=%l:+%c+%p`        |
| custom_format       | `https://wttr.in/{location}?format=j1`                |

## Pipeline Phases

### Phase 1 — Intent & Location Extraction
- Extract `location` from user query (city name, airport code, or region)
- Classify `intent` from the seven options above
- Validate that `location` is non-empty

### Phase 2 — URL Construction
- Substitute `{location}` in the URL template for the classified intent
- URL-encode spaces as `+`

### Phase 3 — HTTP Fetch
- Perform HTTP GET to the constructed URL
- Handle rate-limit (HTTP 429): surface a clear message
- Timeout: 5 seconds

### Phase 4 — Response Summarisation
- Parse and summarise the raw wttr.in text or JSON response
- Produce a concise, human-readable weather summary
- Include: location, conditions, temperature, feels-like, wind, humidity
- For forecasts: include each day with high/low temperatures

## Output Schema

json
{
  "location": "string",
  "intent": "string",
  "url": "string",
  "raw_response": "string",
  "summary": "string",
  "error": "string | null"
}


## Notes

- No API key required (wttr.in is public)
- Rate-limited; do not send more than one request per 5 seconds
- Works for most global cities and IATA airport codes (e.g. `ORD`, `LHR`)
- Default location fallback: `London` if none is provided

Production-Grade Subprocess Orchestration

The `skill_runner.py`

A core technical challenge when wrapping LLM compilers is path execution and environment consistency. Because the mellea-skills engine depends heavily on external binary wrappers—including global npm binaries like the Claude Code CLI (claude)—the orchestrator must explicitly manipulate execution environments at runtime. The snippet below shows our production solution for explicit virtual environment routing and system PATH stitching:

def _build_env() -> dict[str, str]:
    """Return the current process env enriched with .env values and the venv bin on PATH."""
    env = os.environ.copy()
    # Overlay .env values only for keys that are currently empty/missing
    for k, v in _load_dotenv().items():
        if k not in env or not env[k]:
            env[k] = v

    # Ensure the project's internal virtual environment takes structural precedence
    # on the host system PATH, enabling seamless downstream discovery of global npm modules.
    venv_bin = str(Path(_MELLEA).parent)
    extra_paths = [
        venv_bin,
        "/opt/homebrew/bin",   # macOS Apple Silicon Homebrew paths
        "/usr/local/bin",      # Standard Linux/Intel Mac environments
        "/usr/bin",
    ]
    existing = env.get("PATH", "")
    env["PATH"] = os.pathsep.join(extra_paths) + os.pathsep + existing
    return env

Architecture Insight: By modifying env["PATH"] dynamically within Python's execution block, the application prevents common permission traps or missed references inherent in launching detached web components on developer workstations (e.g., Apple Silicon environments requiring custom Homebrew linkage).

Enterprise Service Exposure: The Web Interface

To expose these compilation phases as network-accessible services, a streamlined API surface was established. This allows automated CI/CD runners or the single-page engineering dashboard to initiate verification, testing, and report generation in real time.

@app.post("/api/certify")
async def certify(req: CertifyRequest):
    """
    Runs the comprehensive governance compilation pipeline.
    Combines AI Atlas Risk Identification with real-time Granite Guardian hooks.
    """
    compiled_dir = Path(req.compiled_skill_dir)
    if not compiled_dir.exists():
        raise HTTPException(status_code=404, detail=f"Compiled skill path missing: {compiled_dir}")

    result = certify_skill(
        compiled_dir=compiled_dir,
        fixture=req.fixture,
        enforce=req.enforce,
        model=req.model,
        guardian_model=req.guardian_model,
        inference_engine=req.inference_engine,
    )
    return result

Dynamic Artifact Evaluation and Visualization

Once certification completes, raw telemetry output data must be rendered clearly for auditing teams. The system incorporates a desaturated, highly readable styling layer that maps JSON policy structures into visual components, isolating risk tiers directly inside the management console:

# Mapping governance compliance classifications into clear visual indicators
status = risk.get("governance_status", risk.get("coverage", "MANUAL"))
color = {
    "AUTOMATED": "#22c55e",  # Positive compliance clearance
    "PARTIAL": "#f59e0b",    # Intermediate warning threshold
    "MANUAL": "#ef4444"      # Immediate intervention flag
}.get(status, "#57606a")

This enables continuous automated categorization of systemic risk boundaries spanning from OWASP vulnerability definitions down to specific operational failure profiles.

Deployment, Verification & Technical Conclusion

The testing suite provides out-of-the-box support for distinct capability domains:

Weather Skill (T1 Archetype): Validates structural generative pipeline performance using open external schemas, processing public JSON structures without complex authorization dependencies.
Security Review Skill (T1 Archetype): Performs pure algorithmic analysis of code assets against strict OWASP Top 10 maps, verifying structural code output before downstream compilation occurs.

Through the integration of Mellea structural models, Granite Guardian safety frameworks, and the AI Atlas Nexus taxonomy, this implementation framework demonstrates that generative multi-agent systems no longer need to execute as unmonitored black boxes. By shifting validation directly into compile-time schedules, enterprise development structures can enforce compliance policies while preserving operational agility.

Conclusion: Why the Skills Compiler is a Game-Changer for Enterprise AI

As multi-agent systems transition from experimental playgrounds to core enterprise infrastructure, the industry faces an uncomfortable truth: unstructured prompts are a liability. Relying on raw natural language to guide agent behavior makes orchestration non-deterministic, hard to track, and nearly impossible to validate against strict corporate governance frameworks.

The Mellea Skills Compiler bridges this gap by introducing the rigor of traditional software engineering to generative AI workflows. Here is why it represents a vital shift forward:

From “Prompt Engineering” to “Spec-Driven Development”: Instead of endlessly tweaking prompts and hoping for consistent outputs, developers define explicit capabilities, expected schemas, and strict evaluation metrics within a deterministic Markdown specification. The compiler turns these specs into compiled, typed, and repeatable execution packages.
Proactive Compile-Time Governance: By shifting safety, compliance, and validation protocols (such as NIST AI RMF alignments and OWASP controls) right into the compilation and certification phase, security teams can audit an AI skill before it ever touches production data. It effectively mitigates risk at design time rather than trying to filter harmful outputs at runtime.
Automated Audit Trails: With the generation of declarative policy manifests, evaluation logs (audit_trail.jsonl), and cryptographic verification layers, enterprises finally gain a clear, transparent paper trail for regulatory compliance under emerging frameworks like the EU AI Act.

Ultimately, tools like the Mellea Skills Compiler transform generative agents from black-box novelties into predictable, enterprise-grade components.

>>> Thanks for reading <<<

DEV Community

Testing ‘Mellea’ Skills Compiler

What is Mellea Skills Compiler and how it helps?

Building a simple implementation

High-Level Architectural Engineering

Production-Grade Subprocess Orchestration

The `skill_runner.py`

Enterprise Service Exposure: The Web Interface

Dynamic Artifact Evaluation and Visualization

Deployment, Verification & Technical Conclusion

Conclusion: Why the Skills Compiler is a Game-Changer for Enterprise AI

Links

Top comments (0)

What is Mellea Skills Compiler and how it helps?

Building a simple implementation

High-Level Architectural Engineering

Production-Grade Subprocess Orchestration

The skill_runner.py

Enterprise Service Exposure: The Web Interface

Dynamic Artifact Evaluation and Visualization

Deployment, Verification & Technical Conclusion

Conclusion: Why the Skills Compiler is a Game-Changer for Enterprise AI

Links

The `skill_runner.py`