Debmalya Sen

Posted on Jun 23

Designing Helium Agent

#ai #agents #opensource #programming

1. What is Helium Agent?

Helium Agent is a lightweight, terminal-focused AI agent written in Python.
Think Codex or OpenCode, but without the overhead. It's published on PyPI (
pip install helium-agent ), runs as helium . from any directory, and works
with any OpenAI-compatible LLM endpoint — cloud or local.

Core capabilities:

• General-purpose chat with tool calling
• Long-running coding tasks via an agentic loop
• Deep research with multi-source evidence collection and citation
• RAG (Retrieval-Augmented Generation) for file-based Q&A
• Persistent memory across sessions
• Hierarchical subagent delegation
• A plugin system via SKILL.md files

The design philosophy is minimum viable complexity — every feature is
implemented in the simplest way that works, with no speculative
abstractions.

2. Architecture Overview

The architecture is deliberately flat. There's no framework, no dependency
injection container, no event bus. Modules import each other directly (with
lazy imports where circular dependencies would otherwise occur). This is a
feature, not a bug — the codebase is navigable in a single sitting.

3. Design Choices

3.1 Prompt-Based Tool Calling

Decision: Helium does NOT use OpenAI's function calling API. Instead, the
LLM is instructed (via the system prompt) to output tool calls as
<action>{"tool": "...", "args": {...}}</action> XML tags.

Why:

• Model agnostic. Works with any LLM that can follow instructions — local
llama.cpp models, OpenRouter free tiers, fine-tuned models. No need for the
provider to support a specific tool-calling schema.
• Full control. The tool prompt is a plain string, editable without touching
code. Adding a new tool means adding a function and a description — no
schema generation, no API negotiation.
• Simpler debugging. The raw LLM output is human-readable. You can see
exactly what the model tried to do.

Trade-off: JSON extraction becomes fragile. The model might output slightly
malformed JSON, wrap it in markdown code blocks, or include extra text. This
led to extract_json() in utils/parser.py — a 75-line cascade of
fallbacks:

This works in practice but is a maintenance liability. Every new edge case from a new model means another fallback branch.

3.2 Dependency Injection in the Agentic Loop

Decision: AgenticLoop accepts two callables — ask_model and execute_tool_call — rather than importing the LLM and tool modules directly.

class AgenticLoop:
    def __init__(self, ask_model, execute_tool_call, max_turns=6):
        self.ask_model = ask_model
        self.execute_tool_call = execute_tool_call
        self.max_turns = max_turns

Why: This single decision enables the entire system's composability:

The general chat loop uses the standard LLM and tool execution.
The coding workflow (/code) creates an AgenticLoop with auto-approved tools and max_turns=30.
Subagents create their own AgenticLoop with a filtered tool set (only the tools the parent allowed).
Skills inject their SKILL.md body into the system prompt and run a fresh AgenticLoop.

No subclasses, no strategy pattern, no configuration objects. Just two callables.

3.3 Global Mutable State (by design)

Decision: Several modules use module-level singletons:

conversation_history (list) in core/llm.py
_manager in tools/memory_ops.py
_todo_list in tools/todo_tools.py
_manager in tools/subagent_tools.py

Why: Helium is a single-user, single-threaded terminal agent. There is exactly one conversation, one memory store, one todo list at any time. Global state is the simplest representation of this reality.

Trade-off: This rules out concurrency. You can't run two subagents in parallel because they'd share the same conversation history and tool state. This is acceptable today but is the first thing that would need to change for async support.

3.4 Three-Tier Permission Model

safe — auto-execute. Reads, searches, memory lookups, todo queries.
risky — requires user confirmation. File writes, bash, app launching.
conditional (bash only) — is_command_safe() inspects the command string. ls, cat, pwd are safe. rm, mv, chmod are risky.

The --nuclear / --auto-approve flag bypasses all checks. Useful for CI, dangerous for production.

Why this matters: The LLM can hallucinate tool calls. Without permission gates, a confused model could delete files or run arbitrary commands. The three-tier system lets harmless operations flow freely while blocking anything that could cause damage.

3.5 No SDK — Raw HTTP Only

Decision: All LLM communication uses requests.post() with manual SSE parsing. No OpenAI SDK, no httpx, no abstraction layer.

Why:

Fewer dependencies. The requirements.txt stays small. Each dependency is a potential breakage point.
Full transparency. You can see exactly what's being sent to the API and what's coming back.
Provider flexibility. Any endpoint that accepts OpenAI-format chat completions works. No SDK version pinning, no API compatibility matrix.

Trade-off: Manual SSE parsing is fiddly. stream_openrouter_response() in utils/check_llm_api.py handles chunked transfer encoding, data: [DONE] markers, and error responses. This is code that a well-tested SDK would handle for you.

3.6 SKILL.md Plugin System

Decision: Skills are markdown files with YAML frontmatter, discovered from ~/.config/helium-agent/skills/ and .helium/skills/.

---
name: caveman
trigger: /caveman
type: slash
description: "Respond like a caveman"
---
You are a caveman. Respond to everything in caveman speak.
Use grunts and simple words. No modern language.

Why:

Zero code required. Anyone can create a skill by writing a markdown file.
Two types: Slash commands (triggered by /name) and contextual skills (injected into the system prompt when relevant).
Skill-scoped tools. A skill can declare allowed_tools to restrict what the LLM can do within that skill's context.

This is the simplest possible plugin system. No Python entry points, no registration APIs, no configuration files beyond the markdown itself.

4. Orchestration: How It All Fits Together

4.1 The Agentic Loop

The core of Helium is a turn-based loop that alternates between the LLM and the tool system:

Key parameters:

General chat: max_turns = 6. Enough for a few tool calls without runaway loops.
Coding workflow: max_turns = 30. Long enough for multi-file edits with verification.
Temperature: 0.3. Low enough for consistency, high enough for natural language.
History: MAX_HISTORY = 10. The last 10 messages are kept. Older ones are dropped.

The loop terminates when:

The LLM responds without an <action> tag (final answer).
max_turns is reached (timeout).
A tool call is invalid and the LLM can't recover.

4.2 The Research Pipeline

Deep research is the most complex orchestration in Helium. It's a multi-stage pipeline with iteration:

The dual-provider search (DuckDuckGo + SearxNG) is a redundancy measure. If one provider is down or rate-limited, the other fills the gap. The SourcePolicy scores URLs by source type (official docs > blogs > forums) to prioritize high-quality evidence.

4.3 The Subagent System

Added in the June 15 session, the subagent system enables hierarchical delegation:

Critical design decisions:

Reuses AgenticLoop. No new execution engine. A subagent runs the same loop as the main agent, just with a filtered tool set.
Tool filtering at the manager layer. _wrap_execute_tool() intercepts tool calls and rejects anything not in the subagent's allowed_tools. The LLM doesn't know it's restricted.
IDs are unique, names aren't. You can have five "researcher" subagents with different IDs. This allows multiple instances of the same role.
Lazy imports. tools/registry.py imports subagent tools lazily to avoid a circular dependency chain (registry → subagent_tools → subagent_manager → llm → registry).

4.4 The Memory System

The three-layer approach means:

Flat store handles keyword search well ("What's my preferred language?")
Knowledge graph handles relationship queries ("What do I prefer?") via SPO triplets
Conversation store provides session context without polluting long-term memory

All three share a single SQLite connection. The memory is "persistent" within a project directory (stored as memory.db), but not shared across projects.

5. Mistakes and How to Avoid Them

5.1 Circular Import Hell

What happened: When adding the subagent system, the import chain tools/registry.py → tools/subagent_tools.py → core/subagent_manager.py → core/llm.py → tools/registry.py created a circular dependency that crashed on startup.

The fix: Lazy imports in tools/registry.py:

def _create_subagent_lazy(*args, **kwargs):
    from tools.subagent_tools import create_subagent
    return create_subagent(*args, **kwargs)

How to avoid it:

Draw the import graph before adding new modules.
If A → B → C → A is unavoidable, break the cycle at the point with the fewest dependents (usually the leaf module).
Consider a registry/observer pattern instead of direct imports for tool systems.

5.2 JSON Extraction Fragility

What happened: Different LLMs produce tool calls in subtly different formats. Some wrap JSON in markdown code blocks. Some include trailing commas. Some add explanatory text before or after the JSON. Each new model exposed a new edge case in extract_json().

The current state: A 75-line cascade of regex fallbacks. It works, but every new model is a potential breakage.