DEV Community: Denis Babkevich

Bug Bounty Mode in Spectrion

Denis Babkevich — Thu, 14 May 2026 23:31:34 +0000

Bug Bounty Mode is Spectrion's dedicated workflow for authorized security research.

It is designed for cases where the user is working inside a real bug bounty program, with a published scope, rules, and permission to test specific targets. The mode is not a generic hacking mode. It is a structured research environment that helps the agent stay inside scope, gather evidence, validate impact safely, and produce report-ready findings.

The core idea is simple:

Bug Bounty Mode turns security research into a scoped, evidence-driven workflow.

Why a Dedicated Mode Exists

General-purpose agents often fail at bug bounty work because they mix together:

loose recon;
unclear scope;
unverified assumptions;
noisy vulnerability guesses;
unsafe active testing;
weak evidence;
incomplete reports.

Spectrion separates bug bounty work into a dedicated mode with stricter rules.

The agent must know:

which program is being tested;
where the official rules are;
which target is in scope;
what testing is allowed;
what evidence is needed;
when user approval is required.

This makes the workflow safer and more useful.

Intake First

Bug Bounty Mode starts with intake.

The agent should not begin active work until it has the minimum required context:

program name or platform;
official program/rules/scope URL;
exact in-scope target URL;
scope summary or limits;
user intent for the current test.

For example:

Program: HackerOne Acme
Rules: https://hackerone.com/acme
Target: https://app.example.com
Scope: app.example.com only, no DoS, no social engineering, no destructive testing

If the user only says:

Scan this website.

Bug Bounty Mode should stop and ask for the program and scope information.

That gate prevents accidental out-of-scope testing.

Safe Recon Layer

Spectrion includes a read-only recon tool:

bug_bounty_recon

It is designed for passive or low-impact inspection after intake is complete.

Supported recon actions include:

intake summary;
HTTP overview;
security header review;
robots.txt and sitemap review;
link extraction;
JavaScript endpoint extraction;
technology fingerprinting;
passive recon bundle.

The important rule:

bug_bounty_recon does not exploit, mutate state, brute force, or perform destructive actions.

Its output is treated as signal, not proof.

For example, missing security headers are not automatically a vulnerability. An exposed endpoint is not automatically a vulnerability. A technology fingerprint is not automatically a vulnerability.

Recon creates hypotheses.

Validation creates evidence.

Approval-Gated Validation

Active validation is handled by:

bug_bounty_validate

This tool is approval-gated. The agent must not use it silently.

It requires explicit user approval for active checks because even low-impact probes can still interact with a live target.

Supported validation actions include:

reflected parameter probe;
CORS origin probe;
CSRF form review.

The validation tool returns evidence levels and impact notes. It should not label a bug as confirmed unless there is proof and practical impact.

The rule is:

No confirmed vulnerability without proof and impact.

Evidence Levels

Bug Bounty Mode should separate findings into evidence levels.

Useful categories:

RECON
HYPOTHESIS
INDICATION
PROVEN
REJECTED

RECON

Raw observations.

Examples:

endpoint found in JavaScript;
missing header;
interesting route;
technology fingerprint.

HYPOTHESIS

A possible issue worth checking.

Example:

This endpoint may expose another user's object if authorization is weak.

INDICATION

Some supporting evidence exists, but impact is not fully proven.

Example:

The marker is reflected, but execution context and exploitability are not proven.

PROVEN

The issue has concrete, reproducible proof and impact.

Example:

Account A can access Account B's private resource using a changed object ID, with no special privileges.

REJECTED

The hypothesis was tested safely and did not hold.

This matters because good bug bounty work is not just finding bugs. It is also eliminating false positives.

Hypothesis Ledger

Bug Bounty Mode should maintain a visible hypothesis ledger.

The ledger tracks:

hypothesis;
target;
evidence;
status;
next step;
risk level;
whether active validation is approved;
whether the finding is reportable.

Example:

H1: Possible IDOR in /api/orders/{id}
Status: HYPOTHESIS
Evidence: endpoint discovered in app bundle
Next step: test with owned accounts only
Approval: required before active validation
Reportable: no

This keeps the agent disciplined. It also lets the user understand what is known, what is guessed, and what has been proven.

Chaining Graph

Many real bug bounty findings are not a single isolated bug. They are chains.

Example:

weak reset flow
-> token leakage
-> account takeover

or:

low-severity IDOR
-> internal metadata exposure
-> privilege escalation path

Bug Bounty Mode includes the concept of a chaining graph.

The graph helps the agent model:

weak signals;
supporting evidence;
exploit preconditions;
impact paths;
blocked paths;
strongest proven chain.

The agent should not inflate severity. It should show what is proven and what remains hypothetical.

Specialized Playbooks

Spectrion's bug bounty workflow can attach specialized skills/playbooks for common vulnerability families:

API and GraphQL testing;
authorization and IDOR;
authentication and session logic;
OAuth and token handling;
injection;
XSS and client-side bugs;
CSRF, CORS, and UI redress;
SSRF, redirects, and cache poisoning;
file upload and parser issues;
cloud secrets and exposed configuration;
dependency and CVE review;
mobile app testing;
business logic;
reverse engineering;
custom validation tool building.

The router can attach the right playbook based on the user's target and intent.

For example:

Check for IDOR and BOLA safely.

The runtime can attach the authorization/IDOR playbook and keep the broader bug bounty coordinator active.

Report-Ready Output

Bug bounty work is only valuable if the final report is clear.

A good final finding should include:

title;
program and target;
scope confirmation;
vulnerability class;
severity rationale;
proof of concept;
reproduction steps;
observed impact;
affected accounts or roles;
evidence screenshots/logs where appropriate;
remediation suggestion;
limitations and assumptions.

Bug Bounty Mode should avoid vague claims like:

This might be vulnerable.

Instead, it should produce:

Confirmed finding:
Account A can read Account B's private invoice by changing invoice_id in GET /api/invoices/{id}.

Impact:
Unauthorized disclosure of invoice metadata and billing address.

Evidence:
Request/response pair with redacted account IDs, tested using two owned accounts.

If proof is not complete, the agent should say so clearly and keep the item as HYPOTHESIS or INDICATION.

Safety Rules

Bug Bounty Mode must be conservative by default.

It should not:

perform DoS or stress testing;
brute force credentials;
attack third parties;
test out-of-scope hosts;
bypass rate limits aggressively;
exfiltrate data;
persist access;
run destructive payloads;
submit reports without user approval;
label unproven issues as confirmed.

It should:

use official scope;
prefer read-only recon first;
ask before active validation;
use owned/test accounts;
redact secrets and personal data;
keep evidence minimal and relevant;
preserve a clear audit trail;
report only proven impact.

This is what makes the mode useful for legitimate research instead of noisy scanning.

How the Agent Should Work

A typical Spectrion bug bounty run looks like this:

1. Intake
2. Scope confirmation
3. Visible todo plan
4. Passive recon
5. Hypothesis ledger
6. Safe validation plan
7. User approval for active checks
8. Validation
9. Impact analysis
10. Chaining graph
11. Report drafting
12. User review before submission

The agent should not jump straight to exploitation. It should build confidence step by step.

Tool and Skill Model

Bug Bounty Mode is built from three layers:

Bug Bounty Agent
  -> bug bounty skills/playbooks
  -> safe recon and validation tools

The agent provides the behavioral contract.

The skills provide domain playbooks.

The tools provide bounded execution.

This division matters because bug bounty work needs both intelligence and guardrails.

The LLM can reason about hypotheses, chains, and impact. The tools enforce bounded behavior and produce evidence.

Why This Matters

A good bug bounty assistant should not be a vulnerability slot machine.

It should be a research partner that helps the user:

stay in scope;
reduce false positives;
organize hypotheses;
validate safely;
understand impact;
build stronger reports;
avoid unsafe behavior.

That is the product value of Spectrion's Bug Bounty Mode.

It makes security research more structured, more auditable, and more useful.

Final Principle

The shortest version:

Recon creates hypotheses.
Validation creates evidence.
Impact creates reportability.
Scope creates permission.

Bug Bounty Mode exists to keep all four connected.

In Spectrion, bug bounty work should not feel like random scanning. It should feel like a disciplined investigation with scope, evidence, validation, and report-ready output.

Spectrion treats diagrams as working artifacts, not as disposable images.

Denis Babkevich — Thu, 14 May 2026 22:59:12 +0000

The goal is not just to draw a picture. The goal is to create a persistent, editable, inspectable artifact that can explain a system, link back to source files, evolve over time, and help the agent reason about future changes.

In Spectrion, a diagram is not the final product. It is the visual layer of a deeper artifact model.

From Images to Artifacts

A regular generated image answers one question:

What does this look like?

A Spectrion diagram should answer more:

What is this component?
Where does it live in the codebase?
How does execution pass through it?
What changed between versions?
How can the agent use this map to make the next change safely?

That is why Spectrion stores diagrams as artifacts with source, rendered output, metadata, and revision history.

Diagram Artifacts

The first layer is the DiagramArtifact.

It is responsible for the reliable rendering workflow:

diagram source;
syntax validation;
render output;
sanitized SVG;
PNG preview;
versioned storage;
export;
future revision from the existing source.

A diagram artifact may contain files like:

architecture.dot
architecture.svg
architecture.png
manifest.json

or:

runtime-loop.mmd
runtime-loop.svg
runtime-loop.png
manifest.json

The important point is that the source is preserved. The agent should not regenerate a diagram from scratch every time the user asks for a change. It should revise the existing artifact.

For example:

Add an approval gate before tool execution.

Spectrion should load the existing diagram source, patch it, validate it, render a new version, and return a short summary of what changed.

That is the difference between a toy image generator and a real runtime artifact.

Why PNG Is Not Enough

PNG is useful for previewing. It is not enough for work.

A PNG cannot reliably tell the agent:

which block is ToolExecutor;
which source file defines it;
which edges represent runtime flow;
which components changed between versions;
which execution trace passed through it.

That is why Spectrion keeps multiple layers:

architecture.png          # visual preview
architecture.svg          # vector render
architecture.mmd/.dot     # editable source
architecture.graph.json   # semantic graph
trace-map.json            # trace-to-node mapping
summary.md                # human-readable explanation

Together, these files make the diagram usable by both humans and agents.

Interactive Architecture Artifacts

The second layer is the InteractiveArchitectureArtifact.

This is not just a rendered diagram. It is a semantic map of a system.

It can store:

system components;
relationships between components;
node descriptions;
source references;
related files;
trace mapping rules;
version history;
architecture summaries;
exportable evidence bundles.

The boundary is important:

DiagramArtifact shows structure visually.
InteractiveArchitectureArtifact gives structure meaning.
RunTraceArtifact records execution.
Overlay shows execution on top of structure.

Architecture artifacts should not store raw executions directly. They store the map of the system. Execution traces live separately as RunTraceArtifacts and can be overlaid when needed.

The Semantic Graph Layer

An SVG only knows about shapes and text. It does not know what a runtime component is.

Spectrion adds architecture.graph.json so every important block becomes structured data:

{
  "id": "tool_executor",
  "label": "ToolExecutor",
  "kind": "runtime_component",
  "description": "Dispatches tool calls, applies policy, and routes calls to the correct tool runtime.",
  "source_refs": [
    {
      "path": "Agent/Tools/ToolExecutor.swift",
      "symbol": "ToolExecutor"
    }
  ],
  "tags": ["tools", "policy", "execution"]
}

This makes the architecture usable as a navigation and reasoning layer.

The user can ask:

Explain ToolExecutor.

Spectrion does not need to guess from the image. It can load the graph node, source refs, related traces, and summary, then produce a grounded explanation.

Trace Overlay

The most powerful extension is runtime trace overlay.

The user can ask:

Show how the last request moved through the runtime.

Spectrion can map trace events onto architecture nodes:

User Message
-> AgentRuntime
-> Context Builder
-> Provider Stream
-> ToolExecutor
-> Diagram Tool
-> Artifact Store

That turns architecture from a static diagram into a visual debugger.

Instead of reading raw logs alone, the user can see the execution path on top of the system map.

Why GraphViz Matters

Mermaid is a strong default for simple diagrams:

small flowcharts;
task flows;
simple sequence diagrams;
lightweight documentation.

For full architecture maps, GraphViz is usually a better default.

It handles:

larger graphs;
many edges;
clusters;
service boundaries;
database layers;
backend/frontend/data-flow maps;
architecture layouts with more stable spacing.

Mermaid remains useful, but full project architecture should prefer GraphViz unless the user explicitly asks for another engine.

This also prevents common Mermaid failures such as subgraph/node ID collisions, invalid chained edges, and layout collapse on dense architecture maps.

Source and Engine Rules

Diagram engines must stay strict:

Mermaid source goes to Mermaid.
GraphViz DOT source goes to GraphViz.
PlantUML source must be valid @startuml ... @enduml.
Structurizr source must be valid workspace { ... } DSL.
Markdown fences should be stripped before rendering.
Mermaid subgraph IDs and node IDs must not collide.

For Mermaid, a safe convention is:

sg_* for subgraph IDs
n_*  for node IDs

For GraphViz, clusters should be explicit:

subgraph cluster_backend {
  label="Backend";
  api;
  worker;
  database;
}

The renderer should reject invalid source with recoverable errors. The agent can then repair the source and render again.

Security and Privacy

Architecture diagrams can contain sensitive information:

internal service names;
database names;
file paths;
security boundaries;
provider flows;
approval logic;
runtime events;
infrastructure topology.

Spectrion uses strict defaults:

SVG is sanitized;
PNG is the safe preview format;
source is saved separately;
public renderers are avoided;
rendering goes through a trusted Spectrion backend;
raw trace payloads are redacted by default;
interactivity is implemented in the UI, not inside the SVG.

Interactive behavior should not be embedded as executable SVG logic. The safer model is:

rendered image + sidecar JSON + UI overlay

The SVG or PNG displays the structure. The sidecar files provide meaning.

How It Feels in the Product

A user can say:

Draw the full architecture of this project.

Spectrion should:

inspect the project;
identify components and flows;
build a semantic graph;
choose the best diagram engine;
render the diagram;
save source, SVG, PNG, graph metadata, and summary;
show an architecture artifact card;
open the Architecture Viewer.

The viewer can expose:

Preview | Source | Explain | Trace | History

Preview

Shows the rendered architecture.

Source

Shows the editable diagram source.

Explain

Explains selected components using graph metadata and source refs.

Trace

Overlays a selected runtime trace on top of the architecture.

History

Shows versions, summaries, and diffs.

Multi-Layer Architecture Maps

A full architecture map should provide the system overview first.

But real systems often need deeper layers:

service-level architecture;
module-level architecture;
data-flow diagrams;
database/schema diagrams;
runtime execution paths;
deployment topology;
integration maps.

Spectrion should support progressive deepening.

The first map gives the user the full system shape. Then the user can ask:

Expand the backend service.

or:

Show how data reaches the database.

or:

Show weak points in this architecture.

The agent should use the existing architecture artifact as context instead of starting from zero. This makes the map a working memory for the project.

Evidence Bundles

Architecture artifacts can be exported as evidence bundles:

architecture.svg
architecture.png
architecture.dot
architecture.graph.json
trace-map.json
summary.md
diff.json

This is useful for:

pull requests;
technical documentation;
bug reports;
audits;
onboarding;
architecture reviews;
implementation planning.

The artifact becomes proof of work, not just a chat answer.

Product Value

The real value is not that Spectrion can draw diagrams.

The value is that Spectrion can turn architecture into a living artifact.

Users can:

inspect systems visually;
understand components;
connect diagrams to code;
trace execution paths;
update architecture after code changes;
export documentation;
use the map as context for future agent work.

That changes the role of diagrams from decoration to infrastructure.

Final Principle

The core idea is:

Diagram Artifacts show structure.
Interactive Architecture Artifacts show structure, meaning, and execution.

In other words:

Spectrion turns architecture into a living, inspectable, traceable artifact.

That is what makes diagrams in Spectrion more than visual output. They become part of the agent runtime workspace.

Building an AI Agent Runtime That Uses Codex CLI / Claude Code as Workers and Closes Tasks Only With Evidence

Denis Babkevich — Sun, 10 May 2026 22:19:14 +0000

Most AI agents treat done as a message.

In Spectrion, done is a state transition.

A task is not completed until the runtime has gone through the full path:

- selected the ready task from the plan;
- checked dependencies;
- checked policy and approvals;
- executed the work through tools, CLI workers, or subagents;
- collected evidence;
- updated state through execute_plan;
- continued to the next task or honestly stopped on a blocker.

The most important part: Codex CLI, Claude Code, or any other headless CLI can be used as workers inside the plan.

But they do not decide when the task is finished.

Spectrion verifies their output and closes the task only with evidence.

Why a normal agent loop is weak

A normal AI chat:

user request
  -> model
  -> answer

An agent with tools:

user request
  -> model
  -> tool call
  -> tool result
  -> answer

That is enough for simple tasks.

It is not enough for large engineering work.

For example:

Systematically discover and fix all remaining bugs across agent runtime,
UI/UX, tools, memory, planning system, and server.

Add comprehensive test coverage.
Maintain server backward compatibility.
Resolve planning mode confusion.

You cannot just write a plan and say “done.”

You need a runtime that holds state:

plan
tasks
dependencies
approval gates
evidence
CLI sessions
subagents
blockers
final acceptance criteria

That is what I am building in Spectrion.

The main flow: plan + CLI worker

In Spectrion, a large goal becomes an execution plan:

user goal
  -> create_plan
  -> approval / questions / assumptions
  -> execute ready task
  -> optional CLI worker: Codex CLI / Claude Code / another CLI
  -> collect output
  -> verify with code / tests / logs / artifacts
  -> attach evidence
  -> execute_plan: mark_completed / mark_failed / blocked
  -> continue next ready task
  -> finish only when plan state is terminal

The external CLI is a worker.

Spectrion is the supervisor.

The CLI can say done.

Spectrion must verify whether the task can actually be closed.

A plan is not a Markdown list

A typical “plan” often looks like this:

1. Inspect the code
2. Find bugs
3. Fix them
4. Verify

That list guarantees nothing.

In Spectrion, a plan is a runtime artifact:

objective
scope
phases
tasks
dependencies
risk level
approval gates
open questions
acceptance criteria
rollback notes
required evidence
status
progress

The main invariant:

model message does not close the task
task state closes the task
completed task requires evidence
risky task requires approval
blocked task stays blocked

The plan is not finished until every task is closed as completed or skipped, or until the runtime honestly stops on a blocker / approval / open question.

create_plan vs todo

todo is for short work inside one turn:

- check a file;
- edit text;
- run one test;
- remember to update README;
- keep a small checklist.

create_plan is for situations with:

- multiple phases;
- dependencies;
- approval gates;
- risky changes;
- migrations;
- deployments;
- long-running work;
- rollback;
- acceptance criteria;
- cross-turn execution.

Example:

Find all remaining bugs in runtime, UI, tools, memory,
planning, and server. Add tests. Preserve backward compatibility.

That is not a todo list.

That is an execution plan.

In one real case, the plan broke the work into 35 tasks: discovery, runtime/planning fixes, tools layer, UI/UX, server compatibility, tests, and live smoke checks.

Caption:

The plan is not a Markdown list. It is a runtime artifact: objective, scope, questions, phases, tasks, progress, and approval gates.

Caption:

Risky or critical-path tasks do not execute silently. They stop at an approval gate.

What counts as evidence

Evidence is not a model sentence like “I checked it.”

Evidence is a verifiable trace of execution:

- test output;
- process log;
- diff or patch;
- path to changed file;
- reproducible scenario;
- HTTP response metadata;
- screenshot;
- artifact id;
- concrete blocker reason;
- scope or approval constraint.

If there is no evidence, the task should not be closed.

Codex CLI / Claude Code as workers

Suppose the plan contains this task:

Audit tools layer for schema mismatches and timeout bugs.

Spectrion can launch Codex CLI:

codex exec "Scan Agent/Tools for schema mismatches, timeout handling gaps, and unsafe parsing. Return confirmed issues with file paths, reproduction notes, and suggested regression tests."

Then continue the same context:

codex exec resume --last "Convert the top confirmed findings into concrete patch steps and regression tests. Do not edit files yet."

Or it can use Claude Code / another CLI in headless or line-oriented mode.

But Spectrion does not trust CLI output blindly.

The pipeline looks like this:

Spectrion task
  -> launch CLI worker
  -> read output
  -> check files
  -> run tests
  -> compare with plan objective
  -> filter weak claims
  -> attach evidence
  -> mark task completed only if verified

A CLI can find a suspicious code path.

A CLI can propose a patch.

A CLI can collect logs.

But Spectrion runtime closes the task.

Why you cannot simply trust a CLI

CLI output is input, not a verdict.

It can:

- mix up a file;
- miss an edge case;
- propose a patch without a test;
- call a hypothesis a bug;
- say done even though a command was never run;
- forget backward compatibility;
- miss acceptance criteria from the plan.

Example verification:

Codex says:
  "Found a likely timeout bug in ToolExecutor."

Spectrion checks:
  - file path exists;
  - code path reachable;
  - bug reproducible;
  - patch applicable;
  - test fails before fix;
  - test passes after fix;
  - no neighboring regression.

Only then:
  execute_plan(mark_completed, evidence=...)

That is the difference between “chat with a command” and an execution runtime.

Persistent terminal sessions

Many development tasks need a live process:

- dev server;
- test watcher;
- REPL;
- long-running CLI;
- process waiting for stdin;
- server logs;
- watcher between tool calls.

Spectrion can keep a persistent terminal session:

terminal start:
  session_id = web-dev
  command = npm run dev

terminal read:
  latest server logs

terminal send:
  r

terminal read:
  restart result

For a long-running CLI:

terminal start:
  session_id = tools-audit
  command = some-cli --headless

terminal read:
  partial output

terminal send:
  follow-up prompt

terminal read:
  final result

This turns the agent into a process operator, not just a command generator.

But terminal is a powerful tool. It runs with the permissions of the current environment. So it needs policy boundaries: approval, command logs, workspace limits, kill switch, and restrictions on dangerous behavior.

/afk

In real work, the user is not always sitting next to the agent.

They may start an audit and leave.

A normal chat gets stuck on the first clarification.

Spectrion has /afk: a mode where the agent can continue long-running work without constant user presence.

Inside the runtime, that means:

- do not ask non-blocking questions;
- make conservative assumptions;
- keep plan/todo state up to date;
- continue ready tasks;
- verify evidence;
- stop on real blockers;
- do not bypass approvals, credentials, payments, security, destructive boundaries;
- finalize only with outcome and evidence.

AFK does not bypass rules.

AFK should not turn a terminal into an unlimited root script.

AFK exists so the task does not die because of a minor branch.

Subagents

Spectrion can run subagent sessions.

There are two modes:

delegate_task     -> blocking delegation
sessions_spawn    -> background session

For a large bugfix, the pattern may look like this:

subagent A -> LLM streaming
subagent B -> tools layer
subagent C -> UI regressions
subagent D -> server compatibility
main agent -> plan, dependencies, verification, final quality

Subagents speed up the work, but the parent runtime should not accept their output blindly.

It needs to verify:

- what was tested;
- which files/targets were covered;
- what evidence was attached;
- where the result is a hypothesis vs a confirmed fact;
- what limitations remain.

Responsibility for closing the plan task stays with the parent runtime.

Remote CLI

Sometimes the work should not run on the local Mac:

- long audit;
- heavy tests;
- server environment;
- 24/7 runner;
- isolated Linux environment.

For that, Spectrion has remote CLI.

Spectrion can deploy a CLI container to a Linux server, connect it through mesh, and execute commands remotely:

deploy remote CLI
  -> check status
  -> stream logs
  -> exec command
  -> collect evidence
  -> restart / stop / remove when done

To the user, it is one agent.

Physically, work may happen on iPhone, Mac, or a remote Linux runner.

The runtime keeps the shared plan, state, and evidence.

Bug Bounty Hunter mode

Bug bounty is a mode where scope, approval, and evidence matter even more.

It is not “scan any website.”

The bug bounty agent starts with an intake gate:

- program/platform;
- rules/scope URL;
- concrete target;
- what is allowed / forbidden.

If that data is missing, the agent does not run tools.

It does not guess scope and it does not start active checks.

Flow:

1. read rules;
2. lock scope;
3. passive recon;
4. attack surface map;
5. hypothesis/evidence ledger;
6. HYPOTHESIS / INDICATION / PROVEN;
7. active validation only after approval;
8. report-ready output.

Simple rule:

no scope -> stop
unclear permission -> ask
active validation -> approval
no proof + no impact -> not a vulnerability

This is not a “hack” button. It is a controlled mode for authorized research.

UX: trust state, not words

The core UX is not a fancy button.

The core UX is trust in state.

The user should see:

Phase 2 is running.
This task depends on completed audit.
This task requires approval.
This task is blocked because evidence is missing.
This task was closed with test output and diff.

Not “the agent is thinking somewhere.”

Clear work state.

Conclusion

A good agent is not a model with a large context window and a list of tools.

A good agent is a runtime that can hold commitments.

It knows when a plan is required and when a todo list is enough.

It continues in /afk, but does not bypass approval.

It manages Codex CLI, Claude Code, terminal sessions, and subagents, but does not trust their answers blindly.

It can work in bug bounty mode, but starts with scope and rules.

Most importantly, it closes a task only when there is evidence.

The user is not asking the agent to write a beautiful status update.

The user is asking the agent to do the task.

That is why Spectrion is being built not as a chat with functions, but as an execution runtime.

Where to find it:

Site: https://spectrion.app
App Store: https://apps.apple.com/app/spectrion-agent-ai/id6759151825

I Designed the AI Agent as a Runtime from Day One, Not as a Chat with Functions

Denis Babkevich — Wed, 06 May 2026 10:16:18 +0000

Three months ago I sat down, sketched the architecture of Spectrion in my head, and started writing code.

From the outside, the first version could be described very simply:

an AI agent for iPhone that does not only answer, but can act

But internally I did not want to build "a chat with tools".

I wanted an environment where an agent could keep working beyond a single message: create a reminder, continue tomorrow, watch a page, create a workflow, raise an alert, verify its own output, hand part of the work to a Mac or CLI runner, keep task state, remember unfinished items, and refuse to execute an action if policy says no.

So the starting point was not:

build a chat
then add functions
then somehow bolt automation on top

It was closer to:

the agent is the runtime
the chat is only one interface to it

iPhone was the first user-facing shell. But the idea was broader from the first day: tools, memory, task board, background jobs, workflows, approvals, policies, watchdog, self-created tools, device mesh, and managed business mode should not be separate islands. They should live inside one agent runtime.

In this article I will walk through why I started from an execution loop rather than a chat loop, what subsystems this required, and why function calling is only a small part of a real agent system.

What I Wanted

I did not want an assistant that says:

here is how you can create a reminder

I wanted an agent that creates the reminder.

Not:

here is how you can search and compare options

But:

search, compare, save the result, and return the conclusion

Not:

I can help you make a plan

But:

keep the task, check that steps were not forgotten,
continue if the work stopped too early,
and remind me when the next step is needed

The important difference is not that the agent can call a tool.

The important difference is that a task can live longer than one message.

For example:

Watch this page.
If a new version appears, explain the changes,
create a task to update the project,
and remind me tomorrow.

For a normal AI chat, this looks like a single request.

At execution level, it is a process:

1. Understand the task
2. Check the page now
3. Store a baseline
4. Create a monitoring loop
5. Wake up on schedule
6. Compare changes
7. Run reasoning when an alert appears
8. Create a task
9. Send a notification
10. Keep the follow-up for tomorrow

That is why I did not treat Spectrion as a mobile chat with functions. I needed a runtime for agentic work.

The shortest version:

not a chat with functions,
but a runtime for tasks that last longer than one message

Why Function Calling Is Not Enough

The basic architecture of an AI chat is:

user message
  -> LLM
  -> assistant message

With tools:

user message
  -> LLM
  -> tool call
  -> tool result
  -> LLM
  -> assistant message

For simple requests this is enough.

Examples:

What is the weather tomorrow?

Create a reminder at 10:00.

But once the agent has to actually work, problems appear that cannot be reliably solved with a prompt.

A model can return a tool call as text. A provider can hang. A tool result can be too large. Context can overflow. The user can write a follow-up while the run is still executing. A scheduled task can arrive in the middle of another run. A proactive monitor can raise an alert. A subagent can finish a background task. A workflow can require approval. The model can say "done" while the todo list is still open.

I designed the system as if these cases were not exceptions, but the normal working environment of an agent.

Problem	Why a prompt is not enough	What the runtime needs
Tool hangs	The model no longer controls the process	Timeout, cancellation, retry
Tool result is too large	Context overflows	Truncation, artifacts, summaries
User writes a follow-up during a run	Chat loop does not model competing events	Queues and execution lanes
Scheduled task arrives during other work	This is not just another message	Unified event queue
Model says "done" but task is not done	Model optimizes the answer, not state	Todo/task board + watchdog
Tool is forbidden by policy	Hiding the schema is not enough	Executor-level policy gate
Workflow may be dangerous	LLM may miss edge cases	Preflight review / approvals
Memory accumulates noise	"Remember everything" is unsafe	Scope, TTL, confidence, rollback

Function calling is a way to ask the model to choose an action.

It does not answer these questions:

when should a task run again?
what happens if a tool hangs?
who checks whether the agent stopped too early?
where is task state stored?
how do you avoid showing the model a thousand tools at once?
how do you block a tool at execution level, not just in the prompt?
how does a proactive alert enter the reasoning loop?
how do iPhone, Mac, and CLI coordinate work?

The LLM does not answer those questions.

The runtime does.

Agent Runtime as an Operating Loop

In Spectrion, a normal chat turn is only one input.

Different events can enter the same runtime:

manual user message
scheduled task
proactive alert
workflow node
subagent result
nested call
channel message
heartbeat check-in

Each event type has its own settings, but the pipeline is similar:

event input
  -> classify event
  -> prepare context
  -> choose agent / model / tools
  -> inject memory, skills, project context
  -> stream LLM
  -> parse tool calls
  -> execute tools with policy, approval, timeout, progress
  -> add tool results
  -> loop again if needed
  -> post-run tail:
       watchdog
       proactive queue
       task board
       memory proposals
       workflow routing
       channel routing
       cleanup

The main difference from a normal chat is that an assistant message is not the only output.

The output can be:

created task
updated memory
scheduled workflow
tool artifact
notification
alert
subagent run
pending approval
task board update
audit event

So the agent is not "a model that answers".

It is a system that carries work.

In code, this loop is bounded and controlled.

AgentRuntime keeps ConversationRunState per conversation: streaming text, status, token budget, todo state, abort flag, and rolling summary. sendMessage runs preflight, directives, hooks, user-message persistence, active task envelope, media/link context, and only then enters the outer todo loop.

The tool loop is limited by clampedMaxToolIterations. If the model keeps calling tools for too long, the runtime stops the loop at a safety cap. Inside each iteration there are watchdog timeouts: a longer one for a normal turn and a shorter mode when queued user input or tool calls already exist.

Tool calls are not simply executed one by one. The runtime groups serial-only tools by name, runs other calls through a task group in parallel, and then restores result order. That lets independent calls run faster without breaking tools that require sequential execution.

What Lives Inside the Runtime

If you only look at the top-level pipeline, it is easy to underestimate how much infrastructure sits around it.

The runtime contains several layers that are not there for a nice demo, but for keeping long tasks from falling apart.

ContextManager and Compaction

ContextManager builds provider-visible context.

It can include:

system prompt
active skills
trusted project context
Memory V2 retrieval block
session working memory
Device Mesh context
recent messages
tool results
active task envelope

The context is not just "cut from the beginning".

There is MessageCompactor, TranscriptSummarizer, rolling summary, proactive compaction threshold, rehydration budget, and protected tail. Old tool results do not always need to be carried in full: MicroCompactor can replace old read-heavy tool results with short summaries while preserving recent results and important anchors.

There is also token-estimator calibration: the runtime compares its estimate with actual provider context usage and adjusts scale by provider/model.

The goal is:

preserve the meaning of old work
  -> avoid context-window overflow
  -> keep the active objective visible
  -> avoid asking the user to repeat the task

Active Task Envelope

Long tasks need more than a todo list.

There is an ActiveTaskEnvelope: a provider-visible anchor for the active task.

It stores:

original visible user request
latest visible user update
current objective
constraints
exact final markers
pending todos
artifact promises
tool progress
lifecycle
awaiting user question

Lifecycle can be:

active
background
subagent
scheduled
proactive
waiting_for_user
completed
cancelled

If context is compressed, if a queued follow-up arrives, if a task goes into background mode or a subagent, the runtime can recover the active objective from the envelope instead of guessing.

Execution Lanes

Another boring but important thing: lanes.

The code has ExecutionLaneManager with lane types:

main
cron
subagent
nested

Each lane has its own concurrency limit. For example, scheduled work should not consume every execution slot, and subagents can work in parallel but not without limit.

Lane tokens have a generation. If a lane is cleared or cancelled, old tokens become stale and cannot accidentally release capacity for a newer generation of tasks.

This is not a flashy product feature, but without details like this, background execution turns into races very quickly.

Queued Follow-Ups

A user can write a new message while the agent is still working.

That should not corrupt the current run.

Spectrion has queued follow-ups: the runtime stores the follow-up message, snapshot, delivery order, and can cancel, reorder, or edit queued messages. After the current run finishes, the next input is delivered cleanly.

So a follow-up during execution is not "the user interrupted the stream". It is a separate queued event.

Approvals

Approvals are not just a UI button.

ApprovalManager stores pending requests, keeps the last 100 approval records, waits up to five minutes, and supports persisted auto-approve rules:

always
same arguments
until date
count N times

Together with Business policy this gives two levels:

policy says whether the action is allowed at all
approval says whether it may happen now

Skills, Project Context, Hooks, and Directives

Runtime can inject skills without stuffing one huge prompt into the model. Enabled skills are listed lightly, while full instructions are injected only for activated skills and within budget.

Project context is not blindly read from the folder. There is trust state, skipped sources, parse errors, project skills, project agents, and name conflicts. Untrusted project context should not become runtime instruction.

There are hooks:

agentBootstrap
agentBeforeRun
agentAfterRun
sessionCreate
sessionDelete
toolBeforeExecute
toolAfterExecute
memoryFlush
messageReceived
messageSent
preCompact
postCompact

This should be described carefully: plugin hook APIs and registries exist, but not every hook is deeply wired through every LLM/tool loop path yet. It is an extension surface, not a claim that the whole runtime is already middleware-driven end to end.

Directives give fast runtime control from chat:

/model
/think
/elevated
/verbose
/reset
/status
/compact
/skill
/agent

Together, this turns Spectrion from a single LLM call into a controllable execution environment.

Provider Layer, Capability Routing, and Diagnostics

Another layer that is almost invisible from the outside: the runtime should not be tied to one model or one API format.

There is a provider layer for Spectrion Pro proxy, direct Anthropic/OpenAI providers, Ollama/local models, custom OpenAI-compatible endpoint, custom Anthropic-format endpoint, and Apple Foundation/on-device provider.

This is not only about picking "the smartest model".

Different providers handle these things differently:

tool schemas
streaming
vision
max context
reasoning/thinking settings
prompt caching
multimodal content
fallback

So ToolDefinition can format schemas for OpenAI, Anthropic, and Ollama. OpenAI strict schema is enabled only when it is actually safe: when all properties are required. Otherwise, making a schema strict can make the tool harder for the model to call or break the call entirely.

ProviderRequestBuilder does boring but important work: it cleans system/history messages, prevents the active task envelope from being duplicated, preserves it even in emergency-compaction mode, and builds a provider-visible request where the model can see the correct task state.

There is also ProviderVisibleContextDiagnostics. It breaks visible context into categories:

system
skills
project context
memory
tools
MCP
history
images
rehydration
active envelope
free space
compact threshold

This helps avoid guessing why the model did not see an important instruction. You can inspect what actually went to the provider.

There is also exact request diagnostics: before OpenAI/Anthropic/Proxy HTTP calls, the provider saves a fingerprint, redacted request-body preview, token estimates, tool/image counts, and flags such as context-management. This is not a permanent full raw-request log, but it is much better for debugging provider-visible behavior than looking only at the chat transcript.

The server layer is not just a thin proxy for chat completions either. It has endpoints for completions, media analysis, embeddings, rerank, media encoding, image edit, STT, TTS, voice cloning, video generation, watchdog, and steward. On top of that sit capability resolver, tier/account/provider fallback, retries, SSE sanitization, usage logging, and business/subscription accounting.

The idea is that the model should be a replaceable part of the system. The agentic loop should live above a specific provider API.

ToolCatalog and ToolExecutor

Tools are the obvious part of an agent system. But early on I separated two different problems.

First:

which tools should the model see?

Second:

how do we safely execute the selected tool?

The first is handled by ToolCatalog.

The second is handled by ToolExecutor.

Why You Cannot Put All Tools Into the Prompt

When there are only a few tools, you can pass all schemas to the model.

But if there are tens, hundreds, or thousands of tools, this starts hurting quality.

Context grows. The model chooses worse. Similar tools compete. Tool descriptions take space that should belong to the task, memory, and working state.

So tools are split into layers.

There is a built-in core: native/system tools for web, files, device, calendar, reminders, notes, notifications, memory, knowledge base, workflows, subagents, Shortcuts, Health, maps, weather, PDF, Office, ZIP, cloud files, and other surfaces.

On macOS there are desktop capabilities on top: shell, filesystem, Git, AppleScript, screen capture, browser automation, Docker sandbox, process control, patching, and other tools that are impossible or should live in a different execution environment on a phone.

There is also Shop / external capability layer: today it has more than 1495 tools on top of the built-in core. But this does not mean 1495 schemas are inserted into the prompt at once.

The important part is activation.

The principle is:

the model should not see every available tool,
it should see the relevant tools for the current step

Activation considers:

keywords
categories
explicit activation
user language
project context
policy
negations

For example, if the user says:

do not use the web, check only local files

then web tools should not activate just because the query looks like research.

That is a small detail, but it matters for trust.

The Executor Matters More Than the Catalog

The catalog decides what to show the model.

Real guarantees live in the executor.

ToolExecutor checks:

whether the tool is allowed by policy
whether approval is required
whether arguments are too large
whether result is too large
whether the task has been cancelled
whether timeout expired
whether the tool can run in this environment
whether an audit event should be recorded

The key principle:

policy does not only hide the tool from the LLM,
policy blocks execution in ToolExecutor

If a tool is forbidden, removing its schema from the prompt is not enough. The model may try to call it from older context, a nested call, a workflow, or a malformed tool call.

The boundary must live at execution level.

Otherwise it is not a security boundary. It is decoration.

Artifacts and Native UI

A tool result in Spectrion is not only a string.

A tool can return:

text
summary
image
file
artifact reference
structured payload

There is an artifact contract. It gives stable artifact IDs, attachment references, edit references, storage scope, and validators that prevent the model from accidentally answering with a raw absolute path like /Users/.../file.png.

This sounds small until the agent starts working with images, PDFs, Office files, downloaded archives, generated media, and long-running tool results.

If an artifact is already attached to the message, the model gets guidance not to attach it again. If a file or image needs editing, the model should use an artifact reference rather than inventing a path.

There is also another layer: render_ui.

The agent can return not only text, but a JSON spec for native UI. The A2UI parser supports layout/content/input/data/media/composite components:

vstack, hstack, zstack, scroll, grid
text, image, icon, divider, spacer
button, textfield, toggle, slider, picker, stepper
list, table, chart
map, webview
card, alert, sheet, form, progress

There are limits on depth and node count, so the model cannot generate an infinite UI tree.

The practical value: the agent can return a small working interface inside the answer. For example, an approval form, comparison table, task dashboard, workflow card, or research result panel.

This is another reason Spectrion is not limited to an assistant message. Output can be UI state, artifact, or action.

Project Workspace, Mutation Journal, and Rollback

Code/workspace tasks have their own project layer.

Workspace does not mean "the agent can read and write anywhere". The project model has capabilities:

browse files
edit files
view git changes
search files
load project context
load project skills
load project agents

There is a path policy that should keep the agent inside the selected workspace root. File tree service handles hidden files, binary files, symlinks, unreadable entries, max depth, max entries, max editable bytes, and .gitignore.

Project context is loaded separately from ordinary memory. It has trust state, skipped sources, parse errors, project skills, project agents, and conflict detection. If project context is not trusted, it should not become runtime instruction.

For mutations there is WorkspaceMutationJournal and WorkspaceChangeTracker. They record file changes with before/after hashes, snapshots, and rollback eligibility. If a change can be rolled back, the runtime knows where the snapshot is. If not, that is explicit too.

This matters for macOS/CLI scenarios. When the agent edits a project, "I changed files" is not enough. You need a trace: what changed, how it is evidenced, and whether it can be rolled back.

Tasks the Agent Should Not Drop

One of the main problems with AI agents: the model can end with a nice answer even though the task is not done.

The user says:

Prepare the feature launch.
Check the copy, gather bugs,
prepare the changelog, write the post,
make the release checklist,
and do not stop until every item is closed.

A normal assistant often turns this into a good list.

But the user did not ask for a list. They asked the agent to carry the work.

That requires task state.

Simplified:

TodoManager
  -> pending
  -> in_progress
  -> completed

And a more project-level layer:

TaskStore
  -> status
  -> priority
  -> dependencies
  -> blockers
  -> assigned agent
  -> claim session
  -> claim expiration

After the agent loop, the runtime checks:

are there pending items?
is there in_progress work without result?
were there tool failures?
are there blockers?
did the agent say "done" too early?

If the task is not closed, the runtime can continue instead of stopping.

My principle:

the agent should not stop only because
the model produced a nice final message

This changes UX.

The user should feel not "I received text", but "the task is being carried".

Watchdog and Steward

The task board stores state, but there also needs to be a checking layer.

Spectrion has ChatWatchdog and AgentSteward.

Watchdog looks at the agent stopping and asks:

is it actually okay to finish this run?

Example:

User:
  Check 5 sources and compare options.

Agent:
  Checked 3 sources and wrote "done".

Watchdog:
  User requirement was 5 sources.
  Evidence exists only for 3.
  Completion rejected.

Runtime:
  Continue with the remaining sources.

Steward is a more general verification layer.

It can run in modes:

completionJudge
toolResultVerify
taskBoardGroom
subagentSupervise
workflowPreflight
workflowPostflight

It checks:

whether the task is actually complete
whether a tool result looks incomplete
whether a workflow is safe before launch
whether approval is required
whether a subagent result is valid
whether the task board should be updated

This is not a magical quality guarantee.

But architecturally it is better than hoping the main agent always realizes what it forgot.

The core idea:

the executor should not be the only judge of its own work

How deeply can the steward challenge the main model?

Not infinitely and not without rules.

It can return verdicts such as:

no_action
continue_now
create_tasks
blocked

For completionJudge, that means: if the main agent says "done", but evidence does not match the original task, the steward can reject completion and ask the runtime to continue.

For workflow/subagent work, it can suggest follow-up tasks, flag a problem, block a dangerous result, or send work back for revision.

But there are limits around this.

ChatWatchdog makes at most two nudges per user request. In steward mode it waits two seconds after the turn stops, and idle checks run after five minutes. Watchdog context is limited to recent messages, steward request uses a low budget and maxTokens: 10000, and network timeout is 30 seconds.

AgentSteward also has client-side budget gates:

do not run in Low Power Mode
do not run on bad network
do not run when runtime is busy
do not run when queued user input exists
do not repeat an already applied idempotency key
cool down after rejected actions
limit calls per hour / day

If steward is unavailable, steward mode does not automatically fall back to an older, less governed judge. Skipping the check is safer than running a different uncontrolled loop.

So steward is not "a second model arguing forever with the first".

It is a policy-gated review layer with idempotency, budgets, cooldowns, and strict continuation limits.

Proactive Tools: The Agent Can Raise Its Hand

Normal tools are called when the model decides to call them.

I wanted the agent to be able to observe.

That is how proactive tools appeared.

A proactive tool is a scripted tool running in a background polling loop. It can wake up on a schedule and check a page, API, file, metric, or some other source.

If nothing happened, it returns:

null

If something happened, it returns an alert:

{
  "type": "price_drop",
  "oldPrice": 349,
  "newPrice": 289,
  "url": "..."
}

The alert enters ProactiveExecutionQueue, and the runtime handles it as a new input.

The path:

JS monitor wakes up
  -> checks source
  -> returns alert
  -> ProactiveExecutionQueue
  -> AgentRuntime handles proactive alert
  -> agent reasons about it
  -> user gets result / task / notification

Example:

Watch the competitor changelog.
If a new release or pricing change appears,
write a short competitive note and create a task.

A normal chat does not wake itself up.

An agent runtime can.

In code, proactive scripted tools have their own polling manager: max concurrent checks, persisted running flags, auto-reconnect, structured JSON alerts, alert routing, and auto-stop after repeated errors.

Heartbeat: The Agent Exists Outside the Chat

Proactive tools are observers.

There also needs to be a clock.

That is HeartbeatManager.

It periodically performs service work:

scheduled tasks
scheduled workflows
maintenance
morning briefing
periodic check-ins
Evolution cycle
Mesh leadership

Periodic check-in looks like this:

runtime:
  check whether anything needs attention

agent:
  HEARTBEAT_OK

If nothing is happening, the user sees nothing.

If there are blockers, alerts, follow-ups, or claimable tasks, the runtime can continue work or send a notification.

This matters because an agent should not exist only at the moment the user writes a message.

Some tasks need to live for hours, days, or weeks.

Channels and Native Surfaces

The in-app chat is only one entry point.

There is a ChannelManager and channel implementations:

Telegram bot
Telegram user channel
Slack
Discord
WhatsApp
Email

ChannelType also includes SMS and custom channels, but those are reserved types. In the current ChannelManager.connect, ready implementations exist for Telegram, Telegram User, Slack, Discord, WhatsApp, and Email; SMS/custom should be treated as unsupported until an adapter is added.

A channel has config, credentials, enabled/autoConnect flags, and status. Configs are stored locally and synced through Mesh. Incoming messages arrive as ChannelMessage, and responses go back through ChannelManager.sendResponse with retry.

Heartbeat checks channel connections separately, and auto-connect is performed only by the Mesh execution leader. This matters so two devices do not both reply in the same Telegram/Slack thread.

From the outside:

Telegram / Slack / WhatsApp / Email
  -> ChannelManager
  -> AgentRuntime
  -> same tools / memory / policy / approvals
  -> response back to channel

There are also native Apple ecosystem surfaces:

widgets
Live Activities
Share Extension
App Intents / App Shortcuts
Watch companion
macOS menu bar
macOS Services
voice overlay
wake word daemon
camera stream

There are important boundaries here too. WidgetKit, App Intents, WatchConnectivity, Share Extension, and Spotlight are wired as native surfaces. Live Activity UI exists, but starting a Live Activity is intentionally disabled in the manager right now, so the honest phrasing is "the surface is prepared", not "it always runs in production". Wake word is VAD + Apple Speech recognition for a phrase, not a separate embedded hotword model.

These are not separate little assistants. They should lead the user into the same runtime.

Examples:

selected text in any macOS app
  -> Services -> Ask Spectrion
  -> runtime receives the task

said wake phrase
  -> voice command captured
  -> runtime handles the task
  -> TTS responds
  -> wake-word listens again

sent a file through Share Extension
  -> attachment enters the app
  -> agent can analyze / save / create task

For me this is an important part of the idea: if the agent runtime is real, it can have many inputs and outputs, but state and rules must remain shared.

Server Layer

Part of Spectrion lives outside the app.

The server has routes for:

auth
subscription
chat proxy
webhooks
admin
OAuth
web
OpenAI plan proxy
CLI deploy
plugins
channels
Telegram user
community store
mesh
business
compat

/v1/chat is not just "forward request to LLM". It has subscription/business usage checks, multi-provider routing, retries, SSE streaming, active stream limits, provider fallback paths, account pool, capability resolver, and separate endpoints for capabilities:

chat completions
media analysis
embeddings
rerank
media embeddings
image edit / generation
speech-to-text
text-to-speech
voice clone / delete
watchdog / steward
video generation

/v1/channels stores channel registrations, encrypts credentials with AES-256-GCM, supports server-side polling for Telegram/email, and exposes pending messages to devices.

/v1/mesh handles pairing, device registration, device list, polling fallback, and ack for relay messages.

There is also a community store for skills/tools/MCP, plugin routes, CLI deployment, Telegram userbot, desktop releases, app version/config, and a large Business API.

This matters because Spectrion is not only a local app. The native runtime brings execution closer to the user and device, while the server layer handles provider routing, subscriptions, sync, channels, Mesh relay, store, business control plane, and capability endpoints.

Workflows Inside the Agent Runtime

One tool call is enough for simple actions.

But real tasks often become graphs.

Example:

Every Monday morning, check three sources.
Collect a summary.
If there are important changes, create a task.
If risk is high, notify me.

That is a workflow.

The user-facing surface for this in Spectrion is the manage_workflows tool.

Node types include:

trigger
action
condition
delay
transform
llm
http
script
loop
parallel
notify
end

The important part: workflow should not be a separate automation island.

If a workflow calls a tool, it should go through the same ToolExecutor as the normal agent.

That means the same:

approvals
policies
timeouts
audit
result limits
Business gates

Simplified:

WorkflowEngine
  -> node execution
  -> ToolExecutor
  -> policy / approval / timeout / audit
  -> result
  -> next node

This lets workflows, proactive monitors, and chat agent live inside one runtime.

There is a parallel node inside workflow graphs, but the WorkflowEngine itself is guarded as a single-run engine. So I would not describe it as "unlimited parallel workflow executions". The honest statement: a graph can have parallel branches within a run, while system-level concurrency is handled through heartbeat, proactive queue, and lanes.

The user can write an ordinary sentence:

Watch this API.
If the status changes, check details,
write a short explanation,
and create a task if a reaction is needed.

The agent can:

search existing tools
  -> no matching tool
  -> create scripted monitor
  -> test it
  -> register it
  -> build workflow
  -> schedule it
  -> notify only when something changes

This is not "AI wrote instructions for configuring automation".

This is the agent building automation from the conversation.

Self-Extension: The Agent Can Create Tools

I wanted the agent not to be limited to the tools that shipped with the app.

If the required tool does not exist, the agent can create it.

There is a tool for that: create_tool.

But the boundary matters.

"The agent writes tools for itself" sounds good, but without restrictions it is dangerous.

So self-extension is not just "save JS into a file".

The process looks like:

agent writes tool
  -> checks similar tools in catalog / Shop
  -> reads sandbox API reference
  -> validates syntax
  -> runs sandbox/security audit
  -> checks built-in name collisions
  -> persists ScriptedToolDefinition
  -> registers in ToolCatalog
  -> activates tool
  -> runs auto-test with provided args
  -> keeps version history / rollback point

Restrictions:

sandbox
versioning
secret fields
rollback
test runs
policy gates
approval requirements

On iOS, dynamic tools run through a JS sandbox. There is a limited API for HTTP, persistent KV, crypto, sandboxed FS, DB, HTML/image helpers, and other things needed for integrations.

On macOS the surface is broader: scripted tools can be JavaScript, Python, Shell, Ruby, Node.js, Go, Rust, Swift, and Perl. If a runtime is not available locally, Docker fallback is possible.

Proactive mode is also important: a tool can get an interval and instructions, wake up in the background, and return an alert only when something important happens.

create_tool is not only create. In code it has actions for edit, list, delete, test, templates, api_reference, history, rollback, start/stop/status proactive tools. So self-extension is a lifecycle, not one-shot file generation.

The broader the surface, the more important policy becomes.

Main principle:

the agent can expand capabilities,
but it must not grant itself new permissions

If a tool needs a new secret, the user must fill it explicitly.

If a tool performs a mutating action, approval may be required.

If a tool is forbidden by policy, the executor must reject it even if the model created it.

Personal Tools and Community Store

Self-extension is not limited to local experiments, but it does not publish anything automatically.

When the agent creates a tool through create_tool, it is a local/project/user capability. It can be used immediately in the current runtime, workflow, or proactive loop, but publishing it externally should be a separate deliberate action.

There is a Community Store / Shop layer.

The server has routes for:

community skills
community tools
community MCP servers
search
install
reviews
downloads
my published items
pending / approved / rejected moderation states

The model:

agent creates personal/project tool
  -> user tests it locally
  -> tool can be reused in workflows
  -> tool can be published to community store separately
  -> other users install approved tools from Shop

This boundary matters. The agent can quickly build a missing tool for itself, but community distribution should not happen without a review/publish flow.

Memory V2: Scoped Memory, Retrieval, and Rollback

It is easy to imagine agent memory as one large memory.md.

For a strong agent, that is a poor model.

In Spectrion, memory now consists of several layers:

legacy markdown memory
structured Memory V2 records
MemoryProposal queue
snapshots / rollback
semantic memory
SQLite vector store
conversation recall FTS
session working memory
project context bridge
memory policy

This is not one folder of notes. It is a runtime subsystem.

Structured Memory

MemoryRecord has scope:

user
agent
conversation
project
global

And type:

preference
fact
instruction
decision
projectRule
workflow
correction
summary

Plus metadata:

source
confidence
sensitivity
expiresAt
status
provenance
linkedVectorChunkIds
createdAt
updatedAt

Source is also typed:

manual
automaticExtraction
conversationSummary
document
skill
projectRule
migration
tool

Sensitivity:

publicFact
personal
privateNote
secret

Status:

active
archived
rejected

Why does this matter?

Because "remember this" can mean different things.

Remember that I dislike long answers.

That is a user preference.

Remember that this project cannot change the API without approval.

That is a project rule.

Remember until the end of the release that we use feature flag X.

That is a temporary rule with TTL.

Remember this only for the current conversation.

That is conversation scope.

A flat memory cannot reliably distinguish these cases.

Memory Proposals

Not every memory change should immediately become fact.

There is MemoryProposal with operations:

create
update
archive
delete

And statuses:

pending
approved
rejected

This matters for automatic extraction and business scenarios: the agent can propose a record, but the runtime or user decides whether to apply it.

Important not to overclaim here. The MemoryProposal pipeline exists, but this does not mean every model response automatically becomes a neat V2 proposal. A visible part of the current proposal flow is tied to legacy MEMORY.md migration and sensitive candidates. Manual memory.save does dual-write: legacy markdown, Memory V2 record, and semantic chunk for search.

Store, Snapshots, and Rollback

MemoryV2Store stores records, proposals, and snapshots. It can:

upsert record
deduplicate equivalent records
archive / delete
scoped reset
reset all
search records
export markdown
submit proposal
approve / reject proposal
sync project context
create snapshot
rollback snapshot
rollback with linked vector restore
produce vector repair report

Snapshots store records/proposals and linked vector chunks when those chunks are included in the snapshot. So rollback covers structured memory state and can restore linked semantic chunks through the vector restore path, but it is not a promise to restore every embedding everywhere.

Before saving, a redactor runs. It is best-effort regex redaction for:

api keys
tokens
passwords
secrets
bearer tokens
/Users/... paths

This is important: the agent should not turn long-term memory into a random secret dump. But this is not a DLP system or a mathematical privacy guarantee. It is a runtime protection layer.

Memory Tool

The user-facing surface is the memory tool.

It supports actions:

save
read
recall
search
list
delete_entry
clear
index
stats

On save, it performs dual-write: legacy markdown memory is saved for compatibility, while a MemoryRecord is created in Memory V2 and a semantic chunk is added for search.

On recall, semantic search is used. If semantic memory is unavailable, there is keyword fallback.

On index, conversation history can be reindexed.

On stats, you can see how many chunks exist in memory, conversations, documents, and skills.

Memory Policy and Conversation Modes

Memory should not always behave the same way.

MemoryPolicy decides on operations:

readPersistentMemory
recallMemory
recallConversationMemory
writeManualMemory
writeAutomaticMemory
indexPersistentMemory
indexConversation
summarizeConversation
flushSemanticMemory
sameConversationRecall
toolEvidenceRecall
deleteMemory
clearMemory
readStats

Conversation memory modes change behavior:

full
auto
hybrid
standard
toolsOnly
off
isolated

For example, in standard/auto cross-conversation recall can be blocked; in toolsOnly, only same-conversation/tool-evidence recall can be allowed; in off, almost every memory operation is blocked except stats.

So memory is not global magic. It is a policy-controlled part of the runtime.

Retrieval Planner

When runtime prepares a prompt, it does not just insert all memories.

MemoryRetrievalPlanner builds a mixed plan from:

structured records
semantic vector results

The plan filters by:

status
expiry
scope
cross-conversation rules
policy decisions
linked semantic chunks already covered by records
token budget
max items

Scoring for structured records considers:

query overlap
scope priority
confidence
recency

For semantic candidates:

vector score
keyword score
combined score
source priority
recency

MemoryRuntimeContextBuilder then renders selected records into a dedicated Memory V2 block and can include a debug trace: what was included, what was excluded, and why.

This is the difference from simple memory:

good memory is not "insert more",
but select the right things for this task and token budget

Semantic Memory and Vector Store

This is a separate runtime layer.

Semantic memory works with chunk sources:

memory
conversation
document
skills

The current runtime indexes persistent memory, conversation transcripts, and documents from knowledge base. Source skill is supported at the model/statistics level so the layer can be used for skill retrieval.

The code has extraction patterns for multiple languages, including Russian, English, German, Spanish, French, Arabic, Hindi, Japanese, Korean, Portuguese, and Chinese.

VectorStore stores chunks in SQLite:

chunks table
chunk_embeddings table
FTS5 index
LSH buckets
ANN graph edges
provider/model/namespace metadata

So semantic memory is not just an array of embeddings in process memory. It is a local index with metadata, search, stats, migration, and repair paths.

Knowledge Base

Next to memory there is KnowledgeBase.

It is not exactly user memory; it is a RAG layer for documents.

It can import:

pdf
rtf / rtfd
txt
md / markdown
swift
js
json
csv
xml
html
py
rb
log

Documents are chunked, indexed in VectorStore, searched with vector search and keyword fallback, and synced through Mesh as metadata/text deltas.

For the agent, memory and knowledge base are different things. Memory is rules, preferences, decisions, and state. Knowledge base is documents that can be searched and cited during work.

Conversation Recall and Session Working Memory

Long conversations get two more layers.

The first is conversation_recall.

It is a tool for searching older messages in the current conversation. Under it there is a SQLite/FTS index, BM25, an include-tool-details option, and background indexing. It is useful when session working memory says: "the older transcript contains exact source id; fetch the original text if needed."

The second is session working memory.

It is built from saved transcript and contains:

current state
task specification
structured memory facts
files and functions
workflow
recent files and tool artifacts
evidence snippets
errors and corrections
documentation references
learnings
key results
worklog
span summaries
source anchors
compacted context summary

It has budgets:

max stored tokens
max section tokens
max snippet characters
max structured memory facts
max evidence snippets

If LLM extraction is unavailable, there is a deterministic fallback.

So even if a long transcript no longer fits in context, the runtime keeps working memory and source anchors instead of asking the user to remind it what happened.

Project Context Bridge

Project context can also become Memory V2 records, but only if the manifest is trusted.

MemoryProjectContextBridge converts project rules/instructions into scoped records, and MemoryV2Store.syncProjectContext can archive stale records when a rule disappears from the project.

This matters for code/workspace mode: project rules should not mix with the user's personal preferences.

Short Formula

Memory V2 is not just "long-term memory".

More precisely:

structured memory
  + semantic retrieval
  + session working memory
  + conversation recall
  + project scoped rules
  + policy
  + rollback

Good memory is not remembering everything.

Good memory is remembering the right thing, in the right context, with the right boundaries.

Subagents and Sessions

Some tasks are naturally parallel.

For example:

research competitors
check technical documentation
collect integrations
prepare a draft response

You can do this sequentially, but it is better to delegate.

There are two modes.

Blocking delegation:

delegate_task
  -> parent agent sends task
  -> child agent works
  -> parent waits for result
  -> parent continues

Background session:

sessions_spawn
  -> child session starts
  -> parent does not wait
  -> result can be checked later
  -> mailbox / status / history / kill

A subagent should not receive the parent's full permission set.

For example, it does not need dangerous tools:

memory writes
scheduling
self config
session management
tool/plugin management

It works in a more limited context.

Important detail: current native subagents are managed child sessions inside the app, backed by hidden conversations and async loops. They are not separate OS processes or containers. Isolation is mainly runtime/tool/context/workspace policy, not process isolation.

On macOS this is especially useful for coding/workspace tasks: a subagent can work in a separate workspace or git worktree, return a diff/artifacts, and the parent decides what to accept.

But for me the key is not "another coding mode". The important part is that subagents are part of the same runtime:

tasks
memory
tools
watchdog
policies
mesh
audit

Device Mesh: One Agent Across Devices

I started with iPhone as the first interface, but I did not want to lock the agent into one device.

Different devices have different strengths.

iPhone
  -> close to the user
  -> notifications
  -> quick decisions
  -> personal context

Mac
  -> desktop tools
  -> filesystem
  -> browser automation
  -> coding tasks
  -> shell / git / docker

CLI / Linux runner
  -> can live 24/7
  -> good for monitoring
  -> background execution

This led to Device Mesh.

The idea:

several devices become one agentic loop

In code this is not just a device list. There is pairing, cryptography, sync deltas, remote tools, handoff, and leader election.

Pairing and transport are built around encrypted Mesh: X25519/Curve25519 key agreement, HKDF-SHA256, AES-256-GCM, Keychain keys, WebSocket relay, short-lived WS token, reconnect/ping, and HTTP polling fallback. The server relay should not understand the payload: it forwards opaque nonce/payload/tag and can buffer offline messages.

Scenario:

CLI on a server watches an API at night.
If an error appears, it creates an incident task.
iPhone receives a push.
If desktop action is needed, the task goes to Mac.
The user answers from the phone.
Runtime continues where the next step is best executed.

iPhone is not necessarily the main executor.

It can be:

notification target
approval device
personal context device
decision interface

Mac can be a desktop executor.

CLI can be a 24/7 runner.

The runtime decides where a concrete step should run.

Leader Election, Handoff, and Conflicts

Mesh has two tasks that are easy to confuse.

The first: choose who does background work.

The second: synchronize changes between devices.

Leader election is based on device priority. Each device type has execution priority and notification priority. Execution leader is chosen among online peers by highest execution priority, with a stable tie-break by device id.

Roughly:

CLI / Linux runner -> high execution priority
Mac                -> desktop execution
iPhone             -> high notification priority

So proactive tools, scheduled workflows, and auto-connected channels should not start on every device at once. The execution leader starts them.

Notification routing is calculated separately: if several iPhones have the highest notification priority, a notification can be sent to all devices on that level.

Handoff is a separate path.

If a device goes background/offline in the middle of work, runtime can send MeshTaskHandoff to the best online peer. Handoff contains:

handoffId
conversationId
originalMessage
completedIterations
sourceDeviceId
sourceDeviceName
timestamp

The receiver adds a system note to the conversation and continues the agent loop on its own device.

Sync conflicts are not solved by one global CRDT.

In MeshSyncEngine, each delta has an HLC timestamp. When remote deltas arrive, runtime merges HLC, applies changes by entity type, and updates lastSyncHLC. While remote deltas are being applied, isApplyingRemote is enabled so local hooks do not create an echo loop.

Then logic depends on data type:

conversations/messages -> idempotent create by id, update fields when delta arrives
scheduled tasks        -> upsert by task id
knowledge base docs    -> create/delete by document id
memory                 -> append remote sections not already present locally
evolution              -> take newer/higher version history or newer reset
channel/workflow/tool configs -> apply config delta by id

So this is eventual sync with HLC ordering and entity-specific merge, not "magical merge of any two edits".

If the user edits a task on iPhone while Mac has already started working on it, the change arrives as a new message/update in conversation sync. The active run on Mac does not have to instantly rewrite an already started tool call, but at the next runtime boundary it can see the new visible task through conversation state, queued follow-up, and active task envelope.

Task board items also have claim/session/expiration so multiple workers do not keep taking the same item forever. But I do not consider this a replacement for a full distributed transaction layer.

The principle:

background execution -> leader election
mid-run continuity   -> handoff
state convergence    -> HLC + entity-specific merge
dangerous actions    -> policy / approval / fail-closed

This is more honest than promising perfect conflict resolution. A real agent system is safer with clear boundaries than hidden magic.

Business Policies: Governance Must Live in the Runtime

The more an agent can do, the more governance matters.

For a personal agent, simple approvals may be enough.

For a company, that is not enough.

Companies need:

roles
departments
managed prompts
tool policies
approval policies
audit
locked UI
provider restrictions
revocation
signed manifests

The main architecture idea is the same as with tools:

policy must be applied not only in the prompt,
but in the executor

If an employee is not allowed to send emails without approval, it is not enough to tell the model:

do not send emails without approval

ToolExecutor must physically block the send action until approval exists.

In personal mode, approval is simple: ApprovalManager creates a pending request, waits up to five minutes, and then marks it expired. It keeps history of the last 100 requests and supports auto-approve rules:

always
same arguments
until date
count N times

You can approve/deny one request, approve/deny all pending requests for an agent, clear pending requests, and inspect approval stats.

In Business mode, manifest policy sits on top.

Managed mode uses a manifest approach:

Organization
  -> Workspace
  -> Department
  -> Member
  -> ManagedEnvironmentManifest
  -> Runtime policy gate

The manifest contains:

allowed tools
approval-required tools
model policy
memory policy
UI locks
audit disclosure
managed prompt rules

toolPolicy contains:

allowedTools
deniedTools
defaultDecision
approvalRequiredTools
reasonCodes

If a tool is in approvalRequiredTools, runtime returns requiresApproval. If a tool is not allowed and default decision is deny, executor must not run it at all.

Business layer has department packages for Sales, Support, HR, Finance, Operations, Marketing, Executive, Client Workspace, and vertical packages such as beauty/nails studio, fitness/gym, hotel/small property, cafe/restaurant, retail/marketplace, accounting office, and blank managed department.

But runtime entitlement does not come directly from a nice template. It is materialized into a published department profile and signed manifest.

The server model behind this is not decorative. /v1/business has organizations, members, invites, subscriptions, workspaces, departments, department profile versions, manifests, providers, integration setups, proactive review queue, audit events, guardian rules/reports, app clients, and automation items.

Department profile is versioned: draft, publish, rollback. Publish/rollback invalidates old manifests. Manifest endpoint checks org/member/seat/subscription, workspace/department membership, minimum app version, signs the manifest with Ed25519, includes policyHash, TTL, toolPolicy.defaultDecision = deny, approval-required tools, and integration setup refs without secrets.

Fail-closed matters:

if manifest is stale,
signature is invalid,
user is revoked,
or policy failed to load,
mutating actions are blocked

This is especially important for agents.

The stronger the agent, the less you can rely on prompt-level goodwill.

Business provider surface is also guarded: Spectrion Managed provider is enabled, while company-managed provider is blocked without a pilot flag. So I cannot honestly claim that any company can already plug in an arbitrary LLM provider and immediately use it in production. The code protects this with egress guard, redacted audit, and manifest invalidation when provider policy changes.

Business Store Factory

Another part of the business layer is the Store with ready automation capabilities.

The "1495+ tools" number in Spectrion is not made up. In the current seed it is:

7 curated base tools
62 verticals * 24 tool blueprints = 1488 generated tools

total: 1495 approved Store tools

There are also official skills and MCP profiles for business automation.

Base tools cover typical business flows:

omnichannel intake router
booking grid connector
1C accounting bridge
inventory reorder planner
marketplace order triage
file storage ingest
proactive follow-up watch

The mass catalog is generated by verticals and blueprints. Verticals include beauty/nails, fitness/gym, hotel, cafe/restaurant, retail/marketplace, accounting, legal, healthcare, real estate, education, logistics, auto service, and others. Connector families include Telegram Bot API, WhatsApp Cloud API, Instagram Messaging API, VK, SMTP/OAuth, TravelLine, Bnovo, OPERA Cloud, YCLIENTS, Google Sheet/CSV, 1C Fresh/OData/file exchange, QuickBooks, iiko, r_keeper, Bitrix24, amoCRM, Ozon, Wildberries, Shopify, Google Drive, SharePoint, and SFTP.

Important: this does not mean every tool immediately performs dangerous write actions in an external system without setup.

Business Store is review-first and dry-run-first. Many templates return:

setup_required
dry_run
fetch_sample
provider_review_required
external_write_pending_approval

So Store gives a company a fast start for integrations and workflows, but runtime still has to consider secrets, setup, approvals, audit, and fail-closed behavior.

This is essential for business scenarios: a large capability catalog is useful only if it does not bypass governance.

Approval Pipeline and Review Queue

Two things should be separated honestly.

The code already has:

approval-required tools in manifest
client pending approvals with timeout
auto-approve rules
proactive review queue
guardian rules
business notification channels
audit events

Guardian rules can have response mode:

monitor_only
approval_required
deny

And notification mode:

none
digest
immediate

Proactive review queue is used when automation/proactive run prepared an external action but should not execute it immediately. Admin can approve/decline/archive the item, and the external action is not executed automatically as a side effect of review. In the dry-run/review path, it is explicitly recorded that external action and mutation were not executed.

So business runtime can:

deny an action
require approval
place a proactive item into review queue
write an audit event
send digest/immediate notification
block mutating action on stale/invalid manifest

But it is not a universal BPMN engine for approval chains.

For example:

if approver A does not answer in N minutes,
escalate to approver B,
then C,
then auto-decline

should not be described as a generic ready-made mechanism. The current runtime has a safer base behavior: request expires, mutating action is not executed, event remains in history/audit/review surface.

For me this is an intentional boundary. An agent runtime needs fail-closed and audit first; complex approval graphs can come after.

Evolution: The Agent Can Improve, but Not Grant Itself Permissions

Self-improvement sounds attractive, but it is a dangerous area.

You cannot simply let the model rewrite its prompt, change providers, expand permissions, or enable destructive workflows.

So Evolution is built through signals, proposals, policy gate, and rollback.

Important detail: this is a native runtime mechanism, not a business server that rewrites the app by itself. In code it is typed signals, signal store, proposer/critic, deterministic policy gate, rule store snapshots, and rollback.

The flow:

runtime events
tool failures
watchdog misses
user feedback
subagent results
workflow issues
  -> EvolutionSignalProducer
  -> EvolutionSignalStore
  -> proposals
  -> critic / policy gate
  -> snapshot
  -> limited rule mutation
  -> rollback if needed

What can be improved:

tool descriptions
planning rules
validation rules
workflow hints
project rules
runtime guidance

What cannot be done automatically:

grant new permissions
change API keys
bypass approvals
enable forbidden tools
expand destructive actions

The principle is the same:

the agent can learn,
but it must not grant itself new rights

One Full Trace

Suppose the user writes:

Watch the competitor changelog.
If a new release or pricing change appears,
write a short note, create a task, and notify me.

In a normal chat, the answer would be:

Sure, here is how you can configure it...

In an agent runtime this becomes a trace:

1. User message enters runtime
2. Runtime classifies it as long-running monitoring task
3. Context builder injects relevant memory/project rules
4. ToolCatalog activates web/proactive/workflow/task tools
5. Agent checks whether a suitable monitor already exists
6. If not, agent creates scripted monitor
7. Tool is validated, sandboxed, tested, and registered
8. Workflow is built:
     trigger -> fetch -> diff -> condition -> llm summary -> task -> notify
9. Approval is requested if needed
10. Workflow is scheduled
11. Baseline is stored
12. Heartbeat wakes workflow later
13. Monitor detects change
14. Alert enters ProactiveExecutionQueue
15. Runtime handles alert as event input
16. Agent summarizes change
17. Task board gets a new task
18. iPhone receives notification
19. Watchdog checks whether required steps are done

That is the difference.

The agent did not just answer. It created a mechanism that continues working after the message ends.

What Turned Out to Be Hardest

The hardest part was not calling an LLM and not adding tools.

The hard part was keeping state and recovering from an imperfect world.

1. The Model Can Be Confidently Wrong

It can say:

Done.

Even though a tool failed.

Or:

I checked all sources.

Even though evidence exists only for some of them.

That is why watchdog, steward, and task board exist.

2. Background Execution Is Not Magic

Especially on mobile.

You cannot just say:

let the agent always run in the background

You have to deal with platform limits, schedules, notifications, device mesh, and moving execution to a more suitable device.

3. A Thousand Tools Are Worse Than Twenty Right Ones

A large capability catalog is useful only when the model sees a relevant subset.

Otherwise quality drops.

That is why activation, suppression, categories, and policy matter more than the raw tool count.

4. Memory Without Boundaries Turns Into Noise

If the agent "remembers everything", it starts dragging stale decisions into new contexts.

You need scope, TTL, confidence, provenance, and rollback.

5. Governance Cannot Be Bolted on Later

If an agent can act, policies must live in the executor from the beginning.

Prompt-level restrictions are not a reliable boundary.

6. Self-Extension Requires Discipline

An agent that can create tools must pass through sandbox, tests, versioning, secrets, and rollback.

Otherwise it is not extension. It is uncontrolled code generation.

Why This Is Not Just n8n, Claude Code, or a Channel Gateway

I find it useful to separate agent systems by center of gravity.

There are coding agents. Their strongest loop is repository, files, shell, git, tests, patch, code intelligence.

There are automation platforms. Their strongest loop is workflow graph, triggers, integrations, deterministic automation.

There are channel gateways. Their strongest loop is Telegram, Slack, WhatsApp, email, webchat, and routing messages between channels.

Spectrion is built from a different center:

native agent runtime
  + mobile / desktop / CLI
  + tools
  + workflows
  + proactive monitors
  + memory
  + task state
  + watchdog / steward
  + device mesh
  + policies / approvals / audit

Coding can be part of the system.

Workflow can be part of the system.

Channels can be part of the system.

But the center of gravity is the agent operating loop.

Not "where does the user send a message", but "how does the task live, execute, get checked, and continue".

In that sense, Spectrion is not trying to be only an n8n replacement, only a coding harness, or only a message gateway.

n8n/Make/Zapier are good as external automation graphs. In Spectrion, workflow lives inside the agent runtime: it can call the same tools, pass through the same policies, request approvals, return proactive alerts into the reasoning loop, and use a tool the agent just created.

Claude Code and similar coding agents are strong in the software engineering loop. Spectrion can also do code/workspace tasks on Mac, but it is not limited to IDE or terminal: the same runtime should live on iPhone, Mac, CLI, schedules, workflows, memory, device mesh, and Business policies.

A channel gateway is useful as a message entry point. But if the message does not enter a shared execution layer with task state, watchdog, tools, memory, policies, and background loop, it remains message routing rather than a full environment for agentic work.

Current Architecture in Short

If you put everything together, it looks like this:

Inputs:
  chat message
  scheduled task
  proactive alert
  workflow event
  subagent result
  channel message
  voice command
  App Intent / widget / share extension
  heartbeat

        |
        v

AgentRuntime:
  context preparation
  active task envelope
  session working memory
  memory retrieval
  project context injection
  skill activation
  tool activation
  model selection
  provider request building
  execution lane acquire
  LLM loop
  tool parsing
  tool execution
  approval wait
  result handling
  artifact handling
  continuation logic
  compaction / rehydration

        |
        v

Execution layer:
  ContextManager
  MessageCompactor
  MicroCompactor
  ActiveTaskEnvelopeStore
  ProviderManager
  ProviderRequestBuilder
  ProviderVisibleContextDiagnostics
  ToolCatalog
  ToolExecutor
  AgentArtifactContract
  A2UIRenderer
  WorkflowEngine
  ProactiveExecutionQueue
  HeartbeatManager
  ChannelManager
  ApprovalManager
  ProjectContextLoader
  ProjectFileTreeService
  ProjectSearchService
  ProjectWorkspaceAuditLog
  WorkspaceMutationJournal
  TodoManager
  TaskStore
  AgentSteward
  ChatWatchdog
  Memory V2
  SemanticMemory / VectorStore
  KnowledgeBase
  ConversationRecall FTS
  SkillCatalog
  HookManager
  ExecutionLaneManager
  Device Mesh
  Business Policy Gate

        |
        v

Outputs:
  assistant message
  tool artifact
  rendered native UI
  notification
  task update
  memory proposal
  workflow schedule
  approval request
  workspace mutation snapshot
  subagent run
  channel response
  voice/TTS response
  Live Activity surface update
  audit event

For me, this is the essence of Spectrion.

Not a set of separate automation features.

One runtime through which different forms of agentic work pass.

Where This Architecture Is Useful

It works best not for questions like:

answer X

But for tasks with a loop:

observe -> think -> act -> verify -> continue

Examples:

Every morning at 8:30, look at my calendar,
reminders, and open tasks.
Make a short briefing:
what matters today,
where time conflicts exist,
what I promised but did not schedule.
If something requires a decision, create a task and notify me.

Or:

Help me prepare a feature launch.
Check copy, gather bugs,
prepare the changelog, write the post,
make the release checklist,
and do not stop until every item is closed.

Or:

Watch this page.
If a new version appears,
briefly explain the changes
and create a task to update the project.

Or:

Let CLI monitor the server,
Mac do heavy operations,
and iPhone only notify me and receive decisions.

These are tasks where a normal chat loop quickly becomes too weak an abstraction.

Conclusion

I did not start with "build a chat and then attach tools".

From day one I wanted an agent that can live beyond one message.

Externally it looked like an AI agent for iPhone.

Internally the architectural bet was different:

agent runtime first,
chat interface second

The model can change. Tools can expand through Shop or be created by the agent. Workflows can be built from conversation. Tasks can continue in the background. Memory needs boundaries. Watchdog should verify that the agent did not stop too early. Policies must apply not only in the prompt, but in the executor. iPhone, Mac, and CLI should be parts of one agentic loop.

For me the main conclusion is:

an AI agent is not an LLM with functions

An AI agent is a runtime that accepts events, carries state, executes tools, verifies results, and continues work after a single message has ended.

That is what I am building in Spectrion.

Where to look:

Website: https://spectrion.app
App Store: https://apps.apple.com/app/spectrion-agent-ai/id6759151825

I Built an AI Agent with 57+ Tools That Actually Does Stuff on Your iPhone

Denis Babkevich — Thu, 05 Mar 2026 21:46:19 +0000

I got tired of AI chatbots that can only talk. So I built Spectrion — an autonomous AI agent for iPhone that actually executes tasks: sends messages, manages calendar, searches the web, creates tools, and chains it all together without you lifting a finger.

App Store | spectrion.app

Not Just Chat. A Real Agent.

Most "AI apps" are glorified ChatGPT wrappers. Spectrion runs an agent loop — the LLM calls tools, gets results, and keeps going until the task is done:

User: "Find a pizzeria rated 4.5+ nearby and add it to my calendar"

Agent:
  → web_search("best pizzerias near me rated 4.5+")
  → Found 3 places
  → calendar("add dinner at Luigi's, 7 PM tonight")
  → Done!

One message. Multiple tools. Zero hand-holding.

57+ Built-in Tools

Eight categories covering everything your phone can do:

Search & Web — web_search, web_fetch, URL tools
Communication — iMessage, SMS, calls, email, contacts
Organization — Calendar, reminders, scheduled tasks, cron
Files — Filesystem, cloud, XLSX/DOCX/CSV/PDF parsing
Media — Camera, vision (OCR), image generation & editing, audio
System — Device info, brightness, location, maps, health, shortcuts
AI & Meta — Runtime tool creation, skills, memory, sub-agents, UI rendering
Extensions — MCP servers, plugins, community tools

Real Examples

Multi-tool task execution

"Remind me about the team standup tomorrow at 10am and add it to my calendar"

The agent calls reminders and calendar simultaneously, sets up both, and confirms — all in one turn.

Web search & summarization

"Search the web for the latest AI agent news"

Calls web_search, fetches results, and returns a structured summary.

Runtime tool creation

"Create a tool that converts temperatures between Celsius and Fahrenheit"

The agent writes JavaScript, tests it, and registers it as a new tool — usable immediately. Tools get versioning, rollback, and isolated storage.

Workflows — Visual Automation

Chain tools into multi-step workflows with:

Triggers — manual, scheduled (cron), event-driven
Actions — HTTP requests, LLM calls, notifications
Logic — Conditional branching, loops, parallel execution

Build them visually or let the agent generate them from a description.

Extensions: Store, Skills, Plugins, MCP

Store — Browse and install community tools & skills
Skills — Reusable instruction sets (web_researcher, scheduler, etc.)
Plugins — Hot-reload packages (Smart Summarizer, Code Sandbox, etc.)
MCP — Model Context Protocol servers for external integrations
Custom Tools — JS sandbox with HTTP, KV storage, SQLite, device APIs

Autonomous Agent Features

Heartbeat

The agent wakes up periodically (configurable interval) to check pending tasks, process messages, and run maintenance — even when you're not looking.

Morning Briefing

Daily briefing with weather, calendar events, news headlines — customizable topics and time.

Task Check-In

Auto-resumes unfinished tasks. Configurable active hours (default 8am–11pm).

Chat Watchdog

Auto-nudges the agent if it stops mid-task.

Evolution Engine — Self-Improvement

Every 24 hours, the agent:

Analyzes tool usage patterns and success rates
Refines its system prompt and persona parameters
Auto-creates tools for repetitive tasks
All changes versioned with instant rollback

Semantic Memory

Long-term memory with vector search. The agent decides what to remember — conversations get indexed automatically. Semantic search across all stored knowledge.

Device Mesh — Multi-Device Agent

Connect iPhone + Mac into a single agent:

Sync conversations, tools, settings, memory
Execute tools cross-device (Mac agent can trigger iPhone camera)
End-to-end encrypted (Curve25519 ECDH + AES-256-GCM)
Offline queuing with conflict resolution

Channels — Telegram, Discord, Slack, WhatsApp, Email

Connect external messaging platforms. The agent can receive and respond through Telegram bots, Discord, Slack — with full tool access.

Deep Personalization

Configure the agent's personality:

Role — Assistant, Coder, Researcher, Writer
Style — Friendly, Professional, Casual
Personality Level — Pure LLM, Human (natural), Realistic (full character)
User context — Name, address style

Choose Your AI

Works with multiple providers:

Spectrion Pro — All-in-one, 3-day free trial, no API key needed
Apple On-Device — Private, free, works offline
Anthropic (Claude) — Direct API
OpenAI (GPT-4o) — Direct API
Ollama — Local models, fully offline
Custom — Any OpenAI-compatible endpoint

Auto-fallback between providers. Utilization-aware load balancing.

Tech Stack

Frontend: SwiftUI, @observable, async/await, SwiftData — zero third-party dependencies
Concurrency: Actor-based tool executor, TaskGroup parallel execution
Backend: Node.js, SQLite (WAL), Redis, account pooling, tier routing
Security: Keychain storage, E2E encrypted mesh, no server-side conversation logging
Localization: 11 languages with automatic tool activation by keywords

Links

Built as a solo dev. If you have questions about the architecture, agent loop, tool system, or anything else — ask away in the comments.