樹神宇徳

Posted on Apr 8

ARI — A Universal Research Automation System That Runs from Laptop to Supercomputer

#ai #mcp #autoresearch #ai4science

Introduction

The idea of automating research isn't new. Since Sakana AI's AI Scientist v2, there have been many attempts to hand the entire research process over to LLM agents. But in practice, these systems require either a cloud budget, an in-house engineering team, or domain-specific tooling — making them tools for the few who already have resources.

ARI (Artificial Research Intelligence) is an open-source research automation system designed to tear down that wall. It runs identically on a local Ollama instance on your laptop and on a SLURM supercomputer cluster with commercial APIs — using a single Markdown file. The core contains zero hardcoded domain knowledge; every decision is made by the LLM at runtime. This design means the same pipeline can handle HPC performance benchmarks, ML hyperparameter tuning, and — in principle — chemistry optimization.

In this article, I'll introduce the system design and walk through a real 11-page SpMM performance analysis paper that ARI produced with zero human intervention.

Project homepage: https://kotama7.github.io/ARI/
GitHub: https://github.com/kotama7/ARI

3-Line Summary

Input: A Markdown file describing your research goal (minimum 3 lines)
Output: Experiment code, measured data, figures, LaTeX paper, peer review, and reproducibility verification report
Environment: Seamlessly switches from laptop (local Ollama) to HPC cluster (SLURM + commercial API) with the same experiment file

Current version: v0.4.1 (released 2026-04-08). Includes a 9-page React/TypeScript web dashboard, 14 MCP skills, and documentation in 3 languages.

Why ARI — Democratizing Research Automation

Research automation has historically required:

Expensive cloud budgets
In-house engineering teams
Domain-specific tools that don't generalize

ARI is built on a single claim: the distance from "I have an idea" to "I have results" should be measured in hours, not months — regardless of your resources.

The system scales along 5 axes with one unified codebase:

Axis	Minimal	Full
Compute	Laptop (local process)	Supercomputer (SLURM cluster)
LLM	Local Ollama (qwen3:8b)	Commercial API (GPT-4, Claude)
Experiment spec	3-line `.md`	Detailed SLURM scripts + rules
Domain	Compute benchmarks	Physical world (robotics, sensors, lab)
Expertise	Beginner (goal only)	Expert (full parameter control)

The minimal experiment file really is just this:

# Matrix Multiply Optimization
## Research Goal
Maximize GFLOPS of DGEMM on this machine.
<!-- metric_keyword: GFLOPS -->

From this 3-line goal, ARI runs survey → hypothesis generation → implementation → execution → figure generation → paper writing → reproducibility verification end-to-end.

Architecture — "experiment.md → paper + verification report"

experiment.md ──► ARI Core ──► results + paper + reproducibility report
                      │
          ┌───────────┼──────────────────────┐
          │           │                      │
     BFTS Engine   ReAct Loop         Post-BFTS Pipeline
  (Best-First    (per-node agent)   (workflow.yaml driven)
   Tree Search)        │
                  MCP Skill Servers
                  (plugin system)

ARI's core has three layers:

BFTS (Best-First Tree Search) engine — explores the hypothesis space evidence-driven, not exhaustively
ReAct loop — LLM agent running per node: reasoning → tool call → observation
MCP skill servers — purely functional tools implemented via Model Context Protocol (HPC job submission, paper generation, figure generation, etc.)

After BFTS completes, the Post-BFTS Pipeline defined in workflow.yaml runs data extraction → figure generation → paper writing → peer review → reproducibility verification automatically.

End-to-End Data Flow (10 Steps)

Survey — fetch related work from arXiv / Semantic Scholar
Hypothesis generation — VirSci-style multi-agent deliberation determines hypotheses, key metrics, and evaluation criteria
Tree search — BFTS expands candidate nodes in priority order
Experiment execution — ReAct agent generates, compiles, and runs code per node (auto-polls until SLURM job completes)
Peer review evaluation — LLMEvaluator assigns scientific_score (0.0–1.0)
Tree-wide analysis — Transform skill BFS-traverses the tree to extract hardware/method/ablation insights
Figure generation — Plot skill's LLM writes matplotlib code and outputs PDF figures
LaTeX paper writing — Paper skill generates a full paper with BibTeX citations
Paper peer review — LLM acts as referee and scores the paper
Reproducibility verification — A separate ReAct agent reads only the paper text, re-runs the experiment, and cross-checks claimed values against actual measurements

Step 10 is worth highlighting: the reproducibility agent reads only the paper — no access to the original experiment setup. This checks whether the methods described in the paper are actually sufficient to reproduce the results. This is a check that human peer review cannot realistically perform.

The Core Design — Zero Domain Knowledge Principle

Reading the ARI source code, you'll notice something: ari-core contains zero domain-specific keywords for HPC, ML, chemistry, or anything else. This is not accidental — it's a design invariant enforced in code review.

❌ Forbidden	✅ Correct
`if "GFLOP" in metric_name`	Use LLM's `scientific_score`
`grep -i "gcc\	openmp"`
"Compare against MKL" in prompt	LLM decides comparisons
Hardcode figure type	LLM chooses from data
`+0.2` score weight	LLM scores holistically
`lscpu` in system prompt	LLM calls it if needed

The core specifies only three things:

Format: tool calls in JSON, experiment descriptions in Markdown
Protocol: skill communication via MCP
Signal: BFTS ranking via LLM-assigned scientific_score (0.0–1.0)

Everything else — what to measure, what to compare, which hardware info matters, which figures to draw, which citations to include — is determined autonomously by the LLM at runtime.

Why `scientific_score`?

Earlier versions of ARI (pre-v0.2) ranked nodes using domain-specific keywords like gflop, bandwidth. This worked for HPC but silently failed for other domains.

scientific_score is a 0.0–1.0 quality signal assigned holistically by an LLM acting as peer reviewer:

Did the experiment actually generate measured values?
Did it compare against existing methods?
Is the methodology reproducible?
Do the results support a clear scientific claim?

The LLM decides the weights; ARI only reads the number. This lets the same BFTS engine work equally well for HPC benchmarks, ML hyperparameter tuning, and chemistry optimization.

BFTS — Treating Failures as Information, Not Noise

ARI's BFTS runs with a two-pool structure: pending (nodes waiting to run) and frontier (completed but unexpanded nodes).

def bfts(experiment, config):
    root = Node(experiment, depth=0)
    pending = [root]
    frontier = []
    all_nodes = [root]

    while len(all_nodes) < config.max_total_nodes:
        # Step 1: LLM selects best frontier node to expand
        while frontier and len(pending) < max_parallel:
            best = llm_select_best_to_expand(frontier)
            frontier.remove(best)
            children = llm_propose_directions(best)  # improve / ablation / validation
            pending.extend(children)
            all_nodes.extend(children)

        # Step 2: Parallel batch execution
        batch = llm_select_next_nodes(pending, max_parallel)
        results = parallel_run(batch)

        for node in results:
            memory.write(node.eval_summary)
            frontier.append(node)  # success or failure — both go to frontier

    return max(all_nodes, key=lambda n: n.metrics.get("_scientific_score", 0))

Four key design points:

Lazy expansion: Completed nodes aren't expanded until selected by the LLM. Low-scoring nodes stay in "holding" indefinitely
Failures are not retried: A failed node spawns a debug-labeled child that inherits the failure context. This is not a retry — retries treat failure as noise; ARI's expand() treats failure as signal
Strict budget management: len(all_nodes) < max_total_nodes is the only termination condition
Node labels: draft, improve, debug, ablation, validation — each communicates intent and context to the LLM

Ancestor-Chain-Scoped Memory

Each node has memory it can only read from its own ancestor chain:

root ──▶ memory["root"]
  ├─ node_A ──▶ memory["node_A"]
  │   ├─ node_A1 (reads: root + node_A)
  │   └─ node_A2 (reads: root + node_A, NOT node_A1)
  └─ node_B (reads: root only, NOT node_A branch)

Sibling nodes don't share memory, so parallel branches can't contaminate each other. Memory queries use the node's own eval_summary (not domain keywords), keeping search results semantically relevant to that node's work.

MCP Skills — "LLM Reasons, Skills Execute"

All side effects in ARI go through MCP skill servers. Each skill is an independent process exposing functions via @mcp.tool() decorator.

This gives three properties:

Isolation: Each skill runs in its own process. A bug in paper generation can't break HPC job submission
Replaceability: Any skill can be swapped without touching others
Discoverability: LLM agents discover available tools at runtime. Adding a skill = adding a capability, no agent reprogramming needed

14 skills are available; 9 are registered by default in workflow.yaml:

Skill	Role	Uses LLM?	Default
`ari-skill-hpc`	SLURM submission / polling / Singularity / bash	✗	✓
`ari-skill-evaluator`	Metric spec extraction from experiment file	△	✓
`ari-skill-idea`	arXiv survey + VirSci hypothesis generation	✓	✓
`ari-skill-web`	DuckDuckGo / arXiv / Semantic Scholar / citation crawl	△	✓
`ari-skill-memory`	Ancestor-scoped node memory (JSONL)	✗	✓
`ari-skill-transform`	BFTS tree → science-ready data format	✓	✓
`ari-skill-plot`	Matplotlib / seaborn figure generation	✓	✓
`ari-skill-paper`	LaTeX writing + BibTeX + peer review	✓	✓
`ari-skill-paper-re`	ReAct reproducibility verification	✓	✓

Skills that can be written deterministically (ari-skill-hpc, ari-skill-memory) use no LLM. LLM calls have both cost and latency; "use a pure function if you can" is ARI's policy.

Extension Path to the Physical World

The MCP plugin architecture is intentionally designed to grow beyond computation:

Today (compute):
  ari-skill-hpc    → SLURM job submission
  ari-skill-evaluator → metric extraction from stdout
  ari-skill-paper  → LaTeX paper writing

Tomorrow (physical world):
  ari-skill-robot  → robot arm control via ROS2 MCP bridge
  ari-skill-sensor → temperature / pressure sensor reads
  ari-skill-labware → pipette control, plate reader integration
  ari-skill-camera → experiment observation via computer vision

Adding these requires no changes to ari-core. Write a server.py with @mcp.tool() functions, register it in workflow.yaml. The same infrastructure that optimizes compiler flags today can optimize reaction temperatures tomorrow.

Web Dashboard — The Main Interface

ARI v0.4.x ships a 9-page React/TypeScript SPA dashboard as the main interface:

ari viz ./checkpoints/ --port 8765  # http://localhost:8765

Page	Function
Home	Quick actions, recent experiments, system status
New Experiment	4-step wizard (chat / write / upload goal → scope → resources → launch)
Experiments	List / delete / resume all checkpoints
Monitor	Real-time phase stepper (Idle → Idea → BFTS → Paper → Review), SSE live log, cost tracking
Tree	Interactive BFTS node tree — open any node to see metrics, tool call trace, generated code, stdout
Results	View/download paper (PDF/TeX), review report, reproducibility results, generated figures
Ideas	VirSci-generated hypotheses with novelty / feasibility scores and gap analysis
Workflow	Edit post-BFTS pipeline stages
Settings	LLM provider, API keys, SLURM partition auto-detect, SSH remote test

Real-time updates via WebSocket (tree changes) + SSE (log streaming).

Results — A Paper ARI Wrote Entirely on Its Own

Here's what ARI autonomously produced: "Stoch-Loopline: Burstiness- and Tail-Latency-Aware Loopline Modeling for Robust Multi-Core CPU CSR SpMM Scaling"

Artificial Research Intelligence — April 6, 2026

Research theme: Performance modeling of CSR SpMM (sparse matrix × dense matrix multiply) on multi-core CPU

Hardware: Fujitsu fx700 node, OpenMP 32 threads

Problem: Existing roofline models predict average throughput only — they can't capture the non-monotonic performance variation depending on sparsity patterns and dense width N. The goal: model bursty, irregular memory access and associated tail latency.

What ARI Autonomously Produced

Type	Content
New analytical model	Stoch-Loopline — extends loopline/roofline with burstiness, tail latency, and "scaling collapse risk"
2 kernel implementations	Variant-1 (row-parallel gather + explicit unroll) / Variant-3 (rows-in-flight window)
Ablation study	K-blocking / N-tiling+packing / scalar / no-AVX / prefetch disabled
Synthetic CSR generator	Uniform and Zipf modes (lognormal-based), with CV / skewness / Gini statistics
Experiment sweep	Up to M = K = 200,000 (~3.2M nonzeros), dense width N ∈ {4, 8, 16, 32, 64, 128}
3 figures	Throughput/bandwidth curves, operating point scatter plot, prefetch ablation
References	Alappat et al. (2020, 2021), Trotter et al. (2020), Lei et al. (2025) — auto-collected via Semantic Scholar

Key Numerical Results

Configuration	GFLOP/s	Effective BW
K-blocked CSR SpMM (peak)	23.82	58.30 GB/s
Validation sweep (N=16, 32 threads)	26.22	65.55 GB/s
Max measured BW (root sweep)	17.17	105.18 GB/s
Software prefetch improvement (width avg)	+3.53	+8.18 GB/s

Most interesting: ARI autonomously discovered a "scaling collapse" phenomenon. Increasing dense width N from 64 → 256 causes throughput to drop from 26.22 → ~18.3 GFLOP/s and bandwidth from 65.5 → 41–42 GB/s. This is counterintuitive — you'd expect higher N to improve compute density. The paper explains this via Stoch-Loopline's "tail latency amplification" and "collapse risk" concepts.

The pseudocode (Algorithm 1 / Algorithm 2) matches the actual compiled spmm_stoch_loopline.cpp source structure and unroll counts (unroll ∈ {4, 8}) exactly — because the Transform skill actually reads the source code from the BFTS tree before writing the paper.

Not Just "Write a Paper and Stop" — Reproducibility Loop

After writing the paper, ari-skill-paper-re automatically:

Text-extracts the paper PDF
Reads the configuration
Re-runs the job
Cross-checks claimed values against actual measurements

If the paper claims "26.22 GFLOP/s", a separate agent independently verifies that number is reproducible using only the paper text as its information source. This makes reproducibility a first-class design principle, not an afterthought.

Quick Start

# 1. Install
git clone https://github.com/kotama7/ARI && cd ARI
bash setup.sh

# 2. Set up AI model (choose one)
ollama pull qwen3:8b              # free, local
export ARI_BACKEND=openai OPENAI_API_KEY=sk-…  # or cloud API

# 3a. Launch dashboard
ari viz ./checkpoints/ --port 8765
# Open http://localhost:8765 → use wizard to create and launch experiments

# 3b. Or run directly from CLI
ari run experiment.md             # local experiment
ari run experiment.md --profile hpc  # SLURM cluster (auto-detect + profile)

Three environment profiles are provided: laptop.yaml / hpc.yaml / cloud.yaml. ari/env_detect.py auto-detects scheduler (SLURM / PBS / LSF / SGE / Kubernetes) and container runtime (Docker / Singularity / Apptainer).

After a run, output is organized in ./checkpoints/<run_id>/:

File	Content
`tree.json`	Full BFTS node tree (all nodes, metrics, parent-child links)
`science_data.json`	Science-ready formatted data (no internal BFTS terminology)
`full_paper.tex` / `.pdf`	Generated LaTeX paper and PDF
`review_report.json`	LLM peer review scores and feedback
`reproducibility_report.json`	Independent reproducibility verification results
`figures_manifest.json`	Figure paths and captions
`cost_trace.jsonl`	Per-call LLM cost tracking
`experiments/<slug>/<node_id>/`	Per-node working directory and generated code

Design Principles Summary

Five principles ARI never violates:

#	Principle	Meaning
P1	Domain-agnostic core	Zero experiment-specific knowledge in `ari-core`
P2	Deterministic by default	MCP tools are deterministic by default; LLM-using tools are explicitly annotated
P3	Multi-purpose metrics	No hardcoded scalar scores
P4	Dependency injection	Switching experiments = editing `.md` only
P5	Reproducibility first	Hardware described by specs, not cluster names

And the anti-goals:

Not a replacement for experts (an amplifier)
Not operated without human supervision at physical risk boundaries
Not a black box (every decision is logged and traceable)
Not hardcoding "what good science looks like"

DEV Community

ARI — A Universal Research Automation System That Runs from Laptop to Supercomputer

Introduction

3-Line Summary

Why ARI — Democratizing Research Automation

Architecture — "experiment.md → paper + verification report"

End-to-End Data Flow (10 Steps)

The Core Design — Zero Domain Knowledge Principle

Why `scientific_score`?

BFTS — Treating Failures as Information, Not Noise

Ancestor-Chain-Scoped Memory

MCP Skills — "LLM Reasons, Skills Execute"

Extension Path to the Physical World

Web Dashboard — The Main Interface

Results — A Paper ARI Wrote Entirely on Its Own

What ARI Autonomously Produced

Key Numerical Results

Not Just "Write a Paper and Stop" — Reproducibility Loop

Quick Start

Design Principles Summary

Links

Top comments (0)

Introduction

3-Line Summary

Why ARI — Democratizing Research Automation

Architecture — "experiment.md → paper + verification report"

End-to-End Data Flow (10 Steps)

The Core Design — Zero Domain Knowledge Principle

Why scientific_score?

BFTS — Treating Failures as Information, Not Noise

Ancestor-Chain-Scoped Memory

MCP Skills — "LLM Reasons, Skills Execute"

Extension Path to the Physical World

Web Dashboard — The Main Interface

Results — A Paper ARI Wrote Entirely on Its Own

What ARI Autonomously Produced

Key Numerical Results

Not Just "Write a Paper and Stop" — Reproducibility Loop

Quick Start

Design Principles Summary

Links

Why `scientific_score`?