DEV Community

樹神宇徳
樹神宇徳

Posted on

ARI — A Universal Research Automation System That Runs from Laptop to Supercomputer

Introduction

The idea of automating research isn't new. Since Sakana AI's AI Scientist v2, there have been many attempts to hand the entire research process over to LLM agents. But in practice, these systems require either a cloud budget, an in-house engineering team, or domain-specific tooling — making them tools for the few who already have resources.

ARI (Artificial Research Intelligence) is an open-source research automation system designed to tear down that wall. It runs identically on a local Ollama instance on your laptop and on a SLURM supercomputer cluster with commercial APIs — using a single Markdown file. The core contains zero hardcoded domain knowledge; every decision is made by the LLM at runtime. This design means the same pipeline can handle HPC performance benchmarks, ML hyperparameter tuning, and — in principle — chemistry optimization.

In this article, I'll introduce the system design and walk through a real 11-page SpMM performance analysis paper that ARI produced with zero human intervention.


3-Line Summary

  • Input: A Markdown file describing your research goal (minimum 3 lines)
  • Output: Experiment code, measured data, figures, LaTeX paper, peer review, and reproducibility verification report
  • Environment: Seamlessly switches from laptop (local Ollama) to HPC cluster (SLURM + commercial API) with the same experiment file

Current version: v0.4.1 (released 2026-04-08). Includes a 9-page React/TypeScript web dashboard, 14 MCP skills, and documentation in 3 languages.


Why ARI — Democratizing Research Automation

Research automation has historically required:

  • Expensive cloud budgets
  • In-house engineering teams
  • Domain-specific tools that don't generalize

ARI is built on a single claim: the distance from "I have an idea" to "I have results" should be measured in hours, not months — regardless of your resources.

The system scales along 5 axes with one unified codebase:

Axis Minimal Full
Compute Laptop (local process) Supercomputer (SLURM cluster)
LLM Local Ollama (qwen3:8b) Commercial API (GPT-4, Claude)
Experiment spec 3-line .md Detailed SLURM scripts + rules
Domain Compute benchmarks Physical world (robotics, sensors, lab)
Expertise Beginner (goal only) Expert (full parameter control)

The minimal experiment file really is just this:

# Matrix Multiply Optimization
## Research Goal
Maximize GFLOPS of DGEMM on this machine.
<!-- metric_keyword: GFLOPS -->
Enter fullscreen mode Exit fullscreen mode

From this 3-line goal, ARI runs survey → hypothesis generation → implementation → execution → figure generation → paper writing → reproducibility verification end-to-end.


Architecture — "experiment.md → paper + verification report"

experiment.md ──► ARI Core ──► results + paper + reproducibility report
                      │
          ┌───────────┼──────────────────────┐
          │           │                      │
     BFTS Engine   ReAct Loop         Post-BFTS Pipeline
  (Best-First    (per-node agent)   (workflow.yaml driven)
   Tree Search)        │
                  MCP Skill Servers
                  (plugin system)
Enter fullscreen mode Exit fullscreen mode

ARI's core has three layers:

  • BFTS (Best-First Tree Search) engine — explores the hypothesis space evidence-driven, not exhaustively
  • ReAct loop — LLM agent running per node: reasoning → tool call → observation
  • MCP skill servers — purely functional tools implemented via Model Context Protocol (HPC job submission, paper generation, figure generation, etc.)

After BFTS completes, the Post-BFTS Pipeline defined in workflow.yaml runs data extraction → figure generation → paper writing → peer review → reproducibility verification automatically.

End-to-End Data Flow (10 Steps)

  1. Survey — fetch related work from arXiv / Semantic Scholar
  2. Hypothesis generation — VirSci-style multi-agent deliberation determines hypotheses, key metrics, and evaluation criteria
  3. Tree search — BFTS expands candidate nodes in priority order
  4. Experiment execution — ReAct agent generates, compiles, and runs code per node (auto-polls until SLURM job completes)
  5. Peer review evaluation — LLMEvaluator assigns scientific_score (0.0–1.0)
  6. Tree-wide analysis — Transform skill BFS-traverses the tree to extract hardware/method/ablation insights
  7. Figure generation — Plot skill's LLM writes matplotlib code and outputs PDF figures
  8. LaTeX paper writing — Paper skill generates a full paper with BibTeX citations
  9. Paper peer review — LLM acts as referee and scores the paper
  10. Reproducibility verification — A separate ReAct agent reads only the paper text, re-runs the experiment, and cross-checks claimed values against actual measurements

Step 10 is worth highlighting: the reproducibility agent reads only the paper — no access to the original experiment setup. This checks whether the methods described in the paper are actually sufficient to reproduce the results. This is a check that human peer review cannot realistically perform.


The Core Design — Zero Domain Knowledge Principle

Reading the ARI source code, you'll notice something: ari-core contains zero domain-specific keywords for HPC, ML, chemistry, or anything else. This is not accidental — it's a design invariant enforced in code review.

❌ Forbidden ✅ Correct
if "GFLOP" in metric_name Use LLM's scientific_score
`grep -i "gcc\ openmp"`
"Compare against MKL" in prompt LLM decides comparisons
Hardcode figure type LLM chooses from data
+0.2 score weight LLM scores holistically
lscpu in system prompt LLM calls it if needed

The core specifies only three things:

  • Format: tool calls in JSON, experiment descriptions in Markdown
  • Protocol: skill communication via MCP
  • Signal: BFTS ranking via LLM-assigned scientific_score (0.0–1.0)

Everything else — what to measure, what to compare, which hardware info matters, which figures to draw, which citations to include — is determined autonomously by the LLM at runtime.

Why scientific_score?

Earlier versions of ARI (pre-v0.2) ranked nodes using domain-specific keywords like gflop, bandwidth. This worked for HPC but silently failed for other domains.

scientific_score is a 0.0–1.0 quality signal assigned holistically by an LLM acting as peer reviewer:

  • Did the experiment actually generate measured values?
  • Did it compare against existing methods?
  • Is the methodology reproducible?
  • Do the results support a clear scientific claim?

The LLM decides the weights; ARI only reads the number. This lets the same BFTS engine work equally well for HPC benchmarks, ML hyperparameter tuning, and chemistry optimization.


BFTS — Treating Failures as Information, Not Noise

ARI's BFTS runs with a two-pool structure: pending (nodes waiting to run) and frontier (completed but unexpanded nodes).

def bfts(experiment, config):
    root = Node(experiment, depth=0)
    pending = [root]
    frontier = []
    all_nodes = [root]

    while len(all_nodes) < config.max_total_nodes:
        # Step 1: LLM selects best frontier node to expand
        while frontier and len(pending) < max_parallel:
            best = llm_select_best_to_expand(frontier)
            frontier.remove(best)
            children = llm_propose_directions(best)  # improve / ablation / validation
            pending.extend(children)
            all_nodes.extend(children)

        # Step 2: Parallel batch execution
        batch = llm_select_next_nodes(pending, max_parallel)
        results = parallel_run(batch)

        for node in results:
            memory.write(node.eval_summary)
            frontier.append(node)  # success or failure — both go to frontier

    return max(all_nodes, key=lambda n: n.metrics.get("_scientific_score", 0))
Enter fullscreen mode Exit fullscreen mode

Four key design points:

  • Lazy expansion: Completed nodes aren't expanded until selected by the LLM. Low-scoring nodes stay in "holding" indefinitely
  • Failures are not retried: A failed node spawns a debug-labeled child that inherits the failure context. This is not a retry — retries treat failure as noise; ARI's expand() treats failure as signal
  • Strict budget management: len(all_nodes) < max_total_nodes is the only termination condition
  • Node labels: draft, improve, debug, ablation, validation — each communicates intent and context to the LLM

Ancestor-Chain-Scoped Memory

Each node has memory it can only read from its own ancestor chain:

root ──▶ memory["root"]
  ├─ node_A ──▶ memory["node_A"]
  │   ├─ node_A1 (reads: root + node_A)
  │   └─ node_A2 (reads: root + node_A, NOT node_A1)
  └─ node_B (reads: root only, NOT node_A branch)
Enter fullscreen mode Exit fullscreen mode

Sibling nodes don't share memory, so parallel branches can't contaminate each other. Memory queries use the node's own eval_summary (not domain keywords), keeping search results semantically relevant to that node's work.


MCP Skills — "LLM Reasons, Skills Execute"

All side effects in ARI go through MCP skill servers. Each skill is an independent process exposing functions via @mcp.tool() decorator.

This gives three properties:

  • Isolation: Each skill runs in its own process. A bug in paper generation can't break HPC job submission
  • Replaceability: Any skill can be swapped without touching others
  • Discoverability: LLM agents discover available tools at runtime. Adding a skill = adding a capability, no agent reprogramming needed

14 skills are available; 9 are registered by default in workflow.yaml:

Skill Role Uses LLM? Default
ari-skill-hpc SLURM submission / polling / Singularity / bash
ari-skill-evaluator Metric spec extraction from experiment file
ari-skill-idea arXiv survey + VirSci hypothesis generation
ari-skill-web DuckDuckGo / arXiv / Semantic Scholar / citation crawl
ari-skill-memory Ancestor-scoped node memory (JSONL)
ari-skill-transform BFTS tree → science-ready data format
ari-skill-plot Matplotlib / seaborn figure generation
ari-skill-paper LaTeX writing + BibTeX + peer review
ari-skill-paper-re ReAct reproducibility verification

Skills that can be written deterministically (ari-skill-hpc, ari-skill-memory) use no LLM. LLM calls have both cost and latency; "use a pure function if you can" is ARI's policy.

Extension Path to the Physical World

The MCP plugin architecture is intentionally designed to grow beyond computation:

Today (compute):
  ari-skill-hpc    → SLURM job submission
  ari-skill-evaluator → metric extraction from stdout
  ari-skill-paper  → LaTeX paper writing

Tomorrow (physical world):
  ari-skill-robot  → robot arm control via ROS2 MCP bridge
  ari-skill-sensor → temperature / pressure sensor reads
  ari-skill-labware → pipette control, plate reader integration
  ari-skill-camera → experiment observation via computer vision
Enter fullscreen mode Exit fullscreen mode

Adding these requires no changes to ari-core. Write a server.py with @mcp.tool() functions, register it in workflow.yaml. The same infrastructure that optimizes compiler flags today can optimize reaction temperatures tomorrow.


Web Dashboard — The Main Interface

ARI v0.4.x ships a 9-page React/TypeScript SPA dashboard as the main interface:

ari viz ./checkpoints/ --port 8765  # http://localhost:8765
Enter fullscreen mode Exit fullscreen mode
Page Function
Home Quick actions, recent experiments, system status
New Experiment 4-step wizard (chat / write / upload goal → scope → resources → launch)
Experiments List / delete / resume all checkpoints
Monitor Real-time phase stepper (Idle → Idea → BFTS → Paper → Review), SSE live log, cost tracking
Tree Interactive BFTS node tree — open any node to see metrics, tool call trace, generated code, stdout
Results View/download paper (PDF/TeX), review report, reproducibility results, generated figures
Ideas VirSci-generated hypotheses with novelty / feasibility scores and gap analysis
Workflow Edit post-BFTS pipeline stages
Settings LLM provider, API keys, SLURM partition auto-detect, SSH remote test

Real-time updates via WebSocket (tree changes) + SSE (log streaming).


Results — A Paper ARI Wrote Entirely on Its Own

Here's what ARI autonomously produced: "Stoch-Loopline: Burstiness- and Tail-Latency-Aware Loopline Modeling for Robust Multi-Core CPU CSR SpMM Scaling"

Artificial Research Intelligence — April 6, 2026

Research theme: Performance modeling of CSR SpMM (sparse matrix × dense matrix multiply) on multi-core CPU

Hardware: Fujitsu fx700 node, OpenMP 32 threads

Problem: Existing roofline models predict average throughput only — they can't capture the non-monotonic performance variation depending on sparsity patterns and dense width N. The goal: model bursty, irregular memory access and associated tail latency.

What ARI Autonomously Produced

Type Content
New analytical model Stoch-Loopline — extends loopline/roofline with burstiness, tail latency, and "scaling collapse risk"
2 kernel implementations Variant-1 (row-parallel gather + explicit unroll) / Variant-3 (rows-in-flight window)
Ablation study K-blocking / N-tiling+packing / scalar / no-AVX / prefetch disabled
Synthetic CSR generator Uniform and Zipf modes (lognormal-based), with CV / skewness / Gini statistics
Experiment sweep Up to M = K = 200,000 (~3.2M nonzeros), dense width N ∈ {4, 8, 16, 32, 64, 128}
3 figures Throughput/bandwidth curves, operating point scatter plot, prefetch ablation
References Alappat et al. (2020, 2021), Trotter et al. (2020), Lei et al. (2025) — auto-collected via Semantic Scholar

Key Numerical Results

Configuration GFLOP/s Effective BW
K-blocked CSR SpMM (peak) 23.82 58.30 GB/s
Validation sweep (N=16, 32 threads) 26.22 65.55 GB/s
Max measured BW (root sweep) 17.17 105.18 GB/s
Software prefetch improvement (width avg) +3.53 +8.18 GB/s

Most interesting: ARI autonomously discovered a "scaling collapse" phenomenon. Increasing dense width N from 64 → 256 causes throughput to drop from 26.22 → ~18.3 GFLOP/s and bandwidth from 65.5 → 41–42 GB/s. This is counterintuitive — you'd expect higher N to improve compute density. The paper explains this via Stoch-Loopline's "tail latency amplification" and "collapse risk" concepts.

The pseudocode (Algorithm 1 / Algorithm 2) matches the actual compiled spmm_stoch_loopline.cpp source structure and unroll counts (unroll ∈ {4, 8}) exactly — because the Transform skill actually reads the source code from the BFTS tree before writing the paper.

Not Just "Write a Paper and Stop" — Reproducibility Loop

After writing the paper, ari-skill-paper-re automatically:

  1. Text-extracts the paper PDF
  2. Reads the configuration
  3. Re-runs the job
  4. Cross-checks claimed values against actual measurements

If the paper claims "26.22 GFLOP/s", a separate agent independently verifies that number is reproducible using only the paper text as its information source. This makes reproducibility a first-class design principle, not an afterthought.


Quick Start

# 1. Install
git clone https://github.com/kotama7/ARI && cd ARI
bash setup.sh

# 2. Set up AI model (choose one)
ollama pull qwen3:8b              # free, local
export ARI_BACKEND=openai OPENAI_API_KEY=sk-…  # or cloud API

# 3a. Launch dashboard
ari viz ./checkpoints/ --port 8765
# Open http://localhost:8765 → use wizard to create and launch experiments

# 3b. Or run directly from CLI
ari run experiment.md             # local experiment
ari run experiment.md --profile hpc  # SLURM cluster (auto-detect + profile)
Enter fullscreen mode Exit fullscreen mode

Three environment profiles are provided: laptop.yaml / hpc.yaml / cloud.yaml. ari/env_detect.py auto-detects scheduler (SLURM / PBS / LSF / SGE / Kubernetes) and container runtime (Docker / Singularity / Apptainer).

After a run, output is organized in ./checkpoints/<run_id>/:

File Content
tree.json Full BFTS node tree (all nodes, metrics, parent-child links)
science_data.json Science-ready formatted data (no internal BFTS terminology)
full_paper.tex / .pdf Generated LaTeX paper and PDF
review_report.json LLM peer review scores and feedback
reproducibility_report.json Independent reproducibility verification results
figures_manifest.json Figure paths and captions
cost_trace.jsonl Per-call LLM cost tracking
experiments/<slug>/<node_id>/ Per-node working directory and generated code

Design Principles Summary

Five principles ARI never violates:

# Principle Meaning
P1 Domain-agnostic core Zero experiment-specific knowledge in ari-core
P2 Deterministic by default MCP tools are deterministic by default; LLM-using tools are explicitly annotated
P3 Multi-purpose metrics No hardcoded scalar scores
P4 Dependency injection Switching experiments = editing .md only
P5 Reproducibility first Hardware described by specs, not cluster names

And the anti-goals:

  • Not a replacement for experts (an amplifier)
  • Not operated without human supervision at physical risk boundaries
  • Not a black box (every decision is logged and traceable)
  • Not hardcoding "what good science looks like"

Links

Top comments (0)