Introduction
The idea of automating research isn't new. Since Sakana AI's AI Scientist v2, there have been many attempts to hand the entire research process over to LLM agents. But in practice, these systems require either a cloud budget, an in-house engineering team, or domain-specific tooling — making them tools for the few who already have resources.
ARI (Artificial Research Intelligence) is an open-source research automation system designed to tear down that wall. It runs identically on a local Ollama instance on your laptop and on a SLURM supercomputer cluster with commercial APIs — using a single Markdown file. The core contains zero hardcoded domain knowledge; every decision is made by the LLM at runtime. This design means the same pipeline can handle HPC performance benchmarks, ML hyperparameter tuning, and — in principle — chemistry optimization.
In this article, I'll introduce the system design and walk through a real 11-page SpMM performance analysis paper that ARI produced with zero human intervention.
- Project homepage: https://kotama7.github.io/ARI/
- GitHub: https://github.com/kotama7/ARI
3-Line Summary
- Input: A Markdown file describing your research goal (minimum 3 lines)
- Output: Experiment code, measured data, figures, LaTeX paper, peer review, and reproducibility verification report
- Environment: Seamlessly switches from laptop (local Ollama) to HPC cluster (SLURM + commercial API) with the same experiment file
Current version: v0.4.1 (released 2026-04-08). Includes a 9-page React/TypeScript web dashboard, 14 MCP skills, and documentation in 3 languages.
Why ARI — Democratizing Research Automation
Research automation has historically required:
- Expensive cloud budgets
- In-house engineering teams
- Domain-specific tools that don't generalize
ARI is built on a single claim: the distance from "I have an idea" to "I have results" should be measured in hours, not months — regardless of your resources.
The system scales along 5 axes with one unified codebase:
| Axis | Minimal | Full |
|---|---|---|
| Compute | Laptop (local process) | Supercomputer (SLURM cluster) |
| LLM | Local Ollama (qwen3:8b) | Commercial API (GPT-4, Claude) |
| Experiment spec | 3-line .md
|
Detailed SLURM scripts + rules |
| Domain | Compute benchmarks | Physical world (robotics, sensors, lab) |
| Expertise | Beginner (goal only) | Expert (full parameter control) |
The minimal experiment file really is just this:
# Matrix Multiply Optimization
## Research Goal
Maximize GFLOPS of DGEMM on this machine.
<!-- metric_keyword: GFLOPS -->
From this 3-line goal, ARI runs survey → hypothesis generation → implementation → execution → figure generation → paper writing → reproducibility verification end-to-end.
Architecture — "experiment.md → paper + verification report"
experiment.md ──► ARI Core ──► results + paper + reproducibility report
│
┌───────────┼──────────────────────┐
│ │ │
BFTS Engine ReAct Loop Post-BFTS Pipeline
(Best-First (per-node agent) (workflow.yaml driven)
Tree Search) │
MCP Skill Servers
(plugin system)
ARI's core has three layers:
- BFTS (Best-First Tree Search) engine — explores the hypothesis space evidence-driven, not exhaustively
- ReAct loop — LLM agent running per node: reasoning → tool call → observation
- MCP skill servers — purely functional tools implemented via Model Context Protocol (HPC job submission, paper generation, figure generation, etc.)
After BFTS completes, the Post-BFTS Pipeline defined in workflow.yaml runs data extraction → figure generation → paper writing → peer review → reproducibility verification automatically.
End-to-End Data Flow (10 Steps)
- Survey — fetch related work from arXiv / Semantic Scholar
- Hypothesis generation — VirSci-style multi-agent deliberation determines hypotheses, key metrics, and evaluation criteria
- Tree search — BFTS expands candidate nodes in priority order
- Experiment execution — ReAct agent generates, compiles, and runs code per node (auto-polls until SLURM job completes)
-
Peer review evaluation — LLMEvaluator assigns
scientific_score(0.0–1.0) - Tree-wide analysis — Transform skill BFS-traverses the tree to extract hardware/method/ablation insights
- Figure generation — Plot skill's LLM writes matplotlib code and outputs PDF figures
- LaTeX paper writing — Paper skill generates a full paper with BibTeX citations
- Paper peer review — LLM acts as referee and scores the paper
- Reproducibility verification — A separate ReAct agent reads only the paper text, re-runs the experiment, and cross-checks claimed values against actual measurements
Step 10 is worth highlighting: the reproducibility agent reads only the paper — no access to the original experiment setup. This checks whether the methods described in the paper are actually sufficient to reproduce the results. This is a check that human peer review cannot realistically perform.
The Core Design — Zero Domain Knowledge Principle
Reading the ARI source code, you'll notice something: ari-core contains zero domain-specific keywords for HPC, ML, chemistry, or anything else. This is not accidental — it's a design invariant enforced in code review.
| ❌ Forbidden | ✅ Correct |
|---|---|
if "GFLOP" in metric_name |
Use LLM's scientific_score
|
| `grep -i "gcc\ | openmp"` |
| "Compare against MKL" in prompt | LLM decides comparisons |
| Hardcode figure type | LLM chooses from data |
+0.2 score weight |
LLM scores holistically |
lscpu in system prompt |
LLM calls it if needed |
The core specifies only three things:
- Format: tool calls in JSON, experiment descriptions in Markdown
- Protocol: skill communication via MCP
-
Signal: BFTS ranking via LLM-assigned
scientific_score(0.0–1.0)
Everything else — what to measure, what to compare, which hardware info matters, which figures to draw, which citations to include — is determined autonomously by the LLM at runtime.
Why scientific_score?
Earlier versions of ARI (pre-v0.2) ranked nodes using domain-specific keywords like gflop, bandwidth. This worked for HPC but silently failed for other domains.
scientific_score is a 0.0–1.0 quality signal assigned holistically by an LLM acting as peer reviewer:
- Did the experiment actually generate measured values?
- Did it compare against existing methods?
- Is the methodology reproducible?
- Do the results support a clear scientific claim?
The LLM decides the weights; ARI only reads the number. This lets the same BFTS engine work equally well for HPC benchmarks, ML hyperparameter tuning, and chemistry optimization.
BFTS — Treating Failures as Information, Not Noise
ARI's BFTS runs with a two-pool structure: pending (nodes waiting to run) and frontier (completed but unexpanded nodes).
def bfts(experiment, config):
root = Node(experiment, depth=0)
pending = [root]
frontier = []
all_nodes = [root]
while len(all_nodes) < config.max_total_nodes:
# Step 1: LLM selects best frontier node to expand
while frontier and len(pending) < max_parallel:
best = llm_select_best_to_expand(frontier)
frontier.remove(best)
children = llm_propose_directions(best) # improve / ablation / validation
pending.extend(children)
all_nodes.extend(children)
# Step 2: Parallel batch execution
batch = llm_select_next_nodes(pending, max_parallel)
results = parallel_run(batch)
for node in results:
memory.write(node.eval_summary)
frontier.append(node) # success or failure — both go to frontier
return max(all_nodes, key=lambda n: n.metrics.get("_scientific_score", 0))
Four key design points:
- Lazy expansion: Completed nodes aren't expanded until selected by the LLM. Low-scoring nodes stay in "holding" indefinitely
-
Failures are not retried: A failed node spawns a
debug-labeled child that inherits the failure context. This is not a retry — retries treat failure as noise; ARI'sexpand()treats failure as signal -
Strict budget management:
len(all_nodes) < max_total_nodesis the only termination condition -
Node labels:
draft,improve,debug,ablation,validation— each communicates intent and context to the LLM
Ancestor-Chain-Scoped Memory
Each node has memory it can only read from its own ancestor chain:
root ──▶ memory["root"]
├─ node_A ──▶ memory["node_A"]
│ ├─ node_A1 (reads: root + node_A)
│ └─ node_A2 (reads: root + node_A, NOT node_A1)
└─ node_B (reads: root only, NOT node_A branch)
Sibling nodes don't share memory, so parallel branches can't contaminate each other. Memory queries use the node's own eval_summary (not domain keywords), keeping search results semantically relevant to that node's work.
MCP Skills — "LLM Reasons, Skills Execute"
All side effects in ARI go through MCP skill servers. Each skill is an independent process exposing functions via @mcp.tool() decorator.
This gives three properties:
- Isolation: Each skill runs in its own process. A bug in paper generation can't break HPC job submission
- Replaceability: Any skill can be swapped without touching others
- Discoverability: LLM agents discover available tools at runtime. Adding a skill = adding a capability, no agent reprogramming needed
14 skills are available; 9 are registered by default in workflow.yaml:
| Skill | Role | Uses LLM? | Default |
|---|---|---|---|
ari-skill-hpc |
SLURM submission / polling / Singularity / bash | ✗ | ✓ |
ari-skill-evaluator |
Metric spec extraction from experiment file | △ | ✓ |
ari-skill-idea |
arXiv survey + VirSci hypothesis generation | ✓ | ✓ |
ari-skill-web |
DuckDuckGo / arXiv / Semantic Scholar / citation crawl | △ | ✓ |
ari-skill-memory |
Ancestor-scoped node memory (JSONL) | ✗ | ✓ |
ari-skill-transform |
BFTS tree → science-ready data format | ✓ | ✓ |
ari-skill-plot |
Matplotlib / seaborn figure generation | ✓ | ✓ |
ari-skill-paper |
LaTeX writing + BibTeX + peer review | ✓ | ✓ |
ari-skill-paper-re |
ReAct reproducibility verification | ✓ | ✓ |
Skills that can be written deterministically (ari-skill-hpc, ari-skill-memory) use no LLM. LLM calls have both cost and latency; "use a pure function if you can" is ARI's policy.
Extension Path to the Physical World
The MCP plugin architecture is intentionally designed to grow beyond computation:
Today (compute):
ari-skill-hpc → SLURM job submission
ari-skill-evaluator → metric extraction from stdout
ari-skill-paper → LaTeX paper writing
Tomorrow (physical world):
ari-skill-robot → robot arm control via ROS2 MCP bridge
ari-skill-sensor → temperature / pressure sensor reads
ari-skill-labware → pipette control, plate reader integration
ari-skill-camera → experiment observation via computer vision
Adding these requires no changes to ari-core. Write a server.py with @mcp.tool() functions, register it in workflow.yaml. The same infrastructure that optimizes compiler flags today can optimize reaction temperatures tomorrow.
Web Dashboard — The Main Interface
ARI v0.4.x ships a 9-page React/TypeScript SPA dashboard as the main interface:
ari viz ./checkpoints/ --port 8765 # http://localhost:8765
| Page | Function |
|---|---|
| Home | Quick actions, recent experiments, system status |
| New Experiment | 4-step wizard (chat / write / upload goal → scope → resources → launch) |
| Experiments | List / delete / resume all checkpoints |
| Monitor | Real-time phase stepper (Idle → Idea → BFTS → Paper → Review), SSE live log, cost tracking |
| Tree | Interactive BFTS node tree — open any node to see metrics, tool call trace, generated code, stdout |
| Results | View/download paper (PDF/TeX), review report, reproducibility results, generated figures |
| Ideas | VirSci-generated hypotheses with novelty / feasibility scores and gap analysis |
| Workflow | Edit post-BFTS pipeline stages |
| Settings | LLM provider, API keys, SLURM partition auto-detect, SSH remote test |
Real-time updates via WebSocket (tree changes) + SSE (log streaming).
Results — A Paper ARI Wrote Entirely on Its Own
Here's what ARI autonomously produced: "Stoch-Loopline: Burstiness- and Tail-Latency-Aware Loopline Modeling for Robust Multi-Core CPU CSR SpMM Scaling"
Artificial Research Intelligence — April 6, 2026
Research theme: Performance modeling of CSR SpMM (sparse matrix × dense matrix multiply) on multi-core CPU
Hardware: Fujitsu fx700 node, OpenMP 32 threads
Problem: Existing roofline models predict average throughput only — they can't capture the non-monotonic performance variation depending on sparsity patterns and dense width N. The goal: model bursty, irregular memory access and associated tail latency.
What ARI Autonomously Produced
| Type | Content |
|---|---|
| New analytical model | Stoch-Loopline — extends loopline/roofline with burstiness, tail latency, and "scaling collapse risk" |
| 2 kernel implementations | Variant-1 (row-parallel gather + explicit unroll) / Variant-3 (rows-in-flight window) |
| Ablation study | K-blocking / N-tiling+packing / scalar / no-AVX / prefetch disabled |
| Synthetic CSR generator | Uniform and Zipf modes (lognormal-based), with CV / skewness / Gini statistics |
| Experiment sweep | Up to M = K = 200,000 (~3.2M nonzeros), dense width N ∈ {4, 8, 16, 32, 64, 128} |
| 3 figures | Throughput/bandwidth curves, operating point scatter plot, prefetch ablation |
| References | Alappat et al. (2020, 2021), Trotter et al. (2020), Lei et al. (2025) — auto-collected via Semantic Scholar |
Key Numerical Results
| Configuration | GFLOP/s | Effective BW |
|---|---|---|
| K-blocked CSR SpMM (peak) | 23.82 | 58.30 GB/s |
| Validation sweep (N=16, 32 threads) | 26.22 | 65.55 GB/s |
| Max measured BW (root sweep) | 17.17 | 105.18 GB/s |
| Software prefetch improvement (width avg) | +3.53 | +8.18 GB/s |
Most interesting: ARI autonomously discovered a "scaling collapse" phenomenon. Increasing dense width N from 64 → 256 causes throughput to drop from 26.22 → ~18.3 GFLOP/s and bandwidth from 65.5 → 41–42 GB/s. This is counterintuitive — you'd expect higher N to improve compute density. The paper explains this via Stoch-Loopline's "tail latency amplification" and "collapse risk" concepts.
The pseudocode (Algorithm 1 / Algorithm 2) matches the actual compiled spmm_stoch_loopline.cpp source structure and unroll counts (unroll ∈ {4, 8}) exactly — because the Transform skill actually reads the source code from the BFTS tree before writing the paper.
Not Just "Write a Paper and Stop" — Reproducibility Loop
After writing the paper, ari-skill-paper-re automatically:
- Text-extracts the paper PDF
- Reads the configuration
- Re-runs the job
- Cross-checks claimed values against actual measurements
If the paper claims "26.22 GFLOP/s", a separate agent independently verifies that number is reproducible using only the paper text as its information source. This makes reproducibility a first-class design principle, not an afterthought.
Quick Start
# 1. Install
git clone https://github.com/kotama7/ARI && cd ARI
bash setup.sh
# 2. Set up AI model (choose one)
ollama pull qwen3:8b # free, local
export ARI_BACKEND=openai OPENAI_API_KEY=sk-… # or cloud API
# 3a. Launch dashboard
ari viz ./checkpoints/ --port 8765
# Open http://localhost:8765 → use wizard to create and launch experiments
# 3b. Or run directly from CLI
ari run experiment.md # local experiment
ari run experiment.md --profile hpc # SLURM cluster (auto-detect + profile)
Three environment profiles are provided: laptop.yaml / hpc.yaml / cloud.yaml. ari/env_detect.py auto-detects scheduler (SLURM / PBS / LSF / SGE / Kubernetes) and container runtime (Docker / Singularity / Apptainer).
After a run, output is organized in ./checkpoints/<run_id>/:
| File | Content |
|---|---|
tree.json |
Full BFTS node tree (all nodes, metrics, parent-child links) |
science_data.json |
Science-ready formatted data (no internal BFTS terminology) |
full_paper.tex / .pdf
|
Generated LaTeX paper and PDF |
review_report.json |
LLM peer review scores and feedback |
reproducibility_report.json |
Independent reproducibility verification results |
figures_manifest.json |
Figure paths and captions |
cost_trace.jsonl |
Per-call LLM cost tracking |
experiments/<slug>/<node_id>/ |
Per-node working directory and generated code |
Design Principles Summary
Five principles ARI never violates:
| # | Principle | Meaning |
|---|---|---|
| P1 | Domain-agnostic core | Zero experiment-specific knowledge in ari-core
|
| P2 | Deterministic by default | MCP tools are deterministic by default; LLM-using tools are explicitly annotated |
| P3 | Multi-purpose metrics | No hardcoded scalar scores |
| P4 | Dependency injection | Switching experiments = editing .md only |
| P5 | Reproducibility first | Hardware described by specs, not cluster names |
And the anti-goals:
- Not a replacement for experts (an amplifier)
- Not operated without human supervision at physical risk boundaries
- Not a black box (every decision is logged and traceable)
- Not hardcoding "what good science looks like"
Links
- 🌐 Project homepage (demo + sample paper viewer): https://kotama7.github.io/ARI/
- 📄 Sample paper PDF: https://kotama7.github.io/ARI/sample_paper.pdf
- 💻 GitHub (MIT): https://github.com/kotama7/ARI
- 🎬 Dashboard demo (EN): https://github.com/kotama7/ARI/raw/main/docs/movie/en/ari_dashboard_demo.mp4
Top comments (0)