DEV Community: ArshTechPro

MAI-Thinking-1: Microsoft's New Reasoning Model and What It Means for Developers

ArshTechPro — Fri, 05 Jun 2026 16:24:31 +0000

Microsoft just shipped MAI-Thinking-1, their first in-house reasoning model. If you've been watching the AI space, you know reasoning models — the kind that "think before they answer" — have become a battleground. OpenAI has o3, Anthropic has Claude with extended thinking, Google has Gemini's thinking mode. Now Microsoft is in with their own, and they built it from the ground up rather than licensing or distilling from someone else's model.

Here is what you actually need to know as a developer.

What Is MAI-Thinking-1?

MAI-Thinking-1 is Microsoft's reasoning-focused language model, developed by their internal AI lab (Microsoft AI, or MAI). It is a medium-sized model designed specifically for complex, multi-step tasks — the kind of problems where a model needs to reason through multiple steps before producing an answer, rather than just pattern-matching to a response.

The headline positioning is this: it is a smaller model that punches well above its weight class on software engineering and math benchmarks.

The Architecture: Sparse Mixture of Experts

The model is a sparse Mixture of Experts (MoE) architecture:

35 billion active parameters at inference time
~1 trillion total parameters across all expert layers

This distinction matters for developers. In a dense model, every parameter fires for every token. In a MoE model, only a subset of "experts" activate per token, so the active compute footprint is much smaller than the total parameter count suggests. The practical result: you get near-frontier quality reasoning at a significantly lower inference cost than a comparable dense model.

Compare that to something like GPT-4 class models which are estimated at 1.8T+ parameters (dense), and you start to see why Microsoft is calling this "mid-weight pricing."

Benchmark Performance

Microsoft reports the following numbers:

Benchmark	MAI-Thinking-1	Notes
AIME 2025	97.0%	Advanced math competition
AIME 2026	94.5%	Most recent math competition
SWE-Bench Pro	Competitive with Claude Opus 4.6	Real-world software engineering tasks
Human side-by-side	Preferred over Claude Sonnet 4.6	Blind evaluation by Surge raters

The SWE-Bench Pro result is worth unpacking. SWE-Bench tests models on real GitHub issues — the model has to read a codebase, understand a bug report, and produce a patch that passes the existing test suite. It is arguably the most developer-relevant benchmark that exists right now. Matching Claude Opus 4.6 on this benchmark while running on far fewer active parameters is a meaningful result.

The human preference eval covered 1,276 tasks across single-turn and multi-turn conversations, judged by professional raters from Surge, and prioritized whether responses actually advanced the user's goals rather than just sounding good.

What Makes It Different From Other Models: Training Philosophy

Microsoft made a deliberate choice that is worth understanding because it affects how the model behaves.

No distillation from third-party models. Most smaller models are trained by learning to imitate a larger, more capable model (this is called distillation or knowledge distillation). MAI-Thinking-1 was trained without doing this. Microsoft argues that distilled models are fundamentally bound to the design choices of their teacher model and struggle to generalize to new situations. Training from scratch on their own data means the model has to genuinely learn reasoning rather than mimicking it.

Clean, licensed training data only. All pre-training data was commercially licensed, and AI-generated content was excluded from pre-training. For enterprises, this matters a lot: it affects copyright exposure and gives Microsoft better ability to explain (and improve) model behavior.

In-house training infrastructure end-to-end. From hardware co-design on Microsoft's own accelerators to the reinforcement learning framework, the entire training stack is built internally. This is what they call the "Hill-Climbing Machine" — a system where every component can be improved independently, so capabilities improve continuously rather than requiring architectural overhauls.

Developer-Relevant Features

Before you think about API calls, here is the feature set:

Context window: 256,000 tokens. That is roughly 600 pages of text. You can fit entire codebases, large contracts, or lengthy research documents in a single context. For agentic coding workflows this is essential.

Function calling / tool use. Supported. If you are building agents that need to call APIs, query databases, or interact with external services, the model can handle structured tool calls in the standard format.

System prompt / developer instructions. The model was trained to follow multi-layer instructions — meaning system prompts, user instructions, and constraints stack and interact predictably rather than the model silently ignoring one in favor of another.

Chat Completions API compatibility. This is significant. The API uses the same interface as the widely adopted OpenAI Chat Completions format. If you already have code that calls Azure OpenAI or any OpenAI-compatible endpoint, migration should require minimal changes — primarily just swapping the model name and endpoint URL.

Enterprise security via Microsoft Foundry. All MAI models come with Microsoft Foundry's compliance stack: data residency controls, audit logging, private networking options. If you are building in a regulated industry, this is the access path that gets you the compliance paperwork you need.

What Setup Will Look Like (When It's Available)

Since the model is Chat Completions API-compatible, here is what calling it will look like once you have Foundry access. The pattern is essentially identical to calling Azure OpenAI:

import openai

client = openai.AzureOpenAI(
    azure_endpoint="https://<your-foundry-endpoint>.azure.com",
    api_version="2024-12-01-preview",
    api_key="<your-foundry-api-key>"
)

response = client.chat.completions.create(
    model="mai-thinking-1",
    messages=[
        {
            "role": "system",
            "content": "You are a senior software engineer. Think step by step."
        },
        {
            "role": "user",
            "content": "Review this function and identify any edge cases: ..."
        }
    ],
    max_tokens=4096
)

print(response.choices[0].message.content)

If you are already on the Azure OpenAI SDK or any OpenAI-compatible client, this is the shape of the migration. The main difference is the endpoint URL and model name — the rest of your code stays the same.

For agentic workflows with tool calling:

tools = [
    {
        "type": "function",
        "function": {
            "name": "run_tests",
            "description": "Run the test suite and return results",
            "parameters": {
                "type": "object",
                "properties": {
                    "test_path": {
                        "type": "string",
                        "description": "Path to the test file or directory"
                    }
                },
                "required": ["test_path"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mai-thinking-1",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

Where MAI-Thinking-1 Fits in Your Stack

If you are trying to decide whether this model is worth tracking, here is a practical breakdown by use case:

Agentic coding pipelines. This is the primary target use case. The model was trained on deterministic, executable environments with real test suites. It is built for the multi-step loop of reading code, making edits, running tests, and recovering from failures. If you are building AI-powered code review, bug fixing, or code generation pipelines, this is worth evaluating.

Complex reasoning tasks. The AIME scores put it near the top of the field for mathematical and scientific reasoning. If your application involves multi-step problem solving — financial modeling, technical analysis, research summarization with synthesis — a reasoning model like this will outperform instruction-tuned models.

Enterprise document processing. The 256k context window plus the licensing provenance story makes this a credible option for enterprises processing contracts, technical documentation, or large codebases where IP exposure and compliance are real concerns.

High-volume daily workflows. The MoE architecture and mid-weight pricing position this below frontier-cost models. If you have a use case that could benefit from strong reasoning but cannot justify the cost of running a full dense frontier model on every request, this is the price-performance sweet spot Microsoft is targeting.

The Safety Approach (And Why It Matters for Developers)

Microsoft made an interesting engineering decision on safety that is worth understanding.

Rather than treating safety as a post-hoc filter or a separate fine-tuning stage, they trained safety with the same reinforcement learning loop as capability. Unsafe compliance and unnecessary over-refusals are both treated as defects in the same reward model, weighted by potential harm severity.

The practical effect: you should see fewer situations where the model refuses legitimate developer requests (writing code that involves networking, security concepts, system administration) while still declining actually harmful requests. Microsoft explicitly calls unnecessary refusals a failure mode, not a safe default.

For developers, this means less time spent writing system prompts that work around overly cautious models.

What to Watch For

A few things to keep an eye on as this moves to public preview:

Pricing. Not yet announced publicly. The "mid-weight" positioning suggests something meaningfully below frontier model pricing, but the actual numbers will determine whether the SWE-Bench Pro performance justifies switching from existing workflows.

Regional availability. Microsoft Foundry supports multi-region deployment, but which specific Azure regions will have MAI-Thinking-1 available at launch will affect latency and data residency requirements for some use cases.

Rate limits and quota. Private previews typically have constrained throughput. Production planning should wait for public preview numbers.

Quick Reference


Model type	Sparse Mixture of Experts (reasoning)
Active parameters	35B
Total parameters	~1T
Context window	256,000 tokens
API format	Chat Completions (OpenAI-compatible)
Function calling	Yes
Current status	Private preview on Microsoft Foundry
Public access	Coming soon (MAI Playground)
Early access	Apply via Microsoft Foundry signup form

Headroom: Cut Your LLM Token Usage by Up to 95% Without Changing Your Answers

ArshTechPro — Thu, 04 Jun 2026 09:15:08 +0000

If you're building AI agents or running LLM pipelines in production, you already know the pain: tool outputs, logs, RAG chunks, and conversation history pile up fast. Before you know it, you're burning through tokens at a rate that makes your billing dashboard uncomfortable to look at.

Headroom is an open-source project that tackles this problem directly. It compresses everything your AI agent reads — before it ever reaches the LLM — and claims 60–95% token reduction on real workloads, with accuracy preserved.

The Core Idea

Headroom sits as a layer between your application and the LLM provider. It takes whatever your agent was about to send — a stack of tool call results, a long log file, a RAG retrieval dump — and compresses it using one of several strategies depending on the content type:

SmartCrusher handles JSON (arrays, nested objects, mixed types)
CodeCompressor uses AST-aware compression for Python, JS, Go, Rust, Java, and C++
Kompress-base is a HuggingFace model trained on agentic traces, for prose and text
CacheAligner stabilizes prompt prefixes so provider KV caches actually hit consistently
CCR (Content-Compressed Retrieval) stores originals locally and lets the LLM fetch them on demand — so compression is fully reversible

A ContentRouter figures out what kind of content it's looking at and picks the right compressor automatically. You don't have to think about it.

The key thing: originals are never deleted. If the LLM needs the full version of something, it can retrieve it. Compression is lossless in that sense.

Real Numbers

These are the token counts from the project's benchmarks on real agent workloads:

Workload	Before	After	Savings
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
GitHub issue triage	54,174	14,761	73%
Codebase exploration	78,502	41,254	47%

On accuracy benchmarks (GSM8K math, TruthfulQA, SQuAD v2, BFCL tool-use), scores hold steady or slightly improve after compression. The intuition is that stripping noise helps the model focus on the signal.

You can reproduce these yourself with:

python -m headroom.evals suite --tier 1

Setup: Three Ways to Use It

Headroom gives you three integration modes. Pick whichever fits how you work.

Option 1: Wrap an existing agent (zero code changes)

pip install "headroom-ai[all]"
headroom wrap claude

That's it. Headroom intercepts traffic from Claude Code, Codex, Cursor, Aider, or Copilot CLI automatically. You don't touch your existing code at all.

Option 2: Drop-in proxy

Run Headroom as a local proxy on any port:

headroom proxy --port 8787

Then point your existing OpenAI/Anthropic SDK calls at localhost:8787 instead of the provider URL. Any language, any framework — no code changes needed beyond updating the base URL.

Option 3: Inline library

For finer control, use it directly in Python or TypeScript:

Python:

from headroom import compress

messages = [{"role": "user", "content": your_giant_tool_output}]
compressed = compress(messages, model="claude-opus-4-6")
# compressed has the same structure, far fewer tokens

TypeScript:

import { compress } from "headroom-ai";

const compressed = await compress(messages, { model: "claude-opus-4-6" });

With the Anthropic SDK directly:

from anthropic import Anthropic
from headroom import withHeadroom

client = withHeadroom(Anthropic())
# Use client exactly like normal — compression happens automatically

With LangChain:

from headroom.integrations.langchain import HeadroomChatModel

llm = HeadroomChatModel(your_existing_llm)

With Vercel AI SDK:

import { wrapLanguageModel } from "ai";
import { headroomMiddleware } from "headroom-ai";

const model = wrapLanguageModel({
  model: yourModel,
  middleware: headroomMiddleware(),
});

Requires Python 3.10+. For Node/TypeScript: npm install headroom-ai.

MCP Server Mode

If you're using an MCP client (Claude Desktop, etc.), you can install Headroom as an MCP server:

headroom mcp install

This exposes three MCP tools: headroom_compress, headroom_retrieve, and headroom_stats. Your AI agent can call them directly as part of its tool loop.

Cross-Agent Memory

One underrated feature: shared memory across agents. If you're running Claude and Codex side by side, Headroom can give them a common compressed context store with automatic deduplication.

from headroom.memory import SharedContext

ctx = SharedContext()
ctx.put("current_task", task_description)

# In a different agent's session
task = ctx.get("current_task")

This is useful in multi-agent pipelines where you'd otherwise be passing the same context repeatedly.

headroom learn

There's also a headroom learn command that mines failed agent sessions and writes corrections back to CLAUDE.md, AGENTS.md, or GEMINI.md. The idea is that your agent accumulates a record of what went wrong and avoids repeating the same mistakes.

headroom learn

It parses session logs, extracts failure patterns, and appends structured learnings to your project's agent config files.

Check Your Savings

After using Headroom for a while:

headroom stats

This shows you cumulative compression ratios, tokens saved, and per-content-type breakdowns.

Is It Worth Trying?

Yes, if you:

Run AI coding agents (Claude Code, Cursor, Codex, Aider) regularly and pay for tokens
Build pipelines where tool outputs and RAG chunks are large and repetitive
Want cross-agent shared memory without building it yourself
Need reversible compression — Headroom never discards originals

Skip it, or approach carefully, if you:

Only use a single provider's built-in context management and don't need more
Work in sandboxed or restricted environments where running a local process is an issue
Are on a very simple single-turn setup where context bloat isn't a real problem yet

Quick Reference

# Install
pip install "headroom-ai[all]"
npm install headroom-ai

# Wrap an agent
headroom wrap claude

# Run as proxy
headroom proxy --port 8787

# Install as MCP server
headroom mcp install

# Check savings
headroom stats

# Learn from failures
headroom learn

GitHub: chopratejas/headroom

Harness: Turn a One-Line Prompt Into a Full Agent Team for Claude Code

ArshTechPro — Tue, 02 Jun 2026 09:34:41 +0000

You have Claude Code. You want to build something ambitious — a deep research pipeline, a full-stack app scaffold, a code review system. You could wire up agents manually, writing each definition by hand. Or you could type "build a harness for this project" and let Harness do it.

Harness is a Claude Code plugin that takes a plain-English description of what you want to build and produces a ready-to-run agent team: the agent definitions, the skill files, the orchestration logic — all of it.

What Problem Does It Solve?

Multi-agent work in Claude Code requires a lot of upfront scaffolding. You need to:

define each agent's role and responsibilities in a .claude/agents/ markdown file
write skill files in .claude/skills/ that describe how tasks get done
decide how agents communicate and hand off work
handle error cases and validation

For a non-trivial project, this is several hours of work before you have written a line of actual code. Harness compresses that into a single conversational prompt.

The Six Architecture Patterns

Harness does not just dump agents into a folder. It picks one of six battle-tested team structures based on your domain:

Pipeline — agents run in sequence, each one feeding into the next. Good for anything with clear stages: plan, write, test, deploy.

Fan-out/Fan-in — a coordinator spawns parallel agents, collects their results, and merges them. Good for research or code review where independent threads can run simultaneously.

Expert Pool — agents are specialists invoked selectively based on what the current task needs. Good for domains with diverse sub-problems.

Producer-Reviewer — one agent generates, another critiques. Good for content creation, documentation, or anything where quality gates matter.

Supervisor — a central agent dynamically routes tasks to workers based on what needs to happen next. Good for open-ended workflows.

Hierarchical Delegation — top-down recursive delegation where complex tasks get broken down through multiple layers. Good for large-scale engineering or project management.

Harness reads your description and picks the pattern that fits best. You can also guide it explicitly.

Setup

Prerequisites

You need Claude Code installed and agent teams enabled. Agent teams are still behind a feature flag:

export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

Add that to your shell profile (.zshrc, .bashrc, etc.) so it persists.

Install via Plugin Marketplace

Inside Claude Code, run:

/plugin marketplace add revfactory/harness

Then:

/plugin install harness@harness

That's it. The plugin is now globally available in your Claude Code sessions.

Install Manually (Global Skill)

If you prefer to manage things yourself, clone the repo and copy the skill directly:

git clone https://github.com/revfactory/harness.git
cp -r harness/skills/harness ~/.claude/skills/harness

This drops the skill files into Claude Code's global skill directory and makes them available in any project.

Using It

Once installed, trigger it with a natural language prompt inside Claude Code. There is no special syntax — just describe what you want.

Example: deep research agent team

Build a harness for deep research. I need an agent team that can investigate
any topic from multiple angles — web search, academic sources, community
sentiment — then cross-validate findings and produce a comprehensive report.

Example: code review pipeline

Build a harness for comprehensive code review. I want parallel agents
checking architecture, security vulnerabilities, performance bottlenecks,
and code style — then merging all findings into a single report.

Example: full-stack development

Build a harness for full-stack website development. The team should handle
design, frontend (React/Next.js), backend (API), and QA testing in a
coordinated pipeline from wireframe to deployment.

After you run one of these, Harness generates files in your project:

your-project/
├── .claude/
│   ├── agents/
│   │   ├── analyst.md
│   │   ├── builder.md
│   │   └── qa.md
│   └── skills/
│       ├── analyze/
│       │   └── SKILL.md
│       └── build/
│           ├── SKILL.md
│           └── references/

The agent files define each agent's persona, capabilities, and constraints. The skill files define the step-by-step procedures each agent follows. You can read and edit every file — nothing is a black box.

What the Six-Phase Workflow Looks Like

Harness does not just dump files. It runs a structured process:

Domain Analysis — it reads your prompt and identifies the key actors, inputs, and outputs
Team Architecture Design — it picks the right pattern from the six and sketches the team structure
Agent Definition Generation — it writes the .claude/agents/ markdown files
Skill Generation — it writes the .claude/skills/ files with Progressive Disclosure (loading only what context is needed, when it is needed)
Integration and Orchestration — it wires inter-agent data passing and error handling
Validation and Testing — it sets up trigger verification and dry-run tests

Is It Worth It?

The repo includes A/B test results from a companion repository (revfactory/claude-code-harness) covering 15 software engineering tasks:

Metric	Without Harness	With Harness
Average Quality Score	49.5 / 100	79.3 / 100
Win Rate	—	15 out of 15
Output Variance	—	-32%

The improvement scaled with task complexity: +23.8 points on basic tasks, +29.6 on advanced, +36.2 on expert-level tasks. The more difficult the problem, the more structure helps.

One important caveat: this is an author-measured study with n=15, and third-party replications have not yet been published.

Where it clearly helps:

You are starting a new project and want agent scaffolding without spending hours on definitions
Your task has multiple distinct sub-problems that map cleanly onto a team pattern
You want to experiment with different team architectures quickly
You are building something complex enough that ad-hoc prompting produces inconsistent results

What Gets Generated vs What You Maintain

Harness generates a starting point. The files it creates are plain markdown — readable, editable, version-controllable. You own them after generation.

Ecosystem Fit

Harness is Claude Code-native. It does not work with Gemini CLI or Codex out of the box — a Codex port called meta-harness exists for that.

If you are using LangGraph for state-recoverable, long-running orchestration, Harness is not a replacement. LangGraph handles persistent state and recovery across sessions; Harness handles team architecture design within Claude Code. They occupy different layers.

Quick Reference

# Enable agent teams
export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

# Install via plugin marketplace
/plugin marketplace add revfactory/harness
/plugin install harness@harness

# Or manually
cp -r skills/harness ~/.claude/skills/harness

# Use it
"Build a harness for [your domain]"

Verdict

Harness solves a real problem. Multi-agent scaffolding is tedious to write from scratch, easy to get wrong, and hard to keep consistent. Harness handles the structural work so you can focus on the domain logic.

Repository: github.com/revfactory/harness
License: Apache 2.0

Compound Engineering: A Plugin That Makes Your AI Coding Agent Smarter Over Time

ArshTechPro — Sat, 30 May 2026 23:34:00 +0000

Most developers using AI coding tools hit the same ceiling eventually. The agent writes code, you accept or reject it, and next time it starts from scratch again. There's no memory of what worked, no accumulated judgment about your codebase, no improvement from one session to the next. You're getting faster, but the tool isn't getting better at helping you specifically.

Compound Engineering is a plugin that tries to fix that. Built by Every.to and available for Claude Code, Cursor, Codex, GitHub Copilot, and a growing list of other tools, it introduces a structured workflow designed around a simple principle: each unit of engineering work should make the next one easier.

The Core Idea

Traditional development accumulates technical debt. Features add complexity, bug fixes leave behind knowledge no one wrote down, and the codebase slowly becomes harder to change.

The Compound Engineering philosophy inverts the ratio: 80% of the effort goes into planning and review, 20% into execution. The thinking is that a sharp plan produces a smaller, cleaner implementation. A good code review catches a pattern, not just a specific bug. A documented learning means the agent doesn't have to rediscover the same constraint next week.

The plugin ships 37 skills and 51 agents that implement this workflow as slash commands you run inside your AI coding tool.

The Workflow Loop

The core loop looks like this:

/ce-brainstorm "add retry logic to background jobs"
/ce-plan docs/brainstorms/background-job-retry-requirements.md
/ce-work
/ce-code-review
/ce-compound

Here's what each step actually does:

/ce-brainstorm runs an interactive Q&A session. It asks clarifying questions about your feature or problem, then produces a right-sized requirements document. The output is a file you can hand directly to the next step.

/ce-plan takes that requirements document and turns it into a detailed implementation plan: what to change, what to test, what the edge cases are.

/ce-work executes the plan. It uses worktrees for isolation and tracks tasks as it goes.

/ce-code-review is a multi-agent review pass before you merge. It looks for issues but, more importantly, tries to catch patterns — recurring problems that are worth documenting rather than just fixing.

/ce-compound is where the compounding happens. It documents the learnings from this cycle so the agent has better context the next time you work on something similar.

There are also two commands that sit outside the core loop:

/ce-strategy creates and maintains a STRATEGY.md file — the product's target problem, approach, personas, and key metrics. When this file exists, brainstorm and plan commands read it as grounding, so your strategy choices flow naturally into feature decisions.

/ce-ideate sits upstream of brainstorm for bigger questions. Instead of jumping into requirements, it generates and critically evaluates several ideas, then routes the strongest one into the brainstorm step.

/ce-debug is for bug investigations. It systematically reproduces the failure, traces the root cause, and implements a fix rather than just patching the symptom.

/ce-product-pulse generates a time-windowed report on usage, performance, and errors. Reports are saved to docs/pulse-reports/ so they accumulate into a browseable history of how the product is actually performing.

Installation

Claude Code (simplest path)

/plugin marketplace add EveryInc/compound-engineering-plugin
/plugin install compound-engineering

No Bun required. After installing, run /ce-setup to check your environment and bootstrap project config.

Cursor

In Cursor Agent chat:

/add-plugin compound-engineering

Or search for "compound engineering" in the plugin marketplace.

GitHub Copilot (VS Code)

Open the VS Code command palette
Run Chat: Install Plugin from Source
Enter EveryInc/compound-engineering-plugin as the repo
Select compound-engineering when VS Code shows the available plugins

Codex (three steps required)

Codex currently needs an extra step because its native plugin spec handles skills but not custom agents. The agents are what power commands like /ce-code-review and /ce-plan.

# Step 1: Register the marketplace
codex plugin marketplace add EveryInc/compound-engineering-plugin

# Step 2: Install the agents via Bun
bunx @every-env/compound-plugin install compound-engineering --to codex

# Step 3: Launch Codex, run /plugins, find Compound Engineering, and install
codex

All three steps are required. Skipping the Bun step means delegation-based skills will report missing agents.

Gemini CLI, OpenCode, Kiro, Pi

bunx @every-env/compound-plugin install compound-engineering --to gemini
bunx @every-env/compound-plugin install compound-engineering --to opencode
bunx @every-env/compound-plugin install compound-engineering --to kiro
bunx @every-env/compound-plugin install compound-engineering --to pi

A Typical Bug Investigation

For debugging, the flow is shorter:

/ce-debug "checkout webhook sometimes creates duplicate invoices"
/ce-code-review
/ce-compound

/ce-debug doesn't just jump to a fix. It reproduces the failure first, traces where it originates, then implements a targeted fix. After a review and a compound step, that knowledge about the invoicing edge case is now part of the project's accumulated context.

What Actually Gets Written to Disk

This is worth understanding. Compound Engineering is not just about prompts — it produces files in your project:

STRATEGY.md — the product anchor document, if you use /ce-strategy
docs/brainstorms/ — requirements documents from /ce-brainstorm
docs/pulse-reports/ — product performance reports from /ce-product-pulse
Compound notes written by /ce-compound, stored wherever the plugin is configured to put them

These files are meant to persist across sessions and become grounding context for future agent interactions. The point is that each cycle is building toward a more informed next cycle, not starting fresh.

Is It Worth Using?

The plugin is a genuine attempt to solve a real problem: AI coding agents are stateless by default, and their usefulness degrades over the life of a complex project unless you actively manage context.

It's worth trying if:

You're working on a non-trivial codebase where decisions have history and context matters
You find yourself re-explaining the same architectural constraints to your agent in every session
You want more structured reviews than just "does this code work"
You're using Claude Code, Cursor, or Copilot and want a workflow rather than just a chat interface

It may be overkill if:

You're working on small, self-contained scripts or prototypes
Your sessions are isolated enough that accumulated context doesn't matter
You prefer a lighter workflow and the brainstorm/plan/compound ceremony feels like friction

The contribution policy is also worth knowing: the author explicitly does not accept outside contributions and reviews issues and PRs through their own agents rather than directly. That's an unusual choice for an open-source tool, but it's stated clearly and the release cadence (153 releases, latest in May 2026) suggests active maintenance regardless.

One honest note: the value of this plugin scales with how consistently you run the full loop. If you only use /ce-work and skip /ce-compound, you're leaving the most important part on the table. The compounding only happens if you complete the cycle.

Quick Reference

Command	What it does
`/ce-setup`	First-time setup and environment check
`/ce-strategy`	Create or update `STRATEGY.md`
`/ce-ideate`	Big-picture ideation before brainstorming
`/ce-brainstorm`	Interactive requirements doc generation
`/ce-plan`	Turn requirements into an implementation plan
`/ce-work`	Execute the plan
`/ce-debug`	Reproduce, trace, and fix a bug
`/ce-code-review`	Multi-agent pre-merge review
`/ce-doc-review`	Documentation review
`/ce-compound`	Document learnings for future sessions
`/ce-product-pulse`	Time-windowed usage and error report

GitHub: https://github.com/EveryInc/compound-engineering-plugin

MarkItDown: Microsoft's Tool for Converting Almost Anything to Markdown

ArshTechPro — Fri, 29 May 2026 14:50:33 +0000

If you've been building LLM-powered applications, you've likely run into the same problem: your data lives in PDFs, Word documents, Excel sheets, and PowerPoint decks — but your AI pipeline expects clean text. Copy-pasting doesn't scale, and most conversion tools either strip too much structure or produce noisy output.

Microsoft's MarkItDown is built specifically for this gap. It's a lightweight Python utility that converts a wide range of file formats into Markdown, preserving the structure that matters: headings, tables, lists, and links.

What Is MarkItDown?

MarkItDown is a Python library (and CLI tool) that converts files and documents into Markdown. It is not designed for pixel-perfect human-readable output. The explicit goal is to feed text into LLMs and text analysis pipelines — and Markdown is the right format for that because most large language models understand it natively and it is highly token-efficient.

Supported formats include:

PDF
Word (.docx)
PowerPoint (.pptx)
Excel (.xlsx and older .xls)
Images (EXIF metadata + optional OCR)
Audio files (EXIF metadata + optional speech transcription)
HTML
CSV, JSON, XML
ZIP files (iterates and converts contents)
YouTube URLs (fetches transcription)
EPubs

That's a broad surface area for one library.

Installation

You need Python 3.10 or higher. The simplest way to get everything:

pip install 'markitdown[all]'

The [all] flag installs all optional dependencies for every supported format. If you want a leaner install, you can pick specific formats:

pip install 'markitdown[pdf,docx,pptx]'

Available optional extras: pdf, docx, pptx, xlsx, xls, outlook, audio-transcription, youtube-transcription, az-doc-intel.

It is recommended to work inside a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install 'markitdown[all]'

Using the CLI

The command-line interface is straightforward:

# Convert a file and print to stdout
markitdown report.pdf

# Save output to a file
markitdown report.pdf -o report.md

# Pipe input
cat report.pdf | markitdown

That's it. No configuration required for basic use.

Using the Python API

For programmatic use in your pipeline:

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)
result = md.convert("financials.xlsx")
print(result.text_content)

The result.text_content attribute holds the converted Markdown string.

Converting Different File Types

from markitdown import MarkItDown

md = MarkItDown()

# Word document
result = md.convert("proposal.docx")

# PowerPoint deck
result = md.convert("slides.pptx")

# CSV file
result = md.convert("data.csv")

# HTML file
result = md.convert("page.html")

print(result.text_content)

The API is consistent regardless of file type. You call .convert() and get back a result object.

LLM-Powered Image Descriptions

If you pass an image file (or a PowerPoint with images), MarkItDown can call an LLM to generate descriptions for those images, which then become part of the Markdown output. You supply your own client:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

result = md.convert("diagram.jpg")
print(result.text_content)

This is useful when the actual visual content of an image matters for downstream processing, not just the file metadata.

OCR Support via Plugin

For PDFs and Office documents that contain images with embedded text (scanned documents, screenshots inside slides), MarkItDown supports a separate OCR plugin:

pip install markitdown-ocr
pip install openai

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("scanned_report.pdf")
print(result.text_content)

The OCR plugin uses the same LLM vision pattern as image descriptions — no separate ML libraries or binaries are required.

Azure Document Intelligence

For enterprise-grade document parsing (better table extraction, form recognition), MarkItDown integrates with Azure Document Intelligence:

# CLI
markitdown report.pdf -o report.md -d -e "<your_endpoint>"

from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<your_endpoint>")
result = md.convert("complex_form.pdf")
print(result.text_content)

This is the right path if you are processing complex financial documents, legal contracts, or forms where structure accuracy is critical.

Running with Docker

If you prefer containerized workflows:

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < your-file.pdf > output.md

Plugin Ecosystem

MarkItDown supports third-party plugins. They are disabled by default.

# List installed plugins
markitdown --list-plugins

# Enable plugins for a conversion
markitdown --use-plugins path-to-file.pdf

To find community plugins, search GitHub for #markitdown-plugin.

Security Considerations

One thing worth knowing before you integrate this into a server-side application: MarkItDown runs with the privileges of the current process. It can access local files and remote URIs the same way open() or requests.get() can.

The recommendation from the project is to avoid passing untrusted input directly to .convert(). If you only need to convert local files, use convert_local(). If you need to handle streams, use convert_stream(). Prefer the narrowest API for your use case.

This is standard advice for any file processing library, but it is worth calling out explicitly if you are building a web-facing feature.

Is It Worth Using?

The honest answer: it depends on what you need it for.

MarkItDown is a good fit if:

You are building an LLM pipeline that needs to ingest documents in various formats.
You want a consistent Python API across PDF, Word, Excel, HTML, and other types without gluing together multiple libraries.
You need a quick CLI tool to batch-convert files for indexing or embedding.
You want the flexibility to extend conversion behavior via plugins.

MarkItDown is not the right tool if:

You need pixel-perfect conversion for human consumption. The project documentation explicitly says the output is meant for text analysis tools, not high-fidelity document rendering.
You need production OCR without LLM dependencies. The OCR plugin requires an OpenAI-compatible client, which adds latency and cost.
You are working with heavily formatted documents where layout matters beyond headings and tables (e.g., multi-column academic papers, complex invoice layouts).

Quick Reference

Task	Command
Install all formats	`pip install 'markitdown[all]'`
Convert via CLI	`markitdown file.pdf -o output.md`
Convert via Python	`MarkItDown().convert("file.pdf").text_content`
Convert with LLM images	Pass `llm_client` and `llm_model` to `MarkItDown()`
Enable OCR plugin	`pip install markitdown-ocr`, then `enable_plugins=True`
Use Azure Doc Intelligence	Pass `docintel_endpoint` to `MarkItDown()`
Run via Docker	`docker run --rm -i markitdown:latest < file.pdf > output.md`

GitHub: https://github.com/microsoft/markitdown

Pi: The Open-Source AI Coding Agent You Probably Haven't Tried Yet

ArshTechPro — Tue, 26 May 2026 10:54:21 +0000

If you've been following the AI coding agent space, you've likely heard of Claude Code, GitHub Copilot, or Codex. But there's a fast-moving open-source alternative sitting at over 46,000 GitHub stars that deserves a serious look: pi, from earendil-works/pi.

This article walks you through what pi actually is, how to get it running in under five minutes, and whether it's worth adding to your workflow.

What Is Pi?

Pi is a monorepo of tools built for constructing and running AI agents. The centerpiece is a coding agent CLI — a terminal-based assistant that can read your files, write code, run shell commands, and iterate on tasks, all within your actual project directory.

The repo is built entirely in TypeScript and ships as a set of npm packages:

@earendil-works/pi-coding-agent — the interactive CLI you'll use day to day
@earendil-works/pi-agent-core — the agent runtime (tool calling, state management) for building your own agents
@earendil-works/pi-ai — a unified LLM API layer that normalizes OpenAI, Anthropic, Google, and others behind one interface
@earendil-works/pi-tui — a terminal UI library with differential rendering
@earendil-works/pi-web-ui — web components for AI chat interfaces

Setup in Five Minutes

Prerequisites: Node.js installed, and an API key or existing subscription (Claude Pro, ChatGPT Plus, or GitHub Copilot).

Step 1: Install

npm install -g @earendil-works/pi-coding-agent

That's the whole install. No Docker, no Python environment, no build step.

If you prefer another package manager:

pnpm add -g @earendil-works/pi-coding-agent
# or
yarn global add @earendil-works/pi-coding-agent
# or
bun add -g @earendil-works/pi-coding-agent

Step 2: Authenticate

Pi supports two authentication paths.

Option A — Subscription login (Claude Pro/Max, ChatGPT Plus/Pro, GitHub Copilot):

Start pi from any directory and run:

pi
/login

A prompt will appear to select your provider. This stores credentials in ~/.pi/agent/auth.json.

Option B — API key:

export ANTHROPIC_API_KEY=sk-ant-...
pi

You can use OPENAI_API_KEY, GOOGLE_API_KEY, or others the same way. The /login command can also store API keys interactively so you don't need to export them every session.

Step 3: Start a session

Navigate to your project and launch:

cd /path/to/your/project
pi

Pi starts in interactive mode and loads your project directory as its working context. Type a request and press Enter:

Summarize this repository and tell me how to run its checks.

Out of the box, the agent has access to four tools: read (read files), write (create or overwrite files), edit (patch files), and bash (run shell commands). Additional read-only tools like grep, find, and ls are available through tool options.

Key Features Worth Knowing

Context files

Pi loads AGENTS.md (or CLAUDE.md) files at startup to give the model project-specific instructions. You can have a global one in ~/.pi/agent/AGENTS.md and a per-project one in your repo root. Example:

# Project Instructions

- Run `npm run check` after code changes.
- Do not run production migrations locally.
- Keep responses concise.

Run /reload inside a session to pick up changes without restarting.

File references

Type @ in the editor to fuzzy-search and reference files, or pass them on the command line:

pi @src/app.ts @src/app.test.ts "Review these together"

You can paste images with Ctrl+V (Alt+V on Windows) or drag them into supported terminals.

Session management

Sessions are saved automatically. Resuming is straightforward:

pi -c         # Continue most recent session
pi -r         # Browse previous sessions

Inside a session, /fork and /clone let you branch the conversation tree — useful when you want to try two different approaches to a problem without losing your current state.

Non-interactive (one-shot) mode

Pi works well in scripts and pipelines:

pi -p "Summarize this codebase"
cat README.md | pi -p "Summarize this text"
pi -p @screenshot.png "What's in this image?"

For automation, --mode json gives structured event output and --mode rpc allows stdin/stdout process integration.

Shell commands mid-session

Prefix a command with ! to run it and send the output to the model:

!npm run lint

Use !!command to run it without adding the output to the model's context window.

Model switching

Use /model or Ctrl+L to change models mid-session. Shift+Tab cycles thinking levels. This is useful if you want a fast cheap model for exploration and a smarter one for final implementation.

Using Pi as a Library

If you're building something on top of pi rather than using it as a CLI, the SDK path is clean:

import {
  AuthStorage,
  createAgentSession,
  ModelRegistry,
  SessionManager,
} from "@earendil-works/pi-coding-agent";

const authStorage = AuthStorage.create();
const modelRegistry = ModelRegistry.create(authStorage);

const { session } = await createAgentSession({
  sessionManager: SessionManager.inMemory(),
  authStorage,
  modelRegistry,
});

await session.prompt("What files are in the current directory?");

For non-Node.js integrations, pi supports RPC mode over stdin/stdout with JSONL framing — so you can integrate from any language.

Building from Source

If you want to contribute or run from source:

git clone https://github.com/earendil-works/pi.git
cd pi
npm install       # Install all dependencies
npm run build     # Build all packages
npm run check     # Lint, format, and type check
./pi-test.sh      # Run pi from sources (any directory)

Note: npm run check requires a prior npm run build because the web-ui package needs compiled .d.ts files from dependencies.

Is It Worth a Try?

Yes, with some caveats.

Pi earns attention for a few concrete reasons:

It's genuinely multi-provider. Most coding agents are tied to one model provider. Pi normalizes across OpenAI, Anthropic, Google, and others at the API layer, so you can switch without re-learning a tool. If you already pay for Claude Pro or GitHub Copilot, pi can use those subscriptions directly — no extra API costs by default.

The session model is well-designed. Branching, forking, and resuming sessions is something most similar tools handle poorly. Pi treats this as a first-class feature, which matters when you're doing long iterative work.

The extensibility story is solid. Extensions are TypeScript modules that can add tools, slash commands, event handlers, and custom UI. If the built-in tools don't cover your workflow, you can add to them.

Where it's less compelling: The terminal UI won't appeal to everyone, and if you're deeply embedded in VS Code with Copilot already working, the switching cost is real. The documentation is good but spread across many individual files in the repo — there's no single polished docs site yet.

For developers who want control over their AI tooling, prefer the terminal, or need to build agents programmatically rather than just use them interactively, pi is a serious option. It's the kind of tool that rewards spending an hour with it.

Quick Reference

Task	Command
Install	`npm install -g @earendil-works/pi-coding-agent`
Start in project	`cd /project && pi`
Login (subscription)	`/login` inside pi
Set API key	`export ANTHROPIC_API_KEY=...`
Continue last session	`pi -c`
Browse sessions	`pi -r`
One-shot prompt	`pi -p "your prompt"`
Switch model	`/model` or Ctrl+L
Run shell command	`!your-command`
Reload context files	`/reload`
Uninstall	`npm uninstall -g @earendil-works/pi-coding-agent`

Repo: github.com/earendil-works/pi

cmux: The Native macOS Terminal Built for Running AI Coding Agents in Parallel

ArshTechPro — Mon, 25 May 2026 04:11:23 +0000

If you have ever run three Claude Code sessions at the same time in a stock terminal, you know the pain. Notifications are generic ("Claude is waiting for your input" — every single time), tab titles blur together, and there is no good way to tell which agent needs you without clicking into each pane one by one. cmux was built to fix exactly this.

What Is cmux?

cmux is an open-source, native macOS terminal application built on top of Ghostty, the GPU-accelerated terminal emulator. It wraps Ghostty's rendering engine (libghostty) in a Swift/AppKit shell and layers on top the features that matter when you are managing multiple AI coding agents simultaneously:

Vertical tab sidebar showing git branch, linked PR status, working directory, listening ports, and the latest notification text for each workspace
Agent-aware notification rings — when an agent needs input, its pane gets a blue visual ring and the sidebar tab lights up
Notification panel with a single keyboard shortcut to jump to the most recent unread agent
In-app split browser with a scriptable API so agents can interact with your dev server directly
Socket and CLI API to script workspace creation, pane splits, keystrokes, and browser control from anywhere

It reads your existing ~/.config/ghostty/config, so your fonts, themes, and colors carry over instantly.

Why Not Just Use tmux or iTerm2?

Fair question. Here is the honest comparison.

cmux vs tmux

tmux is a terminal multiplexer that runs inside any terminal. It is text-based, highly composable, and works over SSH. It has no native notification system for AI agents — you would need to wire up OSC sequences yourself and build your own status line logic. The tab sidebar in cmux gives you live git branch, PR number, CWD, and agent notification text with zero configuration. tmux also runs inside existing terminals, so you are still at the mercy of whatever notification plumbing that terminal has (or does not have).

Feature	cmux	tmux
AI agent notification rings	Built-in	Manual setup required
Vertical sidebar with git/PR status	Yes	No (status bar only)
GPU-accelerated rendering	Yes (libghostty)	Depends on host terminal
In-app browser with scripting API	Yes	No
Native macOS app	Yes	No
Works over SSH	Not yet	Yes
Cross-platform	macOS only	Yes

cmux vs iTerm2

iTerm2 is the veteran macOS terminal. It has excellent shell integration, triggers, and a mature notification system. But it is not built with AI agent workflows in mind — notifications do not carry workspace-level context, there is no sidebar showing agent state, and it is not GPU-accelerated. If you live in Claude Code, Codex, or OpenCode all day, iTerm2 will give you a generic macOS notification with no way to quickly surface which of your eight agents actually needs attention.

Feature	cmux	iTerm2
Agent notification with visual ring	Yes	No
Sidebar with per-workspace agent status	Yes	No
GPU-accelerated (libghostty)	Yes	No
In-app scriptable browser	Yes	No
Shell integration / triggers	Via CLI	Yes, mature
Cross-platform	macOS only	macOS only
Open source	Yes (AGPL-3.0)	Yes

cmux vs Warp

Warp is a modern Electron-based terminal with AI features built in. It has good UX but uses Electron/Tauri under the hood, which means higher memory usage and slower startup compared to a native Swift app. cmux is intentionally not an AI orchestrator — it is a primitive that gives you the tools to run any agent (Claude Code, Codex, OpenCode, Gemini CLI, Aider, Kiro) side by side without locking you into one workflow.

Installation

Option 1: DMG (Recommended for First Install)

Download the latest .dmg from the releases page:

https://github.com/manaflow-ai/cmux/releases/latest/download/cmux-macos.dmg

Open it, drag cmux to your Applications folder, and launch it. cmux auto-updates via Sparkle from that point — you only need to download once.

Option 2: Homebrew

brew tap manaflow-ai/cmux
brew install --cask cmux

To update later:

brew upgrade --cask cmux

System requirements: macOS 14.0 or later, Apple Silicon or Intel.

On first launch macOS will ask you to confirm opening an app from an identified developer. Click Open.

Setting Up the CLI

The CLI is what lets you script cmux from inside or outside the app. Inside cmux terminals it works automatically. To use it from an external script or CI hook, create a symlink:

sudo ln -sf "/Applications/cmux.app/Contents/Resources/bin/cmux" /usr/local/bin/cmux

Test it:

cmux list-workspaces
cmux notify --title "Hello" --body "cmux CLI is working"

The Notification System

This is the core reason to use cmux if you run agents in parallel. There are three ways to send a notification into cmux.

1. CLI (Easiest)

cmux notify --title "Build Complete" --body "webpack finished in 4.2s"
cmux notify --title "Claude Code" --subtitle "Waiting" --body "Agent needs input"

2. OSC 777 (Shell / Any Language)

This is the RXVT escape sequence protocol. Works from any shell script or language that can write to stdout:

printf '\e]777;notify;My Title;Message body\a'

Shell function you can drop in .zshrc or .bashrc:

cmux_notify() {
  printf '\e]777;notify;%s;%s\a' "$1" "$2"
}

cmux_notify "Tests passed" "All 142 tests green"

From Python:

import sys

def notify(title: str, body: str):
    sys.stdout.write(f'\x1b]777;notify;{title};{body}\x07')
    sys.stdout.flush()

notify("Script done", "Processed 5000 rows")

From Node.js:

function notify(title, body) {
  process.stdout.write(`\x1b]777;notify;${title};${body}\x07`);
}

notify('Build done', 'webpack finished');

3. OSC 99 (Kitty Protocol — Richer)

If you need subtitles or notification IDs:

printf '\e]99;i=1;e=1;d=0;p=title:Build Complete\e\\'
printf '\e]99;i=1;e=1;d=0;p=subtitle:Project X\e\\'
printf '\e]99;i=1;e=1;d=1;p=body:All tests passed\e\\'

Use OSC 777 for most cases. Use OSC 99 only when you need subtitle fields.

Wiring Up Claude Code Hooks

This is probably the most useful setup step. Claude Code supports lifecycle hooks, so you can fire a cmux notification the moment an agent stops or completes a sub-task.

Step 1 — Create the hook script:

# ~/.claude/hooks/cmux-notify.sh
#!/bin/bash

# Skip silently if we're not running inside cmux
[ -S /tmp/cmux.sock ] || exit 0

EVENT=$(cat)
EVENT_TYPE=$(echo "$EVENT" | jq -r '.hook_event_name // "unknown"')
TOOL=$(echo "$EVENT" | jq -r '.tool_name // ""')

case "$EVENT_TYPE" in
    "Stop")
        cmux notify --title "Claude Code" --body "Session complete"
        ;;
    "PostToolUse")
        [ "$TOOL" = "Task" ] && cmux notify --title "Claude Code" --body "Agent finished sub-task"
        ;;
esac

chmod +x ~/.claude/hooks/cmux-notify.sh

Step 2 — Register the hook in Claude Code:

// ~/.claude/settings.json
{
  "hooks": {
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/cmux-notify.sh"
          }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Task",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/cmux-notify.sh"
          }
        ]
      }
    ]
  }
}

Restart Claude Code. Now every time a session completes or a sub-agent finishes, the cmux pane gets a blue notification ring and the sidebar tab lights up. Press Cmd+Shift+U to jump straight to the most recent unread.

Key Keyboard Shortcuts

You will use these constantly once you have a few agents running.

Shortcut	Action
`Cmd+N`	New workspace
`Cmd+1–8`	Jump to workspace by number
`Cmd+D`	Split pane right
`Cmd+Shift+D`	Split pane down
`Cmd+Shift+L`	Open browser in split
`Cmd+I`	Open notification panel
`Cmd+Shift+U`	Jump to latest unread agent
`Cmd+B`	Toggle sidebar
`Cmd+Shift+R`	Rename workspace

The In-App Browser

One underrated feature: cmux ships with a split browser pane with a scriptable API ported from Vercel Labs' agent-browser. Open it with Cmd+Shift+L.

Agents running Claude Code can snapshot the accessibility tree of the browser, get element references, click, fill forms, and evaluate JavaScript — all without leaving the terminal. This is useful when an agent is working against a local dev server and you want it to verify UI changes or run through a form flow directly.

Is It Worth Trying?

Yes, if you:

Run multiple AI coding agents in parallel (Claude Code, Codex, OpenCode, Gemini CLI, Aider)
Are on macOS and want native performance over an Electron-based terminal
Already use Ghostty and want agent-aware notifications without switching apps
Want to script your workspace layout through a CLI or socket API

Maybe not yet, if you:

Are on Linux or Windows (cmux is macOS-only, macOS 14+)
Need SSH session support (not available yet)
Rely on live process restore after a restart (layout restores but running shells/agents do not resume yet)
Are happy with a single agent workflow where notifications are not a problem

The project is young (v0.61 at the time of writing, 4.5k GitHub stars) but actively maintained with 24 releases already shipped. It is free, open source under AGPL-3.0, and auto-updates silently. The nightly build runs alongside the stable app with its own bundle ID if you want to live on the edge.

If you regularly find yourself clicking through terminal panes to figure out which Claude Code session is blocked, cmux solves that problem specifically and solves it well.

Quick Links

GitHub: https://github.com/manaflow-ai/cmux

Gemini Spark: Google's 24/7 AI Agent Just Changed the Rules (And What It Means for Developers)

ArshTechPro — Sun, 24 May 2026 12:17:24 +0000

Google I/O 2026 had a lot of announcements. New models, redesigned apps, smart glasses. But if you build software for a living, one announcement deserves your full attention: Gemini Spark.

Not because it has a catchy name. Because it represents a real architectural shift in how AI agents work — and because Google just validated a protocol that was originally Anthropic's idea.

Let me break it down.

What Is Gemini Spark?

Gemini Spark is Google's 24/7 personal AI agent.

From the Google I/O keynote:

"It runs on dedicated virtual machines on Google Cloud. And it's 24/7 so you don't need to keep your laptop open. It's powered by Gemini 3.5 and the Google Antigravity harness, which allows it to perform long-horizon tasks easily in the background."

That phrase "long-horizon tasks" is the one developers should fixate on. A standard API call has a lifecycle measured in seconds. Spark's lifecycle is measured in hours and days.

The Technical Stack

Spark is built on two things that matter here:

Gemini 3.5 Flash — The newly released model announced at the same I/O. It is optimized for agentic workflows and runs faster than previous generations. Spark uses Flash by default, with Gemini 3.5 Pro support coming later.

Google Antigravity — This is the internal orchestration framework Google uses to manage long-running agent tasks. Version 2.0 is now available to external developers. Think of it as Google's answer to the kind of agent harness that tools like LangGraph or CrewAI provide — but designed specifically for tasks that span hours or days rather than seconds.

What Can It Actually Do?

Spark is not a chatbot. It is an agent. The distinction matters.

A chatbot answers a question. An agent receives a goal, breaks it into subtasks, executes those subtasks over time, checks in when needed, and delivers results.

Concretely, Spark can:

Draft and send emails using Gmail context
Read and write Google Docs, Sheets, Slides, and Drive files
Plan multi-step workflows and execute them in sequence
Run as an agentic browser inside Chrome (coming later this summer)
Connect to third-party tools via MCP (more on this below)
Be reached through email or chat, not just the Gemini app
Show live task progress through Android Halo (coming later this year)

The key design constraint: Spark is built to check with you before taking major actions. You opt in to turning it on, you set the parameters, and it asks for confirmation before high-stakes moves. This is intentional — Google is being cautious about autonomous action at launch.

The MCP Angle: This Is the Part Developers Should Care About Most

Here is the headline buried in the keynote that deserves its own section.

Spark integrates with third-party tools through MCP — the Model Context Protocol.

MCP was originally an open standard developed and published by Anthropic. It defines how AI models communicate with external tools in a standardized way — essentially a universal adapter so that any AI agent can talk to any tool without custom integration code for every combination.

Google confirmed that Spark will expand to third-party apps including Canva, OpenTable, and Instacart through MCP, with that support rolling out within weeks of launch.

Why does this matter for developers?

If you maintain a SaaS product, a developer tool, or any kind of API, you no longer need separate integrations for each AI platform. Build one MCP server, and your tool becomes accessible to every major AI agent runtime on the market.

Gemini Spark vs. OpenClaw: Two Different Philosophies

OpenClaw and Gemini Spark are solving the same underlying problem — persistent, autonomous AI agents — but they approach it from opposite directions. Here is a direct comparison:

	Gemini Spark	OpenClaw
Hosting	Google Cloud VMs (managed)	Self-hosted on your own hardware
Source	Proprietary	MIT-licensed, open source
Model	Gemini 3.5 Flash/Pro	Any LLM (Claude, GPT, Gemini, Llama, 200+ backends)
Interface	Gemini app, email, chat	WhatsApp, Telegram, Slack, Signal, iMessage
Memory	Google Workspace context	Local Markdown files on your disk
MCP support	Yes (coming in weeks)	Community-driven via skills/plugins
Availability	Google AI Ultra subscribers (US first)	Free, self-hosted
Oversight	Google infrastructure	You own and control everything

The practical difference:

OpenClaw is a local-first agent. It runs on your machine, stores memory as plain Markdown files on your disk, and lets you bring any model you want. If you want full control over what the agent can access, how it stores data, and which model powers it, OpenClaw gives you that at zero subscription cost (you pay only for API usage). The tradeoff is that you manage the infrastructure.

Gemini Spark is a cloud-first managed agent. You do not run anything yourself. Google handles the VMs, the uptime, the orchestration. It runs even when your devices are off. The tradeoff is that you are inside Google's ecosystem, limited to their model, and it requires a Google AI Ultra subscription.

Neither is strictly better. They serve different developer profiles.

If you are building personal automation that you want tight control over, runs locally, and integrates with whatever LLM you prefer — OpenClaw is still the more flexible choice.

If you are deep in Google Workspace, want zero infrastructure management, and need something that can work reliably in the background without a server to maintain — Spark is the more turnkey solution.

The interesting thing is that MCP may reduce this distinction over time. If Spark can connect to the same MCP servers as Claude Desktop and OpenClaw, then tool access converges even when runtime and hosting remain different.

Availability and Access

Spark is still early. Google is rolling it out to trusted testers first, with a beta coming to Google AI Ultra subscribers in the US starting the week of May 26, 2026.

Timeline for what is coming:

Now: Trusted tester rollout
Next week (US): Beta for Google AI Ultra subscribers
Coming weeks: MCP support for third-party apps
Later this summer: Chrome agentic browser support
Later this year: Android Halo live task progress, Agent Payments Protocol

The Agent Payments Protocol is worth noting separately — this will allow Spark to make purchases autonomously within parameters you define. That capability has significant implications for e-commerce and workflow automation, though Google is understandably cautious about rolling it out.

Multica: An Open-Source Platform for Managing AI Coding Agents Like Teammates

ArshTechPro — Thu, 21 May 2026 22:07:10 +0000

If you've been using Claude Code, Codex, or similar AI coding agents, you've probably felt the friction: you paste a prompt, watch the run, babysit the output, copy something into the next prompt, and repeat. It works, but it doesn't scale — and it definitely doesn't feel like working with a team.

Multica is an open-source project that tries to fix that. The pitch is simple: treat your AI agents the way you treat human teammates. Assign them issues. Watch them post updates. Let them report blockers. Have them compound skills over time.

What Multica Actually Does

At its core, Multica gives your coding agents a place to live inside your team's workflow. Instead of operating a chatbot in isolation, you assign tasks to an agent the same way you'd assign a GitHub issue to a colleague. The agent picks it up, executes it on a runtime (your local machine or a cloud instance), streams progress back in real time, and posts comments when it needs clarification or hits a wall.

A few things stand out:

Task lifecycle management. Tasks move through states: enqueue, claim, start, complete, or fail. You're not just running a command and hoping — you have visibility into where each task is.

Reusable skills. When an agent solves something well — a deployment script, a migration pattern, a code review checklist — that solution becomes a reusable skill the whole team can pull from. Skills accumulate over time, which is where the "compound" part of the tagline comes from.

Multi-agent, multi-workspace. You can have multiple agents running on different runtimes, organized into workspaces. Each workspace is isolated with its own issues, agents, and settings.

Vendor-neutral. Multica works with Claude Code, Codex, OpenCode, OpenClaw, Hermes, Gemini, Pi, and Cursor Agent. You're not locked into one provider.

The Architecture in Plain Terms

The stack is straightforward:

┌──────────────┐     ┌──────────────┐     ┌──────────────────┐
│   Next.js    │────>│  Go Backend  │────>│   PostgreSQL     │
│   Frontend   │<────│  (Chi + WS)  │<────│   (pgvector)     │
└──────────────┘     └──────┬───────┘     └──────────────────┘
                            │
                     ┌──────┴───────┐
                     │ Agent Daemon │  runs on your machine
                     └──────────────┘

Frontend: Next.js 16 with the App Router
Backend: Go with the Chi router, sqlc for type-safe queries, and gorilla/websocket for real-time streaming
Database: PostgreSQL 17 with pgvector (for skill embeddings and similarity search)
Agent runtime: A local daemon that auto-detects whatever agent CLIs you have on your PATH

The daemon is the key piece. It bridges your machine (where the actual agent CLI lives) with the Multica server (cloud or self-hosted). When an agent is assigned a task, the server routes it to the appropriate daemon, which spawns the CLI process and streams output back via WebSocket.

Getting Up and Running

Installation is one line:

macOS / Linux:

brew install multica-ai/tap/multica
# or without Homebrew:
curl -fsSL https://raw.githubusercontent.com/multica-ai/multica/main/scripts/install.sh | bash

Windows:

irm https://raw.githubusercontent.com/multica-ai/multica/main/scripts/install.ps1 | iex

Then connect everything:

multica setup   # authenticate + start the daemon in one command

After that, open the web app, go to Settings → Runtimes, and you should see your machine listed. From there you create an agent (pick a provider and runtime), and you're ready to assign tasks.

The CLI surface is minimal:

Command	What it does
`multica setup`	One-shot: configure, authenticate, start daemon
`multica daemon start`	Start the local runtime manually
`multica daemon status`	Check what's running
`multica issue list`	List your workspace issues
`multica issue create`	Create a new issue
`multica update`	Pull the latest version

If you want to self-host the whole thing (server included), add --with-server to the install script:

curl -fsSL https://raw.githubusercontent.com/multica-ai/multica/main/scripts/install.sh | bash -s -- --with-server
multica setup self-host

This pulls the official Docker images from GHCR. You'll need Docker. The full self-hosting guide lives in SELF_HOSTING.md in the repo.

For Contributors

The dev setup is a single command:

make dev

It auto-detects your environment, sets up the .env file, installs dependencies, runs DB migrations, and starts all services. Prerequisites: Node.js v20+, pnpm v10.28+, Go v1.26+, and Docker.

The codebase is about 53% TypeScript and 43% Go — frontend and backend are clearly separated, which makes it easy to work on one without touching the other.

Multica vs Going Solo With an Agent CLI

Here's an honest comparison of what Multica adds versus just running claude or codex directly:

What you gain:

A shared board where your whole team sees what agents are working on
Real-time streaming progress instead of waiting for a long CLI run to finish
A skills library that accumulates team knowledge rather than living in individual prompts
Multi-agent routing — different tasks can go to different agents on different machines
An audit trail: who assigned what, when, and what happened

What you take on:

Running a daemon process (or a full server if self-hosting)
A PostgreSQL database
The overhead of a web app and task board

If you're a solo developer running occasional agent tasks, the CLI alone might be enough. If you're on a team — even a small one — trying to coordinate multiple agents across projects, Multica addresses a real gap.

Is It Worth Trying?

It depends on where you are with AI agents.

If you're still experimenting and running agents manually on single tasks, Multica is probably more infrastructure than you need right now. Start with the agent CLI directly and see what breaks.

If you've gotten past that point and are starting to feel the coordination pain — agents running on different machines, teammates not knowing what's been automated, the same solutions being re-invented in different prompts — that's exactly the gap Multica is designed to fill.

It's early software (v0.2.x as of this writing), so expect rough edges. But the core loop — assign, execute, report, reuse — is working, and the momentum behind the project is real.

Links:

GitHub: https://github.com/multica-ai/multica

Why Reading Food Labels Shouldn't Feel Like Decoding a Chemistry Exam

ArshTechPro — Thu, 21 May 2026 11:39:42 +0000

Millions of people with dietary restrictions struggle with food labels every day. Here's the real problem — and how we built SafeScan to fix it.

The Hidden Struggle at Every Grocery Aisle

If you've ever stood in a grocery store, squinting at a tiny ingredient list, trying to figure out if something is safe to eat — you're not alone.

For the 79 million Americans with food allergies, Millions of people looking for halal options, the growing community of vegans and vegetarians, and families managing multiple dietary needs at once — grocery shopping isn't just shopping. It's a high-stakes guessing game.

And the labels don't make it easy.

The Real Problem: Labels Are Designed for Regulators, Not People

Here's what most people don't realize: food labels are technically accurate, but practically useless for the average consumer trying to avoid specific ingredients.

For vegans and vegetarians, the challenge goes beyond spotting "meat" or "chicken." Animal-derived ingredients hide behind names most people wouldn't recognize:

Casein and whey — both from milk, found in "non-dairy" creamers
Carmine (or E120) — a red dye made from crushed insects
Gelatin — derived from animal bones, lurking in gummy candies, marshmallows, and even some yogurts
L-Cysteine — an amino acid often sourced from duck feathers, used in commercial bread
Isinglass — fish bladder extract used to clarify some wines and beers

A product can say "plant-based" on the front and still contain animal-derived emulsifiers in the fine print.

For halal consumers, it gets even more complex. Beyond pork and alcohol (which are relatively easy to spot), there's an entire gray area — mushbooh (doubtful) — that requires ingredient-level analysis:

Glycerin — could be plant-derived or animal-derived. The label won't tell you.
Mono and diglycerides — same problem.
Natural flavors — one of the most common ingredients in packaged food, and one of the most opaque. Could contain anything.
Enzymes — widely used in cheese and baked goods, often from animal sources with no disclosure required.

There's no "halal" or "haram" column on a nutrition label. You're on your own.

For people with allergies, the stakes are literally life-threatening. The FDA's "Big 9" allergens must be declared, but:

Peanuts can appear as "arachis hypogaea" or "groundnuts"
Milk hides behind "lactalbumin," "ghee," or "recaldent"
Eggs show up as "albumin," "lysozyme," or "meringue powder"
"May contain" warnings are voluntary — a manufacturer can choose not to disclose cross-contamination risks

And if you're managing allergies for a child, or for multiple family members with different restrictions? Multiply that cognitive load by every person, every product, every shopping trip.

What We Built: SafeScan

We got tired of the mental gymnastics. So we built SafeScan — a free iOS app that turns your phone's camera into a personal food safety analyst.

How it works:

Scan the barcode. SafeScan looks up the product from a database of over 3 million food items.
Photograph the ingredient label. The app uses on-device OCR to read the actual text — because sometimes the database is incomplete, and the physical label is the ground truth.
Get a clear verdict. Safe. Unsafe. Caution. The app cross-references every ingredient against your personal profile using a curated database of hundreds of allergen synonyms, hidden sources, and dietary restriction rules.

No account required. No data leaves your phone. It works offline.

Family Profiles

This was the feature that started the whole project. Real families don't have one set of dietary needs — they have many.

SafeScan lets you create separate profiles for each family member. Your daughter is allergic to tree nuts and eggs. Your partner keeps halal. You're vegan. One app handles all of it. You can even scan a single product and see the verdict for every family member at once.

The Ontology Under the Hood

The part we're most proud of (and the part you'll never see) is the allergen ontology — a hand-curated knowledge graph that maps thousands of ingredient names to their actual sources.

It knows that "surimi" may contain egg. That "stearic acid" can be animal-derived. That "E471" is a mono/diglyceride that could come from pork fat. That "arachis oil" is just another name for peanut oil.

When you scan a product, you're not just doing a string match against a list of allergens. You're running every ingredient through a multi-strategy lookup that catches what human eyes miss.

Who This Is For

Parents managing food allergies for kids who can't read labels yet
Navigating religious dietary laws in countries where those laws aren't reflected on packaging
Vegans and vegetarians who are tired of discovering animal ingredients after buying something
Anyone with a "custom avoid" list — whether it's MSG, carrageenan, Red 40, or high-fructose corn syrup
Families where everyone at the dinner table has different restrictions

The Honest Disclaimer

SafeScan is an aid, not a medical device. For severe allergies, always verify with the manufacturer. We built this to reduce the daily cognitive burden of reading labels — not to replace medical advice.

Try It

SafeScan is free, ad-free, and private. Available on the App Store for iPhone and iPad.

If this resonates with you, we'd genuinely appreciate you sharing it with someone who spends too long reading ingredient lists. That's who we built it for.

Understand Anything: Turn Any Codebase Into an Interactive Knowledge Graph

ArshTechPro — Tue, 19 May 2026 11:38:23 +0000

You join a new team. The codebase has 200,000 lines of code, no docs worth reading, and the one engineer who knew everything just left. Where do you start?

That exact problem is what Understand Anything was built to solve. It is an open-source plugin (15k+ GitHub stars as of May 2026) that scans your project using a multi-agent AI pipeline, builds a structured knowledge graph of every file, function, class, and dependency, and then gives you an interactive visual dashboard to explore it all. The stated goal is refreshingly honest: "graphs that teach, not graphs that impress."

What It Actually Does

At its core, Understand Anything does three things:

Structural analysis. It maps your codebase as a graph where every file, function, and class is a node. You can click any node to see a plain-English summary of what it does, what depends on it, and where it fits in the architecture.

Business domain extraction. Beyond code structure, it has a separate domain view that maps how your code relates to real business processes — domains, flows, and steps. This is genuinely useful when you need to explain a system to a non-technical stakeholder or write onboarding docs.

Knowledge base analysis. If your team uses a Karpathy-pattern LLM wiki (markdown files with wikilinks), you can point the tool at it and get a force-directed knowledge graph with community clustering. The tool discovers both explicit links and implicit relationships between concepts.

Supporting features include guided tours (auto-generated walkthroughs ordered by dependency), fuzzy and semantic search, diff impact analysis to see what your current changes affect, and a persona-adaptive UI that adjusts detail level based on whether you describe yourself as a junior dev, PM, or senior engineer.

How to Set It Up

Setup is straightforward. The tool works across a wide range of AI coding environments: Claude Code, Cursor, VS Code with GitHub Copilot, Codex, Gemini CLI, and about a dozen others.

Claude Code (Native)

/plugin marketplace add Lum1104/Understand-Anything
/plugin install understand-anything

macOS / Linux (for Codex, Gemini CLI, Cursor, Copilot, and others)

curl -fsSL https://raw.githubusercontent.com/Lum1104/Understand-Anything/main/install.sh | bash

If you want to skip the interactive prompt and target a specific platform directly:

curl -fsSL https://raw.githubusercontent.com/Lum1104/Understand-Anything/main/install.sh | bash -s codex

Supported platform values: gemini, codex, opencode, pi, openclaw, antigravity, vibe, vscode, hermes, cline, kimi.

Windows (PowerShell)

iwr -useb https://raw.githubusercontent.com/Lum1104/Understand-Anything/main/install.ps1 | iex

The installer clones the repo to ~/.understand-anything/repo and creates the right symlinks for your chosen platform. Restart your CLI or IDE afterward.

Cursor and VS Code + Copilot

These two auto-discover the plugin via the .cursor-plugin/plugin.json and .copilot-plugin/plugin.json files respectively when you clone the repo. No manual installation step needed — clone and open.

Using It Day-to-Day

Once installed, the main commands are:

# Analyze the entire codebase and build the graph
/understand

# Open the interactive dashboard
/understand-dashboard

# Ask questions in natural language
/understand-chat How does the payment flow work?

# See what your current diff affects
/understand-diff

# Deep-dive into a specific file or function
/understand-explain src/auth/login.ts

# Generate onboarding docs for new team members
/understand-onboard

# Extract business domain flows
/understand-domain

# Analyze a markdown wiki knowledge base
/understand-knowledge ~/path/to/wiki

For multilingual teams, you can generate content in your preferred language:

/understand --language zh   # Simplified Chinese
/understand --language ja   # Japanese
/understand --language ko   # Korean

Sharing the Graph With Your Team

The generated graph is stored as a JSON file at .understand-anything/knowledge-graph.json. You can commit it to the repo so teammates skip the pipeline entirely on first use. Exclude the scratch files:

.understand-anything/intermediate/
.understand-anything/diff-overlay.json

For large graphs (10 MB+), use git-lfs:

git lfs install
git lfs track ".understand-anything/*.json"
git add .gitattributes .understand-anything/

How the Pipeline Works

When you run /understand, it orchestrates five specialized agents in sequence (a sixth is added for domain extraction):

Agent	What It Does
`project-scanner`	Discovers files, detects languages and frameworks
`file-analyzer`	Extracts functions, classes, imports; produces nodes and edges
`architecture-analyzer`	Identifies architectural layers (API, Service, Data, UI, Utility)
`tour-builder`	Generates guided learning tours ordered by dependency
`graph-reviewer`	Validates completeness and referential integrity
`domain-analyzer`	Extracts business domains, flows, and process steps
`article-analyzer`	Extracts entities and implicit relationships from wiki articles

File analyzers run in parallel — up to 5 concurrent with 20-30 files per batch. It also supports incremental updates, so only files changed since the last run get re-analyzed.

The Merits

Broad platform support. It works natively with Claude Code and has one-line installs for 14 other platforms. If your team uses different editors, everyone can still use the same tool.

AI-native but not AI-locked. The knowledge graph output is plain JSON. Once generated, the dashboard runs independently. You are not making LLM calls every time you explore the graph.

Incrementally useful. You do not have to commit to using every feature. Running /understand + /understand-dashboard alone is already valuable for orientation on an unfamiliar codebase.

Team-shareable output. Committing the graph to the repo means the analysis work is done once and shared. A new hire can open the dashboard on day one without running the pipeline.

Actively maintained. 14.7k stars, 1.4k forks, 496 commits, a v2.5.0 release in May 2026. The project is not abandoned.

Language support. English, Simplified Chinese, Traditional Chinese, Japanese, and Korean are supported for output, which matters for distributed teams.

The Drawbacks

LLM costs are on you. The multi-agent pipeline makes real LLM calls during the analysis phase.

Graph quality depends on code quality. If the codebase has unclear naming, no logical separation of concerns, or is largely procedural scripts, the resulting graph will reflect that chaos rather than clarify it. The tool surfaces structure that exists; it does not invent structure that does not.

Initial scan time on large repos. Even with parallel processing (5 concurrent agents), scanning a 200,000-line monorepo takes time. The incremental update feature helps on subsequent runs, but the first pass can be slow.

Knowledge graph can become stale. Unless you enable --auto-update (a post-commit hook), the graph drifts from the codebase. Teams that forget to re-run /understand before major releases will hand out outdated onboarding graphs.

Is It Worth a Try?

For most developers, yes — with some points.

The most compelling use case is onboarding. A committed knowledge graph means a new team member can open an interactive visual map of the architecture on day one, take a guided tour ordered by dependency, ask natural language questions about how flows work, and get to meaningful contribution faster. That alone is worth the LLM cost of the initial scan.

The tool is still evolving, the community is active and the source is MIT-licensed, so there is low risk in trying it.

Quick Reference

GitHub: https://github.com/Lum1104/Understand-Anything
License: MIT

OpenSRE: Build Your Own AI Incident-Investigation Agent

ArshTechPro — Mon, 18 May 2026 09:00:18 +0000

Most AI coding tools stop at the editor. They help you write code. But the hardest, most stressful part of running software is not writing it. It is the moment it breaks in production at 2 a.m.

OpenSRE is built for that moment.

The problem it solves

When an incident hits, the evidence is scattered. Logs are in Datadog. Metrics are in Grafana. The config change that caused it is in Git. Service dependencies live in your infra layer. Each system saw part of what happened. None of them saw all of it.

So you do it manually. You pull logs, line up timestamps, ping the colleague who knows that stack, and slowly piece the story together. It takes hours. Under on-call pressure, you often just ship a patch to get the system back up and figure out the real cause later.

OpenSRE automates that investigation.

What it is

OpenSRE is an open-source framework, built on LangGraph, for building AI-powered SRE agents that automate incident investigation and root cause analysis. It is Apache 2.0 licensed and maintained by Tracer.

The point is not a single fixed product. It is a toolkit. You plug in the alerting sources you already use and compose custom investigation workflows tailored to your own infrastructure.

How the investigation runs

When an alert fires, the agent works through a defined sequence:

Ingest the alert from your monitoring or incident system.
Assemble context from logs, metrics, configs, and dependencies.
Frame failure modes the incident could plausibly be.
Execute investigation queries across connected systems, in parallel.
Evaluate hypotheses against the evidence collected.
Deliver a root cause report and recommended next actions, to Slack out of the box.

The agent tests several hypotheses at once and stops when it has enough confidence to give a clear answer, rather than running forever or guessing early.

What it connects to

OpenSRE integrates with the systems that already power modern platforms:

Data platform: Apache Airflow, Kafka, Spark
Observability: Grafana, Datadog, CloudWatch, Sentry
Infrastructure: Kubernetes, AWS, GCP, Azure
Dev tools: GitHub
Communication: Slack, PagerDuty

Adding a new output destination, such as routing reports to PagerDuty or OpsGenie, is described as one of the easiest contributions you can make.

Design principles worth noting

OpenSRE leans on a few principles that matter for production use: deterministic investigations, evidence-backed conclusions, parallel hypothesis testing, and fully auditable workflows.

That last point is important. This is not a black-box LLM that hands you a guess. The investigation is traceable, so you can see why it reached a conclusion.

Getting started

You can try it without touching production. The repo ships a local Grafana plus Loki demo that produces a real root cause report with one command:

git clone https://github.com/Tracer-Cloud/open-sre-agent
cd open-sre-agent
make install
make install-hooks
cp .env.example .env
opensre onboard
make local-grafana-live

The opensre onboard step walks you through configuring a local LLM provider and optionally validating integrations like Grafana, Datadog, Slack, AWS, GitHub, and Sentry. There is also a bundled demo that skips Docker entirely if you just want to see the flow.

Is it useful?

Promising, with caveats worth being honest about.

It is the youngest of the new wave of AI-agent tooling, with a smaller community and no tagged releases yet. It is also clearly aimed at data-platform teams, the Airflow, Kafka, and Spark crowd. If that describes your stack and on-call is genuinely painful, the local demo is worth an afternoon.

Heed the project's own security guidance: use read-only credentials, restrict network exposure, log every investigation, and always review a report before any automated remediation. An agent touching production systems deserves that caution.

The takeaway

AI agents are moving past the editor and into operations. OpenSRE is an early, open look at what an AI SRE actually involves: not a magic fix-it button, but a structured, auditable investigator that correlates the signals you already have. If incident response on your team still means hours of manual log-correlation, it is a project worth watching and, if your stack fits, worth trying.