DEV Community: Arindam Majumder

Building a Debate Council of LLMs to Stress-Test NVIDIA Cosmos 3

Arindam Majumder — Wed, 24 Jun 2026 13:19:01 +0000

A benchmark score tells you how a model did on a test. It does not tell you whether the model can hold a position, take a punch, and adjust without falling apart.

That second thing is what I wanted to know about NVIDIA Cosmos 3, which NVIDIA had just shipped. So instead of running yet another eval, I did something more fun. I built the model an arena and made it argue with itself.

// Detect dark theme var iframe = document.getElementById('tweet-2061515414837100955-54'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2061515414837100955&theme=dark" }

The result is Cosmos Arena, a multi-agent debate council. You hand it a motion, something like "This house believes AGI will arrive before 2035," and five roles fight it out:

An Advocate argues for, a Skeptic argues against
They trade rebuttals across several rounds
An optional Pragmatist pokes holes in both sides
An Arbiter scores everything and hands down a verdict

Here is the catch that makes it a real test: every seat runs on the same model. The only thing that changes is the role.

That is surprisingly hard to fake. A model that just knows how to sound smart will produce two confident speeches that never touch. A model that can actually reason opens round two by answering the exact weakness the other side exposed in round one. You watch the difference happen, turn by turn.

This tutorial is the full build. By the end you will have a working Streamlit app, you will get why the orchestration uses LangGraph instead of one model pretending to be everyone, and you will know how to serve Cosmos 3 through Nebius Token Factory.

First, what Cosmos 3 actually is

Let me be honest about the model up front, because it is not a normal chat model, and the name can mislead you.

NVIDIA built Cosmos 3 for Physical AI: robots, autonomous vehicles, factory floors, anything that has to understand motion, causality, and physics in the real world. NVIDIA launched it on June 1, 2026 at GTC Taipei and calls it the first fully open omnimodel with native vision reasoning.

Under the hood it is a Mixture-of-Transformers that pairs a reasoning transformer with an expert generation transformer. One half thinks about object interactions, motion, and space. The other half generates video and action trajectories.

The whole point is to let a robot reason before it acts, which cuts physical AI training cycles from months down to days. The NVIDIA technical blog goes deep if you want the full picture.

So why use a robotics world model to run a debate? Because that reasoning transformer is the interesting part. NVIDIA trained it to reason about the physical world, and I wanted to see how well that transfers to something it was never sold for: a pure-language argument, no images, no video, just ideas. That transfer tells you far more than a single-prompt score ever will.

Cosmos 3 specs at a glance

Here are the details that matter for this project.

Property	Cosmos 3 Super	Cosmos 3 Nano
Total parameters	64B	16B
Split	32B reasoner + 32B generator	8B reasoner + 8B generator
Architecture	Mixture-of-Transformers (reasoning + generation)	Mixture-of-Transformers
Built for	Post-training robotics and AV models at the highest physics accuracy	Fast video and action reasoning in a fraction of a second
License	OpenMDW (open for commercial and non-commercial use)	OpenMDW
Released	June 1, 2026 (GTC Taipei)	June 1, 2026

A few more things worth knowing:

It is genuinely omnimodal. It takes in and generates text, images, video, ambient sound, and action sequences, all in one model.
It was trained on one of the largest multimodal physical AI datasets out there, billions of samples across text, image, video, sound, and action.
It shipped with a Cosmos Coalition of robotics and AI labs (Agile Robots, Black Forest Labs, Runway, Skild AI, and others) building on top of it.
NVIDIA is upfront about the limits: generation can drift over time, and the reasoning can still hallucinate, since there is no physics simulator actually running in the loop.

For our build, the parts that matter are the reasoner tower and how well the model holds a role. The debate leans hard on both.

Where it runs: Nebius Token Factory

A 64B omni-model is not something I want to babysit on my own GPUs. So every model call in this project goes through Nebius Token Factory.

Token Factory is Nebius's production inference platform. It takes open and partner models, including NVIDIA's, and serves them behind one fast, OpenAI-compatible API, with the posttraining and governance pieces handled for you.

NVIDIA models like Nemotron already run there, and Nebius has been building cloud infrastructure with NVIDIA specifically for robotics and physical AI. That makes it a natural home for Cosmos.

Why it fits this project so well:

OpenAI-compatible API. Anything built for OpenAI works with a base-URL swap. The base URL is https://api.tokenfactory.nebius.com/v1/.
Drop-in LangChain support. The langchain-nebius package gives you a ChatNebius model that slots straight into LangGraph.
No GPUs to provision. You point at a model name and pay per token. That is the whole setup.
One key for five seats. Every council member shares a single NEBIUS_API_KEY, so there are no per-agent credentials to juggle.

Grab a key from the Nebius Token Factory console and you are ready.

What we are building

The council has five roles. Four of them are model calls. The fifth, the Moderator, is the graph itself, and that turns out to be the decision that makes everything work.

Role	Node	Job
The Advocate	`proponent`	Argues for the motion, rebuts the Skeptic each round
The Skeptic	`opponent`	Argues against the motion, rebuts the Advocate each round
The Pragmatist	`pragmatist`	Independent member who stress-tests both sides (optional)
The Arbiter	`judge`	Scores logic, evidence, and rebuttal, then gives a verdict
The Moderator	the graph	Routes turns, threads the transcript, decides when to stop

The flow goes from opening statements, through alternating rebuttal rounds, to an optional reality check, and finally a scored verdict:

        +--------------+      +-------------+
START ->|  proponent   | ---> |  opponent   | --> (more rounds?)
        +--------------+      +-------------+         |
              ^  more rounds: next_round              | no
              +---------------------------------------+
                                                      v
                                  (pragmatist?) -> judge -> END

Why a graph and not one clever prompt

You could try the lazy version: one prompt that says "argue both sides of X, then judge it." It reads fine and proves nothing.

The problem is that the model writes the "for" case and the "against" case in a single breath. They do not respond to each other, because they were written together. There is no exchange, just a model acting out the idea of a debate.

A real debate needs structure that a prompt cannot promise:

Each side gets its own turn, so each argument is a focused generation
A rebuttal sees what the other side actually just said
Rounds stack up, so later the Advocate answers a real objection instead of repeating its opener
The judge reads the full transcript and scores it on fixed criteria

That is a state machine, not a prompt, which is exactly what LangGraph is for. Putting the structure in code means each role gets its own isolated call, rebuttals genuinely see the prior turn, and the round count is enforced rather than left to the model's mood.

Prerequisites

Python 3.11 or higher
A Nebius Token Factory account and API key
uv for dependencies (pip install uv)

The dependency list is short on purpose:

dependencies = [
    "langgraph>=1.0",
    "langchain-nebius>=0.1.3",
    "langchain-core>=0.3",
    "python-dotenv>=1.1.1",
    "streamlit>=1.47.0",
]

Project layout:

cosmos_arena_debate_council/
  app.py              # Streamlit UI and live debate streaming
  cosmos_council.py   # LangGraph debate graph: nodes, routing, model
  pyproject.toml      # Dependencies
  .env.example        # Environment variable template
  assets/             # NVIDIA and Nebius logos

Get set up:

git clone https://github.com/Arindam200/awesome-ai-apps.git
cd awesome-ai-apps/advance_ai_agents/cosmos_arena_debate_council
uv sync
cp .env.example .env   # then add your NEBIUS_API_KEY

Your .env:

NEBIUS_API_KEY=your_api_key_here
# Optional overrides
COSMOS_MODEL=nvidia/Cosmos3-Super-Reasoner
NEBIUS_BASE_URL=https://api.tokenfactory.nebius.com/v1/

Step 1: Wire up Cosmos 3 and handle the reasoning channel

This is the one gotcha that will trip you up, so it goes first.

The stock ChatNebius integration reads the answer from message.content, the usual OpenAI shape. But Cosmos, served as a reasoner, often puts its answer in a non-standard reasoning field and leaves content empty. Use the integration as-is and every council member comes back blank.

The fix is a small subclass that folds the reasoning field back in. If content is empty, the reasoning is the answer. If both are there, the reasoning becomes separate chain-of-thought for the UI to show.

from langchain_core.outputs import ChatResult
from langchain_nebius import ChatNebius

DEFAULT_BASE_URL = "https://api.tokenfactory.nebius.com/v1/"
DEFAULT_MODEL = "nvidia/Cosmos3-Super-Reasoner"


class CosmosChatNebius(ChatNebius):
    """ChatNebius that surfaces the non-standard `reasoning` field."""

    def _create_chat_result(self, response, generation_info=None) -> ChatResult:
        result = super()._create_chat_result(response, generation_info)
        response_dict = response if isinstance(response, dict) else response.model_dump()
        for gen, choice in zip(result.generations, response_dict.get("choices") or []):
            reasoning = (choice.get("message") or {}).get("reasoning")
            if not reasoning:
                continue
            message = gen.message
            if (message.content or "").strip():
                # Real answer present, keep reasoning as separate chain-of-thought.
                message.additional_kwargs.setdefault("reasoning_content", reasoning)
            else:
                # Empty content, so the reasoning is the answer.
                message.content = reasoning
        return result

Cosmos can also do it the other way, wrapping its reasoning in <think> tags inside the content. A small splitter pulls the two apart so the UI never mixes thinking into the actual argument:

import re

_THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL | re.IGNORECASE)


def split_reasoning(text: str) -> tuple[str, str]:
    """Separate the visible answer from the model's <think> reasoning."""
    text = text or ""
    reasoning_parts = [m.strip() for m in _THINK_RE.findall(text)]
    clean = _THINK_RE.sub("", text).strip()
    if "<think>" in clean.lower():  # unclosed reasoning block
        idx = clean.lower().index("<think>")
        reasoning_parts.append(clean[idx + len("<think>"):].strip())
        clean = clean[:idx].strip()
    return clean.strip(), "\n\n".join(p for p in reasoning_parts if p).strip()

A shared factory ties it together. Every seat calls this same function, so the model never changes, only the prompt does:

import os


def build_model(api_key=None, model=None, base_url=None, temperature=0.6) -> ChatNebius:
    """Create the shared Cosmos reasoner backed by Nebius Token Factory."""
    return CosmosChatNebius(
        model=model or os.getenv("COSMOS_MODEL", DEFAULT_MODEL),
        api_key=api_key or os.getenv("NEBIUS_API_KEY"),
        base_url=base_url or os.getenv("NEBIUS_BASE_URL") or DEFAULT_BASE_URL,
        temperature=temperature,
    )

Step 2: Give each seat a persona

The roles are just system prompts, but they are written to force genuinely different behavior. The Advocate and Skeptic are told to rebut the other side point by point before adding anything new. The Pragmatist takes no side. The Arbiter has to produce a fixed scorecard. Here are two of them:

PROPONENT_PROMPT = (
    "You are The Advocate, a council member in the Cosmos Arena debate.\n"
    "Your role: argue persuasively and rigorously IN FAVOR of the motion.\n\n"
    "Guidelines:\n"
    "- Make the strongest honest case for the motion.\n"
    "- Ground claims in reasoning, evidence, and concrete examples.\n"
    "- If you are given the opposition's prior argument, directly REBUT it "
    "point by point before adding new arguments.\n"
    "- Be sharp and confident, but never fabricate facts.\n"
    "- Keep it focused: 3-5 tight paragraphs in markdown. End with your "
    "single strongest line.\n"
    "- Output ONLY your argument. Do not narrate your process."
)

JUDGE_PROMPT = (
    "You are The Arbiter, the impartial judge of the Cosmos Arena debate.\n"
    "You will be given the full debate transcript.\n\n"
    "Deliver your verdict as markdown with EXACTLY these sections:\n\n"
    "### Scorecard\n"
    "A markdown table scoring each side (Proponent, Opponent) from 0-10 on "
    "**Logic**, **Evidence**, and **Rebuttal**, with a **Total** column.\n\n"
    "### Verdict\n"
    "State the winner (or an honest draw) in one bold sentence, then 2-3 "
    "sentences justifying it based strictly on the arguments made.\n\n"
    "### Strongest Argument\n"
    "Quote or paraphrase the single most decisive point.\n\n"
    "### What Would Change the Outcome\n"
    "One short paragraph on the evidence or reasoning that would flip the result."
)

This is where you find out if Cosmos 3 can really be five different people. Weaker models leak. The Skeptic starts agreeing, or the Arbiter picks a winner before it reads anything. A strong reasoner keeps the seats clean all the way through.

Step 3: Model the debate as state

The whole debate lives in one typed state object. The trick is transcript, an append-only list. Each node returns only its single new turn, and LangGraph's reducer, operator.add, tacks it onto the running transcript. That means each streamed update is exactly one new turn, which is perfect for a live UI.

import operator
from typing import Annotated, TypedDict


class Turn(TypedDict):
    speaker: str   # proponent | opponent | pragmatist | judge
    round: int     # debate round (0 for judge and pragmatist)
    text: str      # the visible argument
    reasoning: str # the model's chain-of-thought, if any


class DebateState(TypedDict):
    motion: str
    current_round: int
    # operator.add makes each node APPEND its turn to the transcript.
    transcript: Annotated[list[Turn], operator.add]

One small helper grabs the other side's most recent turn. This is the piece that makes rebuttals real instead of generic:

def _latest(transcript: list[Turn], speaker: str) -> str:
    for turn in reversed(transcript):
        if turn["speaker"] == speaker:
            return turn["text"]
    return ""

Step 4: Write the nodes

Each council member is a node: one model call with a role prompt and a user prompt built from the live transcript. Watch how the proponent's prompt changes between the opening round and later rounds. From round two on, it gets handed the Skeptic's latest argument and told to rebut it point by point. That is the whole difference between a debate and two people talking past each other.

def _say(model, system_prompt: str, user_prompt: str) -> tuple[str, str]:
    response = model.invoke([
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ])
    answer, inline_reasoning = split_reasoning(response.content or "")
    reasoning = response.additional_kwargs.get("reasoning_content") or inline_reasoning
    return answer, reasoning


def proponent_node(state: DebateState) -> dict:
    rnd = state["current_round"]
    if rnd == 1:
        user = (
            f"The motion before the council:\n\n> {state['motion']}\n\n"
            "Deliver your OPENING case in favor of the motion."
        )
    else:
        user = (
            f"The motion before the council:\n\n> {state['motion']}\n\n"
            "The Skeptic's most recent argument was:\n\n"
            f"{_latest(state['transcript'], 'opponent')}\n\n"
            f"This is round {rnd}. Rebut the Skeptic point by point, then "
            "press your strongest new arguments for the motion."
        )
    text, reasoning = _say(model, PROPONENT_PROMPT, user)
    return {"transcript": [Turn(speaker="proponent", round=rnd, text=text, reasoning=reasoning)]}

The judge node works the same way but reads the entire transcript and produces the scorecard. Since the transcript is already clean and structured, there is no fragile report parsing. Clean text in, verdict out.

Step 5: Let the graph moderate

Here is the payoff. The Moderator is not a model deciding what comes next, it is plain deterministic routing. START to proponent to opponent, then a conditional edge decides: run another round, or wrap up? If rounds remain, bump the counter and loop back. If not, run the optional Pragmatist, then the Arbiter, then end.

from langgraph.graph import END, START, StateGraph


def build_debate_graph(model, rounds: int = 2, use_pragmatist: bool = True):
    # ... node definitions: proponent, opponent, pragmatist, judge ...

    def route_after_opponent(state: DebateState) -> str:
        if state["current_round"] < rounds:
            return "increment_round"
        return "pragmatist" if use_pragmatist else "judge"

    builder = StateGraph(DebateState)
    builder.add_node("proponent", proponent_node)
    builder.add_node("opponent", opponent_node)
    builder.add_node("increment_round", increment_round_node)
    builder.add_node("judge", judge_node)
    if use_pragmatist:
        builder.add_node("pragmatist", pragmatist_node)

    builder.add_edge(START, "proponent")
    builder.add_edge("proponent", "opponent")
    builder.add_conditional_edges(
        "opponent",
        route_after_opponent,
        ["increment_round", "pragmatist", "judge"]
        if use_pragmatist
        else ["increment_round", "judge"],
    )
    builder.add_edge("increment_round", "proponent")
    if use_pragmatist:
        builder.add_edge("pragmatist", "judge")
    builder.add_edge("judge", END)

    return builder.compile()

Step 6: Stream it live in Streamlit

The UI streams with stream_mode="updates", so each member's argument shows up the moment its node finishes. Color-coded card, correct round, and a collapsible panel that exposes Cosmos 3's chain-of-thought.

for update in graph.stream(
    initial_state(motion), config={"recursion_limit": 60}, stream_mode="updates"
):
    delta = next(iter(update.values()))
    new_turns = delta.get("transcript") if isinstance(delta, dict) else None
    if not new_turns:          # skip the increment_round bookkeeping step
        continue
    turn = new_turns[-1]
    card.complete(turn["text"], turn["reasoning"], live=True)

Run it:

uv run streamlit run app.py

Open http://localhost:8501, pick your number of rounds (1 to 4) and whether to seat the Pragmatist, type in a motion, and hit "Convene the Council."

What it looks like in action

Let me run the motion "This house believes AGI will arrive before 2035."

The Advocate opens strong on compute scaling curves, efficiency gains, and the money pouring into the field. Confident, concrete, ends on a sharp line.

The Skeptic does not blink. A trend line is not a mechanism, benchmark progress is not general capability, and "before 2035" is a specific claim that needs a specific argument the other side has not made.

Round two is where it gets good. The Advocate opens by quoting the Skeptic's "no mechanism" point and answering it head-on before pressing forward. The rebuttal threading is working. This is now an actual exchange.

The Pragmatist steps outside the fight and calls it: both sides are arguing definitions. What would really settle it is a measurable capability threshold tied to a date. Name it, or you are just debating vibes.

Then the Arbiter closes with a scorecard:

Side	Logic	Evidence	Rebuttal	Total
Proponent	7	6	8	21
Opponent	8	7	7	22

It gives the win to the Skeptic, narrowly. The Advocate argued well and rebutted directly, but leaned on extrapolation where the Skeptic demanded a mechanism, and that hard 2035 deadline raised a bar the Advocate never quite cleared.

The fun part is not who won. It is that the round-two rebuttal genuinely engaged the round-one objection. That only happens if the model can hold a position, absorb a counter, and adjust, which is exactly the reasoning I was trying to see.

What this tells you about Cosmos 3

Running a few motions through the arena surfaces things a benchmark never will.

Role discipline. Does the Skeptic stay skeptical for four straight rounds, or quietly start agreeing? Cosmos held its seats.
Rebuttal quality. Do later rounds answer the specific prior point, or just restate the opener with new adjectives? This is the clearest signal of real reasoning, and you can see it live.
Judgment calibration. Does the Arbiter's verdict actually follow from the transcript, or does it pick a side and backfill? Read the scorecard against what was said and you will know fast.

A debate is really a reasoning stress test in disguise: adversarial, multi-turn, and either self-consistent across rounds or not. For a model whose reasoner was trained mostly on physical and spatial problems, watching it carry that reasoning into abstract language debate is a genuinely interesting result.

The trade-offs, honestly

This is not free. A two-round debate with the Pragmatist is six full reasoning-model calls: the Advocate twice, the Skeptic twice, the Pragmatist, and the Arbiter. Reasoning models also emit a lot of thinking tokens. More rounds means more cost and more waiting.

For this use case it is worth it, because the structure is the product and watching the reasoning unfold is the whole point. For a plain question-and-answer task, it would be massive overkill. Match the architecture to the job.

Wrapping up

You now have a working multi-agent debate council. It models a structured debate as an explicit LangGraph state machine instead of a prompt, threads real rebuttals through a shared append-only transcript, runs every seat on NVIDIA Cosmos 3 while surfacing its chain-of-thought, and serves the whole thing through one Nebius Token Factory key over an OpenAI-compatible API.

From here the graph makes it easy to keep going. Add seats like a Historian or a Domain Expert. Let the Arbiter call a tie-breaker round. Wire in retrieval so arguments cite real sources. Or run a tournament of motions and chart which side the model tends to favor. Each one is just a few more nodes and edges.

Want to try it? Clone the repo, grab a Nebius Token Factory key, and convene your own council. Pick a motion you genuinely cannot call, and see how Cosmos 3 reasons its way through it.

Built by Arindam Majumder. Part of the awesome-ai-apps collection, powered by LangGraph, NVIDIA Cosmos 3, and Nebius Token Factory.

The Best Vercel Analytics Alternative When You Outgrow the Free Tier

Arindam Majumder — Sat, 13 Jun 2026 20:16:35 +0000

TL;DR

Vercel Analytics is a convenient starting point for Vercel-hosted projects, especially when you only need simple traffic analytics.
The Hobby plan includes a capped monthly Web Analytics allowance. If your site grows beyond that allowance, you may need to upgrade or move to another analytics setup.
Speed Insights is separate from Web Analytics and has its own data-point limits and pricing model.
Raah is a strong Vercel Analytics alternative when you want traffic analytics, Core Web Vitals, network performance, third-party script monitoring, frontend errors, and user journeys in one dashboard.
You do not need to move your app away from Vercel to use Raah. Add one lightweight beacon and keep your existing hosting workflow.

Vercel Analytics is one of the easiest ways to add traffic analytics to a Vercel-hosted site. If you already deploy with Vercel, the setup is simple, the dashboard is clean, and the data is useful enough for many early projects.

But the free tier has limits.

On the Hobby plan, Vercel Analytics includes a capped amount of monthly Web Analytics events. Once your site grows beyond that allowance, tracking can stop until the next billing cycle unless you upgrade. Vercel also prices additional Web Analytics events on paid plans, and Speed Insights has its own separate usage model.

That is fine for some teams. But if you want predictable analytics, privacy-friendly tracking, Web Vitals, network performance, and user behavior insights in one place, you may want a Vercel Analytics alternative.

That is where Raah fits.

Why people look for a Vercel Analytics alternative

Most people do not start looking for an alternative because Vercel Analytics is hard to use. They start looking because their needs change.

Common reasons include:

Your site is growing beyond the Hobby analytics allowance.
You do not want analytics tied to your hosting provider.
You want traffic analytics and performance monitoring in one dashboard.
You need deeper frontend observability than pageviews alone.
You want cookie-free analytics without consent banner complexity.
You want to monitor Core Web Vitals, API calls, third-party scripts, and user journeys together.
You deploy across Vercel, Netlify, Cloudflare, custom servers, or multiple platforms.

Vercel Analytics is convenient if your whole world lives inside Vercel. Raah is built for teams that want analytics to follow the product, not the hosting provider.

Vercel Analytics free tier: what to know

Vercel's Hobby plan includes a free Web Analytics allowance, but it is capped. Vercel's public pricing also lists paid Web Analytics usage at a per-event rate on paid plans.

Speed Insights is separate from Web Analytics. On the Hobby plan, Vercel provides a free Speed Insights allowance for one project, with its own data-point limits.

That means your actual analytics setup may involve more than one usage bucket:

Web Analytics events
Speed Insights data points
Add-ons if you are on a paid plan
Usage-based pricing as traffic grows

For small personal projects, this may be enough. For a growing product, SaaS app, content site, or startup landing page, those limits can become part of your analytics decision.

For the exact current numbers, always check the official Vercel pricing page and Speed Insights limits and pricing documentation.

What to use instead: Raah

Raah is a privacy-first website analytics and frontend observability platform for developers.

You add one lightweight script to your site, and Raah gives you:

Pageviews and visitor analytics
Referrers and UTM campaign tracking
Device, browser, country, and city breakdowns
Core Web Vitals monitoring
Network request performance
API and endpoint timing
Third-party script performance
User journey visibility
Frontend error tracking
Performance alerts
Cookie-free tracking without fingerprinting

Raah is built for teams that want to understand both traffic and user experience.

Traditional analytics tools tell you what pages people visit. Raah also helps you understand whether those pages are fast, which APIs are slow, which third-party scripts are hurting performance, and where users are coming from.

Vercel Analytics vs Raah

Feature	Vercel Analytics	Raah
Basic pageview analytics	Yes	Yes
Works outside Vercel	Limited by setup and use case	Yes
Cookie-free analytics	Yes	Yes
Core Web Vitals	Via Speed Insights	Yes
Network request monitoring	Limited	Yes
API performance from the browser	Limited	Yes
Third-party script performance	Limited	Yes
User journey insights	Limited	Yes
Privacy-friendly by default	Yes	Yes
Hosting-provider independent	No	Yes

Vercel Analytics is a good fit if you want simple analytics for a Vercel project.

Raah is a better fit if you want a fuller picture of what users experience after they land on your site.

When should you switch from Vercel Analytics?

You should consider switching when analytics becomes part of how you run the product, not just something you occasionally check.

For example, switch when you want to answer questions like:

Which marketing campaigns are bringing users who actually engage?
Which pages are slow for real users?
Are API calls slowing down the user journey?
Which third-party scripts are hurting load time?
Where are users dropping off?
Are performance issues happening for specific countries, browsers, or networks?
Can I monitor analytics and frontend reliability in one dashboard?

If you only need a simple pageview counter, Vercel Analytics may be enough.

If you need traffic analytics plus real-user performance monitoring, Raah gives you a broader view.

How to add Raah to a Vercel site

You do not need to move your app away from Vercel to use Raah.

Raah works with Vercel-hosted sites, Next.js apps, React apps, Astro sites, Vue apps, Nuxt apps, SvelteKit apps, and other frontend frameworks.

The setup is usually one script tag:

<script
  defer
  src="https://t.raah.dev/script.js"
  data-pid="YOUR_PROJECT_ID"
  data-domain="YOUR_DOMAIN"
></script>

After deployment, open your Raah dashboard and visit your site. Data usually starts appearing within a few minutes.

If you use Next.js, you can also follow the framework-specific setup in the Raah beacon installation guide.

Why Raah is a strong Vercel Analytics alternative

The biggest reason to use Raah is that it combines analytics and frontend observability.

You do not just see that a page received traffic. You can see whether the experience was good.

That matters because traffic without performance context can be misleading. A campaign may bring users, but if the landing page has poor LCP, slow API calls, or a blocking third-party script, those users may leave before converting.

Raah helps you connect those dots:

Acquisition: Where users came from.
Behavior: What pages and journeys they followed.
Experience: How fast the site felt.
Reliability: Which requests, scripts, or errors affected the session.

That makes Raah useful not only for marketers, but also for founders, developers, and product teams.

Final verdict

Vercel Analytics is a convenient starting point for Vercel projects. The free Hobby allowance is useful for small sites, but growing products often need more flexibility, deeper visibility, and analytics that are not tied to one hosting platform.

If you are looking for a Vercel Analytics alternative that gives you privacy-friendly web analytics, Core Web Vitals, network performance, third-party script monitoring, and user journey insights in one dashboard, Raah is built for that.

You can keep hosting on Vercel.

Just use Raah to understand what your users are actually experiencing.

[Boost]

Arindam Majumder — Tue, 09 Jun 2026 16:13:29 +0000

Arindam Majumder for Raah

Jun 9

Introducing Raah

#web #webanalytics #webdev #javascript

3 min read

Introducing Raah

Arindam Majumder — Tue, 09 Jun 2026 14:30:00 +0000

Today, we are introducing Raah, a privacy-first analytics and frontend observability platform for teams that need a clearer view of what real users experience in production.

Most products treat analytics, performance, errors, and uptime as separate problems. In practice, they are the same question from different angles: are people finding your product, using it, and getting a fast, reliable experience?

Raah answers that from one lightweight beacon. You get traffic analytics, Core Web Vitals, browser errors, API timing, third-party script impact, alerts, and status pages in one dashboard, without cookies or fingerprinting.

The goal is simple: help small teams move from scattered signals to production truth.

Why we built Raah

Most teams stitch together analytics, Web Vitals tooling, error tracking, uptime checks, and backend monitoring. Each tool is useful on its own, but the complete picture is still hard to read.

You may know a page received traffic, but not whether it loaded slowly. You may know an API was healthy from your server logs, but not whether users on a regional network saw high latency. You may know conversion dropped, but not whether a third-party script, JavaScript error, or poor INP contributed to the problem.

Raah is built around a simple belief: product and engineering teams should be able to see user behavior and user experience together.

What Raah gives you

Raah starts with the questions teams ask every week:

How many people visited, where did they come from, and which pages mattered?
Which pages are slow for real users, not just in a lab test?
Which API endpoints are creating the worst p95 or p99 experience?
Are errors concentrated in a browser, region, ISP, or release?
Did a marketing campaign bring traffic that actually engaged?
Can we explain an incident with real user evidence?

Instead of forcing those answers across several dashboards, Raah brings the main signals into one workflow.

Analytics without the usual privacy baggage

Raah tracks visits, pageviews, referrers, UTM campaigns, devices, geography, sessions, and engagement without setting cookies or fingerprinting users.

That makes it a practical alternative for teams that want useful product analytics without turning a simple website into a consent and tracking project.

The goal is not to recreate every enterprise marketing analytics feature. The goal is to give developers, founders, and product teams a clear read on what is happening and what deserves attention.

Frontend observability built in

Traffic alone does not tell you whether the experience was good. Raah also captures frontend reliability and performance signals from real browsers:

Core Web Vitals including LCP, CLS, INP, FCP, and TTFB
Browser-level network timing including DNS, TCP, TLS, TTFB, and download phases
API endpoint latency, error rate, and percentile breakdowns
JavaScript errors and unhandled promise rejections
Third-party script performance
Session-level context for debugging what happened

Real user network monitoring

Synthetic checks are useful, but they run from known locations under controlled conditions. They do not show what every user saw from their own browser, device, ISP, and geography.

Raah captures browser-level network telemetry from live traffic, so you can compare API and page performance by endpoint, country, ISP, status code, and timing phase.

That matters when a service looks healthy from your infrastructure but feels slow to customers in a specific region or network.

Built for small teams that need production truth

Raah is designed for teams that need practical visibility without a heavyweight rollout:

SaaS teams debugging inconsistent endpoint latency
Founders watching traffic, campaigns, performance, and reliability from one place
Agencies managing client websites without cookie-heavy analytics setups
Product teams connecting user behavior to frontend experience
Engineering teams that want alerts and evidence before users complain

You can start with a single script tag and let real traffic populate the dashboard.

Getting started

Create a project in the Raah dashboard.
Install the lightweight script.
Let live traffic populate metrics.
Use breakdowns by endpoint, ISP, geography, and session context to investigate outliers.

Raah was built to help teams move from “we think it is fine” to “we know what users are seeing.”

Claude Opus 4.8: Effort Controls, Dynamic Workflows, and an Honest-by-Default Coding Agent

Arindam Majumder — Fri, 29 May 2026 04:51:37 +0000

The frontier model race has been moving in fits and starts. OpenAI shipped GPT-5.5 and a new Codex line. Google pushed Gemini 3.1 Pro and a faster Gemini Flash. xAI keeps iterating on Grok. And now Anthropic has shipped Claude Opus 4.8, only 41 days after Opus 4.7, which is an unusually short release cycle for them and a clear signal about how the rest of 2026 is going to feel.

Opus 4.8 is not a flashy rebrand. The headline numbers are real (SWE-bench Pro at 69.2%, USAMO 2026 at 96.7%, GraphWalks at 1M tokens jumping from 40.3% to 68.1%), but the more interesting story is structural: effort controls you can dial per request, dynamic workflows that orchestrate hundreds of parallel subagents, a fast mode that is roughly three times cheaper than it used to be, and a measurable drop in the kind of overconfident, slightly-deceptive coding behavior that makes agents annoying to trust.

In this article, I will walk you through everything you need to know about Opus 4.8. We will cover the release details, the new effort tiers, the Dynamic Workflows feature in Claude Code, the pricing changes, the honesty and alignment improvements, two practical things you can build with it today, and an honest assessment of where it still falls short. By the end, you should have a clear mental model of when to upgrade, when to wait, and how to actually use the new controls.

What Is Claude Opus 4.8?

Claude Opus 4.8 is Anthropic's new flagship model, released on May 28, 2026 and available immediately across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. The API model ID is claude-opus-4-8. It ships at the same headline price as Opus 4.7, keeps the 1M-token context window, and is positioned squarely at agentic coding, long-horizon reasoning, and multi-day workflows.

Here is what makes Opus 4.8 stand out:

Effort controls everywhere: low, high (default), extra (xhigh), and max tiers you can pick per request, with high tuned to spend roughly the same tokens as Opus 4.7's default while doing better work.
Dynamic Workflows in Claude Code: a research preview that lets the model plan a job, spin up hundreds of parallel subagents, verify their outputs adversarially, and resume across multi-day runs.
Cheaper fast mode: 2.5× output speed at $10/$50 per million tokens, roughly three times cheaper than the previous Opus fast mode.
Honesty by default: a four-fold reduction in unreported code flaws versus Opus 4.7, a 0% rate of uncritically reporting flawed results (the first Claude model to hit zero on that test), and a ten-fold reduction in overconfidence.
Big agentic coding gains: SWE-bench Pro at 69.2%, leading every published competitor on that benchmark.
Long-context retrieval breakthrough: GraphWalks BFS at 1M tokens jumps from 40.3% to 68.1% F1, the biggest single benchmark gain of the release.
New API ergonomics: the Messages API now accepts system entries inside the messages array, letting you update instructions mid-task without breaking the prompt cache.

Pricing and Availability

Opus 4.8 launches at the same standard rate as Opus 4.7, which keeps the upgrade math simple.

Standard: $5 per million input tokens, $25 per million output tokens.
Fast mode: $10 / $50 per million tokens for roughly 2.5× the output speed. Anthropic describes this as "three times cheaper than fast mode for previous models," which is the real story here: latency-sensitive workloads that were borderline on the old Opus fast mode become economically reasonable on 4.8.
Context window: 1,000,000 tokens, unchanged from 4.7.
Access: Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, claude.ai, Claude Code, Cowork, and (as of release day) GitHub Copilot.

Because price did not move, the upgrade decision for most teams is just a function of whether the benchmark and reliability gains help your specific workload.

Benchmark Performance

The benchmark deltas are the cleanest part of the release.

Coding

SWE-bench Pro: 69.2% (up from 64.3% on Opus 4.7). For comparison, GPT-5.5 sits at 58.6% and the next competitor at 54.2%.
SWE-bench Verified: 88.6% (up from 87.6%).
SWE-bench Multilingual: 84.4% (up from 80.5%).

Opus 4.8 leads every SWE-bench variant. This is the most useful headline number for anyone building a coding agent today.

Math and Reasoning

USAMO 2026: 96.7% (up from 69.3% on 4.7). A 27.4-point jump in one model cycle is not a normal benchmark delta. Anthropic is describing this as a qualitative change in mathematical reasoning depth, not a tuning win.

Long-Context Retrieval

GraphWalks BFS at 1M tokens: 68.1% F1 (up from 40.3%).
GraphWalks Parents at 1M tokens: 83.3% F1 (up from 56.6%).

This is the release's biggest relative lead. If your application leans hard on long-context retrieval (legal review, large repo navigation, research synthesis), 4.8 is a different model than 4.7 for that work.

Other Notable Scores

HLE with tools: 57.9%
OSWorld-Verified (computer use): 83.4%
MCP-Atlas: 82.2%
Finance Agent v2: 53.9%

Where It Regressed

GPQA Diamond: 93.6% vs 94.2% on 4.7. Near-saturated benchmark, so variance at the top is expected, but worth flagging.
Terminal-Bench 2.1: 74.6%, behind GPT-5.5's 78.2%.
Multilingual tasks: trails Gemini 3.1 Pro and GPT-5.5 in several languages.

Take the benchmarks as directional. The honesty and reliability deltas below matter more for production use.

Effort Control: The Knob You Have Been Waiting For

Opus 4.8 exposes four effort tiers, and you can set them per request.

Low: fast responses, minimal token use. Best for summarization, classification, and simple Q&A.
High (default): Anthropic's "best balance" tier. Tuned to spend similar tokens to Opus 4.7's default while outperforming it on coding.
Extra (xhigh): recommended for difficult tasks and long-running async workflows.
Max: maximum token depth. Reserve for quality-only priorities where you do not care about cost.

The effort knob is exposed in claude.ai and Cowork (all plans), in Claude Code via the existing effort menu, and in the API. Claude Code's rate limits have been raised to accommodate the new high default. This is the kind of control that used to live behind enterprise sales conversations, and getting it as a first-class request parameter is a quiet but meaningful win.

A practical migration heuristic: move to Opus 4.8 on the default high tier first, then sample representative tasks at xhigh to model the token-cost delta before turning it on in production.

Dynamic Workflows: Hundreds of Subagents, Multi-Day Jobs

The biggest new feature in this release is Dynamic Workflows, a research preview inside Claude Code that lets Opus 4.8 orchestrate work normally reserved for a small engineering team.

The capability, in short:

The model plans a job.
It spins up tens or hundreds of parallel subagents to execute pieces of it.
Other agents adversarially try to refute the findings before they are reported.
State is checkpointed, so jobs survive interruptions and resume across multi-day runs.
Coordination happens outside the conversation thread, so the main session stays responsive.

The use cases Anthropic is pointing at:

Codebase-wide bug hunts and dead-code discovery.
Security and hardening audits.
Large migrations: framework swaps, language ports, API deprecations.
Verification-heavy tasks where you want independent attempts plus an adversarial reviewer.

The canonical case study is Bun's Zig-to-Rust port, run by Jarred Sumner: 750,000 lines of Rust generated, 99.8% of the test suite passing, eleven days from first commit to merge. The workflow first mapped lifetime requirements, then parallel writers generated every .rs file with two reviewers per file, and an overnight fix loop addressed a data-copy optimization. That is a real-world result, not a benchmark.

Activation

Inside Claude Code, you turn it on in any of three ways:

Enable auto mode in Claude Code.
Ask explicitly: "create a workflow."
Toggle the ultracode setting, which sets effort to xhigh.

The first trigger shows you a preview and requires confirmation. Dynamic Workflows is available on Max, Team, and Enterprise (admin-enabled), and via the APIs.

Token Cost Warning

Workflows consume substantially more tokens than standard sessions. Plan your budgets before flipping this on in production. The same feature that finishes a quarter's worth of work in days will also spend a quarter's worth of tokens in the same window if you do not watch it.

Honesty, Reliability, and Alignment

This is the part of the release that does not show up cleanly in a benchmark table but matters most for anyone shipping agents.

Code Honesty

Four-fold reduction in unreported code flaws versus Opus 4.7.
Code summary honesty: fails to flag important events only 3.7% of the time.
Uncritically reporting flawed results: 0%. First Claude model to score zero on that test.
Overconfidence: more than ten-fold improvement over Opus 4.7.
Lazy investigation: perfect score. Opus 4.7 gave incorrect answers 25% of the time on the same probe.

Read those as one trend: Opus 4.8 is significantly less likely to hand you a confident-sounding summary that papers over a real problem. Bridgewater Associates' testimonial in Anthropic's launch coverage explicitly calls out "Opus 4.8's tendency to proactively flag issues with the inputs and outputs of an analysis" as the differentiator.

Alignment

Stronger prosocial behavior and user-autonomy support.
Substantially lower deceptive and misuse-enabling behaviors than 4.7.
Reckless or destructive actions reduced significantly.
Overall alignment risk rated "very low."

Agentic Security Caveat

One regression worth taking seriously: prompt-injection robustness dropped. Gray Swan attack success rate climbed to roughly 9.6%, up from 6.0% on Opus 4.7. If your pipeline ingests untrusted external content (scraped web pages, user-submitted documents, third-party tool output) review your sandboxing approach before upgrading.

API Enhancements You Should Actually Notice

The Messages API now accepts system entries inside the messages array. In practice that means you can update instructions mid-task without breaking the prompt cache, which has been a real pain point for long-horizon agents.

Concrete use cases:

Adjusting tool permissions partway through a session.
Reallocating a token budget after a planning step.
Injecting environment context (the user just switched repos, a long-running job finished) without restarting the whole conversation.

For long-running agentic workflows, this single change is more useful than it sounds.

How Opus 4.8 Compares to the Field

A short read on where 4.8 sits today:

vs. GPT-5.5:

Opus 4.8 leads on SWE-bench Pro (69.2% vs 58.6%) and on GDPval-AA ELO by roughly 121 points.
GPT-5.5 still leads Terminal-Bench 2.1 (78.2% vs 74.6%).

vs. Gemini 3.1 Pro:

Opus 4.8 leads on SWE-bench variants and long-context retrieval.
Gemini 3.1 Pro leads on GPQA Diamond (94.3% vs 93.6%) and on several multilingual tasks.

Read those numbers as "Opus 4.8 is the strongest agentic coding model right now, but the multilingual and terminal-harness wins are not universal."

Honest Assessment: Strengths and Limitations

Opus 4.8 is the most useful coding model Anthropic has shipped, but it is worth going in with calibrated expectations.

Where it shines:

Agentic coding gains are real. The SWE-bench Pro lead over GPT-5.5 is large enough to matter on real workloads.
The honesty improvements (four-fold drop in unreported flaws, zero rate on uncritical reporting, ten-fold drop in overconfidence) translate directly into less time spent double-checking the agent.
Dynamic Workflows is a genuine shift, not a gimmick. The Bun port is the kind of result that would have been a quarter of work and is now eleven days.
Effort control as a first-class parameter is the right primitive for cost-aware agent design.
Fast mode at roughly one-third the previous price unlocks latency-sensitive workloads that were borderline before.
Same pricing as 4.7 makes the upgrade decision easy.

Where to be careful:

Prompt-injection robustness regressed (Gray Swan ~9.6% vs 6.0%). If you handle untrusted content, audit your sandbox before upgrading.
Terminal-Bench 2.1 still trails GPT-5.5. If your workload is shell-heavy, benchmark both.
Multilingual tasks remain a relative weakness. Gemini 3.1 Pro and GPT-5.5 win in several languages.
Dynamic Workflows can burn tokens fast. Budget carefully or you will discover the cost on your invoice.
Vending-Bench 2 regressed, hinting at potential issues with highly structured, multi-step transactional interactions. Worth testing if that is your domain.
Pipelines tightly tuned to 4.7 prompts will need re-validation. The honesty improvements change how the model communicates uncertainty, which can ripple through downstream parsing.

When to Upgrade and When to Wait

Upgrade from Opus 4.7 if:

You run agentic coding workflows and want the honesty improvements.
Long-context retrieval is core to your product (the GraphWalks delta is the biggest gain in the release).
You have been hitting silent failures or overconfident outputs.
You want fast mode at the new lower price.

Stay on Opus 4.7 if:

Your production pipeline is meticulously prompt-tuned to 4.7 and you do not have a re-validation window.
You rely on the GPQA Diamond delta at the top of the curve.
You ingest untrusted external content and cannot tighten sandboxing right now.

A safe default migration path: switch the model ID, leave effort at high, validate a representative sample of tasks, then experiment with xhigh for the hardest workloads.

What to Learn Next

A few directions worth exploring once you have Opus 4.8 running:

Effort tuning per workload: profile token spend at each tier on real traffic. The right tier is almost never the same across your app.
Dynamic Workflows playbooks: build templates for codebase audits, migrations, and dead-code sweeps. The reusable bit is the workflow shape, not the prompt.
Mid-task system messages: refactor your long-horizon agents to use the new in-array system entries instead of restarting threads.
Mythos preview: Anthropic has hinted that Mythos-class models will reach general availability in the coming weeks, with the same same-price upgrade pattern. Worth tracking.
Cheaper-with-Opus-capability tier: Anthropic also signaled cheaper models with Opus-level capability are in the works, which will reshape which tier you run by default.

Anthropic's documentation, the Claude API release notes, and the model card for Opus 4.8 are the best primary sources for any of this.

Final Thoughts

Opus 4.8 is not a rebrand and it is not a victory lap. It is a focused release that pushes on the parts of agentic coding that actually slow teams down: silent failures, overconfident summaries, single-thread workflows, and a coarse effort model. The benchmark wins are real, but the more durable shift is structural. Effort controls give you a cost dial. Dynamic Workflows give you a way to run multi-day jobs. The honesty improvements give you a model you can trust to flag its own mistakes.

The 41-day cycle from 4.7 to 4.8, the Mythos hints, and the cheaper-fast-mode pricing together signal that Anthropic is racing. That is good for everyone building on top of these models, but it also means the right play is to upgrade incrementally, measure what changed, and keep your prompts and sandboxes loose enough to absorb the next jump.

If you are building agents right now, switch the model ID, leave effort at high, test your hardest workloads at xhigh, and try one Dynamic Workflow on something you have been putting off. That is enough to see why this release matters.

Building AI Agents That Actually Remember: Memory Systems Explained

Arindam Majumder — Thu, 28 May 2026 14:39:46 +0000

You built an AI agent. It calls tools, plans multi-step workflows, and the first time you run it, the demo feels magical. Then you run it again the next day, and it greets you like a stranger. Same clarifying questions. Same mistakes. Same steps reconstructed from scratch.

That is not a tooling problem. That is a memory problem.

Most agents shipping today are stateless. They execute inside a loop, but they do not accumulate knowledge across runs. Every session starts at zero, which means every improvement they appeared to make last time is gone. If agents are going to automate real work, they have to get better the longer you use them. That only happens with memory, and not just stored data, but structured, evolving memory that compresses experience into knowledge.

In this article, I will break down what memory really means inside an agent, where the standard implementations fall short, how to design memory as a first-class layer in your agent loop, and finally how a system like Engram handles this in practice. By the end, you should have a clear mental model for adding real long-term memory to whatever you are building.

If you like Video more, You can watch this:

What "Memory" Actually Means in an Agent

When people say "memory" in AI, they usually mean storing chat history and pulling it back later. That is not memory. That is replay.

An agent already has context during execution. The system prompt, the tools, the intermediate tool calls, the partial outputs, all of it sits inside the model's context window for the duration of a single run. The moment the run ends, that context evaporates.

The common patch is to dump everything into a vector database, then retrieve the chunks that look semantically similar to the next prompt. This keeps the agent informed, but it does not make it smarter. It is the equivalent of handing someone a transcript of a meeting they never attended and hoping they get up to speed.

Real memory does something more interesting. It compresses raw experience into reusable knowledge. Humans do not remember entire conversations word for word. We remember conclusions, preferences, and patterns. Agents need the same transformation layer between raw execution and stored memory, or they will keep drowning in their own logs.

How Most Agents Are Built Today

The default agent loop looks roughly like this:

The user gives the agent a task.
The agent plans a sequence of steps.
It calls tools, observes results, and produces an output.
The whole transcript is dumped into a database, often a vector store.
On the next run, similar chunks are retrieved and injected back into the prompt.

At a glance, this feels like memory. In practice it builds a system that can recall but cannot learn.

The agent never refines what it knows. It never resolves contradictions. It never updates outdated information. It just accumulates. Over time, the store fills with multiple versions of the same idea, conflicting preferences, and abandoned half-thoughts. Retrieval gets noisier, irrelevant context starts crowding the prompt, token usage climbs, and accuracy drops.

This is the reason so many agents feel stuck at the same level no matter how many sessions you put through them.

The Missing Piece: A Learning Loop

The gap is a learning step between execution and storage.

Most loops look like this:

input → plan → execute → store → end

A real long-term-memory loop should look like this:

input → plan → execute → learn → update memory → next run

That learning step is where transformation happens. Instead of saving raw logs, the system pulls out structured insights and writes those. If a user says they prefer Postgres over MySQL, that is a stable preference, not a line buried in a chat log. If an agent tried three approaches and only one worked, the successful path is a reusable strategy, not noise mixed in with two failures.

Without this step, the agent keeps rediscovering the same things on every run, and you pay for it in tokens, latency, and user trust.

The Three Types of Memory Inside an Agent

It helps to split memory into layers, the same way cognitive science does. Most current systems collapse all three into one bucket, which is exactly why they get confused.

Episodic memory: events. Tool calls, inputs, outputs, failures, timestamps. Useful for traceability and debugging, but rarely the right thing to inject back into a prompt.
Semantic memory: distilled knowledge. Preferences, facts, constraints, decisions. This is what the agent should actually carry across sessions.
Procedural memory: how to do things. Sequences of steps that worked before. This is where agents start becoming efficient, because they can reuse solutions instead of recomputing them.

A good memory system treats these differently. Episodes get logged and mostly left alone. Semantic facts get reconciled and updated in place. Procedures get versioned, scored, and promoted when they keep working.

What a Real Memory System Needs

A working memory layer for agents needs three core operations:

Extraction: take raw input and decide what is worth remembering. Most of what an agent sees is noise. Extraction is the filter that separates a durable fact ("user is on macOS Sonoma, prefers pnpm") from disposable chatter.
Reconciliation: compare new information with what already exists. Update when the facts have changed. Resolve when they conflict. Merge when they are redundant. This is the step that keeps memory clean instead of letting it sprawl.
Retrieval: when the agent runs again, hand it the relevant pieces, not the whole archive. The goal is precision, not volume. A 200-token answer with the right three facts beats a 4,000-token dump every time.

This pipeline is what turns memory into something the agent can rely on instead of something it has to sift through every turn.

How This Changes Agent Behavior

Once a memory layer like this is in place, the agent's behavior changes in ways the user actually notices:

It stops asking the same clarifying questions because it already has the answers.
It avoids paths it has already tried and failed on.
It adapts to the user because preferences are stored and continuously updated.
It gets faster because it reuses successful strategies instead of recomputing them.

At this point the agent stops being a pure execution engine and starts being something closer to an accumulating knowledge system. That is the line between "impressive demo" and "tool I actually use every day."

Engram: A Memory System for Agents

This brings us to Engram, an open-source memory layer designed to sit alongside your agent rather than inside it. The project bills itself as "persistent cognitive memory for AI agents" and is built around exactly the extract → reconcile → retrieve pipeline above.

Here is what makes Engram worth a look:

Memory pipeline, not a bucket: incoming data flows through an extraction step that classifies items as facts, preferences, events, or decisions and assigns an importance score.
Reconciliation built in: new memories are checked against existing ones. Duplicates are merged, conflicts are resolved, and stale entries get pruned instead of piling up.
Dream Cycle: a background consolidation job that runs on a schedule (the docs describe a nightly cadence). It refreshes scores, dedupes, extracts patterns across memories, and prunes things that have gone stale. The biological analogy is not a coincidence.
Ensemble retrieval: multiple embedding models are queried in parallel and combined with Reciprocal Rank Fusion. Recency is weighted so fresh context wins ties.
Memory pools: shared memory spaces with access control, so multiple agents can read and write into the same knowledge base without trampling each other.
Open and self-hostable: Apache 2.0 licensed, runs locally on Apple Silicon or CUDA, with an optional hybrid mode that mixes in cloud embeddings.

Engram exposes itself through a TypeScript SDK, a REST API, and an MCP server, which means it slots into Claude Desktop, Cursor, Windsurf, and anything else that speaks the Model Context Protocol.

Installing Engram

The Python core is a single pip install:

pip install engram-core

Optional extras cover the server, the MCP integration, embedding backends, or the full bundle. For the TypeScript SDK, install the client package alongside whatever agent framework you are already using.

The self-hosted version of openengram.ai ships a setup wizard that walks you through account creation and model configuration, and all features unlock locally at no cost. If you want to start with the cloud-hosted control plane, you grab an API key (ek_...) and point the SDK at it.

A Minimal Code Walkthrough

Here is the smallest end-to-end example using the Python SDK:

from engram import Memory

mem = Memory()

mem.store("User prefers Python", type="preference", importance=8)
mem.store("Project uses Postgres, not MySQL", type="fact", importance=9)

results = mem.search("programming language")
context = mem.recall(limit=10)

store is the extraction entry point. You tag each memory with a type and an importance score so the consolidation pass can reason about it later. search runs full-text and semantic retrieval. recall pulls the top-N most relevant memories for prompt injection.

The TypeScript flavor is just as light:

import { Engram } from '@engram/client';

const engram = new Engram({ apiKey: 'ek_...' });

await engram.remember("User prefers dark mode");
const memories = await engram.recall("UI preferences");

Both SDKs share the same mental model: write through remember / store, read through recall / search, and let the consolidation pipeline keep the underlying store clean in the background.

You can also link memories explicitly to build a small knowledge graph the agent can walk:

bug_id = mem.store("Login fails on Safari", type="error_fix", importance=9)
fix_id = mem.store("Added WebKit prefix to CSS", type="error_fix")

mem.link(bug_id, fix_id, "caused_by")
graph = mem.graph(bug_id, max_depth=2)

That link call is what turns a flat collection of facts into procedural memory the agent can actually follow.

Project 1: Give a Developer Agent a Real Long-Term Memory

Let us put this together with a concrete scenario. You are building a developer agent that helps scaffold and maintain a project. Without memory, every new session restarts the same conversation about your stack.

Bootstrap a tiny project:

mkdir agent-with-memory && cd agent-with-memory
python -m venv .venv && source .venv/bin/activate
pip install engram-core openai

Wire memory into a basic loop:

from engram import Memory
from openai import OpenAI

mem = Memory()
client = OpenAI()

def chat(user_input: str) -> str:
    context = mem.context(user_input, max_tokens=500)

    messages = [
        {"role": "system", "content": f"Known about the user:\n{context}"},
        {"role": "user", "content": user_input},
    ]
    reply = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    ).choices[0].message.content

    mem.store(user_input, type="event", importance=4)
    mem.store(reply, type="event", importance=3)
    return reply

print(chat("Set up a new Node service. I prefer Postgres and pnpm."))
print(chat("Add a users table to the project."))

On the first call, the agent has nothing to recall, so it answers from scratch and writes two events to memory. On the second call, mem.context() returns the relevant prior decisions (Postgres, pnpm) and injects them into the system prompt. The agent never has to ask "what package manager do you use" again, and you never have to repeat yourself.

If you then say "actually, switch this project to Bun," Engram's reconciliation step updates the existing preference rather than appending a contradicting one. That is the difference between a memory system and a log.

Project 2: Wire Engram Into Your Coding Agent Over MCP

The second project shows how to plug Engram into an existing coding agent without writing client code at all. Most modern coding agents (Claude Desktop, Cursor, Windsurf, the Antigravity CLI, and others) speak MCP, and Engram ships an MCP server out of the box.

Start the MCP server:

pip install "engram-core[mcp]"
engram mcp serve

Then add it to your agent's MCP config. For Claude Desktop, edit claude_desktop_config.json:

{
  "mcpServers": {
    "engram": {
      "command": "engram",
      "args": ["mcp", "serve"]
    }
  }
}

Restart the agent. From inside any session you can now ask things like:

Remember that this project uses Drizzle ORM, not Prisma.

What have I told you about my testing setup?

The agent calls Engram's MCP tools to write and read memories, and the consolidation loop keeps the store tidy in the background. The same memory pool is now visible to every MCP-aware agent you use, which is the part most people underestimate the first time they try it.

How to Think About Memory While Building Agents

When you sit down to design an agent, the first question is usually "which model?" or "which tools?" That is the wrong starting point. The first question should be:

What does this agent need to remember, and how should that memory evolve?

Decide what qualifies as durable knowledge for your domain. Decide what should never be stored (raw PII, transient state, things that age out fast). Decide how updates and conflicts are resolved. Then design your loop so that every execution contributes back to that memory, not just consumes from it.

A few practical heuristics:

Store conclusions, not transcripts. A summarized decision is worth ten chat logs.
Score importance at write time. Future-you needs a signal for what to keep when the store gets crowded.
Reconcile, do not append. When the same fact shows up twice, the system should update, not duplicate.
Retrieve narrowly. A small, precise context wins over a giant relevant-ish blob.

That is what turns a collection of tools into a system that actually improves over time.

Honest Assessment: Strengths and Limitations

Engram is one of the most thoughtful entries in the agent-memory space right now, but it is worth going in with calibrated expectations.

Where it shines:

The extract → reconcile → retrieve pipeline matches the way memory should work, not just the way it is easiest to ship.
The Dream Cycle is the right idea. Background consolidation is what keeps memory stores from rotting over weeks of use.
Ensemble retrieval with RRF is a meaningful step up from single-model vector search. Recall stays high even when the query phrasing drifts from how the memory was originally written.
MCP support means it works with the agent you are already using today, not just a bespoke SDK.
Apache 2.0 and self-hostable. Your memories live where you want them to live.

Where to be careful:

Extraction is LLM-driven, which means it can occasionally classify the wrong thing as a durable fact. Importance scoring helps, but you should still spot-check what is being written.
"Memory" is only as good as your write discipline. If you treat it as a dump-everything store, you will recreate the noisy-vector-DB problem inside a nicer wrapper.
The ecosystem around agent memory is moving fast, and Engram is one of several real implementations (some Go, some Rust, some Python). Pick the one whose architecture and license match your deployment, and expect interfaces to keep evolving for the next year.
Hybrid cloud mode is convenient but read the data-handling policy if you are working with sensitive content. Self-hosting is the safer default for regulated workloads.

What to Learn Next

Once you have a memory layer wired in, there are a few directions worth exploring:

Memory pools: share a single store across multiple agents (a researcher, a writer, a reviewer) and watch them coordinate without you writing any glue.
Procedural memory: start storing successful tool-call sequences as reusable strategies, not just facts.
Eval your memory: write tests that check whether your agent remembers the right things across sessions. Memory regressions are real and they are easy to miss without a harness.
MCP everywhere: once memory is exposed as MCP, every agent you use can read and write into the same brain. That is when things actually get interesting.

The Engram documentation at engram.to and openengram.ai is the best place to go deeper, and the GitHub organizations behind the various implementations are active and worth tracking.

Final Thoughts

Most agents today feel powerful in the moment and forget everything afterward. That is the ceiling on what they can become.

Memory raises that ceiling. It lets agents accumulate knowledge, refine behavior, and adapt with use instead of resetting every session. Engram is one good implementation of this idea, but the larger shift is architectural. Agents are not just reasoning systems anymore. They are learning systems, and memory is the layer that makes the learning stick.

If you are building agents right now, focus less on making them smarter inside a single run. Focus on making them better across many runs.

That is where the real leverage is. Give it a try in your next project and see how quickly the dynamic changes.

[Boost]

Arindam Majumder — Mon, 25 May 2026 19:28:42 +0000

Arindam Majumder

May 21

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

#antigravity #ai #python #webdev

8 min read

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

Arindam Majumder — Thu, 21 May 2026 19:52:44 +0000

The AI coding agent space has been reshuffling fast. Anthropic has Claude Code, OpenAI has Codex CLI, xAI has Grok CLI, and now Google has retired Gemini CLI and replaced it with something bigger: Antigravity CLI, a Go-based terminal agent that shares its engine with the Antigravity 2.0 desktop application.

In this tutorial, I will walk you through everything you need to know to get started with Antigravity CLI. We will cover the installation process, authentication, the agent and command modes, and build two practical projects that show what this tool can actually do. By the end, you will have a solid foundation for using Antigravity CLI in your real development workflow.

What Is Antigravity CLI?

Antigravity CLI is Google's command-line coding agent powered by the Gemini family of models (with optional support for Claude and open-source backends). Unlike a simple chat wrapper, it is designed to reason across your project, edit multiple files at once, spawn subagents for parallel work, and call tools on your behalf.

Google describes it as a lightweight Terminal User Interface (TUI) that brings the core capabilities of Antigravity 2.0 (multi-step reasoning, multi-file editing, tool calling, and persistent history) directly to your terminal. It is the official replacement for Gemini CLI, which sunsets for individual Google AI Pro and Ultra users on June 18, 2026.

Here is a quick look at what makes Antigravity CLI stand out:

Built in Go: snappier startup and lower memory footprint than the Node-based Gemini CLI it replaces.
Shared agent engine with Antigravity 2.0: the desktop app and the CLI use the same runtime, so updates land everywhere at once.
Bidirectional sync with the GUI: preferences, permissions, and even active sessions can be exported from the terminal into the desktop app and back.
Async subagents: long-running refactors and research tasks run in parallel without blocking your prompt.
First-class extensibility: Agent Skills, Hooks, Subagents, MCP servers, and Plugins (the rebrand of Gemini CLI Extensions) are all supported.
SSH-aware authentication: detects remote sessions and gives you an auth URL to open on your local browser.

Prerequisites

Before you get started, make sure you have the following in place:

Operating system: macOS, Linux, or Windows (native PowerShell, no WSL required).
Curl (macOS/Linux) or PowerShell 5+ (Windows): used by the installer.
A Google account: any Google AI Pro, Ultra, or free-tier account works. Code Assist Standard/Enterprise customers can use their existing license.
Basic terminal comfort: you do not need to be a shell expert, but knowing how to navigate directories and run commands will help.

Installing Antigravity CLI

Installation is a single command. Open your terminal and run the one that matches your OS.

macOS / Linux:

curl -fsSL https://antigravity.google/cli/install.sh | bash

Windows (PowerShell):

irm https://antigravity.google/cli/install.ps1 | iex

The installer detects your environment, drops the binary (named agy, not antigravity) into ~/.local/bin/ on Unix or %LOCALAPPDATA%\Antigravity\ on Windows, and prints the line you need to add to your shell profile if the directory is not already on PATH.

Once it finishes, restart your terminal and verify the installation worked:

agy --version

You should see the version number printed. If the command is not found, double-check that the install directory is on your PATH.

Authentication

The first time you run agy, it kicks off a Google Sign-In flow. On a desktop machine it opens your browser automatically and walks you through the OAuth grant. Credentials are then cached in your system keyring (Keychain on macOS, Credential Manager on Windows, libsecret on Linux).

If you are working over SSH or in a headless server environment, Antigravity CLI detects the remote session and prints an authorization URL plus a one-time code. Open the URL on your local machine, paste the code, and the CLI completes the handshake. This fixes one of the more painful flows in the old Gemini CLI.

If you would rather use an API key (for CI or scripting), set it before running any command:

export ANTIGRAVITY_API_KEY=your_api_key_here

To make this permanent, add the export line to your ~/.bashrc or ~/.zshrc file, then source it:

source ~/.zshrc

Understanding the Operating Modes

Antigravity CLI gives you three distinct ways to interact with it, each suited to a different context.

1. Interactive Agent Mode

This is the default mode. Running agy from any project directory launches the full TUI: a scrollable conversation pane, a > prompt, and a status bar showing the active model, token usage, and any running subagents.

agy

Once inside, you can type natural language prompts. A few starter prompts to try in any repo:

Explain this repo

What does @src/main.go do and where is it called from?

The @ syntax pulls a file into context without you having to paste it. You can also reference whole directories (@src/) or glob patterns (@**/*.ts).

You can switch models mid-session using the /model slash command. Antigravity CLI ships with access to Gemini 3.5 Flash, Gemini 3.1 Pro, Claude Sonnet, Claude Opus, and GPT-OSS 120B (subject to your plan):

/model gemini-3.1-pro

Useful built-in slash commands inside the TUI:

/help: list every command and keyboard shortcut.
/context: show token usage broken down by category and manage checkpoints.
/usage: quota and rate-limit status across all available models.
/export: push the current session into Antigravity 2.0 so you can keep working in the GUI.

2. Command Mode

Command mode is designed for quick, inline assistance: getting a one-shot completion or a terminal command without leaving whatever you were doing. Trigger it with Cmd + I on macOS or Ctrl + I on Windows/Linux from inside the TUI, or invoke it headlessly:

agy -p "Write a Go function that reads a CSV and returns a summary struct"

For structured output that you can pipe into other tools, add --output-format:

agy -p "List all TODOs in this codebase" --output-format json

This is the mode you reach for from shell scripts, Git hooks, or CI jobs.

3. Async Subagent Mode

Antigravity CLI's headline upgrade over Gemini CLI is asynchronous subagents. From inside the TUI you can dispatch a long-running task to a background agent and keep prompting in the foreground:

/agent refactor "Convert all callback-based handlers in @internal/api to use context.Context"

The subagent reports progress in the status bar and posts its diff back into the conversation when finished. You can run several in parallel, which is handy for splitting a big refactor across packages.

Inspecting Your Project Context

When you drop Antigravity CLI into a new project, the most useful command to run is agy inspect:

agy inspect

This prints a summary of everything the agent has discovered, including:

Configuration files it has loaded (.agents/, AGENTS.md, project-level instructions).
Agent Skills available globally (~/.gemini/antigravity-cli/skills/) and per-workspace (.agents/skills/).
Plugins loaded (including any Gemini CLI extensions you imported).
Hooks registered for the project.
MCP servers connected from mcp_config.json.

If Antigravity CLI is not behaving the way you expect, agy inspect is your first debugging step. It shows you exactly what context the agent is working with.

Migrating From Gemini CLI

If you were already using Gemini CLI, Antigravity CLI ships a one-shot importer that pulls your old extensions, model preferences, and auth state across:

agy plugin import gemini

It walks the legacy ~/.gemini/ directory, converts each extension into a Plugin, and rewrites your settings.json into the new schema. Original files are left untouched until you confirm.

The official migration guide lives at antigravity.google/docs/gcli-migration and covers edge cases like custom auth providers and air-gapped enterprise installs.

Customizing Antigravity CLI for Your Project

Antigravity CLI supports project-level customization through a few mechanisms.

AGENTS.md: drop a plain-English instruction file at the root of your project. Anything you write here gets prepended to every prompt processed inside that directory:

Always use TypeScript. Prefer functional patterns over class-based ones. Never use `any` as a type.
Run `pnpm test` after every change that touches src/.

Agent Skills: reusable slash commands. Place a Markdown file at .agents/skills/lint.md and it becomes /lint inside the TUI. Skills can include their own prompts, allowed tools, and even nested subagent definitions. Global skills live at ~/.gemini/antigravity-cli/skills/.

Hooks: JSON-defined lifecycle interceptors that fire before a tool call, after a file edit, or on session start. Use them for things like auto-running gofmt after every write or blocking edits to vendor/.

MCP servers: both local (stdio) and remote (HTTP) Model Context Protocol servers are supported, configured in a dedicated mcp_config.json. Remote servers require a serverUrl field.

Run agy inspect any time to confirm which of these are active in the current directory.

Custom Model Configuration

If you want to pin the CLI to a specific model or point it at a self-hosted or third-party endpoint, edit your config file (~/.config/antigravity/config.toml). The configuration accepts:

model: the model identifier (e.g. gemini-3.1-pro, claude-opus, gpt-oss-120b)
base_url: the API base URL, useful for proxies or self-hosted Gemma deployments
name: a human-readable label shown in the TUI
env_key: the environment variable that holds the API key

After adding a custom model, select it with the -m flag:

agy -m my-custom-model -p "Summarize the changes in the last 5 commits"

Or switch to it inside the TUI with /model my-custom-model.

Honest Assessment: Strengths and Limitations

Antigravity CLI is the most polished launch of a Google developer tool in a while, but it is worth going in with calibrated expectations.

Where it shines:

The Go rewrite is noticeably faster than Gemini CLI. Startup feels instant and the TUI stays responsive even under long sessions.
Async subagents are a genuine workflow upgrade. Splitting a refactor across three background agents and watching their diffs land one by one is genuinely useful, not a gimmick.
SSH-aware auth fixes the single biggest pain point of Gemini CLI for remote developers.
Bidirectional sync with Antigravity 2.0 means you can switch between the terminal and the GUI without losing context.
Built-in support for multiple model families (Gemini, Claude, GPT-OSS) lets you pick the right tool for each task.

Where to be careful:

Like every LLM-based coding agent, it can produce plausible-looking code with subtle bugs. Always review generated output before running it in production.
The plugin ecosystem is essentially Gemini CLI's extension ecosystem with a coat of paint. Many extensions work but a few have not been re-tested against the new runtime.
Async subagents are powerful but they consume your quota in parallel. Watch /usage if you are on a metered plan.
Enterprise (Code Assist Standard/Enterprise) users can stay on Gemini CLI indefinitely, but everyone else needs to migrate before June 18, 2026 or lose access entirely.

What to Learn Next

Now that you have Antigravity CLI running and you have seen what it can do, here are some directions worth exploring:

Agent Skills: write a small Markdown file to turn a common prompt into a reusable slash command for your team.
MCP servers: connect Antigravity CLI to your database, internal docs, or issue tracker so the agent can pull live context.
Subagent recipes: experiment with patterns like "researcher + writer" or "scout + fixer" pairs that run async.
Antigravity 2.0 desktop: use /export to push a hard problem from the terminal into the GUI when you want richer file diffs and graph views.
Antigravity SDK: if you want to embed agent capabilities inside your own apps, the SDK exposes the same runtime that powers the CLI.

The official Antigravity documentation at antigravity.google/docs is the best place to go deeper on all of these topics.

Final Thoughts

Antigravity CLI is more than a rebrand of Gemini CLI. The Go rewrite, async subagents, shared runtime with the desktop app, and SSH-aware auth add up to a genuinely different product. It is fast to install, thoughtfully designed around extensibility, and its mix of interactive, command, and async modes covers a wide range of real development scenarios.

The fact that Google built it as a peer of the Antigravity 2.0 desktop application, not a stripped-down companion, is a sign that the terminal is being treated as a first-class surface, not an afterthought.

Give it a try in your next project and see where it fits.

Scaling Karpathy’s AutoResearch Using Nebius Token Factory

Arindam Majumder — Sun, 03 May 2026 13:50:44 +0000

Introduction

We ran about 250 prompt optimization experiments overnight using a small AI agent loop. The idea was simple: let an AI system propose an experiment, run it, evaluate the result, and then try again with a better idea. Instead of manually testing prompts one by one, the system keeps improving its own attempts over multiple iterations.

This idea comes from Andrej Karpathy’s AutoResearch, where an AI agent can automate the typical machine learning research cycle. In a normal workflow, researchers adjust parameters, run experiments, observe the results, and repeat the process many times before reaching a good configuration. AutoResearch shows that this repetitive process can be handled by an intelligent agent.

In this article, we will walk through how we built a cloud-native AutoResearch loop using Nebius Token Factory for LLM inference, allowing the agent to run hundreds of experiments automatically while keeping structured records of every attempt.

Pitfalls in the Original AutoResearch Implementation

Karpathy’s AutoResearch project is a very interesting idea and a great starting point to understand how AI agents can run experiments automatically.

The repository mainly consists of simple files such as program.md (which defines the research goal), train.py (which runs the experiment), and a lightweight results log that records experiment outcomes. The agent reads the goal, modifies the experiment code, runs it, and stores the results, demonstrating how an AI system can iterate through experiments automatically.

It was built as a research prototype to demonstrate the concept, not as a full system for running large-scale experiments in real workflows.

As a result, a few limitations arise when we try to scale the idea.

Experiment tracking using TSV files: In the original implementation, experiment results are stored in a simple TSV (tab-separated values) file. This file usually contains basic fields like experiment ID, score, and parameters. While this is easy to implement, it becomes difficult to manage as the number of experiments grows.
Limited structure for experiment metadata: A flat TSV file does not provide structured storage for experiment details such as prompts, responses, timestamps, or reasoning steps.
Local model dependency: The original workflow usually depends on running models locally. This means developers often need access to local GPUs or preconfigured environments to run inference.
Difficulty scaling experiment loops: Because of the local infrastructure and simple logging system, running hundreds of experiments in a controlled way becomes harder.

// Detect dark theme var iframe = document.getElementById('tweet-2030371219518931079-421'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2030371219518931079&theme=dark" }

What We Are Going to Build

The use case for this project is prompt optimization. In many real workflows, developers often try multiple prompts manually to get the best response from a language model. This usually involves repeated trial-and-error, where prompts are slightly modified, tested, and evaluated before arriving at a good result. When the number of experiments increases, managing and tracking these attempts becomes difficult.

In this tutorial, we will build a small system that keeps the same AutoResearch idea while addressing a few of those practical limitations.

AutoResearch-style experiment loop: We implement the same research cycle where an agent proposes an experiment, runs it, evaluates the result, stores the outcome, and then proposes the next attempt based on previous results.
Cloud-based reasoning using Nebius Token Factory: Instead of relying on local models, the agent uses Nebius Token Factory to generate experiment ideas and responses. This removes the dependency on local GPUs and makes the research loop easier to run.
Structured experiment tracking using JSON: Instead of logging experiments in a flat TSV file, each experiment is stored as a structured JSON record. This allows us to track prompts, responses, scores, and timestamps more clearly.
Prompt optimization as the experiment domain: For this tutorial, the system will try to generate the best explanation for vector databases. Each experiment proposes a different prompt and evaluates how well the response matches our criteria.
Running a large experiment loop: The system runs around 250 iterations, allowing the agent to gradually improve prompts by learning from previous experiment results.

Why Nebius Token Factory

To run the AutoResearch loop properly, the agent needs a reliable way to generate experiment ideas and responses. Instead of running models locally, we use Nebius Token Factory as the inference layer.

Managed model inference: Nebius Token Factory lets us run open models through an API. We do not need to download models or manage GPUs locally.
Access to open models: The platform provides models such as Llama, DeepSeek, and Qwen, which can be used for reasoning, prompt generation, and experiment responses.
OpenAI-compatible API: The API follows the same structure used in the OpenAI SDK. This makes integration simple in Python applications.
Agent reasoning backend: In our system, the AutoResearch agent calls Nebius Token Factory to analyze previous experiments and propose the next prompt.
Supports large experiment loops: Because inference runs in the cloud, we can run hundreds of experiment iterations without worrying about local compute limits.
No local GPU requirement: Developers can run the experiment loop directly from their machine while the model inference happens in Nebius infrastructure.

For more demos, refer to the cookbooks here.

Tutorial: Building an AutoResearch-like System with Nebius Token Factory

Step 1 — Clone the Project Repository

Clone the repository.

git clone https://github.com/studio1hq/Nebius_autoresearch.git

Install required dependencies.

pip install requests python-dotenv

Set up a working environment.
Review the Project Structure

File	Purpose	What it Does
program.md	Defines the research goal	Contains the task the agent is trying to optimize. In this project, the goal is to generate the best prompt for explaining vector databases.
agent.py	Generates the next experiment	Uses Nebius Token Factory to analyze previous experiment results and propose the next prompt to test.
experiment.py	Runs the experiment	Sends the generated prompt to Nebius Token Factory and retrieves the model’s response.
scorer.py	Evaluates experiment output	Scores the response based on simple rules such as response length and presence of relevant keywords. Returns a numeric score.
main.py	Controls the AutoResearch loop	Loads previous experiment history, runs the experiment cycle, logs results, and repeats the process for multiple iterations.
results.json	Stores experiment history	Saves structured experiment records including prompt, response, score, and timestamp. Easier to analyze compared to flat TSV logs.

Step 2 — Configure Nebius Token Factory

Create a Nebius Token Factory API key

Open the Token Factory dashboard:

Navigate to the API Keys section in the left sidebar and click Get API key.

Create a new key for your project. This key will be used by the application to authenticate API requests sent to Nebius Token Factory.

Store the API key as an environment variable

Create a .env file in the root of the project and store the key as an environment variable:

NEBIUS_API_KEY=your_api_key

Configure the Token Factory client

Nebius Token Factory provides an OpenAI-compatible API, so we can use the same client structure used for OpenAI integrations.

Initialize the client using the Nebius API base URL and the API key stored in the environment variable.

The request sends the prompt to the selected model (for example, llama-3-70b-instruct) and receives the generated response.

In this system, these API calls power two parts of the AutoResearch loop:

Agent reasoning: Where the agent proposes the next experiment prompt (agent.py)

from typing import List, Dict, Any
from client import generate_response

def propose_experiment(history: List[Dict[str, Any]], goal: str) -> str:
    """
    Proposes a new system prompt based on the research goal and experiment history.

    Args:
        history: List of previous experiment results (dicts with 'prompt' and 'score').
        goal: The research goal string.

    Returns:
        str: The proposed system prompt for the next experiment.
    """
    # Format history for the prompt context
    if not history:
        history_text = "No previous experiments."
    else:
        # Use the last 3 experiments to provide context without overflowing context window
        recent_history = history[-3:]
        history_text = "\n".join([
            f"Attempt {i+1}:\n"
            f"Prompt: {r.get('prompt', 'Unknown')}\n"
            f"Score: {r.get('score', 0)}/10\n" 
            for i, r in enumerate(recent_history)
        ])

    system_instruction = (
        "You are an expert AI researcher optimizing a system prompt to achieve a specific goal.\n"
        "Analyze the previous attempts and their scores. Identify what worked and what didn't.\n"
        "Then, generate a NEW, improved system prompt that is likely to achieve a higher score.\n"
        "Do not repeat previous prompts. Be creative and precise.\n\n"
        "Format your response exactly as follows:\n"
        "THOUGHT: <your analysis and plan>\n"
        "PROMPT: <the actual system prompt text>"
    )

    user_message = (
        f"Research Goal:\n{goal}\n\n"
        f"Experiment History:\n{history_text}\n\n"
        "Based on the above, generate the next system prompt to test.\n"
        "Ensure you include your thought process before the prompt."
    )

The agent builds a context using the goal + past experiments.
This prompt is sent to Nebius using generate_response().
The model suggests the next experiment (new prompt).
This is where the “thinking” of the agent happens.

Experiment execution: where the prompt is sent to the model, and the response is evaluated. (experiment.py)

from client import generate_response

def run_experiment(system_prompt):
    """
    Runs the experiment by using the proposed system prompt to explain vector databases.
    """
    test_prompt = f"{system_prompt}\n\nExplain vector databases. Respond in under 120 words."
    return generate_response(test_prompt)

The generated prompt is sent again to Nebius.
The model produces the actual output for evaluation.
This response is later scored and stored.
This is the execution step of the experiment.

With this configuration in place, the AutoResearch loop can use Nebius Token Factory as its inference backend.

For example, (refer to client.py)

api_key = os.environ.get("NEBIUS_API_KEY")
    if not api_key:
        raise ValueError("NEBIUS_API_KEY environment variable not set")

    # Nebius Token Factory OpenAI-compatible endpoint
    url = "https://api.tokenfactory.nebius.com/v1/chat/completions"

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "messages": [
            {"role": "user", "content": prompt}
        ],
        # Optional parameters for generation control
        "temperature": 0.7,
        "max_tokens": 4096
    }

You can also use an OpenAI-compatible API interface. The same client patterns used in OpenAI-based applications can be used here with minimal changes, making integration into existing workflows straightforward.

from openai import OpenAI
client = OpenAI(base_url="https://api.tokenfactory.nebius.com",
        api_key="NEBIUS_API_KEY")

completion = client.chat.completions.create(
    model="llama-3-70b-instruct",
    messages=[{
      "role":"user",
      "content":"What is the answer to all questions?"
    }]
)
print(completion.choices[0].message.content)

Step 3 — Understanding the AutoResearch Loop

The system follows a simple research loop where the agent proposes experiments, evaluates the results, and improves the next attempt based on previous runs.

Load the research goal: The system reads the task defined in program.md. In this case, the goal is to generate the best prompt for explaining vector databases.
Load previous experiment history: The program reads results.json to understand what prompts were already tested and how they performed.
The agent proposes the next experiment: The agent uses Nebius Token Factory to analyze the previous results and suggest a new prompt to try.
Run the experiment: The new prompt is sent to the model through the Token Factory API, and the response is generated.
Score the result: The response is evaluated using the scoring function defined in scorer.py.
Store the experiment record: The prompt, response, score, and timestamp are saved in results.json.
Repeat the loop: The system continues this cycle for around 250 experiment iterations, allowing the agent to gradually improve the prompts.

Step 4 — Run the System

Command:

python main.py

Terminal output shows:

iteration number
proposed prompt
generated response
score.

Every 25 experiments, the system prints a short report in the terminal showing the number of experiments completed, the best score so far, and the prompt that produced the best result. This helps track how the agent is improving over time.

Each experiment is stored as a structured record in results.json. The record contains the prompt used, the response generated by the model, the score assigned by the evaluation function, and the timestamp of the run. This structured format makes it easier to inspect and analyze experiment history later.

Instead of starting from scratch every time, the system loads the existing results.json file when the program starts. It detects the last completed experiment and continues from the next iteration. This allows the experiment loop to resume without losing previous results.

Key Takeaways

AutoResearch shows how an AI agent can run experiments automatically by proposing ideas, testing them, and learning from the results.

In this tutorial, we extended that idea by using Nebius Token Factory for cloud-based model inference and structured experiment logging for better tracking. This makes the research loop easier to run, easier to observe, and more practical for developers experimenting with AI workflows.

If you want to try similar experimentation workflows, you can start with Nebius Token Factory to run open models through a simple API without managing GPUs. The broader Nebius Cloud platform also provides GPU infrastructure, scalable inference, and tools for building AI applications.

Explore the available services, experiment with different models, and build your own AI-driven systems using the Nebius ecosystem.

Build a HackMD-Style Collaborative Markdown Editor with React, Antigravity IDE & Velt

Arindam Majumder — Mon, 27 Apr 2026 07:21:30 +0000

TL;DR

Building real-time collaboration from scratch takes significant effort. You need sync logic, presence, comments, and infrastructure before you even ship the feature.

In this guide, we generate a pixel perfect HackMD style editor UI using Antigravity, connect live markdown preview in React, and then use Velt to add presence, live sync, and comments in just a few steps.

What We’re Building

We are building a HackMD style markdown editor with a clean two pane layout. On the left, users can write markdown. On the right, they see a live rendered preview. The interface follows a dark theme and mirrors the structure and layout of HackMD closely.

This is not just a static clone. The final result will support real time collaboration, allowing multiple users to edit, comment, and stay aware of each other inside the same document.

Tech Stack

We use a focused, minimal stack:

React with Vite and TypeScript: Provides a fast development setup and a clean component-based architecture.
**Antigravity:** Used to generate a pixel-accurate editor UI directly from a reference image. This allows us to replicate the layout precisely without manual design iteration.
**Velt React SDK:** Adds the collaboration layer. We use it to enable presence, live state sync, and contextual comments without building real-time infrastructure from scratch.

Step 1: Generating a Pixel-Perfect UI with Antigravity

Antigravity is an AI-powered development platform and “agent-first” IDE where AI agents assist with coding tasks across your editor, terminal, and browser, moving beyond simple code completion toward autonomous execution of complex software workflows.

It lets you generate and modify real code based on high-level instructions, orchestrating planning, editing, and validation with minimal manual effort.

Why Use Antigravity?

Cloning an interface like HackMD manually is time-consuming. Matching spacing, typography, layout, and dark mode details takes careful iteration.

We used Antigravity to generate the editor UI directly from the reference image. The prompt enforced strict visual fidelity. No redesign. No interpretation.

This gave us:

Rapid UI cloning: Full split layout with header, editor, preview, and status bar in minutes.
Pixel-accurate output: Layout and styling matched the reference closely.
No design drift: The UI stayed consistent with the original.

With the UI ready, we could move straight to functionality and collaboration.

The Prompt Strategy

The prompt was written with strict visual constraints. Every layout detail, spacing rule, and styling decision had to follow the reference image exactly.

We enforced a simple rule: the image always wins. If there was any conflict between best practice and the screenshot, the screenshot was treated as the authority.

You are an expert frontend engineer and UI pixel-perfect implementer.

Your task is to build **only the editor UI** of a HackMD-style markdown editor **exactly matching the provided image**. This is **not** a redesign, interpretation, or approximation. This must be a **visual and behavioral clone** of the image.

---

### CRITICAL INSTRUCTIONS (DO NOT IGNORE)

* **DO NOT make any assumptions** about layout, spacing, colors, typography, sizing, or behavior.
* **DO NOT invent UI elements** that are not visible in the image.
* **DO NOT omit UI elements** that are visible in the image.
* **DO NOT restyle or “improve” anything.**
* **DO NOT change colors, icons, padding, fonts, or alignment.**
* **DO NOT guess breakpoints** — infer responsiveness strictly from the image and standard proportional scaling.
* **Follow the image exactly as it is.** If something is unclear, replicate it as faithfully as possible from the visual evidence alone.

---

### INPUT CONTEXT

* You are working inside a **basic React project**.
* You are building **only the editor UI** (no authentication, no backend, no real GitHub integration).
* The editor consists of:

  * **Left pane**: Markdown editor
  * **Right pane**: Live markdown preview
* The provided image is the **single source of truth**.

---

### REQUIRED OUTPUT

Produce **production-ready React code** that recreates the UI **pixel-perfectly**.

You must:

1. Use **React functional components**
2. Use **CSS (or CSS Modules / styled-components)** to precisely match styles
3. Ensure the layout is **fully responsive** to all screen sizes while preserving proportions
4. Match:

   * Background colors
   * Pane widths
   * Divider behavior
   * Toolbar icons and placement
   * Font family, size, weight
   * Line height
   * Button styles
   * Hover/focus states (only if visible/implied)
   * Spacing and margins
5. Implement:

   * Markdown input on the left
   * Live preview rendering on the right
6. Match **dark mode styling exactly** as shown
7. Match scrollbar appearance as closely as possible
8. Use **no external UI libraries** unless strictly necessary for markdown parsing
9. Use semantic HTML where applicable

---

### LAYOUT REQUIREMENTS

* Two-column split layout
* Left: editable markdown text area
* Right: rendered markdown preview
* Divider exactly positioned as shown
* Toolbar at the top exactly matching icon order, spacing, and alignment
* Bottom GitHub buttons and template buttons must appear exactly as shown (visual only)

---

### RESPONSIVENESS REQUIREMENTS

* On smaller screens:

  * Maintain proportional scaling
  * Preserve visual hierarchy
  * Do NOT collapse, remove, or redesign panes unless explicitly shown in the image
* No mobile-specific UI unless clearly implied by the image

---

### FUNCTIONAL REQUIREMENTS

* Markdown typing updates preview in real time
* Toolbar buttons do NOT need real functionality unless explicitly visible
* GitHub buttons are **visual only**
* No routing, no persistence, no API calls

---

### DELIVERY FORMAT

* Return:

  * React component(s)
  * CSS
  * Brief explanation of structure
* Code must be clean, readable, and copy-paste ready

---

### FINAL RULE

If there is ever a conflict between best practices and the image:
**THE IMAGE ALWAYS WINS.**

Use the provided image as the **absolute authority** and replicate it **exactly**.

Generated Component Structure

Antigravity generated a clean, modular React structure instead of a single large file. The UI was split into focused components, each responsible for one section of the editor. This made the layout easy to reason about and ready for collaboration features.

Header.tsx — Top navigation and toolbar
Editor.tsx — Left pane markdown input
Preview.tsx — Right pane rendered markdown
StatusBar.tsx — Bottom metadata bar
Layout.tsx — Structural wrapper composing the editor layout

This structure is accurate based on the repo you shared. It matches the actual src/components breakdown and reflects how the UI is assembled in App.tsx.

Implementing Markdown + Live Preview

The editor follows a simple two-pane model. The left pane is a controlled textarea where users write markdown. The right pane renders the parsed markdown in real time.

State is lifted to a shared parent component so that every keystroke updates both the editor and the preview instantly. This keeps the UI predictable and ensures the preview always reflects the latest content.

Step 2: Making the Editor Collaborative with Velt

Once the local markdown editor was working, the next step was to make it collaborative. Instead of building real-time infrastructure from scratch, we integrated Velt to handle sync, presence, and comments.

Why Use Velt?

Velt is a collaboration SDK that lets developers embed real-time collaboration features into web products quickly and efficiently. It provides fully managed components and backend support so you can add multiplayer-style experiences without building real-time infrastructure from scratch.

Key Features of Velt:

Live Sync – Real-time shared state across users so everyone sees updates instantly.
Comments – Contextual commenting components like those in Figma, Google Docs, and spreadsheet tools.
Presence & Cursors – Shows active users and cursor positions in shared sessions.
Multiplayer Editing – Multiple users can edit content concurrently with conflict resolution.
Notifications – Built-in support for alerts and updates (mentions, replies).
Recording & Huddles – audio/video/screen recording and in-app collaborative sessions.
Customizable SDK – Components and behavior can be styled and extended to match your product.

Agent Skills and MCP Integration

Velt recently introduced Agent Skills and an implementation MCP that allow collaboration features to be integrated using AI agents. Instead of manually wiring presence, comments, and live sync, agents can now orchestrate much of the integration flow.

Installing & Setting Up Velt

We started by installing the Velt React SDK and adding it to the project. This gives us access to collaboration primitives such as live state, presence, and comments.

npm install @veltdev/react
# Optional: npm install --save-dev @veltdev/types

Next, we wrapped the root of the application with the Velt provider. This initializes the collaboration layer and connects the app to Velt using an API key.

From this point on, collaboration features can be layered into existing components without restructuring the entire application.

Since Velt also supports Agent Skills and MCP-based implementations in an agent-enabled environment, collaboration features can be scaffolded automatically, without manually wiring every component. The agent can configure provider setup, inject components, and connect live state with minimal manual steps.

In other words:

Manual SDK setup → Explicit integration in code
Agent Skills / MCP → AI-assisted integration with reduced setup effort

For this project, we used the manual SDK approach. But teams using agent driven workflows can accelerate collaboration integration even further.

Adding Text Comments

With Velt initialized, the next step was enabling inline comments inside the document.

We wrapped the editor layout with the VeltComments component in text mode. This attaches a collaborative comment layer directly to the markdown content without changing the editor’s internal logic.

Contextual inline comments: Users can select text and leave feedback directly within the document.
Anchored collaboration: Comments stay attached to specific sections even as content evolves.
Multi-user discussion: Multiple users can comment and reply in the same document in real time.

At this point, the editor moves from being a single user tool to a shared workspace.

function App() {
  const apiKey = import.meta.env.VITE_VELT_API_KEY;
  const [currentUser, setCurrentUser] = useState(staticUsers[0]);

  const switchUser = (user: VeltUser) => {
    setCurrentUser(user);
    localStorage.setItem("hackmd-current-user", user.userId);
  };

  // Load user preference on app start
  useEffect(() => {
    const storedUserId = localStorage.getItem("hackmd-current-user");
    const user = storedUserId
      ? staticUsers.find((u) => u.userId === storedUserId) || staticUsers[0]
      : staticUsers[0];
    setCurrentUser(user);
  }, []);

  return (
    <VeltProvider apiKey={apiKey}>
      <VeltComments textMode={true} darkMode={true} />
      <AppContent
        currentUser={currentUser}
        staticUsers={staticUsers}
        onSwitchUser={switchUser}
      />
    </VeltProvider>
  );
}

export default App;

Enabling Live Sync

Comments make the document collaborative, but the content itself is still local. To enable true multi-user editing, we replaced local React state with Velt’s shared live state.

Instead of managing markdown with useState, we switched to useLiveState. This hook stores the document content in a shared real time layer managed by Velt.

Every update to the markdown now propagates instantly across connected users. No WebSockets, no manual sync logic, no conflict resolution setup.

The rest of the component structure remains unchanged. Only the state source is replaced.

Multi user editing — Multiple users can type in the same document simultaneously.
Instant shared updates — Changes appear in real time across all active sessions.

This is the moment where the editor becomes fully collaborative.

import React from 'react';
import { useLiveState } from '@veltdev/react';
import Header from './Header';
import Editor from './Editor';
import Preview from './Preview';
import StatusBar from './StatusBar';
import type { VeltUser } from '../types/veltUser';
import { defaultMarkdown } from '../constants/defaultTemplate';

interface LayoutProps {
    currentUser: VeltUser;
    staticUsers: VeltUser[];
    onSwitchUser: (user: VeltUser) => void;
}

const Layout: React.FC<LayoutProps> = ({ currentUser, staticUsers, onSwitchUser }) => {
    const [markdown, setMarkdown] = useLiveState<string>('hackmd-clone-markdown', defaultMarkdown);

    return (
        <div style={{
            display: 'flex',
            flexDirection: 'column',
            height: '100vh',
            width: '100vw',
            overflow: 'hidden'
        }}>
            <Header currentUser={currentUser} staticUsers={staticUsers} onSwitchUser={onSwitchUser} />
            <div style={{
                display: 'flex',
                flex: 1,
                overflow: 'hidden',
                position: 'relative'
            }}>
                <Editor value={markdown} onChange={setMarkdown} />
                <div style={{
                    width: '1px',
                    backgroundColor: '#000',
                    opacity: 0.3,
                    zIndex: 10
                }} />
                <Preview content={markdown} />
            </div>
            <StatusBar />
        </div>
    );
};

export default Layout;

Presence Awareness

Editing and commenting are core collaboration features, but presence adds awareness. It lets users see who else is currently active inside the document.

With Velt, presence is automatically tracked once the provider is configured. Active users can be identified in the session, enabling visual indicators such as avatars or active participant signals.

This creates a collaborative awareness layer. Users know when others are viewing or editing the same document, which reduces overlap and improves coordination.

import React, { useState } from 'react';
import { VeltPresence } from '@veltdev/react';
import type { VeltUser } from '../types/veltUser';

interface HeaderProps {
    currentUser: VeltUser;
    staticUsers: VeltUser[];
    onSwitchUser: (user: VeltUser) => void;
}

const Header: React.FC<HeaderProps> = ({ currentUser, staticUsers, onSwitchUser }) => {
    const [showUserMenu, setShowUserMenu] = useState(false);
    return (
        <div style={{
            height: 'var(--toolbar-height)',
            backgroundColor: '#2f3136', // Darker gray for toolbar
            display: 'flex',
            alignItems: 'center',
            justifyContent: 'space-between',
            padding: '0 16px',
            borderBottom: '1px solid #111',
            fontSize: '14px',
            color: '#b9bbbe'
        }}>
            {/* Left Section */}
            <div style={{ display: 'flex', alignItems: 'center', gap: '8px' }}>
                <div style={{ 
                    display: 'flex', 
                    alignItems: 'center', 
                    gap: '8px', 
                    color: '#fff', 
                    fontWeight: 600,
                    marginRight: '12px'
                }}>
                    <div style={{
                        width: '24px', 
                        height: '24px', 
                        borderRadius: '50%', 
                        background: '#3370b7', 
                        display: 'flex', 
                        alignItems: 'center', 
                        justifyContent: 'center'
                    }}>
                        <Power size={14} color="white" />
                    </div>
                    <span>My workspace</span>
                </div>

                <div style={{ width: '1px', height: '20px', background: '#4f545c', margin: '0 4px' }}></div>

                {/* Editor Mode Buttons */}
                <div style={{ display: 'flex', background: '#333', borderRadius: '4px', padding: '2px' }}>
                    <button style={{ padding: '4px 8px', background: '#444', borderRadius: '3px', color: '#fff' }}>
                        <Pencil size={14} />
                    </button>
                    <button style={{ padding: '4px 8px', color: '#888' }}>
                        <Columns size={14} />
                    </button>
                    <button style={{ padding: '4px 8px', color: '#888' }}>
                        <Eye size={14} />
                    </button>
                </div>

                <button style={{ padding: '4px' }}><Plus size={18} /></button>
                <button style={{ padding: '4px' }}><HelpCircle size={18} /></button>
                <button style={{ padding: '4px' }}><Search size={18} /></button>
            </div>

...

export default Header;

What We Didn’t Have to Build

Using Velt removed the need to build and maintain a complex collaboration infrastructure.

No WebSocket layer for managing real-time connections
No CRDT or conflict resolution system for concurrent edits
No custom backend service for syncing document state
No notification engine for mentions and updates
No database layer for storing and anchoring comments

This allowed us to ship faster, reduce engineering overhead, and keep the codebase focused on core product functionality rather than infrastructure.

Try It Yourself

You can run the full demo locally and explore the collaborative features in action.

Clone the repository
Install dependencies
Add your Velt API key
Start the development server

Once running, open the app in two different browsers or devices. You will see live sync, comments, and presence working in real time.

Key Takeaways

Modern tooling changes how fast we can ship collaborative software. AI can drastically accelerate UI replication, allowing you to move from design to production-ready components in minutes. At the same time, collaboration infrastructure no longer needs to be built from scratch. By layering Velt on top of a clean React architecture, you can enable live sync, comments, and presence without managing real-time systems yourself.

If you’re building collaborative features into your product, explore Velt and see how quickly you can turn a single user interface into a shared workspace.

Resources

Velt Documentation
GitHub Repository
Live Demo: Try the application yourself

Claude Opus 4.7 seems to use way more tokens than expected

Arindam Majumder — Wed, 22 Apr 2026 04:35:24 +0000

While playing with Opus 4.7 over the last few days, I noticed that prompts were filling context much faster than I expected.

I also came across a few measurements from others testing it with real developer inputs like project instructions, git logs, stack traces, and long coding prompts.

Claude Opus 4.7 seems to use way more tokens than expected
Anthropic mentions the updated tokenizer may produce around 1.0–1.35× more tokens compared to previous models.

But a lot of the real-world measurements seem closer to ~1.4–1.47× more tokens. Which becomes noticeable pretty quickly if you're running larger contexts.

That means:

context budgets disappear faster
long-running sessions accumulate tokens much quicker
effective cost per workflow goes up

Not necessarily a bad thing, though.

I mean, Tokenizer changes are usually made to improve how the model handles code, markdown, structured text, and other developer-heavy inputs. So there’s probably a capability tradeoff happening here.

I made a short video here walking through the measurements, the tokenizer changes, and what it means in practice, if you want to explore more

Claude’s new Advisor Strategy is pretty interesting

Arindam Majumder — Wed, 15 Apr 2026 20:07:17 +0000

A lot of people building AI agents run into the same problem sooner or later.

If you run the entire agent on a powerful model, it works well but the costs grow quickly. If you run everything on a cheaper model, the system stays fast and affordable but it sometimes makes weak decisions, especially when planning complex tasks or choosing tools.

Anthropic recently introduced something called Advisor Strategy that tries to solve this in a simple way.

Instead of using one model for everything, the agent runs on a smaller executor model like Sonnet or Haiku. That model handles the normal workflow such as calling tools, executing steps, and moving the task forward. When the agent reaches something more complex, it can consult a stronger model like Opus for guidance. The advisor reads the full context, suggests what to do next, and the executor continues the workflow.

So most of the work stays cheap and fast, but the agent can still get strong reasoning when it actually needs it. It feels a lot like how a junior engineer works most of the time but occasionally asks a senior engineer for advice.

I found this architecture interesting because it pushes agent systems toward multi-model setups instead of relying on a single model for everything, which seems like a direction many frameworks will probably move toward.

I made a short video breaking down how the advisor strategy works and how developers can implement it in their own agents