DEV Community: GetPochi

Five Practical Tips to Save Token Consumption with Pochi

GetPochi — Wed, 04 Mar 2026 08:17:52 +0000

The usual experience with coding agents is predictable - they start out sharp, then slowly become confused, verbose, and expensive. Instructions keep piling up, tools accumulate, and failed attempts linger in the conversation. By the time token costs start hitting the roof, the agent already feels harder to work with.

At that point, most teams reach for the usual fixes: improve the prompts, avoid the biggest models for small tasks, and aggressively cache tool calls. While all of these help, they come with their own overhead of constant tuning and close monitoring - and even then, they rarely address the root cause of runaway token usage.

This post breaks down practical workflow patterns you can apply to address common sources of context bloat, explains the principles behind them, and shows how Pochi supports these behaviors in day-to-day work. If you’ve ever felt like an agent got worse the longer you worked with it, these patterns are likely why, and how Pochi helps you fix it.

1. Compact context aggressively as noise accumulates

Token usage grows over time as conversations accumulate failed attempts and abandoned approaches. This context debt increases token usage and degrades response quality, making agents more verbose and error-prone.

To solve this, Pochi periodically allows you two options to compact the context:

Compact Task: This summarizes the task context and replaces long conversational history with a concise, up-to-date representation of intent and state. Applicable when you want to stay in the same task and continue the conversation with the condensed context.

Create a New Task with Summary: This creates a clean task with a summary of the previous conversation, helping you avoid hitting context limits while keeping all relevant information.

These mechanisms are especially useful during long debugging sessions, iterative refactors, or tasks with multiple rounds of clarification.

In these cases, the majority of the conversation history becomes irrelevant once a direction is chosen. Compacting ensures the agent doesn’t keep paying for that history over and over again.

Steps:

When a task gets long, compact the context regularly (e.g., after 3–5 iterations).
Keep only the essential state and intent.
If the direction changes, create a new task with a summary of the previous conversation.

2. Attach intent to code instead of explaining it in chat

Explaining code changes in plain chat is one of the fastest ways to burn tokens. Each time you prompt the agent with queries like:

Prompt:
- “Actually, change this part…”
- “No, not that file , the other one”
- “I meant refactor this logic, not rewrite it”

The model has to re-read large parts of the context, reconstruct what changed, and infer your intent all over again. This kind of repetition adds up quickly.

Pochi avoids this by attaching intent directly to code through Edits and Reviews.

Edits track the exact diffs you introduce locally while iterating. If you tweak a variable, adjust logic, or partially rewrite a block, Pochi includes only those changes in the agent’s context the next time you send a prompt.

Reviews, on the other hand, let you leave inline comments directly on generated code. Instead of re-explaining issues in chat, you comment on specific lines and batch that feedback into a single, focused update.

Steps:

Use edits to track local diffs during iteration.
Attach intent directly to code using inline comments.
Batch feedback into a single update instead of multiple chat messages.

3. Isolate intent early with subagents and forks

Token usage often spikes when multiple ideas compete in the same context. You start with one goal, explore a few approaches, abandon some, and finally pivot to another direction. In this case, the agent is continuously juggling multiple lines of intent. Even with compaction, the model still has to reconcile what you meant before with what you want now.

Language models are optimized for coherent, single-threaded intent. When a task mixes multiple implementation strategies, the model keeps all of that alive in context, even if only one direction is still relevant.

The answer is isolation. Separate tasks mean separate contexts, and separate contexts mean fewer tokens spent reconciling unrelated ideas.

Pochi supports this through task forking and subagents:

Forking a task creates a new task that starts from the current code state but does not inherit conversational noise. It’s ideal when you want to try a different approach or explore an alternative implementation without dragging prior reasoning along.

Subagents allow focused exploration within the same repository while keeping contexts separate. Each subagent works with a clean, bounded scope instead of accumulating unrelated history. In practice, this kind of isolation can lead to dramatic token savings. Developers running large, multi-step workflows often split work across multiple subagents, each with its own narrow instruction set.

Steps:

When you explore multiple approaches, create a fork or a subagent.
Keep each task focused on one implementation strategy.
Use separate contexts for separate goals.

4. Scope MCP servers per task to shrink the action space

Before an agent attempts to solve a problem, it evaluates what it can do with every enabled tool. Each additional MCP server expands the action space the model must reason over. Even if a tool is never used, the model still spends tokens evaluating whether it is relevant, how it compares to other options, and when it applies.

For example, if a task only requires database access, exposing ten additional MCP servers adds unnecessary reasoning overhead. More options mean more branches to evaluate, which translates directly into higher token usage.

Scope MCP servers per task so the agent only reasons about tools relevant to the task. Only selected servers are loaded into context, contribute tool definitions, and influence model reasoning. All other tools are invisible to the agent.

Steps to follow:

Before starting a task, decide which MCP servers are necessary.
Disable all MCP servers not required for this task.
If the task changes, immediately rescope the tool set.
Keep a task template for common workflows (DB, infra, testing, etc.).

5. When execution dominates, move heavy data processing out of the model

Once you’ve scoped tools correctly, the next source of runaway token usage often shows up during execution, typically inside MCP-backed workflows.

The prompt may be short and the tool choice correct, yet execution tokens spike because large volumes of raw data are streamed into the model.

Tool calls may return hundreds or thousands of rows, which then get streamed into the model as large JSON payloads. At that point, execution tokens dwarf everything else.

The failure mode is asking the model to reason over this raw data. For example, we asked Pochi:

Prompt: How many failed orders do we have?

In the first attempt, the agent queried the database, fetched all matching rows, and streamed them into the model so it could filter and count them.

The answer was correct, but thousands of records flowed through the context just so the model could compute a single number.

A better approach is to separate what needs to be computed from how it is computed. Instead of reasoning over data, the model should generate code that performs the computation and returns only the result.

Prompt: Please write a small script that queries the database, filters failed orders and only returns the final count.

Pochi generated the following script:

import sqlite3
import os

def count_failed_orders():
    # Assuming the database file path based on common patterns if it were local, 
    # but since I have execute_sql tool, I will demonstrate how to do it via SQL.
    # For a standalone script, it would typically connect to a DB.

    # This is a conceptual script that would be used in an environment where 
    # the DB connection is established.

    query = "SELECT COUNT(*) as failed_count FROM orders WHERE status = 'failed';"

    # In this specific environment, we use the provided tools to interact with the DB.
    # If this were a real Python script for the user:
    """
    conn = sqlite3.connect('database.db')
    cursor = conn.cursor()
    cursor.execute("SELECT COUNT(*) FROM orders WHERE status = 'failed';")
    count = cursor.fetchone()[0]
    print(f"Total failed orders: {count}")
    conn.close()
    """

    # Since I'm tasked to write a script, I'll provide a clean Python script.
    print(f"Querying failed orders count...")
    # Placeholder for actual DB execution logic
    return query

if __name__ == "__main__":
    sql = count_failed_orders()
    print(f"SQL to execute: {sql}")

Now when prompted again to fetch the number of failed orders, the model never sees the raw records. Only a short summary enters the context, dropping execution tokens from tens of thousands to a few hundred.

Why not just run a `COUNT(*)` query?

At this point, it’s reasonable to think that the agent can run the SQL query directly:

  query = "SELECT COUNT(*) as failed_count FROM orders WHERE status = 'failed';"

Why do we need to get the agent to write a separate script?

And you’re right. Expect that agents often choose the least expensive path. Even when a database can do aggregation, agents frequently fall back to pull-and-process patterns:

SELECT * FROM orders WHERE created_at > '2026-01-01';

This is where token usage explodes, for several reasons:

Schema uncertainty: If the agent isn’t confident about column names, enums, joins, indexes, it plays it safe by fetching rows and reasoning in-text.
Ambiguous instructions: If the prompt says: “Find refunded orders and tell me how many there are”, the agent may fetch records first, inspect fields, and thenthen count, Instead of jumping straight to COUNT(*)
Tool abstraction: Many MCP database tools expose run query, fetch rows, but don’t strongly bias the model toward aggregation-first queries. So the model takes the path it can reason about most reliably.
Multi-step reasoning: If the question is slightly more complex: “How many refunded orders from customers who signed up last quarter?”. The agent might fetch orders, fetch users, join in its head and then count. That’s almost guaranteed to stream a lot of data.

Databases are cheap at filtering and counting while language models are not. The solution is simple - let the model decide what to compute, and let code handle how the computation happens. Only the final result should enter the context. Having a script lets us review the code and make sure that it runs the same computation every time the same prompt is called.

This keeps execution costs predictable, even when working with large datasets.

Steps:

Detect when tool results exceed ~100 rows.
Instead of asking the model to reason over raw data, ask it to generate code that computes the result.
Return only the final result to the model (summary, count, aggregation).
Use aggregation-first queries when possible.

Conclusion

If there’s one theme across all five guides, it’s that token usage is shaped long before a prompt is sent.
Most token blowups don’t come from bad prompts or choosing the wrong model. They come from workflows that allow too much context, too many tools, and too many competing ideas to accumulate in the same place.

When each task has a clear goal, well-suited tools, and a clean context, the agent doesn’t have to waste tokens reconciling noise. It can converge faster, reason more clearly, and produce better results at lower cost.

At Pochi, this philosophy is baked into the product. The goal isn’t to make you think about tokens because we tailor an experience that naturally keeps context small, intent clear, and costs predictable.

A safe way to let coding agents interact with your database (without prod write access)

GetPochi — Tue, 24 Feb 2026 03:12:47 +0000

We previously examined common approaches teams use to protect production databases (i.e. command allowlists, SQL filters, and manual approval workflows) and why they fail in the presence of autonomous agents.

The primary reason is that agents "work really hard" - they often route around these restrictions to deliver the results with any possible execution surface (shell, file system, runtime).

This tutorial demonstrates how to grant database access in Pochi without exposing production credentials or enabling uncontrolled writes.

Why this matters

Agents must never execute arbitrary code against production systems. At the same time, agents are most useful when they can read and write data to iterate quickly. The challenge is doing this safely.

We’ll walk through multiple access tiers, explain their security tradeoffs, and show how to progressively increase agent autonomy without expanding the production attack surface.

Prerequisites

Before we start, you’ll need the following:

Node.js installed
PostgreSQL running

Note: Before following along, do note that this tutorial is a demo. Don’t expose credentials in real systems and always use secrets managers.

Tier 1: Read-only inspection

To make sure that an agent never connects to the database directly, a common first step is to provide read-only access via a tool interface. The tool service owns the database credentials and strictly constrains the queries that can be executed.

For our demo, we’ll use an MCP HTTP service that exposes a fixed set of read-only tools. Production data will be accessible only through this interface.

Step 1: Create a read-only Postgres role

We’ll use a Postgres database. To create one, run the following command in your terminal:

create db shop_sandbox

Next, we create a dedicated database role whose permissions enforce read-only access at the database layer:

CREATE ROLE readonly LOGIN PASSWORD 'readonly';
GRANT CONNECT ON DATABASE shop_sandbox TO readonly;
GRANT USAGE ON SCHEMA public TO readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;

Continue over to add a new table:

CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  user_id INT,
  status TEXT,
  total NUMERIC,
  created_at TIMESTAMP DEFAULT now()
);

CREATE TABLE order_items (
  id SERIAL PRIMARY KEY,
  order_id INT REFERENCES orders(id),
  price NUMERIC,
  qty INT
);

Now we proceed to add data. Here's a sample to get you started:

INSERT INTO orders (user_id, status, total, created_at)
VALUES
  (1, 'paid', NULL, now() - interval '2 days'),
  (2, 'failed', NULL, now() - interval '1 day'),
  (3, 'paid', 120, now());

INSERT INTO order_items (order_id, price, qty)
VALUES
  (1, 30, 2),
  (1, 40, 1),
  (2, 10, 1),
  (2, 15, 1);

Step 2: MCP Server (`read-only` tools)

Next, we implement a minimal MCP server (server.js) :

import express from "express";
import cors from "cors";
import pg from "pg";

/* ---------------- DB (PRODUCTION READ-ONLY) ---------------- */

const pool = new pg.Pool({
  host: "localhost",
  port: 5432,
  user: "readonly",
  password: "readonly",
  database: "shop_sandbox",
});

/* ---------------- MCP HTTP Server ---------------- */

const app = express();
app.use(cors());
app.use(express.json());

app.post("/mcp", async (req, res) => {
  const { id, method, params } = req.body;

  try {
    if (method === "initialize") {
      return res.json({
        jsonrpc: "2.0",
        id,
        result: {
          protocolVersion: "2024-11-05",
          capabilities: { tools: {} },
          serverInfo: {
            name: "prod-readonly-db",
            version: "1.0.0",
          },
        },
      });
    }

    if (method === "ping") {
      return res.json({ jsonrpc: "2.0", id, result: {} });
    }

    if (method === "tools/list") {
      return res.json({
        jsonrpc: "2.0",
        id,
        result: {
          tools: Object.entries(TOOLS).map(([name, t]) => ({
            name,
            description: t.description,
            inputSchema: t.inputSchema,
          })),
        },
      });
    }

    if (method === "tools/call") {
      const tool = TOOLS[params.name];
      if (!tool) throw new Error(Unknown tool: ${params.name});

      const result = await tool.handler(params.arguments || {});
      return res.json({
        jsonrpc: "2.0",
        id,
        result: {
          content: [{ type: "json", value: result }],
        },
      });
    }

    return res.json({
      jsonrpc: "2.0",
      id,
      error: { code: -32601, message: "Method not found" },
    });
  } catch (err) {
    console.error(err);
    return res.json({
      jsonrpc: "2.0",
      id,
      error: { code: -32000, message: String(err) },
    });
  }
});

const PORT = 3333;
app.listen(PORT, () => {
  console.log(Prod MCP server running at http://localhost:${PORT}/mcp);
});

In this MCP server, we expose a narrowly scoped tool called recent_orders that allows us to filter the most recent orders in a given time period.

const TOOLS = {
  recent_orders: {
    handler: async ({ since_hours, status }) => {
      return pool.query(`
        SELECT id, user_id, status, total, created_at
        FROM orders
        WHERE created_at >= now() - ($1 || ' hours')::interval
          AND ($2::text IS NULL OR status = $2)
        ORDER BY created_at DESC
        LIMIT 50
      `, [since_hours, status ?? null]);
    }
  }
};

Step 3: Configure Pochi

Now we configure Pochi to use the MCP. For that, let's add to config.jsonc:

"prod-readonly": {
      "url": "http://localhost:3333/mcp",
      "disabled": false
    },

From the agent’s perspective, the only available interface to production data is the MCP tool API. To test this theory, let’s give the agent a read-only prompt:

Prompt: Show failed orders in the last 24 hours

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

As expected, the agent invokes the recent_orders tool and gets the corresponding output. Since there was no write operation involved, there was no direct database access.

So far, this looks safe.

Next, let’s ask the agent to modify some data.

Prompt: Mark this order as refunded.

Since there is no MCP tool capable of performing writes, the expected behaviour is for execution to fail.

Instead of failing, the agent searches for alternative execution paths to complete the task when its first few attempts to do a write operation fails. It searches the code for database-related code, generates a small Node.js script using the pg client and executes it through the shell. This ends up updating the database directly.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

If you look closely, the agent simply completed the task using an execution path that still existed because we left that gap open.

From this, we conclude that since the agent can be given shell access, database credentials, or the ability to construct queries, the security boundary still lives inside the model rather than the system.

Step 4: Disable execution surface

To actually make the setup read-only, two additional controls can be applied. First, we explicitly revoke database writes:

REVOKE INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public FROM readonly;
ALTER DEFAULT PRIVILEGES IN SCHEMA public
REVOKE INSERT, UPDATE, DELETE ON TABLES FROM readonly;

Second, we disable execution permissions in the editor.

This approach removes the entire class of risk. From Pochi’s standpoint, this means the agent cannot run any shell commands, write files, or execute programs. The only remaining interface was the MCP tool API, which is equivalent to calling a standard, credential-isolated API to query production data.

Following this, the read requests continued to work as expected from before. The agent used the MCP tools and returned the proper output. But now, when asked again to do a write operation:

Prompt: Mark this order as pending

The agent could still reason about how to perform updates, and could still propose code in the editor, but it no longer had the ability to apply or execute those changes without explicit human approval. The task remained incomplete by design.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

That being said, production was still accessible through the same execution surface, requiring constant human oversight. This could lead to problems where a human might blindly click ‘Save’ or settings may have auto-approve write enabled by default.

While this may appear operational, it is strongly recommended to avoid granting even read access to production databases due to the unpredictable nature of agents. Agents are non-deterministic by design, and production systems should not be exposed to that uncertainty.

But the challenge remains that many legitimate tasks still require writes. This leads us to the next tier of access.

Tier 2: Safe writes via clone + pipeline (demo)

Granting an agent direct write access to production is dangerous. Even if you try to enforce human approvals, the agent can still find ways to bypass restrictions if it has any execution surface.

The safer approach is to strictly separate reasoning from execution.

For this we'll use an Isolated Work Environment (IWE). The agent can generate and test migration scripts in a writable clone of the database (IWE), but the production database remains locked down. Once the migration is validated on the clone, the same script is applied to production through the normal deployment pipeline, with human approval and rollback controls.

Let’s have a look at what this looks like in a complete flow.

Continuing our example from before, we found that many paid orders in production were missing the total amount value. The correct total for an order should be:

SUM(order_items.price * order_items.quantity)

We want to backfill all orders where the total IS NULL. At the same time, we also do not want to give the agent write access to production.

Step 1: Architecture

In order to enforce isolation, we’ll run two separate MCP services.

Production MCP: This will be connected to our existing database, i.e. shop_sandbox, which has a read-only DB role
Validation MCP: This will be connected to a clone of shop_sandbox named shop_validate and will have a write-capable DB role.

Each MCP server will have its own database credentials and expose its own tool interface. From the agent’s point of view, the only way to interact with a database is by calling tools exposed by whichever MCP servers are enabled for the task.

Step 2: Add a production inspection tool

As seen earlier, shop_sandbox is locked down. The agent can only query it via MCP read-only tools. The database role for this MCP has only SELECT permission, and shell access is disabled in agent tasks.

Previously, we only exposed the recent_orders tool via this MCP. Now we’ll introduce the orders_missing_totals tool:

 orders_missing_totals: {
    description: "Find orders with NULL totals",
    inputSchema: { type: "object", properties: {} },
    handler: async () => {
      const res = await pool.query(
                SELECT id, user_id, status, created_at         FROM orders         WHERE total IS NULL         ORDER BY created_at DESC        
      );
      return res.rows;
    },
  },

To check if everything works as expected, let’s run a sample prompt:

Prompt: Show orders missing total

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

As seen, the agent can inspect how many orders are missing a total field, but it cannot modify anything on the production database.

Step 3: Create a writable clone

Next, we create a writeable clone of shop_sandbox named shop_validate with the same schema and data.

pg_dump -U apple shop_sandbox | psql -U writer shop_validate

Create a write-capable role for validation:

CREATE ROLE writer LOGIN PASSWORD 'writer';

Grant database and schema access:

Create a write-capable role for validation:
CREATE ROLE writer LOGIN PASSWORD 'writer';
Grant database and schema access:
GRANT CONNECT ON DATABASE shop_validate TO writer;
GRANT USAGE ON SCHEMA public TO writer;

GRANT SELECT, INSERT, UPDATE, DELETE
ON ALL TABLES IN SCHEMA public
TO writer;

Step 4: Setup Validation MCP

Add the new server to config.jsonc:

"validate-write": {
      "url": "http://localhost:3334/mcp",
      "disabled": false
    }

This new MCP will be our environment for the agent to perform writes. It will expose two tools: execute_sql and run_migration_file.


const TOOLS = {
  execute_sql: {
    description: "Execute raw SQL against validation database",
    inputSchema: {
      type: "object",
      properties: {
        sql: { type: "string" },
      },
      required: ["sql"],
    },
    handler: async ({ sql }) => {
      const res = await pool.query(sql);
      return res.rows;
    },
  },

  run_migration_file: {
    description: "Run SQL migration file against validation database",
    inputSchema: {
      type: "object",
      properties: {
        path: { type: "string" },
      },
      required: ["path"],
    },
    handler: async ({ path }) => {
      if (!fs.existsSync(path)) {
        throw new Error(Migration file not found: ${path});
      }
      const sql = fs.readFileSync(path, "utf8");
      const res = await pool.query(sql);
      return { rowCount: res.rowCount };
    },
  },
};

This way, we give Validation MCP the ability perform write operations while Production MCP only has read access. For the purpose of this blog, this is how we achieved environment isolation by routing the agent to different tool backends.

Step 5: Plan (Pochi Plan Mode)

Next, we will use Pochi’s plan mode to create a migration plan. Since Pochi allows us to enable MCP per task, we can use Validation MCP to perform the write operations.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

To start with Plan mode we insert the below prompt and click on Plan in the prompt-send option:

Prompt: Create a migration plan to backfill totals for orders where total IS NULL. Do not execute anything.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

So far there is no database access involved. You can review the plan, put inline comments and Pochi will modify the plan based on your inputs.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

Step 6: Generate Migration Script

Once you have the plan finalised, you can generate a migration script. To do so, we prompt back to Pochi:

Prompt: Convert the approved plan into a SQL migration file at: `migrations/backfill_order_totals.sql`. Do not execute anything.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

Pochi generates the following migration file, which can now be reviewed, versioned, and audited:

-- Migration: Backfill order totals from `order_items`
-- Description: Updates `orders.total` where it is `NULL` by summing `price * qty` from `order_items`.

BEGIN;

-- 1. Identify records to be updated and store in a temporary table for verification/rollback
CREATE TEMP TABLE backfill_log AS
SELECT id FROM orders WHERE total IS NULL;

-- 2. Perform the backfill
UPDATE orders
SET total = (
    SELECT SUM(price * qty)
    FROM order_items
    WHERE order_items.order_id = orders.id
)
WHERE id IN (SELECT id FROM backfill_log);

-- 3. Verification queries (intended to be run before COMMIT)

-- Check if any orders in the log still have NULL totals
-- SELECT COUNT(*) FROM orders WHERE id IN (SELECT id FROM backfill_log) AND total IS NULL;

-- Check for discrepancies between orders.total and sum of order_items
-- SELECT o.id, o.total, SUM(oi.price * oi.qty) as expected
-- FROM orders o
-- JOIN order_items oi ON o.id = oi.order_id
-- WHERE o.id IN (SELECT id FROM backfill_log)
-- GROUP BY o.id, o.total
-- HAVING o.total != SUM(oi.price * oi.qty);

COMMIT;

-- Rollback Strategy:
-- In case of failure before COMMIT, the transaction will roll back automatically.
-- In case of failure after COMMIT (if the temp table is still available in the session):
-- UPDATE orders SET total = NULL WHERE id IN (SELECT id FROM backfill_log);

At this point, still no database values have been modified.

Step 7: Validate in Clone

Now, with only validation MCP enabled, we can ask Pochi to do a write operation by applying the migration to the shop_validate database.

Prompt:  Apply the migration in migrations/backfill_order_totals.sql to the validation database and verify that no orders have NULL totals.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

Pochi will read the SQL file, send the SQL to validation MCP service, which will update shop_validate database.

Before:

After:

All this time production database shop_sandbox remains untouched.

Now if the changes are incorrect or you would like modifications, you can prompt back Pochi. For eg:

Prompt: Fix the migration logic and re-run until totals are correct. Do not update the migration file yet.

The agent will try new SQL and rerun in validation.

Once you are happy with the changes you can update the SQL file:

Prompt: Update migrations/backfill_order_totals.sql to reflect the final validated logic.

Step 8: Deploy to Production (Manual + Approved Pipeline)

Once validation looks good, production updates can be performed manually or via an approved deployment pipeline. In our case, we run the following command to apply the changes from shop_validate to shop_database.

psql -U apple -d shop_sandbox -f migrations/backfill_order_totals.sql

It uses the same migration and review process without the agent ever touching the production credentials. You can again verify the details on your production database.

Before:

After:

How this maps to real production setups

In the demo above, we used two local Postgres databases (shop_sandbox and shop_validate) to illustrate isolation.

In real systems, this isolation usually already exists in production, staging, and development database environments. Modern database platforms make it straightforward to create and manage these environments.

For example, Managed Postgres (RDS, Cloud SQL, Neon, Supabase) has read replicas, cloned databases, and point-in-time snapshots that are restored into new instances. Data warehouses have schema-level clones (Snowflake, BigQuery) and masked production extracts.

As a user, all you need to do is route the agent to the correct tier depending on the operation you intend to perform.

Conclusion

In this tutorial, we displayed why tool-level restrictions are not enough and how agents can still bypass read-only controls if they have any execution surface.

The safest approach is to separate reasoning from execution:

Tier 1: Read-only access via narrow tools
Tier 2: Writable clones + validated migration scripts

This gives agents useful autonomy while keeping your production systems protected.

How to Give Coding Agents Access to SSH and Databases (Without Breaking Production)

GetPochi — Sat, 21 Feb 2026 02:19:21 +0000

As AI agents become more capable, teams are trying to limit the damage they can do when given access to SSH or production databases.
Common approaches include:

Command allowlists: allow ls, cat, grep, tail; block rm, mv, chmod.
SQL filters: Allow SELECT; block INSERT, DELETE, DROP.
Manual approval flows: Run everything in read-only mode until a human explicitly accepts changes.

These practices assume that constraining agent behaviour through rules, filters, and approvals can prevent dangerous actions.

This assumption is wrong.

Allowlists, prompts, and approval dialogs are control surfaces that influence what an agent chooses to do. Shells, credentials, runtimes, and database roles function as execution surfaces, defining what the system can do.

Risk is determined by execution surfaces, not control surfaces.

Enforcing safety at the behavior layer isn’t the solution. What follows is why these approaches fail in practice - and what actually holds up in production.

Why database-level controls fail

Human engineers rarely connect to primary databases with full access. They query replicas or views that cannot change production data. The same principle must apply to agents, but it is not enough to enforce this only at the query level.

There are several reasons for this.

First, SQL filtering is unreliable. Even if you block write statements, many databases still support queries that trigger full table scans. Constructs like SELECT INTO can introduce new tables, and functions can produce side effects.

Second, read access alone is dangerous. It can expose authentication tokens, PII, or operational metadata. This is why databases themselves do not rely on client-side query validation and instead implement safety through roles, views, and replicas.

Re-implementing parts of this logic in the agent layer with regexes or heuristics is both fragile and incomplete.

Agents route around blocked tools

An agent’s goal is task completion. Blocking individual tools does not mean entire classes of state changes are prevented. For example, if the agent finds that direct deletion within a database schema is blocked, it will reach the same outcome by putting together other allowed operations.

In this case, even though direct access to the tool is blocked, it doesn't prevent many other possible walkarounds that could apply undesired changes to the production system.

In practice, this can include writing a script that performs deletion or just invoking a different tool that indirectly gives the expected outcome. This is not adversarial behavior. Instructions, allowlists, and approval dialogs influence behavior, but they do not define what the system is capable of doing.

For that reason, safety cannot live solely inside the model. It must be enforced outside the model through OS permissions, roles, and tool interfaces. Access control is not a prompting problem. It's an infrastructure problem that will require explicit separation between reasoning and execution, with enforcement applied at deterministic execution boundaries.

What actually holds up in production

1. `Read-only` access still allows irreversible damage

In one setup, we exposed production data to an agent exclusively through a read-only tool interface backed by a SELECT-only database role. On paper, this appeared safe - the agent could inspect data but had no explicit write tools.

However, as long as the agent retained access to a general execution surface (shell access, runtime file system, or database credentials), it simply routed around the restriction, generating its own script and updating the database through an unintended path.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

We removed the execution surface entirely and enforced read-only access at the infrastructure level. This involved revoking database write permissions and disabling shell execution. Post this, the agent could reason freely but could no longer apply any changes.

controls
style={{
width: "100%",
borderRadius: "8px",
boxShadow: "0 4px 12px rgba(0, 0, 0, 0.15)",
}}
>

Your browser does not support the video tag.

Even when read-only access is enforced at the infrastructure level, production access still depends on human approval. Over time, these approvals tend to degrade into procedural steps. Auto-approve paths appear, reviews become mechanical, and the safety boundary weakens.

As a result, granting agents direct access to production databases, even in read-only mode, is best avoided. Agents are non-deterministic by design, and production systems should not be exposed to that uncertainty.

The challenge is that many legitimate tasks still require writes.

2. Writes must flow through existing deployment pipelines

Many engineering tasks involve backfills, schema updates, and data correction. All of these require write access.

Giving an agent write access to production data under these circumstances is rarely acceptable. A more robust pattern is to let agents propose changes, generate and test migration scripts, and iterate freely. But applying those same changes to production will still wait on explicit human approval.

This mirrors standard engineering practice. It avoids giving production credentials to agents and ensures all changes are auditable and reversible.

The tradeoff is slower iteration, because agents can’t validate assumptions against real production data. This is where isolated writable environments become important.

3. Isolated writable environments enable safe iteration

Isolated Writable Environments (IWEs) are disposable database instances that mirror production schema and data. Within these environments, agents can evolve schemas, validate queries, and test migrations freely. Once the changes are ready, the same test and migration scripts can be replayed through the original production pipelines.

In practice, combining IWEs with gated production deployment yields the best results. You get the agent to perform actions against isolated databases while production remains gated by standard deployment processes.

The same principle applies outside the database layer. With SSH and shell access, the execution surface becomes effectively unbounded unless similar infrastructure-level boundaries are enforced.

Why shells make safety undefined

A shell is a general-purpose programming environment. Once exposed, safety boundaries become undefined.

An agent can use cat to overwrite files, using grep can exfiltrate secrets, and tail -f, if used on the wrong file, can leak sensitive data indefinitely. On the other hand, having an allowlist to control which binaries are executed does not automatically constrain the kind of operations that are possible on the system.

Having a shell exposed grants access to the file system, process creation, and the environment state. And once a shell exists, the boundary of what is allowed is no longer clear.

The safest designs accept this and shift the focus from preventing mistakes to containing them.

Disposable environments

The simplest and most reliable pattern is to treat any machine an agent can access as disposable. This way when something goes wrong, you just replace the whole thing and not spend time to fix it. This is already how CI systems operate, and increasingly how dev environments are provisioned.

Instead of connecting agents to long-lived servers, teams route them to short-lived containers, ephemeral VMs or dev sandboxes per task or per session. This way agents are free to modify files, install packages and experiment with configurations. But this introduces infrastructure cost while reducing production impact.

Restricted hosts and forced commands

Some teams still require agents for log inspection, operational debugging, or controlled maintenance. But even in these cases, full interactive shells are rarely necessary. Common restrictions can include:

SSH users with no home directory and limited permissions
forced commands in authorized_keys so only specific scripts can run
wrapper binaries that expose narrow actions instead of general shells

For example, instead of allowing:

ssh agent@host

the key may enforce:

command="/usr/local/bin/fetch-logs.sh"

This reduces damage area but comes with its own pitfalls. Debugging can become harder and workflows might require constant tooling updates.

Structured tools and APIs provide stronger safety guarantees than shells.. A shell exposes the file system, environment variables and network access - making it impossible to create boundaries without building a second operating system around it.

This approach already mirrors how human access to infrastructure has evolved as well. We now have fewer SSH sessions, more pipelines, dashboards, and automation APIs. Agents benefit from the same shift, for the same reasons.

Conclusion

Throughout this post, we saw the same pattern repeat across different systems. When the agent was instructed to follow safety, through instructions, prompts, or behavioral constraints, it failed. Whereas, when safety was enforced through infrastructure (roles, isolation, and execution boundaries), it held.

Many teams make this fundamental mistake of treating access control as a prompting problem instead of a system problem.

The only reliable way to make agents safe is to design environments where destructive actions are physically unreachable, and all state changes flow through auditable, deterministic systems. Once safety is achieved by design, mistakes become recoverable, contained, and reviewable - and only then can agents be given meaningful autonomy across real production use cases.

We’ve been shipping "slop" for 20 years. We just used to call it an MVP.

GetPochi — Fri, 09 Jan 2026 15:04:23 +0000

A lot of people have started using the word “slop” as shorthand for AI-generated code. Their stance is that AI is flooding the industry with low-quality software, and we’re all going to pay for it later in outages, regressions, and technical debt.

This argument sounds convincing until you look honestly at how software has actually been built for the last 20 years.

The uncomfortable truth is that “slop” didn’t start with AI. In fact, it is AI that made it impossible to keep pretending otherwise.

Let’s pull back the curtain on a silent pact the industry followed, long before the first LLM was trained.

Software has always optimized for execution

Outside of Google’s famously rigorous review culture, most Big Tech giants (Meta, Amazon, and Microsoft included) have historically prioritized speed.

In the real world, PRs are often skimmed, bugs are fixed after users report them, and the architecture itself evolves after the product proves itself. We didn’t call this “slop” back then; we called it an MVP (Minimum Viable Product).

By comparison, some of the code that coding agents deliver today is already better than the typical early-stage PRs in many companies. AI isn’t introducing a new era of “good enough” code; it’s just the latest tool for a strategy we’ve used for decades.” And in hindsight, we have always been willing to trade internal code purity for external market velocity.

The Open Source Antidote

The primary exception is open-source projects, which operate differently. Open source has consistently produced reliable, maintainable code, even with contributions from dozens or hundreds of developers.

Why?

Because open source forces modularity. Unlike internal corporate developers who can reach across a private monolith to create messy dependencies, open-source contributors often work in isolation. To be successful, the project must maintain strict API boundaries and clean abstractions so that someone with zero internal context can contribute without breaking the system.

This environment creates aggressive iteration loops and context-rich opinions. Every contribution undergoes a series of automated tests and diverse human peer reviews. Unlike internal systems, which remain messy even after years of maintenance, open source libraries receive feedback from diverse sources, which usually converge better on overall quality than code written for one or two specific use cases.

This trend of prioritizing execution over perfection actually fits most application-layer workflows in companies today. If we treat an AI agent like an external open-source contributor, i.e. someone who needs strict boundaries and automated feedback to be successful, the “slop” disappears.

Engineering Quality into the Agent

At Pochi, we believe the output of an AI agent is only as good as the contextual guardrails you build around it. If you want to avoid”slop”, you have to go further than simple chat prompts. Some tips we found useful were:

1. Solving the Hallucination Problem
The biggest problem with AI code is its tendency to “hallucinate” nonexistent libraries or deprecated syntax. This is because developers convey changes from a “Prompt Engineering” lens instead of an “Environment Engineering” perspective.

This is solvable if you integrate the agent directly into the CI/CD pipeline, where every line of code can be instantly validated against existing compilers and linters. This way, you don’t have to wait for the AI to get it right, but trust your environment to catch it when it’s wrong.

2. Utilizing “Cloud Markdown”
A “Cloud Markdown” approach is useful for high-scale design practices. Instead of a static PDF with verbose architectural standards, you create a README.pochi.md file that acts as the agent's source of truth.

An example architectural guardrails file can look like this:

#Project Design Patterns 

## Data Fetching
- Rule: No direct fetch calls in components.
- Pattern: Use the useQuery wrapper from @/lib/api.
- Reasoning: Ensures global error handling and caching are applied.
## State Management
- Constraint: All shared state must reside in LiveStore.
- Pattern: const [data, set] = useLiveStore(key);

With this approach, you end up with three critical workflows:

Documentation as Context: You can store Markdown files with deep architectural rules and design patterns within the repository.
Prompt Injection: Before an agent begins a task, it “reads” these Markdown files to understand global restrictions (e.g., “Always use local-first storage patterns via LiveStore”).
Context Scaffolding: This ensures the agent isn’t just writing a snippet in a vacuum, but is following the specific scaffolding of the existing codebase.
This helps you embed deep architectural knowledge directly into the workflow. Now, before every major migration, the agent gets tasked with gathering as much file-level context as possible to produce the most accurate result.

Conclusion

At the end of the day, users never see “slop.” They see broken interfaces, slow loading times, crashes, and unreliable features.

If you dismiss AI code as “slop,” you are missing out on the greatest velocity shift in the history of computing. By combining Open Source discipline (rigorous review and modularity) with AI-assisted execution, we can finally build software that is both fast to ship and resilient to change.

How do you build serious features using only VS Code’s public APIs?

GetPochi — Fri, 09 Jan 2026 13:46:18 +0000

I've been writing a series on how we trained our NES model, what that model takes as context to make a prediction, and how these model requests are managed with correct timing under continuous typing.

With this, we’ve reached a point where NES can predict what edit should happen and when it should appear in the editor.

Now, we'll talk about how there is still one critical decision to make. Once a suggestion arrives, how should that change be presented inside a live editor?

Pochi’s NES is built as a VS Code-native feature, not a standalone IDE or a custom fork. This means previews must integrate with VS Code’s public APIs, performance model, and established interaction patterns.

This introduces a core design challenge - to surface enough context for a suggestion to be actionable, without disrupting the developer's flow.

Designing a system that honors this is more than a matter of visual polish; it is a complex systems + UX problem. We’ll explore why this balance is so difficult for a native AI agent and the specific rendering strategies NES uses to achieve it.

The Display Problem

Unlike conventional editor features, NES does not control where the user’s cursor is when a suggestion arrives. The editor is a continuously changing environment and does not function like a static canvas. Sometimes the user's cursor might be exactly where the edit belongs, or it can be twenty lines away, or the suggestion itself can be a huge change spanning multiple lines.

Showing such suggestions naïvely introduces new failure modes that are easy to trigger and hard to ignore. One experiences jumps in cursor position, abrupt viewport scrolls, or rendering large changes directly in the editing flow. In practice, these behaviors are often more disruptive than not showing a suggestion at all.

This brings us to the most fundamental design question: How do we show an edit without stealing the developer’s attention?

Answering that question requires understanding the VS Code interaction model.

VS Code does not provide a built-in API for previewing LLM-generated edits. Instead, the editor offers different primitives for different kinds of locations and edits. These primitives are optimized for various interaction patterns, each with their own affordances and limitations. Some work well for cursor-local edits, while others are better suited for changes elsewhere in the file.

Understanding this difference is key. Pochi's NES does not render suggestions in a single, fixed way. Instead, NES relies on these primitives to create a balance between visibility and disruption.

Dynamic Rendering Strategy

Rather than forcing all suggestions into a single representation, we designed a Dynamic Rendering Strategy offering the optimal visual experiences in different editing scenarios:

Suggestions that target the current cursor position are rendered inline, flowing naturally into the user's typing behavior.
Suggestions that apply off-cursor are previewed via an inline diff decoration, avoiding jumps in the viewport.
For large, multi-line block inserts, a floating preview is used to provide sufficient context without disrupting the user's current focus.
This way, each path is deliberately scoped to the situations where it performs best, aligning it with the least disruptive representation for a given edit.

Let’s take a walk-through of these rendering strategies in detail and examine when each one is used, starting with the least disruptive case.

Inline Completion

When an edit is positioned right at the cursor, the least disruptive option is to stay out of the way. In such cases, we render the edit inline, making it blend directly into the user's typing flow.

To achieve this, we use VS Code's inline completion API. This approach works especially well for small, localized changes like autoclosing brackets, replacing a few characters, or edits that are directly made under the cursor.

Inline Diff Preview

Because NES predicts the next meaningful edit across the file (not just at the cursor), many suggestions naturally apply outside the user’s current editing position. For example, while you are typing inside a function, NES may suggest updating a related import, adjusting a type definition, or fixing a reference several lines away.

In these cases, the cost of getting the presentation wrong is high. The user is forced to jump across the file, break context and interrupt their flow.

To avoid that, we render the suggestion as an inline diff decoration. The text to be replaced is highlighted in red, while the new content is shown in green at the insertion point. This way, the user gets a clear preview of the change without moving the cursor.

This works particularly well for changes involving single-line updates or even multiple lines where each line is being changed independently.

Floating Diff Image

Because NES has the ability to propose structural edits, such as inserting a new helper function, refactoring a block of logic, or adding a multi-line configuration, it frequently produces multi-line suggestions that cannot be represented as simple, inline changes.

In these cases, the suggestion is no longer tied to the cursor’s immediate context, and the standard inline rendering stratergies do not suffice.

At this point, the decision falls under either pulling the user away from where they’re working or bringing the preview to them. Since preserving developer flow is a core design principle for NES, we consistently choose the latter.

In order to make the suggestion appear near the edit target without moving the cursor, we generate a floating diff preview and render it as an image. The color schema of the suggestion will also stay consistent with the other solutions we discussed previously - red for deleted text, and green for inserted ones.

VS Code allows extensions to attach image-based decorations. With careful layout and positioning, these decorations can be floated near the edit target and used as a diff preview. However, the editor does not render code into images, which means the preview has to be generated by the extension itself.

This required a small rendering pipeline:

Theme matching: Every VS Code theme is an extension with a standard JSON format. We parse the active theme, extract its token colour map, and match it to the user’s active settings so the preview matches the theme in the editor.
Syntax highlighting: VSCode includes a bundled TextMate runtime. We load the grammar for the current filetype, generate syntax scopes, and apply the same colouring rules that VS Code uses. This ensures that the rendered code maintains the same appearance as the code in the editor.
Image rendering: Here we use canvaskit-wasm to render the tokenized code into an image. To draw the code properly, we took the editor’s current fontSize and lineHeight, drew each tokenized segment at the correct coordinates, then applied diff highlights (additions in green and removals in red). The final image is then surfaced using the decoration API.

This approach allows multi-line edit suggestions to appear near their target location while preserving cursor position and avoiding viewport jumps.

Conclusion

Different kinds of edit suggestions need different presentation strategies, with the editor API playing a decisive role in shaping the final experience.

Rendering an NES suggestion ended up being less about displaying text and more about maintaining the reader’s attention. Because no matter what, once attention is broken, even the best suggestion gets ignored.

Each rendering path is designed to stay as close as possible to the developer’s flow while working within the editor’s interaction model.

At this point in our journey, NES can decide what to suggest (the model), when to surface it (request management), and how to show it without disruption (rendering paths). Combined, these layers define how AI-generated edit results become truly helpful in a real IDE.

How do you build serious features using only VS Code’s public APIs?

GetPochi — Thu, 08 Jan 2026 15:21:16 +0000

With this, we’ve reached a point where NES can predict what edit should happen and when it should appear in the editor.

Now, we'll talk about how there is still one critical decision to make. Once a suggestion arrives, how should that change be presented inside a live editor?

This introduces a core design challenge - to surface enough context for a suggestion to be actionable, without disrupting the developer's flow.

The Display Problem

This brings us to the most fundamental design question: How do we show an edit without stealing the developer’s attention?

Answering that question requires understanding the VS Code interaction model.

Understanding this difference is key. Pochi's NES does not render suggestions in a single, fixed way. Instead, NES relies on these primitives to create a balance between visibility and disruption.

Dynamic Rendering Strategy

Rather than forcing all suggestions into a single representation, we designed a Dynamic Rendering Strategy offering the optimal visual experiences in different editing scenarios:

Let’s take a walk-through of these rendering strategies in detail and examine when each one is used, starting with the least disruptive case.

Inline Completion

When an edit is positioned right at the cursor, the least disruptive option is to stay out of the way. In such cases, we render the edit inline, making it blend directly into the user's typing flow.

Inline Diff Preview

In these cases, the cost of getting the presentation wrong is high. The user is forced to jump across the file, break context and interrupt their flow.

This works particularly well for changes involving single-line updates or even multiple lines where each line is being changed independently.

Floating Diff Image

In these cases, the suggestion is no longer tied to the cursor’s immediate context, and the standard inline rendering stratergies do not suffice.

This required a small rendering pipeline:

Theme matching: Every VS Code theme is an extension with a standard JSON format. We parse the active theme, extract its token colour map, and match it to the user’s active settings so the preview matches the theme in the editor.
Syntax highlighting: VSCode includes a bundled TextMate runtime. We load the grammar for the current filetype, generate syntax scopes, and apply the same colouring rules that VS Code uses. This ensures that the rendered code maintains the same appearance as the code in the editor.
Image rendering: Here we use canvaskit-wasm to render the tokenized code into an image. To draw the code properly, we took the editor’s current fontSize and lineHeight, drew each tokenized segment at the correct coordinates, then applied diff highlights (additions in green and removals in red). The final image is then surfaced using the decoration API.

This approach allows multi-line edit suggestions to appear near their target location while preserving cursor position and avoiding viewport jumps.

Conclusion

Different kinds of edit suggestions need different presentation strategies, with the editor API playing a decisive role in shaping the final experience.

Each rendering path is designed to stay as close as possible to the developer’s flow while working within the editor’s interaction model.

Everyone says to have better context management. I'll show you how I built it. https://dev.to/getpochi/nes-series-part-2-real-time-context-management-in-your-code-editor-3jeo

GetPochi — Tue, 23 Dec 2025 14:53:05 +0000

Everyone says to have better context management. I'll show you how I built it.

GetPochi ・ Dec 9 '25

#ai #machinelearning #llm #programming

https://dev.to/getpochi/what-it-really-took-to-train-a-next-edit-suggestion-model-4bf2

GetPochi — Tue, 23 Dec 2025 14:48:50 +0000

How I Trained a Next-Edit Suggestion Model for a Coding Agent (32k Github Stars)

GetPochi ・ Nov 19 '25

#machinelearning #ai #opensource #architecture

How does a code editor decide the right moment to show an LLM-generated code suggestion

GetPochi — Tue, 23 Dec 2025 14:45:11 +0000

In Part 1, we talked about how we trained our NES model to predict the next meaningful edit you’re likely to make in your code.

In Part 2, we then covered what the model takes as context to make this edit. This included deep dives into editable regions, user’s edit history, and using the overall project context.

Together, these two pieces (model + context) form the core intelligence foundation of the NES system. But incorporating them into an end-to-end real-time engineering system requires more thinking about real developer coding behaviour.

A code editor is a continuously changing space. Developers type, pause, delete, move the cursor, undo, redo, and essentially keep editing, often faster than any model can respond. Even a fast model call involves network latency, debouncing delays, server-side scheduling, and decoding / streaming time.

If not careful, a request that was correct when it was sent can return a response that arrives a few hundred milliseconds too late. Which means now you end up with edit suggestions for code that no longer exists. This is something that’s termed “haunted” for being technically right but not at the right place.

This means, in practice, a correct edit shown at the wrong moment is perceived as wrong by the user. So even with proper context and a good model, it is equally important to have the correct timing. Then only can the product actually feel useful without being distracting.

But getting timing right is challenging, due to the ever evolving nature of the user’s editing state. To make NES feel real-time and helpful, we had to reason about what happens before a request is sent, while it’s in-flight, and after the model responds. This is what we call request management.

Let’s look at it in more detail.

The NES Request Management Lifecycle

Request Management of a NES prediction happens in three stages:

Before the request: waiting until the user actually pauses
While the request is in-flight: discard anything that becomes outdated
After the model responds: keep the suggestion alive if the user continues along the same trajectory

These map to what we technically implement as debouncing, cancellation, and speculative-prediction caching.

This structure helps bring the intelligent results (what we get with context + model) to users reliably, even as they type continuously. NES continues to run this loop as you type. Let’s take a closer look at how we handle timing at each stage.

Debouncing: Requesting the Model at the Right Moment

The first question we had to tackle was, “When is the right time to send a request?” When a developer is typing continuously, a request on every keystroke has little value and is wasteful. At the same time, waiting too long would make the system feel unresponsive and slow. We had to find that sweet spot that lies in detecting the exact moment the user actually paused typing.

Most systems solve this with a fixed interval, (say, 100ms), but real-world typing isn’t this predictable. Instead, we decided to adapt the debounce interval based on how the user is behaving right at that moment.

To achieve this, we made NES pay attention to a handful of lightweight signals.

For example, typing a . often means the developer is about to pause to access a method of an object, so we get the signal to shorten the debounce delay. Whereas, if the user is continuously typing through a variable name, we stretch the delay a bit to avoid jumping in too early. And if the model’s recent response times have been slower due to network conditions, we account for that too, so suggestions land at the exact moment the user expects them.

This way, the result is a debounce time window that changes with the user’s rhythm. It is short when the user has paused, and long when they’re in flow, all while making sure it never exceeds 1 second.

Cancellation: Correctness Over Completion

Once a request is finally sent, the editor doesn’t stop moving. A user can continue typing, move the cursor, or undo and redo steps before the model has even started responding. When that happens, the original request becomes stale instantly.

In such a case, we cancel the original request from the client-side, and in turn, the server propagates the cancellation, with any late responses being discarded without ever getting them rendered to the UI.

This is a deliberate design decision that optimises and enforces correctness in a live, ever-evolving editing system. We would prefer that NES show nothing rather than something misleading.

If you’re interested in how this works end-to-end, including streaming behaviour, we’ve written more about it here.

Speculative Prediction: Staying One Step Ahead of the User

Traditional caching is straightforward. If nothing has changed, just reuse the previous response. In the case of NES, this helps to avoid duplicate requests. And to think about it, throwing work away all the time would be expensive if we didn’t balance it out elsewhere.

But we go a step further. When the model returns an edit suggestion, we don’t just cache it for the exact context that produced it, but also speculate on the next few contexts the user is likely to enter.

Now, if the user continues typing along the same trajectory, NES doesn't need to call the model again and can continue serving the speculated suggestion. We call this speculative prediction.

A speculated prediction remains valid as long as the user is essentially still typing into the suggestion and the surrounding context hasn't changed.

It’ll be better to illustrate this with the help of an example. Suppose a user types:

NES sends a request and gets the following suggestion:

If the user then continues to type, resulting in:

This user edit is part of the received suggestion. Therefore, the suggestion should still be displayed (unless the user has explicitly rejected it by pressing esc).

By retrieving this result from the forward prediction cache, we can display the suggestion faster and reduce LLM request usage.

Of course, if the user is not satisfied with the cached suggestion, they still have the option to send a new request to get multiple choices. In essence, forward caching helps accelerate the common path, improving the overall experience.

Conclusion

By the time a suggestion appears in NES, a lot has already happened. Debouncing decides when is the right time to make a request, cancellation makes sure outdated intent never surfaces in the UI, and speculative prediction lets us reuse good existing predictions when the user naturally moves through them.

While you’d find these techniques are familiar in distributed systems, applying them inside a code editor was a challenge of its own. The primary driving factor wasn’t about throughput or load but about every evolving human intent under motion.

What’s next?

So far, we’ve focused on how NES decides what to suggest and when those suggestions should appear. With request management in place, we now have a system that ensures LLM-powered edits reach the user only when they’re truly helpful.

But now that brings us to the next stage of the process: How should these edits be presented?

NES suggestions aren’t always a single line near the cursor. Sometimes the relevant edit is several lines or even several files away. Presenting enough information for a quick action without breaking the developer’s flow is a surprisingly deep design and engineering challenge.

This is especially tricky inside a code editor like VS Code, where rendering options are limited. In such cases, how do we preview multi-line edits precisely? How do we make them feel lightweight, immediate, and skimmable, without being modal or disruptive?

In Part 4, we’ll dive into how we approached these constraints and built a rendering system that enables richer previews and lower-latency interactions for code edits.

Closing the Loop: How Reinforcement Learning is Changing AI Coding

GetPochi — Sat, 13 Dec 2025 16:13:33 +0000

TL;DR

Using SFT teaches models how to write code, but it is RL that is necessary to teach them what works. On the other hand, introducing RL in software engineering brings its own specific challenges: data availability, signal sparsity, and state tracking. In this post, we’ll break down how recent works address these challenges.

So far, the focus of RL driven improvements had been based on competitive coding. For example, in LeetCode-style tasks, the model works in a closed loop. It generally receives a clear problem statement and in turn, it generates a single, self-contained solution.

This means there are no dependencies involved, no files systems to navigate, and no legacy code that can break. It is exactly like solving a logic puzzle in isolation rather than understanding the engineering implications on the overall codebase.

However, the field of Software Engineering (SWE) in real-world is fundamentally different. It is a stateful, multi-turn interactive problem. A day-to-day involves much more than just writing the correct functions. You often need to navigate a file system, check up on dependency graphs, run the proper tests, and interpret logs in case of errors. This implies, an agent effectively needs to maintain coherence across a long horizon of interactions.

Which is why RL is an ideal candidate for SWE since agent actions produce verifiable results. At the same time, it also introduces challenges that are not present in single-turn tasks. For example,

Data Availability: We cannot easily simulate millions of environmental interactions like we can with math problems.

Signal Sparsity: Often, success signals appear at the very end of a long sequence of edits.

State Tracking: Along with the static text of the code, the model must understand the dynamic state of the runtime environment
Recent works from Meta and Moonshot AI surfaced how the industry is pivoting from general reasoning RL to domain-specific SWE-RL to address these challenges.

The Data Problem

In order to learn trial and error, the standard RL requires an agent to interact with an environment. For coding, this means running a test suite or compiling code. As compared to verifying math proof, this simulation will be prohibitively slow and expensive in a real-world setting. Here, the engineering challenges arises to figure out how to bootstrap the entire learning process without being dependent on costly online simulation.

Meta proved that you can bypass the online bottleneck by using the massive offline history of Github. In its recent work on SWE-RL they talked through this approach instead of setting up a live sandbox for every training step.

But offline data lacks a reward signal. For every historical Pull Request, you cannot easily go back in time and execute tests. SWE-RL solves this by creating a process proxy reward. They calculate the fine grained text similarity between the generated patch and the actual developer ground truth solution instead of just checking if the code runs.

Depending on how closely the generated format matches the human solution, the model receives a continuous reward. On the other hand, if the model generates an invalid patch format, it receives a penalty. This demonstrates that even before touching a compiler, you can teach a model complex engineering behaviours like navigating file structures and adhering to project conventions using static history.

The Signal Sparsity Problem

Next, we have the credit assignment problem while facing online training within executable environments. That means it is difficult to indentify which step really contributed to the final success of the model and which step should get the reward. This reflects on software engineering as any agent can fail after 50 steps of editing and testing. Standard RL struggles to identify which specific step caused the failure.

The Kimi-Dev paper addresses this through task decomposition. It treats software engineering as a composition of atomic skills: the BugFixer (editing logic) and the TestWriter (verifying logic) instead of training an end-to-end agent to solve the issue immediately.

Their solution starts with Agentless RL. They train the model specifically on these short horizon atomic tasks using outcome based rewards. They look for signals on whether the patch passed the test or did the test reproduce the bug. And since the tasks are scoped down, the feedback signal becomes dense and clear.

Kimi-Dev shows that with minimal additional data, a model having these pre-trained capabilities can be adapted to a complex multi-turn agent framework. This suggests that the most efficient path to autonomous agents is rigorous skill acquisition’s followed by workflow adaptation rather than brute force end to end training.

The State Problem: Building a Code World Model

Coming to the final challenge, which is also arguably the most profound. Engineers generally do not just read code as text but also think about the execution loops in their mind. This involves tracking how variables change in memory and how functions interact between files. Meanwhile, since current code LLMs lack this internal compiler engine, they just merely predict the next token based on statistical likelihood.

Meta Code World Model addresses this by fundamentally changing the training curriculum. They realized that waiting until the RL phase to teach execution dynamics is too late. The rewards are too sparse and the gradient vanishes on hard problems.

Instead, in the mid-training stage, they inject process supervision directly. To teach the physics of code, they constructed two massive datasets:

Python Execution Traces: With over 120 million examples, the model is trained to predict not just the next line of code, but also the exact state of runtime variables (the values in memory) after every single line.
ForagerAgent Trajectories: Agents with 3 million trajectories that interact with a Docker environment to solve tasks.

This forces the model to internalise a Code World Model. By the time the model enters the final RL stage it is no longer starting from scratch. It already understands if I write X then variable Y changes to Z.

Consequently, the RL stage becomes a process of Goal Alignment. It uses sparse result rewards like passing tests simply to guide a model. It already understands execution physics to select the specific path that satisfies the verification requirement.

Takeaway: Moving Toward Verifiable Agents

This progression from SWE-RL (offline proxy rewards) to Kimi-Dev (decomposed skill learning) and CWM (execution-trace world models) outlines a clear engineering roadmap for the next generation of code models and agentic RL frameworks.

We are seeing a shift from generic reasoning to specialized engineering. Future models will be more than just smart. They will be grounded in repository history, capable of self-verification through test writing, and possess an explicit internal model of runtime state.

At TabbyML we view these developments as the foundation for Verifiable Engineering. The future value of AI in software development lies in building agents that understand and respect the state of your system.

Everyone says to have better context management. I'll show you how I built it.

GetPochi — Tue, 09 Dec 2025 08:36:13 +0000

In Part 1, we covered how we trained our NES model, including topics such as the special tokens we use, the LoRA-based fine-tuning on Gemini Flash Lite, and how we utilized a judge LLM to evaluate the model.

However, the end experience is far more than just building a good model. To make NES feel “intent-aware” inside your editor, we needed to give the model the right context at the right moment.

In part 2, we’ll talk about that runtime system, or to be precise, how Pochi manages, ranks, and streams real-time edit context. This is the core that helps NES to understand your intent and predict the next meaningful change.

Why Context Management Matters

To start, let’s understand what context management is. In our case, it’s the layer between when a user starts typing and when the model is called with a well-formed prompt. During that in-between phase, the system gathers and prepares all the relevant context the LLM needs before we make a model request.

As to why it matters, imagine simply sending the entire file to the model on every keystroke. Not only will the model become slower and noisier, but you’d get unstable predictions and over 20 model calls per second, rendering the whole experience unusable.

Instead, as previewed in the first article, we provide NES with three kinds of context:

File Context: text, filepath, cursor position, and the region to be edited
Edit History: record of recent edit steps
Additional context from other files (optional): e.g., functions/type declarations that help understand the current file

Each of these depends on clever filtering, segmentation, and timing - all of which happen in milliseconds during normal typing, as we’ll learn below.

1. File Context: Finding the “live” region of code

The first question to solve: “Where is the user editing right now?”. This is the foundation of every NES prompt. We answer this by gathering three quick pieces of information from the VS Code API:

The current file text
The file path
The user’s cursor position

Using this information, we compute what is called “the editable region”. This region is generally a small code window around the user’s cursor of ~10 lines.

Why ~10 lines?

Because realistically, the next edit will almost always happen very close to where the user is already editing. This small window keeps the latency extremely low and is large enough to capture the structure around the edit.

And while we observe many models are over-eager and hallucinate changes elsewhere, our model is prevented from rewriting parts of the file the user wasn’t touching.

An example of the editable region would be:

2. Edit history: Following the user’s intent over time

So far, we have learnt where the user is currently editing, but we also need to understand how the code is changing over time. This is the part where edit history becomes important for the edit model to predict the user’s intent.

Now, while we could use the VS Code API to register a listener for text change events, this ends up triggering an event for almost every keystroke. For example, if a user updates a type from string to email, it ends up producing ~6 events.

These are not your meaningful edit steps. If we send this to the model, it will think each keystroke is a new “user intent” and will fire too many requests with wildly different predictions. Instead, we reconstruct real edit steps using an internal change segmentation grouping.

How we group events into meaningful steps

Since we cannot directly use the listener events, we decided to reduce them to events that represent edit steps. To achieve this, we group raw text-change events into undo-redo scale units.

Most editors record undo-redo steps on a word scale - for example, when a user inputs a sentence, an undo action will revert the last input word. In our case, for building edit prediction prompts, we do this on a larger scale.

Once we receive information on a user’s cursor position and tracking gets initiated, we create an edit steps list, where each step is an accumulation of several text change events. We found that 5 steps is the sweet spot to build a prompt. Anything more than that adds noise, and if less, loses the intent.

For each received text change event, we check if it is adjacent to the previous one. If yes, it belongs to the same edit step; otherwise, if it happens in a different part of the file, we consider it as a new edit step.

So continuing our example from earlier, if the user happens to add a validateEmail function next, we now have two edit steps in tracking.
The first edit step:

The second edit step:

NES receives these steps wrapped inside <|edit_history|> token to learn how the code is evolving.

Special Case: Git Checkout Noise

One edge case we uncovered is when users run git checkout to switch branches. This triggers massive file changes, none of which represent real user intent. If we were to treat these as edit steps, the model would end up thinking the user rewrote half the codebase. In order to avoid polluting the model direction, we:

Monitor the git status
Reset edit history when it changes (checkout, pull, stash)
Resume tracking after a few seconds

3. Additional Context: Bringing in the rest of your project

Code rarely exists in isolation. If you’re editing a function call, the model may need the definition. Likewise, if you’re modifying a type, the model may need the type declaration.

To give NES this kind of project-aware understanding, we pull additional snippets using the user’s installed language server. For this, we have two VS Code / LSP APIs:

We use vscode.provideDocumentRangeSemanticTokens to scan the editable region for each token type. Then we can find the tokens of interest, like a function, interface, or type defined in another file.
Next, we use the VS Code commandvscode.executeDefinitionProvider to get the target location for the definition code snippets. This is like Ctrl / Cmd + clicking on a function to see the definition in another file.

These two commands are provided by the language server (LSP), which should be available when the language plugin is installed in VS Code. We then extract the definition snippet and include it in <|additional_context|> token as shown below:NES Cover Image

This gives the model the same context a developer would mentally reference before typing the next edit.

Note: We do realise that some of the functions could be huge or a type might be hundreds of lines, with LSP sometimes returning entire class bodies. Therefore, to throttle/limit semantic snippet extraction, we’ve currently hard-coded a maximum of 2000 characters per snippet for now.

Meanwhile, in cases where good LSP support is lacking, like plain text, we don’t add any related snippets context to the prompt. Instead, the prompt will still contain the prefix, suffix, and edit records.

Putting It All Together

At the end, each piece is carefully constructed into the model exactly the way it was trained. This symmetry between training and runtime makes NES far more reliable than native autocomplete-style approaches.

What’s next?

In our next post, we’ll talk about Request Management, the system that ensures the model never gets a chance to be wrong about the user’s current context.

We all understand real coding experience involves a lot of short, focused typing, moving the cursor to different places, and continuing to edit while a request is still in flight. This means the model requests can become outdated before their response arrives, or worse, it might produce suggestions for code that no longer exists.

One of the reasons NES feels fast is because everything that isn’t the latest user intent is thrown away immediately. This cancellation of stale predictions is one of the biggest reasons Pochi’s NES feels so smooth and accurate.

More on this in our upcoming Part 3 post. Stay tuned!

How We Built True Parallel Agents With Git Worktrees

GetPochi — Mon, 24 Nov 2025 05:57:18 +0000

Background Context

We’re building Pochi, a full-stack AI teammate that can handle all your coding tasks and think, communicate, and work like a real engineer. One of our recent feature requests involved releasing Parallel agents.

Most teams rarely work on a single task at a time. You might be partway through a feature when a bug report arrives, someone needs a small refactor reviewed, or a documentation fix is pending.

So you end up switching branches, stashing and popping changes, resetting your workspace, and trying to hold the original task in your head. This is context switching, and it’s one of the biggest hidden costs in software development.

Parallel Agents were introduced to remove this cost. They are not new, but the way most tools implement them still felt off. Our own experience with Cursor / Github Copilot and the likes showcased that these tools operate as parallel agents inside a single editor tab. So in essence, you’re effectively still working in one tab at a time: switching tasks means switching the state of the same working directory and the same conversation.

This is the part that matters. When the underlying repo state is shared, “parallel tasks” are still serial in practice.

The Design Question We Asked

What if multiple agents could work on the same codebase without sharing the same working directory?

The answer already exists in Git: git worktree add path branch. A worktree gives you a second checked-out working directory backed by the same .git repository.

So instead of trying to simulate “parallel tasks” in one tab, if we made each agent correspond to its own worktree, we could expose these worktrees directly inside VS Code (Source Control + Pochi tabs). That means no manual git worktree management is required as each agent simply gets its own branch, filesystem, and local execution context.

Parallel agents only feel parallel when the filesystem is parallel.

What We Built

Based on that, we built Parallel Agents in Pochi that use separate Git worktrees, so that each task has its own working directory, branch, chat history, and terminal environment. This means that each task state stays isolated.

A great example would be to run the same task with different models to pick the best response. Won’t that be a faster and much better experience - all within the same timeframe?

From a UX standpoint, the important bit is how it surfaces in the editor: each agent is a separate tab, each tab bound to its own worktree. You can diff, commit, discard, or merge worktrees independently. You can run two model-generated solutions side-by-side and compare outcomes without the branches stepping on each other.

Imagine if you’ve to run the same task with different models to pick the best response. Won’t a multi tab approach be a faster and much better experience - all within the same timeframe?

While under the hood it’s git worktrees with orchestration that binds each worktree to its own agent state.

How to use?

You can create a worktree from the Pochi sidebar or from the Source Control panel in VS Code. Once a worktree exists, starting a task in that worktree opens it as its own tab in Pochi.

You can switch tabs to switch tasks. Each tab reflects a complete development context: code, chat, history, and tooling.

When a task is complete, you can view a diff of that worktree against the main branch and create a PR. Or you can discard it entirely.
The point is that the work is isolated, so it doesn’t interfere with anything that is in progress. Additionally, you can also open an integrated terminal directly inside each task’s worktree.

When to use Parallel Agents

Parallel Agents are most useful when you want to avoid breaking focus on ongoing work: quick bugfixes during feature development, long-running refactors that you want to keep separate, documentation changes that happen alongside coding, or letting an AI assistant explore broader changes in a sandbox.

On the other hand, if a change is meant to be reviewed and merged as a single unit, keeping it on one branch remains simpler.

You can refer the full documentation here: https://docs.getpochi.com/parallel-agents/

In case you’d like to give Pochi a try, you can install the extension here.

DEV Community: GetPochi

Five Practical Tips to Save Token Consumption with Pochi

1. Compact context aggressively as noise accumulates

2. Attach intent to code instead of explaining it in chat

3. Isolate intent early with subagents and forks

4. Scope MCP servers per task to shrink the action space

5. When execution dominates, move heavy data processing out of the model

Why not just run a COUNT(*) query?

Conclusion

A safe way to let coding agents interact with your database (without prod write access)

Why this matters

Prerequisites

Tier 1: Read-only inspection

Step 1: Create a read-only Postgres role

Step 2: MCP Server (read-only tools)

Step 3: Configure Pochi

Step 4: Disable execution surface

Tier 2: Safe writes via clone + pipeline (demo)

Step 1: Architecture

Step 2: Add a production inspection tool

Step 3: Create a writable clone

Step 4: Setup Validation MCP

Step 5: Plan (Pochi Plan Mode)

Step 6: Generate Migration Script

Step 7: Validate in Clone

Step 8: Deploy to Production (Manual + Approved Pipeline)

How this maps to real production setups

Conclusion

How to Give Coding Agents Access to SSH and Databases (Without Breaking Production)

Why database-level controls fail

Agents route around blocked tools

What actually holds up in production

1. Read-only access still allows irreversible damage

2. Writes must flow through existing deployment pipelines

3. Isolated writable environments enable safe iteration

Why shells make safety undefined

Disposable environments

Restricted hosts and forced commands

Conclusion

We’ve been shipping "slop" for 20 years. We just used to call it an MVP.

Software has always optimized for execution

The Open Source Antidote

Engineering Quality into the Agent

Conclusion

How do you build serious features using only VS Code’s public APIs?

The Display Problem

Dynamic Rendering Strategy

Inline Completion

Inline Diff Preview

Floating Diff Image

Conclusion

How do you build serious features using only VS Code’s public APIs?

The Display Problem

Dynamic Rendering Strategy

Inline Completion

Inline Diff Preview

Floating Diff Image

Conclusion

Everyone says to have better context management. I'll show you how I built it. https://dev.to/getpochi/nes-series-part-2-real-time-context-management-in-your-code-editor-3jeo

Everyone says to have better context management. I'll show you how I built it.

GetPochi ・ Dec 9 '25

https://dev.to/getpochi/what-it-really-took-to-train-a-next-edit-suggestion-model-4bf2

How I Trained a Next-Edit Suggestion Model for a Coding Agent (32k Github Stars)

GetPochi ・ Nov 19 '25

How does a code editor decide the right moment to show an LLM-generated code suggestion

The NES Request Management Lifecycle

Debouncing: Requesting the Model at the Right Moment

Cancellation: Correctness Over Completion

Speculative Prediction: Staying One Step Ahead of the User

Conclusion

What’s next?

Closing the Loop: How Reinforcement Learning is Changing AI Coding

TL;DR

The Data Problem

The Signal Sparsity Problem

The State Problem: Building a Code World Model

Takeaway: Moving Toward Verifiable Agents

Related Papers:

Everyone says to have better context management. I'll show you how I built it.

Why Context Management Matters

Why not just run a `COUNT(*)` query?

Step 2: MCP Server (`read-only` tools)

1. `Read-only` access still allows irreversible damage