우병수

Posted on Jun 29 • Originally published at techdigestor.com

Code Mode + MCP: Wiring Your Local LLM Into a Real Development Workflow

#productivity #tools #career #discuss

TL;DR: The hallucination that wastes the most time isn't a wrong algorithm — it's a wrong tool call. Your LLM confidently writes `prisma db pull --schema=.

📖 Reading time: ~24 min

What's in this article

The Problem: Your AI Coding Assistant Lives in a Walled Garden
How MCP Fits Into a Code Mode Agent Loop
Setting Up an MCP Server for Your Dev Environment
Routing Between Local Ollama Models and Cloud APIs Inside Code Mode
Wiring MCP Into an Automated Dev Pipeline (n8n + Webhooks)
Three Non-Obvious Behaviors That Will Cost You Time
When This Setup Is Not the Right Call

The Problem: Your AI Coding Assistant Lives in a Walled Garden

The hallucination that wastes the most time isn't a wrong algorithm — it's a wrong tool call. Your LLM confidently writes prisma db pull --schema=./prisma/schema.prisma against a database that's actually on a non-standard port behind a tunnel, or generates a Docker Compose service block referencing an image tag that hasn't existed since you migrated registries. The model knows your code because you pasted it. It has no idea what's actually running.

This is the structural gap: the model's context window is a snapshot, not a live connection. You copy in a schema file, a config block, maybe some logs — and the moment any of those change, the model is reasoning against stale state. It can't ask your Postgres instance what columns actually exist. It can't check whether your localhost:3000 health endpoint is up. It can't read the compiled output in /dist to verify its own edits worked. So it fills the gaps with confident guesses, and those guesses are wrong in proportion to how far your actual stack drifts from "typical project".

Model Context Protocol fixes this at the architecture level rather than the workflow level. Instead of you manually copying context into a chat window, MCP defines a typed tool-calling surface — the model can call read_file, run_command, query_database, or any custom tool you expose, and get structured responses back. The key word is typed: tool inputs and outputs are defined schemas, so the model knows exactly what arguments are valid and what shape the response will have. That eliminates an entire class of hallucinated API calls where the model invents parameters that don't exist in your actual SDK.

Code Mode in tools like Cursor, Cline, and Continue tightens this further because they run an agentic loop — plan a change, write the edit, verify it compiled or the test passed, then loop. That verify step is where everything falls apart without reliable tool access. If the model can't actually run tsc --noEmit and read stderr, it either skips verification entirely (optimistically assuming success) or asks you to run it and paste back the output, which defeats the loop. MCP gives that verify step a real execution surface. The plan→edit→verify cycle becomes autonomous instead of human-relayed. For a broader map of where this fits, see our guide on AI Coding Tools in 2026: Cloud Copilots vs Local Models.

The practical consequence is that walled-garden assistants optimize for demo quality, not operational accuracy. They look impressive on greenfield TypeScript with a clean schema. They fall apart on the actual project: the monorepo with four package managers, the legacy MySQL 5.7 table with implicit nulls everywhere, the build pipeline that requires three env vars that aren't in any file the model has seen. MCP doesn't make the model smarter — it makes the model's knowledge current, specific, and verifiable against the system that's actually running.

How MCP Fits Into a Code Mode Agent Loop

The surprising thing about MCP isn't the protocol itself — it's how little the model actually needs to understand about it. The model sees a flat list of typed tool signatures, picks one, emits a JSON call, and gets a result back. That's it. The JSON-RPC plumbing — server discovery, capability negotiation, result framing — happens entirely in the client layer. Concretely: your editor plugin (the MCP client) maintains a list of running MCP servers; each server advertises its tools as typed schemas; when the model decides to call read_file or run_tests, it's doing the same thing it does when calling any function-calling API. The model never speaks directly to the MCP server. The client brokers everything.

Code Mode is where MCP actually earns its keep, and the distinction from Chat Mode matters operationally. Chat Mode is stateless turn-by-turn: the model answers, forgets, moves on. Code Mode maintains a task context — open files, recent edits, tool call history, failure state — and will re-invoke a tool if the previous attempt returned an error or produced unexpected output. That re-invocation behavior is the sharp edge. Your MCP servers must be stateless and idempotent, because the agent will retry them without knowing what side effects the first call left behind. A shell tool that runs npm install twice should produce the same result as running it once. A file-write tool that gets called twice with the same payload shouldn't corrupt the file. If your server holds mutable state between calls, Code Mode will eventually find the broken edge.

Local models are where this architecture develops cracks. Tool-call reliability — meaning: the model emits well-formed JSON that correctly matches the tool's argument schema — degrades noticeably on quantized models below Q5 quantization. A 7B model under load will produce malformed JSON arguments, miss required fields, or hallucinate argument names that don't exist in the schema. This isn't a fringe case; it happens regularly in agentic loops where the context window fills with prior tool results. On my 32GB VRAM workstation, Q4_K_M at 32B is the practical floor for running a reliable Code Mode loop. Below that, you're spending more time writing retry logic and output validators than you're saving by running locally. For anything business-critical, the fallback to a cloud model isn't a failure — it's the right call.

Three MCP server types cover the majority of real development workflows:

Filesystem servers — read and write project files, list directory trees, watch for changes. The agent uses these to inspect source, patch files, and verify its own edits. The main gotcha: scope the allowed paths tightly in the server config, or the agent will happily traverse your entire home directory looking for context.
Shell/process servers — run test suites, linters, build commands, and capture stdout/stderr. These are the highest-value tools in a Code Mode loop because they close the feedback cycle: write code, run tests, read failures, patch, repeat. Keep timeouts aggressive — a hanging test suite will stall the entire agent loop.
Service connectors — Postgres queries, REST API calls, git operations. Useful when the task involves understanding live schema state, checking API responses, or inspecting commit history. These are also the most dangerous from an idempotency standpoint: a tool that fires a POST request on retry can create duplicate records. Wrap mutating calls in dry-run flags or explicit confirmation steps.

Setting Up an MCP Server for Your Dev Environment

The TypeScript SDK wins on completeness — it ships with typed tool schemas, Zod validation helpers, and the full transport abstraction layer out of the box. The Python SDK is fine for pure data tools, but if your MCP server shells out frequently (running lint, parsing files, invoking CLI tools), the startup overhead on each subprocess call compounds fast. For a dev environment server that needs to feel snappy inside Cursor or Cline, start with @modelcontextprotocol/sdk on Node 20+.

Here's a minimal but real server scaffold — tool registration, typed input schema, stdio transport — in under 35 lines:

`typescript
// mcp-dev-server/src/index.ts
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
import { execSync } from "child_process";

const server = new McpServer({
name: "dev-tools",
version: "0.1.0",
});

// Register a tool — schema is validated before your handler runs
server.tool(
"run_eslint",
{
filepath: z.string().describe("Relative path inside workspace"),
},
async ({ filepath }) => {
// Don't trust the model to pass safe paths — validate against your allowed root
const allowed = process.env.WORKSPACE_ROOT ?? "/workspace/src";
if (!filepath.startsWith(allowed)) {
return { content: [{ type: "text", text: "Path outside allowed root" }] };
}
const result = execSync(npx eslint --format compact ${filepath}, {
encoding: "utf8",
timeout: 15_000,
});
return { content: [{ type: "text", text: result }] };
}
);

const transport = new StdioServerTransport();
await server.connect(transport);
// process stays alive — MCP host controls the lifecycle via stdin/stdout
`

The mcp.json config is what your editor actually reads to know which servers exist and how to spawn them. Cursor looks in .cursor/mcp.json at the project root; VS Code with the MCP extension uses .vscode/mcp.json. The structure is the same either way — a named map of server entries with command, args, and optional env. Here's a realistic two-server config with a filesystem tool jailed to a subdirectory and a Postgres query tool:

json // .cursor/mcp.json { "mcpServers": { "filesystem": { "command": "npx", "args": [ "-y", "@modelcontextprotocol/server-filesystem", "--rootPaths", "./src", // ← jailed to src/, not the repo root "./tests" ] }, "postgres": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-postgres"], "env": { "POSTGRES_CONNECTION_STRING": "postgresql://readonly_user:pass@localhost:5432/mydb" } } } }

The rootPaths argument on the filesystem server is the single most important security control most people skip. Omit it and the server defaults to whatever working directory the subprocess inherits — often your entire repo root or home directory. A model operating in code mode can and will read .env files, private keys, or node_modules/.cache if the path is reachable. Jail it to ./src and ./tests only, and use a read-only Postgres role for the DB server — not your migration user. These aren't theoretical risks; they're the first two things to audit before you let any agentic flow run unsupervised.

The stdio vs SSE transport distinction will bite you the first time you wonder why your server loses state mid-session. Stdio (the default in Cursor and Cline) spawns your server as a subprocess and communicates over stdin/stdout — the process is killed when the session closes, which means any in-memory state, open DB connections, or file watchers disappear with it. SSE keeps a long-running HTTP server and multiplexes sessions over it, which is what you want if your server does expensive initialization (loading an embedding model, opening a connection pool). The cost is operational: you need a running process, a port, and CORS headers set to allow your editor's origin. For most local dev setups, stdio is fine and simpler. Switch to SSE only when you're sharing one server instance across multiple clients or need persistent state between tool calls.

Routing Between Local Ollama Models and Cloud APIs Inside Code Mode

The routing decision that actually matters isn't "local vs. cloud" as a preference — it's token estimate as a proxy for task complexity. A 40-token autocomplete request has no business hitting Claude Sonnet over an API call with 200ms round-trip latency. But a multi-step agent task that needs to read three files, reason about dependencies, and emit a patch? That's where a 7B model starts hallucinating import paths. The split I run is a simple token-count threshold with a task-type override:

`typescript
// route.ts — called before every model dispatch
const FAST_MODEL = "qwen2.5-coder:7b"; // Ollama local, ~5GB VRAM
const STRONG_LOCAL = "qwen2.5-coder:32b-q4_k_m"; // ~22GB VRAM, used for agents
const CLOUD_FALLBACK = "claude-sonnet-4-5"; // API, for when local is saturated

type TaskType = "autocomplete" | "inline_edit" | "agent" | "multi_file";

Wiring Cline or Continue to Ollama's OpenAI-compatible endpoint sounds trivial until it isn't. The config looks like this in Continue's config.json:

json { "models": [ { "title": "Qwen2.5-Coder 7B", "provider": "ollama", "model": "qwen2.5-coder:7b", "apiBase": "http://localhost:11434/v1" } ] }

Two things burn people here. First: the model name must be the exact tag string from ollama list — qwen2.5-coder:7b not qwen2.5-coder or qwen2.5-coder:latest. Ollama will 404 silently or return a generic error that Continue surfaces as "model not found" with no useful hint. Second, and this one is subtle: a trailing slash on apiBase breaks tool-call parsing in Cline's MCP dispatch layer. http://localhost:11434/v1/ causes malformed endpoint concatenation downstream — the tool-call response comes back as plain text instead of JSON, and the MCP server never receives the structured call. No trailing slash, ever.

The VRAM arithmetic on a 32GB card is tight but workable if you're disciplined. The 32B Q4_K_M model loads to roughly 22GB. An active filesystem MCP server process plus a live Postgres MCP connection adds maybe 1-2GB of system VRAM overhead depending on your driver and how much the MCP servers cache. That leaves you 8-9GB of headroom — comfortable for a single agent session where the KV cache stays bounded. The problem is the second agent session. Once you're over 30GB allocation, the driver starts paging KV cache to system RAM, and latency on generation steps jumps 4-8x on typical query lengths. You'll feel it as stalls mid-stream rather than a clean slow response. The fix is to serialize agent sessions at the dispatch layer, not the model layer — queue the second request until the first completes rather than letting both run concurrently.

The health-check-before-dispatch pattern is what keeps the TypeScript automation engine from failing visibly when Ollama is saturated or mid-reload. Ollama's /api/tags endpoint responds in under 5ms when the service is up and becomes immediately unreachable when it's not — it's a better liveness signal than /api/generate with a dummy prompt:

`typescript
// health.ts
async function isOllamaHealthy(timeoutMs = 800): Promise {
try {
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeoutMs);
const res = await fetch("http://localhost:11434/api/tags", {
signal: controller.signal,
});
clearTimeout(timer);
return res.ok;
} catch {
// connection refused OR timeout — treat both as unhealthy
return false;
}
}

// In the dispatch loop:
const healthy = await isOllamaHealthy();
const model = routeModel(estimatedTokens, taskType, healthy);
// If model === CLOUD_FALLBACK, swap the client to OpenAI SDK — same interface, different baseURL
`

The 800ms timeout is deliberate. Anything longer and a slow-but-alive Ollama instance under load will pass the health check but then time out on the actual generation request — you've just added latency without gaining reliability. If it can't answer /api/tags in under a second, treat it as unavailable and route to the cloud client. The OpenAI SDK and the Ollama OpenAI-compatible endpoint share enough interface surface that swapping the baseURL and adding an API key is the only code change needed — keep both clients instantiated and select between them after the health check resolves.

Wiring MCP Into an Automated Dev Pipeline (n8n + Webhooks)

The most disruptive part of this whole setup isn't the agent — it's realizing you can close the loop between your repo and your local reasoning layer without touching a single SaaS CI product. A GitHub webhook fires, your local n8n instance picks it up, an MCP-capable agent reads the diff, runs your actual test suite, and posts a structured review back to the PR. The whole chain runs on your workstation. Here's how to wire it.

GitHub Webhook → n8n → MCP Agent → PR Comment

The entry point is a Webhook node in n8n listening on something like /webhook/github-pr. You authenticate it with a shared secret checked against the X-Hub-Signature-256 header — n8n's Webhook node exposes the raw body, so you can verify it in a Function node before anything else runs. After verification, extract pull_request.number, pull_request.diff_url, and repository.full_name from the payload. A second HTTP Request node fetches the raw diff from diff_url using your GitHub token. That diff string goes into the body of the call to your local MCP agent endpoint.

The HTTP Request node hitting your MCP-capable agent endpoint should look like this:

json { "model": "qwen2.5-coder:32b", "messages": [ { "role": "user", "content": "Review this diff for correctness, style violations, and test coverage gaps. Diff:\n\n{{$json.diff}}" } ], "tool_choice": "auto", "tools": [ { "type": "function", "function": { "name": "mcp__shell__run_command", "description": "Run a shell command on the local machine via MCP shell tool", "parameters": { "type": "object", "properties": { "command": { "type": "string" }, "cwd": { "type": "string" } }, "required": ["command"] } } }, { "type": "function", "function": { "name": "mcp__filesystem__read_file", "description": "Read a file from the registered workspace root", "parameters": { "type": "object", "properties": { "path": { "type": "string" } }, "required": ["path"] } } } ] }

The tool names use the mcp__<server>__<tool> namespacing convention. When the agent decides to run tests, it will return a tool call for mcp__shell__run_command with something like "command": "cd /workspace/myrepo && npm test -- --reporter=json". Your n8n workflow catches that tool call response, executes the dispatch back to the MCP server's tool endpoint, feeds the output back into a follow-up messages array, and loops until the agent returns a plain text completion — which then gets POST'd to the GitHub PR comments API as a structured review.

The Version Mismatch Failure Mode

The most reliable way to break this pipeline silently is to run two different versions of your MCP server — one that your editor (Cursor, VS Code with Cline, whatever) registered tools against, and a different one that n8n is hitting at runtime. The agent returns a tool call like mcp__shell__run_command, your MCP server doesn't recognize it because the tool was renamed or its parameter schema changed in a patch release, and you get a generic 400 or an unhandled tool error that n8n logs as a completed node. The PR comment never gets posted. You don't notice until someone asks why the bot went quiet.

Fix this in two places. First, pin the MCP server package version in your Dockerfile or package.json — no floating ^ or ~:

json // package.json — MCP server process { "dependencies": { "@modelcontextprotocol/server-filesystem": "0.6.2", "@modelcontextprotocol/server-shell": "0.4.1" } }

Second, add a /info route to your MCP server wrapper that returns the pinned version. Then have your n8n workflow hit /info as its first node and assert the version matches what it expects before doing anything else. A mismatch kills the workflow run immediately with a descriptive error rather than a ghost failure downstream:

typescript // Express wrapper around your MCP server app.get('/info', (req, res) => { res.json({ mcp_server_version: process.env.npm_package_version, // reads from package.json tools: registeredTools.map(t => t.name), started_at: startTime.toISOString() }); });

Nightly Unattended Loop via PM2 Cron

The same agent-plus-MCP pattern runs as a nightly static analysis sweep with zero human involvement. The PM2 ecosystem config schedules a Node script that calls git log to get all files changed since the last tagged deploy, sends them through the MCP-capable agent with the filesystem and shell tools available, instructs the agent to auto-apply ESLint fixes only on functions with cyclomatic complexity below a threshold (I use 10 — above that, it flags rather than touches), commits the result, and opens a draft PR via the GitHub API. If nothing changed or no fixable issues were found, it exits cleanly and PM2 logs the no-op.

javascript // ecosystem.config.js module.exports = { apps: [{ name: 'nightly-lint-agent', script: './scripts/nightly-lint-loop.js', cron_restart: '0 2 * * *', // 2 AM local time autorestart: false, // don't restart on exit — this is a one-shot job watch: false, env: { MCP_ENDPOINT: 'http://localhost:3100', GITHUB_TOKEN: process.env.GITHUB_TOKEN, COMPLEXITY_THRESHOLD: '10', BASE_BRANCH: 'main' } }] };

One gotcha that doesn't show up until the second or third run: if the agent opens a draft PR but the nightly job runs again before anyone closes it, you'll accumulate stacked draft PRs against the same base. Guard against this by querying the GitHub API for open draft PRs authored by your bot token before the agent does any work — if one already exists for the same base branch, append to it rather than opening a new one, or skip the run entirely. The GitHub search API supports is:pr is:draft author:app/your-bot-name base:main which is fast enough to call synchronously at job start.

Three Non-Obvious Behaviors That Will Cost You Time

The behaviors that actually cost you hours aren't the ones in the error logs — they're the ones that produce wrong but plausible output and silent data corruption. Here are three that don't show up in the MCP spec until you've already been burned by them.

Tool Call Cancellation Is Not Guaranteed

If the model emits a partial tool call payload and the session drops mid-flight, several MCP clients will queue that call for retry on reconnect. The spec treats tool calls as idempotent by convention, but nothing in the protocol enforces that — and your file-write tool almost certainly isn't idempotent. The failure mode: reconnect triggers a second write_file call with the same arguments, and your patch gets applied twice. With line-insertion edits, that means duplicate blocks. With find-and-replace, the second pass may silently corrupt the file by matching on already-modified content.

The fix is a content guard on every write tool, not an assumption that the client handles this. Before applying any edit, read a hash of the current file state and compare it against a hash captured when the task was first dispatched:

`typescript
// write_tool_handler.ts
async function applyEdit(params: EditParams): Promise {
const currentHash = await sha256OfFile(params.path);

// caller passes the hash they saw when they read the file
// if hashes diverge, the file changed under us — refuse the write
if (currentHash !== params.expectedHash) {
return {
success: false,
reason: content_mismatch: expected ${params.expectedHash}, got ${currentHash}
};
}

await applyPatch(params.path, params.patch);

return {
success: true,
newHash: await sha256OfFile(params.path)
};
}
`

A line-range guard works too and is simpler to reason about for bounded edits: the tool refuses to write if the target line range has shifted. Either way, the model gets a clean error it can reason about instead of silently writing corrupted output. The extra round-trip is worth it every time.

Context Window Bleed Kills Long Agent Sessions

Code mode accumulates tool call results directly in the context window — every read_file response, every lint output, every test result goes in as a message. On a 32B model running at 32K context, around 15 tool calls in, the model starts behaviorally ignoring early system instructions. It doesn't error out. It just stops following the task constraints you set at the top of the session — the coding style rules, the file exclusion list, the output format requirements. This is a soft attention failure, not a hard truncation, and it's insidious because the output still looks reasonable.

The practical mitigation is task checkpointing: break any job that requires more than 8-10 tool calls into explicit subtasks, each with its own fresh context. In my n8n flows, I do this by treating each subtask as a separate HTTP call to the model endpoint, passing only a compressed summary of prior state rather than the full tool call history. Something like a 200-token "checkpoint header" that says files modified so far, constraints still active, next objective — assembled by a lightweight summarizer node before each subtask starts:

`http

n8n HTTP Request node — subtask dispatch

POST /api/generate
{
"model": "qwen2.5-coder:32b",
"system": "{{ $json.checkpointHeader }}",
"messages": [
{ "role": "user", "content": "{{ $json.subtaskPrompt }}" }
],
"context_length": 32768
}
`

The checkpoint header is cheap to generate and completely sidesteps the bleed problem. The tradeoff is that you lose conversational continuity — the model can't reference specific earlier outputs by memory. For code generation tasks that's almost never a problem; for exploratory debugging sessions where context coherence matters, you may prefer a smaller model with a larger context window instead.

Resources vs Tools: The Cold-Start Latency You're Leaving on the Table

MCP draws a hard architectural line between resources and tools. Resources are fetched and cached by the client at session initialization — they're available to the model with no round-trip cost. Tools are called on demand, which means a network hop plus server-side execution every single invocation. Most MCP server implementations default to exposing everything as a tool because it's simpler to implement, and most tutorials do the same. The cost is paid at cold start: if your agent calls describe_schema or list_available_endpoints at the top of every session, you're paying 2-4 seconds per session just to hand the model static information it could have had for free.

Anything that doesn't change between sessions belongs in a resource. Schema definitions, API surface documentation, file tree snapshots of stable directories, environment capability lists — these are all resource candidates. Exposing them correctly in your MCP server definition looks like this:

`yaml

mcp_server_config.yaml

resources:

uri: "schema://db/main"
name: "Main database schema"
mimeType: "application/json"

served from a static file or a cached query result refreshed on server start

handler: "handlers.schema.serve_main"
uri: "schema://api/openapi"
name: "Internal API surface"
mimeType: "application/yaml"
handler: "handlers.schema.serve_openapi"

tools:

name: "run_query" description: "Execute a read-only SQL query" # this is dynamic — keep it as a tool handler: "handlers.db.run_query" `

The client fetches all declared resources during the handshake phase and includes them in the initial context before the first user message. The model has the schema available immediately without any tool call overhead. The common gotcha here: if your resource handler hits a slow database or a remote API at serve time, you've moved the latency problem rather than eliminated it. Cache the resource payload on server startup and refresh it on a schedule, not on every client connection.

When This Setup Is Not the Right Call

The filesystem MCP tool's recursive read is genuinely useful on a focused project directory — it stops being useful the moment your codebase crosses into monorepo territory. Past roughly 500K LOC, the tool will either get truncated by the model's context window before it assembles a coherent picture, or it'll burn most of the context budget on directory traversal before a single line of analysis happens. If you're in that situation, the right architecture is a code embedding index with semantic search in front of it. I use bge-m3 for this — it's dense enough to handle code tokens well and fits comfortably in a moderate VRAM budget. The MCP layer then becomes a thin retrieval interface rather than a raw filesystem reader: the agent queries the index, gets back the 10-15 most relevant chunks, and works from those instead of trying to read the repo whole.

GPU constraint is the other hard wall. The agentic loop that makes Code Mode + MCP actually useful requires a model capable of multi-step tool use and self-correction — on my 32GB VRAM box that's workable with a quantized 32B or a pair of smaller specialized models. Under 16GB VRAM, you don't have that headroom. A 7B or 8B model will accept MCP tool calls, but the planning quality degrades fast on anything beyond two-step tasks; you'll spend more time fixing the agent's mistakes than you would have spent writing the code. The practical answer is to route the agent to a cloud API — Claude 3.5 Sonnet or GPT-4o handle the reasoning — and keep MCP tool permissions strictly read-only. Read-only means the worst case is a wasted API call, not an overwritten file or an executed shell command you didn't intend.

Shared dev environments are where this setup can become a genuine security problem rather than just an inconvenience. An MCP server with shell exec permissions is an arbitrary code execution surface. On a single-operator workstation where you control the process list, what the server can spawn, and what credentials are in the environment, that's a manageable risk — you're the only lateral movement target and you can audit it. On shared infra — a team dev box, a cloud VM that multiple engineers SSH into, a Kubernetes pod with a mounted service account — an MCP server with exec access is a privilege escalation path waiting to happen. One misconfigured tool definition or one prompt injection through a malicious code comment, and the blast radius extends to every other user and service that machine can reach. This architecture was designed for single-operator use; treating it as a team tool without a significant rethink of the permission model is the wrong call.

Monorepo scale (>500K LOC): Replace recursive filesystem reads with bge-m3 embeddings + semantic retrieval; the MCP tool becomes a query interface, not a directory walker.
Under 16GB VRAM: Use a cloud API for the agent brain, lock MCP tools to read-only, accept that local inference isn't the bottleneck worth optimizing here.
Shared infrastructure: MCP shell access on multi-user systems is a lateral movement risk; the single-operator assumption baked into this setup doesn't hold and the permission model needs a full redesign before it's safe to deploy that way.

Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

DEV Community