Back when we introduced MCP support, we ended on a teaser: Phase 3 would tackle Sampling—letting servers request completions from the host instead of only exposing tools and resources to it. NodeLLM 1.17 delivers on that, and pairs it with a second, unrelated but overdue improvement: precise control over how tool calls execute, now available consistently in both core and the ORM persistence layer.
🔄 MCP Sampling: Closing the Loop
Sampling inverts the usual MCP direction. Instead of the client asking the server for tools, the server asks the client to run an LLM completion on its behalf. This lets an MCP server offer LLM-powered capabilities—summarization, classification, drafting—without needing its own API key or provider integration.
createLLMSamplingHandler answers those requests using a real NodeLLM instance, so a server's tool ends up powered by whatever model you configure client-side:
import { createLLM } from "@node-llm/core";
import { MCP, createLLMSamplingHandler } from "@node-llm/mcp";
const llm = createLLM({ provider: "openai" });
const mcp = await MCP.connect(
{ command: "node", args: ["./sampling-server.mjs"] },
{ sampling: createLLMSamplingHandler(llm, "gpt-4o-mini") }
);
const tools = await mcp.discoverTools();
// The server only advertises sampling-backed tools once it sees
// the client declared sampling support during the handshake.
If you need full control over how a sampling request is answered—routing by model hint, injecting your own guardrails—pass a plain handler function instead of { llm, model }. It receives the raw sampling/createMessage params and returns a CreateMessageResult, so you decide exactly how (or whether) to answer.
⚡ Concurrent Tool Execution
When a model returns several independent tool calls in the same turn, NodeLLM has always executed them one at a time. That's safe by default, but wastes time when the calls don't depend on each other—three weather lookups for three different cities, say. toolConcurrency makes that opt-in parallel:
const chat = llm.chat("gpt-4o-mini").withTool(WeatherTool).withToolConcurrency(true);
await chat.ask("What is the weather in Tokyo, London, and New York?");
The same flag works in stream() and in Agent mode, so agentic loops get the same latency win without any change to how tools are defined.
🧩 Additive Callback Stacking
Calling an on*(), beforeRequest(), or afterResponse() hook a second time used to silently replace the first handler—a sharp edge if two independent concerns (logging and a UI update, say) each tried to register their own callback. Every registered handler now runs, in order:
chat
.onEndMessage((msg) => audit.log(msg))
.onEndMessage(() => ui.refresh());
chat
.beforeRequest(redactPII)
.beforeRequest(logOutboundPrompt);
Nothing changes for the common case of a single handler; this only matters once you start composing middleware-like behavior from multiple independent call sites.
🗄️ ORM 0.8.0: The Same Tool Control, Persisted
@node-llm/orm now exposes the same tool-execution surface as core, so chats backed by Prisma get precise control without dropping down to the raw core API:
import { createChat } from "@node-llm/orm/prisma";
import { ToolExecutionMode } from "@node-llm/core";
const chat = await createChat(prisma, { model: "gpt-4o" })
.withToolExecution(ToolExecutionMode.CONFIRM)
.onConfirmToolCall(async (call) => await askUserToApprove(call))
.onToolCallError((call, error) => ({ error: error.message }))
.withToolChoice("get_weather")
.withToolConcurrency(true);
toolExecution accepts auto (default), confirm (calls onConfirmToolCall before each execution—handy for human-in-the-loop approval flows), or dry-run (skip execution entirely). Combined with toolChoice and toolCalls, the ORM adapter now mirrors core's tool orchestration one-for-one, and everything the model actually did still ends up correctly persisted.
Alongside this, @node-llm/testing 0.5.1 fixes a rare async race in VCR cassette auto-naming and formally declares vitest as a peer dependency instead of a hard one, so you can bring your own vitest version without a resolution conflict.
📊 Monitor: Cache, Reasoning, and Image Token Accounting
Separately, @node-llm/monitor 0.4.2 and @node-llm/monitor-otel 0.1.1 now extract and store cache, cache-creation, reasoning, and image token counts—not just prompt/completion totals. Token usage arrives in wildly different shapes depending on where it came from, so extraction is normalized against every naming convention we've seen in the wild:
interface ExtractedTokenUsage {
prompt: number;
completion: number;
cached: number;
cacheCreation: number;
reasoning: number;
image: number;
}
Whether the raw event used the Vercel AI SDK's cachedInputTokens, OpenAI's snake_case, or the OTel GenAI semantic convention's cache_read_input_tokens, the dashboard and time-series aggregation now see the same six numbers—so cost and usage breakdowns stay accurate as reasoning models and prompt caching become the default rather than the exception.
Getting Started
npm install @node-llm/core@1.17.0 @node-llm/mcp@0.2.0 @node-llm/orm@0.8.0 @node-llm/testing@0.5.1
npm install @node-llm/monitor@0.4.2 @node-llm/monitor-otel@0.1.1
For the complete list of changes, see the Commit History and CHANGELOG.
Top comments (0)