DEV Community

Cover image for Building a Production AI Agent in Node.js: Tool Calling, the ReAct Loop, and Error Handling
ZyVOP
ZyVOP

Posted on • Originally published at zyvop.com

Building a Production AI Agent in Node.js: Tool Calling, the ReAct Loop, and Error Handling

Most agent tutorials stop at a toy. A bot that checks the weather, a script that answers one question, then a victory lap in the README.

None of that prepares you for what happens when a tool throws an error, the model calls a function ten times in a row, or you blow past your rate limit mid-conversation. This post builds the other kind of agent, running on Groq.

Groq is worth a look for this specifically: an OpenAI-compatible API, no credit card to start, and inference fast enough that the agent loop feels instant instead of laggy. No framework either — just the SDK and a loop you can read top to bottom. By the end you'll have a customer-support agent that looks up orders, does math, and searches a FAQ, plus the iteration caps, retries, and error handling that keep it from falling over once real traffic hits it.

Full source, including the test suite, is linked at the bottom.

What "production" actually means here

An agent isn't a chatbot with extra steps. It reasons about a task, picks a tool, looks at the result, and decides what to do next, looping until it has an answer. That loop is usually called ReAct: reason, act, observe, repeat.

The demo version of this loop assumes everything goes right. The production version has to handle:

  • A tool that throws, times out, or returns garbage

  • A model that keeps calling tools and never wraps up

  • A 429 or a 500 from the API itself, which is different from a tool failure

  • A model that sends back malformed arguments for a tool call

  • Needing to know what happened after the fact, not just whether it worked

Each of those gets its own piece of the code below.

Why Groq, and why no framework

Groq doesn't train its own models. It runs open-source ones (Llama, GPT-OSS, Qwen, and others) on custom LPU hardware built for inference speed.

The free tier needs no credit card and gives you every model, gated only by rate limits: roughly 30 requests a minute and a few thousand tokens a minute per model, enforced per account rather than per key. That's plenty for a demo and for prototyping. It's not enough for real user traffic without upgrading.

Because Groq's API mirrors OpenAI's chat completions format, you don't need LangChain.js or any other framework to get tool calling working. The groq-sdk package gives you the same loop with nothing hidden, which matters when something breaks and you need to know exactly where.

The loop itself is under 100 lines. You're about to read all of them.

Defining a tool

Groq's tool format wants a type: "function" wrapper, a name, a description the model reads to decide when to use the tool, and a JSON schema for the arguments. Here's the calculator tool in full:

// src/tools/calculator.js
import { safeCalculate } from "../lib/safeCalculate.js";

export const calculatorTool = {
  definition: {
    type: "function",
    function: {
      name: "calculate",
      description:
        "Evaluate a basic arithmetic expression. Supports +, -, *, /, ^, parentheses, and decimals. " +
        "Always use this instead of doing math yourself, including for things like order totals or discounts.",
      parameters: {
        type: "object",
        properties: {
          expression: {
            type: "string",
            description: "A math expression, e.g. '(42 + 8) * 3' or '199.99 * 0.85'"
          }
        },
        required: ["expression"]
      }
    }
  },

  async handler({ expression }) {
    const result = safeCalculate(expression);
    return String(result);
  }
};

Enter fullscreen mode Exit fullscreen mode

That description line is doing real work. It tells the model to reach for this tool instead of computing totals itself, which matters because language models are unreliable at arithmetic. The tool use overview covers schema design in more depth if you're writing tools with nested objects or several optional fields.

One thing worth flagging: safeCalculate is a small hand-written parser, not eval(). The expression comes from the model, and the model's input ultimately traces back to whatever the user typed.

Piping that into eval() or new Function() means a sufficiently creative prompt is now running arbitrary code on your server. The repo's calculator does the math with a real tokenizer and recursive-descent parser instead — more code, zero risk.

The other two tools follow the same shape: get_order_status looks up a mock order by ID (with a deliberately broken ORD-FAIL entry for testing failure handling), and search_knowledge_base does keyword search over a small FAQ array. Full source for both is in the repo.

The loop itself

This is the part that drives the agent. It lives in src/agent.js:

async run(userMessage, { history = [], onStep } = {}) {
  const messages = [...history, { role: "user", content: userMessage }];
  const trace = [];
  const usage = { inputTokens: 0, outputTokens: 0 };

  for (let step = 0; step < this.maxIterations; step++) {
    const response = await this.callModel(messages);
    usage.inputTokens += response.usage?.prompt_tokens ?? 0;
    usage.outputTokens += response.usage?.completion_tokens ?? 0;

    const choice = response.choices[0];
    const message = choice.message;

    if (choice.finish_reason !== "tool_calls" || !message.tool_calls?.length) {
      const text = (message.content ?? "").trim();
      messages.push({ role: "assistant", content: message.content ?? "" });
      const finalStep = { step, type: "final", text };
      trace.push(finalStep);
      onStep?.(finalStep);
      return { text, steps: step + 1, usage, trace, history: messages, truncated: false };
    }

    // Groq requires the assistant's tool_calls message echoed back verbatim
    // before the matching tool results — this is the OpenAI-style contract.
    messages.push({ role: "assistant", content: message.content, tool_calls: message.tool_calls });

    for (const toolCall of message.tool_calls) {
      const result = await this.executeTool(toolCall, step);
      trace.push(result.traceEntry);
      onStep?.(result.traceEntry);
      messages.push(result.toolMessage);
    }
  }

  const truncatedStep = { step: this.maxIterations, type: "max_iterations_reached" };
  trace.push(truncatedStep);
  onStep?.(truncatedStep);

  return {
    text: "I wasn't able to finish this within the step limit. Here's what I found before stopping.",
    steps: this.maxIterations,
    usage,
    trace,
    history: messages,
    truncated: true
  };
}

Enter fullscreen mode Exit fullscreen mode

Five things worth calling out:

The iteration cap is the for loop's bound, not an afterthought. Without it, a model stuck in a reasoning loop — or a tool that always returns "try again" — burns through your rate limit until something else stops it. That matters more on a free tier than a metered one: you don't lose money, you lose your remaining requests for the minute, and maxIterations defaults to 8 here to keep that from happening.

finish_reason is checked, not just whether tool_calls exists. Groq's response can technically include a tool_calls array while finish_reason says something else (truncated output, for instance), so the loop checks both before deciding the model wants a tool run.

Tool failures don't throw past this function. executeTool (below) catches whatever the tool does and turns it into a normal tool role message. The model sees the failure as part of the conversation and can retry with different input, try another tool, or tell the user it couldn't complete the request.

history comes in and goes out, and it never contains the system message. The system prompt gets injected fresh inside callModel on every call instead of being stored in history. Otherwise a long conversation would accumulate a duplicate system message every turn.

Every step gets logged through onStep, whether it's a tool call or the final answer. The CLI uses this to print tool calls in verbose mode; a real deployment would pipe it to whatever you use for logging.

Tool execution looks like this:

async executeTool(toolCall, step) {
  const name = toolCall.function.name;
  const handler = this.findHandler(name);
  const startedAt = Date.now();
  let content;
  let isError = false;

  try {
    if (!handler) throw new Error(`Unknown tool: ${name}`);

    let input;
    try {
      input = JSON.parse(toolCall.function.arguments || "{}");
    } catch {
      throw new Error(`Model sent invalid JSON arguments for "${name}"`);
    }

    content = await this.withTimeout(handler(input), this.toolTimeoutMs);
  } catch (err) {
    isError = true;
    content = `Tool error: ${err.message}`;
    this.logger.warn?.(`[agent] tool "${name}" failed: ${err.message}`);
  }

  const durationMs = Date.now() - startedAt;

  return {
    traceEntry: { step, type: "tool_call", tool: name, input: toolCall.function.arguments, result: content, isError, durationMs },
    toolMessage: { role: "tool", tool_call_id: toolCall.id, content: String(content) }
  };
}

Enter fullscreen mode Exit fullscreen mode

Two details here are specific to Groq's OpenAI-style format and don't show up if you're used to Anthropic's tool-calling shape. First, toolCall.function.arguments arrives as a JSON string, not an already-parsed object. The model can and occasionally will send back something that doesn't parse, so that JSON.parse is wrapped in its own try/catch with a clear error message rather than letting it throw a raw SyntaxError up the stack.

Second, there's no is_error flag on the result message the way Anthropic's tool_result blocks have one. A failed tool just returns its error as text inside a normal tool role message, and the model reads it like any other result.

The timeout matters as much as the try/catch. A tool that calls a flaky downstream API can hang for a long time if you let it; withTimeout races the handler against a timer and turns a hang into a clean error after 10 seconds by default.

API-level failures get separate handling from tool failures, because they mean different things. A 429 or a 500 from Groq's API is transient: retrying with backoff usually fixes it. A 401 means your API key is wrong, and retrying does nothing but waste a request:

async callModel(messages, attempt = 0) {
  const maxRetries = 3;
  try {
    return await this.client.chat.completions.create({
      model: this.model,
      max_completion_tokens: this.maxTokens,
      messages: [{ role: "system", content: this.systemPrompt }, ...messages],
      tools: this.toolDefinitions(),
      tool_choice: "auto"
    });
  } catch (err) {
    const status = err?.status;
    const retryable = RETRYABLE_STATUS.has(status); // 408, 409, 429, 500, 502, 503, 504
    if (!retryable || attempt >= maxRetries - 1) throw err;
    const delayMs = 500 * 2 ** attempt;
    this.logger.warn?.(`[agent] API call failed (status ${status}), retrying in ${delayMs}ms`);
    await sleep(delayMs);
    return this.callModel(messages, attempt + 1);
  }
}

Enter fullscreen mode Exit fullscreen mode

One more Groq-specific detail: the groq-sdk client retries some of these statuses on its own, twice by default. The client here is constructed with maxRetries: 0 so the SDK's built-in retry doesn't stack on top of this one. Without that, a single rate-limited call could silently balloon to nine real HTTP attempts (three of mine, each retried three times by the SDK) instead of three.

Wiring it up

src/server.js exposes the agent as POST /api/agent/chat, rate-limited with express-rate-limit at a level that sits comfortably under Groq's free-tier cap even with a few concurrent users. Sessions live in an in-memory Map for the demo — fine for trying this out, not fine once you have more than one server process, at which point Redis is the obvious swap.

src/cli.js is the same agent wired to a terminal readline loop instead, useful for testing prompts and tool behavior interactively without standing up a server.

Both read a GROQ_API_KEY from the environment and default to openai/gpt-oss-120b as the model, overridable with GROQ_MODEL. Get a key at console.groq.com/keys — no credit card needed.

Trying it over HTTP

Here's what hitting the running server looks like. The JSON envelope below is exact: those are the literal field names server.js returns, but the reply text, steps count, and usage numbers will vary somewhat each time you run this, since it's a live model and not a fixture.

A normal order lookup:

curl -X POST http://localhost:3000/api/agent/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is the status of ORD-1001?"}'

Enter fullscreen mode Exit fullscreen mode
{
  "sessionId": "a1b2c3d4-5678-90ab-cdef-1234567890ab",
  "reply": "Order ORD-1001 has shipped via UPS, tracking number 1Z999AA10123456784. It's estimated to arrive July 2, 2026.",
  "steps": 2,
  "truncated": false,
  "usage": { "inputTokens": 612, "outputTokens": 47 }
}

Enter fullscreen mode Exit fullscreen mode

steps: 2 means the model called a tool once, then answered — exactly the loop from earlier. Now the same question about the order rigged to fail, reusing the sessionId from above to stay in the same conversation:

curl -X POST http://localhost:3000/api/agent/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Can you check ORD-FAIL for me?", "sessionId": "a1b2c3d4-5678-90ab-cdef-1234567890ab"}'

Enter fullscreen mode Exit fullscreen mode
{
  "sessionId": "a1b2c3d4-5678-90ab-cdef-1234567890ab",
  "reply": "I wasn't able to look up that order just now — the lookup service timed out. Could you try again in a moment?",
  "steps": 2,
  "truncated": false,
  "usage": { "inputTokens": 701, "outputTokens": 39 }
}

Enter fullscreen mode Exit fullscreen mode

That's the tool-error path from earlier, end to end: the get_order_status handler threw, executeTool caught it, the model got the failure as a normal message instead of a crash, and it answered like a person would — not a stack trace in sight. And the multi-tool case, in one request:

curl -X POST http://localhost:3000/api/agent/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is the status of ORD-1001, and if I return it how much of the $89.99 do I get back after a 15% restocking fee?"}'

Enter fullscreen mode Exit fullscreen mode
{
  "sessionId": "f0e1d2c3-4b5a-69d8-c7e6-0a1b2c3d4e5f",
  "reply": "ORD-1001 has shipped via UPS and should arrive by July 2, 2026. If you return it, you'd get back $76.49 after the 15% restocking fee.",
  "steps": 3,
  "truncated": false,
  "usage": { "inputTokens": 845, "outputTokens": 61 }
}

Enter fullscreen mode Exit fullscreen mode

steps: 3 here — one call to get_order_status, one to calculate, then the final answer combining both. That $76.49 isn't the model doing arithmetic in its head: calculate actually runs 89.99 * 0.85 through the real parser from earlier (which returns 76.49149999999999, full floating-point precision and all), and the model rounds that to cents for a reply a person can read.

Testing it without spending a request

The agent loop takes a client in its constructor. In production that's a real Groq instance. In tests, it's a plain object with a scripted chat.completions.create() that returns whatever response you tell it to:

test("handles malformed tool-call arguments from the model without crashing", async () => {
  const client = makeScriptedClient([
    {
      choices: [{
        finish_reason: "tool_calls",
        message: {
          role: "assistant",
          content: null,
          tool_calls: [{ id: "call_3", type: "function", function: { name: "calculate", arguments: "{not json" } }]
        }
      }],
      usage: { prompt_tokens: 10, completion_tokens: 5 }
    },
    {
      choices: [{ finish_reason: "stop", message: { role: "assistant", content: "Sorry, something went wrong." } }],
      usage: { prompt_tokens: 12, completion_tokens: 6 }
    }
  ]);

  const agent = new Agent({
    client,
    model: "openai/gpt-oss-120b",
    tools: [calculatorStub],
    systemPrompt: "test",
    logger: silentLogger
  });

  const result = await agent.run("Send a malformed tool call");

  const toolCallStep = result.trace.find((step) => step.type === "tool_call");
  assert.equal(toolCallStep.isError, true);
  assert.match(toolCallStep.result, /invalid JSON arguments/);
});

Enter fullscreen mode Exit fullscreen mode

That test exists because it's a real failure mode in this format, not a hypothetical one: Groq's models occasionally send back arguments that don't parse cleanly, and the only way to find that out before a user does is to write the test that assumes it'll happen.

The full suite (21 tests across the loop, the calculator parser, and the tools) runs on Node's built-in test runner — no Jest, no extra dependency:

npm test

Enter fullscreen mode Exit fullscreen mode

What the mocked tests actually catch

Worth being precise about what's verified here and what isn't. The 21 tests run against a scripted mock client, so they prove the loop's behavior: a 429 gets retried with backoff, a 401 doesn't get retried at all, a tool that throws gets turned into a clean error message instead of crashing the process, and malformed tool arguments don't take down the server either.

None of that touches Groq's actual servers, and it doesn't need to — that's the point of mocking the client. What it can't tell you is whether your specific model picks the right tool for a given prompt, or how it behaves under real latency.

For that, the repo ships a small smoke-test script. Start the server, then run it against a live key:

npm start                          # terminal 1
bash scripts/live-smoke-test.sh    # terminal 2

Enter fullscreen mode Exit fullscreen mode

It runs nine checks in sequence: a normal order lookup, an unknown order ID, the simulated outage on ORD-FAIL, the calculator, the FAQ search, one turn that needs two tools at once, a follow-up that checks session memory, an invalid request, and a burst of twelve rapid-fire calls to confirm the local rate limiter kicks in. Watching that run once tells you more about how your model handles tool selection than any number of mocked tests can. On Groq's free tier, it costs nothing but a couple of minutes.

If you'd rather watch it think in real time, npm run cli -- --verbose prints every tool call as it happens, which is the faster way to see why the model reached for a given tool.

Taking it further

This is a teaching example. Before anything like it goes near real users:

  • Move sessions from the in-memory Map to Redis or a database

  • Add auth in front of the chat endpoint: right now anyone who can reach the port can spend your rate limit

  • Watch for 429s under real load: the free tier's per-minute cap is easy to hit with more than a couple of concurrent users, and Groq's Developer tier (still no minimum spend) raises that ceiling

  • Trim or summarize long conversation history before it eats your context window

  • Replace the mock order and FAQ data with real lookups

Get the code

Full source, with the test suite and a setup guide, is on GitHub: https://github.com/zyvop27-cmyk/zyvop-blogs/tree/main/ai-agent-node

Clone it, drop in your own GROQ_API_KEY, and ask it about an order. The agent loop, the tools, and the error handling are all small enough to read in one sitting — which is the point.


Originally published on ZyVOP

💡 For more articles like this, subscribe to the ZyVOP newsletter!

Top comments (0)