I Used Hindsight to Make My Groq Agent Decisions Auditable — Here's What That Actually Looks Like

Gehini Busarapalli — Mon, 18 May 2026 12:44:59 +0000

The hardest part of running LLMs inside a production pipeline isn't the inference. It's figuring out, three hours later, why the model classified a specific user as AT_RISK when you expected POWER_USER, and what that decision caused downstream. Groq gives you fast inference. It doesn't give you memory. I added Hindsight to fill that gap, and the difference between debugging with it and without it is large enough that I'd wire it in before writing a single prompt.

What the Pipeline Looks Like From the LLM's Perspective

VORTEX uses Groq (llama3-70b-8192) in two places: Agent 2 classifies user intent and produces a score from 0–100, and Agent 3 generates a personalized email draft. Both agents receive a structured activity atom as input and return structured JSON as output.

Agent 2's prompt looks roughly like this:

// Agent 2 — Intent Architect: Groq call
const prompt = `
You are an intent classification engine for a B2B SaaS product.

Given this user activity atom, return a JSON object with:
- intent_score: integer 0-100
- tier: "POWER_USER" | "AT_RISK" | "PASSIVE"  
- urgency: "HIGH" | "MEDIUM" | "LOW"
- primary_pain: string describing the core friction point

Activity atom:
${JSON.stringify(atom, null, 2)}

Return only valid JSON. No explanation.
`;

const response = await fetch('https://api.groq.com/openai/v1/chat/completions', {
  method: 'POST',
  headers: { Authorization: `Bearer ${GROQ_API_KEY}` },
  body: JSON.stringify({
    model: 'llama3-70b-8192',
    messages: [{ role: 'user', content: prompt }],
    temperature: 0.1,   // low temperature for consistent scoring
    max_tokens: 200,
  })
});

temperature: 0.1 is the most important parameter here. Intent scoring needs to be deterministic — the same activity atom should produce the same score on repeated runs. High temperature introduces variance that makes the scores unreliable as routing inputs. With 0.1, the model is consistent enough that Agent 7's threshold logic (≥ 80 = HOT_LEAD) produces predictable outcomes.

The Problem Hindsight Solves

Before Hindsight, when a lead didn't get a Slack alert, my debugging process was:

Check Firestore leads collection — see the current status field
Check activity_feed collection — see which agents fired
Check agent_logs collection — read each agent's output by timestamp
Manually reconstruct the sequence from three separate documents
Realize the timestamps are in different formats across agents
Give up and re-trigger the event to watch it live

This process took 15–20 minutes for a simple routing failure. The root issue was that I had logs, not memory. Logs tell you what happened in isolation. Agent memory tells you what happened in sequence, causally linked, for a specific input.

Hindsight stores each agent's contribution keyed by lead ID, in order, with the full input and output at each step. Reconstructing the chain for any lead is one query:

// Retrieve full decision chain for a lead
const chain = await hindsight.recall({ key: leadId });

// Returns ordered array of agent decisions:
// [
//   { agent: 'behavioral_scout', input: atom, output: atom, timestamp },
//   { agent: 'intent_architect', input: atom, output: { intent_score: 91, tier: 'POWER_USER' }, timestamp },
//   { agent: 'persona_scriptwriter', input: scoredLead, output: { subject, body }, timestamp },
//   { agent: 'executive_router', input: scoredLead, output: { tier: 'HOT_LEAD', actions: ['slack', 'email'] }, timestamp }
// ]

That's the full reasoning chain, in order, with inputs and outputs at every step. What used to take 20 minutes takes seconds.

What the Debate Log Actually Shows

The Hindsight chain surfaces in the dashboard as the Debate Log — a terminal-style view that replays each agent's contribution for a selected lead:

// AgentActivity.jsx — maps Hindsight chain to display lines
const allLines = DEBATE_LOG.flatMap(block => [
  {
    isHeader: true,
    agent:    block.agentName,
    time:     block.time,
    agentId:  block.agent,
  },
  ...block.lines,
]);

For a lead with intent_score: 91 — a user who hit their API export limit after a 47-minute session with 3 teammates invited — the Debate Log renders this:

[14:03:01] > AGT-01 — BEHAVIORAL SCOUT
  Event: api_limit_hit · Session: 47 min
  API calls today: 98 / 100
  Teammates invited: 3 — adoption signal detected
  → Routing to Executive Router

[14:03:02] > AGT-02 — INTENT ARCHITECT  
  Invoking Groq llama3-70b-8192...
  Classification: POWER_USER
  Intent score: 91 / 100 · Urgency: HIGH
  Primary pain: USAGE_LIMIT

[14:03:03] > AGT-03 — PERSONA SCRIPTWRITER
  Subject: You hit the Data Export limit — here's how to unblock your team
  Body: 178 words · Personalization tokens: 4

[14:03:05] > AGT-07 — EXECUTIVE ROUTER
  Score 91 ≥ 80 → HOT_LEAD tier
  ✓ Slack fired → #sales
  ✓ Email queued for approval
  Pipeline complete · 4.2s

Every line in this output came from Hindsight. The sales rep reading the Slack alert can pull up this log and see exactly why they're being notified about this lead, what the model saw, and what it decided.

The Mistake: Storing Predictions Instead of Inputs

My first Hindsight integration stored the model's output, not its input:

// Wrong — stores the prediction, not what produced it
await hindsight.store({
  key: leadId,
  agent: 'intent_architect',
  data: {
    intent_score: result.intent_score,
    tier: result.tier,
  }
});

This is useless for debugging. When the score is wrong, you need to know what the model saw — the full activity atom — not just what it returned. Storing only the output tells you the model was wrong. It doesn't tell you why.

The correct version stores the full input alongside the output:

// Correct — stores input + output so you can reconstruct the reasoning
await hindsight.store({
  key: leadId,
  agent: 'intent_architect',
  data: {
    input: atom,               // what the model received
    prompt_tokens: usage.prompt_tokens,
    completion_tokens: usage.completion_tokens,
    output: {
      intent_score: result.intent_score,
      tier: result.tier,
      urgency: result.urgency,
      primary_pain: result.primary_pain,
    },
    model: 'llama3-70b-8192',
    temperature: 0.1,
  }
});

Now when a score looks wrong, I can pull the input atom and re-run the prompt manually against Groq to reproduce the result. Without the input stored, reproduction is impossible.

Why Groq Specifically

The latency profile matters for this architecture. Agent 7 calls Agent 2 and Agent 3 synchronously — it waits for both before updating Firestore and firing Slack. If each LLM call takes 8–10 seconds (typical for a large model on a slower provider), the total pipeline time for a HOT lead is 20+ seconds. That's too slow to feel real-time on the dashboard.

Groq's inference for llama3-70b-8192 runs at roughly 800 tokens per second. Agent 2's completion is under 200 tokens. Agent 3's email draft is under 250 tokens. End-to-end LLM time is around 1.5–2 seconds per call, which keeps total pipeline time under 5 seconds.

The tradeoff: Groq is fast but the model selection is limited compared to providers like Anthropic or OpenAI. For intent classification and email generation at low temperature, llama3-70b-8192 is sufficient. For tasks requiring more nuanced reasoning or longer context, you'd want to evaluate other options — but you'd also be accepting higher latency.

Prompt Engineering for Consistent JSON Output

Getting Groq to return valid, parseable JSON consistently required a few specific practices:

Explicit schema in the prompt. Describing the exact field names and types in the prompt reduces hallucinated field names:

// Good — explicit schema
`Return a JSON object with exactly these fields:
{ "intent_score": <integer 0-100>, "tier": <"POWER_USER"|"AT_RISK"|"PASSIVE">, ... }`

// Bad — vague
`Analyze this and return a JSON with the user's intent score.`

"Return only valid JSON. No explanation." Without this instruction, llama3-70b-8192 frequently wraps the JSON in markdown code fences or adds a preamble sentence. The instruction eliminates both.

Low temperature (0.1). Reduces variance in field naming and value ranges. At higher temperatures, the model occasionally returns "score" instead of "intent_score", or returns a string where you expected an integer.

Parse defensively in the Code node:

// Agent 2 — Parse Intent (n8n Code node)
const raw = $input.first().json.choices[0].message.content;

// Strip markdown fences if present despite instructions
const cleaned = raw.replace(/```
{% endraw %}
json\n?|\n?
{% raw %}
```/g, '').trim();

let parsed;
try {
  parsed = JSON.parse(cleaned);
} catch (e) {
  // Log to Hindsight before throwing
  await hindsight.store({ key: leadId, agent: 'intent_architect', data: { error: e.message, raw } });
  throw new Error(`JSON parse failed: ${e.message}`);
}

Logging parse failures to Hindsight before throwing means failed runs are still queryable. Without this, a JSON parse error produces a gap in the decision chain that's invisible unless you're watching the n8n execution log in real time.

What I'd Add Next

Hindsight namespaces per agent. Currently all entries go into a single workspace queryable by lead ID. Fleet-level analytics — how often does Agent 2 return each tier, what's the score distribution, how has it changed as the prompt evolved — require per-agent namespaces. That's the next thing I'd add.

Input hashing for prompt caching. Groq supports prompt caching for repeated prefixes. If two leads have identical activity atoms — same event type, same feature, same score — the Groq call is redundant. Hashing the input atom and checking a cache before calling Groq would reduce both latency and API costs for common event patterns.

Structured evals against the Hindsight log. Every stored input/output pair in Hindsight is a potential eval case. Running the current prompt against historical inputs and comparing outputs is how you know whether a prompt change improved or regressed classification quality. Right now that comparison is manual.

Takeaways

Store inputs, not just outputs. The output tells you what the model decided. The input tells you why. Without both, failed runs are not reproducible.

Low temperature is not optional for scoring pipelines. If the model's output drives routing decisions, variance is a bug. temperature: 0.1 keeps the scoring consistent enough to trust the thresholds.

Log parse failures to memory before throwing. A JSON parse error is not just a code failure — it's a data point about prompt reliability. Storing it in Hindsight means you can query how often it happens and under what inputs.

Audit trails are not optional when LLMs make decisions. A sales rep acting on a Slack alert needs to understand why they're being alerted. A developer debugging a misclassified lead needs to reproduce the model's reasoning. Agent memory is what makes both possible.

Closing

Fast inference without observability is just fast failure. Groq gets the pipeline under 5 seconds end-to-end. Hindsight makes every decision in that pipeline inspectable, reproducible, and queryable by lead ID. The combination is what makes the system trustworthy enough to act on — not just fast enough to impress in a demo.

DEV Community: Gehini Busarapalli