Ashmita Khuntia

Posted on Jun 27

How Hindsight and Four Agents Changed My Market Analysis

#ai #software #development #agents

The first version of my market analysis pipeline could write a sharp, well-reasoned brief — and then forget it ever existed the moment the process exited. Ask it the same question a week later and you'd get a different opinion, with zero awareness that it had ever weighed in before. That's not a small bug. For a system whose entire premise is "tell me what's happening in this market," having no memory of what it told you yesterday is disqualifying.

This is the story of why I ended up treating memory as a first-class part of the architecture instead of an afterthought, and what changed once I did.

What the system does

The project is a Strategic Market Intelligence System (SMIS): you give it a query — a company, a sector, a technology trend — and it runs that query through four agents in a fixed sequence, each one handing its output to the next:

Market Intelligence Agent -> Trend Agent -> Insights & Recommendation Agent -> Critic Agent

It's built on LangChain with Groq running llama-3.3-70b-versatile as the LLM, Tavily for web search, and a Streamlit front end that streams the pipeline's print() output so you can watch each agent finish its stage in real time. There's no agent framework deciding which tool to call — I deliberately kept this as a fixed pipeline with explicit handoffs (pipeline.py just calls each function in order), because for this use case I wanted predictable, debuggable stages, not an LLM improvising tool calls.

The orchestration is almost embarrassingly simple:

state = market_intelligence_agent(query)
state = trend_agent(state)
state = insights_recommendation_agent(state)
state = critic_agent(state)

Each function takes a dict, adds its output, and returns the expanded dict. No graph library, no state machine. That simplicity is intentional — it made the next decision easier to reason about.

The core problem: a Critic with no consequences

Agent 1 searches the web and summarizes raw signals. Agent 2 reads the top sources in full and pulls out recurring patterns. Both are stateless and have no opinion about anything beyond the current run — which is fine, that's their job.

Agent 3 is where it gets interesting. It's supposed to turn trend analysis into one actionable recommendation. Agent 4, the critic, then evaluates that recommendation and returns a verdict: APPROVED, NEEDS REVISION, or REJECTED.

Here's the problem I ran into early on: the Critic's verdict had nowhere to go. It got displayed in the Streamlit UI, the user read it, and then it evaporated. If the Critic rejected a recommendation about, say, overstating EV battery supply chain risk, that judgment had zero effect on what the system would say the next time someone asked about EV battery supply chains. The system could — and did — make the same flagged mistake in back-to-back runs, because nothing about the rejection persisted anywhere the next run could see it.

That's the gap Hindsight closes. It's not in the system for atmosphere — it connects agent 4's output back into agent 3's input, across separate, unrelated invocations of the pipeline.

Where memory actually lives in the code

I kept the Hindsight calls out of the agent functions themselves and behind a small set of wrappers in tools.py, so the agents read like plain business logic and the memory client is a swappable implementation detail:

Three operations, three distinct jobs: retain writes a fact, recall does similarity-style lookup over what's already stored, and reflect is the one I actually find most useful — it doesn't just return matching memories, it synthesizes an answer to a judgment question across all of them. That distinction matters more than it looks like it should. A plain vector lookup tells you "here are five things that are semantically close to your query." reflect lets me ask "has anything like this been wrong before?" and get a direct answer instead of five paragraphs I'd have to summarize myself in another LLM call. The Hindsight docs describe this as the difference between retrieval and reasoning over memory, and in practice that's exactly the gap it filled for me — I was about to build a second summarization pass on top of recall results before I noticed reflect already did that.

Only two of my four agents touch any of this. The Insights & Recommendation Agent reads:

And the Critic writes, right after it renders its own verdict:

That's the whole loop. Agent 4 writes the verdict back as a tagged memory; agent 3, on some future run, recalls and reflects on it before generating a new recommendation. The Market Intelligence and Trend agents stay completely stateless on purpose — they're doing fresh retrieval and extraction every time, which is correct for them. Memory belongs at the judgment layer, not the retrieval layer.

What this actually looks like in use

Run the pipeline once on "EV battery supply chain risk in 2026" and the Critic might come back with NEEDS REVISION — flagging that the recommendation overstates near-term supply constraints based on thin sourcing. That verdict gets written to memory with a tag like topic:ev-battery-supply-chain-risk-in-2026.

Run the same query, or a close variant, weeks later. This time recall_past_insights surfaces that prior critique as plain text, and reflect_on_findings answers the cautionary question directly: yes, a similar recommendation was flagged before, and here's why. That gets folded straight into the system prompt for agent 3, which is explicitly instructed to avoid repeating flagged mistakes. The recommendation that comes out the second time is shaped by the first run's failure — not because I hand-wrote a rule about EV batteries, but because the Critic's judgment became durable input instead of disposable output.

That's the behavior I actually wanted from the start and didn't get until memory was wired into the loop: the system gets harder to fool the same way twice.

Lessons learned

Memory belongs where judgment happens, not everywhere. I was tempted to wire Hindsight into all four agents "for consistency." It would have been mostly wasted writes — retrieval agents don't benefit from remembering last week's search results, and "trend" detection should ideally be its own beneficiary of historical memory too, which is a gap I still want to close. The Critic-to-Recommender loop is where memory paid for itself immediately.
Distinguish recall from reflect early. If I'd only had a recall-style lookup, I'd have built a second LLM call just to summarize the recalled memories into a judgment. Having reflect available as a separate primitive in Vectorize's agent memory model saved me from re-implementing synthesis logic that the memory layer should own.
A verdict with no write-back is just UI decoration. The Critic agent existed for weeks before it actually changed anything. Adding retain_insight() after the critique was a three-line change, but it's the line that turned "the system has an opinion" into "the system has a memory of its opinion."
Keep the memory client behind a thin wrapper. Routing every Hindsight call through retain_insight, recall_past_insights, and reflect_on_findings in tools.py rather than calling the client directly from inside agent logic meant I could test, mock, or eventually swap the backing memory implementation without touching agents.py at all.
Tag aggressively, even with simple slugs. The topic tag on every retained memory (topic:ev-battery-supply-chain-risk-in-2026) is unglamorous, but it's what makes recall scoped instead of noisy. Without it, every recall call would be pulling from the entire memory bank instead of the relevant slice.

The pipeline itself is still a straight line — four functions, called in order, no branching logic, no agent framework deciding what happens next. What changed is that the line now has a past. The Critic's job stopped being a one-off opinion and started being a constraint the next run has to actually reckon with — which, for a system whose whole point is giving advice about markets that don't sit still, is the part that was missing the whole time.

Top comments (1)

Justin • Jun 27

This was an excellent read. I love your section on what you learned too, especially no. 1, "Memory belongs where judgment happens, not everywhere" - I'm stealing this from you.

If you don't mind, I'd like to share a lesson from my own project: you're working with lower-level persistence objects; if you don't mind more complexity, I'd consider Neo4j and graphing these together, merging them based on fact edges - so you can get finer accuracy with recall, and also allow context to scale deeper as you store more. Good luck!