kaushal trivedi

Posted on May 26

Built an AI agent framework, discovered more agents made it worse, and accidentally created cognition infrastructure for AI.

#ai #programming #githubcopilot #chatgpt

I want to tell you about the most surprising thing I've found in the past few weeks of building.
I was running ablation studies on a multi-agent system — comparing different configurations of Planner, Coder, Reviewer, Tester, Verifier agents working together. The hypothesis was obvious: more specialized agents = better results. That's how human teams work, right?
Here's what I actually found:
minimal (Coder → Tester only): 19/20 solved 27,476 tokens $0.014
full pipeline (all 5 agents): 18/20 solved 37,118 tokens $0.009
review_first ordering: 18/20 solved 45,591 tokens $0.009
The reviewer agent costs -1 solve rate and +9,642 tokens. It makes things worse.
I ran this three times across different seeds thinking I'd made a mistake. Same result every time. I then trained a Q-Learning agent on 220 execution trajectories to independently verify — it confirmed that the minimal policy dominates 89% of task states.
More agents. Worse performance. More expensive.
I genuinely did not expect that.

How this started
A few weeks ago I was frustrated by a pattern I kept seeing in autonomous agents: they'd fail at something, you'd restart, and they'd fail at the exact same thing again. No memory. No learning. Every session starts cold.
It felt like hiring someone who forgets everything overnight. Imagine telling your engineer the same bug exists every single morning.
So I asked a weird question: what if memory lived in the environment instead of the agent?
Instead of modifying the agent to have memory, the environment stores every failure and injects it back as context next time. The agent doesn't need to change at all — any LLM, any RL agent, any rule-based system automatically gets memory for free.
That was the insight that became CogniCore.

What I built
Over the past few weeks this evolved from a simple memory experiment into something I'm calling NEXUS — a runtime cognition layer for autonomous AI agents.
Here's what it does:
Persistent cross-session memory
Every failure is stored. Every success is stored. When a similar task appears, the agent gets context about what worked and what didn't — not just in this session but across all previous sessions. Forever.
python# Agent remembers guard_fix failed 6 times for null_handling

Automatically suggests rewrite instead

Without any changes to the agent itself

from cognicore import Memory, ReflectionEngine

mem = Memory()
ref = ReflectionEngine(memory=mem)

After enough failures...

action, reason, confidence = ref.suggest_override("null_handling", "guard_fix")

→ action="rewrite", confidence=0.87

→ reason="guard_fix failed 6/6 times, rewrite succeeded 3/3"

The compounding effect
This is the part that genuinely excites me. The more tasks NEXUS handles, the cheaper and faster it gets:
Week 1: cost per fix $0.05 (no memory, tries everything)
Week 4: cost per fix $0.02 (knows what doesn't work)
Week 8: cost per fix $0.01 (skips failed approaches immediately)
I measured this. It's real. An agent with 6 months of memory on your codebase is fundamentally different from one starting cold — and that difference compounds every single day.
NEXUS multi-agent runtime
This is where it gets interesting. NEXUS coordinates specialized agents:
Planner → decomposes the issue
Coder → generates patches

Tester → validates in sandbox
Memory → checks past failures before each attempt
And based on my ablation research — no Reviewer. The data is clear.
Agent Immune System
A DQN-backed threat detector that learns to block prompt injection, jailbreaks, and token bomb attacks. It gets better with every attack it sees, developing "antibodies" for known threats.
pythonfrom cognicore.immune import NexusShield

agent = NexusShield(agent=your_agent)

Now protected. Learns from every interaction.

Replay and time travel
Every agent decision is event-sourced. You can rewind any task to any step and branch from that point with a different strategy. The RL navigator learns which branches lead to success over time.
bashcognicore replay --task abc123 --from-step 3
cognicore branch --task abc123 --step 3 --policy minimal
6 enterprise integrations
GitHub Issues auto-trigger (label nexus → auto-fix → PR), CI failure fixer, Slack live updates, Linear integration, scheduled overnight runs, memory-backed PR review.
bashcognicore integrations setup

Interactive wizard connects GitHub, Slack, Linear

Then just label an issue with 'nexus' and watch it fix itself

The benchmark results
Policy comparison (20 tasks, 3 seeds, SWE-style):

minimal 19/20 (95%) 27,476 tokens $0.014
full_pipeline 18/20 (90%) 37,118 tokens $0.009

review_first 18/20 (90%) 45,591 tokens $0.009

RL policy learning:
220 trajectories → Q-Learning → 11,000 updates
Learned: minimal wins 89% of states
Exception: test_first wins for long-description tasks
Honest caveat: these are rule-based agents on curated tasks, not real LLMs on production repos. The architecture is designed for LLM substitution — we're working on that now. But the orchestration findings are real and statistically significant.

The CognitiveMemory system
This is the part I'm most proud of technically. It's a three-layer biological memory model:
pythoncog = cc.CognitiveMemory()

After 20 experiences...

result = cog.recall(category='null_handling')

Returns:

recommended_action: 'rewrite'

confidence: 0.75

sources_used: ['episodic', 'semantic', 'procedural']

episodic: 3 past null_handling fixes

semantic: accuracy=0.75 for this category

procedural: rule learned from repetition

Working memory (last 7 items), episodic memory (specific past experiences), semantic memory (category-level patterns), procedural memory (rules learned from repetition). Each layer contributes to the recommendation. The agent doesn't just remember — it learns rules from repeated experience.

What this could become
I keep thinking about this framing: every infrastructure company starts by solving a problem that everyone has but nobody has built proper tooling for.
AWS solved "I need servers but don't want to manage them."
Docker solved "it works on my machine."
Kubernetes solved "I need to orchestrate containers."
The autonomous agent space right now feels like pre-Docker. Every team is rebuilding memory, retry logic, and orchestration from scratch. Every deployment is fragile. Nobody has won the "cognition infrastructure" layer.
That's what NEXUS is trying to be. Not an agent. Not a wrapper. The layer underneath that makes any agent smarter, cheaper, and more reliable over time.

The honest part
I'm one person. This is Alpha. There are bugs — I've documented four known ones in the repo and I'm fixing them as fast as I can. The immune system doesn't catch prompt injection yet. The SemanticMemory fuzzy matching isn't as good as I want it to be.
But the core architecture works. The memory compounds. The ablation finding is real. The CognitiveMemory recommendation system actually suggests the right action after enough experience.
1,700+ downloads in the first week. Getting good traction on r/reinforcementlearning. Interesting conversations starting with some folks in the agent memory space.

Try it
bashpip install cognicore-env

Quick demo

python -c "
import cognicore as cc

env = cc.make('SafetyClassification-Easy-v1')
agent = cc.AutoLearner()
cc.train(agent=agent, env=env, episodes=30)
score = cc.evaluate(agent=agent, env=env, episodes=5)
print(f'Score: {score:.2%}')
"

Or open the dashboard

cognicore ui
GitHub: github.com/Kaushalt2004/cognicore-my-openenv

The reviewer finding still surprises me every time I look at it. I expected the paper to say "multi-agent coordination improves performance." Instead it says "be very careful what agents you add."
I think that's a more interesting finding honestly.

Top comments (2)

Harjot Singh • May 31

"More agents made it worse" is the most honest and useful thing anyone can say about multi-agent, because the naive intuition (more agents = more capability) is wrong - past a point, each added agent adds coordination overhead, handoff loss, and more surface for one agent's bad output to poison the rest. Agents don't sum; they interact, and uncontrolled interaction degrades. The teams who learn this build FEWER, better-coordinated agents, not more.

The "accidentally created cognition infrastructure" turn is telling - it means the value wasn't the agents, it was the structure you had to build to make them not step on each other (shared state, handoff contracts, coordination logic). The infra IS the product; the agents are interchangeable workers on top of it. That's exactly the lesson behind Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - the orchestration/coordination layer is the hard-won value, not the agent count, which is what keeps a build coherent AND ~$3 flat. Genuinely sharp writeup - the "fewer agents, better orchestration" lesson is earned, not theorized. At what agent count did it start degrading for you, and what coordination primitive fixed it? That inflection point is the practical gold.

kaushal trivedi • Jun 5

That's exactly what surprised me. I started by adding more specialized agents because it felt like the obvious path, but after benchmarking, the coordination cost grew faster than the gains.

The biggest improvement came when I shifted focus from adding agents to improving the runtime around them: memory, replay, reflection, and execution history. Once agents could learn from previous failures, the need for additional reviewers dropped significantly.

In my tests, performance started degrading once I added reviewer-style agents into the loop. Solve rates decreased while token usage increased. The most effective coordination primitive ended up being shared execution memory rather than additional agent handoffs.

I'm still running larger benchmarks, but the pattern so far has been: fewer agents, stronger memory, better results.