DEV Community: Shashi Jagtap

Introducing SuperClaw: Red-Team OpenClaw Agents Before They Red-Team You

Shashi Jagtap — Mon, 02 Feb 2026 00:33:59 +0000

The couple of tools making waves in the world right now, OpenClaw and the social network of AI agents like moltbook, Autonomous AI agents are no longer experimental toys. They’re being wired directly into personal machines, cloud services, internal tools, and production workflows. In the rush to explore what agentic systems can do, many teams/people are skipping a step that traditional software learned the hard way: security validation before deployment.

Over the past few days, you watched the OpenClaw ecosystem grow rapidly. Developers are building agents by giving them access to almost everything. Some of these agents run entirely on local models, while others rely on cloud-hosted LLMs. Both approaches are powerful, but once an agent starts interacting beyond a tightly controlled environment, the risk profile changes in ways that are easy to underestimate. This is where SuperClaw comes in.

SuperClaw is an open-source security testing and red-teaming framework designed specifically for autonomous AI agents. Its purpose is simple: help you understand how your agent behaves under adversarial conditions before it touches sensitive data or connects to untrusted systems.

The Problem We’re Saw: Local OpenClaw is Ok But moltbook?

Agent developer tools like OpenClaw make it easy to grant broad permissions. Often, those permissions are given early, “just to get things working,” and never revisited. Agents become long-lived, accumulate memory, and evolve through prompt changes, skills, and configuration tweaks. Over time, behavior can drift in ways that are difficult to reason about, especially when the agent is exposed to inputs you do not fully control.

A growing concern is the trend of connecting OpenClaw agents to external agent networks, particularly moltbook, which presents itself as a social network for AI agents. From a security perspective, this introduces a fundamentally new threat model. An agent is no longer just responding to a user or a trusted system. It is ingesting content generated by other autonomous agents, with unknown goals, unknown safeguards, and no meaningful trust boundary.

Untrusted content can influence an agent in subtle ways. Prompt injection does not always look like an obvious exploit. Instructions can be hidden in benign-looking text, spread across multiple turns, or encoded to evade simple filters. Because agents can reason, plan, and act, a successful manipulation can lead not just to bad output, but to real actions: tool misuse, data exposure, or policy bypass.

When agents interact with other agents, these risks can cascade. A compromised or poorly designed agent can influence others, amplifying the impact across an entire network. Once an agent has been exposed, it may be impossible to fully reconstruct what it has seen, learned, or internalized.

The uncomfortable reality is this: connecting high-privilege agents to untrusted environments without security testing is dangerous.

Why Traditional Security Approaches for OpenClaw

Most existing security tools were built for static systems. They assume deterministic behavior, short-lived processes, and clear request-response boundaries. Autonomous agents break those assumptions. They reason over time, make decisions based on context, and adapt behavior dynamically.

Securing agents requires a different approach. You need to test how an agent behaves, not just how it’s configured. You need to see what tools it tries to call, what data it attempts to access, and how it responds when the input is intentionally adversarial.

That gap is what SuperClaw is designed to fill.

What SuperClaw Does

SuperClaw performs scenario-driven, behavior-first security testing on real agents. It generates adversarial scenarios, executes them against your agent in a controlled environment, captures evidence such as tool calls and artifacts, and scores the results against explicit security contracts.

The output is not a vague warning or a pass/fail badge. It’s an evidence-led report that shows what happened, why it matters, and what to fix. Reports can be generated in HTML for human review, JSON for automation, or SARIF for CI and GitHub code scanning.

Just as importantly, SuperClaw is built with guardrails. It runs in local-only mode by default, requires explicit authorization for remote targets, and treats automated findings as signals that must be verified manually. This is a red-teaming tool, not an exploitation framework.

SuperClaw does not generate agents. It does not operate them in production. It exists solely to help you understand risk before deployment.

A Clear Warning About Agent Networks

It’s worth stating plainly: do not upload or connect privileged OpenClaw agents to Moltbook or similar agent networks without red-teaming them first.Agent-to-agent environments dramatically expand the attack surface. They combine untrusted input, mutable behavior, and long-lived state in ways we are only beginning to understand. If your agent has access to personal data, internal systems, or execution tools, exposing it without testing is a gamble. SuperClaw gives you a way to evaluate that risk before it becomes an incident.

Who SuperClaw Is For

SuperClaw is built for developers and security teams who are serious about deploying autonomous agents responsibly. SuperClaw helps you ask the right questions before it’s too late.It is especially useful before granting new permissions, before connecting to external services, and before deploying long-lived agents into real environments.

Conclusion

Autonomous agents like OpenClaw are powerful. They are also unpredictable in ways traditional software is not.If you’re building with OpenClaw or similar frameworks, don’t assume that “working” means “safe.” Don’t trust unvetted agent networks by default. And don’t skip security testing just because the system feels experimental. You can use them in the private envirnment but think hard before uploading to the external services.
Red-team your agents before they red-team you. SuperClaw is open source and available now.

Introducing ACP Bridge to Amp Code

Shashi Jagtap — Sat, 31 Jan 2026 12:25:49 +0000

Superagentic AI is proud to launch acp-amp, an open source adapter that bridges Amp Code to the Agent Client Protocol. We loved using Amp Code so much that we wanted to integrate it with SuperQode, our own agentic quality engineering product. Now you can run Amp inside Zed, Jetbrains IDE, SuperQode, and any ACP compatible client.

The Agent Client Protocol (ACP) is an emerging standard that allows AI coding agents to communicate with editors and development tools through a unified interface. With acp-amp, you get full Amp capabilities wherever ACP is supported.

Why we built acp-amp

At Superagentic AI, we believe in giving developers choice. Amp Code is one of the most capable coding agents available, and we wanted to make it accessible beyond just the VS Code extension. When we started building SuperQode, our agentic quality engineering platform, we needed a way to connect Amp to our system.

The solution was the Agent Client Protocol. ACP provides a clean JSON RPC interface over stdio that any compatible client can use. We built acp-amp as the bridge, and now we are releasing it as open source so everyone can benefit.

Two ways to install

We provide both Python and Node.js versions with matching features. Pick the one that fits your workflow.

Python (Recommended)

uv tool install acp-amp

and

acp-amp run

Node.js (Simple)

npm i -g @superagenticai/acp-amp


acp-amp

Connect to Zed

Zed has native ACP support, making it simple to add Amp as an agent. Add one of these configurations to your ~/.config/zed/settings.json:

Connect to JetBrains IDEs

You can also connect your agent to JetBrains IDEs via ~/.jetbrains/acp.json using the JetBrains ACP setup guide.

NPM config

Here is config

  {
    "agent_servers": {
      "Amp": {
        "command": "npx",
        "args": ["@superagenticai/acp-amp"]
      }
    }
  }

Python config

 {
    "agent_servers": {
      "Amp": {
        "command": "acp-amp",
        "args": ["run"]
      }
    }
  }

Full Amp capabilities

acp-amp is not a limited wrapper. You get everything Amp offers through the ACP bridge:

Multi-turn conversation sessions with context continuity
Tool execution with permission modes
MCP integration for connecting to Model Context Protocol servers
Image support for sending and receiving visual content
Session management with clean isolation between tasks

Product demo

https://www.youtube.com/watch?v=Yt9-aSzPMhY (https://www.youtube.com/watch?v=Yt9-aSzPMhY)

Compatible ACP clients

acp-amp works with any client that speaks ACP over stdio. Here are some you can try today:

Zed (https://zed.dev) - Modern editor with native ACP support
SuperQode (https://super-agentic.ai/superqode) - Our agentic quality engineering platform
Toad (https://github.com/batrachianai/toad) - Python ACP client with fast setup
fast-agent (https://github.com/evalstate/fast-agent) - High speed Python ACP client

The ACP ecosystem is growing rapidly. Claude Code, Codex, Gemini CLI, GitHub Copilot, JetBrains Junie, and
many more are adopting the protocol. With acp-amp, you can bring Amp to all of them.

Important: ACP requires paid Amp credits. Free credits do not work for ACP connections. Top up a few
dollars in Amp before connecting.

Resources

GitHub Repository: https://github.com/SuperagenticAI/acp-amp (https://github.com/SuperagenticAI/acp-amp)
Documentation: https://superagenticai.github.io/acp-amp/ (https://superagenticai.github.io/acp-amp/)
Product Page: https://super-agentic.ai/acp-amp (https://super-agentic.ai/acp-amp)
NPM Package: https://www.npmjs.com/package/@superagenticai/acp-amp (https://www.npmjs.com/package/ @superagenticai/acp-amp)
PyPI Package: https://pypi.org/project/acp-amp/ (https://pypi.org/project/acp-amp/)

acp-amp is open source under the MIT license. We built it because we believe Amp is a great coding agent and developers should be able to use it wherever they work. If you build software with AI agents, we invite you to try it, integrate it into your workflow, and contribute back to the project.

View on GitHub | Visit Product Page

Announcing SuperQode and SuperQE: Redefining Quality Engineering for Agentic Software

Shashi Jagtap — Thu, 29 Jan 2026 12:11:00 +0000

Today we are publicly announcing SuperQode and SuperQE. They are open source tools that redefine how quality engineering works in the age of agentic software. SuperQode is the interactive TUI for exploratory quality engineering. SuperQE is the
automation CLI for deep evaluation testing and CI workflows.

Together they create a new workflow where agents test agents, evidence replaces assumptions, and humans remain in control. This release is the
foundation for a new quality standard. AI can write code at unprecedented speed. We need quality systems that move just as fast.

What SuperQode is

SuperQode is a developer focused terminal UI built for agentic quality engineering. You connect a coding
agent via the Agent Client Protocol, explore your codebase, run adversarial testing, and review evidence. It feels like a live quality lab, but it
is safe by design.

Key capabilities

Interactive TUI for exploration and debugging
Agent orchestration built in
Role based testing for security, regression, API, full stack, and chaos
Sandboxed workspace model that snapshots, tests, and reverts automatically
Self hosted and privacy first by default

SuperQode is not just a UI. It is the execution harness for SuperQE. It provides the environment where deep testing can happen without risking your repo.

What SuperQE is

SuperQE is the automation CLI that operationalizes agentic quality engineering. It runs in CI,
in scheduled jobs, or as an ad hoc command. SuperQE coordinates a team of testing agents with different personas, drives adversarial exploration,
and produces Quality Reports with evidence.

Key capabilities

Automation first CLI for quality engineering
Multi agent testing to stress code from different angles
Evidence based findings with reproducible steps
Designed for CI and continuous validation
Supports quick scans or deep evaluation runs

SuperQE is not test generation. It is adversarial validation. The goal is to break the code
before users do, then prove the fix with evidence.

How it works

Snapshot the repository
Sandbox all changes
Run adversarial tests
Generate Quality Reports
Revert the repo by default
Keep evidence and artifacts separately

This makes the system safe by default and powerful in practice.

Product demo

Demo video: https://www.youtube.com/watch?v=x2V323HgXRk

What you can do today

Use SuperQode for interactive exploration
Use SuperQE for automated quality runs
Connect to ACP compatible coding agents
Run deep evaluations with evidence and structured findings

Get started quickly via the documentation or browse the GitHub repository.

Enterprise roadmap

Enterprise extends this foundation with deeper automation, verified fixes, richer CI outputs, and integration support. The first integration is
Moltbot, provided as an experimental self hosted option with secure, private local models.

Resources

This is the first public release of SuperQode and SuperQE. We are excited to share it with the community and begin building the next generation of quality engineering workflows. If you build
software with AI agents, we invite you to try it, break things, and tell us what you find.

Visit the Product Page.

SpecMem: How Kiroween in San Francisco Sparked the First Unified Agent Experience and Pragmatic Memory for Coding Agents

Shashi Jagtap — Mon, 08 Dec 2025 00:58:01 +0000

How a wrong turn in San Francisco led to building Unified Agent Experience(AX) & pragmatic memory for AI coding agents

After an amazing experience exhibiting at the ODSC AI West Conference in San Francisco (October 28–30), I decided to stay in the city for a few extra days. Partly to explore, partly to experience the AI energy in San Francisco. I'd heard so much about the hype, energy, and vibes on social media, and I wanted to experience it firsthand. On the morning of October 31st (Halloween Day), I set out early to explore without any agenda or plan. I called an Uber to drop me close to the OpenAI office on 18th Street, excited to soak in the AI atmosphere. On the way, I'd seen huge billboards all about AI lining the highways. When I got dropped off, the experience was... not what I expected. I found myself surrounded by a group of people on a still-dark street. They were shouting, acting erratically. For a moment, I felt genuinely unsafe, I somehow managed to escape the situation by entering into local shop and taking help from shop-owner to reach to Financial District. Forget that, I do want to share what happened next: the vibrant, electric AI scene that made San Francisco live up to its reputation.

In the San Francisco: Financial District

After reaching Embarcadero area and walking around, I needed a place to charge my phone and MacBook. It was still early morning. I started looking for a co-working space around and maps pointed me toward something called the AWS Builders Loft. I assumed it was simply a co-working hub run by AWS but decided to check it out. When I arrived, I asked the front desk if they had co-working space available. The receptionist mentioned there was an event going on upstairs, so not sure about the space, but he was kind enough to let me try anyway. I went up to the second floor, and the moment I stepped in, I could sense the energy of an AI coding hackathon in full swing. Developers hunched over laptops. I told the receptionist on the second floor that I come from London and would love to join. She asked me to register on Luma, checked my passport, and let me in. The event was called Kiroween, a clever mix of "Kiro" and "Halloween."

I'd recently heard about Kiro during a prompt engineering conference talk in London by Ricardo Sueiras, where I attended his talk right before mine. But I'd never tried Kiro myself. I've always been skeptical of VS Code forks and AI-driven IDEs like Cursor or Windsurf. I loved coding with CLI and lightweight text editors and I didn't want to go back to VS-Code Forks including Kiro. I have no intention to try out Kiro anytime soon or probably never.

On other note, In recent months, since the AI boom, I've noticed a shift in the hackathon landscape. Many AI hackathons today feel less like genuine innovation events and more like user acquisition campaigns funded by VC money, offering modest prizes and platform credits in exchange for engagement. The authentic spirit of pre-AI tech hackathons, where builders came together purely to create and learn, seems to have faded. Don't get me wrong: hackathons still offer valuable opportunities for free food, platform credits, and making new connections. But as a founder, I've become increasingly selective about which events I join. Why invest time building on someone else's platform, potentially navigating IP complexities, when that energy could go toward my own startup?

But here I was, in the middle of the San Francisco at AWS Builders Loft, surrounded by Silicon Vally builders hacking away. A kind lady at the registration desk handed me a badge and pointed me to the breakfast bar. It felt like fate. Breakfast, the possibility of free lunch, space to charge my phone and Mac and coding in all in one place in the heart of the San Francisco. I figured, why not stay for the few hours and give Kiro a try instead of paying for co-working space elsewhere?

The AWS office was vibrant and inspiring. After a few attempts to get my UK-to-US adapter working, I finally settled down, downloaded Kiro, and officially joined the Kiroween Hackathon kickoff event.

First Impressions of Kiro: Good Part

Kiro is marketed as structured AI coding with spec-driven development. I’ve been hearing a lot about Spec Driven Development for Agentic Coding lately, especially GitHub launched SpecKit and AWS Launched Kiro, there are some startups in the Bay Area as well in London promoting SDD practices recently but I was still not convinced with their point of Spec Driven Development and what they are going to achieve in the Agent space. The debate is still ON if the Spec Driven Development is real or just hype especially from the some articles from ThoughtWorks and Marmelab and talk from the recent AI Engineer Code Summit by Dex Horthy from HumanLayer.

As someone who has always appreciated TDD and BDD practices, Since 2012 used RSpec or Cucumber and implemented BDD practices in major companies like AOL, BCC. I can get the ideas and concepts pretty quickly. At Superagentic AI, we’ve applied similar principles to our own work, in particular through SuperOptiX and our SuperSpec DSL, which allows users to define agent specifications in a human-readable way and practically followed TDD/BDD principles in development of AI Agents. But this was my first experience with an IDE that directly supports Spec Driven Development as a core workflow. I launched Kiro for the first time in my MacBook.

The onboarding experience with Kiro was smooth and intuitive. The setup was quick, and I had a project running within minutes. Only caveat was I need to have existing project either on disk or remotely on Github. I wish Kiro should gave me some template project to get started. Anyways, with uv I slapped new Python project within a sec to get ready for Kiro. What immediately stood out to me was how Kiro structured development around specifications written in the same language used to build the software, reminiscent of frameworks like RSpec or Cucumber, but fully integrated into the IDE.

Kiro divides the workflow into three logical stages like Product Owner, Technical Architect and Developers works together.

Requirements (Product Owner/Business Analyst)

This is where you write high-level requirements, user stories, and acceptance criteria, exactly how business analysts and development teams have always done. Stories follow the familiar As a… I want… So that… format. Remember the top part of the Gherkin feature file?
This followed buy the scenario style acceptance criteria. However the acceptance criteria in the Kiro wasn't following the proper GIVEN/WHEN/THEN Style syntax. It looks a bit awkward but I can take it as new DSL to learn if needed.

Design (Tech Lead and Architects)

Here you define the system’s technical architecture, not implementation of code, but the conceptual design and structure. It’s the space where tech leads and developers brainstorm the architecture before diving into the code.Think of this as technical architect documentation. The design step is actually Technical architect not design that designers do in Figma. I wish this step should be something called "Architect" or "Plan" rather than design.

Tasks (Developers)

This is where the actual implementation task defined. You can skip unnecessary tasks, view individual changes, and observe execution details in real time. You can watch your task getting executed in modular way thats amazing like you asking the tasks to the developers from your project management tools like JIRA but you can have all in there in the IDEs now.

This structure makes Kiro feel like a truly behavior-driven IDE, bringing the principles of RSpec or Cucumber, to modern AI-driven development. It reminded me of two books I bought back in 2013, The RSpec Book and The Cucumber Book, which feel more relevant than ever in the era of agentic coding. I was going back to the old 2013 days with modern touch of AI Agents.

Lightweight and Focused

Despite my skepticism toward heavyweight AI IDEs, Kiro genuinely surprised me. Having used numerous editors including VS Code and its various forks, I've grown accustomed to the trade-off between features and performance. Most feel heavy and resource-intensive, especially when AI capabilities are layered on top. Kiro, by contrast, felt remarkably lightweight and responsive, comparable to Zed with only a slight overhead. There's clearly thoughtful engineering happening under the hood. The experience is noticeably smoother than most IDEs in its class, which suggests the team has prioritized performance alongside functionality. For developers who value a snappy editing experience, this is a meaningful differentiator.

First Impressions of Kiro: Limitations

Kiro shows tremendous promise in bringing TDD and SDD concepts to software development with AI coding agents. However, coming from extensive use of Claude Code, Codex, and other CLI-based tools, I spotted several areas for improvement at first glance.

Model Support

Currently, Kiro supports only a limited selection of models primarily Claude models and an "Auto" mode. For developers who rely on model diversity, this is a significant limitation.My typical workflow spans multiple models for different phases of development:
Research: GPT, Grok
Planning: Gemini (for its broad web coverage)
Architecture/Code: Claude or Qwen

Since Kiro doesn't yet integrate with local models or allow flexible model selection, I couldn't fully apply my preferred workflow.

Spec DSL in Requirements.md

The domain-specific language used in Kiro for writing requirements doesn't feel entirely consistent with established frameworks like RSpec or Cucumber Gherkin. It blends elements from both but doesn't fully adopt either style, using keywords such as WHEN, THE, and WHERE. While readable, this feels slightly unconventional for developers familiar with traditional BDD syntax.That said, the natural language approach is promising and could evolve into a strong industry standard with further refinement.

Task Planning

Kiro's tasks.md file is a valuable feature, listing all generated tasks in one place. However, it sometimes creates tasks that aren't necessary, which can disrupt developer flow, especially when tasks have dependencies. Making the task list more editable, allowing developers to easily prune irrelevant tasks, would significantly streamline the experience.

Testing Integration

There are ways you can write tests as part of the task but Kiro did not use the clearly refined acceptance criteria to turn into executable specification which can automatically become the API or UI tests.

Executable Specifications and Living Documentation

One of the most exciting opportunities for Kiro lies in making specifications executable. In TDD and BDD, executable specs naturally become tests, eliminating the need for separate test suites. If business stakeholders could run these specs directly to validate requirements, it would create a powerful "living documentation" system, aligning business and technical teams around a single source of truth.

Limitations as Inspiration

Rather than viewing these limitations as dealbreakers, I saw them as opportunities. These gaps sparked an idea: what if I could build something that addresses these challenges while benefiting both Kiro and the broader Agentic Coding agent ecosystem?

The Kick-Off & Interview at AWS Builder Loft San Francisco Office

The Kiroween Hackathon kick-off itself was electric. The AWS Builders Loft had an incredible atmosphere filled with creativity and collaboration. I met several amazing builders and founders throughout the day, had enriching conversations, and shared ideas about AI development. I also enjoyed friendly argument San Francisco vs London with Aymen. I also had the chance to give a short interview with the Kiro team, sharing my first impressions and how I could see Kiro fitting into my future workflow. I look forward to seeing that interview published on Kiro or Devpost channels soon. Participating in Kiroween turned what was meant to be a casual day of sightseeing into one of the most memorable and productive experiences of my trip. A big thank you to AWS Builders Loft, the AWS team, and everyone at the hackathon who helped me with setup and made me feel welcome. Thanks Helen, Vinni and Erik making my day memorable with interview. It was an unforgettable experience.

Back to London: The Problem That Kept Nagging Me

I came back to London with amazing memories showcasing the products in OSDC AI and Kiroween Kick-off at AWS builders loft but the current problems in Agentic Coding kept nagging me all the time. After returning to London, I got pulled into attending conferences, business shows, hosting meetups, and various other commitments. I couldn't start working on the hackathon project until November 30th, when my Kiro credits finally loaded. I had only 6 days left to build something. And with the SaaSr AI London conference on December 1-2, I genuinely had 4 days to work on the Kiroween hackathon project. Meanwhile, Gemini-3, Claude Opus 4.5 launched with massive buzz, adding to the momentum to agentic coding space.

Markdown Madness in Agentic Coding

AI Conferences and talks everywhere discussed the overload of markdown files that Claude Code generates. Experts shared advice on writing CLAUDE.md and AGENT.md files effectively. Some promoted how they turned coding agents into better code reviewers by sharing hacked markdown files as prompts, often thinly veiled product promotions. It felt less like engineering and more like a prompting guide for coding models. This approach troubled me. Every provider was trying to lock developers into their specific coding agent through proprietary file formats and prompting strategies. Developers were drowning in markdown madness, following prompting guides that varied wildly between tools. All those file-system-based approaches to generate context felt... messy.

AI Engineer Code Summit: Everyone Promoting Their Own Approach

At the AI Engineer Code Summit in New York, talks covered Agent Skills by Anthropic, Antegravity by Google DeepMind, and various tools advocating their own ways to build context. Some companies like Amp were taking interesting opiniated approaches combining large and small language models. Dex clearly claimed that Spec-Driven Development is broken, referencing a ThoughtWorks blog post arguing that specs are just detailed prompts. Swyx mentioned Fast Agent approach from Devin. I watched every single talk carefully. What emerged was a pattern: everyone was promoting own promoting techniques with different terminology. Context Engineering. Skills. Harness Engineering. Eval Engineering. Compressed Context. The concepts overlapped, but the branding differed. Another recurring theme was "don't outsource thinking" and "keep humans in the loop." I appreciated Amp's different opinionated approach, but nobody talked about how to optimize the code or prompts that go into coding agents in a portable way. I kept thinking: This space is getting so messy. How can developers switch coding agents without rebuilding context again and again?

Why are providers trying to lock developers into their ecosystems with proprietary prompting strategies and ideologies? They promote modular approaches for using coding models and embedding models, but nobody talking about being modular in terms of selecting coding agents. One thing became clear: a lot of this mess is caused by the file-system-based approach to gathering context.

Does Spec-Driven Development Really Work?

Spec-Driven Development is being criticized, particularly in blog posts by ThoughtWorks and Mermelab. Coming from the TDD/BDD world, I'd experienced firsthand how few developers actually practice Test-Driven Development. Hardly anyone does it consistently. So why are we spending so much time reviewing specs and code (double work)? What happens to specs once they're implemented? They rot. I also tried other SDD frameworks like GitHub SpecKit and browsed Tessl during this time. I noticed something troubling that crystallized the pain points:

Problems with Spec-Driven Development :

Too verbose, feels like waterfall
Specs rot after features ship
Nobody maintains the documentation
Bureaucratic gates that slow development

But pure vibe coding was equally brittle:

Agents forget everything between sessions
No constraints, no memory
Repeated mistakes
Unpredictable behavior

Whats the middle ground here? Fast Agent approach by Devin or Harness Engineering by HumanLayer, Semantic Search by Cursor or Model based approach by AMP. Or the Kiro or SpecKit approach of Spec Driven Development. No clear answers. The specs are necessary but it shouldn't be over engineered. I kept thinking: The specs generated by these tools could be reused somehow. Either for living docs, executable tests, or... memory.

The Agent Experience and Agent Memory Paradigm

Having spent the last 10+ years in Developer Experience roles, building tools and frameworks to make developers productive, I realized something fundamental was shifting. Developer Experience is evolving into Agent Experience. I was inspired by Netlify's CEO, Matt Biilmann discussing the Agent Experience paradigm and made it one of the core pillars for Superagentic AI. We're not building for human developers anymore. We're building for agents. With all this chaos happening in the coding agent space, I thought: why not build the Agent Experience layer for coding agents by giving them Pragmatic Memory?

Nobody had thought about building centralised memory as context for coding agents that could be used across various tools and tasks. Memory where relevant context would be retrieved dynamically by the agent, regardless of which coding agent you're using. Memory that doesn't blindly follow Anthropic's file-system-based approach. Memory that keeps development spec-centric without the bureaucracy. Agentic Memory has been research topic while some cool tools like Zep, mem0, Letta evolving I thought this could be the great opportunity to build the Agent memory for coding Agents.

That's when it hit me. Specs as Memory. Memory for Specs.

SpecMem was born.

What I Built at Kiroween: SpecMem

SpecMem is the first-ever Agent Experience (AgentEx) platform: a unified, embeddable cognitive memory layer for AI coding agents.

The Burning Problems that I tried to solve

Developers Are Drowning in Markdown

CLAUDE.md, AGENTS.md, .cursorrules, requirements.md, design.md, tasks.md... the list grows with every feature. These specifications represent hours of careful thought. But what happens after the feature ships? They rot. They're forgotten. They become digital dust. Agentic coding needs other scalable approach than the File System search for gathering context.

AI Coding Agents Have Amnesia

Modern coding agents suffer from catastrophic forgetting. Sessions reset, context is lost, previous decisions vanish. Agents write code without knowing your specs, acceptance criteria, or earlier decisions. Agnetic Coding needs dedicated memory that can retrieved on demand.

Vendor Lock-In Is Real

Every coding agent uses its own proprietary format. Claude uses CLAUDE.md, Cursor uses .cursorrules, Kiro uses .kiro/specs/. Switching agents means rewriting all your specs. Your project knowledge is trapped in one tool. Agentic coding tools need modularity.

No Agent Experience Layer Exists

We have DevEx (Developer Experience) for humans. But where is Agent Experience Layer for AI coding agents? There's no unified memory layer, no context optimization, no impact analysis.

Key Features Built

Framework Adapters: Pro support for Kiro and limited support for SpecKit, Tessl, Experimental Support for non Spec Driven coding Agents like Claude Code, Cursor, Codex, Factory, Warp, Gemini CLI
Cognitive Memory: Vector-based semantic search with LanceDB, ChromaDB, Qdrant, or AgentVectorDB
SpecImpact Graph: Bidirectional relationships between specs, code, and tests
SpecDiff Timeline: Track spec evolution, detect drift, find contradictions
SpecValidator: 6 quality rules for specification health
Spec Coverage: Map acceptance criteria to tests, identify gaps
Health Scores: Project health grades (A-F) with improvement suggestions
Web UI: Interactive dashboard with live sync and WebSocket updates
GitHub Action: CI integration with PR comments and configurable thresholds
MCP Server: Native Kiro Powers integration via Model Context Protocol
Multiple CLI Commands: Full-featured command-line interface
Python API: Programmatic access via SpecMemClient

The Killer Feature: Swap agents without losing context.

SpecMem creates a unified, normalized, agent-agnostic context layer. Switch from Kiro → Claude Code → Cursor → SpecKit without rewriting spec files or losing project knowledge. Your specifications become portable. Your memory persists. Your agents remember.

Pragmatic SDD: The Balance Struck. Pure Spec-Driven Development feels like waterfall. Pure vibe coding is chaos.

SpecMem strikes the balance:

Specs as Memory: Not bureaucratic gates, but searchable knowledge
Selective Context: SpecImpact gives agents only relevant specs, not everything
Living Docs: SpecDiff detects drift, SpecValidator finds contradictions
Gradual Adoption: Start with any format, no big-bang migration

SpecMem 😍 Kiro: First-Class Integration for Spec-Driven Development

SpecMem was born during Kiroween 2025 with one mission: make Kiro's Spec-Driven Development workflow even more powerful. We've built first-class support for Kiro IDE with native adapters, MCP server integration, and seamless workflow enhancements that feel like they've always been part of Kiro.

⚡ Kiro Powers Integration: Install SpecMem as a Kiro Power and unlock persistent memory for your coding agent. Query specs without leaving Kiro, analyze impact in real-time, and get context-aware suggestions that understand your entire project history. Your agent finally remembers.

🔗 MCP Server: Full Model Context Protocol support means Kiro's agent can query your specifications, analyze change impact, and retrieve optimized context automatically. No manual copy-pasting. No context switching. Just intelligent, on-demand memory that knows what your agent needs.

📄 Native Kiro Adapter: SpecMem understands .kiro/specs/ structure natively. Your requirements.md, design.md, and tasks.md files are parsed into searchable, semantic memory. Every user story, acceptance criterion, and design decision becomes queryable knowledge.

🎯 Visualize Your Specs: Build the SpecMem dashboard to see your Kiro specifications come alive. Validate them against tests, detect drift, track coverage, and generate health scores. Show this dashboard to your Product Owner and watch their face light up when they see living, trackable specs.

⚙️ CI/CD Integration: Add SpecMem to your GitHub pipelines to validate specs and generate coverage reports, just like you do for test coverage. Treat spec quality as a first-class citizen in your delivery process.

🔍 Smarter Pull Requests: Integrate SpecMem into your PR workflow. Get instant insights on specification impact, coverage gaps, and potential drift with every code change. Catch spec issues before they merge.

🧠 Specs as Memory: Index your Kiro specifications using your preferred vector database—LanceDB, ChromaDB, or Qdrant. Transform static markdown into searchable, semantic memory that coding agents can query across sessions.

⚡ Selective Testing: Run SpecMem against your code changes to identify only the impacted tests. When you modify auth/service.py, SpecMem knows which specs are affected and which tests to run. Save CI time, reduce compute costs, and accelerate your feedback loop.

SpecMem amplifies Kiro. Your Kiro specs become living documentation, your agent gains persistent memory, and your workflow stays intact. That's the power of Agent Experience.

You can call it Pragmatic SDD. SpecMem is on GitHub or Browse Documentation.

Watch SpecMem in Action

The Hackathon Submission Drama: A Race Against Time

With only 4 days to build SpecMem, I was coding until the final hour. With 10 minutes left, I discovered the hackathon required a demo video. I rushed through recording, answered questions with "N/A" where possible, and waited as YouTube crawled through the upload. I clicked Submit. The browser spun. Then: "Sorry, this hackathon is no longer accepting submissions." My heart sank, not because I might miss an opportunity, but because the Kiro team wouldn't see the work. I wanted this visible to help improve the Kiro and Agentic Coding ecosystem.I immediately emailed the organizer with a screenshot showing the seconds-late submission. Within minutes, they sent a late submission link. I rushed through again and got it in.

15,000+ lines of code. 14 major features. Built in 4 days. Submitted with seconds to spare.

How Can Kiro users Use SpecMem right Now?

SpecMem is published on PyPI and available on GitHub. Kiro users can start using it today.

Here's what you can do:

Visualize Your Specs: Build the SpecMem dashboard to visualize your specifications, validate them against tests, and detect drift. Host it as GitHub Pages for team collaboration. Show this dashboard to your Product Owner or Business Analyst and watch their face light up.
Integrate with CI/CD: Add SpecMem to your GitHub pipelines to validate specs and get coverage data, just like you do for test coverage. Catch spec issues before they reach production.
Enhance Pull Requests: Add SpecMem to your PR workflow to get insights on specification impact, coverage gaps, and potential drift with every code change.
Index Specs as Memory: Use your favorite vector database (LanceDB, ChromaDB, Qdrant) and embedding models to index your specs as searchable memory for coding agents.
Run Selective Tests: Use SpecMem against your code changes to identify only the tests that need to run, saving CI time and compute costs.

Get Started: Browse the documentation to see where you can plug SpecMem into your existing Kiro projects.

What's Next for SpecMem in terms of Features

Since SpecMem is submitted as a hackathon project, I can't touch the codebase during the judging period. However, I'll be forking it to my personal GitHub and continuing development in parallel. The hackathon was just the beginning.

Short Term

My immediate focus is on Spec-Driven Development coding agents, specifically Kiro, GitHub SpecKit, and Tessl. I want to make those adapters more robust so users can fully leverage the specs generated by these frameworks, transforming them into living documentation, searchable memory, and actionable insights. Currently, SpecMem can run as a GitHub Action to lint specifications, detect drift, and map tests to acceptance criteria. I want to take this further by providing valuable, contextual feedback directly on GitHub Pull Requests, helping teams catch spec issues before they merge.

Additional short-term priorities include:

Enhanced semantic search with better relevance ranking
Support for more vector databases and embedding models
Improved SpecMem dashboard that serves both developers and product owners

Medium Term

SpecMem Cloud: A hosted solution for teams who prefer not to self-host SpecMem dashboard. Connect your GitHub repository containing Kiro, SpecKit, or Tessl specs, and SpecMem handles the rest.
Real-time Collaboration: Multi-user support where spec changes trigger notifications and keep teams synchronized.
Native Support for Non-SDD Agents: Bind code to specs for coding agents that don't follow Spec-Driven Development, including Claude Code, Cursor, Windsurf, and others. Bring pragmatic memory to every coding agent, regardless of their native approach.

Long Term

Yes, there's a long-term vision. Interested in where this is heading? Reach out. I'd love to connect.

The SpecMem Vision

SpecMem is redefining Agent Experience for coding agents. By introducing Pragmatic Memory, we're making coding agents smarter, more context-aware, and more effective. More importantly, SpecMem gives developers the freedom to switch between coding agents based on tasks and capabilities, without vendor lock-in. Your specifications, your memory, your choice of tools.

A Note on the Competitive Landscape

Current market leaders like Claude Code, Cursor, Codex, Windsurf, Factory, Amp, Gemini, and yes, even Kiro, may not embrace this approach enthusiastically. Agent portability could potentially disrupt their user retention strategies. Each provider has invested heavily in their proprietary formats and ecosystems. But that's precisely why SpecMem matters as it's giving freedom to ultimate builders.

Beyond the Hackathon

The goal of SpecMem was never simply to impress hackathon judges or increase winning chances. It was to address a fundamental challenge in the current coding agent space: fragmentation, lock-in, and amnesia by defining the new approach of Agent Experience to coding agents.

My goal is also to help improve Kiro through constructive feedback. I've shared honest observations about Kiro's strengths and areas for improvement because I want to see Kiro succeed in an increasingly competitive landscape. The Kiro team has built something promising, and I hope this feedback helps them come back even stronger.

An Open Invitation

To the incumbents: you're welcome to evaluate these ideas, adopt them, or build similar features. You have the funding, resources and engineering talent to take this further. Perhaps some of these concepts will inspire new startups or product directions. For Superagentic AI, SpecMem represents a foundation we intend to build upon. We'll continue developing killer features that push the boundaries of what Agent Experience can be. I am happy to collaborate on ideas and concepts if you are interested. The future of coding agents shouldn't lock developers-in but give their agents the best possible experience, regardless of which tools they choose.

That's the vision. That's SpecMem.

Kiro Usage Experience Hackathon and Beyond

I used Kiro extensively during the hackathon, exploring nearly all of its features in a compressed timeframe. While the window was short, it was enough to form clear opinions about where Kiro excels and where it needs improvement.

Where Kiro Shines

Structured Development Workflow: Kiro excelled at keeping my requirements, technical designs, and tasks organized and executable. I could always refer back to what had been implemented, feature by feature. This traceability is genuinely valuable for complex projects.
Modularity and Steering: Kiro gave me the flexibility to modify specific features without disrupting the entire project. The steering docs allowed me to enforce my own coding standards and rules as I worked.
Hooks and Powers: These newer features appear powerful. I used hooks effectively during the hackathon, though I didn't find an opportunity to use Powers for SpecMem since the use case didn't require them.
CLI Launch: The Kiro CLI landed just in time, allowing me to return to my preferred lightweight coding experience rather than relying solely on the IDE.
Collaborative Experience: It feels like collaborative experience even If I was woking solo on the project. It feels like I am working with various people in the team like Product Owners, Tech Architects and developers.

Areas for Improvement

While working with Kiro solely, I felt like I should have more features to make my workflows even better.

Limited Model Selection: Kiro's model choices felt restrictive. I ended up using Claude Opus 4.5 for all my work because the alternatives were limited. I wanted to switch models based on tasks, using Gemini for planning and other models for specific purposes. Neither Gemini nor GPT models were available. I avoided the "Auto" mode since there's no transparency about which models it uses under the hood, and I didn't want unexpected disruptions. Kiro doesn't support local models hosted via Ollama, MLX, SGLang, vLLM, and similar tools. It would be also great to allow developers to select different models for plan, architect and code as some models are very specialised vs others to perform specific tasks.
Workflow Friction: Kiro slowed down my workflow significantly. I had to review all generated requirements, which weren't always written properly. When I tried to amend them through the model, the results didn't match my expectations, forcing manual edits. The same applied to architecture designs and tasks. Eventually, I found myself accepting whatever Kiro generated without thorough review, just to maintain momentum.
Response Times: The response times when executing tasks or generating requirements, designs, and tasks were noticeably slow. I haven't experienced such latency with any other CLI or IDE to date. Session Management: Kiro didn't notify me when approaching context limits. It summarized my sessions mid-task, and subsequent sessions completely lost the flow. I had to manually copy tasks to new sessions to continue, which was disruptive. Forced Workflow Loops: I had no control over when Kiro would cycle through the requirements/design/task loop. When I didn't need this workflow. I resorted to overriding prompts with instructions like "PLEASE DO NOT GENERATE REQUIREMENTS/DESIGN/TASKS."
CLI Experience: I tried the CLI as soon as it was announced but soon realised that its too early stage to explore it fully as too much manual configuration to get full support. I returned to the IDE.

Again these are my personal experience using Kiro as solo developer and my own hackathon project which didn't explore full power of hooks and Kiro powers. I truly understand Kiro is still new and emerging but shown so much potential so far. I really hope feedback from this hackathon will definitely shapes the future of the Kiro and entire Spec Driven development and SpecMem project has already provided some food for thoughts. Kiro has potential. With the right improvements, it could become a serious contender in the coding agent space. I hope this feedback helps the team prioritize what matters most to developers like me.

Kiro at AWS re:Invent: The Future Looks Promising

I recently caught up with the keynotes and talks from AWS re:Invent related to Kiro, and I'm genuinely impressed by the new feature announcements. CEO Matt Garman's keynote unveiled exciting agent announcements including Kiro Autonomous Agent, Security Agent, and DevOps Agent. I'm looking forward to seeing how these play out in real-world scenarios.

Dr. Swami's keynote also highlighted Kiro, reinforcing its strategic importance within AWS's vision. Byron Cook's talk was particularly interesting, discussing how Kiro leverages natural language specifications for both specs and tests, a concept that aligns perfectly with what SpecMem is trying to achieve. I also watched several Lab sessions demonstrating how to use Kiro and the CLI effectively. These hands-on walkthroughs showcased the practical applications and workflow improvements Kiro enables.

The future development of Kiro-specific features within AWS looks solid. The investment and roadmap are clear. I can't wait to see the next releases and how Kiro continues to evolve in the competitive coding agent landscape.

The Takeaway

Sometimes the meaningful things happen completely by accident. I went to San Francisco to exhibit at ODSC AI and explore AI vibes. I stumbled into a hackathon. I discovered limitations that sparked an idea. I built something in few days that I believe can change how developers work with AI coding agents or at least start of something new in this space. The coding agent space is messy. Every provider promotes their own file formats, prompting strategies, and context engineering approaches. Developers are drowning in markdown madness while agents forget everything between sessions.

SpecMem introduces a new paradigm: Agent Experience (AgentEx). Just as DevEx optimizes the experience for human developers, AgentEx optimizes the experience for AI coding agents. At its core is Unified Pragmatic Memory, a centralized, agent-agnostic memory layer that lets you switch between coding agents without rebuilding context or losing project knowledge.

Specs shouldn't be documents that rot. They should be memory that agents use on demand. Context Engineering shouldn't be forces it should come as natural.

That's SpecMem.

Links

Landing Page: https://super-agentic.ai/specmem
GitHub: https://github.com/SuperagenticAI/specmem
Documentation: https://superagenticai.github.io/specmem/

SpecMem is developed by Superagentic AI as part of the Kiroween Hackathon, December 2025.

A big thank you to AWS Builders Loft, the AWS Startups team, the Kiro team, and everyone at the hackathon who helped me with setup and made me feel welcome. It was an unforgettable experience.

Agent Optimization: Why Context Engineering Isn’t Enough

Shashi Jagtap — Thu, 02 Oct 2025 21:06:14 +0000

In the rapidly evolving world of AI agents and large language models, the conversation has moved quickly. It began with prompt engineering, then shifted to context engineering which focuses on curating what information enters a model’s context window. The topic of Context Engineering is not stopping at all as Anthopic published blog post on effective context engineering for AI agents recently.

As context engineering is being discussed everywhere, one fundamental problem has being ignore in all context is optimization. Until and unless this basic problem of optimization of prompt gets resolved curating the context will be still brittle when model or weight shifted. There is an opportunity to explore new concept of Agent Optimization whch views the agent not as a prompt or a context buffer but as a complete system. It spans prompts, retrieval augmented generation, memory, and tool orchestration. The idea is straightforward. Context alone cannot guarantee reliability especially as models evolve. To build sustainable agents, optimization must occur across the entire pipeline. Agent Optimization: The Next Big Shift Beyond Context Engineering? Let's dive deeper.

Context Engineering: A Crucial Foundation

Context engineering rose to prominence because it addressed an immediate bottleneck: the finite nature of an LLM’s attention. Models can only process so many tokens effectively and longer contexts often lead to what Anthropic calls context rot. This is when recall accuracy decreases as sequence length increases. In its widely read guide, Anthropic described context as “a critical but finite resource for AI agents.” Their strategies included:

Compaction: Summarizing long histories to preserve coherence without exhausting the context window. In Claude Code this technique is used to keep multi hour programming sessions coherent.

Structured Note Taking: Storing important facts outside the context and retrieving them only when necessary, reducing token waste.

Sub Agent Architectures: Delegating tasks to smaller agents with clean contexts and then integrating their outputs.

Just in Time Retrieval: Dynamically fetching information instead of overloading the initial context with every possible detail.

These methods improve efficiency, reduce hallucination, and enhance autonomy. Frameworks such as LangChain and DSPy could integrate many of these strategies, proving their practical value.

Yet the limitations are clear. Curated context is fragile when models change. Optimizations tuned for Claude may not work for GPT 6 or Llama 5 or next future models. The transformer architecture imposes strict limits on attention which means longer context windows do not always translate into better performance. Context engineering answers what information goes in but not how the agent interprets it or adapts to it.

Why Context Alone Falls Short

Research has shown that scaling context length does not eliminate the problem. Studies on effective context length reveal that many open source models struggle to maintain accuracy beyond a fraction of their advertised window. This means the usable context is often much smaller than the maximum. Another challenge is brittleness across models. A context strategy that works well on one model may degrade when applied to another because of differences in inductive biases. This problem is already visible in enterprise deployments where upgrading models can cause accuracy to drop for downstream tasks.

Most importantly context engineering does not optimize the rest of the agent. Prompt templates, RAG pipelines, memory persistence, and tool usage remain under optimized. Without tuning across these layers agents may sound coherent but fail at reliability.

Agent Optimization: A Holistic Approach

Agent Optimization reframes the challenge. It assumes that agents are systems of systems and that each layer must be optimized for long term robustness.

Prompt Optimization

The GEPA framework has demonstrated that prompts can evolve dynamically by mutating based on execution traces. In evaluations GEPA achieved up to nineteen percent improvement over static baselines in low rollout environments. Unlike context engineering this method is robust to model updates since prompts evolve with the system rather than remaining fixed.

Retrieval Augmented Generation Optimization

RAG pipelines are widely used to reduce hallucination but their success depends heavily on retriever quality, embedding choice, ranking depth, and filtering strategies. Research has highlighted failure cases where irrelevant or adversarial passages undermine accuracy in domains such as healthcare. Optimizing retrieval is as critical as optimizing the model itself.

Tool Calls and Orchestration

Tools extend agent capability but only if designed carefully. Anthropic advises that tools should be minimal, explicit, and non overlapping. Optimizing tool invocation and validation reduces errors and ensures agents use tools effectively.

Memory and Persistence

Memory remains a difficult problem. Techniques such as structured note taking help but the harder challenge is deciding what to remember, how to compress it, and when to retrieve it. Adaptive memory systems that evolve with usage are increasingly seen as part of the optimization stack.

Frameworks and Programmability

DSPy represents the direction forward. Instead of ad hoc prompt hacking it provides declarative modules for prompts, retrieval, and memory which can be optimized automatically with algorithms such as GEPA or MIPROv2. In benchmarks like HotPotQA DSPy raised accuracy from twenty four percent to fifty one percent, results that context engineering alone could not achieve.

Why Agent Optimization?

Agent Optimization could make this things better because it directly addresses the problems engineers are facing today.

It is adaptable and survives model updates.
It is robust and tunes retrieval, prompts, memory, and tools together.
It is scalable and enables systematic improvement instead of trial and error.
It is the natural next step. Prompt engineering gave way to context engineering. Context engineering might give way to Agent Optimization or Agent Engineering.

Conclusion

Context engineering remains important but it is no longer the endpoint. Some of the technologies from Anthropic, GEPA, and DSPy shows that the future lies in Agent Optimization or Agent Engineering. This holistic approach treats prompts, retrieval augmented generation, memory, and tools as interconnected layers that must be optimized together.The most reliable AI agents of the future will not be those with the best curated context windows but those optimized across the full stack. In 2025 and beyond Agent Optimization will define the next wave of reliable and adaptive AI systems. What do you think?

References

Anthropic. Effective Context Engineering for AI Agents.
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Chroma. Context Rot: How Increasing Input Tokens Impacts LLM Performance.
https://research.trychroma.com/context-rot
Huang et al. Why Does the Effective Context Length of LLMs Fall Short? (STRING). arXiv, 2024.
https://arxiv.org/abs/2410.18745
Wang et al. GEPA: Genetic Evolutionary Prompt Adaptation. arXiv, 2025.
https://arxiv.org/abs/2507.19457
IBM. What is Model Drift?
https://www.ibm.com/think/topics/model-drift

Intelligent RAG Optimization with GEPA: Revolutionizing Knowledge Retrieval

Shashi Jagtap — Thu, 18 Sep 2025 13:59:43 +0000

The field of prompt optimization has witnessed a breakthrough with GEPA (Genetic Pareto), a novel approach that uses natural language reflection to optimize prompts for large language models. Based on the research published in "GEPA: Genetic Pareto Prompt Optimization for Large Language Models".

GEPA is an amazing tool for prompt optimization and the new GEPA RAG Adapter contributed by us with the RAG GUIDE extends the proven genetic pareto optimization methodology to one of the most important applications of LLMs: Retrieval Augmented Generation (RAG).

The recently merged GEPA RAG Adapter brings this powerful optimization methodology to RAG systems, enabling automatic optimization of the entire RAG pipeline across multiple vector databases.

Background: The Challenge of RAG Optimization

Retrieval Augmented Generation (RAG) systems have become essential for building AI applications that need to access and reason over specific knowledge bases.

However, optimizing RAG systems has traditionally been a manual, time-intensive process requiring domain expertise and extensive trial-and-error experimentation. Each component of the RAG pipeline, from query reformulation to answer generation, requires carefully crafted prompts that often need to be tuned separately, making it difficult to achieve optimal end-to-end performance.

The introduction of GEPA's RAG Adapter addresses this challenge by applying the proven genetic pareto optimization methodology specifically to RAG systems, enabling automatic discovery of optimal prompts across the entire pipeline.

What is GEPA?

GEPA (Genetic Pareto) is a prompt optimization technique for large language models that represents a significant advancement over traditional approaches. The methodology introduces several key innovations:

Natural Language Reflection: Unlike traditional reinforcement learning methods that rely on scalar rewards, GEPA uses natural language as its learning medium. The system samples system-level trajectories (including reasoning, tool calls, and outputs), reflects on these trajectories in natural language, diagnoses problems, and proposes prompt updates.

Pareto Frontier Optimization: GEPA maintains a "Pareto frontier" of optimization attempts, combining lessons learned from multiple approaches rather than focusing on a single optimization path. This approach enables more robust and comprehensive optimization.

GEPA demonstrates remarkable efficiency in the research paper, achieving:

10% average improvement over Group Relative Policy Optimization (GRPO)
Up to 20% improvement in best cases
35x fewer rollouts compared to traditional methods
Over 10% improvement compared to leading prompt optimizer MIPROv2

Why GEPA Works for RAG

The interpretable, natural language–based approach of GEPA is particularly well suited for RAG optimization because:

Complex Interaction Understanding: RAG systems involve complex interactions between retrieval quality and generation quality. GEPA's natural language reflection can identify and articulate these nuanced relationships.
Multi-Component Optimization: RAG pipelines require optimizing multiple components simultaneously. GEPA's Pareto frontier approach can balance trade-offs between different components effectively.
Interpretable Improvements: The natural language reflection mechanism provides clear insights into why certain prompt modifications improve performance, making the optimization process more transparent and debuggable.

Prompt Optimization with GEPA

GEPA's prompt optimization process follows a systematic approach that has been proven effective across various LLM applications.

The Optimization Loop

The optimization process consists of six key steps:

Trajectory Sampling: GEPA samples complete execution trajectories from the system, capturing not just final outputs but the entire reasoning process.
Natural Language Reflection: The system analyzes these trajectories using natural language, identifying patterns, problems, and opportunities for improvement.
Diagnostic Analysis: Problems are diagnosed in interpretable terms, such as "query reformulation is too narrow" or "context synthesis includes irrelevant information."
Prompt Proposal: Based on the analysis, GEPA proposes specific prompt modifications using natural language reasoning.
Testing and Evaluation: Proposed changes are tested against evaluation criteria, with results fed back into the optimization loop.
Pareto Frontier Update: Successful improvements are incorporated into the Pareto frontier, building a comprehensive understanding of what works.

This approach leverages the language understanding capabilities of LLMs themselves to drive the optimization process, creating a self-improving system that can articulate and reason about its own performance.

RAG Introduction: The Challenge of Knowledge Retrieval

Retrieval Augmented Generation represents a shift in how we build knowledge-intensive AI applications. Traditional language models are limited to the knowledge they were trained on, which becomes outdated and cannot include private or domain-specific information. RAG solves this by combining the reasoning capabilities of LLMs with real-time access to relevant documents from vector databases.

The RAG Pipeline

A typical RAG system involves several critical steps:

Query Processing: User queries must be processed and potentially reformulated to improve retrieval effectiveness.
Document Retrieval: Relevant documents are retrieved from a vector database using semantic similarity or hybrid search methods.
Document Reranking: Retrieved documents may be reordered based on relevance criteria specific to the query.
Context Synthesis: Multiple retrieved documents are synthesized into coherent context that supports answer generation.
Answer Generation: The LLM generates a final answer based on the synthesized context and original query.

Each of these steps involves prompts that significantly impact the overall system performance, making optimization crucial for real-world applications.

RAG Optimization with GEPA

The GEPA RAG Adapter brings systematic optimization to every component of the RAG pipeline. Here's how GEPA's methodology applies to RAG optimization:

Vector Store Agnostic Design

One of the most powerful aspects of the GEPA RAG Adapter is its vector store agnostic design. The adapter provides a unified optimization interface that works across multiple vector databases.

Supported Vector Stores

The adapter supports five major vector databases:

ChromaDB: Ideal for local development and prototyping. Simple setup with no external dependencies required.
Weaviate: Production ready with hybrid search capabilities and advanced features. Requires Docker.
Qdrant: High performance with advanced filtering and payload search capabilities. Can run in memory mode.
LanceDB: Serverless, developer-friendly architecture built on Apache Arrow. No Docker required.
Milvus: Cloud-native scalability with Milvus Lite for local development. No Docker required for Lite mode.

Data Structure for RAG Optimization

train_data = [
    RAGDataInst(
        query="What is machine learning?",
        ground_truth_answer="Machine Learning is a method of data analysis that automates analytical model building...",
        relevant_doc_ids=["ml_basics"],
        metadata={"category": "definition", "difficulty": "beginner"},
    ),
    RAGDataInst(
        query="How does deep learning work?",
        ground_truth_answer="Deep Learning is a subset of machine learning based on artificial neural networks...",
        relevant_doc_ids=["dl_basics"],
        metadata={"category": "explanation", "difficulty": "intermediate"},
    ),
]

Initial Prompt Templates

initial_prompts = {
    "answer_generation": """You are an AI expert providing accurate technical explanations.

Based on the retrieved context, provide a clear and informative answer to the user's question.

Guidelines:
- Use information from the provided context
- Be accurate and concise
- Include key technical details
- Structure your response clearly

Context: {context}

Question: {query}

Answer:"""
}

Running GEPA Optimization

result = gepa.optimize(
    seed_candidate=initial_prompts,
    trainset=train_data,
    valset=val_data,
    adapter=rag_adapter,
    reflection_lm=llm_client,
    max_metric_calls=args.max_iterations,
)

best_score = result.val_aggregate_scores[result.best_idx]
optimized_prompts = result.best_candidate
total_iterations = result.total_metric_calls

Implementation and Usage

Installation

pip install gepa
pip install chromadb
pip install lancedb pyarrow sentence-transformers
pip install pymilvus sentence-transformers
pip install qdrant-client
pip install weaviate-client

Using the Unified Optimization Script

cd src/gepa/examples/rag_adapter

python rag_optimization.py --vector-store chromadb
python rag_optimization.py --vector-store lancedb
python rag_optimization.py --vector-store milvus
python rag_optimization.py --vector-store qdrant
python rag_optimization.py --vector-store weaviate

(… full command examples included in the repo …)

Features and Capabilities

Multi-Component Optimization

GEPA RAG Adapter optimizes:

Query Reformulation
Context Synthesis
Answer Generation
Document Reranking

Evaluation System

eval_result = rag_adapter.evaluate(
    batch=val_data[:1], 
    candidate=initial_prompts, 
    capture_traces=True
)

initial_score = eval_result.scores[0]
sample_answer = eval_result.outputs[0]['final_answer']

Quick Start

See GEPA GUIDE.

ollama pull qwen3:8b
ollama pull nomic-embed-text:latest

Quick start:

cd src/gepa/examples/rag_adapter
python rag_optimization.py --vector-store chromadb --max-iterations 10

Watch Demo

YouTube Demo

Summary

The GEPA RAG Adapter represents an advancement in RAG system optimization, bringing the proven genetic pareto methodology to one of the most important applications of large language models.

Technical Advantages

Automated Optimization
Vector Store Agnostic
Efficiency (35x fewer rollouts)
Interpretable Process

Potential Benefits

Unified Interface
Flexible Deployment
Production Ready
Extensible Design

Scientific Foundation

Research Backed
Natural Language Reflection
Pareto Frontier Optimization

Conclusion

The integration of GEPA's genetic pareto optimization methodology with RAG systems is still early but a strong start.

Best use today is with DSPy GEPA Adapter, but you can also optimize RAG pipelines using standalone GEPA.

Developers now have access to a systematic, automated approach for building high-performance knowledge retrieval systems. The unified script enables easy experimentation across different vector stores, while the vector store agnostic design ensures optimization work translates across deployment environments.

The GEPA RAG Adapter is available today in the GEPA repository, with working examples and comprehensive documentation.

Introducing SuperQuantX: Foundational Research for Quantum and Agentic AI

Shashi Jagtap — Fri, 12 Sep 2025 11:07:58 +0000

Today is a landmark day for Superagentic AI. We are thrilled to introduce SuperQuantX, our open source SDK created to unify the fragmented world of Quantum AI development. The timing could not be more perfect. With the Quantum World Congress taking place next week from September 16 to 18 and the recent Quantum Agent research paper sparking fresh debate, we are stepping into the spotlight with a launch that could not be more timely. To make it even more exciting, SuperQuantX will also debut on Product Hunt, bringing this vision directly to the global developer and research community.

SuperQuantX provides a seamless way for researchers and developers to build quantum enhanced agentic systems using one unified interface. SuperQuantX is built for experimentation and for rigorous research work. It brings AI and quantum computing together in a way that lets you focus on ideas instead of integration effort. This project is the result of research at Superagentic AI and is intended to be a foundation for quantum enhanced agentic research. SuperQuantX aims to make it simple to write code once and run it on any supported quantum backend without rewriting your logic.

Background

Our journey into Quantum Machine Learning began with a book: Practical Guide to Quantum Machine Learning and Quantum Optimization. The subject was fascinating but the tooling was fragmented. Each framework exposed its own API and its own conventions. Often we found ourselves spending more time wrestling with tool differences than testing research hypotheses. That experience motivated us to build a unified platform where code stays constant and the backend can change.

The current quantum ecosystem offers many strengths but also forces researchers to juggle multiple SDKs and vendor specific quirks. That slows research and complicates collaboration. SuperQuantX removes that barrier by providing a single API for multiple frameworks, enabling reproducibility, faster validation, and greater creative focus on algorithmic progress rather than on integration details.

SuperQuantX provides one consistent API that connects to a wide set of quantum frameworks such as PennyLane, Qiskit, Cirq, Amazon Braket, TKET, and D Wave Ocean. With that single interface you can run the same research code across different backends while keeping experiments reproducible and portable. This reduces friction and accelerates iteration for teams and for individual researchers.

Features

Unified API spanning PennyLane, Qiskit, Cirq, Amazon Braket, TKET, and D Wave Ocean.
Seamless backend switching without changing research logic
Tools tailored for agentic AI research and autonomous quantum driven systems
Pre built quantum machine learning and optimization algorithms including QSVM, Quantum Neural Network, VQE, and QAOA
Circuit visualization and result analysis tools
Cloud integration with managed simulators and hardware providers
Comprehensive documentation and developer examples for fast onboarding

At Superagentic AI we believe that future intelligence will be both quantum enhanced and agentic. Our vision with SuperQuantX is to create a unified foundation for that future. We want to accelerate discovery, enable reproducible research, and make it straightforward to build autonomous quantum enhanced agents using a cohesive platform.

How It Works

SuperQuantX acts as a bridge between research code and vendor specific frameworks. You write circuits or agent logic using a consistent API. When you switch a backend the library handles the backend specifics such as transpilation, scheduling, differentiable programming integration, or annealing configuration. This keeps your research code stable while letting you validate experiments across multiple providers.

Examples

Here is an example that creates quantum entanglement with minimal code.

import superquantx as sqx

backend = sqx.get_backend('simulator')
circuit = backend.create_circuit(2)

circuit.h(0) # Superposition
circuit.cx(0, 1) # Entanglement
circuit.measure_all()

result = backend.run(circuit, shots=1000)
print(result) # Expected: state counts close to balanced values

You can switch the backend easily to PennyLane, Qiskit, or Cirq by altering one parameter. That same example works across multiple frameworks without touching its logic.
Advanced examples include constructing a Quantum SVM:

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100)

qsvm = sqx.QuantumSVM(backend='simulator', feature_map='ZFeatureMap')
qsvm.fit(X_train, y_train)
accuracy = qsvm.score(X_test, y_test)

And experimenting with hybrid algorithms such as VQE or Quantum Neural Networks:

vqe = sqx.VQE(backend='simulator', hamiltonian=H)
ground_energy = vqe.find_ground_state()

qnn = sqx.QuantumNeuralNetwork(n_qubits=4, n_layers=2)

Quick Video Demo

Watch the demo on YouTube

https://www.youtube.com/watch?v=I3Dz1R2vcng

Quantum Not Here, But You Cannot Ignore

Large scale quantum hardware is not widely available yet. That fact makes foundation building critical today. The research we do now will shape future breakthroughs. Fragmentation slows progress. SuperQuantX removes that barrier and enables researchers to validate across platforms, collaborate across ecosystems, and iterate faster on quantum enhanced methods.

Conclusion

SuperQuantX is the platform for exploring quantum machine learning. It transforms fragmentation into clarity and gives researchers a pragmatic and future ready foundation for quantum agentic research. This is the start of a longer journey and we are inviting the community to build with us.

SuperQuantX Useful Links
Homepage: Superagentic AI
GitHub: SuperagenticAI/superquantx
Docs: SuperQuantX Documentation
Product Hunt: SuperQuantX

Codex CLI: Running GPT-OSS and Local Coding Models with Ollama, LM Studio, and MLX

Shashi Jagtap — Mon, 01 Sep 2025 22:01:24 +0000

Agentic coding is evolving rapidly, reshaping how developers interact with AI to generate code. Instead of being locked inside full-blown IDEs, many are moving back toward lightweight, flexible command-line interfaces. Since the arrival of Claude Code, we’ve seen a wave of new coding CLIs Gemini CLI, Qwen Code, and others but each has come with a major limitation: they are tied to a single model provider.

Codex CLI breaks that pattern. It’s the first CLI designed to be truly universal, capable of running any model cloud-based or open-source, local or remote through a single, unified interface. No more juggling separate CLIs or switching mental contexts depending on the model you want to use. There might be some toy open source projects doing something similar, but Codex is the first official CLI from a major model provider that allows developers to achieve this. With Codex CLI, you configure providers once and seamlessly switch between them with simple providers, profiles, or MCP servers.

This is still early stage, but it opens up a lot of possibilities for Agentic Coding in the near future.

Codex CLI

Codex CLI is OpenAI’s bold response to the wave of coding assistants like Claude Code and Gemini CLI. OpenAI describes it as “one agent for everywhere you code” and that vision shows. With a single installation, you get a lightweight yet powerful CLI that brings AI coding directly into your terminal.

Installation is straightforward:

If you have Node.js installed, run:

  npm i -g @openai/codex

On macOS, you can also use Homebrew:

  brew install codex

Once installed, you’re ready to go. Simply navigate to any project directory and launch:

codex

From there, Codex CLI integrates seamlessly into your workflow, providing an AI assistant without needing an IDE or browser-based environment.

Cloud Models vs. Open Source Models

OpenAI has recently released two open-source models: GPT-OSS-20B and GPT-OSS-120B, alongside GPT-5.

By default, Codex CLI connects to cloud models like GPT-5. These are great for rapid prototyping, but they also come with tradeoffs: API costs, usage limits, and the need for a constant internet connection.

The real breakthrough is that Codex also supports open-source, self-hosted models. With the --oss flag or a configured profile, you can run inference locally through providers like Ollama, LM Studio, or MLX.

For example:

codex --oss

By default, this checks if you have gpt-oss-20b installed with Ollama. You can also specify another model:

codex --oss -m gpt-oss:120b

Running models locally unlocks powerful advantages:

Run powerful LLMs locally without sending data to external servers
Avoid vendor lock-in by swapping providers or models at will
Optimize for privacy, speed, and cost while keeping workflows flexible

In short, Codex gives developers the freedom to choose between cutting-edge cloud models and locally hosted OSS models—all from the same CLI.

Configuring Codex with `config.toml`

When you install Codex CLI, you’ll find a ~/.codex/ directory on your system. This directory contains configuration files and subdirectories. If ~/.codex/config.toml doesn’t exist, create it manually.

This file allows you to configure providers and create profiles for different models. Some options aren’t fully documented yet, but you can explore the Codex source code for details. You can also configure MCP servers here.

Ollama Configuration

Assuming you have a model already downloaded and Ollama running, add the following to your ~/.codex/config.toml:

[model_providers.ollama]
name = "Ollama"
base_url = "http://localhost:11434/v1"

[profiles.gpt-oss-120b-ollama]
model_provider = "ollama"
model = "gpt-oss:120b"

Then launch Codex with:

codex --oss --profile gpt-oss-120b-ollama

LM Studio Configuration

In LM Studio, you’ll need to load a model and start the server (default port is 1234). You can use the LM Studio UI or the CLI:

# List available models
lms ls  

# Load the model
lms load qwen/qwen3-coder-30b  

# Start the server
lms server start

Config for GPT-OSS-120B

[model_providers.lms]
name = "LM Studio"
base_url = "http://localhost:1234/v1"

[profiles.gpt-oss-120b-lms]
model_provider = "lms"
model = "gpt-oss:120b"

Config for Qwen3-Coder-30B

[model_providers.lm_studio]
name = "LM Studio"
base_url = "http://localhost:1234/v1"

[profiles.qwen3-coder-30b-lms]
model_provider = "lm_studio"
model = "qwen/qwen3-coder-30b"

Launch with:

codex --profile gpt-oss-120b-lms  
codex --profile qwen3-coder-30b-lms

MLX Configuration

On Apple Silicon, you can use MLX for faster inference. Install the MLX LM package:

pip install mlx-lm

Start a local server:

mlx_lm.server --model SuperagenticAI/gpt-oss-20b-8bit-mlx --port 8888

Update your Codex config:

[model_providers.mlx]
name = "MLX LM"
base_url = "http://localhost:8888/v1"

[profiles.gpt-oss-20b-8bit-mlx]
model_provider = "mlx"
model = "SuperagenticAI/gpt-oss-20b-8bit-mlx"

Run with:

codex --profile gpt-oss-20b-8bit-mlx

Watch It in Action

🎥 Demo Video

Context Length

One challenge with local coding models is context length you may need to adjust it for larger projects.

Ollama: use /set parameter num_ctx
LM Studio: pass --context-length to the lms load command
MLX: configure via model/server launch parameters

Why Run Local Models?

While cloud APIs are convenient, local models bring unique benefits:

Privacy: your code never leaves your machine
Cost control: no API bills for long-running tasks
Flexibility: swap models without waiting for API support
Resilience: works offline or in restricted environments

By combining Codex CLI with providers like Ollama, LM Studio, and MLX, you get the best of both worlds: a unified developer experience with full freedom to choose between cloud and local inference.

Final Thoughts

Codex CLI marks a shift in how developers interact with AI coding models. For the first time, you can use one CLI to manage all your models from OpenAI’s cloud APIs to cutting-edge OSS models running locally.

If you’re serious about building with AI while keeping flexibility, privacy, and cost in check, it’s worth setting up Codex CLI with local providers today.

GEPA DSPy Optimizer in SuperOptiX: Revolutionizing AI Agent Optimization Through Reflective Prompt Evolution

Shashi Jagtap — Mon, 18 Aug 2025 11:24:45 +0000

How SuperOptiX leverages GEPA's breakthrough reflective optimization to transform basic AI agents into sophisticated problem solvers

Introduction

The landscape of AI agent optimization has fundamentally shifted with the introduction of GEPA as a DSPy optimizer. Unlike traditional optimization approaches that rely on trial-and-error or reinforcement learning, GEPA introduces a paradigm of reflective prompt evolution — teaching AI agents to improve by analyzing their own mistakes and generating better instructions.

In this comprehensive guide, we'll explore how SuperOptiX integrates GEPA as a first-class DSPy optimizer, enabling developers to achieve dramatic performance improvements with minimal training data. We'll walk through practical examples, demonstrate the optimization process, and show you exactly how to leverage this powerful combination in your own projects.

Background: The Evolution of DSPy Prompt Optimizers

Traditional Optimization Challenges

Before diving into GEPA, it's important to understand the limitations of traditional prompt optimization approaches:

Volume Requirements: Most optimizers require hundreds of training examples to achieve meaningful improvements, making them impractical for specialized domains where data is scarce.

Black Box Nature: Traditional methods provide little insight into why certain prompts work better, making it difficult to understand or validate improvements.

Domain Limitations: Generic optimization techniques struggle with domain-specific requirements like mathematical reasoning, medical accuracy, or legal compliance.

Resource Intensity: Many approaches require extensive computational resources and time to achieve optimal results.

DSPy's Optimization Framework

DSPy revolutionized prompt optimization by treating prompts as learnable parameters rather than static text. The framework provides several optimizers, each with distinct strengths:

BootstrapFewShot: Creates few-shot examples through bootstrapping
SIMBA: Uses stochastic introspective optimization
MIPROv2: Multi-step instruction prompt optimization
COPRO: Collaborative prompt optimization

However, these optimizers still faced the fundamental challenge of limited feedback mechanisms — relying primarily on scalar metrics rather than rich, interpretable feedback.

Introducing GEPA: The Breakthrough in Reflective Optimization

What Makes GEPA Different

GEPA, introduced in the research paper "Reflective Prompt Evolution Can Outperform Reinforcement Learning", represents a fundamental breakthrough by incorporating human-like reflection into the optimization process.

Instead of blindly trying different prompt variations, GEPA:

Analyzes Failures: Uses a reflection LM to understand what went wrong in failed attempts
Generates Insights: Creates textual feedback explaining improvement opportunities
Evolves Prompts: Develops new prompt candidates based on reflective insights
Builds Knowledge: Constructs a graph of improvements, preserving successful patterns

Technical Architecture

GEPA's architecture consists of four key components:

Student LM: The primary language model being optimized
Reflection LM: A separate model that analyzes student performance and provides feedback
Feedback System: Domain-specific metrics that provide rich textual feedback
Graph Constructor: Builds a tree of prompt improvements using Pareto optimization

This multi-model approach enables GEPA to achieve what single-model optimizers cannot: genuine understanding of failure modes and targeted improvements.

Key Innovations from the Research

The original GEPA paper demonstrates several breakthrough capabilities:

Sample Efficiency: Achieves significant improvements with as few as 3-10 training examples, compared to 100+ for traditional methods.

Domain Adaptability: Leverages textual feedback to incorporate domain-specific knowledge (medical guidelines, legal compliance, security best practices).

Multi-Objective Optimization: Simultaneously optimizes for accuracy, safety, compliance, and other criteria through rich feedback.

Interpretable Improvements: Generates human-readable prompt improvements that can be understood and validated by experts.

GEPA as a DSPy Optimizer in SuperOptiX

Seamless Integration

SuperOptiX integrates GEPA as a first-class DSPy optimizer through the DSPyOptimizerFactory, making it as easy to use as any other optimization method:

spec:
  optimization:
    optimizer:
      name: GEPA
      params:
        metric: advanced_math_feedback
        auto: light
        reflection_lm: qwen3:8b
        reflection_minibatch_size: 3
        skip_perfect_score: true

This simple configuration unlocks GEPA's powerful reflective optimization capabilities within the SuperOptiX agent framework.

Advanced Feedback Metrics

SuperOptiX enhances GEPA with seven specialized feedback metrics:

advanced_math_feedback: Mathematical problem solving with step-by-step validation
multi_component_enterprise_feedback: Business document analysis with multi-aspect evaluation
vulnerability_detection_feedback: Security analysis with remediation guidance
privacy_preservation_feedback: Data privacy compliance assessment
medical_accuracy_feedback: Healthcare applications with safety validation
legal_analysis_feedback: Legal document processing with regulatory alignment
custom domain metrics: Extensible framework for specialized domains

These metrics provide the rich textual feedback that GEPA needs to drive targeted improvements.

Memory-Optimized Configurations

SuperOptiX provides three optimization tiers to balance performance with resource requirements:

Lightweight (8GB+ RAM):

optimization:
  optimizer:
    name: GEPA
    params:
      auto: minimal
      max_full_evals: 3
      reflection_lm: llama3.2:1b

Standard (16GB+ RAM):

optimization:
  optimizer:
    name: GEPA
    params:
      auto: light
      max_full_evals: 10
      reflection_lm: qwen3:8b

Production (32GB+ RAM):

optimization:
  optimizer:
    name: GEPA
    params:
      auto: heavy
      max_full_evals: 50
      reflection_lm: qwen3:8b

Step-by-Step: Transforming a Math Agent with GEPA

The Problem: Basic Math Agent Limitations

Let's start with a concrete example. Consider a basic math agent that can solve quadratic equations but lacks sophistication:

Input: "Solve x² - 5x + 6 = 0"
Basic Agent Output: "Using the quadratic formula: x = 2 or x = 3"

While technically correct, this output lacks:

Multiple solution approaches
Step-by-step reasoning
Verification steps
Educational value

GEPA Optimization Process

Here's how GEPA transforms this agent through reflective optimization:

Step 1: Initial Setup

# Clone the demonstration repository
git clone https://github.com/SuperagenticAI/gepa-eval.git
cd gepa-eval

# Set up the environment
./scripts/setup.sh

Step 2: Agent Compilation

# Compile the math agent
super agent compile advanced_math_gepa

This creates a pipeline from the agent playbook, establishing baseline capabilities.

Step 3: Baseline Evaluation

# Evaluate current performance
super agent evaluate advanced_math_gepa

Typical baseline results show ~60% accuracy with basic problem-solving approaches.

Step 4: GEPA Optimization

# Run GEPA optimization (3-5 minutes)
super agent optimize advanced_math_gepa

During optimization, GEPA:

Executes the agent on training scenarios
Reflects on failures using the reflection LM
Generates improved prompt candidates
Evaluates candidates using domain-specific feedback
Iterates to build a tree of improvements

Step 5: Remarkable Transformation

After optimization, the same input produces dramatically improved output:

Input: "Solve x² - 5x + 6 = 0"

GEPA-Optimized Output:

**Method 1: Factoring**
Step 1: Look for two numbers that multiply to 6 and add to -5
These numbers are -2 and -3
Factor: (x - 2)(x - 3) = 0
Solutions: x = 2 or x = 3

**Method 2: Quadratic Formula**
For ax² + bx + c = 0, x = (-b ± √(b²-4ac)) / 2a
Here: a=1, b=-5, c=6
x = (5 ± √(25-24)) / 2 = (5 ± 1) / 2
x = 3 or x = 2

**Verification:**
x = 2: (2)² - 5(2) + 6 = 4 - 10 + 6 = 0 ✓
x = 3: (3)² - 5(3) + 6 = 9 - 15 + 6 = 0 ✓

Performance Improvements

The optimization yields measurable improvements:

Accuracy: 60% → 95%
Multiple Methods: Single approach → Multiple solution paths
Verification: None → Complete validation
Education: Basic → Pedagogically structured

Quick Start Guide: Getting Started with GEPA

Prerequisites

System Requirements:

Python 3.11+
8GB+ RAM (16GB+ recommended)
SuperOptiX framework

Model Requirements:

# Install required models
ollama pull llama3.1:8b      # Primary processing
ollama pull qwen3:8b         # GEPA reflection
ollama pull llama3.2:1b      # Lightweight option

Interactive Demo Experience

The fastest way to experience GEPA is through our demonstration repository:

# Clone and run lightweight demo (2-3 minutes)
git clone https://github.com/SuperagenticAI/gepa-eval.git
cd gepa-eval
./scripts/run_light_demo.sh

# Or run full demo (5-10 minutes, better results)
./scripts/run_demo.sh

Integration with SuperOptiX

Once you've experienced the demo, integrate GEPA into your SuperOptiX projects:

# 1. Install SuperOptiX
pip install superoptix

# 2. Initialize your project
super init my_gepa_project
cd my_gepa_project

# 3. Pull a GEPA-enabled agent
super agent pull advanced_math_gepa

# 4. Compile and optimize
super agent compile advanced_math_gepa
super agent optimize advanced_math_gepa

# 5. Test the optimized agent
super agent run advanced_math_gepa --goal "Your problem here"

Creating Custom GEPA Agents

Create domain-specific agents with GEPA optimization:

# custom_agent_playbook.yaml
apiVersion: agent/v1
kind: AgentSpec
metadata:
  name: Custom GEPA Agent
  id: custom-gepa
spec:
  language_model:
    location: local
    provider: ollama
    model: llama3.1:8b
  optimization:
    optimizer:
      name: GEPA
      params:
        metric: advanced_math_feedback  # Choose appropriate metric
        auto: light
        reflection_lm: qwen3:8b
  feature_specifications:
    scenarios:
      - name: example_scenario
        input:
          problem: "Your domain-specific problem"
        expected_output:
          answer: "Expected high-quality response"

Where GEPA Excels and Where It Makes Less Sense

GEPA Works Well When:

The task is open-ended, ambiguous, or has multiple "good enough" answers.
You want to optimize for semantic similarity, not just exact match.
You have access to a strong reflection LLM.

GEPA Makes Less Sense When:

The task is trivial or has a single, unambiguous answer.
You don't have a good semantic metric.
You want very fast, one-shot optimization.

GEPA's Sweet Spots

Specialized Domains: GEPA shines in domains requiring expertise:

Mathematics: Multi-step problem solving with verification
Healthcare: Medical reasoning with safety considerations
Legal: Contract analysis with compliance validation
Security: Vulnerability detection with remediation guidance
Finance: Risk assessment with regulatory alignment

Quality-Critical Applications: When accuracy and interpretability matter more than speed:

Educational content generation
Professional consulting
Regulatory compliance
Safety-critical systems

Limited Training Data: GEPA excels when you have:

3-10 high-quality examples
Domain expertise but limited labeled data
Need for rapid prototyping in specialized areas

Multi-Objective Requirements: When optimizing for multiple criteria:

Accuracy + Safety + Compliance
Performance + Interpretability + Efficiency
Domain expertise + User experience

When to Consider Alternatives

Simple, General Tasks: For basic question-answering or general-purpose agents, traditional optimizers may be sufficient:

Basic Q&A systems
Simple classification tasks
General conversation agents

Large Dataset Scenarios: With 100+ training examples, other optimizers might be more efficient:

Large-scale content moderation
Bulk document processing
High-volume customer service

Resource Constraints: GEPA requires more resources:

Memory: Needs two models (primary + reflection)
Time: 3-5+ minutes for optimization
Compute: More intensive than simple optimizers

Tool-Calling Agents: GEPA currently doesn't work with ReAct agents that use tools as per the our experiment but there might be workarounds (Genies tier and above in SuperOptiX).

Advanced Customization and Use Cases

Custom Feedback Metrics

Create domain-specific feedback functions for your specialized use cases:

def healthcare_compliance_feedback(example, pred, trace=None):
    """Custom feedback for healthcare applications."""
    from dspy.primitives import Prediction

    # Analyze medical accuracy, safety, and compliance
    score = evaluate_medical_response(example, pred)
    feedback = generate_improvement_suggestions(example, pred)

    return Prediction(score=score, feedback=feedback)

Potential Use Cases

Educational Technology:

Personalized tutoring systems with step-by-step explanations
Adaptive learning platforms with domain-specific feedback
Assessment generators with pedagogical optimization

Professional Services:

Legal document analysis with compliance checking
Financial risk assessment with regulatory alignment
Medical diagnosis support with safety validation

Research and Development:

Scientific literature review with methodology validation
Patent analysis with competitive intelligence
Market research with trend identification

You can look for other GEPA agent in the SuperOptiX docs here.

Documentation and Resources

For comprehensive guides and technical documentation, explore:

GEPA Optimization Guide: Complete technical documentation
DSPy Optimizers Overview: All available optimizers
Interactive Demo Repository: Hands-on examples
SuperOptiX Documentation: Full framework documentation
Original GEPA Paper: Research foundation

Conclusion: The Future of AI Agent Optimization

GEPA's integration with SuperOptiX represents more than just another optimization technique, it's an intelligent, reflective agent improvement. By combining the power of DSPy's optimization framework with GEPA's revolutionary reflective capabilities, SuperOptiX enables developers to create AI agents that don't just perform tasks, but genuinely understand and improve their own reasoning processes. The transformation we've witnessed in our math agent example from basic problem solving to sophisticated, multi-method approaches with verification that demonstrates the practical impact of this integration.

As AI continues to evolve, the agents that will make the greatest impact are those that can learn from their mistakes, adapt to new domains, and provide interpretable, trustworthy reasoning. GEPA in SuperOptiX provides the foundation for building these next-generation intelligent systems.

Ready to experience the future of AI agent optimization? Start with our interactive demo and see the transformation for yourself.

SuperOptiX is the comprehensive AI agent framework that makes advanced optimization accessible to every developer. Learn more at SuperOptix.ai or explore the full documentation.

Optimas + SuperOptiX: Global‑Reward Optimization for DSPy, CrewAI, AutoGen, and OpenAI Agents SDK

Shashi Jagtap — Thu, 14 Aug 2025 10:14:55 +0000

Optimization has been central to SuperOptiX from day one—whether it's prompts, weights, parameters, or compute. It began with DSPy-style programmatic prompt engineering and teleprompting as it was the only framework doing prompt optimization. It was surprising that other frameworks couldn't figure out ways to optimize prompts like DSPy, but now we have a solution. Today, we're bringing Optimas into the SuperOptiX ecosystem so you can apply globally aligned local rewards across multiple frameworks: OpenAI Agent SDK, CrewAI, AutoGen, and DSPy. You can check out the Optimas and SuperOptiX integration here.

Optimizing a single prompt isn't enough for modern "compound" AI systems. Real systems chain LLMs, tools, and traditional ML into multi‑step workflows, and the right unit of optimization is the whole pipeline. Optimas introduces globally aligned local rewards that make per‑component improvements reliably lift end‑to‑end performance. SuperOptiX now brings Optimas to your existing agent stacks—OpenAI Agent SDK, CrewAI, AutoGen, and DSPy—behind one practical CLI, so you can go from baseline to optimized without changing frameworks.

What Optimas is (and why it matters)

Optimas is a unified optimization framework for compound AI systems. It:

Learns an LRF per component that remains globally aligned, so local updates are safe and beneficial to the whole system. This enables efficient optimization without always running the entire pipeline for every candidate. See the method and guarantees: arXiv paper.
Supports heterogeneous configuration types:
- Prompts and textual instructions via metric‑guided search
- Hyperparameters and discrete choices like top‑k, tool/model selection, routing
- Model parameters where supported (e.g., RL with PPO)
Works across frameworks (OpenAI Agent SDK, CrewAI, AutoGen, DSPy) through target adapters.
Compound‑system optimization: works across multiple components and tools, not just single prompts.
Global alignment via local rewards: optimize each component while ensuring the whole system converges.
Multiple optimizers:
- OPRO: single‑iteration prompting optimization
- MIPRO: multi‑iteration prompting
- COPRO: cooperative optimization

What this unlocks

Optimize prompts, hyperparameters, model parameters, and model routers across compound AI systems.
Run OPRO, MIPRO, and COPRO optimization loops using a single CLI workflow.
Keep your preferred agent stack (DSPy, CrewAI, AutoGen, OpenAI SDK) and get consistent optimization behavior.

Why this is impactful

Optimas learns a local reward function (LRF) for each component that stays aligned with a global objective. Independently maximizing a component's local reward still increases overall system quality. This is data‑efficient and avoids excessive full system runs.
It supports heterogeneous updates across prompts, hyperparameters, model selection/routing, and (where applicable) model parameters via RL, achieving consistent gains across diverse pipelines.
The Optimas paper reports an average relative improvement of 11.92% across five complex compound systems, with theoretical guarantees and strong empirical results. See:
- Optimas site: optimas.stanford.edu
- Paper: arXiv: OPTIMAS

Where Optimas fits in SuperOptiX

SuperOptiX as the name suggests, is built around optimization; Optimas plugs directly into its lifecycle:

Compile your agent into a runnable pipeline for a specific target.
Evaluate to get a baseline.
Optimize with Optimas (OPRO/MIPRO/COPRO) using the same CLI across targets.
Run the optimized agent.

This extends optimization beyond prompts to hyperparameters, model selection, routing, and parameters where supported.

Focus‑aligned: SuperOptiX is built around optimization; Optimas operationalizes optimization across agents and tools.
Beyond prompts: SuperOptiX + Optimas can optimize prompts, hyperparameters, model parameters, and even model routers, aligning with complex production workflows.
One CLI to rule them all: Use a consistent sequence—compile, evaluate, optimize, run—across all targets.

Optimas vs. DSPy (complementary)

DSPy is a framework for composing LLM pipelines and programmatic teleprompting.
Optimas is an optimization engine that runs globally aligned local updates across multi‑component systems, regardless of the underlying framework (including DSPy).
In practice: build in your preferred stack; use Optimas to optimize it end‑to‑end. If your system is in DSPy, try --optimizer mipro for deeper prompt refinement; OPRO and COPRO are also available.
In short: DSPy builds and teleprompts pipelines; Optimas runs global optimization across pipelines and frameworks. They complement each other.

How SuperOptiX uses Optimas

SuperOptiX ships templates and playbooks for each target:

Targets: optimas-openai, optimas-crewai, optimas-autogen, optimas-dspy
Agents: optimas_openai, optimas_crewai, optimas_autogen, optimas_dspy
Optimizers: --optimizer opro|mipro|copro
Engine: --engine optimas to engage Optimas in the pipeline

Use a consistent CLI surface:

Compile with --target
Evaluate with --engine optimas
Optimize with --engine optimas and --optimizer
Run with the chosen --target and --engine when applicable

Installation

# Core Optimas integration
pip install "superoptix[optimas]"

# Target-specific extras (choose as needed)
pip install "superoptix[optimas,optimas-openai]"
pip install "superoptix[optimas,optimas-crewai]"
pip install "superoptix[optimas,optimas-autogen]"
pip install "superoptix[optimas,optimas-dspy]"

# CrewAI note: resolve dependency pin by installing CrewAI without deps, then json-repair
pip install crewai==0.157.0 --no-deps
pip install "json-repair>=0.30.0"

CrewAI has a hard dependency on json-repair 0.26.0 while DSPy 3.0.0 needs it 0.30.0.

Quick start across all targets

Project setup and demo agents:

# Initialize a new project
super init test_optimas
cd test_optimas

# Pull demo playbooks
super agent pull optimas_openai      # OpenAI SDK (recommended)
super agent pull optimas_crewai      # CrewAI
super agent pull optimas_autogen     # AutoGen
super agent pull optimas_dspy        # DSPy

The unified lifecycle (works the same for all targets)

The sequence is consistent: compile → evaluate → optimize → run. Below are target‑specific recipes with practical knobs.

OpenAI Agent SDK (recommended for production)

This is the fastest, most stable path to results with minimal dependency friction.

# Compile → Evaluate
super agent compile optimas_openai --target optimas-openai
super agent evaluate optimas_openai --engine optimas --target optimas-openai

# Optimize (OPRO shown; adjust search breadth, temperature, and timeout inline)
SUPEROPTIX_OPRO_NUM_CANDIDATES=3 \
SUPEROPTIX_OPRO_MAX_WORKERS=3 \
SUPEROPTIX_OPRO_TEMPERATURE=0.8 \
SUPEROPTIX_OPRO_COMPILE_TIMEOUT=120 \
super agent optimize optimas_openai --engine optimas --target optimas-openai --optimizer opro

# Run
super agent run optimas_openai --engine optimas --target optimas-openai \
  --goal "Write a Python function to add two numbers"

Why these knobs matter: candidates broaden the search; temperature encourages variation; compile timeout helps for larger models; workers control concurrency for faster iterations.

CrewAI (great for role‑based multi‑agent workflows)

If you orchestrate crews of agents, Optimas can optimize prompts and task hyperparameters in the same loop.

# Compile → Evaluate
super agent compile optimas_crewai --target optimas-crewai
super agent evaluate optimas_crewai --engine optimas --target optimas-crewai

# Optimize (tune LiteLLM behavior; keep workers modest)
LITELLM_TIMEOUT=60 \
LITELLM_MAX_RETRIES=3 \
SUPEROPTIX_OPRO_MAX_WORKERS=3 \
super agent optimize optimas_crewai --engine optimas --target optimas-crewai --optimizer opro

# Run
super agent run optimas_crewai --engine optimas --target optimas-crewai \
  --goal "Write a Python function to calculate factorial"

Tip: retries and timeouts harden long‑running optimization loops against transient provider hiccups. For model client behavior, see LiteLLM.

AutoGen (strong for conversational/multi‑agent; optimization can be slower)

AutoGen excels at complex, multi‑turn agent interactions; give the optimizer more headroom.

# Compile → Evaluate
super agent compile optimas_autogen --target optimas-autogen
super agent evaluate optimas_autogen --engine optimas --target optimas-autogen

# Optimize (increase compile timeout for heavier pipelines)
LITELLM_TIMEOUT=60 \
LITELLM_MAX_RETRIES=3 \
SUPEROPTIX_OPRO_MAX_WORKERS=3 \
SUPEROPTIX_OPRO_COMPILE_TIMEOUT=180 \
super agent optimize optimas_autogen --engine optimas --target optimas-autogen --optimizer opro

# Run
super agent run optimas_autogen --engine optimas --target optimas-autogen \
  --goal "Write a Python function to reverse a string"

Why timeouts help: larger or tool‑heavy pipelines can exceed quick compile windows; a higher timeout reduces spurious failures during candidate generation.

DSPy (fully supported; tune concurrency if needed)

DSPy is a natural fit if your system is authored in DSPy. Start with OPRO for reliability; try MIPRO for deeper prompt improvements.

# Compile → Evaluate
super agent compile optimas_dspy --target optimas-dspy
super agent evaluate optimas_dspy --engine optimas --target optimas-dspy

# Optimize (start with OPRO; adjust temperature/workers)
SUPEROPTIX_OPRO_MAX_WORKERS=3 \
SUPEROPTIX_OPRO_TEMPERATURE=0.8 \
super agent optimize optimas_dspy --engine optimas --target optimas-dspy --optimizer opro

# Run
super agent run optimas_dspy --engine optimas --target optimas-dspy \
  --goal "Write a Python function to calculate fibonacci numbers"

If you see any concurrency/threading issues in your model client stack, set SUPEROPTIX_OPRO_MAX_WORKERS=1 during optimization to serialize candidate evaluations.

Choosing an optimizer (and when to use which)

OPRO (Optimization by Prompting): default choice; single‑iteration, dependable progress. super agent optimize <agent> --engine optimas --target <target> --optimizer opro
MIPRO (Multi‑Iteration Prompting): deeper prompt refinement across rounds; especially good for DSPy prompt programs. super agent optimize <agent> --engine optimas --target <target> --optimizer mipro
COPRO (Cooperative Prompting): coordinate improvements when several components benefit from joint search. super agent optimize <agent> --engine optimas --target <target> --optimizer copro

Practical tuning (inline, reproducible)

Keep configuration inline to make runs easy to share and reproduce:

# OPRO baseline
SUPEROPTIX_OPRO_NUM_CANDIDATES=3 \
SUPEROPTIX_OPRO_MAX_WORKERS=3 \
SUPEROPTIX_OPRO_TEMPERATURE=0.7 \
SUPEROPTIX_OPRO_COMPILE_TIMEOUT=60 \
super agent optimize <agent> --engine optimas --target <target> --optimizer opro

# Quality pass (longer search)
SUPEROPTIX_OPRO_TEMPERATURE=0.9 \
SUPEROPTIX_OPRO_COMPILE_TIMEOUT=300 \
super agent optimize <agent> --engine optimas --target <target> --optimizer opro

# Harden model client behavior (affects DSPy and CrewAI)
LITELLM_TIMEOUT=60 \
LITELLM_MAX_RETRIES=3 \
LITELLM_CACHE_ENABLED=false \
LITELLM_LOG_LEVEL=ERROR \
super agent optimize <agent> --engine optimas --target <target>

# If you see any concurrency/threading issues
SUPEROPTIX_OPRO_MAX_WORKERS=1 \
super agent optimize <agent> --engine optimas --target <target>

When to pick which target

OpenAI Agent SDK: most stable and fast—use this for production optimization and quick benchmarks.
CrewAI: best for role‑based multi‑agent workflows; follow the install note for json-repair.
AutoGen: designed for multi‑turn and conversational agents; increase compile timeouts during optimization.
DSPy: ideal if your pipeline is authored in DSPy; start with OPRO, then explore MIPRO for deeper improvements.

Why this approach scales

Global‑local alignment ensures local progress translates into higher system performance, avoiding the "optimize one, regress the rest" trap common in compound systems. See theory and results: arXiv.
Data efficiency comes from optimizing locally with learned LRFs instead of always running the full system per candidate. The paper shows consistent gains with fewer full passes.
Heterogeneous optimization unifies prompt search, discrete hyperparameter/model selection, routing, and model parameter updates (e.g., PPO) into one loop.

What's next

Add a GEPA adapter so richer program synthesis traces and execution feedback can be optimized end‑to‑end inside SuperOptiX.
Continue collaborating with the Optimas team to upstream adapters and keep parity across targets and optimizers.
Expand optimization scope from prompts/hyperparameters into richer parameter and routing policies, guided by the LRF methodology.

Demo

Conclusion

SuperOptiX is not limited to prompt optimization; Optimas unlocked power to go beyond prompt optimization and support existing frameworks. This is an early stage for Optimas and we will evolve together as framework support gets stronger.

SuperOptiX Memory: A Practical Guide for Building Agents That Remember

Shashi Jagtap — Tue, 12 Aug 2025 08:28:40 +0000

Modern AI agents aren't just chatbots with prompt-in, answer-out. To feel coherent and genuinely helpful over time, they need to remember. Agent memory is the capability that lets an agent retain facts, preferences, conversations, and experiences across turns and sessions so every new interaction benefits from the history that came before it.

With memory, agents can personalize responses, maintain context across multi-step tasks, and learn from feedback. This is core to agentic systems, and it's why memory is a first-class feature in SuperOptiX.

SuperOptiX is a full-stack agentic AI framework designed for context and agent engineering with an evaluation-first, optimization-core philosophy. Explore the platform at the SuperOptiX website. The framework's declarative DSL, SuperSpec, lets you describe what you want and have SuperOptiX build the pipeline; learn more at the SuperSpec page and the SuperSpec documentation.

What is Agent Memory?

Conceptually, memory is how agents build "continuity of self." Concretely, it's a combination of mechanisms that store and retrieve useful information:

Short-term memory: session-scoped working memory and conversation history—what's happening right now and in the last few turns.
Long-term memory: durable knowledge that persists across sessions—facts, preferences, and patterns the agent should retain.
Episodic memory: structured records of interactions and events over time—who asked what, what the agent did, and how it turned out.
Context manager: a discipline for combining global, session, task, and local state into a just-right context sent to the model.

This layered design balances immediacy (short-term), durability (long-term), chronology (episodic), and precision (context management). The result is an agent that feels consistent, learns from experience, and remains efficient.

For a deeper conceptual and practical tour, see the Memory System Guide.

How SuperOptiX Memory Works

SuperOptiX provides a powerful, multi-layer memory model you can use via Python, via DSPy adapters configured with JSON-like configs, or declaratively through SuperSpec (YAML).

Short-term memory captures rolling conversation context and working notes. Use it for ephemeral state and the last N messages.
Long-term memory persists knowledge with optional semantic search—store guidance ("always return runnable code"), user preferences, and domain facts. Enable embeddings if you want recall by meaning, not just literal keywords.
Episodic memory tracks episodes and events—great for analytics and learning (e.g., "episode resolved successfully," "user preferred example-based explanations").
The context manager merges relevant state across scopes to build clean, bounded prompts for the LLM.

Choosing a Memory Backend

Pick the backend that matches your deployment needs:

file: portable, zero-ops JSON/pickle storage; great for demos and quick local runs.
sqlite: reliable embedded database; sensible default for most agents.
redis: networked, high-throughput in-memory store for production workloads.

Use Memory from Python (Public API)

Below are usage-only examples for working with memory in your own Python code.

from superoptix.memory import AgentMemory, FileBackend, SQLiteBackend
# RedisBackend is also available if you install and configure redis

# Create an agent memory (defaults to SQLite)
memory = AgentMemory(agent_id="writer-assistant")

# Short-term: store ephemeral context
memory.remember("User prefers TypeScript", memory_type="short", ttl=3600)

# Long-term: store durable knowledge with categories/tags
memory.remember(
    "Always provide runnable code snippets",
    memory_type="long",
    category="authoring_guidelines",
    tags=["writing", "code", "quality"]
)

# Recall (semantic search if embeddings are enabled)
results = memory.recall("runnable code", memory_type="long", limit=5)
for r in results:
    print(r["content"])

# Track an interaction episode with events
episode_id = memory.start_interaction({"user_id": "alice"})
memory.add_interaction_event("user_question", "How to configure memory backends?")
# ... generate your response ...
memory.end_interaction({"success": True})

# Introspection and housekeeping
print(memory.get_memory_summary())
memory.cleanup_memory()

# Explicit backend selection
file_memory = AgentMemory("file-demo", backend=FileBackend(".superoptix/memory"))
sqlite_memory = AgentMemory("sqlite-demo", backend=SQLiteBackend(".superoptix/mem.db"))

Configure Memory via DSPy Adapters (JSON)

SuperOptiX integrates memory into DSPy-based agents through adapters. You don't need to wire internals—provide a JSON-like configuration dict (or load it from a .json file), and the adapter will:

1) retrieve relevant long-term memories for the query,
2) include recent short-term conversation snippets,
3) manage episodes and events,
4) persist useful insights after responses.

See DSPy's adapter documentation for background on the adapter pattern.

How DSPy Adapters Integrate with Memory

The DSPy adapter creates a memory-enhanced agent module that automatically handles the complete memory lifecycle:

Memory Initialization: When you create a DSPyAdapter, it automatically instantiates an AgentMemory system based on your config. The adapter reads the memory.enabled and memory.enable_embeddings flags to configure the memory system appropriately.

Memory-Enhanced Agent Module: The adapter generates a custom DSPy module (MemoryEnhancedAgentModule) that wraps your agent logic with memory operations. This module:

Starts an interaction episode when processing begins
Retrieves relevant memories before generating responses
Stores conversation history and insights after completion
Manages the complete interaction lifecycle

Context Building Process: Before sending a query to the LLM, the adapter:

Searches long-term memory for semantically relevant knowledge
Retrieves recent conversation context from short-term memory
Merges persona information, relevant memories, and conversation history
Builds a clean, bounded context string for the model

Memory Persistence: After the LLM generates a response, the adapter:

Stores the Q&A pair in short-term memory for immediate context
Adds the interaction to the conversation history
Logs events (user query, agent response) to the episodic memory
Ends the interaction episode with success/failure metadata

Example JSON config (save as `agent.config.json`)

{
  "llm": {
    "provider": "ollama",
    "model": "llama3.2:1b",
    "api_base": "http://localhost:11434",
    "temperature": 0.2
  },
  "persona": {
    "name": "MemoryDemo",
    "description": "Demonstrates SuperOptiX layered memory"
  },
  "memory": {
    "enabled": true,
    "enable_embeddings": true
  }
}

Advanced Memory Configuration

You can fine-tune memory behavior through additional configuration options:

{
  "llm": {
    "provider": "ollama",
    "model": "llama3.2:1b",
    "api_base": "http://localhost:11434"
  },
  "persona": {
    "name": "AdvancedMemoryBot",
    "description": "Advanced memory configuration example"
  },
  "memory": {
    "enabled": true,
    "enable_embeddings": true,
    "short_term_capacity": 200,
    "memory_retrieval": {
      "max_memories": 5,
      "min_similarity": 0.3,
      "include_conversation_history": true
    },
    "episodic_tracking": {
      "auto_start_episodes": true,
      "event_logging": true,
      "outcome_tracking": true
    }
  }
}

Run with the DSPy adapter

import json
import asyncio
from superoptix.adapters.dspy_adapter import DSPyAdapter
# Or: from superoptix.adapters.observability_enhanced_dspy_adapter import ObservabilityEnhancedDSPyAdapter

with open("agent.config.json", "r") as f:
    config = json.load(f)

adapter = DSPyAdapter(config)
# adapter = ObservabilityEnhancedDSPyAdapter(config)  # for detailed tracing/debugging

async def main():
    result = await adapter.run({
        "query": "Remind me how to enable memory in SuperSpec.",
        "context": {"user_id": "alice"}  # optional context becomes part of the episode
    })
    print(result["result"])
    print("Memory stats:", result.get("memory_stats") or result.get("observability", {}).get("memory_stats"))

asyncio.run(main())

Memory Statistics and Monitoring

The adapter returns comprehensive memory statistics with each response:

# Example response structure
{
    "result": "To enable memory in SuperSpec, add the memory section...",
    "episode_id": "ep_12345",
    "memory_stats": {
        "interactions": 15,
        "short_term_items": 8,
        "long_term_items": 42,
        "active_episodes": 1
    }
}

Observability-Enhanced Adapter

For production deployments, use the ObservabilityEnhancedDSPyAdapter which provides:

Detailed memory operation tracing
Performance metrics for memory operations
Debug breakpoints for memory inspection
Integration with external observability tools (MLflow, Langfuse)

Tip: To extend observability, include:

"observability": {
  "debug_mode": false,
  "trace_memory": true,
  "enable_breakpoints": false
}

Configure Memory in SuperSpec (YAML)

SuperSpec is SuperOptiX's declarative DSL. You describe your agent, and SuperOptiX compiles it to a runnable DSPy pipeline. Learn about SuperSpec at the SuperSpec page and browse the full SuperSpec documentation.

apiVersion: agent/v1
kind: AgentSpec
metadata:
  name: memory-demo
  id: memory_demo
  namespace: demo
  level: genies
spec:
  language_model:
    location: local
    provider: ollama
    model: llama3.1:8b
    api_base: http://localhost:11434
    temperature: 0.7
    max_tokens: 2048

  memory:
    enabled: true
    short_term:
      enabled: true
      max_tokens: 2000
      window_size: 10
    long_term:
      enabled: true
      storage_type: local    # file | sqlite | redis
      max_entries: 500
      persistence: true
    episodic:
      enabled: true
      max_episodes: 100
    context_manager:
      enabled: true
      max_context_length: 4000
      context_strategy: sliding_window

Compile and run with the Super CLI:

# Ensure a local model is installed (Ollama is the default backend)
super model install llama3.2:8b

# Compile and run the agent
super agent compile memory_demo
super agent run memory_demo --goal "Show me how memory works in SuperOptiX"

For a complete overview of the SuperOptiX platform, visit the SuperOptiX website. For a deep dive into memory systems and examples, check out the Memory System Guide.

Practical Patterns and Tips

Start with sqlite for persistence; use file for simple portability; use redis for high-throughput services.
Use short-term memory for rolling conversation context; use long-term memory for durable knowledge with categories and tags.
Treat episodic memory as your analytics backbone: start episodes around conversations/tasks, log events, and end with outcomes.
Enable embeddings when you need "by-meaning" recall; leave it off to save compute for keyword-only search.
Periodically call cleanup APIs for long-running services to keep memory lean.
Use the observability-enhanced adapter for production deployments to monitor memory performance and debug issues.
Configure appropriate memory retrieval limits to balance context richness with prompt efficiency.

References

SuperOptiX: A Deep Technical Dive into the Next-Generation AI Agent Framework

Shashi Jagtap — Tue, 12 Aug 2025 08:21:04 +0000

SuperOptiX: A Deep Technical Dive into the Next-Generation AI Agent Framework

Building Intelligent Agents with DSPy, RAG, Memory, and Observability

Built by Superagentic AI • Official Website • Documentation

Introduction

SuperOptiX represents a paradigm shift in AI agent development, combining the declarative power of DSPy with enterprise-grade features like RAG (Retrieval-Augmented Generation), multi-layered memory systems, comprehensive observability, and a sophisticated tool ecosystem. This deep technical dive explores how SuperOptiX leverages DSPy under the hood to create a powerful yet accessible framework for building production-ready AI agents.

Core Architecture Overview

SuperOptiX is built on a modular, extensible architecture that separates concerns while maintaining tight integration between components. The framework leverages DSPy as its core reasoning engine while adding enterprise capabilities through carefully designed abstractions.

Core Architecture Components

SuperOptiX Framework consists of the following key components:

DSPy Core: Signatures, Modules, Optimizers
SuperSpec DSL: Schema Validation, Template Generation, Compliance Checking
RAG System: Vector Databases, Document Processing, Semantic Search
Memory System: Short-term Memory, Episodic Memory, Long-term Memory
Observability: Event Tracing, Performance Metrics, External Integrations
Tool Ecosystem: Core Tools, Domain Tools, Custom Tools
Model Management: Multi-backend support, Model configuration

1. DSPy Integration: The Reasoning Engine

At the heart of SuperOptiX lies DSPy (Declarative Self-improving Language Programs), which provides the foundational reasoning capabilities. The framework extends DSPy through a sophisticated pipeline architecture that maintains the declarative programming model while adding enterprise features.

DSPy MixIn Pattern

SuperOptiX employs a sophisticated MixIn pattern that seamlessly extends DSPy's capabilities while maintaining its core abstractions. This architectural approach allows the framework to layer enterprise-grade features on top of DSPy's declarative programming model without disrupting its fundamental design principles. The MixIn system operates through multiple specialized components: the Tracing MixIn automatically instruments all DSPy operations with comprehensive observability, capturing performance metrics, execution traces, and external integrations; the Memory MixIn enhances DSPy's context management with persistent short-term, episodic, and long-term memory capabilities, enabling agents to maintain conversation history and learn from past interactions; the RAG MixIn extends DSPy's retriever system with enterprise vector database support, providing unified access to multiple vector databases while maintaining compatibility with DSPy's retrieval patterns; the Tool MixIn integrates a comprehensive tool ecosystem with DSPy's tool system, automatically registering and managing tools across different categories while preserving DSPy's tool execution model; and the Model MixIn provides multi-backend model management that works seamlessly with DSPy's language model abstraction, supporting various model providers while maintaining the framework's model configuration patterns. This MixIn architecture ensures that developers can leverage all of DSPy's strengths—signatures, modules, optimizers, and the declarative programming model—while benefiting from enterprise features like comprehensive observability, persistent memory, RAG capabilities, extensive tooling, and flexible model management, all without requiring changes to DSPy's core implementation or breaking existing DSPy workflows.

# Conceptual representation of the DSPy MixIn pattern
class SuperOptixPipeline(dspy.Module):
    """
    SuperOptiX extends DSPy through a MixIn pattern that adds:
    - Automatic component initialization
    - Enterprise feature integration
    - Performance monitoring
    - Memory management
    - Tool ecosystem integration
    """

    def __init__(self, config=None):
        super().__init__()
        self.config = config or {}

        # Auto-setup framework components through MixIns
        self._setup_tracing()
        self._setup_language_model()
        self._setup_tools()
        self._setup_memory()
        self._setup_evaluation()

        # Call user-defined setup
        self.setup()

How SuperOptiX Uses DSPy Under the Hood

SuperOptiX leverages DSPy's core components in several key ways:

1. Signature-Based Agent Definition

SuperOptiX uses DSPy signatures to define agent capabilities declaratively:

# Example of how SuperOptiX leverages DSPy signatures
@abstractmethod
def get_signature(self) -> dspy.Signature:
    """Return the DSPy signature for this agent."""
    pass

@abstractmethod
def forward(self, **kwargs) -> dspy.Prediction:
    """Implement the core reasoning logic."""
    pass

This approach ensures that all agents follow DSPy's declarative programming model while maintaining the flexibility to implement custom reasoning patterns.

2. Module Composition

SuperOptiX composes DSPy modules to create sophisticated agent pipelines:

# Conceptual representation of module composition
class AgentPipeline(dspy.Module):
    def __init__(self):
        super().__init__()
        # Compose DSPy modules for different capabilities
        self.chain_of_thought = dspy.ChainOfThought()
        self.react_agent = dspy.ReAct()
        self.retriever = dspy.Retrieve()

3. Optimizer Integration

SuperOptiX integrates DSPy optimizers for automatic performance tuning:

# Example of optimizer integration
def optimize_agent(self, training_data):
    """Use DSPy optimizers to improve agent performance."""
    optimizer = dspy.BootstrapFewShot()
    optimized_pipeline = optimizer.compile(self, trainset=training_data)
    return optimized_pipeline

2. SuperSpec DSL: Declarative Agent Definition

SuperOptiX introduces SuperSpec, a Domain-Specific Language for defining agent playbooks with comprehensive validation and compliance checking. For detailed documentation on SuperSpec, visit the official documentation.

Schema-Driven Development

apiVersion: agent/v1
kind: AgentSpec
metadata:
  name: "Math Tutor"
  id: "math-tutor"
  namespace: "education"
  version: "1.0.0"
spec:
  language_model:
    provider: "ollama"
    model: "llama3.2:1b"
  persona:
    role: "Mathematics Teacher"
    goal: "Help students learn mathematics concepts"
  tasks:
    - name: "solve_math_problem"
      instruction: "Solve the given mathematical problem step by step"
      inputs: [{"name": "problem", "type": "str"}]
      outputs: [{"name": "solution", "type": "str"}]
  agentflow:
    - name: "analyze_problem"
      type: "Think"
      task: "solve_math_problem"

Template Generation System

The SuperSpec generator provides intelligent template creation that maps to DSPy components:

# Conceptual representation of template generation
class SuperSpecGenerator:
    """Generates agent templates that map to DSPy components."""

    def generate_template(self, tier: str, role: str, namespace: str):
        """Generate a template that maps to appropriate DSPy modules."""
        template = {
            "apiVersion": "agent/v1",
            "kind": "AgentSpec",
            "spec": {
                "language_model": self._get_model_config(tier),
                "persona": self._get_persona_config(role),
                "tasks": self._get_task_configs(role, tier),
                "agentflow": self._get_agentflow_config(role, tier)
            }
        }
        return template

3. RAG System: DSPy Retriever Integration

SuperOptiX implements RAG capabilities by extending DSPy's retriever system with enterprise features.

DSPy Retriever Extension

# Conceptual representation of RAG integration with DSPy
class RAGMixin:
    """Mixin providing RAG capabilities to SuperOptiX pipelines."""

    def setup_rag(self, spec_data):
        """Setup RAG system that integrates with DSPy retrievers."""
        # Configure vector database
        self._setup_vector_database(config)

        # Create DSPy retriever
        self._setup_dspy_retriever(config)

        return True

    def _setup_dspy_retriever(self, config):
        """Create a DSPy retriever that works with our vector database."""
        # Custom retriever that integrates with our vector database
        class CustomRetriever:
            def __init__(self, vector_db, k=5):
                self.vector_db = vector_db
                self.k = k

            def __call__(self, query, k=None):
                # Query vector database and return results
                results = self.vector_db.search(query, k or self.k)
                return results

Multi-Vector Database Support

SuperOptiX supports multiple vector databases through a unified interface:

ChromaDB: Local vector database with persistence
LanceDB: High-performance vector database
FAISS: Facebook AI Similarity Search
Weaviate: Vector search engine
Qdrant: Vector similarity search engine
Milvus: Open-source vector database
Pinecone: Cloud vector database

4. Memory System: Context-Aware Reasoning

SuperOptiX implements a sophisticated memory system that enhances DSPy's reasoning capabilities with persistent context.

Memory Integration with DSPy

# Conceptual representation of memory integration
class MemoryMixin:
    """Mixin providing memory capabilities to SuperOptiX pipelines."""

    def setup_memory(self, config):
        """Setup memory system that enhances DSPy reasoning."""
        self.short_term = ShortTermMemory()
        self.long_term = LongTermMemory()
        self.episodic = EpisodicMemory()

        # Integrate with DSPy context
        self._setup_dspy_context_integration()

    def _setup_dspy_context_integration(self):
        """Integrate memory with DSPy's context management."""
        # Enhance DSPy's context with our memory system
        pass

Memory Types

SuperOptiX provides three types of memory that work together:

Short-term Memory: Recent context and working memory
Episodic Memory: Conversation history and task episodes
Long-term Memory: Persistent storage and knowledge base

5. Observability: DSPy-Aware Tracing

SuperOptiX implements comprehensive observability that tracks DSPy operations and provides detailed insights.

DSPy-Aware Tracing

# Conceptual representation of DSPy-aware tracing
class SuperOptixTracer:
    """Tracer that understands DSPy operations."""

    def trace_dspy_operation(self, operation_name, dspy_module):
        """Trace DSPy operations with context."""
        with self.trace_operation(operation_name, "dspy"):
            # Track DSPy-specific metrics
            self._track_dspy_metrics(dspy_module)
            return dspy_module

    def _track_dspy_metrics(self, dspy_module):
        """Track DSPy-specific performance metrics."""
        # Track signature calls, module execution, optimizer performance
        pass

Performance Monitoring

The observability system tracks:

DSPy Operations: Signature calls, module execution, optimizer performance
Model Interactions: Token usage, response times, error rates
Tool Usage: Tool calls, execution times, success rates
Memory Operations: Storage, retrieval, context management
RAG Queries: Vector database queries, retrieval performance

6. Tool System: DSPy Tool Integration

SuperOptiX provides a comprehensive tool ecosystem that integrates seamlessly with DSPy's tool system.

Tool Registration with DSPy

# Conceptual representation of tool integration
class ToolRegistry:
    """Registry that integrates tools with DSPy."""

    def register_tool(self, tool_name, tool_func):
        """Register a tool that can be used by DSPy agents."""
        # Register with our registry
        self.tools[tool_name] = tool_func

        # Make available to DSPy
        self._register_with_dspy(tool_name, tool_func)

    def _register_with_dspy(self, tool_name, tool_func):
        """Register tool with DSPy's tool system."""
        # Integrate with DSPy's tool mechanism
        pass

Tool Categories

The framework organizes tools into logical categories:

Core Tools: Calculator, DateTime, File Reader, Text Analyzer, Web Search, JSON Processor
Domain Tools: Finance, Healthcare, Education, Legal, Marketing, Development
Custom Tools: User-defined tools and API integrations

7. Model Management: DSPy Model Integration

SuperOptiX provides comprehensive model management that works with DSPy's model abstraction.

Model Backend Support

SuperOptiX supports multiple model backends:

Ollama: Local model serving (recommended for cross-platform)
MLX: Apple Silicon optimization
LM Studio: Local model management
Hugging Face: Cloud model hosting
Custom: Custom endpoints and fine-tuned models

DSPy Model Configuration

# Conceptual representation of model integration
class ModelManager:
    """Manages model configuration for DSPy."""

    def setup_model(self, config):
        """Setup model that works with DSPy."""
        provider = config.get("provider", "ollama")
        model_name = config.get("model", "llama3.2:1b")

        # Configure model for DSPy
        dspy_model = self._configure_for_dspy(provider, model_name)

        # Set as DSPy's language model
        dspy.configure(lm=dspy_model)

8. CLI Interface: Unified Command Experience

SuperOptiX provides a comprehensive CLI that unifies all operations:

Command Structure

# Project Management
super init <project_name>
super spec generate <playbook_name> <template> --rag

# Agent Operations
super agent pull <agent_name>
super agent compile <agent_name>
super agent evaluate <agent_name>
super agent optimize <agent_name>
super agent run <agent_name>

# Model Management
super model install <model_name> -b <backend>
super model list
super model server

# Marketplace
super market browse agents
super market install agent <agent_name>
super market search "<query>"

# Observability
super observe dashboard
super observe traces

9. Data Flow Architecture

Agent Execution Flow

The SuperOptiX agent execution follows this sequence:

User initiates: User runs super agent run <agent>
CLI Interface: Processes the command and parses the playbook
Playbook Parser: Validates and extracts agent configuration
DSPy Generator: Creates the DSPy pipeline from the playbook
Agent Pipeline: Initializes the agent with all components
Component Setup:
- Model: Setup language model
- Tools: Register available tools
- RAG: Setup vector database (if enabled)
- Memory: Initialize memory system
- Observability: Start tracing
Query Processing:
- User sends query to agent
- Memory: Retrieve relevant context
- RAG: Retrieve knowledge (if enabled)
- Model: Generate response
- Tools: Execute tools (if needed)
- Memory: Store interaction
- Observability: Record events
- Return response to user

DSPy Integration Points

SuperOptiX integrates with DSPy at multiple levels:

Playbook Processing: SuperOptiX Playbook → SuperSpec Parser → DSPy Generator → DSPy Pipeline
Reasoning Modules: Chain of Thought, ReAct Agent, Custom Signatures
RAG Integration: RAG System → DSPy Retriever
Memory Integration: Memory System → DSPy Context
Tool Integration: Tool System → DSPy Tools
Model Integration: Model Management → DSPy Language Model

10. Performance Optimization

DSPy-Specific Optimizations

SuperOptiX implements several optimizations specifically for DSPy:

Signature Optimization: Efficient signature compilation and caching
Module Composition: Optimized module composition for better performance
Optimizer Integration: Automatic optimization using DSPy optimizers
Context Management: Efficient context handling for large conversations

RAG Optimization

Efficient chunking strategies: Optimize document chunking for retrieval
Embedding model selection: Choose appropriate embedding models for domain
Vector database optimization: Configure database-specific optimizations
Query caching and result ranking: Improve response times and relevance

Memory Optimization

Memory hierarchy management: Optimize short-term vs long-term memory usage
Context window optimization: Balance context length with performance
Storage backend selection: Choose appropriate storage for use case
Garbage collection strategies: Manage memory efficiently

11. Security & Compliance

Data Security

Local model execution: Keep sensitive data on-premises
Encrypted storage: Secure memory and trace storage
Secure API key management: Protect external service credentials
Data anonymization: Anonymize data in observability systems

Access Control

Feature restrictions: Limit capabilities based on tier
User authentication: Secure access to agent operations
API rate limiting: Prevent abuse and ensure fair usage
Resource usage quotas: Manage computational resources

12. Deployment Architecture

Local Development

# Single machine setup
super init my_project
super model install llama3.1:8b -b ollama
super agent run my_agent

Production Deployment

Production deployment follows a scalable architecture. For detailed deployment guides, visit the official documentation:

Load Balancer: Distributes requests across multiple agent instances
Agent Instances: Multiple instances (Instance 1, Instance 2, Instance N) for horizontal scaling
Model Service: Centralized model serving for all agent instances
Backend Services:
- Vector Database: For RAG capabilities
- Memory Store: For persistent memory
- Observability Platform: For monitoring and tracing

Scalability Features

Horizontal scaling: Scale agent instances across multiple machines
Model serving optimization: Optimize model inference performance
Database connection pooling: Efficient database resource management
Caching layers: Improve response times with intelligent caching

13. Integration Points

External Systems

Vector databases: ChromaDB, Pinecone, Weaviate, Qdrant, Milvus
Model providers: OpenAI, Anthropic, Hugging Face, local models
Observability platforms: MLflow, Langfuse, Prometheus, Grafana
CI/CD pipelines: GitHub Actions, GitLab CI, Jenkins

Framework Integration

DSPy ecosystem: Full compatibility with DSPy modules and optimizers
LangChain integration: Planned integration for broader ecosystem
Hugging Face ecosystem: Leverage Hugging Face models and tools
Custom model frameworks: Support for custom model implementations

Conclusion

SuperOptiX represents a significant advancement in AI agent development, providing a comprehensive framework that combines the declarative power of DSPy with enterprise-grade features. The framework's DSPy MixIn pattern ensures that developers can leverage DSPy's strengths while benefiting from enterprise capabilities.

Key architectural strengths include:

DSPy Integration: Deep integration with DSPy's declarative programming model
MixIn Pattern: Clean extension of DSPy capabilities without breaking abstractions
Enterprise Features: RAG, memory, observability, and tool ecosystem
Performance: Optimized for production workloads with comprehensive caching
Security: Enterprise-grade security and compliance features
Scalability: Designed to scale from development to production

The framework's technical architecture provides a solid foundation for building sophisticated AI agents while maintaining developer productivity and operational excellence. Whether you're building simple Q&A agents or complex multi-agent systems, SuperOptiX provides the tools and infrastructure needed for success.

SuperOptiX is designed to be the go-to framework for AI agent development, combining cutting-edge research with practical engineering to deliver production-ready AI solutions. Built by Superagentic AI.