We upgraded our AI agent's "intelligence" from string matching to actual understanding
Our OUROBOROS system has 22 primitives. Think of them as reflexes: pattern-matched behaviors the agent can trigger without asking the LLM. Things like detecting when a task is similar to a past failure, or recognizing that a piece of feedback contradicts earlier advice.
Last month I audited how these primitives actually worked. The honest answer was uncomfortable.
Eight of them were genuinely smart. They used proper logic, maintained state, and produced useful results. Ten of them were keyword matching dressed up in function names that sounded impressive. The remaining five were pure theater. One of them evaluated assumptions by computing md5(assumption)[:8] % 3 == 0 and calling that "adversarial analysis." Another "mutated" directives by prepending the string [refined] to them. That was the mutation. It prepended a word.
Here's how we fixed the ten shallow ones in one shot, why the theater five are gone, and what the whole thing taught us about agent systems.
The audit
The trigger was a bug. The agent failed to recognize that "optimize database queries" and "speed up SQL performance" were the same task. The similarity primitive returned 0.0. Zero. Any developer knows these mean the same thing, but the primitive was comparing them with Jaccard similarity on tokenized words. The word sets {optimize, database, queries} and {speed, up, sql, performance} share zero tokens. So: 0.0 similarity.
I started checking the others. The contradiction detector? It looked for the word "not" near another word. The deduplication primitive? Exact string match after lowercasing. The feedback clustering? Grouped by shared nouns using a simple POS tagger.
They weren't broken. They just weren't doing what their names claimed. It was like opening the hood of a car and finding a hamster wheel.
The fix: one shared semantic layer
The obvious fix for each primitive would be to add some NLP to that specific primitive. Maybe swap Jaccard for cosine similarity on TF-IDF vectors. Maybe add WordNet synonyms to the contradiction detector.
The less obvious but better fix: build one shared semantic embedding module and have all ten primitives use it.
We went with all-MiniLM-L6-v2 from sentence-transformers. It's a 22MB model that produces 384-dimensional embeddings. On my development machine (AMD Ryzen 5, no GPU, 14GB RAM) it runs in about 80ms per sentence. Not fast enough for real-time chat, but fast enough for the batch operations and background analysis these primitives actually do.
Here's the core module, simplified:
from functools import lru_cache
try:
from sentence_transformers import SentenceTransformer
_model = SentenceTransformer('all-MiniLM-L6-v2')
_SEMANTIC_AVAILABLE = True
except ImportError:
_SEMANTIC_AVAILABLE = False
@lru_cache(maxsize=512)
def embed(text: str):
if not _SEMANTIC_AVAILABLE:
return None
return _model.encode(text, normalize_embeddings=True)
def similarity(a: str, b: str) -> float:
ea, eb = embed(a), embed(b)
if ea is None or eb is None:
# fallback to Jaccard
sa, sb = set(a.lower().split()), set(b.lower().split())
return len(sa & sb) / max(len(sa | sb), 1)
return float(ea @ eb)
A few things worth noting about this setup.
The try/except import pattern means the system degrades gracefully. If sentence-transformers isn't installed, or if the model download fails, every primitive falls back to Jaccard similarity. The agent keeps working. It's just dumber. This matters because we deploy on some pretty constrained environments and not every box can spare 22MB for a model file.
The lru_cache with 512 entries means we don't re-encode the same strings. In practice, the same task descriptions and feedback snippets get compared repeatedly during a session, so the cache hit rate sits around 60-70%. Each cached hit drops the lookup from 80ms to roughly 0.
And normalize_embeddings=True means the dot product (ea @ eb) gives us cosine similarity directly. No need to compute norms separately.
The numbers
This is the part that surprised me. I expected an improvement, but the gap between keyword matching and semantic similarity was bigger than I thought.
| Comparison | Jaccard | Semantic |
|---|---|---|
| "optimize database queries" vs "speed up SQL performance" | 0.000 | 0.736 |
| "fix the login bug" vs "users can't sign in" | 0.000 | 0.682 |
| "refactor auth module" vs "clean up authentication code" | 0.250 | 0.814 |
| "add dark mode" vs "implement dark theme" | 0.000 | 0.891 |
| "improve error messages" vs "better error handling" | 0.167 | 0.593 |
| "update dependencies" vs "bump package versions" | 0.000 | 0.547 |
The Jaccard column is mostly zeros because synonyms and paraphrases don't share tokens. The semantic column isn't perfect either. 0.547 for "update dependencies" vs "bump package versions" is on the low side. But it's way better than zero. And for the primitives that consume these scores (deduplication, clustering, contradiction detection), a threshold of 0.55 catches most of what Jaccard misses entirely.
We tuned the thresholds per primitive after this. Deduplication uses 0.75 because false positives there mean merging unrelated tasks. Similarity detection uses 0.60 because it's better to over-suggest than to miss. The contradiction detector uses 0.50 as a first pass, then runs a separate logical analysis on high-similarity pairs. That two-stage approach (filter by similarity, then analyze by logic) turned out to be more reliable than the old "look for the word not" approach.
Why one module instead of ten fixes
When I first realized ten primitives needed fixing, my instinct was to fix them one at a time. Add better NLP to the deduplicator. Add synonym expansion to the similarity checker. Maybe bring in spaCy for the contradiction detector.
That approach has two problems.
First, it means ten different NLP pipelines to maintain. Ten different model downloads. Ten different fallback behaviors. Ten different sets of edge cases.
Second, and more important: the hardest part of making these primitives work isn't the comparison logic. It's the embedding. Once you have good vector representations of the text, the rest is just arithmetic. Cosine similarity is a dot product. Clustering is k-means on vectors. Deduplication is thresholding the similarity matrix. The hard part is turning "speed up SQL performance" into a vector that lives near "optimize database queries" in vector space. That's what the sentence-transformer model does.
By centralizing that step, each primitive only needs to define its own threshold and its own response to the similarity score. The embedding work happens once and gets reused everywhere.
This also made the theater primitives easier to spot. When every primitive goes through the same module, you can add logging and see which ones actually call it. The five that never called it? Those were the theater ones. Gone now.
What we learned
Building agent systems for months has taught us a few things, and this refactor reinforced them.
Names lie. A function called detect_contradiction that searches for the word "not" is not detecting contradictions. It's doing string matching. The gap between what code is called and what code does is where bugs hide in agent systems. Audit early.
Shared infrastructure pays for itself. One embedding module upgraded ten primitives at once. The marginal cost of the eleventh primitive is near zero because the infrastructure is already there. Same argument as shared libraries, but it hits different when the "library" is a 22MB neural network you're loading into RAM.
Fallback behavior is not optional. The try/except import pattern took five minutes to write and has saved us multiple times. Deployment environments are unpredictable. The agent should work everywhere, just better where resources allow.
CPU is enough for a lot of things. We don't have a GPU. The embeddings run on a Ryzen 5 in 80ms. For batch operations and background analysis, that's fine. Not every ML feature needs a TPU. Ship the CPU version first, optimize later if you actually need to.
Theater code is worse than no code. Those five primitives that did nothing? They made the system seem more capable than it was. When you're debugging an agent and you see it has a "feedback_synthesis" primitive, you assume feedback synthesis is happening. When it's not, you waste hours checking everything except the primitive itself. We'd have been better off without it.
Where we are now
The system has 17 primitives. Eight that were already smart. Nine that got upgraded through the shared semantic layer. The five theater props are gone.
The embedding cache uses about 3MB of RAM at steady state. The model itself is 22MB on disk. Total inference time for a typical session (maybe 30-40 embedding calls) adds up to roughly 3 seconds, most of which is cached away.
We haven't tried a larger model yet. all-MiniLM-L6-v2 works well enough that the bottleneck is now in the threshold tuning and the downstream logic, not in the embeddings. When that changes, we'll revisit.
The code is on GitHub. The relevant module is semantic.py in the ouroboros package. If you're building something similar and want to compare notes, open an issue.
Top comments (0)