<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rost</title>
    <description>The latest articles on DEV Community by Rost (@rosgluk).</description>
    <link>https://dev.to/rosgluk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3544400%2F04dd81bf-749e-4055-971f-316c0134e76c.jpg</url>
      <title>DEV Community: Rost</title>
      <link>https://dev.to/rosgluk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rosgluk"/>
    <language>en</language>
    <item>
      <title>PARA Method for Engineers: Organize Knowledge by Action</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sun, 21 Jun 2026 12:18:24 +0000</pubDate>
      <link>https://dev.to/rosgluk/para-method-for-engineers-organize-knowledge-by-action-1npg</link>
      <guid>https://dev.to/rosgluk/para-method-for-engineers-organize-knowledge-by-action-1npg</guid>
      <description>&lt;p&gt;Organizing notes by topic sounds logical until you have notes on PostgreSQL in five different folders and cannot find the one that matters for today's problem.&lt;/p&gt;

&lt;p&gt;The issue is not discipline. The issue is that topic-based organization asks the wrong question. "What is this about?" is useful for libraries. For engineers, the better question is "What am I doing with this?" That is the premise of PARA.&lt;/p&gt;

&lt;p&gt;PARA is a simple four-bucket system created by Tiago Forte as the organizational backbone of his &lt;a href="https://www.glukhov.org/knowledge-management/foundations/second-brain/" rel="noopener noreferrer"&gt;Building a Second Brain&lt;/a&gt; framework. The idea is that all information can be sorted into four categories: Projects, Areas, Resources, and Archives. Each category represents a different level of actionability, and that distinction drives where every note lives.&lt;/p&gt;

&lt;p&gt;This guide applies PARA to engineering work specifically — codebases, documentation, learning material, and the tension between active project work and long-term reference.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Topic-Based Organization
&lt;/h2&gt;

&lt;p&gt;Most engineers organize knowledge the way they organize code: by domain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;databases/
  postgresql/
  redis/
api/
  rest/
  graphql/
devops/
  kubernetes/
  terraform/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That structure makes sense when you are browsing. It breaks down when you need something for a specific task. You remember a useful note about database migration safety, but it could be in &lt;code&gt;databases/postgresql/&lt;/code&gt;, &lt;code&gt;devops/deployments/&lt;/code&gt;, &lt;code&gt;api/versioning/&lt;/code&gt;, or nowhere because you saved it somewhere temporary.&lt;/p&gt;

&lt;p&gt;Topic folders force you to decide where knowledge belongs before you understand its context. PARA delays that decision — instead of asking what something is about, it asks what you are currently doing with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Buckets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Projects
&lt;/h3&gt;

&lt;p&gt;A project is active, time-bound work with a defined outcome.&lt;/p&gt;

&lt;p&gt;For engineers, projects are things like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Migrate billing service to queue v2
Upgrade PostgreSQL from 14 to 16
Write architecture decision record for auth service redesign
Implement rate limiting on public API
Publish article about distributed tracing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every project has a completion state. When you finish, the project moves to Archives. When you are not actively working on it, it is not a project.&lt;/p&gt;

&lt;p&gt;The key constraint: a project note should only contain what you need for that project. Reference material belongs in Resources. Reusable concepts belong in your Zettelkasten or personal notes. Project notes are working documents, not knowledge stores.&lt;/p&gt;

&lt;h3&gt;
  
  
  Areas
&lt;/h3&gt;

&lt;p&gt;An area is an ongoing responsibility without a deadline.&lt;/p&gt;

&lt;p&gt;For engineers, areas include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System architecture
Infrastructure reliability
Code review quality
Professional development
API design standards
Security posture
On-call responsibilities
Mentoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Areas do not finish. You are always responsible for infrastructure reliability. You always care about your professional development. The difference between a project and an area is that areas do not have exit criteria — they are things you maintain, not things you complete.&lt;/p&gt;

&lt;p&gt;A useful rule: if you can imagine shipping it or closing the ticket, it is a project. If it is just part of what your role means, it is an area.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;p&gt;Resources are reference material you collected because it might be useful later.&lt;/p&gt;

&lt;p&gt;For engineers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API documentation bookmarks
Cheat sheets
Benchmark results
Architecture diagrams for third-party systems
Conference talks you want to revisit
Library documentation
Research papers
Interesting blog articles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resources have no active home in your current work. They are collected because you expect to need them eventually. The important discipline here is that resources should not masquerade as projects. A collection of Kubernetes documentation is a resource. A running task to "learn Kubernetes for the platform migration" is a project.&lt;/p&gt;

&lt;h3&gt;
  
  
  Archives
&lt;/h3&gt;

&lt;p&gt;Archives contain everything that is no longer active.&lt;/p&gt;

&lt;p&gt;Items move to Archives when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A project is complete or cancelled&lt;/li&gt;
&lt;li&gt;An area of responsibility changes hands&lt;/li&gt;
&lt;li&gt;Resource material is too outdated to be useful&lt;/li&gt;
&lt;li&gt;You want to preserve something but do not need it in the active buckets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Archives are not deletion. They are low-friction storage for things that have finished their active life. The rule is simple: if you find yourself wondering whether something is in Archives, that is fine. You rarely need Archives content urgently.&lt;/p&gt;

&lt;h2&gt;
  
  
  PARA in Practice for Engineers
&lt;/h2&gt;

&lt;p&gt;Here is a concrete example of what an engineer's PARA structure might look like in Obsidian:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Projects/
  billing-queue-migration/
  postgresql-16-upgrade/
  rate-limiting-rfc/
  blog-distributed-tracing/

Areas/
  architecture-standards/
  infrastructure/
  on-call-runbooks/
  career-development/

Resources/
  api-references/
  database-cheatsheets/
  benchmark-results/
  conference-notes/

Archives/
  2025-q4-projects/
  deprecated-services/
  old-runbooks/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The folder structure itself is not sacred. What matters is the discipline of placing notes in the right category based on their relationship to your current work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mapping a Typical Engineer's Knowledge
&lt;/h3&gt;

&lt;p&gt;Many engineers start with an undifferentiated pile of notes. Migrating to PARA requires a single audit pass:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Projects&lt;/strong&gt; — anything with a ticket, a deadline, or a deliverable you are currently working toward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Areas&lt;/strong&gt; — recurring responsibilities that define your role.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt; — reference material you collected without a specific project in mind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Archives&lt;/strong&gt; — everything else.&lt;/p&gt;

&lt;p&gt;A working rule: when in doubt, Archive it. You can always retrieve it later. An overcrowded Projects folder is more damaging than an underused Archive.&lt;/p&gt;

&lt;h2&gt;
  
  
  PARA and Zettelkasten: A Practical Hybrid
&lt;/h2&gt;

&lt;p&gt;PARA and &lt;a href="https://www.glukhov.org/knowledge-management/methods/zettelkasten-for-developers/" rel="noopener noreferrer"&gt;Zettelkasten&lt;/a&gt; are often compared as competing systems. They are not competing. They solve different problems.&lt;/p&gt;

&lt;p&gt;Zettelkasten is for ideas. It captures atomic concepts, links them by meaning, and lets understanding emerge from the connections. Zettelkasten notes are not tied to projects — they belong to no active bucket. A note about idempotency applies to ten different projects, past and future.&lt;/p&gt;

&lt;p&gt;PARA is for action. It organizes working context around what you are actively doing, responsible for, or collecting for later use.&lt;/p&gt;

&lt;p&gt;A practical hybrid:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Projects/
  billing-queue-migration/
    migration-plan.md
    open-questions.md
    → links to Zettelkasten: [[Idempotency keys turn retries into safe operations]]
    → links to Zettelkasten: [[Outbox pattern separates persistence from delivery]]

Areas/
  architecture-standards/
    current-adr-index.md
    → links to Zettelkasten: [[Database constraints are concurrency control]]

Resources/
  benchmark-results/
    q1-2026-postgres-benchmarks.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this model, PARA folders hold working documents and context. Zettelkasten notes hold reusable knowledge. Project notes link to Zettelkasten concepts — the project uses the concept without owning it.&lt;/p&gt;

&lt;p&gt;This is more durable than trying to make PARA do the job of Zettelkasten. Projects end. Concepts stay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Failures
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Over-Archiving
&lt;/h3&gt;

&lt;p&gt;Some engineers use Archives as a dump for anything they feel guilty discarding. When Archives become large and unsorted, they lose their value. Archives should contain completed work in reasonable shape, not a graveyard of unsorted notes.&lt;/p&gt;

&lt;p&gt;A periodic archive sweep — quarterly works well — keeps it manageable. Delete duplicates. Consolidate. Ask whether the old project note still contains anything worth keeping as a Resource or Zettelkasten note before archiving it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Areas Becoming Dumping Grounds
&lt;/h3&gt;

&lt;p&gt;When Areas grow without pruning, they start to look like a topic-based folder system. An Area called &lt;code&gt;databases/&lt;/code&gt; that contains unsorted notes from three years is not a responsibility — it is a pile.&lt;/p&gt;

&lt;p&gt;Keep each Area tight. An area should represent something you are actively accountable for, not a topic you are broadly interested in. Interest goes into Resources. Accountability goes into Areas.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources Growing Without Review
&lt;/h3&gt;

&lt;p&gt;Resources are easy to collect and easy to forget. A bookmark dump in &lt;code&gt;Resources/&lt;/code&gt; with 400 unsorted links is harder to use than a bookmark manager. Resources should be curated lightly — remove outdated material, keep the signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skipping the Weekly Review
&lt;/h3&gt;

&lt;p&gt;PARA works best with a weekly ten-minute review of your Projects folder. For each active project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this still active?&lt;/li&gt;
&lt;li&gt;What is the next concrete action?&lt;/li&gt;
&lt;li&gt;Is there anything to move to Archives?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that review, Projects accumulate stale entries and the system loses its value as a current view of your work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation in Obsidian
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/" rel="noopener noreferrer"&gt;Obsidian&lt;/a&gt; is a natural fit for PARA because folders map directly to the four buckets and Dataview queries can surface project status automatically.&lt;/p&gt;

&lt;p&gt;A basic setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vault/
  ├── Projects/
  ├── Areas/
  ├── Resources/
  ├── Archives/
  └── Zettelkasten/     ← concept notes, linked freely
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A simple Dataview query to surface active project notes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LIST FROM "Projects"
WHERE !contains(file.path, "Archives")
SORT file.mtime DESC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tags can mark status without moving files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;active&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;paused&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;done&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a project completes, tag it &lt;code&gt;done&lt;/code&gt;, then move the folder to &lt;code&gt;Archives/YEAR-QN/&lt;/code&gt;. Simple, auditable, reversible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation in Plain Files
&lt;/h2&gt;

&lt;p&gt;You do not need Obsidian. PARA works equally well in a Git repository with plain Markdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;knowledge/
  projects/
    2026-billing-migration/
      README.md
      migration-plan.md
      decisions.md
  areas/
    architecture/
      adr-index.md
  resources/
    databases/
      postgres-16-release-notes.md
  archives/
    2025/
      feature-x-launch/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Git gives you history, diff, search, and portability. That is often more than enough for a personal system.&lt;/p&gt;

&lt;h2&gt;
  
  
  When PARA Makes Sense
&lt;/h2&gt;

&lt;p&gt;PARA is well suited when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You juggle multiple active projects at the same time&lt;/li&gt;
&lt;li&gt;You need to quickly find what relates to today's work&lt;/li&gt;
&lt;li&gt;You want a system that is folder-friendly and tool-agnostic&lt;/li&gt;
&lt;li&gt;You combine it with a Zettelkasten or concept-note layer for reusable ideas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PARA is less useful when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You work on a single long-running project with no clear buckets&lt;/li&gt;
&lt;li&gt;You are primarily doing research-oriented work with no active deliverables&lt;/li&gt;
&lt;li&gt;You prefer emergent structure over explicit categorization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For engineers doing a mix of active project work and long-term learning, PARA and Zettelkasten together cover most cases: PARA for context, Zettelkasten for thinking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Framework
&lt;/h2&gt;

&lt;p&gt;When a new note arrives, ask these questions in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is this tied to something I am actively working toward? → &lt;strong&gt;Projects&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Is this part of an ongoing responsibility I own? → &lt;strong&gt;Areas&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Is this reference material I might need later? → &lt;strong&gt;Resources&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Is this finished or inactive? → &lt;strong&gt;Archives&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Is this a reusable concept or idea not tied to any project? → &lt;strong&gt;Zettelkasten&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is the full decision tree. Five options. One rule per option. It takes about ten seconds per note.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;PARA works because it matches how engineers actually use knowledge — not for browsing, but for acting. You do not open your notes to see what is in &lt;code&gt;databases/&lt;/code&gt;. You open them because you are working on a specific problem right now, and you need the relevant material to surface quickly.&lt;/p&gt;

&lt;p&gt;The discipline of separating active projects from reference material, and both from finished work, reduces the cognitive overhead of maintaining a personal knowledge base. In combination with a &lt;a href="https://www.glukhov.org/knowledge-management/foundations/personal-knowledge-management/" rel="noopener noreferrer"&gt;personal knowledge management&lt;/a&gt; foundation and a Zettelkasten for concept-level notes, PARA gives you the organizational backbone that keeps everything findable when it matters.&lt;/p&gt;

&lt;p&gt;Start with one folder per bucket. Run one audit to sort your existing notes. Review Projects weekly. The rest will follow naturally.&lt;/p&gt;

</description>
      <category>para</category>
      <category>obsidian</category>
      <category>knowledgemanagement</category>
      <category>secondbrain</category>
    </item>
    <item>
      <title>Evergreen Notes: Write Notes That Compound Over Time</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sun, 21 Jun 2026 12:18:21 +0000</pubDate>
      <link>https://dev.to/rosgluk/evergreen-notes-write-notes-that-compound-over-time-2hbc</link>
      <guid>https://dev.to/rosgluk/evergreen-notes-write-notes-that-compound-over-time-2hbc</guid>
      <description>&lt;p&gt;Most engineering notes are written once and forgotten. You capture something during a debugging session, paste it into a doc, and rediscover it two years later with no context for why it mattered.&lt;/p&gt;

&lt;p&gt;The problem is not effort. Engineers write constantly — code comments, Slack messages, Confluence pages, Jira descriptions, pull request explanations, architecture diagrams. The problem is that most of those notes are written for a specific moment and age poorly. They do not compound. They accumulate.&lt;/p&gt;

&lt;p&gt;Evergreen notes are the alternative. The idea is simple: write each note so that it stays useful indefinitely, improves when you revisit it, and connects to other notes in a way that makes the whole system more valuable over time.&lt;/p&gt;

&lt;p&gt;The term was popularized by researcher Andy Matuschak, whose own public notes demonstrate the idea at scale. For engineers, the principle has direct applications in technical writing, documentation, architecture decisions, and the long-term capture of hard-won lessons.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes a Note Evergreen
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Atomic
&lt;/h3&gt;

&lt;p&gt;An evergreen note contains one idea. Not one topic — one idea.&lt;/p&gt;

&lt;p&gt;A note called "PostgreSQL" is not evergreen. It is a container waiting to be filled. A note called "Partial indexes reduce write overhead when queries target a small subset" is evergreen. It states a specific, portable claim.&lt;/p&gt;

&lt;p&gt;The atomic constraint is important because it controls reuse. A container note can only be linked as a vague topic. An atomic note can be linked wherever that specific idea applies — in a discussion of query optimization, in a comparison of indexing strategies, in a project note about a specific performance problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Standalone
&lt;/h3&gt;

&lt;p&gt;An evergreen note should be understandable without its original source.&lt;/p&gt;

&lt;p&gt;That means writing in your own words. A note that says "See the linked article — good stuff on caching" is not evergreen. A note that says "Write-through caching updates the cache synchronously with the database on every write, improving read consistency at the cost of higher write latency" is evergreen. You can read it a year later without chasing the original source.&lt;/p&gt;

&lt;p&gt;This is harder than it sounds. Writing a standalone note requires actually understanding what you read, not just tagging it. That processing step is where most of the learning happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evolving
&lt;/h3&gt;

&lt;p&gt;Evergreen notes improve over time rather than going stale.&lt;/p&gt;

&lt;p&gt;A fleeting note has a lifecycle: you write it, it serves a moment, it becomes irrelevant. An evergreen note should be worth revisiting and refining six months or two years later. You might add a counterexample, update it with a production experience, link it to a new pattern, or simply rewrite it more precisely.&lt;/p&gt;

&lt;p&gt;The word "evergreen" is intentional: these notes do not die after harvest. They persist and improve.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linked
&lt;/h3&gt;

&lt;p&gt;Evergreen notes connect to other notes rather than sitting in isolation.&lt;/p&gt;

&lt;p&gt;A standalone note about write-through caching connects naturally to notes about read-heavy workloads, cache invalidation, eventual consistency, and database write performance. Each link makes both notes more useful — the connection surfaces context that neither note contains alone.&lt;/p&gt;

&lt;p&gt;The linking habit is what turns a collection of individual insights into a network of connected understanding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Note Types and When to Use Each
&lt;/h2&gt;

&lt;p&gt;Understanding evergreen notes requires understanding what they are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fleeting notes&lt;/strong&gt; are temporary captures. A line scribbled during a debugging session, a bookmark to revisit, a question to follow up on. Fleeting notes serve a moment. They should be processed quickly and either discarded or promoted into something more durable. Most fleeting notes never become evergreen notes, and that is fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Literature notes&lt;/strong&gt; are summaries of external sources — a documentation page, a postmortem, a book chapter, a conference talk. Literature notes preserve what a source said. They are a step toward understanding, not understanding itself. A literature note says "this source claims X." An evergreen note says "I believe X for these reasons."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evergreen notes&lt;/strong&gt; synthesize what you have come to understand. They live at the output of the learning process, not the input.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Note type&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Lifespan&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fleeting&lt;/td&gt;
&lt;td&gt;Quick capture&lt;/td&gt;
&lt;td&gt;Hours to days&lt;/td&gt;
&lt;td&gt;"Look into why Postgres vacuum missed this row"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Literature&lt;/td&gt;
&lt;td&gt;Source summary&lt;/td&gt;
&lt;td&gt;Medium term&lt;/td&gt;
&lt;td&gt;"Redis docs say AOF fsync default is 1s"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evergreen&lt;/td&gt;
&lt;td&gt;Portable idea&lt;/td&gt;
&lt;td&gt;Years&lt;/td&gt;
&lt;td&gt;"Fsync-on-write durability trades throughput for crash safety"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Writing Evergreen Technical Notes
&lt;/h2&gt;

&lt;p&gt;The structure of a good evergreen technical note follows a simple logic: claim, evidence, implication.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Write-through caching improves read consistency at the cost of write latency&lt;/span&gt;

Write-through caching updates the cache at the same time as the underlying store
on every write. Every read hits fresh data because the write path ensures
consistency before the write is acknowledged.

The tradeoff is write latency — every write now requires two operations (store
and cache) to complete before the caller receives a confirmation.

This pattern suits read-heavy workloads where cache staleness has real
business impact, such as product inventory counts or user settings.

Links:
&lt;span class="p"&gt;-&lt;/span&gt; [[Read-through caching shifts cache population to read time]]
&lt;span class="p"&gt;-&lt;/span&gt; [[Cache invalidation is a coordination problem]]
&lt;span class="p"&gt;-&lt;/span&gt; [[Write-behind caching trades consistency for write throughput]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That note is useful without the source. It states the claim, explains the tradeoff, gives a context where it applies, and links to related ideas.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to Avoid
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Time-sensitive references&lt;/strong&gt; age badly. "As of Postgres 14, this behavior works this way" is a literature note, not an evergreen note. Write the principle instead: "The planner skips index scans when estimated row count exceeds a threshold relative to table size." That claim survives version changes even if the threshold changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool-specific commands without context&lt;/strong&gt; are snippets, not notes. A note that is just a &lt;code&gt;kubectl&lt;/code&gt; command copied from a StackOverflow answer is not evergreen. A note about why that command works — what Kubernetes resource it affects and what problem it solves — has a chance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assumptions about reader knowledge&lt;/strong&gt; degrade fast. Write as if explaining to a competent colleague who is not inside your current context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Good Candidates for Evergreen Notes in Engineering
&lt;/h3&gt;

&lt;p&gt;Almost any hard-won lesson with broad applicability is a good candidate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture tradeoffs and the reasoning behind decisions&lt;/li&gt;
&lt;li&gt;Debugging patterns that apply across systems&lt;/li&gt;
&lt;li&gt;API design rules and their edge cases&lt;/li&gt;
&lt;li&gt;Performance characteristics with real-world numbers attached&lt;/li&gt;
&lt;li&gt;Security assumptions that turned out to be wrong&lt;/li&gt;
&lt;li&gt;Test strategy lessons from projects where the approach failed&lt;/li&gt;
&lt;li&gt;Deployment constraints that changed how the team worked&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: specific enough to be actionable, general enough to apply more than once.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evergreen Workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Capture Fleeting Notes
&lt;/h3&gt;

&lt;p&gt;Capture quickly without overthinking. The goal is not to produce an evergreen note in the moment — it is to preserve the raw material for one.&lt;/p&gt;

&lt;p&gt;During a debugging session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Found that the cache was returning stale user permissions after role changes.
The TTL was 5 minutes but the role update was immediate.
Need to think through how to handle this — invalidation on write?
Or shorter TTL? Or event-driven update?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a fleeting note. It is not an evergreen note, but it contains the seeds of several.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Process Into Evergreen Notes Within 48 Hours
&lt;/h3&gt;

&lt;p&gt;Processing is where the value appears. Take the raw capture and extract the ideas that are worth preserving.&lt;/p&gt;

&lt;p&gt;From that debugging note, you might write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Role-based cache entries require invalidation on write, not just TTL expiry&lt;/span&gt;

When cached data encodes permissions or roles, TTL-based expiry is not safe.
A user whose role is downgraded keeps elevated permissions until the TTL expires.
Write-time invalidation — or event-driven cache updates on role change — is required
for correctness in permission-sensitive caches.

Links:
&lt;span class="p"&gt;-&lt;/span&gt; [[Cache invalidation is a coordination problem]]
&lt;span class="p"&gt;-&lt;/span&gt; [[Authorization decisions should not be cached at rest without validation]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The debugging context is gone. The portable idea remains.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Connect to Existing Notes
&lt;/h3&gt;

&lt;p&gt;After writing the note, spend two minutes asking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What existing note does this relate to?&lt;/li&gt;
&lt;li&gt;What concept does this depend on?&lt;/li&gt;
&lt;li&gt;What does this extend or contradict?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add links in both directions. The new note links to existing notes. Existing notes that are now richer for the connection link back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Revisit and Improve
&lt;/h3&gt;

&lt;p&gt;Evergreen notes do not have a single correct state. Every time you encounter the idea again — in a production incident, a design review, a code review comment — consider returning to the note and making it better.&lt;/p&gt;

&lt;p&gt;You might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add a more concrete example&lt;/li&gt;
&lt;li&gt;Update the claim based on new evidence&lt;/li&gt;
&lt;li&gt;Remove a caveat that turned out not to matter&lt;/li&gt;
&lt;li&gt;Add a link to a new related note&lt;/li&gt;
&lt;li&gt;Rewrite the opening sentence for clarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That cycle of refinement is what makes notes compound rather than decay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evergreen Notes and Documentation
&lt;/h2&gt;

&lt;p&gt;There is a useful distinction between personal evergreen notes and team documentation.&lt;/p&gt;

&lt;p&gt;Personal evergreen notes are your understanding, written for future you. They can be rough, opinionated, and incomplete. Their value is in being reusable for your thinking.&lt;/p&gt;

&lt;p&gt;Team documentation is for shared understanding. It needs accuracy, accessibility, and maintenance ownership.&lt;/p&gt;

&lt;p&gt;The two layers complement each other. Your evergreen notes about why a system was designed a certain way can become the raw material for the architecture decision record. Your debugging notes can feed the runbook. Your API design notes can inform the style guide.&lt;/p&gt;

&lt;p&gt;The direction of flow is usually: evergreen notes → polished documentation, not the reverse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evergreen Notes and RAG Systems
&lt;/h2&gt;

&lt;p&gt;As AI-augmented knowledge tools become more practical, well-written evergreen notes become increasingly valuable as retrieval source material. The &lt;a href="https://www.glukhov.org/knowledge-management/foundations/retrieval-vs-representation/" rel="noopener noreferrer"&gt;retrieval versus representation&lt;/a&gt; problem in knowledge management is essentially about quality of source material — and evergreen notes, being atomic, standalone, and written for comprehension, chunk well for vector search.&lt;/p&gt;

&lt;p&gt;A Zettelkasten of atomic evergreen notes is a natural foundation for a personal &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;RAG&lt;/a&gt; system. The atomic structure aligns with retrieval chunk size. The standalone property means retrieved notes need no additional context to be useful. The linking structure provides graph traversal opportunities beyond keyword search.&lt;/p&gt;

&lt;p&gt;This is increasingly relevant for engineers who want to query their own knowledge base with an LLM rather than starting from scratch each time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Writing Too Broadly
&lt;/h3&gt;

&lt;p&gt;A note that covers an entire topic is not an evergreen note — it is a draft article. If your note is longer than a single screen and covers more than one claim, break it into smaller notes and link them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing Too Narrowly
&lt;/h3&gt;

&lt;p&gt;A note that is too specific to one context has no reuse value. "Fixed the billing service cache bug on 2024-03-14" is a log entry, not an evergreen note. Raise the abstraction level until the idea applies in at least three different contexts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Confusing "Evergreen" With "Never Changes"
&lt;/h3&gt;

&lt;p&gt;Evergreen does not mean immutable. It means the note remains worth returning to. A note about Go generics written in 2022 is still evergreen if you update it to reflect how patterns evolved in 2024. A note that you never touch because you believe it is permanently correct is a note that will eventually become wrong in silence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skipping the Processing Step
&lt;/h3&gt;

&lt;p&gt;The most common failure is treating evergreen notes as a collection target rather than a writing practice. You cannot grow a collection of high-quality atomic notes by saving bookmarks. The evergreen note is not the article you read — it is what you extracted from it in your own words.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Obsidian
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/" rel="noopener noreferrer"&gt;Obsidian&lt;/a&gt; is the most popular tool for evergreen notes. Its local Markdown files, bidirectional links, and graph view align well with the practice. A simple structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vault/
  fleeting/
    daily/
  literature/
  evergreen/
  maps/       ← index notes for clusters of evergreen notes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The graph view in Obsidian makes link clusters visible — useful for discovering which concepts form natural groups that might become index notes or published articles.&lt;/p&gt;

&lt;h3&gt;
  
  
  Plain Markdown With Git
&lt;/h3&gt;

&lt;p&gt;A Git repository of Markdown files works well and has no dependency on any specific tool. Standard Markdown links connect notes. Search is handled by your editor or &lt;code&gt;grep&lt;/code&gt;. Version history comes from Git.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;knowledge/
  evergreen/
    caching/
    api-design/
    performance/
  literature/
  fleeting/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The discipline is the same regardless of tool — one idea per note, written in your own words, linked to related notes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting From Zero
&lt;/h2&gt;

&lt;p&gt;The most useful way to start is not to migrate your existing notes. It is to write one evergreen note today.&lt;/p&gt;

&lt;p&gt;Take something you learned in the last week. Write it as a claim. Explain it in your own words in one paragraph. Add links to zero or one related ideas.&lt;/p&gt;

&lt;p&gt;That is a complete evergreen note. Repeat once per week for six months and you have a working system.&lt;/p&gt;

&lt;p&gt;The compounding effect takes time to become visible. Engineers who maintain evergreen notes for a year often report that their notes start answering questions before they finish asking them — because they have already written the answer in a previous context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The reason evergreen notes work is not that they are better at storage. They are better at thinking. The discipline of writing one portable idea per note, in your own words, with links to related ideas, forces understanding that passive collection does not.&lt;/p&gt;

&lt;p&gt;For engineers, this has practical consequences. The notes from a production incident that you process into evergreen format are more useful than the incident log. The design tradeoff you distill into an atomic note is more useful than the architecture diagram. The debugging pattern you generalize from a specific bug is more reusable than the ticket.&lt;/p&gt;

&lt;p&gt;Used alongside the &lt;a href="https://www.glukhov.org/knowledge-management/methods/para-method-for-engineers/" rel="noopener noreferrer"&gt;PARA method&lt;/a&gt; for organizing active work, evergreen notes give you the conceptual layer that PARA does not provide — a growing network of reusable understanding that persists across projects, across roles, and across years.&lt;/p&gt;

</description>
      <category>obsidian</category>
      <category>knowledgemanagement</category>
      <category>zettelkasten</category>
    </item>
    <item>
      <title>Cost Optimization for LLM Systems: Where the Money Actually Goes</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:52:51 +0000</pubDate>
      <link>https://dev.to/rosgluk/cost-optimization-for-llm-systems-where-the-money-actually-goes-17e</link>
      <guid>https://dev.to/rosgluk/cost-optimization-for-llm-systems-where-the-money-actually-goes-17e</guid>
      <description>&lt;p&gt;LLM costs scale linearly with usage. A system processing 10,000 requests a day at $0.01 per request costs $100 daily — $365 a year. At enterprise scale, that's over $10,000.&lt;/p&gt;

&lt;p&gt;Cost optimization isn't about cutting corners. It's about spending tokens where they matter.&lt;/p&gt;

&lt;p&gt;Every token you waste is a token you could have spent on a better answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token budgeting
&lt;/h2&gt;

&lt;p&gt;The simplest way to control costs is to set limits. Per session, per task, or per day.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 1: Per-Session Budgets
&lt;/h3&gt;

&lt;p&gt;Per-session budgets are straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SessionBudget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;budget_tokens&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allocate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;used&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 2: Per-Task Budgets
&lt;/h3&gt;

&lt;p&gt;Per-task budgets are more useful. Different tasks need different amounts of context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;task_budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;classify&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen2.5-1.5b&lt;/span&gt;
  &lt;span class="na"&gt;summarize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;
  &lt;span class="na"&gt;code_review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen2.5-coder-7b&lt;/span&gt;
  &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4000&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 3: Adaptive Budgets
&lt;/h3&gt;

&lt;p&gt;Adaptive budgets adjust based on what actually happens. If classification tasks consistently use 80 tokens, stop allocating 100:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AdaptiveBudget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allocate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens_used&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokens_used&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="mf"&gt;0.9&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;tokens_used&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exponential moving average (0.9 weight) means recent usage matters more than history. Adjust the weight based on how volatile your workloads are.&lt;/p&gt;

&lt;h2&gt;
  
  
  API vs local inference
&lt;/h2&gt;

&lt;p&gt;Local inference is cheaper at scale. The break-even depends on your hardware and API rates.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;API ($/M tokens)&lt;/th&gt;
&lt;th&gt;Local cost/hour&lt;/th&gt;
&lt;th&gt;Break-even&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;$2.50 / $10.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;$3.00 / $15.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-72B&lt;/td&gt;
&lt;td&gt;$0.50 / $2.00&lt;/td&gt;
&lt;td&gt;~$0.50&lt;/td&gt;
&lt;td&gt;~4 hours/day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-32B&lt;/td&gt;
&lt;td&gt;$0.30 / $1.20&lt;/td&gt;
&lt;td&gt;~$0.20&lt;/td&gt;
&lt;td&gt;~2 hours/day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-7B&lt;/td&gt;
&lt;td&gt;$0.10 / $0.40&lt;/td&gt;
&lt;td&gt;~$0.05&lt;/td&gt;
&lt;td&gt;~1 hour/day&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The hardware math:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Upfront&lt;/th&gt;
&lt;th&gt;Monthly electricity&lt;/th&gt;
&lt;th&gt;Break-even vs API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090 (used)&lt;/td&gt;
&lt;td&gt;$600&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;td&gt;~4 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;$1,500&lt;/td&gt;
&lt;td&gt;$20&lt;/td&gt;
&lt;td&gt;~6 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 5080&lt;/td&gt;
&lt;td&gt;$1,000&lt;/td&gt;
&lt;td&gt;$18&lt;/td&gt;
&lt;td&gt;~5 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DGX Spark&lt;/td&gt;
&lt;td&gt;$2,000&lt;/td&gt;
&lt;td&gt;$30&lt;/td&gt;
&lt;td&gt;~8 months&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At moderate usage — an hour or more per day — local inference pays for itself. At high usage, the savings are dramatic. The catch is upfront capital. A RTX 5080 is $1,000. An API bill you can pause. Hardware you can't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fallback strategies
&lt;/h2&gt;

&lt;p&gt;When your preferred model is too expensive or too slow, fall back to something cheaper. The key is knowing when quality is "good enough."&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 1: Quality-Based Fallback
&lt;/h3&gt;

&lt;p&gt;Quality-based fallback tries models until the output meets a threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;QualityFallback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;quality_threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.015&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-72b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.002&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0004&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model_config&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate_quality&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem is evaluation itself. How do you measure quality without calling another model? Some systems use a small classifier. Others use heuristic checks — length, structure, keyword presence. None of these are perfect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 2: Latency-Based Fallback
&lt;/h3&gt;

&lt;p&gt;Latency-based fallback is simpler. Route to the fastest model that meets your time budget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LatencyFallback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_latency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_latency&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-1.5b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model_config&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_latency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Caching
&lt;/h2&gt;

&lt;p&gt;Caching is the most underrated cost optimization. Identical prompts happen more often than you think — classification requests, FAQ-style queries, repeated tool calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 1: Prompt Caching
&lt;/h3&gt;

&lt;p&gt;Exact prompt caching is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PromptCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 2: Semantic Caching
&lt;/h3&gt;

&lt;p&gt;Semantic caching is more useful. It catches prompts that are different but mean the same thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SemanticCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;similarity_threshold&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cached_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;cached_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;cached_prompt&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;prompt_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_embedding&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The threshold matters. 0.95 is aggressive — only very similar prompts match. 0.85 is more forgiving but risks returning wrong answers. Measure your miss rate and adjust.&lt;/p&gt;

&lt;p&gt;Response caching for common queries is worth it too. If users ask "what's the weather" or "what time is it" repeatedly, cache the pattern, not just the exact prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResponseCache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;common_queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what is the weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check weather API&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;what is the time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check system time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;who is the president&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check current president&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;common_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;common_queries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;common_query&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't sophisticated, but it works. Common queries are common for a reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  When optimization helps
&lt;/h2&gt;

&lt;p&gt;Optimization matters when you're processing high volumes, running mixed workloads, or paying API costs that add up.&lt;/p&gt;

&lt;p&gt;It doesn't matter when you're prototyping, using a single model, or processing low volumes. The complexity of budgeting, fallback, and caching isn't worth it for a system that makes 100 requests a day.&lt;/p&gt;

&lt;p&gt;Get the basic flow working first. Add optimization when the bill comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No optimization&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Consistent&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token budgeting&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Variable&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fallback models&lt;/td&gt;
&lt;td&gt;Low-Medium&lt;/td&gt;
&lt;td&gt;Variable&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Caching&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;High (for cache hits)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid&lt;/td&gt;
&lt;td&gt;Optimized&lt;/td&gt;
&lt;td&gt;Optimized&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Production systems usually run hybrid. Budget per session, fall back on quality or latency, cache what you can. The complexity is real, but so are the savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/model-routing/model-routing-strategies/" rel="noopener noreferrer"&gt;Model Routing Strategies&lt;/a&gt; — capability-based, cost-aware, latency-aware routing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/guardrails/llm-guardrails-in-practice/" rel="noopener noreferrer"&gt;LLM Guardrails in Practice&lt;/a&gt; — input validation, output filtering, safety&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/model-routing/multi-model-system-design/" rel="noopener noreferrer"&gt;Multi-Model System Design&lt;/a&gt; — architecture for multiple models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/" rel="noopener noreferrer"&gt;LLM Architecture&lt;/a&gt; — system design pillar: routing, cost, guardrails, and orchestration&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>costoptimization</category>
      <category>localinference</category>
    </item>
    <item>
      <title>LLM Guardrails in Practice: What Actually Works</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:52:41 +0000</pubDate>
      <link>https://dev.to/rosgluk/llm-guardrails-in-practice-what-actually-works-54ph</link>
      <guid>https://dev.to/rosgluk/llm-guardrails-in-practice-what-actually-works-54ph</guid>
      <description>&lt;p&gt;LLMs are unpredictable. They hallucinate, leak data, generate harmful content, or refuse legitimate requests. Guardrails constrain model behavior without sacrificing capability.&lt;/p&gt;

&lt;p&gt;The key is knowing which guardrails matter and which are just noise.&lt;/p&gt;

&lt;p&gt;Guardrails aren't about controlling the model. They're about controlling the risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Input validation
&lt;/h2&gt;

&lt;p&gt;The most important guardrail. Bad input gets bad output, and bad input can also prompt-inject your system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 1: Prompt Sanitization
&lt;/h3&gt;

&lt;p&gt;Sanitize dangerous patterns early:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PromptSanitizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dangerous_patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ignore\s+previous\s+instructions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system\s+prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;you\s+are\s+now\s+free&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;break\s+out\s+of&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sanitize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dangerous_patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[REDACTED]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't bulletproof. Adversarial inputs are creative. But it catches the obvious ones, and the obvious ones are the most common.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 2: Input Length Limits
&lt;/h3&gt;

&lt;p&gt;Length limits prevent token waste and timeouts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;InputValidator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Input too long: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 3: Content Filtering
&lt;/h3&gt;

&lt;p&gt;Content filtering blocks policy violations. The patterns here depend on your domain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContentFilter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocked_topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;violence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hate speech&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;self-harm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sexual content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;illegal activities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;prompt_lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocked_topics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;prompt_lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Blocked: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple string matching is fast but imprecise. For production, use a classifier model — even a small one like Qwen2.5-1.5B — to detect policy violations. It's more accurate and harder to evade.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output filtering
&lt;/h2&gt;

&lt;p&gt;The model's output needs checking too. Structure, content, and facts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy 1: Response Validation
&lt;/h3&gt;

&lt;p&gt;Validate structure first. If you expect JSON, check for JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResponseValidator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;required_fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;required_fields&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing field: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 2: Content Filtering
&lt;/h3&gt;

&lt;p&gt;Filter harmful content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OutputFilter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocked_patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kill\s+someone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bomb\s+recipe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hate\s+speech&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;self-harm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocked_patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Blocked: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 3: Fact-Checking
&lt;/h3&gt;

&lt;p&gt;Fact-checking is harder. You can't validate every claim, so pick the ones that matter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FactChecker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;known_facts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capital of france&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Paris&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;population of usa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;330 million&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speed of light&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;299,792,458 m/s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;claim_lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;truth&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;known_facts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;fact&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;claim_lower&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;truth&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;claim_lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fact check failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For real fact-checking, you need a retrieval pipeline. Check claims against a knowledge base, not a hardcoded dictionary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety mechanisms
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Strategy 1: Rate Limiting
&lt;/h3&gt;

&lt;p&gt;Rate limiting prevents abuse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RateLimiter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_requests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_requests&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_requests&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_requests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 2: Token Budgeting
&lt;/h3&gt;

&lt;p&gt;Token budgeting caps per-request costs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TokenBudget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Token limit exceeded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy 3: Context Window Management
&lt;/h3&gt;

&lt;p&gt;Context window management prevents overflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContextManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_context&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sliding window trimming is simple but loses early context. Better approaches use summarization or attention-based compression, but those add latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compliance
&lt;/h2&gt;

&lt;p&gt;Enterprise systems need compliance guardrails. Two that matter most:&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Data Residency
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Data residency&lt;/strong&gt; — ensure data stays within required geographic boundaries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataResidency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allowed_regions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allowed_regions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;allowed_regions&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allowed_regions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Region not allowed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 2: Audit Logging
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Audit logging&lt;/strong&gt; — log all model interactions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AuditLogger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audit.log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;log_file&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Audit logs are critical for debugging and compliance. Make them structured, append-only, and stored securely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it together
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Simple Guardrails
&lt;/h3&gt;

&lt;p&gt;A simple guardrail pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SimpleGuardrails&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_validator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InputValidator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OutputFilter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_filter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 2: Advanced Guardrails
&lt;/h3&gt;

&lt;p&gt;Advanced guardrails add sanitization, rate limiting, and token budgets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AdvancedGuardrails&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sanitizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PromptSanitizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_validator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InputValidator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content_filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ContentFilter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OutputFilter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rate_limiter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RateLimiter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token_budget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TokenBudget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sanitizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sanitize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content_filter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rate_limiter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;allow&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Rate limit exceeded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_filter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token_budget&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  When guardrails matter
&lt;/h2&gt;

&lt;p&gt;Guardrails matter when you're building user-facing systems, handling sensitive data, or running in production. They also matter when you have compliance requirements — GDPR, HIPAA, SOC 2.&lt;/p&gt;

&lt;p&gt;They don't matter when you're prototyping, using models for internal tools only, or not handling sensitive data. Skip them until you need them.&lt;/p&gt;

&lt;p&gt;The tradeoff is always capability versus safety. More guardrails mean fewer failures but also fewer capabilities. Find the balance that works for your system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Safety&lt;/th&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No guardrails&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input validation&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output filtering&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Safety mechanisms&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compliance&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Related
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/model-routing/model-routing-strategies/" rel="noopener noreferrer"&gt;Model Routing Strategies&lt;/a&gt; — capability-based, cost-aware, latency-aware routing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/cost-optimization/cost-optimization-for-llm-systems/" rel="noopener noreferrer"&gt;Cost Optimization for LLM Systems&lt;/a&gt; — token budgeting, fallback models, caching&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/model-routing/multi-model-system-design/" rel="noopener noreferrer"&gt;Multi-Model System Design&lt;/a&gt; — architecture for multiple models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/" rel="noopener noreferrer"&gt;LLM Architecture&lt;/a&gt; — system design pillar: routing, cost, guardrails, and orchestration&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>safety</category>
      <category>llmsecurity</category>
    </item>
    <item>
      <title>Model Routing: Stop Using One Model for Everything</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:51:51 +0000</pubDate>
      <link>https://dev.to/rosgluk/model-routing-stop-using-one-model-for-everything-4mf1</link>
      <guid>https://dev.to/rosgluk/model-routing-stop-using-one-model-for-everything-4mf1</guid>
      <description>&lt;p&gt;Running a 70B parameter model to summarize a 200-word email is wasteful. Running a 3B model to review production code is reckless. Most systems live somewhere in between — and that's where model routing comes in.&lt;/p&gt;

&lt;p&gt;It matches task complexity to model capability. The tradeoffs are real, but the savings are too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The routing problem
&lt;/h2&gt;

&lt;p&gt;People usually start with one model and stick with it. That works until you notice the cost, or the latency, or both. The alternative is building a router — something that decides which model handles which request.&lt;/p&gt;

&lt;p&gt;Four strategies work in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Capability-based&lt;/strong&gt; — route by what the model can do&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-aware&lt;/strong&gt; — route by what you're willing to spend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency-aware&lt;/strong&gt; — route by how fast you need it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid&lt;/strong&gt; — combine them&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each optimizes something different. Picking one is usually a decision about what hurts most.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capability-based routing
&lt;/h2&gt;

&lt;p&gt;The simplest approach. Classify the task, send it to the model that handles it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Model size&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classification, tagging&lt;/td&gt;
&lt;td&gt;1-3B&lt;/td&gt;
&lt;td&gt;Qwen2.5-1.5B, Gemma-2-2B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarization, extraction&lt;/td&gt;
&lt;td&gt;3-7B&lt;/td&gt;
&lt;td&gt;Qwen2.5-7B, Llama-3.1-8B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;td&gt;7-14B&lt;/td&gt;
&lt;td&gt;Qwen2.5-Coder-7B, DeepSeek-Coder-V2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex reasoning&lt;/td&gt;
&lt;td&gt;14-32B&lt;/td&gt;
&lt;td&gt;Qwen2.5-32B, Llama-3.1-70B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative writing, analysis&lt;/td&gt;
&lt;td&gt;32B+&lt;/td&gt;
&lt;td&gt;Qwen2.5-72B, Claude, GPT-4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If the task doesn't need the bigger model, don't use it. A 1.5B model handles sentiment classification fine. It just won't write a coherent essay.&lt;/p&gt;

&lt;p&gt;Implementation is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ROUTING_RULES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-1.5b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-coder-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;creative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ROUTING_RULES&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ROUTING_RULES&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The catch is classification itself. If you get the task type wrong, you route to the wrong model. I've seen systems classify code review as "summarization" and lose quality silently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost-aware routing
&lt;/h2&gt;

&lt;p&gt;Local inference shines here. Local models are effectively free after hardware amortization. A RTX 5080 pays for itself in about six months at moderate API usage.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M tokens)&lt;/th&gt;
&lt;th&gt;Output ($/M tokens)&lt;/th&gt;
&lt;th&gt;Local cost/hour&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-72B (API)&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-32B (local)&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;~$0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-7B (local)&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;$0.00&lt;/td&gt;
&lt;td&gt;~$0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're processing thousands of requests per session, even $0.05 in electricity beats $15/M tokens.&lt;/p&gt;

&lt;p&gt;Budget-based routing falls back as you spend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CostAwareRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget_per_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;budget_per_session&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cheap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expensive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.000015&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spent&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expensive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cheap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quality degrades as you fall back. You start with Claude, move to Qwen-32B, then to Qwen-7B. By the end of a long session, the output is noticeably worse. Whether that matters depends on what you're building.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency-aware routing
&lt;/h2&gt;

&lt;p&gt;Interactive tools need fast first tokens. Batch jobs can wait. The difference is usually a factor of five in model size.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;First token&lt;/th&gt;
&lt;th&gt;Complete&lt;/th&gt;
&lt;th&gt;Max model size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Real-time chat&lt;/td&gt;
&lt;td&gt;&amp;lt; 200ms&lt;/td&gt;
&lt;td&gt;&amp;lt; 2s&lt;/td&gt;
&lt;td&gt;&amp;lt; 7B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactive tools&lt;/td&gt;
&lt;td&gt;&amp;lt; 500ms&lt;/td&gt;
&lt;td&gt;&amp;lt; 5s&lt;/td&gt;
&lt;td&gt;&amp;lt; 14B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch processing&lt;/td&gt;
&lt;td&gt;&amp;lt; 1s&lt;/td&gt;
&lt;td&gt;&amp;lt; 30s&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research/analysis&lt;/td&gt;
&lt;td&gt;&amp;lt; 2s&lt;/td&gt;
&lt;td&gt;&amp;lt; 60s&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When you're streaming tokens to a user, first token latency is what they feel. A 32B model taking half a second to start feels sluggish compared to a 1.5B model that fires instantly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LatencyAwareRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_latencies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-1.5b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_latency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latencies&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_latencies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;latencies&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;target_latency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-1.5b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The latency numbers are rough — they depend on your hardware, quantization, and batch size. Measure on your own setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fallback strategies
&lt;/h2&gt;

&lt;p&gt;Models fail. APIs rate-limit. Timeouts happen. The pattern that works is a fallback chain, ordered from best to most reliable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FallbackRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fallback_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-72b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fallback_chain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;APIError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All fallback models failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last model in the chain should be local. It's slower, but it won't fail because of a network issue or an API key.&lt;/p&gt;

&lt;h2&gt;
  
  
  When routing helps
&lt;/h2&gt;

&lt;p&gt;Routing makes sense when your workload is mixed. If you're doing classification, summarization, and reasoning in the same system, a router saves money and latency.&lt;/p&gt;

&lt;p&gt;It doesn't make sense when everything you do is the same complexity. Just use the model that's good at that task. The router adds complexity you don't need.&lt;/p&gt;

&lt;p&gt;Early prototyping is another reason to skip it. Get the task working with one model, then add routing when cost or latency actually becomes a problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;Every routing strategy optimizes something and sacrifices something else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single model&lt;/strong&gt; — simplest, most expensive, consistent quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capability-based&lt;/strong&gt; — better cost, higher quality per task, moderate complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-aware&lt;/strong&gt; — cheapest, quality varies, moderate complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency-aware&lt;/strong&gt; — fastest, may sacrifice quality, moderate complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid&lt;/strong&gt; — best of all, most complex to implement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production systems usually converge on hybrid. Start with capability-based routing, add cost awareness when the bill comes in, add latency awareness when users complain about slowness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/cost-optimization/cost-optimization-for-llm-systems/" rel="noopener noreferrer"&gt;Cost Optimization for LLM Systems&lt;/a&gt; — token budgeting, caching, fallback models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/guardrails/llm-guardrails-in-practice/" rel="noopener noreferrer"&gt;LLM Guardrails in Practice&lt;/a&gt; — input validation, output filtering, safety&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/model-routing/multi-model-system-design/" rel="noopener noreferrer"&gt;Multi-Model System Design&lt;/a&gt; — architecture for multiple models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/" rel="noopener noreferrer"&gt;LLM Architecture&lt;/a&gt; — system design pillar: routing, cost, guardrails, and orchestration&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>localinference</category>
      <category>modelrouting</category>
    </item>
    <item>
      <title>Multi-Model System Design: When One Model Isn't Enough</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:51:46 +0000</pubDate>
      <link>https://dev.to/rosgluk/multi-model-system-design-when-one-model-isnt-enough-311c</link>
      <guid>https://dev.to/rosgluk/multi-model-system-design-when-one-model-isnt-enough-311c</guid>
      <description>&lt;p&gt;Single-model systems are simple. Multi-model systems are powerful. The challenge isn't choosing models — it's designing the architecture that orchestrates them.&lt;/p&gt;

&lt;p&gt;A multi-model system isn't about having more models. It's about having the right model for the right task at the right time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture patterns
&lt;/h2&gt;

&lt;p&gt;Five patterns cover most use cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;When to use&lt;/th&gt;
&lt;th&gt;Tradeoff&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single Model&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Prototyping, simple tasks&lt;/td&gt;
&lt;td&gt;Limited capability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sequential&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Multi-step workflows&lt;/td&gt;
&lt;td&gt;Higher latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Independent tasks&lt;/td&gt;
&lt;td&gt;Higher cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchical&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Complex reasoning&lt;/td&gt;
&lt;td&gt;Complex orchestration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ensemble&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Critical decisions&lt;/td&gt;
&lt;td&gt;Highest cost&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Pick the simplest one that works. Complexity is real, and it compounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sequential architecture
&lt;/h2&gt;

&lt;p&gt;Process tasks through a chain of models, each specializing in a step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Pipeline
&lt;/h3&gt;

&lt;p&gt;Pipeline pattern — each model's output feeds the next:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-1.5b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;classify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model_config&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Latency adds up. Three models in sequence means three times the latency. Only use this when each step actually needs a different model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Router
&lt;/h3&gt;

&lt;p&gt;Router pattern — classify the task, route to the specialist:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelRouter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-1.5b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;specialists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-coder-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;math&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;creative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;task_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;specialists&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;specialists&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The classifier is the weak link. If it misclassifies, you route to the wrong model and lose quality. Use a classifier that's good enough — even a small one works if the categories are clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel architecture
&lt;/h2&gt;

&lt;p&gt;Process independent tasks simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Fan-Out
&lt;/h3&gt;

&lt;p&gt;Fan-out — run the same prompt through multiple models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelFanOut&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Useful for comparison, A/B testing, or when you want to pick the best output. Expensive, but the quality gain is worth it for critical decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Voting
&lt;/h3&gt;

&lt;p&gt;Voting — combine outputs through consensus:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelVoting&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;vote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;
        &lt;span class="n"&gt;votes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;votes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;most_common&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Majority voting works for classification. For generation tasks, it's harder — you need semantic similarity, not exact matches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hierarchical architecture
&lt;/h2&gt;

&lt;p&gt;Use models at different levels of abstraction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Planner-Executor
&lt;/h3&gt;

&lt;p&gt;Planner-executor — a strong model plans, smaller models execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PlannerExecutor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;planner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-coder-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;math&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;planner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Plan: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_plan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;planner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Synthesize: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner does the heavy lifting. The executors handle specific tasks. This pattern works well when the planning step is expensive but the execution steps are cheap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Supervisor-Worker
&lt;/h3&gt;

&lt;p&gt;Supervisor-worker — a supervisor delegates and reviews:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SupervisorWorker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;supervisor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-coder-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;assignments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;supervisor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Assign: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;assignment&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_assignments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assignments&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;assignment&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;worker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;assignment&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;supervisor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The supervisor is the bottleneck. It plans, delegates, and reviews. Make sure it's fast enough, or the whole system slows down.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ensemble architecture
&lt;/h2&gt;

&lt;p&gt;Combine multiple models for critical decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Weighted Ensemble
&lt;/h3&gt;

&lt;p&gt;Weighted ensemble — score each model's output, pick the highest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;WeightedEnsemble&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weights reflect your confidence in each model. Adjust them based on actual performance, not benchmarks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Consensus Ensemble
&lt;/h3&gt;

&lt;p&gt;Consensus ensemble — require agreement, escalate if there isn't any:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ConsensusEnsemble&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;
        &lt;span class="n"&gt;votes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;max_votes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;votes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_votes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;votes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;most_common&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The threshold controls how strict consensus is. 0.7 means two-thirds agreement. Lower it for faster decisions, raise it for higher confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  When multi-model systems make sense
&lt;/h2&gt;

&lt;p&gt;Multi-model systems make sense when you have mixed workloads, need high quality for critical decisions, or are optimizing for cost or latency.&lt;/p&gt;

&lt;p&gt;They don't make sense when all tasks are similar complexity, you're prototyping, or simplicity matters more than optimization.&lt;/p&gt;

&lt;p&gt;The rule of thumb: start with one model. Add more when you hit a real constraint — cost, latency, or quality. Don't architect complexity before you need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single Model&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Variable&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sequential&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchical&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ensemble&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every pattern trades something. Pick the one that matches your constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/model-routing/model-routing-strategies/" rel="noopener noreferrer"&gt;Model Routing Strategies&lt;/a&gt; — capability-based, cost-aware, latency-aware routing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/cost-optimization/cost-optimization-for-llm-systems/" rel="noopener noreferrer"&gt;Cost Optimization for LLM Systems&lt;/a&gt; — token budgeting, fallback models, caching&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/guardrails/llm-guardrails-in-practice/" rel="noopener noreferrer"&gt;LLM Guardrails in Practice&lt;/a&gt; — input validation, output filtering, safety&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-architecture/" rel="noopener noreferrer"&gt;LLM Architecture&lt;/a&gt; — system design pillar: routing, cost, guardrails, and orchestration&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>systemdesign</category>
      <category>llmarchitecture</category>
    </item>
    <item>
      <title>AI Assistant Architecture: LLM, Memory, Tools, Routing, Observability</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Tue, 16 Jun 2026 12:20:21 +0000</pubDate>
      <link>https://dev.to/rosgluk/ai-assistant-architecture-llm-memory-tools-routing-observability-4jdi</link>
      <guid>https://dev.to/rosgluk/ai-assistant-architecture-llm-memory-tools-routing-observability-4jdi</guid>
      <description>&lt;p&gt;A production AI assistant is not "an LLM with a prompt". It is a system that accepts intent, keeps state, decides when to retrieve or act, and exposes enough runtime detail to debug failures.&lt;/p&gt;

&lt;p&gt;That systems-level view is what the &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems cluster&lt;/a&gt; explores when assistants move beyond a single model invocation.&lt;/p&gt;

&lt;p&gt;OpenAI describes agents as applications that plan, call tools, collaborate, and keep enough state for multi-step work, while Anthropic frames the same problem as a managed harness that can run files, commands, web access, and code securely.&lt;/p&gt;

&lt;p&gt;The cleanest architecture splits responsibilities into five layers: LLM, Memory, Tooling, Routing, and Observability. That split matches the capabilities exposed by major provider APIs, by MCP, by self-hosted runtimes such as vLLM and llama.cpp, and by real assistant systems such as &lt;a href="https://www.glukhov.org/ai-systems/openclaw/" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; and Hermes.&lt;/p&gt;

&lt;p&gt;Memory should be treated as more than "longer context". Retrieval systems turn external knowledge into explicit non-parametric memory — the same design space covered in depth by &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;Retrieval-Augmented Generation (RAG)&lt;/a&gt; — and both Anthropic's context guidance and the "Lost in the Middle" paper warn that merely cramming more tokens into context does not guarantee reliable recall.&lt;/p&gt;

&lt;p&gt;Tool use is a contract boundary, not magic. OpenAI function calling, Anthropic tool use, and MCP all rely on the same pattern: the model emits a structured request, some runtime executes it, and the result flows back into the conversation. If that boundary is sloppy, the assistant becomes sloppy.&lt;/p&gt;

&lt;p&gt;My bias is simple: start boring. One orchestrator, one durable memory path, one trace per request, and one explicit policy for tool execution. Multi-agent graphs are useful, but only after you can explain your single-agent failure cases without guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an AI assistant system is
&lt;/h2&gt;

&lt;p&gt;A practical definition is this: an AI assistant system is a runtime that turns user intent into a response or action by combining a model interface, context assembly, tool execution, state management, and telemetry. That is why the useful docs are not just model cards. The useful docs are API references, tool contracts, retrieval guides, routing docs, and tracing docs. OpenAI's Responses API exposes stateful interactions, built-in tools, and function calling. Anthropic's Claude API exposes direct Messages access as well as Managed Agents. OpenClaw and Hermes go one step further and show what happens when you put those capabilities behind persistent gateways, channels, sessions, and memory.&lt;/p&gt;

&lt;p&gt;In other words, an assistant system has a broader contract than a chat completion. A good internal contract looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AssistantRequest  = user intent + identity + session + attachments + policy
AssistantResponse = answer + actions + citations + state changes + trace id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That contract matters because every production disagreement eventually reduces to one of these questions: what context was visible, which tool executed, which model answered, which memory was read or written, and where the trace says the system spent time. OpenTelemetry defines traces as the path of a request through an application, which is exactly the abstraction serious assistants need. LangSmith and OpenLIT then specialise that idea for LLMs, tools, vector stores, and agent workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core components and interfaces
&lt;/h2&gt;

&lt;p&gt;The component split below is the one I find most durable. It is also the split that lines up best with the official APIs and the open-source runtimes people actually operate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Main responsibility&lt;/th&gt;
&lt;th&gt;Typical interface&lt;/th&gt;
&lt;th&gt;Example technologies&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM layer&lt;/td&gt;
&lt;td&gt;Reason, generate, decide, emit structured calls&lt;/td&gt;
&lt;td&gt;Responses API, Messages API, OpenAI-compatible or Anthropic-compatible endpoints&lt;/td&gt;
&lt;td&gt;OpenAI, Anthropic, vLLM, llama.cpp, Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory layer&lt;/td&gt;
&lt;td&gt;Hold session state, durable notes, and searchable knowledge&lt;/td&gt;
&lt;td&gt;embeddings, vector search, memory read/write tools, retrieval APIs&lt;/td&gt;
&lt;td&gt;OpenAI embeddings and vector stores, Pinecone, Weaviate, pgvector, Milvus, Hermes memory, OpenClaw memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tooling layer&lt;/td&gt;
&lt;td&gt;Read data and perform actions outside the model&lt;/td&gt;
&lt;td&gt;JSON-schema tools, MCP tools, file and web search, native runtime tools&lt;/td&gt;
&lt;td&gt;OpenAI function calling, Anthropic tool use, MCP, LangChain tools, LlamaIndex query tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing layer&lt;/td&gt;
&lt;td&gt;Choose model, backend, policy, and tenant path&lt;/td&gt;
&lt;td&gt;model aliases, failover groups, health checks, budgets, channel bindings&lt;/td&gt;
&lt;td&gt;LiteLLM, OpenClaw multi-agent routing, Hermes provider runtime resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Explain what happened and why&lt;/td&gt;
&lt;td&gt;traces, spans, logs, metrics, eval runs&lt;/td&gt;
&lt;td&gt;OpenTelemetry, LangSmith, OpenLIT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The table above is derived from the official provider interfaces, MCP, vector database docs, and the runtime docs for vLLM, llama.cpp, OpenClaw, and Hermes.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;LLM layer&lt;/strong&gt; should do three things well: consume a current working context, emit either a final answer or a structured action request, and return enough metadata to support retries and tracing. OpenAI's Responses API is explicitly designed for stateful interactions plus built-in tools and function calling. Anthropic's Messages API exposes the same core loop through &lt;code&gt;tool_use&lt;/code&gt; blocks and &lt;code&gt;tool_result&lt;/code&gt; returns, while Managed Agents gives you a hosted harness if you do not want to build the loop yourself. Self-hosted runtimes such as vLLM and llama.cpp matter because they preserve familiar provider-style interfaces while letting you place inference inside your own environment.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Memory layer&lt;/strong&gt; should be split mentally into three buckets: working memory, durable symbolic memory, and searchable semantic memory. OpenAI embeddings return vectors that can be indexed and searched; OpenAI Retrieval and File Search then layer semantic and keyword search on top of vector stores. Pinecone, Weaviate, pgvector, and Milvus represent four common storage shapes: fully managed, open-source vector-native, Postgres-native, and distributed vector database. Hermes and OpenClaw add a useful reminder that not all memory belongs in a vector store: file-backed notes, reviewed promotions, and session-scoped snapshots are often the more honest design — patterns unpacked in &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System&lt;/a&gt; for bounded core memory and frozen session snapshots.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Tooling layer&lt;/strong&gt; is where an assistant stops being a summariser and starts being software. OpenAI function calling treats tools as schema-defined functionality the model may decide to invoke. Anthropic says the same thing more explicitly: tool use is a contract between your application and the model, and the model never executes anything on its own. MCP generalises that contract into a client-server protocol where hosts connect to one or more servers that expose tools, prompts, and resources — the same boundary described step by step in &lt;a href="https://www.glukhov.org/ai-systems/mcp/mcp-server-in-go/" rel="noopener noreferrer"&gt;MCP Server in Go&lt;/a&gt;. LangChain and LlamaIndex sit comfortably here as orchestration libraries: LangChain focuses on a prebuilt agent architecture and integrations, while LlamaIndex focuses on context-augmented data access, query engines, and workflows.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Routing layer&lt;/strong&gt; exists because "which model?" is never the only question. You also need "which provider path, which tenant, which budget, which latency class, and which fallback?". LiteLLM is useful because its official docs are refreshingly concrete: weighted pick, least-busy, latency-based, cost-based routing, and bounded failovers are all first-class patterns. OpenClaw extends routing upward into channel and agent isolation, while Hermes extends it downward into model slots for main and auxiliary work such as summarisation, context compression, and MCP tool routing. That is the right mental model: the router chooses more than a model, it chooses an execution lane.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Observability layer&lt;/strong&gt; is what prevents architecture from turning into folklore. OpenTelemetry gives you the trace abstraction. LangSmith gives you end-to-end visibility over LLM application steps and supports cloud, hybrid, and self-hosted deployment shapes. OpenLIT gives you OpenTelemetry-native AI observability with zero-code and manual instrumentation options, including support for LLMs, agent frameworks, vector databases, and GPUs. For production metrics, traces, and SLO patterns across inference and agent workflows, see &lt;a href="https://www.glukhov.org/observability/observability-for-llm-systems/" rel="noopener noreferrer"&gt;Observability for LLM Systems&lt;/a&gt;. If your assistant has no trace per request, no span per model call, and no event history for tool execution, you do not really have an architecture yet. You have vibes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Capture, enrich, respond
&lt;/h2&gt;

&lt;p&gt;The sequence that keeps showing up in real systems is capture -&amp;gt; enrich -&amp;gt; respond -&amp;gt; record. Different frameworks wrap it differently, but the flow is stable enough to treat as the backbone.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sequenceDiagram
    participant U as User or Channel
    participant G as Gateway or UI
    participant R as Router
    participant M as Memory and Retrieval
    participant L as LLM
    participant T as Tools or MCP
    participant O as Observability

    U-&amp;gt;&amp;gt;G: message, file, or command
    G-&amp;gt;&amp;gt;O: start root trace
    G-&amp;gt;&amp;gt;R: request + identity + session + policy
    R-&amp;gt;&amp;gt;M: load session state and retrieve context
    M--&amp;gt;&amp;gt;R: notes, chunks, metadata
    R-&amp;gt;&amp;gt;L: prompt + context + tool schemas
    L--&amp;gt;&amp;gt;R: answer or tool call
    alt tool call
        R-&amp;gt;&amp;gt;T: execute tool or MCP action
        T--&amp;gt;&amp;gt;R: tool result
        R-&amp;gt;&amp;gt;L: tool result + updated context
        L--&amp;gt;&amp;gt;R: final answer
    end
    R-&amp;gt;&amp;gt;M: persist session changes and memory candidates
    R-&amp;gt;&amp;gt;O: spans, metrics, eval events
    G--&amp;gt;&amp;gt;U: response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;capture&lt;/strong&gt; step is usually more important than it looks. OpenClaw and Hermes both put a persistent gateway in front of the assistant because ingress is not just text entry. It includes channel metadata, identities, authorisation, session boundaries, direct messages, groups, cron ticks, and delivery semantics. If you skip that layer and rely on a raw chat widget abstraction, you will eventually bolt it back on as ad hoc middleware anyway.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;enrich&lt;/strong&gt; step is where mature systems diverge from toy demos. OpenAI Retrieval and File Search make retrieval explicit through vector stores and search calls. LlamaIndex formalises the same pattern through data connectors, indexes, query engines, and workflows. Hermes goes further by splitting the model estate into main and auxiliary slots, offloading work such as compression, summarisation, and routing to smaller or more specialised models. That is a design pattern worth stealing: do not spend your most expensive model tokens on chores.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;respond&lt;/strong&gt; step is not "generate text". It is "close the current loop". If the model can answer directly, it does. If it needs a tool, it emits a structured request. Anthropic's tool-use contract and OpenAI's function-calling guide both make this explicit. The reason this matters architecturally is that outputs now include both language and control flow. Your response object is partly prose and partly runtime plan.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;record&lt;/strong&gt; step is where consistency semantics show up. Pinecone separates write and read paths and processes writes after durable acknowledgement. Hermes memory is injected as a frozen snapshot per session so it can preserve prefix-cache performance, which means new writes do not automatically appear in the current session prompt. OpenClaw's Dreaming system only promotes reviewed, grounded candidates into &lt;code&gt;MEMORY.md&lt;/code&gt;, and it is opt-in rather than always-on. The practical lesson is that memory is rarely truly read-after-write across every layer. You need to design for staged visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenClaw and Hermes as reference systems
&lt;/h2&gt;

&lt;p&gt;OpenClaw and Hermes are useful reference cases because they are not just wrappers around one provider API. Both present an assistant as a long-running system with gateways, sessions, tools, memory, and multiple model backends.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Architectural concern&lt;/th&gt;
&lt;th&gt;OpenClaw mapping&lt;/th&gt;
&lt;th&gt;Hermes mapping&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ingress and surfaces&lt;/td&gt;
&lt;td&gt;Self-hosted gateway connecting chat apps and channel surfaces&lt;/td&gt;
&lt;td&gt;Single background messaging gateway connecting many external platforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;Gateway-centric control plane for channels and AI interactions&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;AIAgent&lt;/code&gt; loop handling prompt assembly, provider selection, tool dispatch, retries, and failover&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing&lt;/td&gt;
&lt;td&gt;Multi-agent routing binds inbound traffic to isolated agents with separate workspaces and sessions&lt;/td&gt;
&lt;td&gt;Main and auxiliary model slots split core reasoning from compression, summarisation, approvals, and MCP routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;File-backed memory plus optional active memory and background Dreaming promotion&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;MEMORY.md&lt;/code&gt; and &lt;code&gt;USER.md&lt;/code&gt; injected as a frozen session snapshot, plus external memory providers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tooling and extension&lt;/td&gt;
&lt;td&gt;Built-in tools, session tools, provider plugins, custom and self-hosted endpoints&lt;/td&gt;
&lt;td&gt;40+ tools, built-in MCP client, toolsets, skills, and memory-provider plugins&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This mapping is grounded in the official OpenClaw and Hermes docs and repos. OpenClaw documents a gateway architecture, multi-agent routing, custom and self-hosted provider support including vLLM and Ollama, optional active memory, and Dreaming-based promotion. Hermes documents a messaging gateway, a central &lt;code&gt;AIAgent&lt;/code&gt; loop, main and auxiliary model slots, built-in memory, and native MCP integration.&lt;/p&gt;

&lt;p&gt;My slightly opinionated read is that both systems make the same architectural argument in different accents. OpenClaw is strongly gateway-first. Hermes is strongly agent-loop-first. But both reject the shallow idea that an assistant is just "prompt plus model". They model channels, identities, memory semantics, tool surfaces, and backend heterogeneity as first-class concerns. That is exactly what a production architecture should do.&lt;/p&gt;

&lt;p&gt;A practical hybrid stack inspired by both systems looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;edge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hermes or openclaw&lt;/span&gt;

&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;proxy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;litellm&lt;/span&gt;
  &lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latency and budget aware&lt;/span&gt;
  &lt;span class="na"&gt;tenancy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;session and channel scoped&lt;/span&gt;

&lt;span class="na"&gt;llm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai responses or anthropic messages&lt;/span&gt;
  &lt;span class="na"&gt;local_fallback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
  &lt;span class="na"&gt;local_dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama or llama.cpp&lt;/span&gt;

&lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;session&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sqlite or postgres&lt;/span&gt;
  &lt;span class="na"&gt;semantic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pgvector or weaviate&lt;/span&gt;
  &lt;span class="na"&gt;embeddings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai embeddings or ollama embeddings&lt;/span&gt;

&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contract&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;json schema tools plus mcp&lt;/span&gt;
  &lt;span class="na"&gt;examples&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;filesystem, browser, web search, internal APIs&lt;/span&gt;

&lt;span class="na"&gt;observability&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;opentelemetry&lt;/span&gt;
  &lt;span class="na"&gt;ai_dashboards&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openlit or langsmith&lt;/span&gt;
  &lt;span class="na"&gt;evals&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai evals plus app-specific regression sets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That stack is a reasoned deployment pattern rather than a vendor-prescribed blueprint. It works because the official interfaces line up: OpenAI and Anthropic expose tool-oriented APIs, vLLM and llama.cpp emulate provider-style endpoints, Ollama handles local models and embeddings, MCP standardises external tools, LiteLLM handles routing and failover, and OpenTelemetry-compatible platforms can trace the whole path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns, tables, and tradeoffs
&lt;/h2&gt;

&lt;p&gt;There are a few repeatable assistant patterns worth naming. A &lt;strong&gt;managed assistant&lt;/strong&gt; keeps most of the runtime inside provider APIs. A &lt;strong&gt;retrieval-first assistant&lt;/strong&gt; treats memory and search as the main differentiator. A &lt;strong&gt;tool-first assistant&lt;/strong&gt; behaves more like an operator than a chatbot. A &lt;strong&gt;gateway assistant&lt;/strong&gt; prioritises always-on access through messaging surfaces. A &lt;strong&gt;specialist mesh&lt;/strong&gt; decomposes work into multiple agents or routes. Official docs across OpenAI, Anthropic, LlamaIndex, LiteLLM, OpenClaw, and Hermes all support versions of these patterns, even if they name them differently.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;What it optimises for&lt;/th&gt;
&lt;th&gt;Best use case&lt;/th&gt;
&lt;th&gt;Hidden cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Managed assistant&lt;/td&gt;
&lt;td&gt;Speed of delivery&lt;/td&gt;
&lt;td&gt;Internal copilots and support bots&lt;/td&gt;
&lt;td&gt;Provider lock-in and less control over runtime details&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval-first assistant&lt;/td&gt;
&lt;td&gt;Grounded answers over owned data&lt;/td&gt;
&lt;td&gt;Docs, support, knowledge work&lt;/td&gt;
&lt;td&gt;Retrieval quality becomes the real product&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool-first assistant&lt;/td&gt;
&lt;td&gt;Action over conversation&lt;/td&gt;
&lt;td&gt;Ops workflows, data pulls, automations&lt;/td&gt;
&lt;td&gt;Side effects, retries, and approvals become core concerns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gateway assistant&lt;/td&gt;
&lt;td&gt;Ubiquitous access&lt;/td&gt;
&lt;td&gt;Personal and team assistants across chat surfaces&lt;/td&gt;
&lt;td&gt;Identity, session, and security complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Specialist mesh&lt;/td&gt;
&lt;td&gt;Division of labour&lt;/td&gt;
&lt;td&gt;Complex workflows with real ownership boundaries&lt;/td&gt;
&lt;td&gt;Harder debugging, orchestration, and eval design&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This pattern table is a synthesis from the provider docs, framework docs, and reference systems rather than a claim from any one vendor.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option shape&lt;/th&gt;
&lt;th&gt;Typical components&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;th&gt;Weakness&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;td&gt;OpenAI Responses or Anthropic Managed Agents, hosted file search or vector stores&lt;/td&gt;
&lt;td&gt;Fastest path, fewer moving parts, hosted tools&lt;/td&gt;
&lt;td&gt;Lowest control over data path and runtime semantics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid&lt;/td&gt;
&lt;td&gt;Provider API plus self-hosted router and vector store&lt;/td&gt;
&lt;td&gt;Good balance of speed and control&lt;/td&gt;
&lt;td&gt;More contracts to maintain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;vLLM or llama.cpp or Ollama, MCP, self-hosted vector DB, OTel&lt;/td&gt;
&lt;td&gt;Strong privacy and deployment control&lt;/td&gt;
&lt;td&gt;Highest ops burden, hardware and tuning overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Table notes: OpenAI hosted File Search is a managed tool, Anthropic offers a managed harness, Pinecone is a managed vector service, while vLLM, llama.cpp, Ollama, pgvector, Weaviate, Milvus, LangSmith self-hosted, and OpenLIT all support self-managed or hybrid operation to varying degrees.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vector store&lt;/th&gt;
&lt;th&gt;Shape&lt;/th&gt;
&lt;th&gt;Why teams choose it&lt;/th&gt;
&lt;th&gt;Watchout&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pinecone&lt;/td&gt;
&lt;td&gt;Managed vector service&lt;/td&gt;
&lt;td&gt;Strong operational simplicity and scalable managed architecture&lt;/td&gt;
&lt;td&gt;External dependency and managed-service economics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weaviate&lt;/td&gt;
&lt;td&gt;Open-source vector database&lt;/td&gt;
&lt;td&gt;Vector plus inverted indexes and flexible index choices&lt;/td&gt;
&lt;td&gt;More cluster tuning than a hosted-only path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pgvector&lt;/td&gt;
&lt;td&gt;Postgres extension&lt;/td&gt;
&lt;td&gt;Keep vectors with relational data and existing SQL stack&lt;/td&gt;
&lt;td&gt;Not the best fit for every high-scale ANN workload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Milvus&lt;/td&gt;
&lt;td&gt;Distributed vector database&lt;/td&gt;
&lt;td&gt;Purpose-built scale and ecosystem around managed Zilliz Cloud&lt;/td&gt;
&lt;td&gt;Another specialist datastore to operate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Table notes: Pinecone documents a managed control plane and regional data planes. Weaviate documents vector and inverted indexes with multiple vector index types. pgvector adds exact and approximate nearest-neighbour search to Postgres. Milvus positions itself as an open-source high-performance, scalable vector database, with Zilliz Cloud as the managed option.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;LLM option&lt;/th&gt;
&lt;th&gt;Interface style&lt;/th&gt;
&lt;th&gt;Best at&lt;/th&gt;
&lt;th&gt;Watchout&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Responses&lt;/td&gt;
&lt;td&gt;Stateful responses plus built-in tools&lt;/td&gt;
&lt;td&gt;Fast start, hosted tools, structured loops&lt;/td&gt;
&lt;td&gt;You inherit platform-specific abstractions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Messages&lt;/td&gt;
&lt;td&gt;Direct model access with explicit tool-use contract&lt;/td&gt;
&lt;td&gt;Clear tool boundaries and good control in custom loops&lt;/td&gt;
&lt;td&gt;More runtime is your responsibility unless you use Managed Agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vLLM&lt;/td&gt;
&lt;td&gt;OpenAI-compatible and Anthropic-compatible self-hosted serving&lt;/td&gt;
&lt;td&gt;High-throughput self-hosted inference&lt;/td&gt;
&lt;td&gt;Real infrastructure and model-serving work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;Simple local model and embedding runtime&lt;/td&gt;
&lt;td&gt;Local development and small self-hosted stacks&lt;/td&gt;
&lt;td&gt;Not the same class of serving system as a tuned distributed runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;llama.cpp&lt;/td&gt;
&lt;td&gt;Lightweight local server with provider-compatible routes&lt;/td&gt;
&lt;td&gt;Edge, CPU-first, constrained environments&lt;/td&gt;
&lt;td&gt;You do more manual tuning and capability matching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Table notes: OpenAI documents Responses as its advanced interface for stateful responses and built-in tools. Anthropic documents the Messages API and the tool-use contract separately from Managed Agents. vLLM exposes an OpenAI-compatible server plus Anthropic Messages API support. Ollama documents local embedding and model workflows. llama.cpp documents OpenAI-compatible chat, responses, and embeddings routes, plus Anthropic-compatible chat completions.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Constraint or tradeoff&lt;/th&gt;
&lt;th&gt;Bias toward managed&lt;/th&gt;
&lt;th&gt;Bias toward self-hosted&lt;/th&gt;
&lt;th&gt;Practical mitigation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;Often better first iteration and fewer local tuning tasks&lt;/td&gt;
&lt;td&gt;Can win when model and data are colocated and kept warm&lt;/td&gt;
&lt;td&gt;Use routing tiers, hot caches, and smaller auxiliary models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Easy to start, variable at token scale&lt;/td&gt;
&lt;td&gt;Better amortisation at steady utilisation&lt;/td&gt;
&lt;td&gt;Measure real traffic before optimising by instinct&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privacy and residency&lt;/td&gt;
&lt;td&gt;Simpler for non-sensitive data&lt;/td&gt;
&lt;td&gt;Stronger control for sensitive and regulated flows&lt;/td&gt;
&lt;td&gt;Use hybrid boundaries and keep only what must move&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consistency&lt;/td&gt;
&lt;td&gt;Hosted tools still have staged visibility semantics&lt;/td&gt;
&lt;td&gt;Self-hosted memory pipelines also stage and promote data&lt;/td&gt;
&lt;td&gt;Define read-after-write rules explicitly by layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling&lt;/td&gt;
&lt;td&gt;Less control-plane pain&lt;/td&gt;
&lt;td&gt;Better tailoring for steady, specialised workloads&lt;/td&gt;
&lt;td&gt;Use batching, queueing, and isolated tenants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debuggability&lt;/td&gt;
&lt;td&gt;Easy to miss opaque provider internals&lt;/td&gt;
&lt;td&gt;Easy to drown in self-made complexity&lt;/td&gt;
&lt;td&gt;Trace every request and evaluate every route&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This tradeoff matrix is an architectural inference from the official docs, not a vendor benchmark. The consistency row matters more than many blog posts admit: Pinecone separates write and read paths, Hermes freezes memory into session-start prompts, and OpenClaw promotes durable memory through staged review. That means "memory updated" and "memory visible to the current answer" are often different truths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure modes and mitigations
&lt;/h2&gt;

&lt;p&gt;Most assistants do not fail because the base model is "bad". They fail because the surrounding system lies to the model, starves it of the right context, lets tools drift, or makes debugging impossible.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Where it breaks&lt;/th&gt;
&lt;th&gt;Typical symptom&lt;/th&gt;
&lt;th&gt;Usual cause&lt;/th&gt;
&lt;th&gt;Mitigation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt assembly&lt;/td&gt;
&lt;td&gt;Confident but off-target answer&lt;/td&gt;
&lt;td&gt;Too much irrelevant context, poor ordering&lt;/td&gt;
&lt;td&gt;Budget context, rerank, keep key facts near the top&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;Correct tone, wrong facts&lt;/td&gt;
&lt;td&gt;Bad chunking, stale index, weak filters&lt;/td&gt;
&lt;td&gt;Evaluate retrieval separately, add metadata filters and hybrid search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool boundary&lt;/td&gt;
&lt;td&gt;Wrong action or duplicate action&lt;/td&gt;
&lt;td&gt;Loose schemas, retries without idempotency&lt;/td&gt;
&lt;td&gt;Tight schemas, idempotency keys, approval gates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing&lt;/td&gt;
&lt;td&gt;Wildly inconsistent behaviour by request&lt;/td&gt;
&lt;td&gt;Cost or latency routing without quality controls&lt;/td&gt;
&lt;td&gt;Add sticky sessions and per-route evals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Stale or poisoned recall&lt;/td&gt;
&lt;td&gt;Over-eager writes, weak review, cross-session leakage&lt;/td&gt;
&lt;td&gt;Separate working and durable memory, review promotions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;No idea what happened&lt;/td&gt;
&lt;td&gt;Missing traces or no span granularity&lt;/td&gt;
&lt;td&gt;Emit root and subspans for retrieval, model, and tool calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination control&lt;/td&gt;
&lt;td&gt;Plausible but unsupported claims&lt;/td&gt;
&lt;td&gt;Weak grounding or no validation pass&lt;/td&gt;
&lt;td&gt;Reference-doc validation, self-consistency checks, eval gates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The evidence base for this table is broad but consistent. Anthropic's tool docs make clear that tool use is a contract boundary. OpenAI Guardrails includes hallucination detection against a reference knowledge base via File Search. SelfCheckGPT shows that self-consistency across samples can help detect unsupported claims. The "Lost in the Middle" results and Anthropic's context guidance both reinforce the same operational lesson: more tokens do not remove the need for context curation.&lt;/p&gt;

&lt;p&gt;Preferred mitigation stack could be boring and repetitive: trace every request, version prompts, evaluate retrieval independently, keep tools idempotent, and run regression evals before you change routes or memory policy. OpenAI's Evals docs and repo are blunt about why: without evals, it is hard and time-consuming to understand how model or prompt changes affect your use case. That applies just as much to routers and retrieval as it does to prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  More reading
&lt;/h2&gt;

&lt;p&gt;If you want to go deeper, there are the most useful primary sources to keep open while designing or reviewing an assistant architecture.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;OpenAI: Responses Overview, Function Calling, Using Tools, Retrieval, File Search, Evals, and MCP for remote tool servers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Anthropic: API Overview, Tool Use, the tool-use contract, Managed Agents, Context Windows, and the MCP connector.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MCP itself: the Architecture Overview and Specification are worth reading directly, because they explain hosts, clients, servers, tools, prompts, resources, transports, and capability negotiation cleanly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Frameworks and routing: LangChain Overview, LlamaIndex context-augmentation docs, LiteLLM routing docs, LangSmith observability docs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Self-hosted runtimes and assistant systems: vLLM, llama.cpp server, Ollama embeddings, OpenClaw docs and repo, Hermes docs and repo.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Storage and observability: Pinecone, Weaviate, pgvector, Milvus, OpenTelemetry, OpenLIT.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Research papers: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Lost in the Middle, and SelfCheckGPT.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>hermes</category>
      <category>openclaw</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>Memory Systems in AI Assistants</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Tue, 16 Jun 2026 12:20:17 +0000</pubDate>
      <link>https://dev.to/rosgluk/memory-systems-in-ai-assistants-3gf0</link>
      <guid>https://dev.to/rosgluk/memory-systems-in-ai-assistants-3gf0</guid>
      <description>&lt;p&gt;Memory turns assistants from reactive to persistent, but it is also where many systems quietly rot. Surveys argue the short-term versus long-term split is no longer enough for modern agent memory; OpenAI and LangGraph SDKs point to a simpler stack — working memory, durable state, and retrieval.&lt;/p&gt;

&lt;p&gt;Assistants need working memory for the current run, durable state for stable facts and preferences, and retrieval memory for relevant supporting context. My slightly opinionated view is that structured state is underused, vector retrieval is overused, and most memory failures come from promotion and injection policy rather than storage choice.&lt;/p&gt;

&lt;p&gt;The other important point is that memory does not automatically fix long context. LoCoMo shows that very long-term conversational recall remains hard, and "Lost in the Middle" shows that simply throwing more tokens at the model can degrade performance when relevant information lands in the middle of the prompt. Good memory systems are selective, layered, and explicit about precedence.&lt;/p&gt;

&lt;p&gt;This guide sits in the &lt;a href="https://www.glukhov.org/ai-systems/memory/" rel="noopener noreferrer"&gt;AI Systems Memory hub&lt;/a&gt; as the cross-framework map for the memory layer inside &lt;a href="https://www.glukhov.org/ai-systems/architecture/ai-assistant-architecture/" rel="noopener noreferrer"&gt;AI Assistant Architecture&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to think about assistant memory
&lt;/h2&gt;

&lt;p&gt;Assistant memory is not the same problem as PKM, wikis, or standalone RAG pipelines — &lt;a href="https://www.glukhov.org/knowledge-management/foundations/pkm-vs-rag-vs-wiki-vs-memory-systems/" rel="noopener noreferrer"&gt;PKM vs RAG vs Wiki vs Memory Systems&lt;/a&gt; maps those paradigms at the knowledge-architecture level. This guide stays one layer down, in the runtime contracts assistants actually implement.&lt;/p&gt;

&lt;p&gt;The cleanest way to think about memory is not as "chat history", but as a set of storage contracts with different jobs. One store preserves the active thread. Another store keeps durable user state. Another supports semantic lookup over documents or past interactions. OpenAI's memory guidance for personalisation makes this explicit by separating global and session memory, while LangGraph separates thread-level persistence from long-term stores across conversations.&lt;/p&gt;

&lt;p&gt;Memory matters because production assistants repeat work, revisit goals, and operate across days or weeks. Generative Agents popularised the pattern of storing experiences, reflecting on them, and retrieving them dynamically for future planning. MemGPT pushed that further by modelling memory as tiers and movement between fast and slow stores. More recent systems such as A-MEM and Mem0 focus on linking, consolidation, and deployment efficiency rather than just recall volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  Types of memory
&lt;/h2&gt;

&lt;p&gt;Production assistants typically need three cooperating layers. The FAQ above names them; the sections below explain how each behaves in real systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Short-term memory
&lt;/h3&gt;

&lt;p&gt;Short-term memory is the working context of the current conversation or run. OpenAI Sessions automatically prepend conversation history before each run and append new items after each run. LangGraph implements the same idea as thread-level persistence through a checkpointer. This layer keeps local coherence, but it is also the first thing that explodes when tool results, file reads, or long chats pile up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Long-term retrieval memory
&lt;/h3&gt;

&lt;p&gt;Long-term retrieval memory stores items that are looked up when relevant rather than replayed every turn. That overlaps with &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;RAG&lt;/a&gt; as a retrieval technique, but it is not the whole assistant memory story — wikis and PKM corpora often feed the index while structured state and session memory live elsewhere, as the PKM/RAG/wiki/memory comparison above makes clear. In classical RAG, the model combines parametric memory with non-parametric memory such as a dense vector index. Self-RAG improves on naive retrieval by making retrieval on-demand rather than fixed for every request. In practical assistant systems, this is usually the vector store or searchable transcript layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structured memory
&lt;/h3&gt;

&lt;p&gt;Structured memory stores durable facts, preferences, or constraints in explicit fields with precedence rules. OpenAI's personalisation cookbook is unusually clear here. Global and session memory have different roles, the latest user instruction wins, session memory can override global memory for the current task, and memory that conflicts with current user intent should trigger clarification rather than silent obedience. This is why structured state is often better than retrieval for stable preferences, policies, or standing constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval mechanics
&lt;/h2&gt;

&lt;p&gt;A typical retrieval flow has five steps: capture, encode, search, rerank or filter, then inject. Pinecone, Weaviate, Qdrant, Redis, and Milvus all document variants of this pattern. Some support dense vectors only, others support hybrid retrieval that combines semantic and lexical search, and some expose metadata filters or namespaces for tenancy and scope control. The engineering point is straightforward. Retrieval quality depends as much on filtering, chunking, and ranking strategy as on the embedding model itself.&lt;/p&gt;

&lt;p&gt;Hybrid retrieval is usually the sensible default when queries mix meaning and exact terms. Weaviate documents hybrid search with an &lt;code&gt;alpha&lt;/code&gt; parameter balancing vector and keyword components, Qdrant supports hybrid and multi-stage queries through its Query API and score-fusion methods, and Milvus describes dense, sparse, and hybrid retrieval in the same system. That matters for assistants because users often ask for both approximate meaning and exact identifiers, file names, revision numbers, or product codes. When the lexical side lives in Postgres or Elasticsearch rather than inside the vector database, &lt;a href="https://www.glukhov.org/data-infrastructure/search/postgresql-full-text-search-vs-elasticsearch/" rel="noopener noreferrer"&gt;PostgreSQL full text search vs Elasticsearch&lt;/a&gt; helps you choose where keyword search should run in production.&lt;/p&gt;

&lt;p&gt;One more opinionated point: retrieval should not decide policy. It should supply candidates. The assistant still needs structured rules for precedence, privacy, recency, and conflict resolution. OpenAI's state-based memory example makes this explicit, and it is a much healthier pattern than pretending similarity search alone can resolve contradictory user state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common issues
&lt;/h2&gt;

&lt;p&gt;The most common failure is stale or contradictory memory. OpenAI's long-term memory cookbook calls memory consolidation the most sensitive and error-prone stage, listing context poisoning, memory loss, duplicate memories, and contradiction handling as core concerns. That is correct, and it is where many assistants fail quietly. They remember too much, too early, and without a rule for forgetting.&lt;/p&gt;

&lt;p&gt;The second failure is context overload. LangGraph warns that long conversations can exceed the LLM context window and recommends trimming, deletion, summarisation, or checkpoint management. OpenClaw similarly prunes old tool outputs from in-memory context while preserving the full on-disk transcript. These are not optional optimisations. They are required if your assistant reads, searches, or executes anything non-trivial.&lt;/p&gt;

&lt;p&gt;The third failure is assuming long context equals reliable recall. LoCoMo shows that long-term conversational memory is still difficult, and "Lost in the Middle" shows position sensitivity inside long prompts. If memory is important, do not rely on brute-force prompt stuffing. Use compaction, retrieval, and explicit state.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;The vector database layer is where many assistant teams make early platform bets. The comparison below focuses on documented product characteristics that matter for assistant memory design.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;What stands out&lt;/th&gt;
&lt;th&gt;Best fit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pinecone&lt;/td&gt;
&lt;td&gt;Managed vector database with integrated embedding, reranking, metadata filters, namespaces, and support for dense, sparse, and BM25-style full-text in one schema&lt;/td&gt;
&lt;td&gt;Teams that want managed retrieval with minimal infra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weaviate&lt;/td&gt;
&lt;td&gt;Open-source vector database storing objects and vectors, with semantic and hybrid search and strong RAG positioning&lt;/td&gt;
&lt;td&gt;Teams that want open-source flexibility with hybrid retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;AI-native vector search with filtering, hybrid and multi-stage queries, plus an embedded offline-capable Edge mode&lt;/td&gt;
&lt;td&gt;Teams that want search control, edge deployment, or strong filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pgvector&lt;/td&gt;
&lt;td&gt;Vector similarity search inside Postgres, with exact and approximate search plus ACID, JOINs, and recovery features&lt;/td&gt;
&lt;td&gt;Teams already standardised on Postgres and relational data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Milvus&lt;/td&gt;
&lt;td&gt;Cloud-native vector database with disaggregated storage and compute, plus dense, sparse, and hybrid retrieval&lt;/td&gt;
&lt;td&gt;Large-scale retrieval workloads and distributed deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Once you pick a backend, operating it is a &lt;a href="https://www.glukhov.org/data-infrastructure/" rel="noopener noreferrer"&gt;data infrastructure&lt;/a&gt; problem — Postgres with pgvector for session metadata and vectors on one stack, or &lt;a href="https://www.glukhov.org/data-infrastructure/databases/neo4j/" rel="noopener noreferrer"&gt;Neo4j&lt;/a&gt; when retrieval memory is graph-shaped rather than flat chunks.&lt;/p&gt;

&lt;p&gt;The latency and cost pattern below is a design synthesis based on the operational models described in OpenAI Sessions and compaction guidance, LangGraph memory management, OpenAI state-based memory, and the documented retrieval behaviour of Redis and vector stores. It is intentionally qualitative, because real numbers depend on corpus size, embedding model, network placement, and caching.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Memory tactic&lt;/th&gt;
&lt;th&gt;Read latency&lt;/th&gt;
&lt;th&gt;Write latency&lt;/th&gt;
&lt;th&gt;Token cost pressure&lt;/th&gt;
&lt;th&gt;Infra cost&lt;/th&gt;
&lt;th&gt;When it is worth it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw session history&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Simple multi-turn chat and short runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summary or compaction memory&lt;/td&gt;
&lt;td&gt;Low to medium&lt;/td&gt;
&lt;td&gt;Medium, because summarisation itself is a model step&lt;/td&gt;
&lt;td&gt;Medium to low&lt;/td&gt;
&lt;td&gt;Low to medium&lt;/td&gt;
&lt;td&gt;Long-running work where the active run must continue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured profile and state&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Durable preferences, rules, and standing constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector or hybrid retrieval&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low to medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Large corpora, searchable history, document grounding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full replay of everything&lt;/td&gt;
&lt;td&gt;High and increasingly unstable&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Low infra, high model spend&lt;/td&gt;
&lt;td&gt;Almost never, except tiny corpora and debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Implementation examples
&lt;/h2&gt;

&lt;p&gt;OpenAI's current stack gives two useful reference patterns. The first is Sessions for short-term continuity across runs. The second is state-based long-term memory, where structured profile fields and global memory notes are injected at session start, session notes are distilled during the run, and a consolidation step promotes only durable items into global memory. That inject → reason → distill → consolidate loop is one of the clearest public memory patterns available right now.&lt;/p&gt;

&lt;p&gt;LangGraph provides a similar but framework-agnostic split. Checkpointers handle short-term thread memory and stores handle long-term search across conversations. The store can be searched inside nodes at runtime, which makes it a good reference design for assistants that need explicit orchestration rather than hidden framework magic.&lt;/p&gt;

&lt;p&gt;Hermes is a useful public example of layered memory in the wild. Its built-in memory uses &lt;code&gt;MEMORY.md&lt;/code&gt;, &lt;code&gt;USER.md&lt;/code&gt;, and SQLite FTS5 session search, while external provider plugins add graph memory, semantic retrieval, automatic fact extraction, and user modelling. The full mechanics are documented in &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System&lt;/a&gt;, and the eight pluggable backends are compared in &lt;a href="https://www.glukhov.org/ai-systems/memory/agent-memory-providers/" rel="noopener noreferrer"&gt;Agent memory providers compared&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;OpenClaw offers a different take, with session pruning, optional active memory that runs before the main reply, and an opt-in Dreaming system for background memory consolidation. Those examples are worth paying attention to because they treat memory as an operational subsystem, not just a retrieval trick. For how OpenClaw maps onto the wider five-layer assistant stack, see the &lt;a href="https://www.glukhov.org/ai-systems/openclaw/" rel="noopener noreferrer"&gt;OpenClaw system overview&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Research prototypes point in the same direction. MemGPT uses hierarchical memory tiers and control flow for context management, A-MEM uses dynamic indexing and linking inspired by Zettelkasten, and Mem0 reports better accuracy with much lower p95 latency and token cost than full-context baselines on LoCoMo. You do not need to copy these systems wholesale, but their shared lesson is clear. Memory quality comes from selection and organisation, not from storing everything forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  When memory helps versus hurts
&lt;/h2&gt;

&lt;p&gt;Memory helps when the assistant repeatedly encounters stable preferences, durable constraints, reusable workflow lessons, or large external corpora that cannot fit in a prompt. OpenAI's reliable agents guide makes the distinction well. Compaction helps the current long-running run continue, while memory helps future runs reuse workflow lessons. That is the right mental model for most business assistants.&lt;/p&gt;

&lt;p&gt;Memory hurts when the task is one-shot, the user state changes often, the retrieval index is noisy, or the system cannot reconcile conflicts. OpenAI's travel-memory example warns that session memory should not automatically become global memory, and it explicitly states that memory is not a security boundary. If your assistant treats every recalled string as truth, you have built a confusion engine, not a memory system.&lt;/p&gt;

&lt;h2&gt;
  
  
  A selective memory loop
&lt;/h2&gt;

&lt;p&gt;The simplest robust memory loop is selective and staged. Load durable state, retrieve supporting context, answer, capture only candidate memories, then consolidate later. Both OpenAI's state-based pattern and recent memory papers move in this direction.&lt;/p&gt;

&lt;p&gt;Without tracing and evals, memory changes are hard to debug. When you promote new facts or change retrieval policy, pair those changes with the observability patterns in &lt;a href="https://www.glukhov.org/observability/observability-for-llm-systems/" rel="noopener noreferrer"&gt;Observability for LLM Systems&lt;/a&gt; so you can see which layer injected what.&lt;/p&gt;

&lt;h2&gt;
  
  
  Take-Away
&lt;/h2&gt;

&lt;p&gt;The practical memory stack for assistants is not "just use a vector DB". It is working memory for the live run, structured state for durable truth, retrieval memory for supporting evidence, and a conservative consolidation policy that forgets as deliberately as it remembers. Recent research and current SDK guidance both point in that direction.&lt;/p&gt;

&lt;p&gt;For the full assistant stack around this layer, start with &lt;a href="https://www.glukhov.org/ai-systems/architecture/ai-assistant-architecture/" rel="noopener noreferrer"&gt;AI Assistant Architecture&lt;/a&gt;. For Hermes-specific bounded memory and provider plugins, follow &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System&lt;/a&gt; and &lt;a href="https://www.glukhov.org/ai-systems/memory/agent-memory-providers/" rel="noopener noreferrer"&gt;Agent memory providers compared&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>hermes</category>
      <category>openclaw</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>AI for Knowledge Management: Real Workflows That Hold Up</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sun, 31 May 2026 13:51:08 +0000</pubDate>
      <link>https://dev.to/rosgluk/ai-for-knowledge-management-real-workflows-that-hold-up-3ag</link>
      <guid>https://dev.to/rosgluk/ai-for-knowledge-management-real-workflows-that-hold-up-3ag</guid>
      <description>&lt;p&gt;AI is not replacing knowledge management; it is changing the shape of it for both individuals and teams.&lt;/p&gt;

&lt;p&gt;Microsoft's Work Trend Index describes a move toward hybrid teams of humans and agents, and NIST's AI RMF argues that trustworthy AI systems need explicit roles, evaluation, and oversight rather than vague automation.&lt;br&gt;
Those ideas fit neatly beside the human-centred practices in the site's &lt;a href="https://www.glukhov.org/knowledge-management/" rel="noopener noreferrer"&gt;Knowledge Management in 2026 pillar&lt;/a&gt;, which focuses on tools and methods long before any model is involved.&lt;/p&gt;

&lt;p&gt;That is exactly the right frame for knowledge work: AI is best treated as an enrichment layer over notes, docs, runbooks, and research, not as a magical second brain that works without structure. A useful mental model is the one developed in &lt;a href="https://www.glukhov.org/knowledge-management/foundations/pkm-vs-rag-vs-wiki-vs-memory-systems/" rel="noopener noreferrer"&gt;PKM vs RAG vs Wiki vs Memory Systems&lt;/a&gt;, where human note systems, shared wikis, retrieval pipelines, and agent memory each play a distinct role instead of collapsing into a single tool.&lt;/p&gt;

&lt;p&gt;The slightly opinionated version is this: if your notes are chaotic, AI will not rescue them. It will often make the chaos more fluent. Good knowledge management still starts with capture, naming, ownership, and source discipline. What AI changes is what you can do after capture: compress, extract, link, retrieve, and repackage information at useful speed. That view fits both modern prompting guidance, which recommends small, well-scoped tasks, and chunking guidance that preserves semantic units for retrieval instead of flattening everything into one blob.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why AI changes knowledge management
&lt;/h2&gt;

&lt;p&gt;The core shift is from static archives to active memory. Embeddings convert text into vectors that reflect relatedness and are commonly used for search, clustering, and recommendations. Retrieval systems can then surface semantically similar material even when the query shares few or no keywords with the source text. In practical terms, that means a note about "incident review" can still find a runbook chunk titled "post-deployment outage steps" without brittle exact-match rules.&lt;/p&gt;

&lt;p&gt;This is why AI-augmented knowledge management is worth doing now. The enabling pieces are no longer exotic: embedding APIs are mainstream, vector stores are standard, local embedding models are easy to run, and production databases such as Postgres can do both exact and approximate nearest-neighbour search with pgvector. The result is not artificial knowledge in the philosophical sense. It is a much more practical thing: better recall, better compression, and better context at the moment someone needs to think, especially when paired with solid representation choices from work such as &lt;a href="https://www.glukhov.org/knowledge-management/foundations/retrieval-vs-representation/" rel="noopener noreferrer"&gt;Retrieval vs Representation in Knowledge Systems&lt;/a&gt;. If your next step is implementation detail, the &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;RAG cluster&lt;/a&gt; covers chunking, retrieval, reranking, and production patterns in depth.&lt;/p&gt;
&lt;h2&gt;
  
  
  Workflow patterns that actually work
&lt;/h2&gt;

&lt;p&gt;The patterns that hold up in production are boring in the best way. They use AI for bounded transformations, not vague autonomy. In practice, three patterns show up again and again: summarisation, extraction, and linking suggestions. Those map neatly to what current tools do well: summarise within a clear scope, extract structured data with schemas, and compute semantic relatedness through embeddings and retrieval. They also map cleanly onto the layered view of knowledge systems behind concepts such as &lt;a href="https://www.glukhov.org/knowledge-management/foundations/second-brain/" rel="noopener noreferrer"&gt;second brain workflows&lt;/a&gt; and &lt;a href="https://www.glukhov.org/knowledge-management/knowledge-systems-architectures/compiled-knowledge/what-is-llm-wiki/" rel="noopener noreferrer"&gt;LLM Wiki style compiled knowledge&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Summaries that preserve decisions
&lt;/h3&gt;

&lt;p&gt;Summarisation works best when it stays close to the source and preserves the parts humans actually need later: decisions, unresolved questions, owners, dates, and links back to the original material. OpenAI's enterprise prompting guidance explicitly recommends "one prompt, one deliverable", simple headings, and clear success criteria. That is a good discipline for knowledge work too: summarise one meeting, one document, or one research item at a time, then store the summary beside the source. Do not ask a model to "summarise my knowledge base" and expect anything trustworthy.&lt;/p&gt;

&lt;p&gt;A real workflow looks like this: capture meeting notes or a PDF, run a scoped summary prompt, store the summary with source references, then add a human check before it becomes canonical. If the source is a rich PDF, multimodal parsing can matter because slide decks and exported web pages often contain layout cues that plain text extraction misses. OpenAI's PDF parsing cookbook shows a practical split between text extraction and page-image analysis for turning rich PDFs into retrievable content.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Context
You are assisting with team knowledge capture.

# Instructions
Summarise this meeting note in:
- 5 key points
- decisions made
- open questions
- actions with owners
- terms that should link to existing notes

# Constraints
- Do not invent details
- If something is unclear, mark it as uncertain
- Include the source note ID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Extraction that creates reusable fields
&lt;/h3&gt;

&lt;p&gt;Extraction is where AI starts to feel genuinely infrastructural. Instead of storing only prose, you ask the model to populate reusable fields such as entities, systems, APIs, owners, action items, products, dates, claims, or risk tags. OpenAI's Structured Outputs feature is designed to keep responses aligned to a JSON Schema, and Ollama offers the same pattern locally with schema-based JSON output. That matters because useful knowledge systems are made of fields you can sort, filter, compare, and validate, not just paragraphs that sound clever.&lt;/p&gt;

&lt;p&gt;OpenAI's long-document entity extraction example follows the right operational pattern: chunk the document, extract the relevant facts from each chunk, and then combine results. That same workflow works for postmortems, research papers, product docs, customer interviews, and support transcripts. In practice, I would extract more than named entities: I would also pull "needs follow-up", "contradicts existing note", and "candidate for evergreen note" because those fields create action, not just metadata.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"note-2026-05-22-incident-review"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Short summary here."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"entities"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"service-a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"postgres"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"oauth"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"owner"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ops"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rotate keys"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"due"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-24"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"related_terms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"token refresh"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"deployment checklist"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"medium"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Linking that turns notes into a graph
&lt;/h3&gt;

&lt;p&gt;Link suggestions are the quiet workhorse of AI for knowledge management. Embeddings are explicitly used for search, clustering, and recommendations, which makes them a natural fit for related notes, similar incidents, see also, and you may want to merge these two docs features. Semantic retrieval is especially good at surfacing conceptually related content even when wording differs. That makes it far better than folder hierarchies alone for large note sets and technical documentation.&lt;/p&gt;

&lt;p&gt;Dense semantic search should not be your only retrieval signal, though. Exact identifiers still matter: function names, package names, issue IDs, error codes, SKUs, regulation numbers. Google Research has shown that hybrid retrieval, which combines semantic and lexical signals, improves recall because each method finds relevant material the other misses. In a technical knowledge base, that is not an academic detail. It is the difference between finding the conceptually related design note and also finding the exact migration command someone needs at 2 a.m.&lt;/p&gt;

&lt;p&gt;If you are already on Postgres, pgvector is the pragmatic option. It stores vectors with the rest of your data, supports exact search by default, and offers approximate indexing through HNSW and IVFFlat when you need more speed and can tolerate some recall trade-off. That is enough to build related-content suggestions, semantic search, and note deduplication without adding a separate vector database on day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The human plus AI loop
&lt;/h2&gt;

&lt;p&gt;The model that actually works is not human or AI. It is capture -&amp;gt; AI enrich -&amp;gt; human refine. Microsoft describes the broader shift as humans working with assistants and then agent teams, while NIST's AI RMF and Playbook stress clearly defined human roles, responsibilities, and oversight in human-AI configurations. For knowledge management, that means humans remain accountable for the canonical note, the source of truth, and the final merge or publication decision. AI does the first-pass compression and cross-linking; humans do the judgement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capture -&amp;gt; parse -&amp;gt; chunk -&amp;gt; embed -&amp;gt; enrich -&amp;gt; review -&amp;gt; publish
             |         |        |
             |         |        +-&amp;gt; related notes
             |         +-&amp;gt; retrieval index
             +-&amp;gt; structure-aware extraction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This division of labour is more than cautious process design. It matches how risk accumulates. NIST notes that understanding the limitations of human-AI interaction improves AI risk management, and that roles in oversight and use should be clearly differentiated. In practice, that means the model can draft titles, tags, summaries, and candidate links, but a person should approve anything that changes taxonomy, publishes external content, or overwrites an existing note. If you let the model silently rewrite your knowledge base, you are not building memory. You are outsourcing editorial control to a probabilistic system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool choices that matter
&lt;/h2&gt;

&lt;p&gt;The base layer is embeddings plus retrieval. OpenAI's embeddings guide frames embeddings as a way to measure relatedness between text strings, while the Retrieval API handles semantic search over your data through vector stores. For many teams, that is the minimum viable stack for AI-augmented knowledge management: parse content, chunk it well, embed it, and retrieve the right fragments before synthesis. If you only do one serious thing this quarter, make it retrieval-backed recall instead of a chat wrapper over raw documents.&lt;/p&gt;

&lt;p&gt;Local models are the right answer when privacy, offline use, or cost control dominate. Ollama documents both local embeddings and structured outputs, and its product pages emphasise that data stays yours and that workloads can run entirely offline. That makes local-first pipelines sensible for internal notes, engineering runbooks, and sensitive research archives. My bias is simple: use local models for indexing, classification, and routine enrichment; reach for hosted APIs when you need stronger reasoning, multimodal extraction, or the best available model quality.&lt;/p&gt;

&lt;p&gt;Do not ignore parsing and chunking. Unstructured's chunking docs recommend building chunks from semantic document elements rather than raw character boundaries when possible, and OpenAI's PDF cookbook shows why rich-document parsing matters for RAG. Structure-aware PDF work goes further: naive parsing can destroy tables, scramble reading order, and strip hierarchical headings, while structure-aware parsing preserves paragraphs, tables, and document hierarchy. In knowledge management, that is the difference between an index that understands your corpus and one that merely tokenises it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations worth respecting
&lt;/h2&gt;

&lt;p&gt;Hallucination is still the obvious risk, but the more useful framing is insufficient context. RAG exists because large language models can hallucinate, use stale knowledge, and produce answers with weak traceability; retrieval helps by grounding generation in external knowledge. Even so, Google Research found that models often answer incorrectly instead of abstaining when the provided context is not sufficient. That matters for knowledge management because "I found something similar" is not the same as "I found enough to answer". Your system should preserve source references, expose uncertainty, and prefer abstention over confident fabrication.&lt;/p&gt;

&lt;p&gt;Long context does not remove the need for retrieval discipline. The 2023 "Lost in the Middle" paper showed that model performance could degrade when relevant information sat in the middle of long inputs, and newer Google results show that at least some newer models have improved substantially on simple needle-in-a-haystack retrieval near context limits. The sober lesson is not "long context solves it" or "long context is useless". It is that you should test your actual workflows and corpus, because position effects, task type, and document structure still matter.&lt;/p&gt;

&lt;p&gt;Loss of structure is the quieter failure mode, and in technical documentation it can be worse than hallucination because it poisons retrieval before the model even starts reasoning. Structure-aware PDF research shows that naive parsing can split tables, destroy their internal meaning, and break reading order, while semantic chunking systems try to preserve coherent document elements. If your source material includes tables, diagrams, code examples, or multi-column layouts, your parser is part of your knowledge system, not a boring preprocessing detail.&lt;/p&gt;

&lt;p&gt;So the practical rule is this: keep the human editorial loop, preserve source links, use schemas for extraction, and treat retrieval quality as a product feature. AI does not replace PKM, team docs, or knowledge architecture. It changes the leverage. Used well, it turns raw notes into searchable, linkable, structured memory. Used badly, it turns your documentation into high-speed drift.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>knowledgemanagement</category>
      <category>rag</category>
    </item>
    <item>
      <title>Multi-Tenancy Database Patterns with examples in Go</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Thu, 28 May 2026 13:13:31 +0000</pubDate>
      <link>https://dev.to/rosgluk/multi-tenancy-database-patterns-with-examples-in-go-1kje</link>
      <guid>https://dev.to/rosgluk/multi-tenancy-database-patterns-with-examples-in-go-1kje</guid>
      <description>&lt;p&gt;&lt;a href="https://www.glukhov.org/app-architecture/multitenancy/multi-tenant-database-patterns/" rel="noopener noreferrer"&gt;Multi-tenancy&lt;/a&gt; is a fundamental architectural pattern for SaaS applications, allowing multiple customers (tenants) to share the same application infrastructure while maintaining data isolation.&lt;/p&gt;

&lt;p&gt;Choosing the right database pattern is crucial for scalability, security, and operational efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview of Multi-Tenancy Patterns
&lt;/h2&gt;

&lt;p&gt;When designing a multi-tenant application, you have three primary database architecture patterns to choose from:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Shared Database, Shared Schema&lt;/strong&gt; (most common)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shared Database, Separate Schema&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Separate Database per Tenant&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each pattern has distinct characteristics, trade-offs, and use cases. Let's explore each in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1: Shared Database, Shared Schema
&lt;/h2&gt;

&lt;p&gt;This is the most common multi-tenancy pattern, where all tenants share the same database and schema, with a &lt;code&gt;tenant_id&lt;/code&gt; column used to distinguish tenant data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────┐
│     Single Database                 │
│  ┌───────────────────────────────┐  │
│  │  Shared Schema                │  │
│  │  - users (tenant_id, ...)     │  │
│  │  - orders (tenant_id, ...)    │  │
│  │  - products (tenant_id, ...)  │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Implementation Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When implementing multi-tenant patterns, understanding SQL fundamentals is crucial. For a comprehensive reference on SQL commands and syntax, check out our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/sql-cheatsheet/" rel="noopener noreferrer"&gt;SQL Cheatsheet&lt;/a&gt;. Here's how to set up the shared schema pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Users table with tenant_id&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="k"&gt;FOREIGN&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;tenants&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Index on tenant_id for performance&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_users_tenant_id&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Row-Level Security (PostgreSQL example)&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;ENABLE&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="k"&gt;LEVEL&lt;/span&gt; &lt;span class="k"&gt;SECURITY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;tenant_isolation&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
    &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
    &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_setting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'app.current_tenant'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more PostgreSQL-specific features and commands, including RLS policies, schema management, and performance tuning, refer to our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/postgresql-cheatsheet/" rel="noopener noreferrer"&gt;PostgreSQL Cheatsheet&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Application-Level Filtering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When working with Go applications, choosing the right ORM can significantly impact your multi-tenant implementation. The examples below use GORM, but there are several excellent options available. For a detailed comparison of Go ORMs including GORM, Ent, Bun, and sqlc, see our &lt;a href="https://www.glukhov.org/app-architecture/data-access/comparing-go-orms-gorm-ent-bun-sqlc/" rel="noopener noreferrer"&gt;comprehensive guide to Go ORMs for PostgreSQL&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Example in Go with GORM&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;GetUserByEmail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gorm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenantID&lt;/span&gt; &lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="n"&gt;User&lt;/span&gt;
    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tenant_id = ? AND email = ?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Middleware to set tenant context&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TenantMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;tenantID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;extractTenantID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// From subdomain, header, or JWT&lt;/span&gt;
        &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"tenant_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Shared Schema Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lowest cost&lt;/strong&gt;: Single database instance, minimal infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easiest operations&lt;/strong&gt;: One database to backup, monitor, and maintain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple schema changes&lt;/strong&gt;: Migrations apply to all tenants at once&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best for high tenant count&lt;/strong&gt;: Efficient resource utilization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-tenant analytics&lt;/strong&gt;: Easy to aggregate data across tenants&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Shared Schema Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Weaker isolation&lt;/strong&gt;: Data leakage risk if queries forget tenant_id filter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Noisy neighbor&lt;/strong&gt;: One tenant's heavy workload can affect others&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited customization&lt;/strong&gt;: All tenants share the same schema&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance challenges&lt;/strong&gt;: Harder to meet strict data isolation requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup complexity&lt;/strong&gt;: Can't restore individual tenant data easily&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Shared Schema Best For&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SaaS applications with many small-to-medium tenants&lt;/li&gt;
&lt;li&gt;Applications where tenants don't need custom schemas&lt;/li&gt;
&lt;li&gt;Cost-sensitive startups&lt;/li&gt;
&lt;li&gt;When tenant count is high (thousands+)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pattern 2: Shared Database, Separate Schema
&lt;/h2&gt;

&lt;p&gt;Each tenant gets their own schema within the same database, providing better isolation while sharing infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separate Schema Architecture&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────┐
│     Single Database                 │
│  ┌──────────┐  ┌──────────┐         │
│  │ Schema A │  │ Schema B │  ...    │
│  │ (Tenant1)│  │ (Tenant2)│         │
│  └──────────┘  └──────────┘         │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Separate Schema Implementation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL schemas are a powerful feature for multi-tenancy. For detailed information on PostgreSQL schema management, connection strings, and database administration commands, consult our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/postgresql-cheatsheet/" rel="noopener noreferrer"&gt;PostgreSQL Cheatsheet&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create schema for tenant&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;tenant_123&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Set search path for tenant operations&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;search_path&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;tenant_123&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create tables in tenant schema&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;tenant_123&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Application Connection Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Managing database connections efficiently is critical for multi-tenant applications. The connection management code below uses GORM, but you might want to explore other ORM options. For a thorough comparison of Go ORMs including connection pooling, performance characteristics, and use cases, refer to our &lt;a href="https://www.glukhov.org/app-architecture/data-access/comparing-go-orms-gorm-ent-bun-sqlc/" rel="noopener noreferrer"&gt;Go ORMs comparison guide&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Connection string with schema search path&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;GetTenantDB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenantID&lt;/span&gt; &lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gorm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DB&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;initializeDB&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SET search_path TO tenant_%d, public"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Or use PostgreSQL connection string&lt;/span&gt;
&lt;span class="c"&gt;// postgresql://user:pass@host/db?search_path=tenant_123&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Separate Schema Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Better isolation&lt;/strong&gt;: Schema-level separation reduces data leakage risk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customization&lt;/strong&gt;: Each tenant can have different table structures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Moderate cost&lt;/strong&gt;: Still single database instance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easier per-tenant backups&lt;/strong&gt;: Can backup individual schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better for compliance&lt;/strong&gt;: Stronger than shared schema pattern&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Separate Schema Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema management complexity&lt;/strong&gt;: Migrations must run per tenant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connection overhead&lt;/strong&gt;: Need to set search_path per connection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited scalability&lt;/strong&gt;: Schema count limits (PostgreSQL ~10k schemas)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-tenant queries&lt;/strong&gt;: More complex, requires dynamic schema references&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource limits&lt;/strong&gt;: Still shared database resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Separate Schema Best For&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Medium-scale SaaS (dozens to hundreds of tenants)&lt;/li&gt;
&lt;li&gt;When tenants need schema customization&lt;/li&gt;
&lt;li&gt;Applications needing better isolation than shared schema&lt;/li&gt;
&lt;li&gt;When compliance requirements are moderate&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Pattern 3: Separate Database per Tenant
&lt;/h2&gt;

&lt;p&gt;Each tenant gets their own complete database instance, providing maximum isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separate Database Architecture&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Database 1  │  │  Database 2  │  │  Database 3  │
│  (Tenant A)  │  │  (Tenant B)  │  │  (Tenant C)  │
└──────────────┘  └──────────────┘  └──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Separate Database Implementation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create database for tenant&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;tenant_enterprise_corp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Connect to tenant database&lt;/span&gt;
&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="n"&gt;tenant_enterprise_corp&lt;/span&gt;

&lt;span class="c1"&gt;-- Create tables (no tenant_id needed!)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dynamic Connection Management&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Connection pool manager&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TenantDBManager&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;pools&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gorm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DB&lt;/span&gt;
    &lt;span class="n"&gt;mu&lt;/span&gt;    &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RWMutex&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;TenantDBManager&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;GetDB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenantID&lt;/span&gt; &lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gorm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RLock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pools&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RUnlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RUnlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unlock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// Double-check after acquiring write lock&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pools&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="n"&gt;exists&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Create new connection&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;gorm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"host=localhost user=dbuser password=dbpass dbname=tenant_%d sslmode=disable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gorm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pools&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Separate Database Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Maximum isolation&lt;/strong&gt;: Complete data separation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best security&lt;/strong&gt;: No risk of cross-tenant data access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full customization&lt;/strong&gt;: Each tenant can have completely different schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Independent scaling&lt;/strong&gt;: Scale tenant databases individually&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easy compliance&lt;/strong&gt;: Meets strictest data isolation requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-tenant backups&lt;/strong&gt;: Simple, independent backup/restore&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No noisy neighbors&lt;/strong&gt;: Tenant workloads don't affect each other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Separate Database Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Highest cost&lt;/strong&gt;: Multiple database instances require more resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational complexity&lt;/strong&gt;: Managing many databases (backups, monitoring, migrations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connection limits&lt;/strong&gt;: Each database instance has connection limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-tenant analytics&lt;/strong&gt;: Requires data federation or ETL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration complexity&lt;/strong&gt;: Must run migrations across all databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource overhead&lt;/strong&gt;: More memory, CPU, and storage needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Separate Database Best For&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enterprise SaaS with high-value customers&lt;/li&gt;
&lt;li&gt;Strict compliance requirements (HIPAA, GDPR, SOC 2)&lt;/li&gt;
&lt;li&gt;When tenants need significant customization&lt;/li&gt;
&lt;li&gt;Low to medium tenant count (dozens to low hundreds)&lt;/li&gt;
&lt;li&gt;When tenants have very different data models&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security Considerations
&lt;/h2&gt;

&lt;p&gt;Regardless of the pattern chosen, security is paramount:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Row-Level Security (RLS)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL RLS automatically filters queries by tenant, providing a database-level security layer. This feature is particularly powerful for multi-tenant applications. For more details on PostgreSQL RLS, security policies, and other advanced PostgreSQL features, see our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/postgresql-cheatsheet/" rel="noopener noreferrer"&gt;PostgreSQL Cheatsheet&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Enable RLS&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;ENABLE&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="k"&gt;LEVEL&lt;/span&gt; &lt;span class="k"&gt;SECURITY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Policy to isolate by tenant&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;tenant_isolation&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
    &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_setting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'app.current_tenant'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Application sets tenant context&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_tenant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'123'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Application-Level Filtering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Always filter by tenant_id in application code. The examples below use GORM, but different ORMs have their own approaches to query building. For guidance on choosing the right ORM for your multi-tenant application, check our &lt;a href="https://www.glukhov.org/app-architecture/data-access/comparing-go-orms-gorm-ent-bun-sqlc/" rel="noopener noreferrer"&gt;comparison of Go ORMs&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// ❌ BAD - Missing tenant filter&lt;/span&gt;
&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"email = ?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// ✅ GOOD - Always include tenant filter&lt;/span&gt;
&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tenant_id = ? AND email = ?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// ✅ BETTER - Use scopes or middleware&lt;/span&gt;
&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scopes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TenantScope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"email = ?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Connection Pooling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use connection poolers that support tenant context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// PgBouncer with transaction pooling&lt;/span&gt;
&lt;span class="c"&gt;// Or use application-level connection routing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Audit Logging&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Track all tenant data access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;AuditLog&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ID&lt;/span&gt;        &lt;span class="kt"&gt;uint&lt;/span&gt;
    &lt;span class="n"&gt;TenantID&lt;/span&gt;  &lt;span class="kt"&gt;uint&lt;/span&gt;
    &lt;span class="n"&gt;UserID&lt;/span&gt;    &lt;span class="kt"&gt;uint&lt;/span&gt;
    &lt;span class="n"&gt;Action&lt;/span&gt;    &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Table&lt;/span&gt;     &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;RecordID&lt;/span&gt;  &lt;span class="kt"&gt;uint&lt;/span&gt;
    &lt;span class="n"&gt;Timestamp&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Time&lt;/span&gt;
    &lt;span class="n"&gt;IPAddress&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance Optimization
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Indexing Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Proper indexing is crucial for multi-tenant database performance. Understanding SQL indexing strategies, including composite indexes and partial indexes, is essential. For a comprehensive reference on SQL commands including CREATE INDEX and query optimization, see our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/sql-cheatsheet/" rel="noopener noreferrer"&gt;SQL Cheatsheet&lt;/a&gt;. For PostgreSQL-specific indexing features and performance tuning, refer to our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/postgresql-cheatsheet/" rel="noopener noreferrer"&gt;PostgreSQL Cheatsheet&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Composite indexes for tenant queries&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_orders_tenant_created&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_orders_tenant_status&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Partial indexes for common tenant-specific queries&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_orders_active_tenant&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Query Optimization&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Use prepared statements for tenant queries&lt;/span&gt;
&lt;span class="n"&gt;stmt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Prepare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SELECT * FROM users WHERE tenant_id = $1 AND email = $2"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Batch operations per tenant&lt;/span&gt;
&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tenant_id = ?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Use connection pooling per tenant (for separate database pattern)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Effective database management tools are essential for monitoring multi-tenant applications. You'll need to track query performance, resource usage, and database health across all tenants. For comparing database management tools that can help with this, check out our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/dbeaver-vs-beekeeper/" rel="noopener noreferrer"&gt;DBeaver vs Beekeeper comparison&lt;/a&gt;. Both tools offer excellent features for managing and monitoring PostgreSQL databases in multi-tenant environments.&lt;/p&gt;

&lt;p&gt;Monitor per-tenant metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query performance per tenant&lt;/li&gt;
&lt;li&gt;Resource usage per tenant&lt;/li&gt;
&lt;li&gt;Connection counts per tenant&lt;/li&gt;
&lt;li&gt;Database size per tenant&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Migration Strategy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Shared Schema Pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When implementing database migrations, your choice of ORM affects how you handle schema changes. The examples below use GORM's AutoMigrate feature, but different ORMs have different migration strategies. For detailed information on how various Go ORMs handle migrations and schema management, see our &lt;a href="https://www.glukhov.org/app-architecture/data-access/comparing-go-orms-gorm-ent-bun-sqlc/" rel="noopener noreferrer"&gt;Go ORMs comparison&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Migrations apply to all tenants automatically&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Migrate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gorm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DB&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AutoMigrate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Separate Schema/Database Pattern&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Migrations must run per tenant&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;MigrateAllTenants&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenantIDs&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenantID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;tenantIDs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;GetTenantDB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AutoMigrate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="p"&gt;{});&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"tenant %d: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenantID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Shared Schema&lt;/th&gt;
&lt;th&gt;Separate Schema&lt;/th&gt;
&lt;th&gt;Separate DB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low-Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Customization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operational Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compliance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best Tenant Count&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1000+&lt;/td&gt;
&lt;td&gt;10-1000&lt;/td&gt;
&lt;td&gt;1-100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Hybrid Approach
&lt;/h2&gt;

&lt;p&gt;You can combine patterns for different tenant tiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Small tenants: Shared schema&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tier&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"standard"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GetSharedDB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Enterprise tenants: Separate database&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tier&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"enterprise"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GetTenantDB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Always filter by tenant&lt;/strong&gt;: Never trust application code alone; use RLS when possible. Understanding SQL fundamentals helps ensure proper query construction—refer to our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/sql-cheatsheet/" rel="noopener noreferrer"&gt;SQL Cheatsheet&lt;/a&gt; for query best practices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor tenant resource usage&lt;/strong&gt;: Identify and throttle noisy neighbors. Use database management tools like those compared in our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/dbeaver-vs-beekeeper/" rel="noopener noreferrer"&gt;DBeaver vs Beekeeper guide&lt;/a&gt; to track performance metrics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement tenant context middleware&lt;/strong&gt;: Centralize tenant extraction and validation. Your ORM choice affects how you implement this—see our &lt;a href="https://www.glukhov.org/app-architecture/data-access/comparing-go-orms-gorm-ent-bun-sqlc/" rel="noopener noreferrer"&gt;Go ORMs comparison&lt;/a&gt; for different approaches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use connection pooling&lt;/strong&gt;: Efficiently manage database connections. PostgreSQL-specific connection pooling strategies are covered in our &lt;a href="https://www.glukhov.org/developer-tools/database-tools/postgresql-cheatsheet/" rel="noopener noreferrer"&gt;PostgreSQL Cheatsheet&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for tenant migration&lt;/strong&gt;: Ability to move tenants between patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement soft delete&lt;/strong&gt;: Use deleted_at instead of hard deletes for tenant data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit everything&lt;/strong&gt;: Log all tenant data access for compliance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test isolation&lt;/strong&gt;: Regular security audits to prevent cross-tenant data leakage&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Choosing the right multi-tenancy database pattern depends on your specific requirements for isolation, cost, scalability, and operational complexity. The Shared Database, Shared Schema pattern works well for most SaaS applications, while Separate Database per Tenant is necessary for enterprise customers with strict compliance needs.&lt;/p&gt;

&lt;p&gt;Start with the simplest pattern that meets your requirements, and plan for migration to a more isolated pattern as your needs evolve. Always prioritize security and data isolation, regardless of the pattern chosen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.postgresql.org/docs/current/ddl-rowsecurity.html" rel="noopener noreferrer"&gt;PostgreSQL Row-Level Security Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/apn/apn-multi-tenant-saas-database-architecture/" rel="noopener noreferrer"&gt;Multi-Tenant SaaS Database Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.microsoft.com/en-us/azure/sql-database/saas-tenancy-app-design-patterns" rel="noopener noreferrer"&gt;Designing Multi-Tenant Databases&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/app-architecture/data-access/comparing-go-orms-gorm-ent-bun-sqlc/" rel="noopener noreferrer"&gt;Comparing Go ORMs for PostgreSQL: GORM vs Ent vs Bun vs sqlc&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/developer-tools/database-tools/postgresql-cheatsheet/" rel="noopener noreferrer"&gt;PostgreSQL Cheatsheet: A Developer’s Quick Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/developer-tools/database-tools/dbeaver-vs-beekeeper/" rel="noopener noreferrer"&gt;DBeaver vs Beekeeper - SQL Database Management Tools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.glukhov.org/developer-tools/database-tools/sql-cheatsheet/" rel="noopener noreferrer"&gt;SQL Cheatsheet - most useful SQL commands&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>sql</category>
      <category>devops</category>
      <category>dev</category>
      <category>privacy</category>
    </item>
    <item>
      <title>Zettelkasten for Developers: A Practical Method That Works</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 25 May 2026 13:09:20 +0000</pubDate>
      <link>https://dev.to/rosgluk/zettelkasten-for-developers-a-practical-method-that-works-3ij</link>
      <guid>https://dev.to/rosgluk/zettelkasten-for-developers-a-practical-method-that-works-3ij</guid>
      <description>&lt;p&gt;Developers do not usually suffer from a lack of information. We suffer from too much of it.&lt;/p&gt;

&lt;p&gt;There are API docs, pull requests, production incidents, design discussions, meeting notes, architecture diagrams, code comments, Slack threads, research papers, experiments, bookmarks, and half-finished ideas sitting in five different tools. The hard part is not saving information. The hard part is turning it into reusable thinking.&lt;/p&gt;

&lt;p&gt;That is where Zettelkasten becomes useful.&lt;/p&gt;

&lt;p&gt;A Zettelkasten is often described as a note-taking system, but that undersells it. Used well, it is a &lt;a href="https://www.glukhov.org/knowledge-management/" rel="noopener noreferrer"&gt;personal knowledge system&lt;/a&gt; for developing ideas over time. For developers, it can become a practical bridge between code, architecture, debugging, learning, and writing.&lt;/p&gt;

&lt;p&gt;The opinionated part is this: most developers should not use Zettelkasten as a romantic productivity hobby. Do not build a beautiful note museum. Build a working system that helps you solve problems, explain systems, and make better engineering decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Zettelkasten?
&lt;/h2&gt;

&lt;p&gt;Zettelkasten means "slip box". The method is associated with sociologist Niklas Luhmann, who used a large collection of linked notes to develop ideas and write extensively.&lt;/p&gt;

&lt;p&gt;The important lesson is not that he used paper cards. The important lesson is that his notes were not isolated files. Each note had a clear idea, a place in the system, and links to other notes. Over time, the system became more valuable because connections accumulated.&lt;/p&gt;

&lt;p&gt;For developers, the modern version is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write one useful idea per note.&lt;/li&gt;
&lt;li&gt;Link it to related notes.&lt;/li&gt;
&lt;li&gt;Use those links to grow explanations, decisions, patterns, and articles.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is it. The rest is implementation detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Developers Struggle With Knowledge Overload
&lt;/h2&gt;

&lt;p&gt;Software development creates knowledge that is both detailed and temporary.&lt;/p&gt;

&lt;p&gt;You learn why a cache invalidation bug happened. You discover a weird edge case in a framework. You compare two queueing strategies. You debug a production outage. You understand why a legacy service behaves strangely. You read a great article about distributed tracing.&lt;/p&gt;

&lt;p&gt;Then, two months later, you vaguely remember that you once knew the answer.&lt;/p&gt;

&lt;p&gt;The usual developer knowledge stack makes this worse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bookmarks store sources, not understanding.&lt;/li&gt;
&lt;li&gt;Folders force early categorization.&lt;/li&gt;
&lt;li&gt;Wikis become stale when nobody owns them.&lt;/li&gt;
&lt;li&gt;TODO lists mix tasks with ideas.&lt;/li&gt;
&lt;li&gt;Code comments explain local details, not broader concepts.&lt;/li&gt;
&lt;li&gt;Chat messages disappear into history.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A Zettelkasten helps because it treats knowledge as a network, not a warehouse. If that framing sounds familiar from reading about &lt;a href="https://www.glukhov.org/knowledge-management/foundations/second-brain/" rel="noopener noreferrer"&gt;building a second brain&lt;/a&gt;, that is not a coincidence — both methods attack the same gap between capture and reuse, but Zettelkasten's discipline of atomic notes and explicit links gives developers a more granular handle on technical ideas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Principles of Zettelkasten
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Atomic Notes
&lt;/h3&gt;

&lt;p&gt;An atomic note contains one idea.&lt;/p&gt;

&lt;p&gt;Not one topic. Not one article summary. Not one giant page called "PostgreSQL". One idea.&lt;/p&gt;

&lt;p&gt;For example, these are too broad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PostgreSQL notes
Kubernetes
Caching
System design
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
`&lt;/p&gt;

&lt;p&gt;These are closer to atomic:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Partial indexes reduce write overhead when queries target a small subset&lt;br&gt;
Kubernetes readiness probes protect traffic routing, not container startup&lt;br&gt;
Write-through caching improves consistency but increases write latency&lt;br&gt;
Idempotency keys turn retries into safe operations&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Atomic notes are powerful because they are easier to link. A huge page can only be linked as a vague topic. A focused note can be connected to an exact concept, decision, bug, or system.&lt;/p&gt;

&lt;p&gt;A good developer note should usually answer one of these questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the idea?&lt;/li&gt;
&lt;li&gt;When does it matter?&lt;/li&gt;
&lt;li&gt;What tradeoff does it expose?&lt;/li&gt;
&lt;li&gt;Where have I seen it in real code?&lt;/li&gt;
&lt;li&gt;What other concept does it connect to?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Linking
&lt;/h3&gt;

&lt;p&gt;Links are the heart of the system.&lt;/p&gt;

&lt;p&gt;The point is not to create a pretty graph. The point is to make ideas reusable.&lt;/p&gt;

&lt;p&gt;When you write a note about idempotency keys, link it to notes about retries, distributed systems, payment processing, message queues, API design, and incident prevention. When you write a note about database migrations, link it to deploy safety, rollback strategy, backward compatibility, and feature flags.&lt;/p&gt;

&lt;p&gt;A link should usually mean one of these things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"This explains the same concept from another angle."&lt;/li&gt;
&lt;li&gt;"This is a practical example of the idea."&lt;/li&gt;
&lt;li&gt;"This is a tradeoff or counterpoint."&lt;/li&gt;
&lt;li&gt;"This concept depends on that concept."&lt;/li&gt;
&lt;li&gt;"This note belongs in a larger argument."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Avoid lazy links. Linking every note to every other note creates noise. The best links are intentional.&lt;/p&gt;

&lt;h3&gt;
  
  
  Emergence
&lt;/h3&gt;

&lt;p&gt;Emergence is the part of Zettelkasten that sounds mystical, but it is practical.&lt;/p&gt;

&lt;p&gt;You do not need to design the perfect structure upfront. You add useful notes, connect them honestly, and let clusters appear over time.&lt;/p&gt;

&lt;p&gt;After a few months, you may notice that many notes connect around topics like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API reliability&lt;/li&gt;
&lt;li&gt;Observability&lt;/li&gt;
&lt;li&gt;Developer experience&lt;/li&gt;
&lt;li&gt;Event-driven architecture&lt;/li&gt;
&lt;li&gt;Database performance&lt;/li&gt;
&lt;li&gt;Technical debt&lt;/li&gt;
&lt;li&gt;Documentation&lt;/li&gt;
&lt;li&gt;Security reviews&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those clusters become future articles, internal docs, design principles, conference talks, onboarding material, or better engineering decisions.&lt;/p&gt;

&lt;p&gt;This is why Zettelkasten is different from a folder hierarchy. Folders ask you to decide where knowledge belongs before you fully understand it. Links let knowledge belong to multiple contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Developer Adaptation Of Zettelkasten
&lt;/h2&gt;

&lt;p&gt;Classic Zettelkasten advice often comes from academic writing — the &lt;a href="https://www.glukhov.org/knowledge-management/foundations/personal-knowledge-management/" rel="noopener noreferrer"&gt;personal knowledge management&lt;/a&gt; literature covers that tradition well. Developers need a slightly different version.&lt;/p&gt;

&lt;p&gt;A developer Zettelkasten should connect three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Concepts&lt;/li&gt;
&lt;li&gt;Code&lt;/li&gt;
&lt;li&gt;Systems&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Concepts
&lt;/h3&gt;

&lt;p&gt;Concept notes explain reusable ideas.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Backpressure prevents fast producers from overwhelming slow consumers&lt;br&gt;
Optimistic locking detects conflicting writes without blocking readers&lt;br&gt;
Circuit breakers protect dependencies from repeated failing calls&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;These notes should be written in your own words. Copying documentation is not enough. The value comes from forcing yourself to explain the concept clearly.&lt;/p&gt;

&lt;p&gt;A useful concept note can include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A short explanation&lt;/li&gt;
&lt;li&gt;A concrete example&lt;/li&gt;
&lt;li&gt;A tradeoff&lt;/li&gt;
&lt;li&gt;A link to a related pattern&lt;/li&gt;
&lt;li&gt;A link to a real system where you used it&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Code
&lt;/h3&gt;

&lt;p&gt;Code notes capture practical implementation knowledge.&lt;/p&gt;

&lt;p&gt;They are not random snippet dumps. A snippet is useful only when it explains a decision or pattern.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`markdown&lt;/p&gt;

&lt;h2&gt;
  
  
  Idempotent request handling with a database constraint
&lt;/h2&gt;

&lt;p&gt;The safest implementation is often a unique constraint on the idempotency key.&lt;br&gt;
The application can retry safely because duplicate requests resolve to the same&lt;br&gt;
stored result instead of creating a second side effect.&lt;/p&gt;

&lt;p&gt;Related:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[[Retries need idempotent operations]]&lt;/li&gt;
&lt;li&gt;[[Database constraints are concurrency control]]&lt;/li&gt;
&lt;li&gt;[[Payment APIs should treat network failure as unknown outcome]]
`&lt;code&gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good code notes explain why the code works, when to use it, and what can go wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Systems
&lt;/h3&gt;

&lt;p&gt;System notes connect abstract ideas to your actual architecture.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
The billing service uses idempotency keys because payment provider calls may&lt;br&gt;
succeed even when our HTTP client times out.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This note can link to:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Idempotency keys turn retries into safe operations&lt;br&gt;
Timeouts do not prove failure&lt;br&gt;
Payment APIs should model unknown outcomes&lt;br&gt;
Outbox pattern separates database writes from external side effects&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This is where Zettelkasten becomes valuable for senior engineering work. It helps you build a memory of why systems are shaped the way they are.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical Workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Capture Fleeting Notes
&lt;/h3&gt;

&lt;p&gt;A fleeting note is a rough capture. It does not need to be polished.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Look into why readiness probe failed during deploy.&lt;br&gt;
Maybe retries made the duplicate invoice bug worse.&lt;br&gt;
Good quote from incident review: timeout is not failure.&lt;br&gt;
Research: Postgres partial index for active rows only.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Use whatever is fastest: Obsidian daily note, Logseq journal, a text file, mobile notes, or a scratch buffer.&lt;/p&gt;

&lt;p&gt;The rule is simple: capture quickly, process later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Process Notes Into Permanent Notes
&lt;/h3&gt;

&lt;p&gt;Processing is where the value appears.&lt;/p&gt;

&lt;p&gt;Turn rough notes into clear, reusable notes. Rewrite in your own words. Give each note a title that states the idea.&lt;/p&gt;

&lt;p&gt;Bad title:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Retries&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Better title:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Retries are safe only when the operation is idempotent&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Bad note:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Need idempotency for retries.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Better note:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Retries can turn a temporary network problem into duplicate side effects.&lt;br&gt;
A retry is safe only when the operation can run more than once and still&lt;br&gt;
produce the same business result. For APIs, this often requires an&lt;br&gt;
idempotency key, a unique constraint, or a stored request result.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Add Links While The Context Is Fresh
&lt;/h3&gt;

&lt;p&gt;After writing the note, ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What does this explain?&lt;/li&gt;
&lt;li&gt;What does this depend on?&lt;/li&gt;
&lt;li&gt;Where have I seen this in code?&lt;/li&gt;
&lt;li&gt;What is the opposite view?&lt;/li&gt;
&lt;li&gt;What system would benefit from this?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add only the links that help future you think.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Create Index Notes Or Maps Of Content
&lt;/h3&gt;

&lt;p&gt;Once a cluster grows, create an index note.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`markdown&lt;/p&gt;

&lt;h1&gt;
  
  
  API Reliability
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Core ideas
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[[Retries are safe only when the operation is idempotent]]&lt;/li&gt;
&lt;li&gt;[[Timeouts do not prove failure]]&lt;/li&gt;
&lt;li&gt;[[Circuit breakers reduce pressure on failing dependencies]]&lt;/li&gt;
&lt;li&gt;[[Rate limits protect shared resources]]&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation patterns
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[[Idempotency keys turn retries into safe operations]]&lt;/li&gt;
&lt;li&gt;[[Outbox pattern separates persistence from delivery]]&lt;/li&gt;
&lt;li&gt;[[Dead letter queues preserve failed messages for inspection]]&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  System examples
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[[Billing service payment retry design]]&lt;/li&gt;
&lt;li&gt;[[Webhook delivery failure handling]]
`&lt;code&gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you navigation without forcing everything into folders.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Use Notes To Produce Output
&lt;/h3&gt;

&lt;p&gt;A Zettelkasten should produce something.&lt;/p&gt;

&lt;p&gt;For developers, output can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture decision records&lt;/li&gt;
&lt;li&gt;Design documents&lt;/li&gt;
&lt;li&gt;Blog posts&lt;/li&gt;
&lt;li&gt;Debugging guides&lt;/li&gt;
&lt;li&gt;Onboarding docs&lt;/li&gt;
&lt;li&gt;Pull request explanations&lt;/li&gt;
&lt;li&gt;Internal talks&lt;/li&gt;
&lt;li&gt;Refactoring plans&lt;/li&gt;
&lt;li&gt;Incident review insights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your notes never influence your work, the system is too decorative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended Note Types For Developers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fleeting Notes
&lt;/h3&gt;

&lt;p&gt;Temporary notes for quick capture.&lt;/p&gt;

&lt;p&gt;Use them for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ideas during coding&lt;/li&gt;
&lt;li&gt;Debugging observations&lt;/li&gt;
&lt;li&gt;Meeting fragments&lt;/li&gt;
&lt;li&gt;Questions&lt;/li&gt;
&lt;li&gt;Bookmarks to process later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Delete or convert them quickly. Do not let them become a swamp.&lt;/p&gt;

&lt;h3&gt;
  
  
  Literature Notes
&lt;/h3&gt;

&lt;p&gt;Notes about external sources.&lt;/p&gt;

&lt;p&gt;For developers, a source can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Documentation&lt;/li&gt;
&lt;li&gt;Blog article&lt;/li&gt;
&lt;li&gt;RFC&lt;/li&gt;
&lt;li&gt;Source code&lt;/li&gt;
&lt;li&gt;Conference talk&lt;/li&gt;
&lt;li&gt;GitHub issue&lt;/li&gt;
&lt;li&gt;Postmortem&lt;/li&gt;
&lt;li&gt;Book chapter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Keep source notes separate from your own permanent notes. A source note says, "This source said this." A permanent note says, "I understand this idea this way."&lt;/p&gt;

&lt;h3&gt;
  
  
  Permanent Notes
&lt;/h3&gt;

&lt;p&gt;These are the core of the Zettelkasten.&lt;/p&gt;

&lt;p&gt;A permanent note should be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Atomic&lt;/li&gt;
&lt;li&gt;Written in your own words&lt;/li&gt;
&lt;li&gt;Linked to related notes&lt;/li&gt;
&lt;li&gt;Useful without needing the original source&lt;/li&gt;
&lt;li&gt;Stable enough to revisit later&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project Notes
&lt;/h3&gt;

&lt;p&gt;Project notes are allowed, but do not confuse them with permanent notes.&lt;/p&gt;

&lt;p&gt;A project note might be:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Migrate billing worker to queue v2&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It can link to permanent notes like:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Backpressure prevents queue consumers from collapsing&lt;br&gt;
Outbox pattern separates persistence from delivery&lt;br&gt;
Feature flags reduce deployment risk&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Projects end. Concepts stay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Obsidian
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/" rel="noopener noreferrer"&gt;Obsidian&lt;/a&gt; works well for developer Zettelkasten because it uses local Markdown files and supports internal links.&lt;/p&gt;

&lt;p&gt;A simple Obsidian structure:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
notes/&lt;br&gt;
  fleeting/&lt;br&gt;
  sources/&lt;br&gt;
  permanent/&lt;br&gt;
  maps/&lt;br&gt;
  projects/&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Example note:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`markdown&lt;/p&gt;

&lt;h1&gt;
  
  
  Timeouts do not prove failure
&lt;/h1&gt;

&lt;p&gt;A timeout means the client stopped waiting. It does not prove the server failed.&lt;br&gt;
The operation may have succeeded, failed, or still be running.&lt;/p&gt;

&lt;p&gt;This matters for payment APIs, job queues, and any external side effect.&lt;/p&gt;

&lt;p&gt;Related:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[[Retries are safe only when the operation is idempotent]]&lt;/li&gt;
&lt;li&gt;[[Idempotency keys turn retries into safe operations]]&lt;/li&gt;
&lt;li&gt;[[External side effects need reconciliation]]
`&lt;code&gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Obsidian is a good fit if you like file ownership, plain text, and editor-like workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logseq
&lt;/h3&gt;

&lt;p&gt;Logseq is useful if you prefer outlining, daily journals, and block-level references.&lt;/p&gt;

&lt;p&gt;Its block model works well for capturing small units of thought. You can write rough notes in the journal, then promote useful blocks into permanent notes.&lt;/p&gt;

&lt;p&gt;Example Logseq-style workflow:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`text&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Timeout during payment request does not prove payment failure.

&lt;ul&gt;
&lt;li&gt;This should become a permanent note about unknown outcomes.&lt;/li&gt;
&lt;li&gt;Related: [[Idempotency]], [[Retries]], [[Payment APIs]]
`&lt;code&gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Logseq is a good fit if your thinking starts as outlines and you like block references. For a side-by-side comparison of both tools across workflow style, sync options, and plugin ecosystems, &lt;a href="https://www.glukhov.org/knowledge-management/tools/obsidian-vs-logseq-comparison/" rel="noopener noreferrer"&gt;Obsidian vs Logseq&lt;/a&gt; maps the trade-offs clearly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Plain Markdown And Git
&lt;/h3&gt;

&lt;p&gt;You do not need a special app.&lt;/p&gt;

&lt;p&gt;A Git repository of Markdown files can be enough:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
knowledge/&lt;br&gt;
  permanent/&lt;br&gt;
  sources/&lt;br&gt;
  maps/&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Use normal Markdown links:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;markdown&lt;br&gt;
[Retries are safe only when operations are idempotent](../permanent/retries-safe-only-with-idempotency.md)&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This approach is boring, durable, and developer-friendly. That is a compliment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naming Notes
&lt;/h2&gt;

&lt;p&gt;Prefer titles that make claims.&lt;/p&gt;

&lt;p&gt;Weak titles:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Caching&lt;br&gt;
Queues&lt;br&gt;
OAuth&lt;br&gt;
PostgreSQL indexes&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Strong titles:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Cache invalidation is a coordination problem&lt;br&gt;
Queues hide latency but do not remove work&lt;br&gt;
OAuth access tokens should be short lived&lt;br&gt;
Partial indexes are useful when queries target a subset&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;A claim-based title makes the note easier to understand and easier to link.&lt;/p&gt;

&lt;h2&gt;
  
  
  What To Put In A Developer Zettelkasten
&lt;/h2&gt;

&lt;p&gt;Good candidates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architecture principles&lt;/li&gt;
&lt;li&gt;Debugging lessons&lt;/li&gt;
&lt;li&gt;Production incident insights&lt;/li&gt;
&lt;li&gt;API design rules&lt;/li&gt;
&lt;li&gt;Database patterns&lt;/li&gt;
&lt;li&gt;Security assumptions&lt;/li&gt;
&lt;li&gt;Performance tradeoffs&lt;/li&gt;
&lt;li&gt;Framework edge cases&lt;/li&gt;
&lt;li&gt;Refactoring heuristics&lt;/li&gt;
&lt;li&gt;Testing strategies&lt;/li&gt;
&lt;li&gt;Deployment lessons&lt;/li&gt;
&lt;li&gt;Code review patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poor candidates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raw meeting transcripts&lt;/li&gt;
&lt;li&gt;Unprocessed bookmarks&lt;/li&gt;
&lt;li&gt;Huge copied documentation pages&lt;/li&gt;
&lt;li&gt;Random snippets with no explanation&lt;/li&gt;
&lt;li&gt;Task lists&lt;/li&gt;
&lt;li&gt;Secrets&lt;/li&gt;
&lt;li&gt;Credentials&lt;/li&gt;
&lt;li&gt;Anything that belongs in official company documentation only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A personal Zettelkasten can reference work, but it should not become an unsafe shadow copy of private systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake 1: Over-Structuring Too Early
&lt;/h3&gt;

&lt;p&gt;Developers love structure. That is sometimes a problem.&lt;/p&gt;

&lt;p&gt;Do not spend the first week designing folders, tags, templates, naming conventions, dashboards, and automation. You do not yet know what structure your notes need.&lt;/p&gt;

&lt;p&gt;Start with a small number of note types:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
fleeting&lt;br&gt;
sources&lt;br&gt;
permanent&lt;br&gt;
maps&lt;br&gt;
projects&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Let complexity earn its place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 2: Treating It Like Folders
&lt;/h3&gt;

&lt;p&gt;A Zettelkasten is not a better folder tree.&lt;/p&gt;

&lt;p&gt;If every note belongs to exactly one folder and has no meaningful links, you have built a filing cabinet. That may still be useful, but it is not Zettelkasten.&lt;/p&gt;

&lt;p&gt;The value comes from connections:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
API retries -&amp;gt; idempotency -&amp;gt; database constraints -&amp;gt; payment safety -&amp;gt; incident prevention&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That chain is more useful than a folder called "Backend".&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 3: Saving Instead Of Thinking
&lt;/h3&gt;

&lt;p&gt;Copying is not learning.&lt;/p&gt;

&lt;p&gt;A saved paragraph from documentation may help later, but a rewritten explanation helps now. The act of restating an idea in your own words is where understanding improves.&lt;/p&gt;

&lt;p&gt;A good rule:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Do not create a permanent note until you can explain the idea without copying.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 4: Linking Everything
&lt;/h3&gt;

&lt;p&gt;Too many links are as bad as too few.&lt;/p&gt;

&lt;p&gt;Do not link words just because they exist. Link ideas because the relationship matters.&lt;/p&gt;

&lt;p&gt;A useful link should help future you answer:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Why is this connected?&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 5: Confusing Tags With Structure
&lt;/h3&gt;

&lt;p&gt;Tags are useful for status and broad grouping:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`text&lt;/p&gt;

&lt;h1&gt;
  
  
  todo
&lt;/h1&gt;

&lt;h1&gt;
  
  
  source
&lt;/h1&gt;

&lt;h1&gt;
  
  
  security
&lt;/h1&gt;

&lt;h1&gt;
  
  
  draft
&lt;/h1&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But tags should not carry the whole system. If you rely only on tags, you lose the richer meaning of direct links.&lt;/p&gt;

&lt;p&gt;A link says:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
This idea relates to that idea in a specific way.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;A tag usually says:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
This belongs to a broad bucket.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Both are useful. They are not the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake 6: Never Producing Output
&lt;/h3&gt;

&lt;p&gt;A Zettelkasten that never produces output becomes a private archive.&lt;/p&gt;

&lt;p&gt;Output does not have to mean public writing. It can be a design doc, an incident review, a better pull request, or a clear explanation to a teammate.&lt;/p&gt;

&lt;p&gt;The system should make your thinking easier to reuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Minimal Template
&lt;/h2&gt;

&lt;p&gt;Use a small template. Resist the urge to create a form with fifteen fields.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`markdown&lt;/p&gt;

&lt;h1&gt;
  
  
  Title as a claim
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Idea
&lt;/h2&gt;

&lt;p&gt;Explain the idea in your own words.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it matters
&lt;/h2&gt;

&lt;p&gt;Describe the practical impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;Show a code, system, or debugging example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;Mention limits, risks, or counterpoints.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[[Related note]]&lt;/li&gt;
&lt;li&gt;[[Another related note]]
`&lt;code&gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For many notes, even this is too much. A title, a paragraph, and three links can be enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example: From Bug To Zettelkasten Notes
&lt;/h2&gt;

&lt;p&gt;Imagine you fixed a bug where users were charged twice after a timeout.&lt;/p&gt;

&lt;p&gt;A weak note would be:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Payment bug - retries caused duplicate charge.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;A stronger set of notes might be:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Timeouts do not prove failure&lt;br&gt;
Retries are safe only when the operation is idempotent&lt;br&gt;
Idempotency keys turn retries into safe operations&lt;br&gt;
Payment APIs should model unknown outcomes&lt;br&gt;
Database constraints are concurrency control&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now the bug has become reusable engineering knowledge.&lt;/p&gt;

&lt;p&gt;Later, those notes can support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A postmortem&lt;/li&gt;
&lt;li&gt;A design doc for payment retries&lt;/li&gt;
&lt;li&gt;A blog post about idempotency&lt;/li&gt;
&lt;li&gt;A checklist for external API integrations&lt;/li&gt;
&lt;li&gt;A code review comment&lt;/li&gt;
&lt;li&gt;A safer implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the practical value of Zettelkasten.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Weekly Maintenance Routine
&lt;/h2&gt;

&lt;p&gt;You do not need a complicated review process.&lt;/p&gt;

&lt;p&gt;Once a week:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Process rough notes.&lt;/li&gt;
&lt;li&gt;Delete notes that no longer matter.&lt;/li&gt;
&lt;li&gt;Convert useful ideas into permanent notes.&lt;/li&gt;
&lt;li&gt;Add missing links.&lt;/li&gt;
&lt;li&gt;Promote clusters into map notes.&lt;/li&gt;
&lt;li&gt;Pick one note and turn it into output.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Keep it lightweight. The system should support development, not compete with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Rules
&lt;/h2&gt;

&lt;p&gt;Use these rules to keep the system healthy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One idea per note.&lt;/li&gt;
&lt;li&gt;Write titles as claims.&lt;/li&gt;
&lt;li&gt;Prefer links over folders.&lt;/li&gt;
&lt;li&gt;Keep source notes separate from your own ideas.&lt;/li&gt;
&lt;li&gt;Connect notes to real code and real systems.&lt;/li&gt;
&lt;li&gt;Create map notes only when a cluster exists.&lt;/li&gt;
&lt;li&gt;Delete low-value notes.&lt;/li&gt;
&lt;li&gt;Do not automate before you understand your workflow.&lt;/li&gt;
&lt;li&gt;Use the system to produce something.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When Zettelkasten Is Not Worth It
&lt;/h2&gt;

&lt;p&gt;Zettelkasten is not the answer to every problem.&lt;/p&gt;

&lt;p&gt;It may be overkill if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You only need a task manager.&lt;/li&gt;
&lt;li&gt;You rarely revisit technical ideas.&lt;/li&gt;
&lt;li&gt;You do not write, teach, design, or document.&lt;/li&gt;
&lt;li&gt;Your notes are mostly short-lived project details.&lt;/li&gt;
&lt;li&gt;You are using it to avoid doing the actual work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is most useful when your work depends on compounding understanding.&lt;/p&gt;

&lt;p&gt;That includes senior engineering, architecture, technical leadership, debugging complex systems, writing, consulting, research, and learning deeply over many years.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;For developers, Zettelkasten is not about collecting notes. It is about building a thinking environment.&lt;/p&gt;

&lt;p&gt;The method works best when it stays practical: atomic notes, meaningful links, real examples, and regular output. Connect concepts to code. Connect code to systems. Connect systems to decisions.&lt;/p&gt;

&lt;p&gt;Do not try to build the perfect second brain. Build a useful one.&lt;/p&gt;

&lt;p&gt;A good developer Zettelkasten should help you answer better questions:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Where have I seen this problem before?&lt;br&gt;
What concept explains this bug?&lt;br&gt;
What tradeoff are we making?&lt;br&gt;
What pattern applies here?&lt;br&gt;
What should I write down so I do not relearn this again?&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That is enough.&lt;/p&gt;

</description>
      <category>obsidian</category>
      <category>logseq</category>
      <category>knowledgemanagement</category>
    </item>
    <item>
      <title>OpenClaw vs Hermes Agent: Stars, Downloads &amp; Usage 2026</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 25 May 2026 13:09:06 +0000</pubDate>
      <link>https://dev.to/rosgluk/openclaw-vs-hermes-agent-stars-downloads-usage-2026-b07</link>
      <guid>https://dev.to/rosgluk/openclaw-vs-hermes-agent-stars-downloads-usage-2026-b07</guid>
      <description>&lt;p&gt;Open-source AI agent frameworks are exploding in popularity on GitHub.&lt;br&gt;
Two projects at the core of the &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;self-hosted AI systems&lt;/a&gt; ecosystem — &lt;strong&gt;OpenClaw&lt;/strong&gt; and &lt;strong&gt;Hermes Agent&lt;/strong&gt; — have pulled so far ahead that the rest of the field is fighting for a distant third place.&lt;/p&gt;

&lt;p&gt;Here is the full picture as of May 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Leaderboard
&lt;/h2&gt;

&lt;p&gt;Star counts are live data fetched from the GitHub API on &lt;strong&gt;May 21, 2026&lt;/strong&gt;. Repos are sorted by current stars, descending.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;GitHub repo&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Stars&lt;/th&gt;
&lt;th&gt;Releases last 30 days&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;&lt;code&gt;openclaw/openclaw&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;373,616&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;62&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Hermes Agent&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;&lt;code&gt;NousResearch/hermes-agent&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;160,175&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Nanobot&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HKUDS/nanobot" rel="noopener noreferrer"&gt;&lt;code&gt;HKUDS/nanobot&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;42,873&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;AstrBot&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/AstrBotDevs/AstrBot" rel="noopener noreferrer"&gt;&lt;code&gt;AstrBotDevs/AstrBot&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;32,709&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;ZeroClaw&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/zeroclaw-labs/zeroclaw" rel="noopener noreferrer"&gt;&lt;code&gt;zeroclaw-labs/zeroclaw&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;31,500&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;≥1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;NanoClaw&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nanocoai/nanoclaw" rel="noopener noreferrer"&gt;&lt;code&gt;nanocoai/nanoclaw&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;29,143&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;≥1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;PicoClaw&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/sipeed/picoclaw" rel="noopener noreferrer"&gt;&lt;code&gt;sipeed/picoclaw&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;29,121&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;AionUi&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/iOfficeAI/AionUi" rel="noopener noreferrer"&gt;&lt;code&gt;iOfficeAI/AionUi&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;26,025&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;≥3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;NemoClaw&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/NVIDIA/NemoClaw" rel="noopener noreferrer"&gt;&lt;code&gt;NVIDIA/NemoClaw&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;20,571&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;OpenFang&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/RightNow-AI/openfang" rel="noopener noreferrer"&gt;&lt;code&gt;RightNow-AI/openfang&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;17,599&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;≥5&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;LangBot&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/langbot-app/LangBot" rel="noopener noreferrer"&gt;&lt;code&gt;langbot-app/LangBot&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;16,084&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;memU&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/NevaMind-AI/memU" rel="noopener noreferrer"&gt;&lt;code&gt;NevaMind-AI/memU&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;13,672&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;IronClaw&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nearai/ironclaw" rel="noopener noreferrer"&gt;&lt;code&gt;nearai/ironclaw&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;12,305&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Moltworker&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/cloudflare/moltworker" rel="noopener noreferrer"&gt;&lt;code&gt;cloudflare/moltworker&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;9,899&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;MemOS&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/MemTensor/MemOS" rel="noopener noreferrer"&gt;&lt;code&gt;MemTensor/MemOS&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;9,246&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;≥2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;ClawWork&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/HKUDS/ClawWork" rel="noopener noreferrer"&gt;&lt;code&gt;HKUDS/ClawWork&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;8,111&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;NullClaw&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nullclaw/nullclaw" rel="noopener noreferrer"&gt;&lt;code&gt;nullclaw/nullclaw&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Zig&lt;/td&gt;
&lt;td&gt;7,603&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;MimicLaw&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/memovai/mimiclaw" rel="noopener noreferrer"&gt;&lt;code&gt;memovai/mimiclaw&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;5,422&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;Moltis&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/moltis-org/moltis" rel="noopener noreferrer"&gt;&lt;code&gt;moltis-org/moltis&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;2,697&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;≥3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;Clawra&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/SumeLabs/clawra" rel="noopener noreferrer"&gt;&lt;code&gt;SumeLabs/clawra&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;2,298&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  OpenClaw: 373k Stars and Still Growing
&lt;/h2&gt;

&lt;p&gt;OpenClaw is a personal AI assistant framework built in TypeScript. It runs entirely on the user's own device and connects to over 50 messaging platforms — WhatsApp, Telegram, Slack, Discord, and more — through a single unified interface.&lt;/p&gt;

&lt;p&gt;The project launched in November 2025 but truly ignited on &lt;strong&gt;January 30, 2026&lt;/strong&gt;, reaching 100,000 stars within 48 hours of its relaunch. By April 2026 it had overtaken React to become the most-starred software repository in GitHub's history. At the time of this writing it sits at &lt;strong&gt;373,616 stars&lt;/strong&gt;, 72,000+ forks, and 360 contributors.&lt;/p&gt;

&lt;p&gt;The release cadence is extraordinary: &lt;strong&gt;62 tagged releases in the last 30 days&lt;/strong&gt; puts it in a category of its own in terms of iteration speed. The full arc of how OpenClaw grew from a weekend prototype to GitHub's most-starred repository — including the economics behind the viral spike and the April 2026 subscription cutoff that reshaped weekly growth — is detailed in the &lt;a href="https://www.glukhov.org/ai-systems/openclaw/openclaw-rise-and-fall-timeline/" rel="noopener noreferrer"&gt;OpenClaw rise and fall timeline&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hermes Agent: The Challenger
&lt;/h2&gt;

&lt;p&gt;Nous Research's Hermes Agent markets itself as "the agent that grows with you." It is a self-improving AI agent built in Python with a built-in learning loop — it creates new skills from experience, searches past conversations for relevant context, and can run on a range of infrastructure options from local hardware to cloud.&lt;/p&gt;

&lt;p&gt;Created in July 2025 and now at &lt;strong&gt;160,175 stars&lt;/strong&gt;, Hermes Agent recently surpassed OpenClaw as the world's most-used open-source AI agent &lt;strong&gt;by daily token processing&lt;/strong&gt; on OpenRouter — though OpenClaw still leads in cumulative all-time usage. The gap between the two in raw GitHub stars remains large (over 200k), but Hermes Agent's trajectory is notably steeper.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mid-field: The 20k–45k Band
&lt;/h2&gt;

&lt;p&gt;The third through eighth positions are all clustered between 26k and 43k stars, making ranking changes here frequent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nanobot&lt;/strong&gt; (HKUDS, 42,873 ⭐) — Python, lightweight graph-based task orchestration from the HKU Data Science lab.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AstrBot&lt;/strong&gt; (AstrBotDevs, 32,709 ⭐) — Python, multi-platform chatbot framework with active release history (11 releases in the last 30 days).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ZeroClaw&lt;/strong&gt; (zeroclaw-labs, 31,500 ⭐) — Rust, systems-level agent runtime targeting low-latency deployments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NanoClaw&lt;/strong&gt; (nanocoai, 29,143 ⭐) — TypeScript, recently migrated from &lt;code&gt;qwibitai/nanoclaw&lt;/code&gt; to &lt;code&gt;nanocoai/nanoclaw&lt;/code&gt;; the rename caused a brief star-count gap in trackers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PicoClaw&lt;/strong&gt; (Sipeed, 29,121 ⭐) — Go, embedded-friendly agent framework. Only 22 stars separate it from NanoClaw.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AionUi&lt;/strong&gt; (iOfficeAI, 26,025 ⭐) — TypeScript, focuses on agentic UI generation with a visual workflow editor.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Language Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Repos in top 20&lt;/th&gt;
&lt;th&gt;Total stars&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;294,897&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TypeScript&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;470,155&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;51,502&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;29,121&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zig&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;7,603&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;5,422&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TypeScript leads in total star weight — largely because of OpenClaw itself — while Python holds the most individual projects. Rust is carving out a niche in the performance-sensitive tier (ZeroClaw, OpenFang, IronClaw).&lt;/p&gt;

&lt;h2&gt;
  
  
  Release Velocity vs Star Count
&lt;/h2&gt;

&lt;p&gt;High star counts do not always mean high release velocity. Several top-starred repos (NemoClaw, memU, ClawWork, Clawra, MimicLaw) show zero releases in the last 30 days — they may be in maintenance mode or experiencing slower development cycles.&lt;/p&gt;

&lt;p&gt;AstrBot stands out in the mid-field with 11 releases in 30 days, suggesting active feature development. OpenFang (≥5) and Moltis (≥3) are also moving quickly relative to their star counts, which may signal emerging momentum.&lt;/p&gt;

&lt;h2&gt;
  
  
  Notable Moves Since Last Snapshot
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NanoClaw&lt;/strong&gt; renamed org from &lt;code&gt;qwibitai&lt;/code&gt; to &lt;code&gt;nanocoai&lt;/code&gt;; updated link in the table above.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NemoClaw&lt;/strong&gt; language corrected to TypeScript (previously listed as JavaScript in older data).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AionUi&lt;/strong&gt; gained ~800 stars, moving from 8th to a stronger 8th position.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MemOS&lt;/strong&gt; crossed 9,000 stars.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  OpenRouter Usage Rankings
&lt;/h2&gt;

&lt;p&gt;GitHub stars measure mindshare; OpenRouter token volume measures actual runtime usage. The two charts tell different stories.&lt;/p&gt;

&lt;p&gt;The table below shows the &lt;strong&gt;global daily ranking on OpenRouter&lt;/strong&gt; as of &lt;strong&gt;May 21, 2026&lt;/strong&gt;, filtered to apps and agents that have opted into usage attribution. Counts are daily tokens processed through the platform.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;App / Agent&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Daily tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Hermes Agent&lt;/td&gt;
&lt;td&gt;Personal / CLI Agents&lt;/td&gt;
&lt;td&gt;458 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;Personal / CLI Agents&lt;/td&gt;
&lt;td&gt;173 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Kilo Code&lt;/td&gt;
&lt;td&gt;CLI / IDE Agents&lt;/td&gt;
&lt;td&gt;163 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Descript&lt;/td&gt;
&lt;td&gt;Video Generation&lt;/td&gt;
&lt;td&gt;68.1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;CLI Agents&lt;/td&gt;
&lt;td&gt;64.1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;pi&lt;/td&gt;
&lt;td&gt;CLI Agents&lt;/td&gt;
&lt;td&gt;58 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Janitor AI&lt;/td&gt;
&lt;td&gt;Roleplay&lt;/td&gt;
&lt;td&gt;28.4 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;ISEKAI ZERO&lt;/td&gt;
&lt;td&gt;Game&lt;/td&gt;
&lt;td&gt;26.8 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;CSS AI Pro&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;25.4 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Cline&lt;/td&gt;
&lt;td&gt;IDE / CLI Agents&lt;/td&gt;
&lt;td&gt;23.5 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Roo Code&lt;/td&gt;
&lt;td&gt;IDE / Cloud Agents&lt;/td&gt;
&lt;td&gt;20.1 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Lemonade&lt;/td&gt;
&lt;td&gt;Programming App&lt;/td&gt;
&lt;td&gt;20 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Mira&lt;/td&gt;
&lt;td&gt;Personal Agents&lt;/td&gt;
&lt;td&gt;15.2 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;VidMuse&lt;/td&gt;
&lt;td&gt;Video Generation&lt;/td&gt;
&lt;td&gt;13.3 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;AA-LCR Benchmark&lt;/td&gt;
&lt;td&gt;Research&lt;/td&gt;
&lt;td&gt;9.42 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;SillyTavern&lt;/td&gt;
&lt;td&gt;Roleplay&lt;/td&gt;
&lt;td&gt;7.84 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;OpenHands&lt;/td&gt;
&lt;td&gt;CLI Agents&lt;/td&gt;
&lt;td&gt;7.21 B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;Nous Research API&lt;/td&gt;
&lt;td&gt;General Chat&lt;/td&gt;
&lt;td&gt;6.63 B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Gaps in rank numbers (e.g., no 7 or 17) reflect apps without public attribution at the time of retrieval; OpenRouter lists 60 apps in total.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  All-time cumulative tokens
&lt;/h3&gt;

&lt;p&gt;The daily leader and the all-time leader have swapped since earlier in the year. As of today Hermes Agent has also overtaken OpenClaw on the all-time chart — a milestone that crossed some time after the May 10 daily flip.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;App / Agent&lt;/th&gt;
&lt;th&gt;All-time tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent&lt;/td&gt;
&lt;td&gt;8.14 T&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;7.18 T&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kilo Code&lt;/td&gt;
&lt;td&gt;5.21 T&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;2.6 T&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What the numbers mean
&lt;/h3&gt;

&lt;p&gt;The gap between Hermes Agent (458 B daily) and OpenClaw (173 B) is now wider than it was on May 10, when the flip first happened at 224 B vs 186 B. Hermes has more than doubled its daily volume in 11 days; OpenClaw's daily volume has declined.&lt;/p&gt;

&lt;p&gt;The architecture difference explains a lot of this. OpenClaw is session-native — it resets between runs, which means every session re-pays the full context-stuffing cost. Hermes is a persistent runtime with a &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;three-layer memory system&lt;/a&gt; (identity snapshot, SQLite FTS5 session database, self-written procedural skill files). Once a skill is written, repeat tasks cost a fraction of the tokens.&lt;/p&gt;

&lt;p&gt;For the coding-agent sub-category specifically, the top five are Hermes Agent, OpenClaw, Kilo Code, Claude Code, and &lt;code&gt;pi&lt;/code&gt;. Cline (#11) and Roo Code (#12) round out the open-source coding-agent tier, both crossing 20 B daily tokens.&lt;/p&gt;

&lt;p&gt;The driver of Hermes's May acceleration was the &lt;strong&gt;v0.13.0 "Tenacity" release&lt;/strong&gt; (May 7, 2026): 864 commits, 588 merged PRs, 295 contributors. That release shipped a Kanban-style durable multi-agent task board with heartbeat monitoring and hallucination recovery, plus eight P0 security fixes and Google Chat as the 20th messaging integration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community Health
&lt;/h2&gt;

&lt;p&gt;GitHub repository metrics reveal a sharp contrast in project maturity and maintenance style between the two leaders.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;th&gt;Hermes Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Issue close rate&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89.9 %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;37.2 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Contributors&lt;/td&gt;
&lt;td&gt;360&lt;/td&gt;
&lt;td&gt;400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forks&lt;/td&gt;
&lt;td&gt;72,696&lt;/td&gt;
&lt;td&gt;26,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Releases shipped (total)&lt;/td&gt;
&lt;td&gt;82+&lt;/td&gt;
&lt;td&gt;14+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disclosed CVEs (2026 YTD)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;9 in 4 days&lt;/strong&gt; (March 2026)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Worst CVE severity&lt;/td&gt;
&lt;td&gt;CVSS 9.9&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exposed public instances&lt;/td&gt;
&lt;td&gt;135,000+ across 82 countries&lt;/td&gt;
&lt;td&gt;Not separately tracked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security response (v0.13.0)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;8 P0 fixes, default redaction on&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;OpenClaw's 89.9 % issue close rate reflects a well-staffed, responsive maintainer team — the highest of any project in this space. Its release cadence (62 in the last 30 days alone) is exceptional, but that velocity has a cost: roughly a quarter of updates reportedly break response delivery on at least one channel, and the March 2026 CVE cluster (nine issues in four days, the worst at CVSS 9.9) forced emergency patching at scale. Shadowserver confirmed over 135,000 exposed Gateway instances across 82 countries in the same window. The OpenClaw team does publish fixes fast; the problem is that a community of this size patches slowly.&lt;/p&gt;

&lt;p&gt;Hermes Agent's 37.2 % issue close rate is the expected profile for a three-month-old project with a backlog accumulating faster than it can be triaged. The security record so far is clean — zero disclosed agent-specific CVEs as of May 2026 — though that partly reflects fewer eyes on the codebase. The v0.13.0 "Tenacity" release shipped eight P0 fixes proactively, before any public disclosure, which is a good signal of security culture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ecosystem Size
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Package downloads
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Package&lt;/th&gt;
&lt;th&gt;Registry&lt;/th&gt;
&lt;th&gt;Weekly downloads&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;openclaw&lt;/code&gt; (main)&lt;/td&gt;
&lt;td&gt;npm&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5,344,931&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@tencent-weixin/openclaw-weixin&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;npm&lt;/td&gt;
&lt;td&gt;230,903&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@ollama/openclaw-web-search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;npm&lt;/td&gt;
&lt;td&gt;160,221&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@paperclipai/adapter-openclaw-gateway&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;npm&lt;/td&gt;
&lt;td&gt;159,310&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@larksuite/openclaw-lark&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;npm&lt;/td&gt;
&lt;td&gt;115,964&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes-agent&lt;/code&gt; (main)&lt;/td&gt;
&lt;td&gt;PyPI&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;53,134&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The raw numbers are not directly comparable — npm counts installs on every &lt;code&gt;npm install&lt;/code&gt; (including CI runs), while PyPI counts pip installs. OpenClaw also has a larger ecosystem of third-party adapter packages that each pull in the core. Even so, the order-of-magnitude difference reflects OpenClaw's deeper penetration of automated pipelines and developer toolchains.&lt;/p&gt;

&lt;p&gt;Hermes Agent at 53,000 PyPI downloads per week is not a small number for a three-month-old Python tool. Its install rate has grown roughly linearly with the GitHub star count.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skills and integrations
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;th&gt;Hermes Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Third-party skill marketplace&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ClawHub — 44,000+ skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None yet (self-generated only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Messaging integrations (official)&lt;/td&gt;
&lt;td&gt;50+ channels&lt;/td&gt;
&lt;td&gt;20 channels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Community repositories (GitHub)&lt;/td&gt;
&lt;td&gt;Large (untracked by maintainers)&lt;/td&gt;
&lt;td&gt;80+ quality-filtered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill libraries (community)&lt;/td&gt;
&lt;td&gt;Embedded in ClawHub&lt;/td&gt;
&lt;td&gt;17 curated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-agent orchestration frameworks&lt;/td&gt;
&lt;td&gt;Built-in ACP swarm&lt;/td&gt;
&lt;td&gt;9 third-party&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External memory providers&lt;/td&gt;
&lt;td&gt;Via skills&lt;/td&gt;
&lt;td&gt;8 native&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ClawHub is OpenClaw's most durable moat: 44,000 community-maintained skills covering integrations, automations, and workflows that would take months to replicate. Hermes's answer is to generate skills from its own task completions rather than pull them from a marketplace — a fundamentally different philosophy that pays off on deep, repeated tasks but leaves gaps on long-tail integrations. The eight external memory backends Hermes ships natively — Honcho, OpenViking, Mem0, Hindsight, and four more — are compared in detail in &lt;a href="https://www.glukhov.org/ai-systems/memory/agent-memory-providers/" rel="noopener noreferrer"&gt;Agent Memory Providers Compared&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One security note on ClawHub: in Q1 2026 Koi Security identified 341 malicious entries in the registry, prompting OpenClaw to add a verification layer to the skill submission pipeline. A detailed guide to vetting skills, understanding which are safe to install, and navigating ClawHub's quality tiers is in &lt;a href="https://www.glukhov.org/ai-systems/openclaw/skills/" rel="noopener noreferrer"&gt;OpenClaw Skills Ecosystem and Practical Production Picks&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community Sentiment
&lt;/h2&gt;

&lt;p&gt;A synthesis of Reddit threads across r/homeautomation, r/selfhosted, and r/MachineLearning (compiled by &lt;a href="https://kilo.ai/articles/openclaw-vs-hermes-what-reddit-says" rel="noopener noreferrer"&gt;kilo.ai&lt;/a&gt;) breaks down operator preferences as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stance&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stay on OpenClaw&lt;/td&gt;
&lt;td&gt;35 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Switched fully to Hermes&lt;/td&gt;
&lt;td&gt;30 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run both side by side&lt;/td&gt;
&lt;td&gt;20 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Withholding judgment on Hermes&lt;/td&gt;
&lt;td&gt;15 %&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 15 % holding off on Hermes are primarily concerned about what some users characterise as coordinated promotion activity from newly created accounts in Hermes-related threads — a pattern common to fast-growing projects but notable enough that veteran community members flag it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Top OpenClaw complaints (by upvote volume)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Release breakage&lt;/strong&gt; — most-upvoted complaint has 305 votes: &lt;em&gt;"Every single update ships more bugs and problems than before."&lt;/em&gt; An estimated 25 % of releases break response delivery on at least one channel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory drift&lt;/strong&gt; — agents forget prior instructions across sessions, requiring users to re-establish context manually.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-host friction&lt;/strong&gt; — disproportionate time spent on Docker configuration, SSH setup, and YAML tuning relative to actual agent work.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Top Hermes Agent complaints (by frequency)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Unreliable self-evaluation&lt;/strong&gt; — the agent occasionally reports task success when the outcome was a partial failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill file overwriting&lt;/strong&gt; — auto-improvement rewrites manually tuned skill files, discarding intentional customisation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration gaps&lt;/strong&gt; — ClawHub has a skill for almost everything; Hermes does not, and self-generation takes time to catch up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The "run both" pattern (20 % of operators) is the most architecturally interesting: OpenClaw as the channel-and-routing layer up front, with Hermes as the deep-specialist backend. Messages arrive via Telegram or Slack, OpenClaw routes them, and the tasks where compounding matters are dispatched to a Hermes instance that has been improving on exactly those workflows for weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Search Interest Trend
&lt;/h2&gt;

&lt;p&gt;Tracking the &lt;strong&gt;project growth leaderboard&lt;/strong&gt; (weekly new GitHub stars, a cleaner signal than raw star count) shows a clear momentum reversal as of May 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Weekly star growth&lt;/th&gt;
&lt;th&gt;Leaderboard position&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claw-code&lt;/td&gt;
&lt;td&gt;+7,000&lt;/td&gt;
&lt;td&gt;#1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent&lt;/td&gt;
&lt;td&gt;+3,800&lt;/td&gt;
&lt;td&gt;#3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;+1,700&lt;/td&gt;
&lt;td&gt;#11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;OpenClaw had a +40,000/week peak in early February 2026 during the post-relaunch explosion. At +1,700/week in May, it is still growing in absolute terms — 373k stars does not happen without weekly adds — but it has settled into a mature project cadence, not a growth sprint.&lt;/p&gt;

&lt;p&gt;Hermes Agent at +3,800/week is the fastest-growing agent runtime on the leaderboard right now, despite having less than half of OpenClaw's cumulative stars. Its growth curve is steeper than OpenClaw's was at the same age (week 12 post-launch).&lt;/p&gt;

&lt;p&gt;The broader &lt;strong&gt;search interest trend&lt;/strong&gt; corroborates the star-growth pattern. Queries for "Hermes Agent" and "hermes-agent install" have been rising consistently since the February launch; "OpenClaw" search volume peaked in late January and has been flat-to-declining since. The intersection point — where Hermes search volume equals OpenClaw's — has not yet been reached, but the trajectories suggest it will cross sometime in Q3 2026 if current rates hold.&lt;/p&gt;

&lt;p&gt;The HN community has also shifted: threads about OpenClaw now centre on security hardening, transport trust (Telegram's lack of default end-to-end encryption), and maintenance overhead. Threads about Hermes Agent are still mostly "how do I set this up for X" — an earlier-stage energy that reflects a project still in its adoption phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://openclawai.io" rel="noopener noreferrer"&gt;OpenClaw official site&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;Hermes Agent — Nous Research&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openrouter.ai/apps" rel="noopener noreferrer"&gt;OpenRouter app rankings (live)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://openclawchronicles.com" rel="noopener noreferrer"&gt;OpenClaw Chronicles — community news&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hermesatlas.com" rel="noopener noreferrer"&gt;Hermes Atlas — monthly state reports&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>hermes</category>
      <category>openclaw</category>
      <category>community</category>
      <category>selfhosting</category>
    </item>
  </channel>
</rss>
