DEV Community: Rost

AI for Knowledge Management: Real Workflows That Hold Up

Rost — Sun, 31 May 2026 13:51:08 +0000

AI is not replacing knowledge management; it is changing the shape of it for both individuals and teams.

Microsoft's Work Trend Index describes a move toward hybrid teams of humans and agents, and NIST's AI RMF argues that trustworthy AI systems need explicit roles, evaluation, and oversight rather than vague automation.
Those ideas fit neatly beside the human-centred practices in the site's Knowledge Management in 2026 pillar, which focuses on tools and methods long before any model is involved.

That is exactly the right frame for knowledge work: AI is best treated as an enrichment layer over notes, docs, runbooks, and research, not as a magical second brain that works without structure. A useful mental model is the one developed in PKM vs RAG vs Wiki vs Memory Systems, where human note systems, shared wikis, retrieval pipelines, and agent memory each play a distinct role instead of collapsing into a single tool.

The slightly opinionated version is this: if your notes are chaotic, AI will not rescue them. It will often make the chaos more fluent. Good knowledge management still starts with capture, naming, ownership, and source discipline. What AI changes is what you can do after capture: compress, extract, link, retrieve, and repackage information at useful speed. That view fits both modern prompting guidance, which recommends small, well-scoped tasks, and chunking guidance that preserves semantic units for retrieval instead of flattening everything into one blob.

Why AI changes knowledge management

The core shift is from static archives to active memory. Embeddings convert text into vectors that reflect relatedness and are commonly used for search, clustering, and recommendations. Retrieval systems can then surface semantically similar material even when the query shares few or no keywords with the source text. In practical terms, that means a note about "incident review" can still find a runbook chunk titled "post-deployment outage steps" without brittle exact-match rules.

This is why AI-augmented knowledge management is worth doing now. The enabling pieces are no longer exotic: embedding APIs are mainstream, vector stores are standard, local embedding models are easy to run, and production databases such as Postgres can do both exact and approximate nearest-neighbour search with pgvector. The result is not artificial knowledge in the philosophical sense. It is a much more practical thing: better recall, better compression, and better context at the moment someone needs to think, especially when paired with solid representation choices from work such as Retrieval vs Representation in Knowledge Systems. If your next step is implementation detail, the RAG cluster covers chunking, retrieval, reranking, and production patterns in depth.

Workflow patterns that actually work

The patterns that hold up in production are boring in the best way. They use AI for bounded transformations, not vague autonomy. In practice, three patterns show up again and again: summarisation, extraction, and linking suggestions. Those map neatly to what current tools do well: summarise within a clear scope, extract structured data with schemas, and compute semantic relatedness through embeddings and retrieval. They also map cleanly onto the layered view of knowledge systems behind concepts such as second brain workflows and LLM Wiki style compiled knowledge.

Summaries that preserve decisions

Summarisation works best when it stays close to the source and preserves the parts humans actually need later: decisions, unresolved questions, owners, dates, and links back to the original material. OpenAI's enterprise prompting guidance explicitly recommends "one prompt, one deliverable", simple headings, and clear success criteria. That is a good discipline for knowledge work too: summarise one meeting, one document, or one research item at a time, then store the summary beside the source. Do not ask a model to "summarise my knowledge base" and expect anything trustworthy.

A real workflow looks like this: capture meeting notes or a PDF, run a scoped summary prompt, store the summary with source references, then add a human check before it becomes canonical. If the source is a rich PDF, multimodal parsing can matter because slide decks and exported web pages often contain layout cues that plain text extraction misses. OpenAI's PDF parsing cookbook shows a practical split between text extraction and page-image analysis for turning rich PDFs into retrievable content.

# Context
You are assisting with team knowledge capture.

# Instructions
Summarise this meeting note in:
- 5 key points
- decisions made
- open questions
- actions with owners
- terms that should link to existing notes

# Constraints
- Do not invent details
- If something is unclear, mark it as uncertain
- Include the source note ID

Extraction that creates reusable fields

Extraction is where AI starts to feel genuinely infrastructural. Instead of storing only prose, you ask the model to populate reusable fields such as entities, systems, APIs, owners, action items, products, dates, claims, or risk tags. OpenAI's Structured Outputs feature is designed to keep responses aligned to a JSON Schema, and Ollama offers the same pattern locally with schema-based JSON output. That matters because useful knowledge systems are made of fields you can sort, filter, compare, and validate, not just paragraphs that sound clever.

OpenAI's long-document entity extraction example follows the right operational pattern: chunk the document, extract the relevant facts from each chunk, and then combine results. That same workflow works for postmortems, research papers, product docs, customer interviews, and support transcripts. In practice, I would extract more than named entities: I would also pull "needs follow-up", "contradicts existing note", and "candidate for evergreen note" because those fields create action, not just metadata.

{
  "source_id": "note-2026-05-22-incident-review",
  "summary": "Short summary here.",
  "entities": ["service-a", "postgres", "oauth"],
  "actions": [
    {"owner": "ops", "task": "rotate keys", "due": "2026-05-24"}
  ],
  "related_terms": ["token refresh", "deployment checklist"],
  "confidence": "medium"
}

Linking that turns notes into a graph

Link suggestions are the quiet workhorse of AI for knowledge management. Embeddings are explicitly used for search, clustering, and recommendations, which makes them a natural fit for related notes, similar incidents, see also, and you may want to merge these two docs features. Semantic retrieval is especially good at surfacing conceptually related content even when wording differs. That makes it far better than folder hierarchies alone for large note sets and technical documentation.

Dense semantic search should not be your only retrieval signal, though. Exact identifiers still matter: function names, package names, issue IDs, error codes, SKUs, regulation numbers. Google Research has shown that hybrid retrieval, which combines semantic and lexical signals, improves recall because each method finds relevant material the other misses. In a technical knowledge base, that is not an academic detail. It is the difference between finding the conceptually related design note and also finding the exact migration command someone needs at 2 a.m.

If you are already on Postgres, pgvector is the pragmatic option. It stores vectors with the rest of your data, supports exact search by default, and offers approximate indexing through HNSW and IVFFlat when you need more speed and can tolerate some recall trade-off. That is enough to build related-content suggestions, semantic search, and note deduplication without adding a separate vector database on day one.

The human plus AI loop

The model that actually works is not human or AI. It is capture -> AI enrich -> human refine. Microsoft describes the broader shift as humans working with assistants and then agent teams, while NIST's AI RMF and Playbook stress clearly defined human roles, responsibilities, and oversight in human-AI configurations. For knowledge management, that means humans remain accountable for the canonical note, the source of truth, and the final merge or publication decision. AI does the first-pass compression and cross-linking; humans do the judgement.

capture -> parse -> chunk -> embed -> enrich -> review -> publish
             |         |        |
             |         |        +-> related notes
             |         +-> retrieval index
             +-> structure-aware extraction

This division of labour is more than cautious process design. It matches how risk accumulates. NIST notes that understanding the limitations of human-AI interaction improves AI risk management, and that roles in oversight and use should be clearly differentiated. In practice, that means the model can draft titles, tags, summaries, and candidate links, but a person should approve anything that changes taxonomy, publishes external content, or overwrites an existing note. If you let the model silently rewrite your knowledge base, you are not building memory. You are outsourcing editorial control to a probabilistic system.

The tool choices that matter

The base layer is embeddings plus retrieval. OpenAI's embeddings guide frames embeddings as a way to measure relatedness between text strings, while the Retrieval API handles semantic search over your data through vector stores. For many teams, that is the minimum viable stack for AI-augmented knowledge management: parse content, chunk it well, embed it, and retrieve the right fragments before synthesis. If you only do one serious thing this quarter, make it retrieval-backed recall instead of a chat wrapper over raw documents.

Local models are the right answer when privacy, offline use, or cost control dominate. Ollama documents both local embeddings and structured outputs, and its product pages emphasise that data stays yours and that workloads can run entirely offline. That makes local-first pipelines sensible for internal notes, engineering runbooks, and sensitive research archives. My bias is simple: use local models for indexing, classification, and routine enrichment; reach for hosted APIs when you need stronger reasoning, multimodal extraction, or the best available model quality.

Do not ignore parsing and chunking. Unstructured's chunking docs recommend building chunks from semantic document elements rather than raw character boundaries when possible, and OpenAI's PDF cookbook shows why rich-document parsing matters for RAG. Structure-aware PDF work goes further: naive parsing can destroy tables, scramble reading order, and strip hierarchical headings, while structure-aware parsing preserves paragraphs, tables, and document hierarchy. In knowledge management, that is the difference between an index that understands your corpus and one that merely tokenises it.

Limitations worth respecting

Hallucination is still the obvious risk, but the more useful framing is insufficient context. RAG exists because large language models can hallucinate, use stale knowledge, and produce answers with weak traceability; retrieval helps by grounding generation in external knowledge. Even so, Google Research found that models often answer incorrectly instead of abstaining when the provided context is not sufficient. That matters for knowledge management because "I found something similar" is not the same as "I found enough to answer". Your system should preserve source references, expose uncertainty, and prefer abstention over confident fabrication.

Long context does not remove the need for retrieval discipline. The 2023 "Lost in the Middle" paper showed that model performance could degrade when relevant information sat in the middle of long inputs, and newer Google results show that at least some newer models have improved substantially on simple needle-in-a-haystack retrieval near context limits. The sober lesson is not "long context solves it" or "long context is useless". It is that you should test your actual workflows and corpus, because position effects, task type, and document structure still matter.

Loss of structure is the quieter failure mode, and in technical documentation it can be worse than hallucination because it poisons retrieval before the model even starts reasoning. Structure-aware PDF research shows that naive parsing can split tables, destroy their internal meaning, and break reading order, while semantic chunking systems try to preserve coherent document elements. If your source material includes tables, diagrams, code examples, or multi-column layouts, your parser is part of your knowledge system, not a boring preprocessing detail.

So the practical rule is this: keep the human editorial loop, preserve source links, use schemas for extraction, and treat retrieval quality as a product feature. AI does not replace PKM, team docs, or knowledge architecture. It changes the leverage. Used well, it turns raw notes into searchable, linkable, structured memory. Used badly, it turns your documentation into high-speed drift.

Multi-Tenancy Database Patterns with examples in Go

Rost — Thu, 28 May 2026 13:13:31 +0000

Multi-tenancy is a fundamental architectural pattern for SaaS applications, allowing multiple customers (tenants) to share the same application infrastructure while maintaining data isolation.

Choosing the right database pattern is crucial for scalability, security, and operational efficiency.

Overview of Multi-Tenancy Patterns

When designing a multi-tenant application, you have three primary database architecture patterns to choose from:

Shared Database, Shared Schema (most common)
Shared Database, Separate Schema
Separate Database per Tenant

Each pattern has distinct characteristics, trade-offs, and use cases. Let's explore each in detail.

Pattern 1: Shared Database, Shared Schema

This is the most common multi-tenancy pattern, where all tenants share the same database and schema, with a tenant_id column used to distinguish tenant data.

Architecture

┌─────────────────────────────────────┐
│     Single Database                 │
│  ┌───────────────────────────────┐  │
│  │  Shared Schema                │  │
│  │  - users (tenant_id, ...)     │  │
│  │  - orders (tenant_id, ...)    │  │
│  │  - products (tenant_id, ...)  │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

Implementation Example

When implementing multi-tenant patterns, understanding SQL fundamentals is crucial. For a comprehensive reference on SQL commands and syntax, check out our SQL Cheatsheet. Here's how to set up the shared schema pattern:

-- Users table with tenant_id
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    tenant_id INTEGER NOT NULL,
    email VARCHAR(255) NOT NULL,
    name VARCHAR(255),
    created_at TIMESTAMP DEFAULT NOW(),
    FOREIGN KEY (tenant_id) REFERENCES tenants(id)
);

-- Index on tenant_id for performance
CREATE INDEX idx_users_tenant_id ON users(tenant_id);

-- Row-Level Security (PostgreSQL example)
ALTER TABLE users ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON users
    FOR ALL
    USING (tenant_id = current_setting('app.current_tenant')::INTEGER);

For more PostgreSQL-specific features and commands, including RLS policies, schema management, and performance tuning, refer to our PostgreSQL Cheatsheet.

Application-Level Filtering

When working with Go applications, choosing the right ORM can significantly impact your multi-tenant implementation. The examples below use GORM, but there are several excellent options available. For a detailed comparison of Go ORMs including GORM, Ent, Bun, and sqlc, see our comprehensive guide to Go ORMs for PostgreSQL.

// Example in Go with GORM
func GetUserByEmail(db *gorm.DB, tenantID uint, email string) (*User, error) {
    var user User
    err := db.Where("tenant_id = ? AND email = ?", tenantID, email).First(&user).Error
    return &user, err
}

// Middleware to set tenant context
func TenantMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        tenantID := extractTenantID(r) // From subdomain, header, or JWT
        ctx := context.WithValue(r.Context(), "tenant_id", tenantID)
        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Shared Schema Pros

Lowest cost: Single database instance, minimal infrastructure
Easiest operations: One database to backup, monitor, and maintain
Simple schema changes: Migrations apply to all tenants at once
Best for high tenant count: Efficient resource utilization
Cross-tenant analytics: Easy to aggregate data across tenants

Shared Schema Cons

Weaker isolation: Data leakage risk if queries forget tenant_id filter
Noisy neighbor: One tenant's heavy workload can affect others
Limited customization: All tenants share the same schema
Compliance challenges: Harder to meet strict data isolation requirements
Backup complexity: Can't restore individual tenant data easily

Shared Schema Best For

SaaS applications with many small-to-medium tenants
Applications where tenants don't need custom schemas
Cost-sensitive startups
When tenant count is high (thousands+)

Pattern 2: Shared Database, Separate Schema

Each tenant gets their own schema within the same database, providing better isolation while sharing infrastructure.

Separate Schema Architecture

┌─────────────────────────────────────┐
│     Single Database                 │
│  ┌──────────┐  ┌──────────┐         │
│  │ Schema A │  │ Schema B │  ...    │
│  │ (Tenant1)│  │ (Tenant2)│         │
│  └──────────┘  └──────────┘         │
└─────────────────────────────────────┘

Separate Schema Implementation

PostgreSQL schemas are a powerful feature for multi-tenancy. For detailed information on PostgreSQL schema management, connection strings, and database administration commands, consult our PostgreSQL Cheatsheet.

-- Create schema for tenant
CREATE SCHEMA tenant_123;

-- Set search path for tenant operations
SET search_path TO tenant_123, public;

-- Create tables in tenant schema
CREATE TABLE tenant_123.users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL,
    name VARCHAR(255),
    created_at TIMESTAMP DEFAULT NOW()
);

Application Connection Management

Managing database connections efficiently is critical for multi-tenant applications. The connection management code below uses GORM, but you might want to explore other ORM options. For a thorough comparison of Go ORMs including connection pooling, performance characteristics, and use cases, refer to our Go ORMs comparison guide.

// Connection string with schema search path
func GetTenantDB(tenantID uint) *gorm.DB {
    db := initializeDB()
    db.Exec(fmt.Sprintf("SET search_path TO tenant_%d, public", tenantID))
    return db
}

// Or use PostgreSQL connection string
// postgresql://user:pass@host/db?search_path=tenant_123

Separate Schema Pros

Better isolation: Schema-level separation reduces data leakage risk
Customization: Each tenant can have different table structures
Moderate cost: Still single database instance
Easier per-tenant backups: Can backup individual schemas
Better for compliance: Stronger than shared schema pattern

Separate Schema Cons

Schema management complexity: Migrations must run per tenant
Connection overhead: Need to set search_path per connection
Limited scalability: Schema count limits (PostgreSQL ~10k schemas)
Cross-tenant queries: More complex, requires dynamic schema references
Resource limits: Still shared database resources

Separate Schema Best For

Medium-scale SaaS (dozens to hundreds of tenants)
When tenants need schema customization
Applications needing better isolation than shared schema
When compliance requirements are moderate

Pattern 3: Separate Database per Tenant

Each tenant gets their own complete database instance, providing maximum isolation.

Separate Database Architecture

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Database 1  │  │  Database 2  │  │  Database 3  │
│  (Tenant A)  │  │  (Tenant B)  │  │  (Tenant C)  │
└──────────────┘  └──────────────┘  └──────────────┘

Separate Database Implementation

-- Create database for tenant
CREATE DATABASE tenant_enterprise_corp;

-- Connect to tenant database
\c tenant_enterprise_corp

-- Create tables (no tenant_id needed!)
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL,
    name VARCHAR(255),
    created_at TIMESTAMP DEFAULT NOW()
);

Dynamic Connection Management

// Connection pool manager
type TenantDBManager struct {
    pools map[uint]*gorm.DB
    mu    sync.RWMutex
}

func (m *TenantDBManager) GetDB(tenantID uint) (*gorm.DB, error) {
    m.mu.RLock()
    if db, exists := m.pools[tenantID]; exists {
        m.mu.RUnlock()
        return db, nil
    }
    m.mu.RUnlock()

    m.mu.Lock()
    defer m.mu.Unlock()

    // Double-check after acquiring write lock
    if db, exists := m.pools[tenantID]; exists {
        return db, nil
    }

    // Create new connection
    db, err := gorm.Open(postgres.Open(fmt.Sprintf(
        "host=localhost user=dbuser password=dbpass dbname=tenant_%d sslmode=disable",
        tenantID,
    )), &gorm.Config{})

    if err != nil {
        return nil, err
    }

    m.pools[tenantID] = db
    return db, nil
}

Separate Database Pros

Maximum isolation: Complete data separation
Best security: No risk of cross-tenant data access
Full customization: Each tenant can have completely different schemas
Independent scaling: Scale tenant databases individually
Easy compliance: Meets strictest data isolation requirements
Per-tenant backups: Simple, independent backup/restore
No noisy neighbors: Tenant workloads don't affect each other

Separate Database Cons

Highest cost: Multiple database instances require more resources
Operational complexity: Managing many databases (backups, monitoring, migrations)
Connection limits: Each database instance has connection limits
Cross-tenant analytics: Requires data federation or ETL
Migration complexity: Must run migrations across all databases
Resource overhead: More memory, CPU, and storage needed

Separate Database Best For

Enterprise SaaS with high-value customers
Strict compliance requirements (HIPAA, GDPR, SOC 2)
When tenants need significant customization
Low to medium tenant count (dozens to low hundreds)
When tenants have very different data models

Security Considerations

Regardless of the pattern chosen, security is paramount:

1. Row-Level Security (RLS)

PostgreSQL RLS automatically filters queries by tenant, providing a database-level security layer. This feature is particularly powerful for multi-tenant applications. For more details on PostgreSQL RLS, security policies, and other advanced PostgreSQL features, see our PostgreSQL Cheatsheet.

-- Enable RLS
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

-- Policy to isolate by tenant
CREATE POLICY tenant_isolation ON orders
    FOR ALL
    USING (tenant_id = current_setting('app.current_tenant')::INTEGER);

-- Application sets tenant context
SET app.current_tenant = '123';

2. Application-Level Filtering

Always filter by tenant_id in application code. The examples below use GORM, but different ORMs have their own approaches to query building. For guidance on choosing the right ORM for your multi-tenant application, check our comparison of Go ORMs.

// ❌ BAD - Missing tenant filter
db.Where("email = ?", email).First(&user)

// ✅ GOOD - Always include tenant filter
db.Where("tenant_id = ? AND email = ?", tenantID, email).First(&user)

// ✅ BETTER - Use scopes or middleware
db.Scopes(TenantScope(tenantID)).Where("email = ?", email).First(&user)

3. Connection Pooling

Use connection poolers that support tenant context:

// PgBouncer with transaction pooling
// Or use application-level connection routing

4. Audit Logging

Track all tenant data access:

type AuditLog struct {
    ID        uint
    TenantID  uint
    UserID    uint
    Action    string
    Table     string
    RecordID  uint
    Timestamp time.Time
    IPAddress string
}

Performance Optimization

Indexing Strategy

Proper indexing is crucial for multi-tenant database performance. Understanding SQL indexing strategies, including composite indexes and partial indexes, is essential. For a comprehensive reference on SQL commands including CREATE INDEX and query optimization, see our SQL Cheatsheet. For PostgreSQL-specific indexing features and performance tuning, refer to our PostgreSQL Cheatsheet.

-- Composite indexes for tenant queries
CREATE INDEX idx_orders_tenant_created ON orders(tenant_id, created_at DESC);
CREATE INDEX idx_orders_tenant_status ON orders(tenant_id, status);

-- Partial indexes for common tenant-specific queries
CREATE INDEX idx_orders_active_tenant ON orders(tenant_id, created_at)
WHERE status = 'active';

Query Optimization

// Use prepared statements for tenant queries
stmt := db.Prepare("SELECT * FROM users WHERE tenant_id = $1 AND email = $2")

// Batch operations per tenant
db.Where("tenant_id = ?", tenantID).Find(&users)

// Use connection pooling per tenant (for separate database pattern)

Monitoring

Effective database management tools are essential for monitoring multi-tenant applications. You'll need to track query performance, resource usage, and database health across all tenants. For comparing database management tools that can help with this, check out our DBeaver vs Beekeeper comparison. Both tools offer excellent features for managing and monitoring PostgreSQL databases in multi-tenant environments.

Monitor per-tenant metrics:

Query performance per tenant
Resource usage per tenant
Connection counts per tenant
Database size per tenant

Migration Strategy

Shared Schema Pattern

When implementing database migrations, your choice of ORM affects how you handle schema changes. The examples below use GORM's AutoMigrate feature, but different ORMs have different migration strategies. For detailed information on how various Go ORMs handle migrations and schema management, see our Go ORMs comparison.

// Migrations apply to all tenants automatically
func Migrate(db *gorm.DB) error {
    return db.AutoMigrate(&User{}, &Order{}, &Product{})
}

Separate Schema/Database Pattern

// Migrations must run per tenant
func MigrateAllTenants(tenantIDs []uint) error {
    for _, tenantID := range tenantIDs {
        db := GetTenantDB(tenantID)
        if err := db.AutoMigrate(&User{}, &Order{}); err != nil {
            return fmt.Errorf("tenant %d: %w", tenantID, err)
        }
    }
    return nil
}

Decision Matrix

Factor	Shared Schema	Separate Schema	Separate DB
Isolation	Low	Medium	High
Cost	Low	Medium	High
Scalability	High	Medium	Low-Medium
Customization	None	Medium	High
Operational Complexity	Low	Medium	High
Compliance	Limited	Good	Excellent
Best Tenant Count	1000+	10-1000	1-100

Hybrid Approach

You can combine patterns for different tenant tiers:

// Small tenants: Shared schema
if tenant.Tier == "standard" {
    return GetSharedDB(tenant.ID)
}

// Enterprise tenants: Separate database
if tenant.Tier == "enterprise" {
    return GetTenantDB(tenant.ID)
}

Best Practices

Always filter by tenant: Never trust application code alone; use RLS when possible. Understanding SQL fundamentals helps ensure proper query construction—refer to our SQL Cheatsheet for query best practices.
Monitor tenant resource usage: Identify and throttle noisy neighbors. Use database management tools like those compared in our DBeaver vs Beekeeper guide to track performance metrics.
Implement tenant context middleware: Centralize tenant extraction and validation. Your ORM choice affects how you implement this—see our Go ORMs comparison for different approaches.
Use connection pooling: Efficiently manage database connections. PostgreSQL-specific connection pooling strategies are covered in our PostgreSQL Cheatsheet.
Plan for tenant migration: Ability to move tenants between patterns
Implement soft delete: Use deleted_at instead of hard deletes for tenant data
Audit everything: Log all tenant data access for compliance
Test isolation: Regular security audits to prevent cross-tenant data leakage

Conclusion

Choosing the right multi-tenancy database pattern depends on your specific requirements for isolation, cost, scalability, and operational complexity. The Shared Database, Shared Schema pattern works well for most SaaS applications, while Separate Database per Tenant is necessary for enterprise customers with strict compliance needs.

Start with the simplest pattern that meets your requirements, and plan for migration to a more isolated pattern as your needs evolve. Always prioritize security and data isolation, regardless of the pattern chosen.

Useful Links

Zettelkasten for Developers: A Practical Method That Works

Rost — Mon, 25 May 2026 13:09:20 +0000

Developers do not usually suffer from a lack of information. We suffer from too much of it.

There are API docs, pull requests, production incidents, design discussions, meeting notes, architecture diagrams, code comments, Slack threads, research papers, experiments, bookmarks, and half-finished ideas sitting in five different tools. The hard part is not saving information. The hard part is turning it into reusable thinking.

That is where Zettelkasten becomes useful.

A Zettelkasten is often described as a note-taking system, but that undersells it. Used well, it is a personal knowledge system for developing ideas over time. For developers, it can become a practical bridge between code, architecture, debugging, learning, and writing.

The opinionated part is this: most developers should not use Zettelkasten as a romantic productivity hobby. Do not build a beautiful note museum. Build a working system that helps you solve problems, explain systems, and make better engineering decisions.

What Is Zettelkasten?

Zettelkasten means "slip box". The method is associated with sociologist Niklas Luhmann, who used a large collection of linked notes to develop ideas and write extensively.

The important lesson is not that he used paper cards. The important lesson is that his notes were not isolated files. Each note had a clear idea, a place in the system, and links to other notes. Over time, the system became more valuable because connections accumulated.

For developers, the modern version is simple:

Write one useful idea per note.
Link it to related notes.
Use those links to grow explanations, decisions, patterns, and articles.

That is it. The rest is implementation detail.

Why Developers Struggle With Knowledge Overload

Software development creates knowledge that is both detailed and temporary.

You learn why a cache invalidation bug happened. You discover a weird edge case in a framework. You compare two queueing strategies. You debug a production outage. You understand why a legacy service behaves strangely. You read a great article about distributed tracing.

Then, two months later, you vaguely remember that you once knew the answer.

The usual developer knowledge stack makes this worse:

Bookmarks store sources, not understanding.
Folders force early categorization.
Wikis become stale when nobody owns them.
TODO lists mix tasks with ideas.
Code comments explain local details, not broader concepts.
Chat messages disappear into history.

A Zettelkasten helps because it treats knowledge as a network, not a warehouse. If that framing sounds familiar from reading about building a second brain, that is not a coincidence — both methods attack the same gap between capture and reuse, but Zettelkasten's discipline of atomic notes and explicit links gives developers a more granular handle on technical ideas.

Core Principles of Zettelkasten

Atomic Notes

An atomic note contains one idea.

Not one topic. Not one article summary. Not one giant page called "PostgreSQL". One idea.

For example, these are too broad:

PostgreSQL notes
Kubernetes
Caching
System design

These are closer to atomic:

text Partial indexes reduce write overhead when queries target a small subset Kubernetes readiness probes protect traffic routing, not container startup Write-through caching improves consistency but increases write latency Idempotency keys turn retries into safe operations

Atomic notes are powerful because they are easier to link. A huge page can only be linked as a vague topic. A focused note can be connected to an exact concept, decision, bug, or system.

A good developer note should usually answer one of these questions:

What is the idea?
When does it matter?
What tradeoff does it expose?
Where have I seen it in real code?
What other concept does it connect to?

Linking

Links are the heart of the system.

The point is not to create a pretty graph. The point is to make ideas reusable.

When you write a note about idempotency keys, link it to notes about retries, distributed systems, payment processing, message queues, API design, and incident prevention. When you write a note about database migrations, link it to deploy safety, rollback strategy, backward compatibility, and feature flags.

A link should usually mean one of these things:

"This explains the same concept from another angle."
"This is a practical example of the idea."
"This is a tradeoff or counterpoint."
"This concept depends on that concept."
"This note belongs in a larger argument."

Avoid lazy links. Linking every note to every other note creates noise. The best links are intentional.

Emergence

Emergence is the part of Zettelkasten that sounds mystical, but it is practical.

You do not need to design the perfect structure upfront. You add useful notes, connect them honestly, and let clusters appear over time.

After a few months, you may notice that many notes connect around topics like:

API reliability
Observability
Developer experience
Event-driven architecture
Database performance
Technical debt
Documentation
Security reviews

Those clusters become future articles, internal docs, design principles, conference talks, onboarding material, or better engineering decisions.

This is why Zettelkasten is different from a folder hierarchy. Folders ask you to decide where knowledge belongs before you fully understand it. Links let knowledge belong to multiple contexts.

A Developer Adaptation Of Zettelkasten

Classic Zettelkasten advice often comes from academic writing — the personal knowledge management literature covers that tradition well. Developers need a slightly different version.

A developer Zettelkasten should connect three things:

Concepts
Code
Systems

Concepts

Concept notes explain reusable ideas.

Examples:

text Backpressure prevents fast producers from overwhelming slow consumers Optimistic locking detects conflicting writes without blocking readers Circuit breakers protect dependencies from repeated failing calls

These notes should be written in your own words. Copying documentation is not enough. The value comes from forcing yourself to explain the concept clearly.

A useful concept note can include:

A short explanation
A concrete example
A tradeoff
A link to a related pattern
A link to a real system where you used it

Code

Code notes capture practical implementation knowledge.

They are not random snippet dumps. A snippet is useful only when it explains a decision or pattern.

For example:

`markdown

Idempotent request handling with a database constraint

The safest implementation is often a unique constraint on the idempotency key.
The application can retry safely because duplicate requests resolve to the same
stored result instead of creating a second side effect.

[[Retries need idempotent operations]]
[[Database constraints are concurrency control]]
[[Payment APIs should treat network failure as unknown outcome]] `

Good code notes explain why the code works, when to use it, and what can go wrong.

Systems

System notes connect abstract ideas to your actual architecture.

For example:

text The billing service uses idempotency keys because payment provider calls may succeed even when our HTTP client times out.

This note can link to:

text Idempotency keys turn retries into safe operations Timeouts do not prove failure Payment APIs should model unknown outcomes Outbox pattern separates database writes from external side effects

This is where Zettelkasten becomes valuable for senior engineering work. It helps you build a memory of why systems are shaped the way they are.

A Practical Workflow

Step 1: Capture Fleeting Notes

A fleeting note is a rough capture. It does not need to be polished.

Examples:

text Look into why readiness probe failed during deploy. Maybe retries made the duplicate invoice bug worse. Good quote from incident review: timeout is not failure. Research: Postgres partial index for active rows only.

Use whatever is fastest: Obsidian daily note, Logseq journal, a text file, mobile notes, or a scratch buffer.

The rule is simple: capture quickly, process later.

Step 2: Process Notes Into Permanent Notes

Processing is where the value appears.

Turn rough notes into clear, reusable notes. Rewrite in your own words. Give each note a title that states the idea.

Bad title:

text Retries

Better title:

text Retries are safe only when the operation is idempotent

Bad note:

text Need idempotency for retries.

Better note:

text Retries can turn a temporary network problem into duplicate side effects. A retry is safe only when the operation can run more than once and still produce the same business result. For APIs, this often requires an idempotency key, a unique constraint, or a stored request result.

Step 3: Add Links While The Context Is Fresh

After writing the note, ask:

What does this explain?
What does this depend on?
Where have I seen this in code?
What is the opposite view?
What system would benefit from this?

Add only the links that help future you think.

Step 4: Create Index Notes Or Maps Of Content

Once a cluster grows, create an index note.

For example:

`markdown

API Reliability

Core ideas

[[Retries are safe only when the operation is idempotent]]
[[Timeouts do not prove failure]]
[[Circuit breakers reduce pressure on failing dependencies]]
[[Rate limits protect shared resources]]

Implementation patterns

[[Idempotency keys turn retries into safe operations]]
[[Outbox pattern separates persistence from delivery]]
[[Dead letter queues preserve failed messages for inspection]]

System examples

[[Billing service payment retry design]]
[[Webhook delivery failure handling]] `

This gives you navigation without forcing everything into folders.

Step 5: Use Notes To Produce Output

A Zettelkasten should produce something.

For developers, output can be:

Architecture decision records
Design documents
Blog posts
Debugging guides
Onboarding docs
Pull request explanations
Internal talks
Refactoring plans
Incident review insights

If your notes never influence your work, the system is too decorative.

Recommended Note Types For Developers

Fleeting Notes

Temporary notes for quick capture.

Use them for:

Ideas during coding
Debugging observations
Meeting fragments
Questions
Bookmarks to process later

Delete or convert them quickly. Do not let them become a swamp.

Literature Notes

Notes about external sources.

For developers, a source can be:

Documentation
Blog article
RFC
Source code
Conference talk
GitHub issue
Postmortem
Book chapter

Keep source notes separate from your own permanent notes. A source note says, "This source said this." A permanent note says, "I understand this idea this way."

Permanent Notes

These are the core of the Zettelkasten.

A permanent note should be:

Atomic
Written in your own words
Linked to related notes
Useful without needing the original source
Stable enough to revisit later

Project Notes

Project notes are allowed, but do not confuse them with permanent notes.

A project note might be:

text Migrate billing worker to queue v2

It can link to permanent notes like:

text Backpressure prevents queue consumers from collapsing Outbox pattern separates persistence from delivery Feature flags reduce deployment risk

Projects end. Concepts stay.

Tool Examples

Obsidian

Obsidian works well for developer Zettelkasten because it uses local Markdown files and supports internal links.

A simple Obsidian structure:

text notes/ fleeting/ sources/ permanent/ maps/ projects/

Example note:

`markdown

Timeouts do not prove failure

A timeout means the client stopped waiting. It does not prove the server failed.
The operation may have succeeded, failed, or still be running.

This matters for payment APIs, job queues, and any external side effect.

[[Retries are safe only when the operation is idempotent]]
[[Idempotency keys turn retries into safe operations]]
[[External side effects need reconciliation]] `

Obsidian is a good fit if you like file ownership, plain text, and editor-like workflows.

Logseq

Logseq is useful if you prefer outlining, daily journals, and block-level references.

Its block model works well for capturing small units of thought. You can write rough notes in the journal, then promote useful blocks into permanent notes.

Example Logseq-style workflow:

`text

Timeout during payment request does not prove payment failure.
- This should become a permanent note about unknown outcomes.
- Related: [[Idempotency]], [[Retries]], [[Payment APIs]] `

Logseq is a good fit if your thinking starts as outlines and you like block references. For a side-by-side comparison of both tools across workflow style, sync options, and plugin ecosystems, Obsidian vs Logseq maps the trade-offs clearly.

Plain Markdown And Git

You do not need a special app.

A Git repository of Markdown files can be enough:

text knowledge/ permanent/ sources/ maps/

Use normal Markdown links:

markdown [Retries are safe only when operations are idempotent](../permanent/retries-safe-only-with-idempotency.md)

This approach is boring, durable, and developer-friendly. That is a compliment.

Naming Notes

Prefer titles that make claims.

Weak titles:

text Caching Queues OAuth PostgreSQL indexes

Strong titles:

text Cache invalidation is a coordination problem Queues hide latency but do not remove work OAuth access tokens should be short lived Partial indexes are useful when queries target a subset

A claim-based title makes the note easier to understand and easier to link.

What To Put In A Developer Zettelkasten

Good candidates:

Architecture principles
Debugging lessons
Production incident insights
API design rules
Database patterns
Security assumptions
Performance tradeoffs
Framework edge cases
Refactoring heuristics
Testing strategies
Deployment lessons
Code review patterns

Poor candidates:

Raw meeting transcripts
Unprocessed bookmarks
Huge copied documentation pages
Random snippets with no explanation
Task lists
Secrets
Credentials
Anything that belongs in official company documentation only

A personal Zettelkasten can reference work, but it should not become an unsafe shadow copy of private systems.

Common Mistakes

Mistake 1: Over-Structuring Too Early

Developers love structure. That is sometimes a problem.

Do not spend the first week designing folders, tags, templates, naming conventions, dashboards, and automation. You do not yet know what structure your notes need.

Start with a small number of note types:

text fleeting sources permanent maps projects

Let complexity earn its place.

Mistake 2: Treating It Like Folders

A Zettelkasten is not a better folder tree.

If every note belongs to exactly one folder and has no meaningful links, you have built a filing cabinet. That may still be useful, but it is not Zettelkasten.

The value comes from connections:

text API retries -> idempotency -> database constraints -> payment safety -> incident prevention

That chain is more useful than a folder called "Backend".

Mistake 3: Saving Instead Of Thinking

Copying is not learning.

A saved paragraph from documentation may help later, but a rewritten explanation helps now. The act of restating an idea in your own words is where understanding improves.

A good rule:

text Do not create a permanent note until you can explain the idea without copying.

Mistake 4: Linking Everything

Too many links are as bad as too few.

Do not link words just because they exist. Link ideas because the relationship matters.

A useful link should help future you answer:

text Why is this connected?

Mistake 5: Confusing Tags With Structure

Tags are useful for status and broad grouping:

`text

todo

source

security

draft

But tags should not carry the whole system. If you rely only on tags, you lose the richer meaning of direct links.

A link says:

text This idea relates to that idea in a specific way.

A tag usually says:

text This belongs to a broad bucket.

Both are useful. They are not the same.

Mistake 6: Never Producing Output

A Zettelkasten that never produces output becomes a private archive.

Output does not have to mean public writing. It can be a design doc, an incident review, a better pull request, or a clear explanation to a teammate.

The system should make your thinking easier to reuse.

A Minimal Template

Use a small template. Resist the urge to create a form with fifteen fields.

`markdown

Title as a claim

Idea

Explain the idea in your own words.

Why it matters

Describe the practical impact.

Example

Show a code, system, or debugging example.

Tradeoffs

Mention limits, risks, or counterpoints.

Example: From Bug To Zettelkasten Notes

Imagine you fixed a bug where users were charged twice after a timeout.

A weak note would be:

text Payment bug - retries caused duplicate charge.

A stronger set of notes might be:

text Timeouts do not prove failure Retries are safe only when the operation is idempotent Idempotency keys turn retries into safe operations Payment APIs should model unknown outcomes Database constraints are concurrency control

Now the bug has become reusable engineering knowledge.

Later, those notes can support:

A postmortem
A design doc for payment retries
A blog post about idempotency
A checklist for external API integrations
A code review comment
A safer implementation

That is the practical value of Zettelkasten.

A Weekly Maintenance Routine

You do not need a complicated review process.

Once a week:

Process rough notes.
Delete notes that no longer matter.
Convert useful ideas into permanent notes.
Add missing links.
Promote clusters into map notes.
Pick one note and turn it into output.

Keep it lightweight. The system should support development, not compete with it.

Practical Rules

Use these rules to keep the system healthy:

One idea per note.
Write titles as claims.
Prefer links over folders.
Keep source notes separate from your own ideas.
Connect notes to real code and real systems.
Create map notes only when a cluster exists.
Delete low-value notes.
Do not automate before you understand your workflow.
Use the system to produce something.

When Zettelkasten Is Not Worth It

Zettelkasten is not the answer to every problem.

It may be overkill if:

You only need a task manager.
You rarely revisit technical ideas.
You do not write, teach, design, or document.
Your notes are mostly short-lived project details.
You are using it to avoid doing the actual work.

It is most useful when your work depends on compounding understanding.

That includes senior engineering, architecture, technical leadership, debugging complex systems, writing, consulting, research, and learning deeply over many years.

Final Thoughts

For developers, Zettelkasten is not about collecting notes. It is about building a thinking environment.

The method works best when it stays practical: atomic notes, meaningful links, real examples, and regular output. Connect concepts to code. Connect code to systems. Connect systems to decisions.

Do not try to build the perfect second brain. Build a useful one.

A good developer Zettelkasten should help you answer better questions:

text Where have I seen this problem before? What concept explains this bug? What tradeoff are we making? What pattern applies here? What should I write down so I do not relearn this again?

That is enough.

OpenClaw vs Hermes Agent: Stars, Downloads & Usage 2026

Rost — Mon, 25 May 2026 13:09:06 +0000

Open-source AI agent frameworks are exploding in popularity on GitHub.
Two projects at the core of the self-hosted AI systems ecosystem — OpenClaw and Hermes Agent — have pulled so far ahead that the rest of the field is fighting for a distant third place.

Here is the full picture as of May 2026.

The Leaderboard

Star counts are live data fetched from the GitHub API on May 21, 2026. Repos are sorted by current stars, descending.

Rank	Project	GitHub repo	Language	Stars	Releases last 30 days
1	OpenClaw	`openclaw/openclaw`	TypeScript	373,616	62
2	Hermes Agent	`NousResearch/hermes-agent`	Python	160,175	5
3	Nanobot	`HKUDS/nanobot`	Python	42,873	2
4	AstrBot	`AstrBotDevs/AstrBot`	Python	32,709	11
5	ZeroClaw	`zeroclaw-labs/zeroclaw`	Rust	31,500	≥1
6	NanoClaw	`nanocoai/nanoclaw`	TypeScript	29,143	≥1
7	PicoClaw	`sipeed/picoclaw`	Go	29,121	3
8	AionUi	`iOfficeAI/AionUi`	TypeScript	26,025	≥3
9	NemoClaw	`NVIDIA/NemoClaw`	TypeScript	20,571	0
10	OpenFang	`RightNow-AI/openfang`	Rust	17,599	≥5
11	LangBot	`langbot-app/LangBot`	Python	16,084	1
12	memU	`NevaMind-AI/memU`	Python	13,672	0
13	IronClaw	`nearai/ironclaw`	Rust	12,305	4
14	Moltworker	`cloudflare/moltworker`	TypeScript	9,899	0
15	MemOS	`MemTensor/MemOS`	Python	9,246	≥2
16	ClawWork	`HKUDS/ClawWork`	Python	8,111	0
17	NullClaw	`nullclaw/nullclaw`	Zig	7,603	2
18	MimicLaw	`memovai/mimiclaw`	C	5,422	0
19	Moltis	`moltis-org/moltis`	Rust	2,697	≥3
20	Clawra	`SumeLabs/clawra`	TypeScript	2,298	0

OpenClaw: 373k Stars and Still Growing

OpenClaw is a personal AI assistant framework built in TypeScript. It runs entirely on the user's own device and connects to over 50 messaging platforms — WhatsApp, Telegram, Slack, Discord, and more — through a single unified interface.

The project launched in November 2025 but truly ignited on January 30, 2026, reaching 100,000 stars within 48 hours of its relaunch. By April 2026 it had overtaken React to become the most-starred software repository in GitHub's history. At the time of this writing it sits at 373,616 stars, 72,000+ forks, and 360 contributors.

The release cadence is extraordinary: 62 tagged releases in the last 30 days puts it in a category of its own in terms of iteration speed. The full arc of how OpenClaw grew from a weekend prototype to GitHub's most-starred repository — including the economics behind the viral spike and the April 2026 subscription cutoff that reshaped weekly growth — is detailed in the OpenClaw rise and fall timeline.

Hermes Agent: The Challenger

Nous Research's Hermes Agent markets itself as "the agent that grows with you." It is a self-improving AI agent built in Python with a built-in learning loop — it creates new skills from experience, searches past conversations for relevant context, and can run on a range of infrastructure options from local hardware to cloud.

Created in July 2025 and now at 160,175 stars, Hermes Agent recently surpassed OpenClaw as the world's most-used open-source AI agent by daily token processing on OpenRouter — though OpenClaw still leads in cumulative all-time usage. The gap between the two in raw GitHub stars remains large (over 200k), but Hermes Agent's trajectory is notably steeper.

Mid-field: The 20k–45k Band

The third through eighth positions are all clustered between 26k and 43k stars, making ranking changes here frequent:

Nanobot (HKUDS, 42,873 ⭐) — Python, lightweight graph-based task orchestration from the HKU Data Science lab.
AstrBot (AstrBotDevs, 32,709 ⭐) — Python, multi-platform chatbot framework with active release history (11 releases in the last 30 days).
ZeroClaw (zeroclaw-labs, 31,500 ⭐) — Rust, systems-level agent runtime targeting low-latency deployments.
NanoClaw (nanocoai, 29,143 ⭐) — TypeScript, recently migrated from qwibitai/nanoclaw to nanocoai/nanoclaw; the rename caused a brief star-count gap in trackers.
PicoClaw (Sipeed, 29,121 ⭐) — Go, embedded-friendly agent framework. Only 22 stars separate it from NanoClaw.
AionUi (iOfficeAI, 26,025 ⭐) — TypeScript, focuses on agentic UI generation with a visual workflow editor.

Language Breakdown

Language	Repos in top 20	Total stars
Python	8	294,897
TypeScript	7	470,155
Rust	3	51,502
Go	1	29,121
Zig	1	7,603
C	1	5,422

TypeScript leads in total star weight — largely because of OpenClaw itself — while Python holds the most individual projects. Rust is carving out a niche in the performance-sensitive tier (ZeroClaw, OpenFang, IronClaw).

Release Velocity vs Star Count

High star counts do not always mean high release velocity. Several top-starred repos (NemoClaw, memU, ClawWork, Clawra, MimicLaw) show zero releases in the last 30 days — they may be in maintenance mode or experiencing slower development cycles.

AstrBot stands out in the mid-field with 11 releases in 30 days, suggesting active feature development. OpenFang (≥5) and Moltis (≥3) are also moving quickly relative to their star counts, which may signal emerging momentum.

Notable Moves Since Last Snapshot

NanoClaw renamed org from qwibitai to nanocoai; updated link in the table above.
NemoClaw language corrected to TypeScript (previously listed as JavaScript in older data).
AionUi gained ~800 stars, moving from 8th to a stronger 8th position.
MemOS crossed 9,000 stars.

OpenRouter Usage Rankings

GitHub stars measure mindshare; OpenRouter token volume measures actual runtime usage. The two charts tell different stories.

The table below shows the global daily ranking on OpenRouter as of May 21, 2026, filtered to apps and agents that have opted into usage attribution. Counts are daily tokens processed through the platform.

Rank	App / Agent	Category	Daily tokens
1	Hermes Agent	Personal / CLI Agents	458 B
2	OpenClaw	Personal / CLI Agents	173 B
3	Kilo Code	CLI / IDE Agents	163 B
4	Descript	Video Generation	68.1 B
5	Claude Code	CLI Agents	64.1 B
6	pi	CLI Agents	58 B
8	Janitor AI	Roleplay	28.4 B
9	ISEKAI ZERO	Game	26.8 B
10	CSS AI Pro	—	25.4 B
11	Cline	IDE / CLI Agents	23.5 B
12	Roo Code	IDE / Cloud Agents	20.1 B
13	Lemonade	Programming App	20 B
14	Mira	Personal Agents	15.2 B
15	VidMuse	Video Generation	13.3 B
16	AA-LCR Benchmark	Research	9.42 B
18	SillyTavern	Roleplay	7.84 B
19	OpenHands	CLI Agents	7.21 B
20	Nous Research API	General Chat	6.63 B

Gaps in rank numbers (e.g., no 7 or 17) reflect apps without public attribution at the time of retrieval; OpenRouter lists 60 apps in total.

All-time cumulative tokens

The daily leader and the all-time leader have swapped since earlier in the year. As of today Hermes Agent has also overtaken OpenClaw on the all-time chart — a milestone that crossed some time after the May 10 daily flip.

App / Agent	All-time tokens
Hermes Agent	8.14 T
OpenClaw	7.18 T
Kilo Code	5.21 T
Claude Code	2.6 T

What the numbers mean

The gap between Hermes Agent (458 B daily) and OpenClaw (173 B) is now wider than it was on May 10, when the flip first happened at 224 B vs 186 B. Hermes has more than doubled its daily volume in 11 days; OpenClaw's daily volume has declined.

The architecture difference explains a lot of this. OpenClaw is session-native — it resets between runs, which means every session re-pays the full context-stuffing cost. Hermes is a persistent runtime with a three-layer memory system (identity snapshot, SQLite FTS5 session database, self-written procedural skill files). Once a skill is written, repeat tasks cost a fraction of the tokens.

For the coding-agent sub-category specifically, the top five are Hermes Agent, OpenClaw, Kilo Code, Claude Code, and pi. Cline (#11) and Roo Code (#12) round out the open-source coding-agent tier, both crossing 20 B daily tokens.

The driver of Hermes's May acceleration was the v0.13.0 "Tenacity" release (May 7, 2026): 864 commits, 588 merged PRs, 295 contributors. That release shipped a Kanban-style durable multi-agent task board with heartbeat monitoring and hallucination recovery, plus eight P0 security fixes and Google Chat as the 20th messaging integration.

Community Health

GitHub repository metrics reveal a sharp contrast in project maturity and maintenance style between the two leaders.

Metric	OpenClaw	Hermes Agent
Issue close rate	89.9 %	37.2 %
Contributors	360	400
Forks	72,696	26,000
Releases shipped (total)	82+	14+
Disclosed CVEs (2026 YTD)	9 in 4 days (March 2026)	0
Worst CVE severity	CVSS 9.9	—
Exposed public instances	135,000+ across 82 countries	Not separately tracked
Security response (v0.13.0)	—	8 P0 fixes, default redaction on

OpenClaw's 89.9 % issue close rate reflects a well-staffed, responsive maintainer team — the highest of any project in this space. Its release cadence (62 in the last 30 days alone) is exceptional, but that velocity has a cost: roughly a quarter of updates reportedly break response delivery on at least one channel, and the March 2026 CVE cluster (nine issues in four days, the worst at CVSS 9.9) forced emergency patching at scale. Shadowserver confirmed over 135,000 exposed Gateway instances across 82 countries in the same window. The OpenClaw team does publish fixes fast; the problem is that a community of this size patches slowly.

Hermes Agent's 37.2 % issue close rate is the expected profile for a three-month-old project with a backlog accumulating faster than it can be triaged. The security record so far is clean — zero disclosed agent-specific CVEs as of May 2026 — though that partly reflects fewer eyes on the codebase. The v0.13.0 "Tenacity" release shipped eight P0 fixes proactively, before any public disclosure, which is a good signal of security culture.

Ecosystem Size

Package downloads

Package	Registry	Weekly downloads
`openclaw` (main)	npm	5,344,931
`@tencent-weixin/openclaw-weixin`	npm	230,903
`@ollama/openclaw-web-search`	npm	160,221
`@paperclipai/adapter-openclaw-gateway`	npm	159,310
`@larksuite/openclaw-lark`	npm	115,964
`hermes-agent` (main)	PyPI	53,134

The raw numbers are not directly comparable — npm counts installs on every npm install (including CI runs), while PyPI counts pip installs. OpenClaw also has a larger ecosystem of third-party adapter packages that each pull in the core. Even so, the order-of-magnitude difference reflects OpenClaw's deeper penetration of automated pipelines and developer toolchains.

Hermes Agent at 53,000 PyPI downloads per week is not a small number for a three-month-old Python tool. Its install rate has grown roughly linearly with the GitHub star count.

Skills and integrations

Dimension	OpenClaw	Hermes Agent
Third-party skill marketplace	ClawHub — 44,000+ skills	None yet (self-generated only)
Messaging integrations (official)	50+ channels	20 channels
Community repositories (GitHub)	Large (untracked by maintainers)	80+ quality-filtered
Skill libraries (community)	Embedded in ClawHub	17 curated
Multi-agent orchestration frameworks	Built-in ACP swarm	9 third-party
External memory providers	Via skills	8 native

ClawHub is OpenClaw's most durable moat: 44,000 community-maintained skills covering integrations, automations, and workflows that would take months to replicate. Hermes's answer is to generate skills from its own task completions rather than pull them from a marketplace — a fundamentally different philosophy that pays off on deep, repeated tasks but leaves gaps on long-tail integrations. The eight external memory backends Hermes ships natively — Honcho, OpenViking, Mem0, Hindsight, and four more — are compared in detail in Agent Memory Providers Compared.

One security note on ClawHub: in Q1 2026 Koi Security identified 341 malicious entries in the registry, prompting OpenClaw to add a verification layer to the skill submission pipeline. A detailed guide to vetting skills, understanding which are safe to install, and navigating ClawHub's quality tiers is in OpenClaw Skills Ecosystem and Practical Production Picks.

Community Sentiment

A synthesis of Reddit threads across r/homeautomation, r/selfhosted, and r/MachineLearning (compiled by kilo.ai) breaks down operator preferences as follows:

Stance	Share
Stay on OpenClaw	35 %
Switched fully to Hermes	30 %
Run both side by side	20 %
Withholding judgment on Hermes	15 %

The 15 % holding off on Hermes are primarily concerned about what some users characterise as coordinated promotion activity from newly created accounts in Hermes-related threads — a pattern common to fast-growing projects but notable enough that veteran community members flag it.

Top OpenClaw complaints (by upvote volume)

Release breakage — most-upvoted complaint has 305 votes: "Every single update ships more bugs and problems than before." An estimated 25 % of releases break response delivery on at least one channel.
Memory drift — agents forget prior instructions across sessions, requiring users to re-establish context manually.
Self-host friction — disproportionate time spent on Docker configuration, SSH setup, and YAML tuning relative to actual agent work.

Top Hermes Agent complaints (by frequency)

Unreliable self-evaluation — the agent occasionally reports task success when the outcome was a partial failure.
Skill file overwriting — auto-improvement rewrites manually tuned skill files, discarding intentional customisation.
Integration gaps — ClawHub has a skill for almost everything; Hermes does not, and self-generation takes time to catch up.

The "run both" pattern (20 % of operators) is the most architecturally interesting: OpenClaw as the channel-and-routing layer up front, with Hermes as the deep-specialist backend. Messages arrive via Telegram or Slack, OpenClaw routes them, and the tasks where compounding matters are dispatched to a Hermes instance that has been improving on exactly those workflows for weeks.

Search Interest Trend

Tracking the project growth leaderboard (weekly new GitHub stars, a cleaner signal than raw star count) shows a clear momentum reversal as of May 2026.

Project	Weekly star growth	Leaderboard position
Claw-code	+7,000	#1
Hermes Agent	+3,800	#3
OpenClaw	+1,700	#11

OpenClaw had a +40,000/week peak in early February 2026 during the post-relaunch explosion. At +1,700/week in May, it is still growing in absolute terms — 373k stars does not happen without weekly adds — but it has settled into a mature project cadence, not a growth sprint.

Hermes Agent at +3,800/week is the fastest-growing agent runtime on the leaderboard right now, despite having less than half of OpenClaw's cumulative stars. Its growth curve is steeper than OpenClaw's was at the same age (week 12 post-launch).

The broader search interest trend corroborates the star-growth pattern. Queries for "Hermes Agent" and "hermes-agent install" have been rising consistently since the February launch; "OpenClaw" search volume peaked in late January and has been flat-to-declining since. The intersection point — where Hermes search volume equals OpenClaw's — has not yet been reached, but the trajectories suggest it will cross sometime in Q3 2026 if current rates hold.

The HN community has also shifted: threads about OpenClaw now centre on security hardening, transport trust (Telegram's lack of default end-to-end encryption), and maintenance overhead. Threads about Hermes Agent are still mostly "how do I set this up for X" — an earlier-stage energy that reflects a project still in its adoption phase.

Useful links

Go Unit Testing: Structure & Best Practices

Rost — Sun, 24 May 2026 02:28:55 +0000

Go's built-in testing package
provides a powerful, minimalist framework for writing unit tests without external dependencies.
Here are the testing fundamentals, project structure, and advanced patterns to build reliable Go applications.

Why Testing Matters in Go

Go's philosophy emphasizes simplicity and reliability. The standard library includes the testing package, making unit testing a first-class citizen in the Go ecosystem. Well-tested Go code improves maintainability, catches bugs early, and provides documentation through examples. If you're new to Go, check out our Go Cheat Sheet for a quick reference of the language fundamentals.

Key benefits of Go testing:

Built-in support: No external frameworks required
Fast execution: Concurrent test execution by default
Simple syntax: Minimal boilerplate code
Rich tooling: Coverage reports, benchmarks, and profiling
CI/CD friendly: Easy integration with automated pipelines

Project Structure for Go Tests

Go tests live alongside your production code with a clear naming convention:

myproject/
├── go.mod
├── main.go
├── calculator.go
├── calculator_test.go
├── utils/
│   ├── helper.go
│   └── helper_test.go
└── models/
    ├── user.go
    └── user_test.go

Key conventions:

Test files end with _test.go
Tests are in the same package as the code (or use _test suffix for black-box testing)
Each source file can have a corresponding test file

Package Testing Approaches

White-box testing (same package):

package calculator

import "testing"
// Can access unexported functions and variables

Black-box testing (external package):

package calculator_test

import (
    "testing"
    "myproject/calculator"
)
// Can only access exported functions (recommended for public APIs)

Basic Test Structure

Every test function follows this pattern:

package calculator

import "testing"

// Test function must start with "Test"
func TestAdd(t *testing.T) {
    result := Add(2, 3)
    expected := 5

    if result != expected {
        t.Errorf("Add(2, 3) = %d; want %d", result, expected)
    }
}

Testing.T methods:

t.Error() / t.Errorf(): Mark test as failed but continue
t.Fatal() / t.Fatalf(): Mark test as failed and stop immediately
t.Log() / t.Logf(): Log output (only shown with -v flag)
t.Skip() / t.Skipf(): Skip the test
t.Parallel(): Run test in parallel with other parallel tests

t.Log is for human-readable test diagnostics. In running services, log/slog and JSON-friendly records are usually a better match for aggregation and incident debugging. See Structured Logging in Go with slog for Observability and Alerting.

Table-Driven Tests: The Go Way

Table-driven tests are the idiomatic Go approach for testing multiple scenarios. With Go generics, you can also create type-safe test helpers that work across different data types:

func TestCalculate(t *testing.T) {
    tests := []struct {
        name     string
        a, b     int
        op       string
        expected int
        wantErr  bool
    }{
        {"addition", 2, 3, "+", 5, false},
        {"subtraction", 5, 3, "-", 2, false},
        {"multiplication", 4, 3, "*", 12, false},
        {"division", 10, 2, "/", 5, false},
        {"division by zero", 10, 0, "/", 0, true},
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            result, err := Calculate(tt.a, tt.b, tt.op)

            if (err != nil) != tt.wantErr {
                t.Errorf("Calculate() error = %v, wantErr %v", err, tt.wantErr)
                return
            }

            if result != tt.expected {
                t.Errorf("Calculate(%d, %d, %q) = %d; want %d", 
                    tt.a, tt.b, tt.op, result, tt.expected)
            }
        })
    }
}

Advantages:

Single test function for multiple scenarios
Easy to add new test cases
Clear documentation of expected behavior
Better test organization and maintainability

Running Tests

Basic Commands

# Run tests in current directory
go test

# Run tests with verbose output
go test -v

# Run tests in all subdirectories
go test ./...

# Run specific test
go test -run TestAdd

# Run tests matching pattern
go test -run TestCalculate/addition

# Run tests in parallel (default is GOMAXPROCS)
go test -parallel 4

# Run tests with timeout
go test -timeout 30s

Test Coverage

# Run tests with coverage
go test -cover

# Generate coverage profile
go test -coverprofile=coverage.out

# View coverage in browser
go tool cover -html=coverage.out

# Show coverage by function
go tool cover -func=coverage.out

# Set coverage mode (set, count, atomic)
go test -covermode=count -coverprofile=coverage.out

Useful Flags

-short: Run tests marked with if testing.Short() checks
-race: Enable race detector (finds concurrent access issues)
-cpu: Specify GOMAXPROCS values
-count n: Run each test n times
-failfast: Stop on first test failure

Test Helpers and Setup/Teardown

Helper Functions

Mark helper functions with t.Helper() to improve error reporting:

func assertEqual(t *testing.T, got, want int) {
    t.Helper() // This line is reported as the caller
    if got != want {
        t.Errorf("got %d, want %d", got, want)
    }
}

func TestMath(t *testing.T) {
    result := Add(2, 3)
    assertEqual(t, result, 5) // Error line points here
}

Setup and Teardown

func TestMain(m *testing.M) {
    // Setup code here
    setup()

    // Run tests
    code := m.Run()

    // Teardown code here
    teardown()

    os.Exit(code)
}

Test Fixtures

func setupTestCase(t *testing.T) func(t *testing.T) {
    t.Log("setup test case")
    return func(t *testing.T) {
        t.Log("teardown test case")
    }
}

func TestSomething(t *testing.T) {
    teardown := setupTestCase(t)
    defer teardown(t)

    // Test code here
}

Mocking and Dependency Injection

Interface-Based Mocking

When testing code that interacts with databases, using interfaces makes it easy to create mock implementations. If you're working with PostgreSQL in Go, see our comparison of Go ORMs for choosing the right database library with good testability.

// Production code
type Database interface {
    GetUser(id int) (*User, error)
}

type UserService struct {
    db Database
}

func (s *UserService) GetUserName(id int) (string, error) {
    user, err := s.db.GetUser(id)
    if err != nil {
        return "", err
    }
    return user.Name, nil
}

// Test code
type MockDatabase struct {
    users map[int]*User
}

func (m *MockDatabase) GetUser(id int) (*User, error) {
    if user, ok := m.users[id]; ok {
        return user, nil
    }
    return nil, errors.New("user not found")
}

func TestGetUserName(t *testing.T) {
    mockDB := &MockDatabase{
        users: map[int]*User{
            1: {ID: 1, Name: "Alice"},
        },
    }

    service := &UserService{db: mockDB}
    name, err := service.GetUserName(1)

    if err != nil {
        t.Fatalf("unexpected error: %v", err)
    }
    if name != "Alice" {
        t.Errorf("got %s, want Alice", name)
    }
}

Popular Testing Libraries

Testify

The most popular Go testing library for assertions and mocks:

import (
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/mock"
)

func TestWithTestify(t *testing.T) {
    result := Add(2, 3)
    assert.Equal(t, 5, result, "they should be equal")
    assert.NotNil(t, result)
}

// Mock example
type MockDB struct {
    mock.Mock
}

func (m *MockDB) GetUser(id int) (*User, error) {
    args := m.Called(id)
    return args.Get(0).(*User), args.Error(1)
}

Other Tools

gomock: Google's mocking framework with code generation
httptest: Standard library for testing HTTP handlers
testcontainers-go: Integration testing with Docker containers
ginkgo/gomega: BDD-style testing framework

When testing integrations with external services like AI models, you'll need to mock or stub those dependencies. For example, if you're using Ollama in Go, consider creating interface wrappers to make your code more testable.

Benchmark Tests

Go includes built-in support for benchmarks:

func BenchmarkAdd(b *testing.B) {
    for i := 0; i < b.N; i++ {
        Add(2, 3)
    }
}

// Run benchmarks
// go test -bench=. -benchmem

Output shows iterations per second and memory allocations.

Best Practices

Write table-driven tests: Use the slice of structs pattern for multiple test cases
Use t.Run for subtests: Better organization and can run subtests selectively
Test exported functions first: Focus on public API behavior
Keep tests simple: Each test should verify one thing
Use meaningful test names: Describe what is being tested and expected outcome
Don't test implementation details: Test behavior, not internals
Use interfaces for dependencies: Makes mocking easier
Aim for high coverage, but quality over quantity: 100% coverage doesn't mean bug-free
Run tests with -race flag: Catch concurrency issues early
Use TestMain for expensive setup: Avoid repeating setup in each test

Example: Complete Test Suite

package user

import (
    "errors"
    "testing"
)

type User struct {
    ID    int
    Name  string
    Email string
}

func ValidateUser(u *User) error {
    if u.Name == "" {
        return errors.New("name cannot be empty")
    }
    if u.Email == "" {
        return errors.New("email cannot be empty")
    }
    return nil
}

// Test file: user_test.go
func TestValidateUser(t *testing.T) {
    tests := []struct {
        name    string
        user    *User
        wantErr bool
        errMsg  string
    }{
        {
            name:    "valid user",
            user:    &User{ID: 1, Name: "Alice", Email: "alice@example.com"},
            wantErr: false,
        },
        {
            name:    "empty name",
            user:    &User{ID: 1, Name: "", Email: "alice@example.com"},
            wantErr: true,
            errMsg:  "name cannot be empty",
        },
        {
            name:    "empty email",
            user:    &User{ID: 1, Name: "Alice", Email: ""},
            wantErr: true,
            errMsg:  "email cannot be empty",
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            err := ValidateUser(tt.user)

            if (err != nil) != tt.wantErr {
                t.Errorf("ValidateUser() error = %v, wantErr %v", err, tt.wantErr)
                return
            }

            if err != nil && err.Error() != tt.errMsg {
                t.Errorf("ValidateUser() error message = %v, want %v", err.Error(), tt.errMsg)
            }
        })
    }
}

Useful Links

Conclusion

Go's testing framework provides everything needed for comprehensive unit testing with minimal setup. By following Go idioms like table-driven tests, using interfaces for mocking, and leveraging built-in tools, you can create maintainable, reliable test suites that grow with your codebase.

These testing practices apply to all types of Go applications, from web services to CLI applications built with Cobra & Viper. Testing command-line tools requires similar patterns with additional focus on testing input/output and flag parsing.

Start with simple tests, gradually add coverage, and remember that testing is an investment in code quality and developer confidence. The Go community's emphasis on testing makes it easier to maintain projects long-term and collaborate effectively with team members.

See the App Architecture hub for related guides on Go project structure, dependency injection, API design, and integration patterns.

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

Rost — Sun, 24 May 2026 00:31:07 +0000

I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM.

For a broader view of token speeds and VRAM trade-offs across more models on the same hardware, see 16 GB VRAM LLM benchmarks with llama.cpp.

What MTP (Multi-Token Prediction) Is

Multi-Token Prediction is a form of speculative decoding built directly into certain model checkpoints. Instead of predicting one token per forward pass, the model carries extra "MTP heads" that propose several future tokens in a single step — then verifies them in parallel. If the guesses are accepted, the effective throughput rises without changing the output quality.

The Qwen 3.6 family ships both standard GGUF files and MTP-enabled variants. In llama.cpp, MTP is activated through:

--spec-type draft-mtp --spec-draft-n-max 3

--spec-draft-n-max is the key tuning knob. It sets how many speculative tokens the MTP head proposes at each step. Higher values give a potential speed boost but cost extra VRAM for the draft buffers — a real constraint on 16 GB cards.

What and How I Tested

I tested how the two Qwen 3.6 models behave with MTP enabled versus standard decoding on a GPU with 16 GB VRAM (RTX 4080).

To fit model weights and KV cache into VRAM I used heavily quantised variants:

Qwen3.6-27B-UD-IQ3_XXS and Qwen3.6-27B-UD-IQ3_XXS-MTP
Qwen3.6-35B-A3B-UD-IQ3_S and Qwen3.6-35B-A3B-UD-IQ3_S-MTP

Two context budgets are tracked per run:

Avg Ctx — the context size at which llama.cpp occupies ~14.8 GB VRAM, leaving other apps (Xorg, GNOME Shell, Cursor) a comfortable ~500 MB buffer.
Max Ctx — the largest context llama.cpp could allocate given that the same desktop apps already hold ~500 MB VRAM.

A key reason for keeping the average context at a practical target is that Hermes Agent — which I use as the primary AI assistant connecting to llama.cpp on this machine — requires at least 64 K context by default and will reject models with a smaller window at startup. Models below that threshold cannot maintain enough working memory for multi-step tool-calling workflows. For llama.cpp this means passing --ctx-size 65536 or larger. Any MTP configuration that compresses the average usable context significantly below 64 K is therefore unsuitable for daily Hermes workloads, which is why the Avg Ctx numbers in the tables below are the most decision-relevant ones.

Both KV cache quantisation levels were tested: q8 (higher quality, more VRAM) and q5 (lower VRAM, longer context). Be aware that moving from q8 to q5 KV cache can cause a noticeable quality drop — in my testing the degradation was significant enough to make q5 unsuitable for my workloads. The speed and context numbers for q5 are included for completeness, but you should test response quality on your own tasks before committing to it.

Qwen 3.6 27B MTP vs Standard

KV Cache q8

	MTP max 1	MTP max 2	MTP max 3	MTP max 4	Standard (IQ3_XXS)
Prompt Speed	148 t/s	151 t/s	148 t/s	147 t/s	200 t/s
Gen Speed	65 t/s	75 t/s	73 t/s	75 t/s	45 t/s
Avg Ctx	40 K	40 K	40 K	30 K	80 K
Max Ctx	60 K	60 K	60 K	50 K	100 K

With q8 KV cache, MTP at --spec-draft-n-max 2 delivers ~67 % faster generation (75 vs 45 t/s) at the cost of halving the average context window from 80 K to 40 K. Prompt ingestion speed drops from 200 to ~150 t/s because MTP requires device-to-host transfers during the prefill phase.

KV Cache q5

	MTP max 1	MTP max 2	MTP max 3	MTP max 4	Standard (IQ3_XXS)
Prompt Speed	145 t/s	144 t/s	141 t/s	139 t/s	191 t/s
Gen Speed	57 t/s	62 t/s	67 t/s	66 t/s	41 t/s
Avg Ctx	70 K	60 K	60 K	50 K	130 K
Max Ctx	100 K	100 K	90 K	80 K	160 K

Switching to q5 KV cache recovers meaningful context: --spec-draft-n-max 1 gives 70 K average context at 57 t/s — a 39 % generation speedup over standard decoding while still keeping the context window at a useful size. At --spec-draft-n-max 3 the context drops to 60 K but generation reaches 67 t/s (+63 %).

Qwen 3.6 27B Takeaway

MTP is genuinely useful for the 27B dense model. The sweet spot on 16 GB VRAM is:

q8 KV + --spec-draft-n-max 2 — best raw speed (75 t/s), context down to 40–60 K
q5 KV + --spec-draft-n-max 1 — best speed-vs-context balance (57 t/s, 70 K avg context)

Qwen 3.6 35B MTP vs Standard

The 35B model is a Mixture-of-Experts (MoE) architecture (35B-A3B means 35B total parameters, ~3B active per token). MoE models usually benefit more from MTP because the sparse routing keeps the MTP head computationally cheap relative to a full forward pass.

KV Cache q8

	MTP max 1	MTP max 2	MTP max 3	MTP max 4	Standard (IQ3_S)
Prompt Speed	277 t/s	277 t/s	265 t/s	275 t/s	368 t/s
Gen Speed	186 t/s	189 t/s	180 t/s	171 t/s	146 t/s
Avg Ctx	15 K	10 K	—	—	80 K
Max Ctx	80 K	70 K	60 K	50 K	150 K

The MoE architecture delivers impressive raw generation speed with MTP (+27 % at max 1, +29 % at max 2 vs standard 146 t/s). But the practical problem is the average context. With q8 KV cache, even --spec-draft-n-max 1 only gives 15 K average context — barely enough for modest tasks. Higher draft depths have no viable average context at all on a 16 GB card.

This is the central VRAM cost question for MTP on consumer hardware: the extra draft buffers eat into the remaining VRAM budget directly, and the 35B-A3B model with q8 KV cache leaves very little headroom.

KV Cache q5

	MTP max 1	MTP max 2	MTP max 3	MTP max 4	Standard (IQ3_S)
Prompt Speed	264 t/s	266 t/s	270 t/s	264 t/s	343 t/s
Gen Speed	151 t/s	147 t/s	137 t/s	131 t/s	122 t/s
Avg Ctx	10 K	—	—	—	120 K
Max Ctx	120 K	110 K	110 K	80 K	200 K

q5 KV cache only marginally improves the average context story. --spec-draft-n-max 1 gives 10 K average context at 151 t/s. Standard decoding at q5 gives 122 t/s with 120 K average context.

Qwen 3.6 35B Takeaway

On a 16 GB GPU the 35B MoE model with MTP faces a hard wall: the average usable context collapses to 10–15 K tokens, making it impractical for real workloads. Standard decoding at 122–146 t/s with 80–120 K context is significantly more useful.

If you have 24 GB+ VRAM, the 35B + MTP combination becomes much more attractive — the context window issue disappears and you keep the speed benefit.

Choosing the Right `--spec-draft-n-max` Value

The question of how many speculative tokens to propose per step (--spec-draft-n-max) does not have a single right answer — it depends on both model architecture and available VRAM:

For 27B dense on 16 GB: --spec-draft-n-max 2 with q8 KV is the fastest, --spec-draft-n-max 1 with q5 KV is the most context-friendly.
For 35B MoE on 16 GB: --spec-draft-n-max 1 is the only option that keeps any usable context, and even then only marginally.
Higher values (3, 4) increase VRAM pressure without proportional speed gains — at max 4 you're spending roughly the same extra VRAM as max 2 but gen speed doesn't keep pace.

How to Enable MTP in llama.cpp

Make sure you use an MTP-enabled GGUF (the filename contains MTP). If you are new to llama.cpp flags, the llama.cpp Quickstart with CLI and Server covers all the fundamentals. Then launch llama-server or llama-cli with:

llama-server \
  --model Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf \
  --ctx-size 40000 \
  -ngl 99 --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2

For q5 KV cache, replace q8_0 with q5_1 or q5_0 and adjust --ctx-size upward:

llama-server \
  --model Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf \
  --ctx-size 80000 \
  -ngl 99 --flash-attn on \
  --cache-type-k q5_1 --cache-type-v q5_1 \
  --spec-type draft-mtp \
  --spec-draft-n-max 1

MTP is activated automatically once llama.cpp sees the MTP heads in the GGUF file and --spec-type draft-mtp is set.
So standard Qwen3.6-27B-UD-IQ3_XXS.gguf will not work in MTP mode, you will need Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf.
But the Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf can work in both Speculative decoding mode and autoregressive one.

Conclusion

On a 16 GB GPU (RTX 4080), with these quants, llama.cpp's MTP is a clear win for Qwen 3.6 27B and a net negative for Qwen 3.6 35B in practical use:

Qwen 3.6 27B (IQ3_XXS) — MTP is worthwhile:

q8 KV + MTP max 2 → ~67 % faster generation, context 40–60 K (vs 80–100 K without MTP)
q5 KV + MTP max 1 → ~39 % faster generation, context 70–100 K (vs 130–160 K without MTP)
Good balance of speed and VRAM efficiency at --spec-draft-n-max 2

Qwen 3.6 35B (IQ3_S) — MTP is not practical at 16 GB:

Generation speed is 27–29 % higher but average context collapses to 10–15 K at q8, 10 K at q5
Standard decoding at 122–146 t/s with 80–120 K context is more useful for real tasks
The situation improves substantially on 24 GB+ VRAM

On paper, q5 KV cache is the obvious answer for maximising context window while keeping MTP speed gains — but in practice the quality drop moving from q8 to q5 can be significant. Test q5 on your own tasks before adopting it; for my workloads the degradation was unacceptable, and q8 with a tighter context budget remains the better trade-off.

For the wider picture of LLM serving options and infrastructure trade-offs, see the LLM Hosting in 2026 pillar and LLM Performance in 2026. If you are tuning Qwen 3.6 sampler settings alongside MTP, the Agentic LLM Inference Parameters Reference for Qwen 3.6 and Gemma 4 is a useful companion.

Unload All llama.cpp Router Models Without Restarting

Rost — Wed, 20 May 2026 01:00:03 +0000

llama.cpp router mode is one of the most useful changes to llama-server in years. It finally gives local LLM operators something close to the model management experience people expect from Ollama, while keeping the raw performance and low-level control that make llama.cpp worth using in the first place.

But there is one sharp edge: unloading everything is not a single magic button in the HTTP API.

The router can list models. It can load a model. It can unload a model. It can evict the least recently used model when --models-max is reached. What it does not currently document as a first-class endpoint is a universal unload all models now call.

That is not a real blocker. The correct pattern is simple, explicit, and scriptable:

Ask the router which models exist.
Filter the models whose status is loaded.
Call /models/unload once per loaded model.

This is the approach I recommend for serious local LLM workflows. It is boring, visible, and easy to debug. That is exactly what you want when your goal is to free VRAM without restarting the whole inference service.

What llama.cpp router mode actually does

In classic llama-server usage, you start one server with one model:

llama-server \
  --model ./models/qwen3-8b.gguf \
  --port 8080

Router mode changes that model. Instead of binding the server to one GGUF file, the router becomes a coordinator for multiple models. It can discover models from a cache or from a models directory, load them on demand, route requests to the correct model, and unload models when needed.

A typical router-mode startup looks like this:

llama-server \
  --models-dir ./models \
  --models-max 4 \
  --port 8080

The important option here is --models-max. It controls how many models may be loaded at the same time. If the limit is reached, llama.cpp can evict the least recently used model. That is useful, but it is not a substitute for a deliberate unload operation. LRU eviction is reactive. An unload script is operational control.

My opinionated take: if you run local models for real work, you should treat router mode like an inference process manager, not like a toy chat server. Explicit lifecycle operations matter.

The model management endpoints you need

The main endpoint for discovery is:

curl -s http://localhost:8080/models | jq

That endpoint returns the models known to the router and their current lifecycle status. The exact JSON shape can vary slightly between builds, so inspect your own response before writing automation.

A common response shape looks like this:

{
  "data": [
    {
      "id": "qwen3-8b",
      "status": "loaded"
    },
    {
      "id": "llama-3.2-3b",
      "status": "unloaded"
    }
  ]
}

To unload one model, call:

curl -s -X POST http://localhost:8080/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-8b"}' \
  | jq

That is the primitive operation. Everything else in this article builds on that.

There is no documented unload all endpoint

This is the part that trips people up.

You might expect something like this:

curl -X POST http://localhost:8080/models/unload-all

Do not build around that assumption. The documented operation is per model. You pass a model identifier to /models/unload, and llama.cpp unloads that one model.

This is not necessarily bad API design. A per-model operation is safer. It makes the caller decide what should be unloaded. It also avoids surprising production behavior where one admin request accidentally kills every warm model being used by other clients.

For a workstation, an unload-all shortcut would be convenient. For a multi-user inference box, explicit loops are better.

Unload one model first

Before automating anything, test the exact model identifier your router expects.

First list models:

curl -s http://localhost:8080/models | jq

Pick one loaded model from the output, then unload it:

curl -s -X POST http://localhost:8080/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-8b"}' \
  | jq

Check the model list again:

curl -s http://localhost:8080/models | jq

If the model status changes to unloaded, your endpoint, port, and model identifier are correct.

If it does not work, do not guess. Inspect the JSON. Router aliases, GGUF filenames, and model IDs are often not the same string.

Unload all loaded models with curl and jq

Once the single-model unload works, the unload-all pattern is just a shell loop.

Use this when your /models response has .data[].id and .data[].status:

curl -s http://localhost:8080/models \
| jq -r '.data[] | select(.status == "loaded") | .id' \
| while IFS= read -r model; do
    echo "Unloading: $model"
    curl -s -X POST http://localhost:8080/models/unload \
      -H "Content-Type: application/json" \
      -d "{\"model\":\"$model\"}" \
      | jq
  done

This is the whole trick. It is not glamorous, but it is the right shape for an admin operation:

It only unloads models that are actually loaded.
It prints what it is doing.
It fails model by model instead of hiding everything behind one opaque action.
It works from cron, systemd hooks, SSH, or CI jobs.

A reusable script for production use

For anything you run more than twice, stop pasting one-liners. Save a script.

Create llama-router-unload-all.sh:

#!/usr/bin/env bash
set -euo pipefail

LLAMA_SERVER_URL="${LLAMA_SERVER_URL:-http://localhost:8080}"

models_json="$(curl -fsS "$LLAMA_SERVER_URL/models")"

loaded_models="$(printf '%s' "$models_json" \
  | jq -r '.data[] | select(.status == "loaded") | .id')"

if [ -z "$loaded_models" ]; then
  echo "No loaded models found."
  exit 0
fi

printf '%s\n' "$loaded_models" | while IFS= read -r model; do
  [ -z "$model" ] && continue

  echo "Unloading: $model"

  curl -fsS -X POST "$LLAMA_SERVER_URL/models/unload" \
    -H "Content-Type: application/json" \
    -d "{\"model\":\"$model\"}" \
    | jq

done

echo "Done. Current model state:"
curl -fsS "$LLAMA_SERVER_URL/models" | jq

Make it executable:

chmod +x llama-router-unload-all.sh

Run it against the default local server:

./llama-router-unload-all.sh

Run it against another host:

LLAMA_SERVER_URL=http://192.168.1.50:8080 ./llama-router-unload-all.sh

This is the version I would actually keep in a tools directory. It uses curl -f so HTTP errors fail the script, and it lets you override the server URL without editing the file.

Adapting the script to your JSON shape

Do not blindly assume every llama.cpp build returns the exact same fields forever. Router mode is still evolving, and your build may expose a slightly different JSON shape.

Start by inspecting the response:

curl -s http://localhost:8080/models | jq

The script uses this filter:

jq -r '.data[] | select(.status == "loaded") | .id'

If your model identifier is in .name, change it to:

jq -r '.data[] | select(.status == "loaded") | .name'

If your status field uses another value, adjust the filter accordingly. The principle is what matters: select loaded models, extract the identifier accepted by /models/unload, then call unload for each one.

Why models may load again after you unload them

This is the most common source of confusion.

Router mode supports on-demand loading. If a client sends a chat completion request for a model that is currently unloaded, the router may load it again automatically.

That means this sequence is possible:

You unload every model.
Open WebUI, a test script, or an agent sends a request.
llama.cpp loads the requested model again.
You think unload failed, but it did not.

The fix is operational, not technical. Stop client traffic first if your goal is to keep VRAM free.

For example:

Stop benchmark scripts.
Pause agents and cron jobs.
Close or disconnect Open WebUI sessions.
Disable health checks that accidentally perform real model requests.

Unloading is not a firewall. If clients keep asking for models, router mode is doing its job by serving them.

Open WebUI and the Eject button

Open WebUI can integrate with llama.cpp model unload support. When the provider is configured as llama.cpp, Open WebUI can show loaded-model state and expose an Eject action for admins.

Under the hood, that action calls Open WebUI's own unload API, which then calls llama.cpp's /models/unload endpoint on the configured connection.

That is nice for manual operation, but I would still keep the shell script. A UI button is convenient. A script is auditable, repeatable, and usable on a headless box at 2 AM.

When to use unload all

Unloading every loaded model is useful when you want to:

Free GPU memory before starting a larger model.
Reset a development box without restarting llama-server.
Prepare for a benchmark run with a clean memory state.
Drain local inference workloads before maintenance.
Recover from a messy session where too many models were warmed.

It is not the right tool when active users are depending on warm models. In that case, tune --models-max, use deliberate routing, and let LRU eviction do part of the work. If you need smarter timeout-based unloading with per-model lifecycle control, llama-swap is a purpose-built proxy that layers exactly that on top of any llama-server setup.

My rule is simple: use LRU for normal pressure, use explicit unload for operator intent.

Troubleshooting

The models endpoint returns 404

You may not be running a router-capable build, or you may be calling the wrong port.

Check the server process and available options:

llama-server --help | grep -i models

Then test both endpoints:

curl -s http://localhost:8080/models | jq
curl -s http://localhost:8080/v1/models | jq

The /v1/models endpoint is the OpenAI-compatible model list. The /models endpoint is the router model-management endpoint. They are related, but they are not the same thing.

jq is not installed

Install it before scripting JSON parsing.

On Ubuntu or Debian:

sudo apt-get update
sudo apt-get install jq

On macOS with Homebrew:

brew install jq

The unload call returns an error

Most failures come from passing the wrong model identifier. Use the exact identifier returned by /models, not the filename you think should work.

Also check whether your model name contains quotes, slashes, or spaces. The script above handles normal strings well, but unusual names may require more careful JSON construction.

For maximum safety, you can build the POST body with jq:

jq -n --arg model "$model" '{model: $model}'

A more defensive unload loop would use that body instead of hand-escaped JSON.

VRAM is not freed immediately

First confirm the model status changed. Then check whether another request reloaded it. Also remember that GPU memory tools can lag or report allocator behavior rather than instant application-level intent.

The practical test is simple: stop traffic, unload models, list model status, then inspect GPU memory. For measured VRAM usage across model sizes and context windows on llama.cpp, the 16 GB VRAM llama.cpp benchmarks give concrete figures to sanity-check against.

A safer JSON body version

If your model identifiers contain unusual characters, use jq to generate the JSON request body:

curl -s http://localhost:8080/models \
| jq -r '.data[] | select(.status == "loaded") | .id' \
| while IFS= read -r model; do
    echo "Unloading: $model"
    body="$(jq -n --arg model "$model" '{model: $model}')"
    curl -s -X POST http://localhost:8080/models/unload \
      -H "Content-Type: application/json" \
      -d "$body" \
      | jq
  done

This is the version to use if your models are named with repository-style identifiers, custom aliases, or paths.

Final take

llama.cpp router mode is a big step forward for local LLM operations. It gives you dynamic loading, model switching, and memory-aware eviction without giving up the directness of llama-server.

But do not wait for a perfect unload-all endpoint. The clean solution already exists: list loaded models and unload them one by one.

That pattern is explicit. It is scriptable. It works over SSH. It plays nicely with Open WebUI. And most importantly, it frees VRAM without restarting the router.

For local AI infrastructure, that is exactly the kind of boring control surface you want.

LLM Wiki - Compiled Knowledge That RAG Cannot Replace

Rost — Mon, 18 May 2026 09:22:56 +0000

The premise is simple: compiled knowledge is more reusable than retrieved fragments.
RAG became the default answer to a straightforward question - how do I give an LLM access to external knowledge?

And the usual architecture is by now familiar.
Take documents, split them into chunks, embed the chunks, store them in a vector database, retrieve relevant pieces at query time, and pass them into the model. That pattern is useful, but it is also overused. RAG is very good at access and not automatically good at structure. It can find relevant fragments but does not create a stable understanding of a domain, it can retrieve context but does not decide what the canonical explanation is, and it can answer from documents but does not maintain a living knowledge base.

LLM Wiki is not just another retrieval pattern but a different way to think about knowledge architecture entirely. Instead of asking the model to synthesize from raw chunks every time a question is asked, an LLM Wiki uses the model earlier in the pipeline, performing synthesis at ingest time and storing the result as structured, readable, linked knowledge.

A good shorthand is this:

RAG retrieves knowledge at query time.
LLM Wiki compiles knowledge at ingest time.

That distinction changes cost, latency, quality, maintenance, governance, and failure modes - and it is the central reason LLM Wiki deserves its own architecture category.

RAG optimizes retrieval, not representation

RAG is powerful because it lets a language model use information outside its training data, making it useful for:

company documentation
product manuals
technical support
internal search
research assistants
policy lookup
code documentation
knowledge base chatbots

But RAG has a structural weakness: it often treats knowledge as a pile of retrievable fragments rather than a structured model of a domain.

A typical RAG system works like this:

Collect documents.
Split them into chunks.
Create embeddings.
Store the chunks in a vector database.
Retrieve similar chunks for each query.
Ask the LLM to answer using those chunks.

This works well for many questions, but it also creates repeated interpretation work for complex ones. Every time a user asks something conceptually rich, the system has to:

retrieve fragments
decide which fragments matter
infer relationships
resolve contradictions
build a temporary explanation
produce an answer

Then that synthesis disappears and the next query starts from scratch. This is fine when questions are simple, but it becomes wasteful when the same concepts are repeatedly reconstructed from raw fragments.

The most common RAG mistake is assuming that better retrieval equals better knowledge. Sometimes that is true, but often it is not, because retrieval and representation solve different problems. Retrieval answers which pieces of text are relevant; representation answers how knowledge should be structured in the first place. A RAG system can retrieve five accurate chunks about a topic and still fail because:

the chunks are outdated
the documents contradict each other
the important concept is spread across pages
the source uses inconsistent terminology
the answer requires synthesis, not lookup
there is no canonical page

RAG is an access layer, not a knowledge model by itself, and an LLM Wiki exists precisely because some knowledge should be represented before it is retrieved.

What is an LLM Wiki?

An LLM Wiki is a knowledge system where a language model helps transform source material into structured wiki-like knowledge. Instead of storing only raw documents and retrieving chunks later, the system creates derived knowledge artifacts such as:

topic pages
summaries
glossaries
concept pages
entity pages
cross-links
comparisons
contradiction notes
source references
decision records
explanations

The output is usually human-readable and, in many implementations, stored as plain Markdown, which matters because Markdown makes the system:

inspectable
portable
editable
versionable
easy to diff
compatible with static sites and PKM tools

The idea is not that the LLM magically knows everything but that the LLM helps maintain a structured layer over the source material, acting as a structuring assistant rather than the final authority.

The core idea

The core idea of LLM Wiki is ingest-time knowledge synthesis. In a RAG system, synthesis usually happens when a user asks a question; in an LLM Wiki, synthesis happens earlier, during ingestion, before any question has been asked.

A simplified pipeline looks like this:

sources
  -> ingest
  -> summarize
  -> structure
  -> link
  -> maintain
  -> query or browse

The system does not wait until query time to figure out what the knowledge means - it creates a reusable structure in advance, which makes LLM Wiki closer to a compiled knowledge base than a search pipeline.

A practical example

Imagine you have 60 articles about local LLM hosting. A RAG system might split them into chunks and retrieve relevant sections when you ask about the differences between Ollama, vLLM, llama.cpp, and SGLang, then let the LLM assemble an answer from those retrieved fragments.

An LLM Wiki system does something different. At ingest time, it creates structured pages:

ollama.md
vllm.md
llama-cpp.md
sglang.md
local-llm-hosting-overview.md
inference-backends-comparison.md
gpu-memory-and-context-length.md

Then it links them. When you later ask a question, the system is not starting from raw fragments but from a structured knowledge layer that was already assembled before the question arrived - and for conceptual and comparative questions, that difference in quality is significant.

How LLM Wiki works

There is no single official implementation, but most LLM Wiki systems follow the same conceptual stages.

Source collection

The system starts with source material - blog posts, PDFs, Markdown notes, technical documentation, transcripts, papers, meeting notes, bookmarks, code comments, and README files - which should be preserved as a separate layer, distinct from the generated wiki. This matters because generated wiki pages are derived knowledge, not original truth, and a serious LLM Wiki should always maintain links back to sources so that every generated page can answer the basic question: where did this claim come from?

Ingestion and extraction

During ingestion, the system reads source material and extracts useful knowledge. It may identify:

main topics
entities and tools
definitions
claims
decisions
examples
contradictions between sources
open questions
recurring concepts

This stage is where LLM Wiki starts to differ from ordinary RAG: while RAG usually chunks documents for retrieval, LLM Wiki tries to understand and reshape the material conceptually rather than just making it searchable.

Summarization

The system creates summaries, but useful summaries are not just shorter versions of text - they should preserve the structure of the argument. A weak summary says "this document discusses local LLM hosting tools." A useful summary says "this document compares local LLM hosting tools by deployment complexity, GPU usage, API compatibility, and production readiness, positioning Ollama as easy for local use, vLLM as stronger for server workloads, and llama.cpp as flexible for quantized models."

For technical knowledge, a summary should capture:

what problem it solves
what assumptions it makes
what tradeoffs it contains
what dependencies it has
what is still uncertain

This is where LLMs are genuinely useful, because they are good at compressing messy prose into structured explanations.

Structuring

Summaries alone are not enough - the system must also decide where knowledge belongs, which is the representation layer. Common structures include:

topic pages
concept pages
index pages
comparison pages
glossary entries
how-to pages
architecture notes
decision records
maps of related pages

A pile of summaries is not a wiki; a wiki needs page boundaries, links, and recurring structure, and a good LLM Wiki is not measured by page count but by whether pages become genuinely reusable.

Linking

Links define the shape of the knowledge system. In a normal document archive, relationships are often implicit; in an LLM Wiki, they should become explicit. Useful link types include:

concept to concept
article to summary
tool to comparison
problem to solution
architecture to implementation
source to derived page
glossary term to detailed page

This is one of the most important differences between LLM Wiki and basic summarization: summaries reduce text, but links build a knowledge graph.

Review and correction

This stage is optional only in toy systems; in serious systems, human review is essential. The review process should check:

whether summaries are faithful
whether links are useful
whether claims are sourced
whether pages are duplicated
whether concepts are misplaced
whether outdated information is marked
whether generated pages overstate certainty

LLM Wiki can reduce human effort, but it should never remove human responsibility.

LLM Wiki vs RAG

The cleanest distinction between LLM Wiki and RAG is timing.

Query-time synthesis

In RAG, the system retrieves information when a user asks a question.

query
  -> retrieve chunks
  -> assemble context
  -> generate answer

This is flexible and works well when:

the corpus is large
information changes often
questions are unpredictable
you need broad coverage
you cannot curate everything

But it may be less coherent for conceptual questions, because the model has to synthesize from fragments each time, which can produce inconsistent answers across similar queries.

Ingest-time synthesis

In LLM Wiki, the system performs synthesis before the question arrives.

sources
  -> summarize
  -> structure
  -> link
  -> query or browse later

This is less flexible but more coherent, and it works well when:

the corpus is manageable
the domain is stable
concepts repeat
human readability matters
you want reusable synthesis
you want a maintained knowledge layer

The main differences

Dimension	RAG	LLM Wiki
Main timing	Query time	Ingest time
Main operation	Retrieve chunks	Compile knowledge
Best corpus	Large and changing	Curated and stable
Output	Generated answer	Structured knowledge pages
Infrastructure	Search index or vector DB	Markdown or wiki structure
Strength	Flexible access	Reusable synthesis
Weakness	Fragmented context	Maintenance drift
Human readability	Often indirect	Usually direct

Complementary, not mutually exclusive

The debate should not be framed as "LLM Wiki or RAG" - that is the wrong question. LLM Wiki does not replace RAG in most production systems; both have distinct and complementary roles. A well-designed system may look like this:

raw documents
  -> source store
  -> LLM Wiki synthesis
  -> reviewed knowledge pages
  -> search index
  -> RAG over source and synthesis
  -> answer with citations

In that architecture, LLM Wiki improves the representation layer and RAG improves the access layer. Use RAG for retrieval over large and changing corpora, use LLM Wiki for compiled synthesis over stable and curated knowledge, and use both together when you need scale and coherence at the same time.

LLM Wiki vs adjacent systems

LLM Wiki vs summarization

A weak LLM Wiki is just a folder of generated summaries, and that is not enough. Summarization compresses content; LLM Wiki structures it. A real LLM Wiki needs stable pages, links, concepts, indexes, source tracking, revision history, maintenance workflows, and conflict detection - the wiki part matters as much as the LLM part.

LLM Wiki vs knowledge graph

A knowledge graph represents entities and relationships explicitly, while an LLM Wiki creates a softer, document-oriented graph through Markdown pages and links. A mature system can use both: the wiki provides human-readable explanations and the knowledge graph provides precisely structured, machine-queryable relationships.

LLM Wiki vs agent memory

LLM Wiki is also different from AI memory. Memory stores context that affects future behavior, while an LLM Wiki stores structured knowledge that can be read, searched, reviewed, and linked by both humans and systems.

Memory might remember:

the user prefers Go examples
the project avoids ORMs
the agent tried a command yesterday
a bug investigation failed

An LLM Wiki might store:

what Go database access patterns exist
how sqlc compares with GORM
why outbox patterns matter
how RAG differs from memory systems

Memory is behavioral context; LLM Wiki is represented knowledge - and mixing the two leads to systems that are hard to inspect, audit, or maintain.

When LLM Wiki works well

LLM Wiki works best for stable domains, personal research, curated corpora, technical documentation, and situations where repeated synthesis over the same material is wasteful.

Stable domains

LLM Wiki works best when the domain does not change every hour. Good examples include:

technical concepts
research notes
learning material
architecture patterns
book notes
model comparison notes
internal engineering principles
curated documentation
personal knowledge bases

If knowledge is stable enough to summarize without becoming stale within days, LLM Wiki can deliver lasting value that compounds as the wiki grows.

Research synthesis

Research synthesis is one of the strongest use cases, because researchers often read many sources and repeatedly ask the same meta-questions:

What are the main ideas?
Which sources agree?
Which sources conflict?
What concepts repeat?
What is the current state of the topic?
What should I read next?

LLM Wiki helps turn that research material into reusable structure - topic pages, comparison pages, contradiction notes, and related links - so the researcher does not have to rebuild the same mental map every time they return to a domain. It is especially useful when working with papers, technical articles, transcripts, documentation, notes, and experiment logs.

Personal knowledge systems

LLM Wiki fits naturally with PKM and the broader knowledge systems spectrum and second brain workflows because a personal knowledge system already contains:

notes
links
unfinished ideas
summaries
references
topic maps

An LLM can help maintain the structure by:

summarizing long notes
proposing links
creating topic pages
detecting duplicate concepts
extracting glossary terms
generating index pages
identifying gaps

The human remains the editor, which is the right relationship between human judgment and machine assistance.

Technical blogging

A technical blog can use LLM Wiki ideas internally even without building a full automated system. A well-structured site can include:

pillar pages
cluster index pages
topic summaries
related article maps
glossary pages
comparison pages
canonical explainers

This is not only SEO but knowledge representation: a well-structured technical blog becomes more valuable when articles are connected into a durable knowledge structure that both humans and AI systems can navigate.

Small team knowledge bases

LLM Wiki can work well for small teams with curated knowledge, including engineering decisions, product architecture, onboarding notes, support playbooks, internal standards, postmortems, and runbooks. The key condition is governance: someone must review and maintain the generated structure, because without clear ownership the wiki decays into noise regardless of how well it was initially generated.

When LLM Wiki is a poor fit

Highly dynamic data

LLM Wiki is weaker when information changes constantly. Live inventory, pricing feeds, incident status, financial market data, rapidly changing support tickets, and real-time logs are all better served by retrieval or direct API access. Compiling fast-moving data into static summaries is counterproductive unless you have a strong refresh process that keeps the compiled layer in sync with reality.

Large unmanaged corpora

LLM Wiki does not automatically scale to millions of documents. At large scale, the difficult problems extend well beyond generation and include:

access control
data lineage
ownership
deduplication
indexing
freshness tracking
evaluation
governance

A simple Markdown wiki is not equipped to address those needs, and at enterprise scale, LLM Wiki may become one layer inside a larger knowledge architecture rather than the whole system.

Low-quality sources

LLM Wiki cannot reliably fix bad sources. If the source material is contradictory, outdated, low quality, duplicated, incomplete, or badly scoped, generated pages may look polished but be wrong. This is dangerous precisely because a clean generated page creates false confidence - the formatting signals quality even when the underlying content does not justify it.

No review process

LLM Wiki without review is risky because generated structure creates authority. A bad answer in RAG may affect one query, but a bad generated wiki page may affect many future queries, readers, and agents that retrieve from it. The model may overgeneralize, miss exceptions, invent structure, merge incompatible ideas, hide uncertainty, create misleading links, or summarize outdated material as though it were current - so for any knowledge that actually matters, human review is not optional.

Limitations and failure modes

The main risks of building an LLM Wiki are stale summaries, hallucinated synthesis baked into the knowledge base, weak source tracking, maintenance cost, and false confidence in generated structure.

Maintenance drift

Knowledge drift happens when generated pages stop matching the underlying sources. This can happen because:

sources changed
new sources were added
old pages were not refreshed
summaries were edited manually
links became outdated
model output changed over time

Drift is the central operational risk of LLM Wiki, and a good system needs explicit refresh and validation workflows to catch it before it propagates.

Hallucinated synthesis

RAG can hallucinate at answer time, but LLM Wiki can hallucinate at ingest time, which is more subtle and more dangerous. If a generated wiki page contains a wrong synthesis, future users may treat that page as ground truth, and future AI systems may retrieve it and amplify the mistake further. Generated structure needs provenance, and every important claim should link back to its original sources so the hallucination can be caught during review rather than silently embedded in the knowledge base.

Over-structuring

Once you have an LLM that can create pages cheaply, it is tempting to create too many of them. You can end up with:

empty taxonomy
duplicate concepts
shallow pages
meaningless links
generated clutter
fake completeness

A useful wiki is not measured by page count but by whether pages are actually reused, linked, and updated over time.

Unclear ownership

The model cannot own the page. A serious system needs clear ownership rules covering:

who reviews pages
who approves updates
who deletes stale pages
who resolves contradictions
who decides canonical structure

Without that clarity, LLM Wiki becomes another abandoned knowledge base - well-intentioned, well-generated, and quietly ignored.

Architecture patterns

Pattern 1. Personal LLM Wiki

The personal pattern is the simplest and most practical version, best suited for individuals.

notes and sources
  -> LLM assisted summaries
  -> Markdown pages
  -> manual review
  -> [Obsidian](https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/ "Using Obsidian for Personal Knowledge Management") or static site

It works well for researchers, writers, engineers, technical bloggers, students, and consultants, where the value comes from reducing repeated synthesis and making personal knowledge easier to navigate without requiring any team coordination or governance infrastructure.

Pattern 2. Team LLM Wiki

The team pattern is best for small groups and needs more governance than the personal version.

team docs
  -> ingest workflow
  -> generated draft pages
  -> review queue
  -> published wiki
  -> search or RAG layer

The review queue is critical here, because generated knowledge should never be published directly into a team source of truth without a human checkpoint - even a lightweight review process catches the most dangerous hallucinations before they become institutional knowledge.

Pattern 3. LLM Wiki plus RAG

This is often the most balanced architecture, giving you both raw source access and compiled synthesis.

raw sources
  -> LLM Wiki pages
  -> reviewed knowledge base
  -> search index
  -> RAG over raw and compiled knowledge
  -> cited answer

The RAG system can retrieve from original documents, generated summaries, topic pages, comparison pages, and glossary entries, which makes retrieval quality significantly stronger than operating over raw documents alone.

Pattern 4. LLM Wiki as site architecture

For a technical website, LLM Wiki ideas can guide content structure even without automation.

articles
  -> pillar pages
  -> topic maps
  -> comparisons
  -> internal links
  -> search and AI access

This turns a blog into a knowledge system where articles are not just posts but nodes in a structured map - a significant difference for both reader experience and machine-readable discoverability.

LLM Wiki design principles

Keep raw sources separate

Never lose the original source. Generated pages should not replace source documents but sit above them - the source layer provides evidence, the wiki layer provides interpretation, and losing the original means losing the ability to verify, challenge, or update the interpretation derived from it.

Use Markdown where possible

Markdown is boring and excellent. It is portable, readable, diffable, versionable, easy to edit, friendly to static sites, and friendly to PKM tools. Boring formats survive longer than clever platforms, which means a Markdown-based LLM Wiki built today will still be usable long after whatever proprietary database you might have chosen has gone through multiple breaking migrations. For syntax reference, see the Markdown Cheatsheet and the guide to Markdown Code Blocks, which are especially relevant when structuring wiki pages that include technical content.

Track provenance

Every generated page should answer:

What sources created this?
When was it generated?
When was it reviewed?
What changed?
Who approved it?

Without provenance, trust collapses over time as pages drift further from their origins. A practical page schema might look like this:

title
summary
status
sources
last_reviewed
related_pages
concepts
open_questions

For technical content, add:

applies_to
version
examples
tradeoffs
failure_modes

For research content, add:

claims
evidence
contradictions
confidence

Prefer fewer better pages

Do not generate a page for every minor idea. Prefer strong concept pages, useful comparison pages, topic indexes, canonical summaries, and glossary entries that earn their place. A small useful wiki with twenty well-maintained pages beats a large generated mess with two hundred pages nobody reads or updates.

Make links meaningful

Links should explain relationships rather than just connect pages at random. Useful link types include:

related concept
depends on
contrasts with
example of
source for
expands on
implementation of

Random links create noise and erode reader trust in the structure.

Mark uncertainty

LLM Wiki pages should not pretend all knowledge is equally certain. Useful status markers include:

confirmed
likely
disputed
outdated
needs review
source conflict
generated summary

These markers protect readers from false confidence and give maintainers a clear signal about which pages need attention.

How to evaluate an LLM Wiki

Do not only ask whether the generated pages look impressive - ask whether they improve knowledge work. Useful evaluation questions include:

Can users find concepts faster?
Are repeated questions answered better?
Are source links preserved?
Are contradictions easier to see?
Are pages reused?
Are summaries accurate?
Is stale content detected?
Does the wiki reduce repeated synthesis?
Does it help humans write or decide?
Does it improve RAG answer quality?

If the answer is no to most of these, the wiki is decoration regardless of how many pages it contains.

LLM Wiki and knowledge management

LLM Wiki belongs in knowledge management because it is fundamentally about representation, not primarily about model hosting, vector search, or agent execution. It answers a different question: how should knowledge be structured so that humans and AI systems can reuse it? That places it in the knowledge systems architecture layer, connecting naturally to PKM, wikis, RAG, agent memory, knowledge graphs, technical publishing, and research synthesis.

A clean layer model looks like this:

Human thinking - PKM, explore and develop ideas
Shared knowledge - Wiki, maintain canonical pages
Compiled knowledge - LLM Wiki, generate structured synthesis
Machine access - RAG, retrieve context at query time
Agent continuity - Memory, persist behavior and preferences

LLM Wiki occupies the compiled knowledge layer, and that position is what makes it useful - it is the layer that turns a pile of documents into something both humans and machines can navigate and reason over.

My opinionated take

LLM Wiki is important, but the hype is slightly wrong - it is not a RAG killer, but a reminder that knowledge representation matters. The industry spent years optimizing retrieval pipelines, and that work was necessary, but many systems still retrieve from badly structured knowledge. Better embeddings and better rerankers help, but they cannot fully compensate for a weak knowledge layer.

LLM Wiki pushes the conversation back toward structure by asking better questions:

What are the core concepts?
What is canonical?
How do ideas connect?
What should be summarized once?
What should be retrieved fresh?
What should be reviewed by humans?

That is the right conversation, and the future is not just better vector search but layered knowledge systems where representation, retrieval, and memory each play a distinct and well-understood role.

Conclusion

LLM Wiki is an architecture pattern for compiled knowledge that uses language models to help transform source material into structured, linked, reusable knowledge before questions are asked. Its core workflow is:

summarize
  -> structure
  -> link
  -> review
  -> reuse

Compared with RAG, the main difference is timing: RAG performs synthesis at query time, while LLM Wiki performs synthesis at ingest time, which makes it valuable for stable domains, research synthesis, personal knowledge bases, technical blogs, and curated team knowledge.

But it has real limitations. It can drift when sources change, hallucinate when model output is wrong, create false confidence when review is absent, and collapse into noise when ownership is unclear. Used badly, it becomes another abandoned wiki. Used well, it becomes the representation layer between raw documents and AI systems - not a replacement for RAG, but the missing layer that makes retrieval worth using.

Sources and further reading

AWS - What Is Retrieval Augmented Generation? - AWS foundational overview of how RAG pipelines are constructed and when they are appropriate.
IBM - Retrieval Augmented Generation - IBM overview of RAG architecture, covering grounding, hallucination reduction, and enterprise use cases.
Google Cloud - Retrieval Augmented Generation - Google Cloud perspective on RAG use cases, system design, and integration with vector search.
Atlan - LLM Wiki vs RAG Knowledge Base - Practical comparison of LLM Wiki and RAG approaches from a data catalog perspective.
Ranjan Kumar - LLM Wiki, Synthesis Time, RAG, and Agentic Memory - In-depth discussion of the timing distinction between synthesis approaches and how they fit into agentic architectures.
Dev.to - RAG vs Agent Memory vs LLM Wiki - Practical comparison of all three knowledge patterns with implementation notes.
Starmorph - Karpathy LLM Wiki Knowledge Base Guide - Guide inspired by Andrej Karpathy's framing of LLM Wiki as a compiled knowledge system.
MindStudio - LLM Wiki vs RAG Knowledge Base - MindStudio perspective on choosing between LLM Wiki and RAG for AI assistant knowledge.

Retrieval vs Representation in Knowledge Systems

Rost — Mon, 18 May 2026 09:22:43 +0000

Most modern knowledge systems optimize retrieval, and that is understandable.
Search is visible, easy to demo, and feels magical when it works. Type a question, get an answer.

But retrieval is only one half of the problem. The deeper question is:

What shape does the knowledge have before anything tries to retrieve it?

That is representation — the structure behind the knowledge:

notes
pages
schemas
graphs
entities
relationships
summaries
taxonomies
source boundaries
canonical versions

Retrieval asks:

Can I find something relevant?

Representation asks:

Is the knowledge organized in a way that makes sense?

These are not the same problem. A RAG system with poor representation becomes a fast interface to a messy archive. It can retrieve fragments, but it cannot fix broken structure. It can quote documents, but it cannot decide which one is canonical. It can assemble context, but it cannot guarantee that the underlying knowledge is coherent.

This is why LLM Wiki style systems are interesting: they shift effort from query time to ingest time. Instead of only retrieving chunks when a user asks a question, they attempt to pre-structure knowledge into pages, concepts, summaries, and links. That does not make RAG obsolete — it means retrieval and representation are different layers, and good knowledge systems need both.

The core distinction

Retrieval is about access; representation is about meaning.

Layer	Question	Examples
Retrieval	How do I find the right information?	search, embeddings, BM25, reranking, vector stores
Representation	How is knowledge structured?	notes, wikis, graphs, schemas, ontologies
Reasoning	How do I use the knowledge?	synthesis, comparison, inference, decision making

A weak system often jumps straight to retrieval; a strong system first asks:

What are the core concepts?
What is the canonical source?
What relationships matter?
What changes over time?
What should be retrieved?
What should already be represented?

This is the difference between search over documents and an actual knowledge system.

Why retrieval became dominant

Retrieval became dominant because it maps well to the modern AI stack. A typical RAG pipeline looks like this:

Load documents
Split them into chunks
Generate embeddings
Store vectors
Retrieve relevant chunks
Optionally rerank them
Put them into an LLM prompt
Generate an answer

This pipeline is practical: it is relatively easy to build, works with messy documents, scales to large corpora, avoids retraining models, and gives LLMs access to current information. That is why RAG became the default pattern for "AI over documents."

But there is a trap:

RAG improves access to knowledge. It does not automatically improve the knowledge.

If your content is duplicated, outdated, contradictory, badly chunked, or poorly named, retrieval will surface those problems — often with confidence.

What representation means

Representation is the way knowledge is shaped before retrieval happens. It answers questions like:

Is this knowledge stored as documents, notes, entities, or facts?
Are relationships explicit or implicit?
Are there canonical pages?
Are there summaries?
Are concepts linked?
Is the system organized by topic, workflow, time, or ownership?
Can a human maintain it?
Can a machine reason over it?

Representation is not decoration — it determines what kind of operations are possible.

Forms of representation

Documents

Documents are the most common representation. Examples include:

articles
PDFs
manuals
reports
README files
support pages
blog posts

Documents are easy for humans to write, but they are often hard for machines to use because they mix facts, narrative, context, examples, opinions, outdated sections, and repeated explanations into the same container. Documents are good containers, but they are not always good knowledge structures.

Notes

Notes are more flexible than documents. They can be:

atomic
linked
private
unfinished
concept focused

A note system, such as a PKM or second brain, can represent evolving knowledge better than a polished document repository. Good notes capture thinking in progress; bad notes become an unsearchable junk drawer.

Wikis

Wikis represent knowledge as maintained pages. A good wiki has:

stable pages
clear topics
internal links
ownership
canonical answers
update patterns

A wiki is stronger than a loose document dump because it gives knowledge a home. "Deployment checklist" lives in one place. "Incident response" lives in one place. "RAG architecture" lives in one place. That matters because retrieval works better when knowledge has a stable structure.

Knowledge graphs

Knowledge graphs represent knowledge as entities and relationships. Instead of storing only text, they model things like:

Person works on Project
Model supports ContextLength
Page depends on Concept
Service connects to Database
Tool implements Protocol

Graphs are powerful because relationships become explicit, which helps with traversal, dependency analysis, entity resolution, lineage, reasoning, and recommendations. But graphs are expensive to maintain and they are not magic — a bad graph is just structured confusion.

Schemas and ontologies

Schemas define expected structure; ontologies go further and define types, relations, and constraints. They answer:

What kinds of things exist?
What properties do they have?
How can they relate?
What rules apply?

This is useful when correctness matters, such as in medical knowledge, legal knowledge, enterprise data catalogs, product taxonomies, and compliance systems. The tradeoff is rigidity: the more formal the representation, the more expensive it is to evolve.

LLM-generated representations

Modern systems increasingly use LLMs to create representations. Examples include:

summaries
extracted entities
topic pages
concept maps
synthetic FAQs
document outlines
cross-links
glossary entries

This is where LLM Wiki style systems sit. They use the model not only to answer queries but to pre-process and structure knowledge before the query happens. RAG says "retrieve relevant chunks at query time"; LLM Wiki says "compile useful knowledge structures at ingest time." Both patterns can coexist in the same architecture.

What retrieval means

Retrieval is the process of finding relevant information. Common retrieval methods include:

keyword search
full text search
vector search
hybrid search
metadata filtering
graph traversal
reranking
query rewriting
agentic search

Retrieval is not one thing — it is a layered stack of complementary methods.

Keyword search

Keyword search matches terms and is still useful because it is predictable, debuggable, fast, and good for exact terms, IDs, error messages, names, and code. Its weakness is semantic mismatch: if the user searches "how to stop repeated answers" but the document says "presence penalty", keyword search may miss the best result.

Vector search

Vector search retrieves by semantic similarity. It is useful when:

wording differs
concepts are fuzzy
users ask natural language questions
documents use inconsistent terminology

Its weakness is precision — vector search can retrieve things that feel related but are not actually correct, which is especially risky in technical systems.

Hybrid search

Hybrid search combines keyword and vector retrieval, which is often better than either alone. Keyword search catches exact matches; vector search catches conceptual matches. For technical knowledge bases, hybrid retrieval is usually a strong default.

Reranking

Reranking takes an initial set of retrieved results and reorders them using a stronger model. This improves quality because the first retrieval step is often broad. A typical pattern retrieves 50 chunks, reranks to the top 5 or 10, then passes only the best context to the LLM. Reranking is one of the most practical ways to improve RAG quality.

Agentic retrieval

Agentic retrieval turns search into a process. Instead of one query, an agent may:

Ask an initial question
Search
Inspect results
Reformulate the query
Search again
Compare sources
Synthesize an answer

This is closer to research than search. It is useful for complex questions, but it is slower and harder to control.

Retrieval without representation is fragile

A retrieval system can only retrieve what exists. It cannot reliably fix:

unclear concepts
duplicate pages
inconsistent terminology
stale documentation
missing source ownership
contradictory statements
weak internal linking
bad document boundaries

This is the most common mistake in RAG projects: teams build a vector database and expect it to become a knowledge system. A vector database is not a knowledge architecture — it is an access layer.

Representation without retrieval is isolated

The opposite failure also exists. You can have a beautifully structured knowledge base that nobody can find. This happens with:

over-designed wikis
deep folder trees
rigid taxonomies
poorly indexed documentation
private note systems with no discovery
graphs without usable interfaces

Representation gives knowledge structure; retrieval gives knowledge reach. You need both.

The tradeoff map

Speed vs coherence

Retrieval is fast to build and representation takes longer. If you need a prototype, retrieval wins; if you need long-term trust, representation matters more.

Priority	Better starting point
Fast Q&A over many docs	Retrieval
Stable technical knowledge	Representation
Exploratory research	PKM plus retrieval
Enterprise assistant	Structured corpus plus RAG
Agent memory	Representation plus selective retrieval

A pure RAG prototype can be built quickly, but a reliable knowledge system takes curation.

Flexibility vs consistency

Loose documents are flexible; structured knowledge is consistent. Flexibility helps when:

the domain changes quickly
knowledge is incomplete
users are exploring
the system is personal

Consistency helps when:

multiple people rely on it
answers must be trusted
workflows depend on it
AI systems consume it

The more people or agents depend on knowledge, the more representation matters.

Recall vs precision

Retrieval systems often optimize recall first, which means finding anything that might be relevant. But good answers need precision, which means finding the best evidence rather than merely related evidence. Representation improves precision by making concepts and boundaries clearer — a well-structured page is easier to retrieve accurately than a random paragraph buried inside a long document.

Ingest-time cost vs query-time cost

RAG usually pushes work to query time. At query time, the system:

rewrites the query
retrieves chunks
reranks results
assembles context
asks the model to reason over fragments

LLM Wiki style systems push more work to ingest time. At ingest time, the system:

reads sources
extracts concepts
writes summaries
creates pages
links related ideas
maintains structure

Architecture	Expensive step	Benefit
RAG	Query time	Flexible retrieval
LLM Wiki	Ingest time	Pre-compiled structure
Knowledge graph	Modeling time	Explicit relationships
Wiki	Maintenance time	Canonical knowledge

None of these is universally better — they optimize different costs.

Why LLM Wiki exists

LLM Wiki exists because retrieval alone often repeats work. In a normal RAG system, every query may force the model to interpret raw fragments again:

Retrieve chunks about a topic
Ask the LLM to infer the concept
Generate an answer
Forget the synthesis
Repeat next time

LLM Wiki says:

Stop re-deriving the same synthesis. Compile it.

Instead of only storing raw documents, it creates structured pages that summarize and connect knowledge, which can improve coherence, reuse, token efficiency, human readability, and long-term maintenance. But it has a cost: the system must maintain the wiki, and if the wiki is wrong, stale, or hallucinated, the structure becomes dangerous.

RAG hallucination vs bad representation

People often blame the LLM when a RAG system gives a bad answer, and sometimes that is correct. But many failures are actually retrieval or representation failures.

Failure mode 1. Correct document, wrong chunk

The answer exists, but chunking splits it badly. The model receives:

half of a paragraph
missing context
a table without explanation
a definition without constraints

The LLM fills those gaps, which looks like hallucination, but the deeper problem is broken representation.

Failure mode 2. Related chunk, wrong answer

Vector search retrieves something semantically similar but operationally wrong. The query asks about production deployment; the retrieved chunk discusses local development. The terms overlap but the meaning differs, so the model answers with local setup instructions for a production problem. This is retrieval imprecision.

Failure mode 3. Conflicting sources

Two documents disagree — one old, one new. The retrieval system returns both, and the LLM merges them into a confident but invalid answer. This is not just a retrieval problem but a representation problem, because the knowledge base lacks canonical state.

Failure mode 4. No concept model

The system has many documents but no model of the domain. It does not know that:

"agent memory" differs from "RAG"
"wiki" differs from "PKM"
"embedding search" differs from "full text search"
"deployment" differs from "hosting"

Without conceptual representation, retrieval becomes fuzzy matching.

Failure mode 5. Generated structure becomes fake authority

LLM Wiki systems have their own failure mode. If an LLM generates a clean page from bad sources, the result can look more authoritative than the original material. This is dangerous: a polished hallucination is worse than a messy source document. Any generated representation needs:

source links
review
update rules
confidence markers
ownership

Design implications

Optimize retrieval when the corpus is large and dynamic

Retrieval should be the priority when:

the corpus is huge
documents change frequently
users ask many unpredictable questions
you need broad coverage
perfect structure is unrealistic

Examples: support knowledge bases, enterprise document search, research assistants, internal chat over many files, legal discovery, and customer service bots. In these cases, invest in strong retrieval:

hybrid search
metadata filters
reranking
query rewriting
source citation
evaluation sets

Optimize representation when coherence matters

Representation should be the priority when:

knowledge must be trusted
answers must be consistent
concepts are reused often
the domain has clear structure
multiple systems depend on it

Examples: architecture knowledge, product documentation, compliance rules, API references, operational runbooks, curated research collections, and technical blog clusters. In these cases, invest in:

canonical pages
glossary terms
diagrams
internal links
ownership
versioning
review cadence

Optimize both when AI systems depend on knowledge

If an AI agent depends on the knowledge, retrieval alone is usually not enough. Agents need:

stable context
clear task rules
durable memory
structured references
source boundaries
update behavior

For agentic systems, representation becomes part of system design. A coding agent does not only need to retrieve "some docs" — it needs to know:

project conventions
architecture decisions
command patterns
forbidden dependencies
testing workflow
deployment rules

Some of that belongs in RAG, some belongs in memory, and some belongs in structured project documentation.

Practical decision framework

If the problem is finding information

Optimize retrieval. Examples:

"Find relevant pages."
"Answer questions over documents."
"Search across many PDFs."
"Locate similar support tickets."

Use:

full text search
vector search
hybrid retrieval
reranking
metadata filtering

If the problem is making knowledge coherent

Optimize representation. Examples:

"Create a canonical explanation."
"Resolve duplicate pages."
"Define the domain model."
"Build a stable knowledge base."

Use:

wiki pages
concept maps
taxonomies
knowledge graphs
summaries
schemas

If the problem is repeated synthesis

Use compiled representation. Examples:

"We answer the same conceptual questions repeatedly."
"The system keeps re-summarizing the same sources."
"We need a stable synthesis layer."

Use:

LLM Wiki
curated summaries
topic pages
human-reviewed generated pages

If the problem is adaptive continuity

Use memory. Examples:

"The agent should remember user preferences."
"The coding agent should remember project conventions."
"The assistant should continue work across sessions."

Use:

agent memory
preference stores
episodic memory
semantic memory
project memory

How this applies to a technical blog

A technical blog can be more than a sequence of posts — it can become a represented knowledge system. Articles are documents, categories are weak taxonomy, internal links are graph edges, pillar pages are canonical summaries, series pages are curated pathways, and search is retrieval. If you only publish isolated posts, retrieval has to work harder. If you build strong representation, retrieval becomes easier.

That means:

clear cluster boundaries
stable slugs
canonical pages
comparison pages
glossary-style explainers
internal links
structured metadata

This is why site architecture matters — not just for SEO, but because it is knowledge representation. The Knowledge Management cluster on this site is itself an example of representation-first publishing.

How this applies to RAG

RAG quality depends heavily on representation. A well-structured source corpus improves:

chunk quality
retrieval accuracy
citation quality
answer consistency
evaluation clarity

Before building a complex RAG pipeline, ask:

Are the source documents current?
Are duplicates removed?
Are important concepts clearly named?
Are pages scoped correctly?
Are tables and code blocks retrievable?
Are canonical answers obvious?
Are document boundaries meaningful?

If the answer is no, better embeddings will only help so much.

How this applies to LLM Wiki

LLM Wiki is a representation-first pattern. It is useful when:

the corpus is small or medium sized
knowledge is stable enough to summarize
repeated synthesis is expensive
humans benefit from readable pages
you want structure before retrieval

It is less useful when:

the corpus is massive
content changes constantly
freshness is more important than coherence
governance is weak
generated summaries cannot be reviewed

LLM Wiki is not a replacement for RAG but a different layer, and a strong system can use both:

LLM Wiki creates structured summaries.
RAG retrieves from raw sources and wiki pages.
Human review keeps the representation trustworthy.

Suggested architecture patterns

Pattern 1. Retrieval first

Use when speed matters.

documents
  -> chunks
  -> embeddings
  -> retrieval
  -> LLM answer

Good for:

prototypes
broad search
large corpora
early experiments

Weakness: coherence depends on source quality.

Pattern 2. Representation first

Use when trust matters.

sources
  -> curated pages
  -> internal links
  -> maintained knowledge base
  -> search or RAG

Good for:

documentation
technical knowledge
long-term content
team knowledge

Weakness: requires maintenance.

Pattern 3. Compiled knowledge

Use when repeated synthesis matters.

raw sources
  -> LLM extraction
  -> generated summaries
  -> topic pages
  -> reviewed knowledge base
  -> retrieval

Good for:

LLM Wiki systems
research collections
personal knowledge bases
stable domains

Weakness: generated structure must be audited.

Pattern 4. Hybrid knowledge architecture

Use when building serious systems.

raw documents
  -> structured knowledge layer
  -> search index
  -> retrieval and reranking
  -> AI answer
  -> feedback and maintenance

Good for:

production RAG
internal knowledge systems
AI assistants
technical publishing systems

Weakness: more moving parts.

Evaluation questions

To evaluate retrieval, ask:

Did the system find the right source?
Did it rank the right source highly?
Did it retrieve enough context?
Did it avoid irrelevant context?
Did the answer cite the correct source?

To evaluate representation, ask:

Is the knowledge structured clearly?
Is there a canonical page?
Are concepts named consistently?
Are relationships explicit?
Is the content maintained?
Can humans and machines both use it?

Do not evaluate a knowledge system only by answer quality — a good answer can hide a bad structure.

The opinionated rule

If your system fails occasionally, improve retrieval. If it fails repeatedly in the same conceptual area, improve representation.

Bad retrieval misses the right information. Bad representation means the right information does not really exist in a usable shape.

Conclusion

Retrieval and representation solve different problems: retrieval gives access, representation gives structure. RAG is powerful because it makes external knowledge available to LLMs at query time, but RAG does not automatically make knowledge coherent, canonical, or maintained. That is why wikis, PKM systems, knowledge graphs, and LLM Wiki style systems still matter.

The future is not retrieval vs representation but layered knowledge systems:

representation for structure
retrieval for access
memory for continuity
reasoning for synthesis

If you are building a serious knowledge system, do not start with the vector database. Start with the shape of the knowledge, then decide how it should be retrieved.

Sources and further reading

PKM vs RAG vs Wiki vs Memory Systems Explained Clearly

Rost — Sun, 17 May 2026 08:50:06 +0000

PKM, RAG, wikis, and AI memory systems are often discussed as if they solve the same problem.
They do not.
They all deal with knowledge, but they operate at different layers:

PKM helps humans think.
Wikis help groups preserve shared knowledge.
RAG helps machines retrieve external knowledge.
Memory systems help AI agents persist context over time.

Confusing these systems leads to bad architecture.

You get wikis full of personal scratch notes, RAG systems without a source of truth, memory layers pretending to be databases, and PKM tools overloaded with automation they were never designed to handle.

A better model is to see them as different parts of a knowledge systems spectrum.

This article compares PKM, RAG, wikis, and AI memory systems by structure, retrieval, ownership, evolution, and real-world use cases.

The short version

System	Primary user	Main purpose	Best for
PKM	Individual	Develop personal knowledge	Thinking, learning, synthesis
Wiki	Team or public group	Maintain shared knowledge	Documentation, policies, reference
RAG	Machine system	Retrieve context for generation	AI answers over external data
AI memory	AI agent	Persist context over time	Long-running agents and personalization

The most important distinction is this:

PKM and wikis structure knowledge. RAG retrieves knowledge. Memory systems evolve agent context.

That is the core mental model.

Why these systems are confused

They overlap in visible behavior.

All of them can:

store notes
retrieve information
answer questions
organize references
connect ideas

But they differ in intent.

A PKM system is not just a private wiki.
A wiki is not just a RAG database.
A RAG pipeline is not an AI memory.
An AI memory system is not a replacement for structured documentation.

The confusion comes from treating "knowledge" as one thing.

In practice, knowledge has multiple layers:

Capture
Structure
Retrieval
Interpretation
Reuse
Evolution

Different systems optimize different stages.

The four paradigms

1. PKM

PKM stands for personal knowledge management.

It is the practice of capturing, organizing, connecting, and using knowledge for personal work.

Typical PKM systems include:

Obsidian
Logseq
Notion
plain Markdown folders
Zettelkasten systems
second brain systems

PKM is human driven.

The goal is not just storage. The goal is better thinking.

What PKM is good at

PKM works well for:

learning a new domain
developing original ideas
connecting notes over time
writing articles or books
tracking personal research
building a second brain

A good PKM system is messy in a useful way. It supports unfinished thoughts, partial ideas, private context, and evolving concepts.

This is why PKM is not the same as documentation.

Documentation wants clarity.
PKM tolerates ambiguity.

PKM failure modes

PKM often fails when it becomes:

a dumping ground
a folder taxonomy project
a productivity aesthetic
a tool optimization hobby
a private archive nobody uses

The main risk is collection without synthesis.

If you only save information, you do not have a knowledge system. You have a personal landfill.

Opinionated take

PKM should optimize for reuse, not capture.

Capturing everything feels productive, but it creates debt. The real value appears when notes become connected, rewritten, compressed, and used in output.

2. Wiki

A wiki is a structured knowledge base designed for shared reference.

Typical wiki systems include:

DokuWiki
MediaWiki
Confluence
BookStack
Git based documentation sites
internal company knowledge bases

A wiki is usually more formal than PKM.

It should answer:

What do we know, and where is the current version?

What wikis are good at

Wikis work well for:

team documentation
operational runbooks
product knowledge
policy documents
technical reference
onboarding material
stable domain knowledge

A wiki is a social contract.

It says:

This page is the place where this knowledge lives.

That makes ownership and maintenance critical.

Wiki failure modes

Wikis often fail because they become stale.

Common problems:

no page owners
outdated screenshots
duplicate pages
unclear canonical versions
too much hierarchy
no maintenance rhythm

A wiki with old information is worse than no wiki, because it creates false confidence.

Opinionated take

A wiki should be boring.

That is a compliment.

A good wiki is not where ideas are born. It is where stable knowledge is preserved after it becomes useful to others.

3. RAG

RAG stands for retrieval augmented generation.

It is an AI architecture where a system retrieves relevant external information before asking a language model to generate an answer.

A basic RAG pipeline usually has:

Documents
Chunking
Embeddings or search index
Retrieval
Optional reranking
Prompt assembly
LLM generation

RAG is machine driven.

The goal is not to create knowledge. The goal is to give a model relevant context at query time.

What RAG is good at

RAG works well for:

question answering over documents
internal search assistants
support bots
technical documentation assistants
compliance lookup
research over large corpora
connecting LLMs to updated information

RAG is especially useful when the model cannot or should not memorize the information.

RAG failure modes

RAG often fails when teams treat it as magic search.

Common problems:

bad chunking
weak retrieval
noisy context
missing metadata
no source of truth
stale documents
weak evaluation
no human feedback loop

RAG does not fix bad knowledge management.

If the underlying content is fragmented, outdated, or contradictory, the RAG system will surface that mess with confidence.

Opinionated take

RAG is not a knowledge strategy.

RAG is an access strategy.

It helps machines access knowledge, but it does not decide what knowledge is valid, maintained, canonical, or useful.

4. AI memory systems

AI memory systems give agents persistent context beyond a single prompt or conversation.

They may store:

user preferences
past decisions
long-term facts
task history
summaries
reflections
extracted entities
episodic memories
semantic memories

Examples and related ideas include:

MemGPT style memory tiers
long-term agent memory
episodic memory
semantic memory
vector memory
profile memory
tool state memory
reflective agents

AI memory is agent driven.

The goal is continuity.

What AI memory is good at

AI memory systems work well for:

personal assistants
long-running coding agents
research agents
customer support agents
tutoring systems
workflow automation
persistent companions
multi-session task execution

Memory matters when the system must behave as if it remembers.

AI memory failure modes

Memory systems are dangerous when unmanaged.

Common problems:

remembering wrong facts
storing too much
privacy risk
stale preferences
poor memory ranking
memory poisoning
no forgetting mechanism
confusing memory with truth

A memory system needs governance.

It should answer:

What should be remembered?
Who approved it?
How long should it live?
When should it be forgotten?
How is it corrected?

Opinionated take

AI memory is not just long context.

Long context lets a model see more at once.
Memory decides what survives across time.

Those are different problems.

Core differences table

Dimension	PKM	Wiki	RAG	AI memory
Primary user	Individual	Team or public group	AI system	AI agent
Main function	Thinking	Shared reference	Query time retrieval	Persistent context
Knowledge state	Evolving	Stabilized	Retrieved	Adaptive
Structure	Flexible	Explicit	Index based	Learned or extracted
Retrieval style	Human search and linking	Navigation and search	Semantic or hybrid retrieval	Relevance plus salience
Ownership	Personal	Page or team owners	System maintainers	Agent or user controlled
Time horizon	Long term personal	Long term shared	Query time	Multi-session
Best output	Insight	Reliable reference	Grounded answer	Continuity
Main risk	Hoarding	Staleness	Bad retrieval	Bad memory
Good metric	Reuse in thinking	Trust and freshness	Answer quality	Helpful continuity

Structure vs retrieval vs evolution

The simplest way to understand these systems is to compare what they optimize. The architectural implications of that distinction are explored in depth in Retrieval vs Representation in Knowledge Systems.

PKM optimizes personal evolution

PKM is about how your understanding changes.

You collect material, rewrite it, connect it, and turn it into something useful.

The output is often:

a better mental model
a written article
a decision
a research direction
a reusable insight

PKM is not primarily about fast lookup. It is about long-term sensemaking.

Wikis optimize shared structure

Wikis are about stable knowledge.

They ask:

What is the current answer?
Who owns it?
Where should people go?
What should be updated?

A wiki works when people trust it.

RAG optimizes machine retrieval

RAG is about retrieving the right context at the right time.

It asks:

What documents are relevant?
Which chunks should be used?
How much context fits?
What should the model cite?

RAG works when retrieval quality is high and the source corpus is trustworthy.

AI memory optimizes continuity

Memory systems are about persistence across sessions.

They ask:

What should the agent remember?
What should be forgotten?
Which memory matters now?
How should memory change behavior?

Memory works when it improves future behavior without polluting the agent with stale or incorrect context.

When to use PKM

Use PKM when the knowledge is personal, unfinished, or exploratory.

Good scenarios:

learning distributed systems
planning articles
researching LLM architecture
collecting book notes
building a second brain
tracking personal experiments

Use PKM when you are still thinking.

Example

You are learning about RAG evaluation.

You collect:

articles
benchmark notes
diagrams
implementation ideas
failures from your own experiments

This belongs in PKM first.

Later, once the knowledge stabilizes, you may publish an article or turn it into documentation.

When to use a wiki

Use a wiki when knowledge must be shared and maintained.

Good scenarios:

team onboarding
API documentation
operational runbooks
architecture decision records
product knowledge
deployment instructions
support procedures

Use a wiki when others need a reliable answer.

Example

Your team has one correct way to deploy a Hugo site to S3 and CloudFront.

That does not belong only in someone's private notes.

It belongs in a wiki or documentation system with clear ownership.

When to use RAG

Use RAG when an AI system needs access to external knowledge at query time.

Good scenarios:

chatbot over documentation
search assistant over internal docs
support assistant over help articles
legal or compliance assistant
research over large document sets
developer assistant over code docs

Use RAG when the problem is:

The model needs information that lives outside its weights.

Example

You have hundreds of technical articles and want an assistant to answer questions using them.

RAG is a good fit.

But only if the documents are clean enough to retrieve from.

When to use AI memory

Use AI memory when an agent needs continuity.

Good scenarios:

coding agents that remember project conventions
personal assistants that remember preferences
research agents that continue long investigations
tutoring agents that remember student progress
support agents that remember prior interactions
autonomous agents that track goals

Use memory when the system must improve across time.

Example

A coding agent should remember:

the project uses Go
tests run with a specific command
the user prefers minimal dependencies
database migrations follow a convention

That is not just retrieval. It is persistent operating context.

How these systems combine

The most useful systems are hybrids.

A mature knowledge architecture might look like this:

PKM for personal exploration
Wiki for stable shared knowledge
RAG for machine access
AI memory for long-running agent continuity

Each layer has a job.

Pattern 1. PKM to wiki

This is the human knowledge pipeline.

Flow:

Capture notes privately
Connect ideas
Distill insights
Publish stable knowledge
Maintain as shared reference

This is how personal research becomes organizational knowledge.

Example

You research self-hosted knowledge tools in Obsidian.

After testing DokuWiki, Nextcloud, and static Markdown systems, you write a stable guide in your site or team wiki.

PKM created the insight.
The wiki preserves the result.

Pattern 2. Wiki to RAG

This is the machine access pipeline.

Flow:

Maintain canonical wiki pages
Index them
Retrieve relevant sections
Generate grounded answers
Link back to sources

This is one of the cleanest RAG patterns.

The wiki remains the source of truth.
RAG becomes the access layer.

Example

A support bot answers questions using a product wiki.

The bot should not replace the wiki. It should cite and route users back to the canonical pages.

Pattern 3. RAG plus memory

This is the agent continuity pipeline.

Flow:

RAG retrieves external facts
Memory stores user or task context
The agent combines both
Future behavior improves

RAG answers:

What does the knowledge base say?

Memory answers:

What matters about this user, project, or task?

Example

A coding agent uses RAG to retrieve framework docs.

It uses memory to remember that your project avoids ORMs, prefers sqlc, and uses structured logging.

Those are different knowledge types.

Pattern 4. PKM plus AI assistant

This is the hybrid thinking pipeline.

Flow:

Human captures notes
AI summarizes and suggests links
Human edits and validates
Knowledge becomes more structured
Some pages graduate to wiki or publication

The AI augments the PKM system, but it should not own the truth.

Example

An AI assistant can suggest connections between notes about RAG, memory systems, and LLM Wiki.

But the human decides which connections are meaningful.

Common architecture mistakes

Mistake 1. Treating RAG as a wiki

RAG is not a knowledge base.

It does not automatically create a canonical structure. It retrieves from whatever exists.

If the source documents are bad, RAG becomes a confident interface to bad knowledge.

Mistake 2. Treating memory as a database

AI memory is selective context, not general storage.

A database stores records.
Memory changes behavior.

If you need exact facts, use a database or knowledge base.
If you need continuity, use memory.

Mistake 3. Treating PKM as documentation

PKM can be messy.

Documentation should not be.

Private notes can contain half-formed ideas. Shared documentation should contain stable, maintained knowledge.

Mistake 4. Treating a wiki as a thinking tool

A wiki can support thinking, but it is not ideal for early exploration.

If every early thought must become a polished page, people stop writing.

Use PKM for rough thinking. Use wikis for durable knowledge.

Mistake 5. Treating long context as memory

Long context is not memory.

It only helps while the context is present.

Memory persists, selects, updates, and sometimes forgets.

Decision guide

Use this simple decision model.

If the knowledge is private and evolving

Use PKM.

If the knowledge is shared and stable

Use a wiki.

If an AI needs to answer from external documents

Use RAG.

If an agent needs continuity over time

Use memory.

If you need all four

Build a layered system.

Do not force one tool to do every job.

The knowledge systems spectrum

These systems form a spectrum from human thinking to AI continuity.

Layer	System	Role
Human thought	PKM	Explore and synthesize
Shared structure	Wiki	Preserve and maintain
Machine access	RAG	Retrieve and generate
Agent continuity	Memory	Persist and adapt

The direction matters.

Knowledge often starts as personal thought, becomes shared structure, is indexed for machine retrieval, and then becomes part of persistent agent behavior.

That is the modern knowledge stack.

Where LLM Wiki fits

LLM Wiki style systems sit between wiki and AI architecture.

They are not classic RAG.

Instead of retrieving chunks only at query time, they attempt to pre-structure knowledge into pages, summaries, entities, and links.

That makes them closer to compiled knowledge systems.

A useful placement:

System	Position
Wiki	Human maintained structured knowledge
RAG	Query time machine retrieval
LLM Wiki	Ingest time machine structured knowledge
Memory	Agent persistent context

This is why LLM Wiki belongs near knowledge systems architecture, not inside ordinary RAG.

Practical examples

Example 1. Personal technical blog

A technical blogger might use:

PKM for research notes
Hugo site as published knowledge
internal linking as wiki-like structure
RAG later for site search
AI memory for writing assistant preferences

This is a strong architecture.

It keeps human judgment at the center while still allowing AI support.

Example 2. Engineering team

An engineering team might use:

PKM for individual learning
wiki for standards and runbooks
RAG assistant for internal docs
memory for coding agents working inside repositories

The wiki should remain canonical.

The RAG assistant should not invent process.
The memory layer should remember project preferences, not replace architecture decisions.

Example 3. AI research workflow

A researcher might use:

PKM for paper notes
wiki for stable summaries
RAG for literature search
memory for long-running research agents

This works because each layer handles a different time scale.

Security and governance

Knowledge systems become risky when they store sensitive or stale information.

PKM governance

Questions:

What should stay private?
What should be published?
What should be deleted?

Wiki governance

Questions:

Who owns each page?
When was it last reviewed?
What is canonical?

RAG governance

Questions:

Which sources are indexed?
Are answers cited?
How is retrieval evaluated?
What content is excluded?

Memory governance

Questions:

What is remembered?
Can users inspect memory?
Can users delete memory?
How are wrong memories corrected?

Memory needs the strictest governance because it can silently influence future behavior.

SEO and content strategy note

If you run a technical site, this distinction is not only architectural. It is also editorial.

You can map content like this:

PKM pages explain human knowledge practices.
Wiki pages explain structured knowledge systems.
RAG pages explain retrieval engineering.
Memory pages explain persistent AI behavior.
Architecture pages compare and connect the paradigms.

This gives your site a clean authority mesh instead of a pile of loosely related AI articles.

Final conclusion

PKM, RAG, wikis, and AI memory systems are not competitors.

They are different answers to different questions.

PKM asks:

How do I think better over time?

A wiki asks:

What do we know, and where is the trusted version?

RAG asks:

What external context should the model use right now?

AI memory asks:

What should this agent remember for the future?

Once you separate those questions, the architecture becomes obvious.

Use PKM for thinking.
Use wikis for shared truth.
Use RAG for retrieval.
Use memory for continuity.

The future is not one knowledge system that replaces all others.

The future is layered knowledge architecture. For tools, methods, and self-hosted platforms across the full knowledge management spectrum, the cluster pillar maps the territory.

Sources and further reading

Agentic LLM Inference Parameters Reference for Qwen and Gemma

Rost — Sun, 17 May 2026 02:27:20 +0000

This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).

It sits alongside the broader LLM performance engineering hub and matches best with a clear LLM hosting and serving story—throughput and scheduling still dominate when the model is starved, but unstable sampling burns retries and output tokens before the GPU does.

This page consolidates:

vendor recommended parameters
embedded defaults from GGUF and APIs
real-world community findings
agentic workflow optimizations

Right now it is focused on:

Qwen 3.6 (dense and MoE)
Gemma 4 (dense and MoE)

If you run terminal agents such as OpenCode, pair this reference with local LLM behavior in OpenCode so workload-level results and sampler defaults stay aligned.

The goal is simple:

Provide a single place to configure models for agent loops, coding, and multi-step reasoning.

TLDR Reference Table - All models (agentic defaults)

Model	Mode	temp	top_p	top_k	presence_penalty
Qwen 3.5 27B	thinking general	1.0	0.95	20	0.0
Qwen 3.5 27B	coding	0.6	0.95	20	0.0
Qwen 3.5 35B MoE	thinking	1.0	0.95	20	1.5
Qwen 3.5 35B MoE	coding	0.6	0.95	20	0.0
Gemma 4 31B	general	1.0	0.95	64	0.0
Gemma 4 31B	coding	1.2	0.95	65	0.0
Gemma 4 26B MoE	general	1.0	0.95	64	0.0
Gemma 4 26B MoE	coding	1.2	0.95	65	0.0

What "Agentic Inference" Actually Means

Most parameter guides assume:

chat
single-shot completion
human interaction

Agentic systems are different.

They require:

multi-step reasoning
tool calling
consistent outputs
low error propagation

This changes tuning priorities.

Core shift

Use case	Priority
Chat	natural language quality
Creative	diversity
Agentic	consistency + reasoning stability

Qwen 3.6 Tuning

Dense vs MoE matters

Qwen is one of the few families where:

MoE requires different penalties

Dense (27B)

stable
predictable
no routing complexity

Recommended:

presence_penalty = 0.0

MoE (35B-A3B)

expert routing per token
risk of repetition loops

Recommended:

presence_penalty = 1.5 (general)
0.0 for coding

Why this matters

MoE models can get stuck reusing the same experts.

Presence penalty helps:

diversify token paths
improve reasoning exploration

Qwen Agentic Coding Setup

This is where most people get it wrong.

Correct setup

temperature = 0.6
top_p = 0.95
top_k = 20
presence_penalty = 0.0

Why low temperature works

Coding agents need:

deterministic outputs
repeatable tool calls
stable formatting

Higher temperature:

breaks JSON
introduces hallucinated APIs
increases retries

Gemma 4 Tuning

Gemma behaves differently.

No official defaults

model cards are empty
configs are implicit
real tuning comes from:
- Google AI Studio
- GGUF defaults
- community benchmarks

The Counter-Intuitive Finding

Gemma 4 performs better with higher temperature.

Observed behavior

Temp	Result
0.5	poor reasoning
1.0	stable baseline
1.2 to 1.5	best coding performance

This contradicts standard advice.

Why high temperature works here

Hypothesis:

training distribution favors exploration
reasoning mode depends on diversity
model compensates for lack of explicit chain-of-thought control

Result:

higher temperature improves solution search space

Gemma Agentic Coding Setup

Recommended:

temperature = 1.2
top_p = 0.95
top_k = 65
penalties = 0.0

Important

Do not apply traditional "low temp for code" rule blindly.

Gemma is an exception.

Thinking Mode and Agent Systems

Both Qwen and Gemma support reasoning modes.

Why it matters

Agent loops require:

intermediate reasoning
error recovery
multi-step planning

Practical rule

Always enable thinking mode for:

coding agents
tool use
multi-step tasks

Parameter Strategy by Use Case

Coding agents

prioritize determinism
minimize penalties
stable sampling

Reasoning agents

moderate temperature
allow exploration
preserve structure

Tool calling

strict formatting
low randomness
consistent token patterns

Schema and JSON tooling are orthogonal to logits; combine these sampling rules with structured output patterns for Ollama and Qwen3 so validators see fewer retries.

Vendor Defaults vs Reality

Vendor defaults are:

safe
generic
not optimized

Community findings often show:

better performance
task-specific tuning
architecture-aware adjustments

Example

Gemma:

official: no guidance
community: high temperature improves coding

Qwen:

official: inconsistent sections
community: standardized values converge

Practical Deployment Notes

Under concurrency, queueing and memory splits interact with retries as much as sampling does—read how Ollama handles parallel requests alongside the presets above.

Ollama

works well for both families
verify GPU compatibility
defaults may differ from reference

vLLM

supports advanced sampling
stable for production
use explicit parameters

llama.cpp

requires sampler ordering
always enable jinja for modern models
incorrect sampler chain reduces output quality

Key Takeaways

there is no universal parameter set
architecture matters more than model size
agentic systems require different tuning than chat
community benchmarks are often ahead of vendors

Final Opinion

Most parameter guides are outdated.

They assume:

chat use
low temperature for code
static configurations

Modern models break those assumptions.

If you are building agentic systems:

treat inference tuning as a first-class system design problem

Not a config file.

Future Direction

This reference will evolve into:

per-model deep dives
agent-specific configs
benchmarking-backed tuning

Because:

inference is where model capability becomes system performance

LLM Structured Output Validation in Python That Holds Up

Rost — Fri, 15 May 2026 01:26:29 +0000

Most LLM "structured output" tutorials are unserious.
They teach you to ask for JSON politely and then hope the model behaves.
That is not validation.
That is optimism with braces.

OpenAI's own docs make the distinction explicit. JSON mode gives you valid JSON, while Structured Outputs enforces schema adherence, and OpenAI recommends using Structured Outputs instead of JSON mode when possible.

That still does not make the payload trustworthy. JSON Schema defines structure and allowed values, Pydantic gives you typed validation in Python, and OpenAI explicitly notes that a schema-valid response can still contain incorrect values. On top of that, refusals and incomplete outputs can bypass the shape you expected. In production, structured output validation is a pipeline, not a toggle. The same boundary also has to live inside the wider story of throughput, retries, and scheduler limits on the LLM performance engineering hub.

Structured output validation is a contract

Structured output validation for LLMs means you define the shape of the answer up front, constrain the model to produce that shape when possible, and then validate the result again before your application trusts it. In practical terms, that means checking required fields, types, enums, closed object shapes, and domain rules before the payload touches your database, UI, queue, or downstream service. JSON Schema exists for exactly this kind of structural validation, Pydantic is built to validate untrusted data against Python type hints, and Python's jsonschema library gives you a direct way to validate an instance against a schema.

There is also a clean split between two common use cases. If the model is supposed to answer the user in a structured format, use a structured response format. If the model is supposed to call your application's tools or functions, use function calling. OpenAI's docs spell out that distinction, and for function calling they recommend enabling strict: true so the arguments reliably adhere to the function schema.

My strong opinion is simple. Treat every structured LLM response as an API boundary. Once you start thinking in terms of contracts instead of prompts, the architecture gets cleaner, the bugs get cheaper, and the whole "why did the model invent a new field in production" problem mostly disappears. That is the real answer to "what is structured output validation for LLMs" and it is a much better answer than "ask the model nicely for JSON."

JSON mode is not validation

If you remember only one thing from this article, make it this. JSON mode is not schema validation. OpenAI's Help Center says JSON mode will not guarantee the output matches any specific schema, only that it is valid JSON and parses without errors. The Structured Outputs guide says the same thing in a cleaner way. Both JSON mode and Structured Outputs can produce valid JSON, but only Structured Outputs enforces schema adherence.

That difference matters more than people admit. In its Structured Outputs launch post, OpenAI reported that gpt-4o-2024-08-06 with Structured Outputs scored 100 percent on its complex JSON schema evals, while gpt-4-0613 scored under 40 percent. You do not need to treat those numbers as universal truth to see the broader point. Schema enforcement changes the failure surface from "anything could happen" to "the contract is much tighter."

There are still edge cases, and pretending otherwise is how toy demos become pager duty. OpenAI documents that the model can refuse an unsafe request, and those refusals are surfaced outside your normal schema path. It also documents incomplete responses, including cases such as hitting max_output_tokens or a content filter interruption. So the FAQ "is JSON mode enough for reliable LLM output" has a short answer and a longer one. The short answer is no. The longer answer is that even strict structured output still needs explicit failure handling.

Where structured output still breaks

Schema enforcement shrinks the problem. It does not delete it. In real traffic you still see broken or surprising payloads for reasons that have little to do with your prompt wording.

Failure shapes worth designing for

Models and clients disagree about details. You can get extra prose before or after the JSON, Markdown fenced blocks around the payload, or a tool call whose name is valid but whose arguments are JSON that does not match your Pydantic model. Streaming makes it worse because you might validate a half-finished buffer. Defensive code should assume "string in, maybe JSON inside" rather than "bytes on the wire already match my model."

Provider and API differences

Not every host exposes the same structured-output surface. One stack might give you a first-class schema-bound completion, another might only guarantee JSON syntax, and local runtimes might lag behind hosted APIs. That is one reason the FAQ "how do you validate LLM JSON in Python" starts with provider enforcement when it exists and still ends with Python-side validation. For a wider view of how vendors compare, see the structured output comparison across popular LLM providers. If you run models locally, the same validation pipeline applies after you normalize the wire format, for example after extraction with Ollama as in structured LLM output with Ollama in Python and Go. When a runtime still wraps JSON with odd prefixes or reasoning traces, expect the same class of parser failures described in Ollama GPT-OSS structured output issues.

The Python stack that actually works

My recommendation is boring on purpose. First, let the model provider enforce the structural contract when it can. Second, validate the returned payload in Python with Pydantic. Third, use explicit business-rule validation for facts that a schema alone cannot prove. Fourth, test the contract with fixtures and adversarial examples instead of waving at a playground screenshot and calling it done. OpenAI's Structured Outputs docs, Pydantic's validator model, Python's jsonschema tooling, and OpenAI's own structured-output eval examples all point in that direction.

Pydantic is the right center of gravity for Python. It lets you model the output as normal Python types, generate JSON Schema with model_json_schema(), and validate raw JSON with model_validate_json(). Pydantic's docs also note that model_validate_json() is generally the better path than doing json.loads(...) first and then validating, because that two-step route adds extra parsing work in Python.

If you keep standalone schema files in your repo, or you want CI to validate fixture payloads independently of model code, Python's jsonschema package gives you the simplest possible contract check with jsonschema.validate(...). If you want that in pre-commit, check-jsonschema exists specifically as a CLI and pre-commit hook built on jsonschema. That is a very good fit for teams that want schema changes reviewed like code changes.

Frameworks can reduce plumbing, but they do not remove the need for actual validation. LangChain now auto-selects provider-native structured output when the provider supports it and falls back to a tool strategy otherwise. Instructor layers Pydantic response models, validation, retries, and multi-provider support on top of model calls. Guardrails focuses on validators and input-output guard layers. Useful tools, all of them. But the schema and the business rules still belong to you. If you are choosing between higher-level libraries, the BAML vs Instructor comparison for Python is a useful companion to this article.

A minimal OpenAI and Pydantic example

The smallest production-worthy example has a few non-negotiables. Use a closed set of enum-like values where possible. Forbid extra keys. Add field descriptions so the schema is understandable to humans and more legible to the model. Keep the root object explicit and boring. OpenAI recommends clear names plus titles and descriptions for important keys, JSON Schema uses enum to restrict values, and Pydantic can close the object shape with extra="forbid".

from typing import Literal

from openai import OpenAI
from pydantic import BaseModel, ConfigDict, Field

class TicketClassification(BaseModel):
    model_config = ConfigDict(extra="forbid")

    category: Literal["billing", "bug", "how_to", "abuse"] = Field(
        description="Support ticket category."
    )
    priority: Literal["low", "medium", "high"] = Field(
        description="Operational urgency."
    )
    needs_human: bool = Field(
        description="Whether a human should review the case."
    )
    summary: str = Field(
        description="A one sentence summary of the issue."
    )

client = OpenAI()

response = client.responses.parse(
    model="gpt-4o-2024-08-06",
    input=[
        {
            "role": "system",
            "content": "Classify support tickets. Return only the structured result.",
        },
        {
            "role": "user",
            "content": "Customer reports duplicate charges after refreshing checkout.",
        },
    ],
    text_format=TicketClassification,
)

result = response.output_parsed
print(result.model_dump())

Two details in that example are easy to miss and absolutely worth caring about. extra="forbid" on the Pydantic side mirrors the JSON Schema idea of additionalProperties: false, which is also a requirement for strict tool schemas in OpenAI's function-calling docs. And enums are not cosmetic. They are one of the simplest ways to stop the model from inventing a value your code does not understand.

The OpenAI Python SDK supports client.responses.parse(...) with a Pydantic model supplied as text_format, and the parsed object is returned on response.output_parsed. The same SDK also supports client.chat.completions.parse(...), where the parsed object lives on message.parsed. If you want direct structured data extraction with minimal glue, those helpers are the cleanest starting point.

Parse, normalize, then validate

Structured Outputs and model_validate_json remove a lot of parsing pain when the stack is aligned end to end. The moment you support a provider that returns plain chat text, a model that wraps JSON in fences, or a logging path that stores the raw completion string, you want one choke point that turns text into a dict before Pydantic runs.

import json

def parse_json_from_llm_text(text: str) -> dict:
    cleaned = text.strip()
    if cleaned.startswith("```

"):
        cleaned = cleaned.split("\n", 1)[1]
        cleaned = cleaned.rsplit("

```", 1)[0].strip()

    # Common "Sure, here is the JSON:" prefix before the object.
    if not cleaned.startswith("{") and "{" in cleaned and "}" in cleaned:
        start = cleaned.find("{")
        end = cleaned.rfind("}")
        if end > start:
            cleaned = cleaned[start : end + 1]

    return json.loads(cleaned)

ticket_dict = parse_json_from_llm_text(raw_completion_text)
ticket = TicketClassification.model_validate(ticket_dict)

That helper is intentionally boring. It handles fenced "

json ...

" blocks and a leading natural-language preamble when the payload is still a single top-level object. It is not a full JSON extractor. If the model nests braces inside string values, naive slicing can break, and the right fix is usually stricter prompting, schema-bound completions, or a dedicated parser library.

Streaming completions

If you stream chat tokens, do not run json.loads or model_validate_json on every delta. Buffer until the API reports a finished message (check your client for the stream termination or finish_reason), concatenate the text, then parse once. The same rule applies when tool-call arguments arrive in chunks. You only validate after the arguments string is complete.

chunks: list[str] = []
for chunk in completion_stream:
    delta = chunk.choices[0].delta.content or ""
    chunks.append(delta)
raw_completion_text = "".join(chunks)
ticket = TicketClassification.model_validate_json(raw_completion_text)

You can still pass raw_completion_text through parse_json_from_llm_text first when you expect fences or chatter around the JSON.

Once you own plain-string parsing, the next constraint is often not Python but the provider's JSON Schema dialect and what the remote API actually accepts.

Provider schema limits (before you get clever in Python)

Do not blindly dump any schema generator output into an API and assume every JSON Schema feature is supported. OpenAI supports a subset of JSON Schema, requires all fields to be required for Structured Outputs, requires the root to be an object rather than a top-level anyOf, and documents limits on nesting depth and total property count. Keep the provider-facing schema simple. That is not a compromise. That is good engineering.

If you need a provider-agnostic validation path, or you want to validate stored fixtures and mocks, Pydantic plus jsonschema is still a great combination.

from jsonschema import validate as validate_json

schema = TicketClassification.model_json_schema()

payload = {
    "category": "bug",
    "priority": "high",
    "needs_human": True,
    "summary": "Checkout duplicates charges after refresh.",
}

validate_json(instance=payload, schema=schema)
ticket = TicketClassification.model_validate(payload)
print(ticket)

That pattern is especially handy in tests, contract fixtures, and integrations where the model provider does not offer native structured output enforcement. Just remember that a locally generated schema may be broader than a given provider's supported subset, so "valid locally" does not automatically mean "accepted by every LLM API." Also note that some providers preprocess and cache schema artifacts, so the first request for a new schema can be slower than warm requests.

Tool calls are a second contract

Function or tool calling is the other major structured-output shape. The model chooses a name and passes arguments that should match a JSON Schema you control. OpenAI recommends strict: true on tool definitions so arguments stay aligned with that schema. In agent-heavy stacks, bad sampling turns into invalid tool JSON fast; keep sampler settings aligned with multi-step work using the agentic inference parameters reference for Qwen and Gemma.

The snippets below assume you already mapped the provider's tool-call object into a name string and an arguments dict, for example by parsing tool_calls[].function on chat completions (JSON string arguments become json.loads first). dispatch_tool is the step after that normalization.

Two practical rules help in Python. First, validate the tool name against an explicit allowlist before you route execution. Second, validate the arguments dict with the same Pydantic model you use in tests, not with ad hoc key access. The failure mode you are avoiding is "valid JSON arguments, wrong shape for the tool that fired," which slips past string checks.

from typing import Any, Callable

from pydantic import BaseModel

ToolHandler = Callable[[dict[str, Any]], str]

def dispatch_tool(
    *,
    name: str,
    arguments: dict[str, Any],
    handlers: dict[str, tuple[type[BaseModel], ToolHandler]],
) -> str:
    if name not in handlers:
        raise ValueError(f"unsupported tool {name}")
    model_cls, handler = handlers[name]
    validated = model_cls.model_validate(arguments)
    return handler(validated.model_dump())

handlers: dict[str, tuple[type[BaseModel], ToolHandler]] = {
    "classify_ticket": (
        TicketClassification,
        lambda data: f"queued as {data['category']}",
    ),
}

That pattern keeps routing and validation in one place. Your real handlers will be richer, but the split should stay the same: allowed names, typed arguments, then side effects.

Schema validation still needs business rules

A valid object is not the same thing as a correct object. OpenAI says this directly. Structured Outputs does not prevent mistakes inside the values of the JSON object. That is why the FAQ "why do schema validation and business-rule validation both matter" has a blunt answer. Because a response can match the schema perfectly and still be wrong in a way that hurts the business.

Here is a realistic example. The structure can be valid, but the pricing logic can still be nonsense.

from decimal import Decimal
from typing import Literal
from typing_extensions import Self

from pydantic import BaseModel, ConfigDict, Field, model_validator

class Offer(BaseModel):
    model_config = ConfigDict(extra="forbid")

    currency: Literal["USD", "EUR", "GBP"]
    amount: Decimal = Field(gt=0)
    original_amount: Decimal | None
    discounted: bool

    @model_validator(mode="after")
    def check_discount_logic(self) -> Self:
        if self.discounted:
            if self.original_amount is None:
                raise ValueError(
                    "original_amount is required when discounted is true"
                )
            if self.original_amount <= self.amount:
                raise ValueError(
                    "original_amount must be greater than amount"
                )
        return self

That validator does something schemas alone often do poorly in real systems. It checks cross-field semantics after the whole model has been parsed. Pydantic's model_validator exists exactly for this kind of whole-object validation. Notice the Decimal | None field without a default. That keeps the field present while still allowing null, which matches OpenAI's documented pattern for optional-like values under strict Structured Outputs.

If you want validation failures to feed back into the model automatically, Instructor is a practical layer on top of Pydantic. Its docs describe a retry loop where validation errors are captured, formatted as feedback, and used to ask the model to try again.

import instructor

retrying_client = instructor.from_provider("openai/gpt-4o", max_retries=2)

offer = retrying_client.create(
    response_model=Offer,
    messages=[
        {
            "role": "user",
            "content": (
                "Extract the offer from this text. "
                "Was 49.00 USD, now 19.00 USD."
            ),
        }
    ],
)

This is one of the few conveniences I will happily recommend. Automatic retries tied to real validation errors are useful. Silent coercion is not. Instructor's model layer, retry docs, and validation docs all lean into that same idea, and they are right to do so.

You can implement the same idea without a framework. The loop is small. Ask the model, validate with Pydantic, and if validation fails, send the error details back in a follow-up user message and ask for corrected JSON only. Cap attempts, log the final failure, and surface a controlled error to callers. When you already rely on responses.parse or other schema-bound helpers, you may rarely exercise this path. It still matters for JSON mode, older chat endpoints, or any gateway that hands you a raw string.

from openai import OpenAI
from pydantic import ValidationError

client = OpenAI()

messages = [
    {"role": "system", "content": "Return only JSON that matches the ticket schema."},
    {"role": "user", "content": "Customer reports duplicate charges after refreshing checkout."},
]

ticket: TicketClassification | None = None
for attempt in range(2):
    completion = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=messages,
        response_format={"type": "json_object"},
    )
    raw_text = completion.choices[0].message.content or ""
    try:
        ticket = TicketClassification.model_validate_json(raw_text)
        break
    except ValidationError as exc:
        messages.append(
            {
                "role": "user",
                "content": f"Validation failed with {exc.errors()}. Return corrected JSON only.",
            }
        )
else:
    raise RuntimeError("exhausted structured output retries")

assert ticket is not None

In real services you would attach tracing IDs, redact customer text in logs, and distinguish recoverable validation errors from refusals or incomplete responses. The important part is that the retry is driven by real validator output, not by a generic "try again" message.

Test, retry, and fail closed

What should happen when LLM validation fails? Not a shrug. Reject the payload, log the failure, retry with bounded attempts if the task is worth retrying, and fail closed instead of normalizing garbage into something that only looks acceptable. This is also where many teams forget to handle refusals and incomplete outputs explicitly, even though the provider docs tell them those paths exist.

For OpenAI's Responses API, failure handling should be first-class code, not an afterthought. The variable is response from client.responses.create or parse, not completion from chat streaming elsewhere in this article.

if response.status == "incomplete":
    raise RuntimeError(response.incomplete_details.reason)

content = response.output[0].content[0]

if content.type == "refusal":
    raise RuntimeError(content.refusal)

That is not defensive over-engineering. It is directly aligned with the documented failure modes. If the model refuses, you are not holding a schema-valid payload. If the response is incomplete, you are not holding a schema-valid payload. Treat both as explicit branches in your control flow.

You should also test the contract outside the model call itself.

import pytest
from jsonschema import validate as validate_json
from pydantic import ValidationError

def test_ticket_fixture_matches_schema():
    payload = {
        "category": "bug",
        "priority": "high",
        "needs_human": True,
        "summary": "Checkout duplicates charges after refresh.",
    }
    validate_json(instance=payload, schema=TicketClassification.model_json_schema())

def test_discount_logic_rejects_broken_offer():
    with pytest.raises(ValidationError):
        Offer.model_validate(
            {
                "currency": "USD",
                "amount": "19.00",
                "original_amount": "10.00",
                "discounted": True,
            }
        )

def test_ticket_rejects_unknown_category_string():
    with pytest.raises(ValidationError):
        TicketClassification.model_validate(
            {
                "category": "refund",
                "priority": "high",
                "needs_human": True,
                "summary": "Customer wants a refund.",
            }
        )

def test_ticket_rejects_extra_keys():
    with pytest.raises(ValidationError):
        TicketClassification.model_validate(
            {
                "category": "bug",
                "priority": "high",
                "needs_human": True,
                "summary": "Broken flow.",
                "severity": "critical",
            }
        )

This is the right shape of test strategy for LLM output validation in Python. Validate golden fixtures with jsonschema so every field in the contract is exercised. Validate semantics with Pydantic, then add adversarial cases such as illegal enum strings, forbidden extra keys, and cross-field contradictions you care about. If you snapshot real model outputs, scrub PII and treat them as regression fixtures.

If your team lives in the OpenAI stack, the Evals API also includes structured-output evaluation recipes specifically for testing and iterating on tasks that depend on machine-readable formats. And if you keep raw schema files in the repo, wire check-jsonschema into CI or pre-commit. Ship contracts, not vibes.

Production checks that save you later

When validation fails, the FAQ answer is blunt. Reject the payload, log why, retry with targeted feedback when the task is worth another attempt, and fail closed instead of coercing bad data into a queue.

A short operations checklist helps teams avoid repeat incidents.

Log schema version or a hash of the JSON Schema you sent to the provider so you can replay failures accurately.
Redact model inputs and outputs in logs. Structured logs are useless if they leak customer text.
Emit counters or metrics for refusal rate, incomplete response rate, validation failure rate, and repair success rate. Spikes there beat guessing when a model or prompt change shipped.

Broader observability for LLM systems guidance helps wire those signals into dashboards, traces, and SLO reviews once the counters exist.

The best practice is not complicated. Use provider-side Structured Outputs or strict tool schemas when you can. Normalize raw text when you must. Mirror the contract in Python with Pydantic. Add business-rule validation for what the schema cannot prove. Handle refusals and incomplete responses as normal branches. Test the contract until it stops being a demo and starts being software. Anything less is just prompt engineering cosplay.