<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rost</title>
    <description>The latest articles on DEV Community by Rost (@rosgluk).</description>
    <link>https://dev.to/rosgluk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3544400%2F04dd81bf-749e-4055-971f-316c0134e76c.jpg</url>
      <title>DEV Community: Rost</title>
      <link>https://dev.to/rosgluk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rosgluk"/>
    <language>en</language>
    <item>
      <title>Idempotency in Distributed Systems That Actually Works</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 11 May 2026 11:37:09 +0000</pubDate>
      <link>https://dev.to/rosgluk/idempotency-in-distributed-systems-that-actually-works-5dl6</link>
      <guid>https://dev.to/rosgluk/idempotency-in-distributed-systems-that-actually-works-5dl6</guid>
      <description>&lt;p&gt;Idempotency in distributed systems is the property that saves you after the network lies, the queue retries, the client panics, and the operator hits replay. In production systems, duplicate delivery is normal. Duplicate side effects are the bug.&lt;/p&gt;

&lt;p&gt;HTTP defines an idempotent method as one where multiple identical requests have the same intended effect on the server as one request. That is why PUT, DELETE, and safe methods are idempotent in protocol semantics and can be retried automatically after a communication failure.&lt;/p&gt;

&lt;p&gt;That definition is useful, but it is not enough. In real architectures, idempotency is not an HTTP trivia answer. It is a business guarantee. If a customer hits "pay" once, you do not get to charge twice because a timeout happened between commit and response. If a worker updates inventory and crashes before acking the message, you do not get to decrement stock twice because the broker redelivered. That is the bar.&lt;/p&gt;

&lt;p&gt;The mistake I see over and over is treating idempotency as a transport feature instead of a system property. Queue deduplication, HTTP verbs, and client retries help, but none of them rescue a design that lets the same business intent create a second side effect. If you want the broader framing for how these integration decisions fit service boundaries and persistence trade-offs, start with &lt;a href="https://www.glukhov.org/app-architecture/" rel="noopener noreferrer"&gt;App Architecture in Production: Integration Patterns, Code Design, and Data Access&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where duplicates come from in production
&lt;/h2&gt;

&lt;p&gt;Duplicates do not appear because teams are careless. They appear because distributed systems retry, reorder, and replay.&lt;/p&gt;

&lt;p&gt;A client can send a create request, the server can commit it, and the response can still disappear on the wire. That is exactly why HTTP distinguishes idempotent methods and why payment APIs such as Stripe and PayPal expose explicit idempotency mechanisms for unsafe methods like POST.&lt;/p&gt;

&lt;p&gt;Message brokers make the problem even more obvious. At-least-once delivery means a consumer can be invoked repeatedly for the same message, and a handler can update the database successfully but fail before acknowledgment, causing the broker to deliver the same message again.&lt;/p&gt;

&lt;p&gt;Webhooks are no different. GitHub says webhook deliveries can arrive out of order, failed deliveries are not automatically redelivered, and each delivery carries a unique &lt;code&gt;X-GitHub-Delivery&lt;/code&gt; GUID that you should use when protecting against replay. For a practical architecture view of chat endpoints as interaction boundaries, see &lt;a href="https://www.glukhov.org/app-architecture/integration-patterns/chat-platforms-as-system-interfaces/" rel="noopener noreferrer"&gt;Chat Platforms as System Interfaces in Modern Systems&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Even systems that advertise stronger guarantees still leave you work to do. Kafka can prevent duplicate entries in Kafka logs with idempotent producers and can provide exactly-once delivery for read-process-write flows that stay inside Kafka with transactions and &lt;code&gt;read_committed&lt;/code&gt; consumers. But Kafka's own design docs are clear that external systems still require coordination with offsets and outputs. Google Cloud Pub/Sub exactly-once delivery is limited to pull subscriptions, within a cloud region, and still requires clients to track processing progress until acknowledgment succeeds.&lt;/p&gt;

&lt;p&gt;My opinionated summary is simple. Assume the transport will retry. Assume operators will replay. Assume webhooks will arrive late. Design the write path so a repeated intent cannot create a second business effect.&lt;/p&gt;

&lt;h2&gt;
  
  
  The API contract I actually trust
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do idempotency keys prevent duplicate API requests
&lt;/h3&gt;

&lt;p&gt;The only API contract I trust for mutating operations is caller-supplied intent plus server-side persistence.&lt;/p&gt;

&lt;p&gt;AWS recommends a caller-provided request identifier and warns that the service must atomically record the idempotency token together with the mutating work. Stripe stores the first status code and response body for a key, compares later parameters with the original request, and returns the same result for retries. PayPal uses &lt;code&gt;PayPal-Request-Id&lt;/code&gt; on supported POST APIs and returns the latest status for the previous request with that same header.&lt;/p&gt;

&lt;p&gt;That leads to a practical contract:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The client generates an idempotency key for a business operation.&lt;/li&gt;
&lt;li&gt;The server scopes that key by tenant and operation name.&lt;/li&gt;
&lt;li&gt;The server stores a request hash so the same key cannot be reused for a different payload.&lt;/li&gt;
&lt;li&gt;The server records state such as &lt;code&gt;pending&lt;/code&gt;, &lt;code&gt;completed&lt;/code&gt;, or &lt;code&gt;failed&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Retries with the same key either return the stored outcome or a stable pointer to it.&lt;/li&gt;
&lt;li&gt;Retries with the same key and a different payload fail loudly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There is an IETF &lt;code&gt;Idempotency-Key&lt;/code&gt; header draft, but as of 2026-05-09 it is still listed in the IETF Datatracker as an expired Internet-Draft rather than a published RFC. In practice, the header name is still widely useful as a de facto convention, but you should document the contract in your own API instead of pretending the standard is finished.&lt;/p&gt;

&lt;p&gt;What should the key represent? Intent. Not an HTTP attempt. Not a TCP connection. Not a retry counter. If the user means "create order 123 once", every retry for that same command must reuse the same key. If the user means "place a second order", that must use a different key.&lt;/p&gt;

&lt;p&gt;A request ID is for tracing. An idempotency key is for correctness. If you mix those up, your dashboards look tidy while your money moves twice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why PUT is not enough
&lt;/h3&gt;

&lt;p&gt;No, HTTP PUT is not enough to make an operation idempotent.&lt;/p&gt;

&lt;p&gt;Yes, RFC 9110 gives PUT idempotent semantics. But if your PUT handler emits a new downstream event, sends an email on every retry, or charges an external provider again, then your implementation has violated the business contract even if your route name looks respectable.&lt;/p&gt;

&lt;p&gt;Verb choice helps clients understand intent. It does not implement intent for you.&lt;/p&gt;

&lt;p&gt;Use PUT when the resource model genuinely fits a full replacement or upsert style operation. Use POST when you are creating commands or actions. But for any mutation that might be retried across network boundaries, document an explicit idempotency contract. If your mutating actions are triggered from chat workflows, the same contract applies in &lt;a href="https://www.glukhov.org/app-architecture/integration-patterns/slack/" rel="noopener noreferrer"&gt;Slack Integration Patterns for Alerts and Workflows&lt;/a&gt; and &lt;a href="https://www.glukhov.org/app-architecture/integration-patterns/discord/" rel="noopener noreferrer"&gt;Discord Integration Pattern for Alerts and Control Loops&lt;/a&gt;. Hidden side effects are where architecture goes to die.&lt;/p&gt;

&lt;h3&gt;
  
  
  How long should an idempotency key be stored
&lt;/h3&gt;

&lt;p&gt;Longer than your transport team wants.&lt;/p&gt;

&lt;p&gt;Stripe says keys can be pruned after at least 24 hours. PayPal says retention is API specific and gives examples that can last up to 45 days. Amazon SQS FIFO deduplicates only within a 5-minute window. GitHub keeps recent deliveries for 3 days for manual redelivery. Those numbers are wildly different because the right retention period is a business decision, not a protocol default.&lt;/p&gt;

&lt;p&gt;If you only keep keys for five minutes because your queue does, you are not designing idempotency. You are copying a transport limitation into your business layer.&lt;/p&gt;

&lt;p&gt;Keep idempotency records for at least the maximum of these windows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;client retry horizon&lt;/li&gt;
&lt;li&gt;queue redrive horizon&lt;/li&gt;
&lt;li&gt;webhook replay horizon&lt;/li&gt;
&lt;li&gt;operator replay horizon&lt;/li&gt;
&lt;li&gt;settlement or compensation horizon for money-moving operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For payments, bookings, and provisioning, that often means hours or days, not minutes.&lt;/p&gt;

&lt;p&gt;AWS also calls out two anti-patterns I fully agree with. Do not use timestamps as the key, because clock skew and collisions make them unreliable. Do not blindly store entire request payloads as the dedup record for every request, because that harms performance and scalability. Store a normalized request hash plus the minimum response state you need to replay safely. If you must reproduce the first response byte for byte, store the canonical response body the way Stripe does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The database patterns that make idempotency real
&lt;/h2&gt;

&lt;p&gt;Idempotency becomes real when the persistence layer can win a race exactly once.&lt;/p&gt;

&lt;p&gt;PostgreSQL gives you two critical primitives here. Unique constraints enforce uniqueness on one or more columns, and &lt;code&gt;INSERT ... ON CONFLICT&lt;/code&gt; lets you define an alternative action instead of failing on a uniqueness violation. PostgreSQL also documents that &lt;code&gt;ON CONFLICT DO UPDATE&lt;/code&gt; guarantees an atomic insert-or-update outcome under concurrency.&lt;/p&gt;

&lt;p&gt;That means your idempotency layer should usually start with a table like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;api_idempotency&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;operation&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idempotency_key&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;request_hash&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;resource_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;expires_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the handling flow should look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;begin transaction

try insert (tenant_id, operation, idempotency_key, request_hash, state='pending')
on conflict do nothing

load row for (tenant_id, operation, idempotency_key) for update

if row.request_hash != incoming_request_hash
    fail with conflict or validation error

if row.state = 'completed'
    return stored response

if row.state = 'pending' and row was created by another live request
    either wait briefly, or fail fast with a retryable response

perform local business mutation

store stable result in idempotency row
set state = 'completed'

commit
return result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is not the syntax. The important part is the atomicity. Recording the key and performing the mutation must succeed or fail together. AWS says this explicitly for API idempotency, and the same rule applies in SQL-backed services.&lt;/p&gt;

&lt;p&gt;Do not do a naive check-then-act sequence like "select key; if missing then insert order". Under concurrency, two requests can pass the check and both create the side effect. A unique constraint is not optional. It is the mechanism that turns your architecture from optimistic folklore into something you can prove under load.&lt;/p&gt;

&lt;p&gt;Here is the rule I use in reviews. If the dedup decision is not protected by the same transactional boundary as the mutation, you do not have idempotency. You have hope.&lt;/p&gt;

&lt;h2&gt;
  
  
  Messages, events, and webhooks need their own boundary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do consumers handle duplicate events and messages
&lt;/h3&gt;

&lt;p&gt;For message consumers, the classic pattern is still the right one. Record processed message IDs in the same database transaction as the business update. Chris Richardson describes the &lt;code&gt;PROCESSED_MESSAGES&lt;/code&gt; table approach directly, using a primary key on subscriber and message ID so duplicates fail cleanly and can be ignored.&lt;/p&gt;

&lt;p&gt;Many teams call that explicit &lt;code&gt;processed_messages&lt;/code&gt; store an inbox table. The label matters less than the rule. The receiver must persist proof that it already handled the message before a retry can safely do nothing.&lt;/p&gt;

&lt;p&gt;A minimal form looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="n"&gt;processed_messages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;subscriber_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;message_id&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;processed_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="k"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subscriber_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the consumer flow is just as strict as the HTTP flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;begin transaction

insert into processed_messages (subscriber_id, message_id)
values (?, ?)
on conflict do nothing

if no row inserted
    rollback
    ack and ignore duplicate

apply business mutation

commit
ack message
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pattern is boring. Good. Idempotency should be boring.&lt;/p&gt;

&lt;p&gt;It is also usually better than trying to lean on broker marketing terms. Kafka's exactly-once support is excellent when you stay inside Kafka's own transactional model, but Kafka's docs still warn that external destinations need cooperation. SQS FIFO reduces duplicate sends only within its 5-minute dedup window. Pub/Sub exactly-once still expects the subscriber to track progress and avoid duplicate work when acknowledgments fail.&lt;/p&gt;

&lt;p&gt;Exactly-once is usually a local optimization. Idempotent side effects are the system guarantee.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pair dedup with the outbox pattern
&lt;/h3&gt;

&lt;p&gt;If your service updates local state and also publishes an event, idempotent consumption alone is not enough. You also need a safe way to get the event out after the local transaction commits.&lt;/p&gt;

&lt;p&gt;That is why the transactional outbox pattern matters. Chris Richardson describes the basic idea as writing the event to an outbox table in the same transaction as the business update, and then publishing it asynchronously. Debezium says the outbox pattern avoids inconsistencies between a service's internal state and the events consumed by other services. NServiceBus goes further and shows how outbox processing deduplicates incoming messages and avoids zombie records and ghost messages.&lt;/p&gt;

&lt;p&gt;This is the architecture I recommend for services that own data and publish integration events:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Validate and persist the command under an idempotency key.&lt;/li&gt;
&lt;li&gt;Write business state and outbox event in one local transaction.&lt;/li&gt;
&lt;li&gt;Let CDC or an outbox dispatcher publish the event.&lt;/li&gt;
&lt;li&gt;Make downstream consumers idempotent too.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Outbox does not remove the need for idempotent consumers. It removes the need to pretend that a database commit and a broker publish can be one magical distributed transaction when they usually cannot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Webhooks are just messages with better branding
&lt;/h3&gt;

&lt;p&gt;Treat inbound webhooks exactly like messages from an untrusted network edge.&lt;/p&gt;

&lt;p&gt;GitHub documents that deliveries can arrive out of order, recommends using &lt;code&gt;X-Hub-Signature-256&lt;/code&gt; to verify authenticity, and provides &lt;code&gt;X-GitHub-Delivery&lt;/code&gt; as the unique delivery identifier. It also notes that redeliveries reuse the same delivery ID.&lt;/p&gt;

&lt;p&gt;So the architecture is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;verify the signature first&lt;/li&gt;
&lt;li&gt;use the delivery GUID as the dedup key&lt;/li&gt;
&lt;li&gt;persist receipt before side effects&lt;/li&gt;
&lt;li&gt;make handlers order-aware rather than assuming arrival order&lt;/li&gt;
&lt;li&gt;enqueue the heavy work and return fast&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your webhook handler writes directly to business tables before it records receipt, it is not production-ready. It is just faster at making duplicate mistakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sagas and workflow engines still need idempotency
&lt;/h2&gt;

&lt;p&gt;Sagas and durable workflow engines do not delete the problem. They make it visible.&lt;/p&gt;

&lt;p&gt;Temporal recommends writing Activities to be idempotent because Activities can be retried after failures or timeouts. Its docs even call out the edge case where a worker completes an external side effect successfully but crashes before reporting completion, which causes the Activity to run again. Temporal also suggests using a combination of Workflow Run ID and Activity ID as a stable idempotency key when calling downstream services. If you are applying this in service orchestration, &lt;a href="https://www.glukhov.org/app-architecture/integration-patterns/go-microservices-for-ai-ml-orchestration-patterns/" rel="noopener noreferrer"&gt;Go Microservices for AI/ML Orchestration&lt;/a&gt; covers the broader workflow trade-offs.&lt;/p&gt;

&lt;p&gt;That is exactly the right mental model. A workflow engine can preserve execution history and coordinate retries. It cannot retroactively uncharge a card or unsend an email unless your application gives it idempotent steps and idempotent compensations.&lt;/p&gt;

&lt;p&gt;The same applies to sagas. Temporal's own saga guidance describes compensating actions that run when a step fails. Those compensations must be idempotent too. If "refund payment" runs twice, you may have solved the original bug by creating a new one.&lt;/p&gt;

&lt;p&gt;My rule here is brutal and simple. Every Activity, every command handler, and every compensation that touches the outside world should either be naturally idempotent or carry a real idempotency key to the downstream system.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to test idempotency before production
&lt;/h2&gt;

&lt;p&gt;Most teams test happy paths and then act surprised when retries happen. That is not enough.&lt;/p&gt;

&lt;p&gt;You should have automated tests for at least these cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the server commits the mutation but the response never reaches the client&lt;/li&gt;
&lt;li&gt;two identical requests race with the same idempotency key&lt;/li&gt;
&lt;li&gt;the same key is reused with a different payload&lt;/li&gt;
&lt;li&gt;a consumer commits its database work and crashes before ack&lt;/li&gt;
&lt;li&gt;a webhook is replayed with the same delivery ID&lt;/li&gt;
&lt;li&gt;an outbox dispatcher publishes the same event more than once&lt;/li&gt;
&lt;li&gt;a workflow Activity completes the external call and crashes before completion is reported&lt;/li&gt;
&lt;li&gt;an idempotency record expires and a genuine late retry arrives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS explicitly recommends comprehensive test suites that include successful requests, failed requests, and duplicate requests. That advice is pedestrian and absolutely correct.&lt;/p&gt;

&lt;p&gt;I would add one more failure drill. Verify that the replayed response is semantically equivalent to the first result. AWS discusses late-arriving retries and argues for responses that preserve the original meaning even after underlying state has changed. That is the difference between "no extra side effect happened" and "the caller still has a consistent contract."&lt;/p&gt;

&lt;h2&gt;
  
  
  Opinionated rules that save real systems
&lt;/h2&gt;

&lt;p&gt;Here are the rules I would enforce in an architecture review.&lt;/p&gt;

&lt;p&gt;First, idempotency keys belong to business intent, not transport attempts.&lt;/p&gt;

&lt;p&gt;Second, scope every key by tenant and operation. Global key spaces are how unrelated requests collide.&lt;/p&gt;

&lt;p&gt;Third, persist the dedup decision atomically with the mutation. If that is not true, the design is wrong.&lt;/p&gt;

&lt;p&gt;Fourth, reject same-key different-payload retries. Stripe and AWS both do this for good reason.&lt;/p&gt;

&lt;p&gt;Fifth, keep keys for the full replay horizon of the business process, not for the shortest queue window.&lt;/p&gt;

&lt;p&gt;Sixth, pair producers with an outbox and consumers with message ID tracking. One side without the other is half a design.&lt;/p&gt;

&lt;p&gt;Seventh, propagate the same operation identity downstream when the business action is the same. AWS explicitly recommends passing the idempotency token along the processing chain.&lt;/p&gt;

&lt;p&gt;Eighth, never assume exactly-once marketing removes the need for idempotent side effects.&lt;/p&gt;

&lt;p&gt;If that sounds strict, good. Idempotency is where optimistic architecture meets production reality. You do not need complexity everywhere. But wherever duplicate side effects would hurt money, state, or trust, idempotency should be a first-class part of the contract.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.rfc-editor.org/rfc/rfc9110.html" rel="noopener noreferrer"&gt;RFC 9110, HTTP Semantics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.cloud.google.com/pubsub/docs/exactly-once-delivery" rel="noopener noreferrer"&gt;Google Cloud Pub/Sub, Exactly-once delivery&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/webhooks/testing-and-troubleshooting-webhooks/redelivering-webhooks" rel="noopener noreferrer"&gt;GitHub Docs, Redelivering webhooks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.temporal.io/evaluate/use-cases-design-patterns" rel="noopener noreferrer"&gt;Temporal Documentation, Use cases and design patterns&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://datatracker.ietf.org/doc/draft-ietf-httpapi-idempotency-key-header/" rel="noopener noreferrer"&gt;IETF Datatracker, The Idempotency-Key HTTP Header Field&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dev</category>
      <category>microservices</category>
      <category>api</category>
    </item>
    <item>
      <title>Hermes Voice Control from Your Phone</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sun, 10 May 2026 11:12:56 +0000</pubDate>
      <link>https://dev.to/rosgluk/hermes-voice-control-from-your-phone-3fm6</link>
      <guid>https://dev.to/rosgluk/hermes-voice-control-from-your-phone-3fm6</guid>
      <description>&lt;p&gt;You already chat to Hermes Agent from your phone with text.&lt;br&gt;
Now you want to talk to it directly and get spoken replies back.&lt;br&gt;
That is usually the right move, especially if you already use &lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes as a persistent self-hosted assistant&lt;/a&gt;.&lt;br&gt;
Typing long prompts on a small screen is slow and error-prone&lt;/p&gt;



&lt;p&gt;Voice mode makes Hermes practical in the moments where it matters most, while walking, commuting, or doing admin work away from your desk.&lt;/p&gt;

&lt;p&gt;The good news is that voice mode can run with zero paid APIs. A local faster-whisper model handles transcription, and Edge TTS handles spoken output for free. This guide covers setup, provider choices, platform differences, practical command patterns, and the failure modes that usually block first-time users.&lt;/p&gt;
&lt;h2&gt;
  
  
  How the Pipeline Works
&lt;/h2&gt;

&lt;p&gt;Three stages, no magic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Transcription STT&lt;/strong&gt; — Your voice message becomes text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt; — Hermes processes that text exactly like a typed request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesis TTS&lt;/strong&gt; — The response text is converted back to audio.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The important distinction from consumer assistants is execution depth. Hermes is not just answering trivia. It can call tools, inspect files, run code paths, and continue multi-step work from memory. In practice, that means voice can trigger real workflows such as incident triage, draft generation, and targeted debugging. If you want the broader architecture context, the &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems pillar&lt;/a&gt; explains how this voice layer fits into local agent infrastructure.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Voice Control Is Great For
&lt;/h2&gt;

&lt;p&gt;Use voice mode when keyboard precision is not required yet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operational checks&lt;/strong&gt; while away from your laptop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idea capture&lt;/strong&gt; for drafts, outlines, and rough specs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast triage&lt;/strong&gt; of alerts and errors before deeper desktop follow-up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hands-busy workflows&lt;/strong&gt; where speaking is the only realistic input channel.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Voice Input: Pick an STT Provider
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;API Key&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local faster-whisper&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;On-device, ~150 MB model, 90+ languages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Groq Whisper&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GROQ_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fast cloud inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Whisper&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;&lt;code&gt;VOICE_TOOLS_OPENAI_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Highest accuracy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Voxtral&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MISTRAL_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Alternative cloud option&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Configuration in &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local&lt;/span&gt;
  &lt;span class="na"&gt;local&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;base&lt;/span&gt;  &lt;span class="c1"&gt;# tiny, base, small, medium, large-v3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start with &lt;code&gt;local&lt;/code&gt;. It works immediately, handles multilingual speech, and adds no recurring cost. Move to Groq or OpenAI only if your local setup cannot meet your latency or accuracy requirements. For command-level setup and diagnostics while testing providers, keep the &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-cli-cheatsheet/" rel="noopener noreferrer"&gt;Hermes CLI cheat sheet&lt;/a&gt; nearby.&lt;/p&gt;

&lt;h3&gt;
  
  
  Faster Whisper Model Selection
&lt;/h3&gt;

&lt;p&gt;Use a simple progression:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;tiny&lt;/strong&gt; for very low-power devices where speed matters most.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;base&lt;/strong&gt; as the default balance for laptops and small servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;small&lt;/strong&gt; when accents, noisy environments, or domain terms reduce accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;medium or large-v3&lt;/strong&gt; when quality is critical and hardware budget is higher.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your transcripts are consistently wrong, increase model size first before adding more prompt complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Voice Output: TTS Providers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Edge TTS (default)&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Quick start, 322 voices, 74 languages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;Premium quality, voice cloning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI TTS&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;Natural voices, 6 options&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax TTS&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;Fine-grained speed/volume/pitch control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NeuTTS&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Free (local)&lt;/td&gt;
&lt;td&gt;Fully offline, voice cloning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;edge"&lt;/span&gt;
  &lt;span class="na"&gt;speed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;

  &lt;span class="na"&gt;edge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;voice&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US-AriaNeural"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One critical detail is output format. Telegram voice bubbles are most reliable when audio is encoded as OGG with Opus. Hermes relies on ffmpeg for these conversions in common setups. If ffmpeg is missing, replies often show up as file attachments instead of inline voice bubbles.&lt;/p&gt;

&lt;p&gt;Install ffmpeg early:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg  &lt;span class="c"&gt;# Ubuntu/Debian&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg       &lt;span class="c"&gt;# macOS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Platform Workflows and Practical Differences
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Telegram
&lt;/h3&gt;

&lt;p&gt;Telegram is the easiest place to start. Voice messages are first-class on mobile, and the interaction loop is simple hold, speak, release, receive.&lt;/p&gt;

&lt;p&gt;Setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Create a bot via @BotFather, get your token&lt;/span&gt;
&lt;span class="c"&gt;# 2. Add to ~/.hermes/.env:&lt;/span&gt;
&lt;span class="nv"&gt;TELEGRAM_BOT_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;***&lt;/span&gt;
&lt;span class="nv"&gt;TELEGRAM_ALLOWED_USERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_user_id

&lt;span class="c"&gt;# 3. Start the gateway&lt;/span&gt;
hermes gateway start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open the Hermes chat, tap the microphone, and speak. If STT and TTS are enabled, Hermes transcribes your request, executes it, and sends a voice reply.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discord
&lt;/h3&gt;

&lt;p&gt;Discord supports two useful modes. Voice messages in DMs or channels are close to Telegram behavior.&lt;/p&gt;

&lt;p&gt;The more advanced option is live voice channels. In that flow, Hermes can participate continuously, transcribing speech and responding without explicit message bubbles.&lt;/p&gt;

&lt;p&gt;Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message Content Intent enabled in your bot settings&lt;/li&gt;
&lt;li&gt;Server Members Intent enabled&lt;/li&gt;
&lt;li&gt;Bot permissions: Connect and Speak&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Signal
&lt;/h3&gt;

&lt;p&gt;Signal works through the &lt;code&gt;signal-cli&lt;/code&gt; daemon. Voice messages still use the same Hermes STT and TTS pipeline.&lt;/p&gt;

&lt;p&gt;A useful pattern is running &lt;code&gt;signal-cli&lt;/code&gt; as a linked device and using Signal Note to Self. You can leave yourself a voice note and get Hermes output in the same thread.&lt;/p&gt;

&lt;h3&gt;
  
  
  WhatsApp
&lt;/h3&gt;

&lt;p&gt;WhatsApp follows the same gateway model. Audio messages transcribe automatically once the connector is configured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mobile App Permissions
&lt;/h2&gt;

&lt;p&gt;Both iOS and Android need microphone access for the messaging app you're using.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;iOS:&lt;/strong&gt; Settings → Telegram (or Discord) → Permissions → Microphone → Allow. Enable Background App Refresh for instant responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Android:&lt;/strong&gt; Settings → Apps → Telegram → Permissions → Microphone → Allow. For Discord voice channels, enable overlay permission.&lt;/p&gt;

&lt;p&gt;Pinning the Hermes bot chat to your home screen helps — one tap to start speaking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speaking Patterns That Work Reliably
&lt;/h2&gt;

&lt;p&gt;Voice interaction has different ergonomics than typing. You cannot easily paste logs or quote long stack traces, so structure matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Be explicit.&lt;/strong&gt; Say the action, scope, and output format in one sentence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep one objective per message.&lt;/strong&gt; Split multi-step jobs into short follow-ups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constrain output.&lt;/strong&gt; Ask for numbered actions or a 3-point summary when mobile readability matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stay short.&lt;/strong&gt; Around 10 to 30 seconds per message usually transcribes better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use iterative turns.&lt;/strong&gt; Correct and refine in the next voice message instead of overloading the first.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Prompts You Can Speak
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;"Check deployment logs for the last one hour and report only critical errors."&lt;/li&gt;
&lt;li&gt;"Create a draft outline for a post about OpenTelemetry migration with five sections."&lt;/li&gt;
&lt;li&gt;"Summarize this bug in three bullets and propose the most likely root cause."&lt;/li&gt;
&lt;li&gt;"Review the config and tell me what to change for lower transcription latency."&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Use Cases with Concrete Outcomes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Operations&lt;/strong&gt; — "Check production health and list failed services."
Outcome is a focused status update you can act on immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writing&lt;/strong&gt; — "Turn these rough points into a publishable intro paragraph."
Outcome is polished text from spoken notes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debug triage&lt;/strong&gt; — "Investigate this TypeError and suggest the first fix to test."
Outcome is a concrete next step before opening the IDE.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research&lt;/strong&gt; — "Find three recent sources on topic X and summarize differences."
Outcome is a compressed briefing for later deep work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt; — "Run the home routine and confirm device states."
Outcome is direct action plus confirmation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Voice messages not transcribing:&lt;/strong&gt; Confirm &lt;code&gt;stt.enabled: true&lt;/code&gt; in &lt;code&gt;config.yaml&lt;/code&gt;. Verify local dependencies are installed. Then restart with &lt;code&gt;hermes gateway restart&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTS not responding:&lt;/strong&gt; Confirm &lt;code&gt;tts.provider&lt;/code&gt; is set. If using a paid provider, verify the API key in &lt;code&gt;.env&lt;/code&gt;. Validate current voice settings from the Hermes CLI status commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Poor transcription quality:&lt;/strong&gt; Increase &lt;code&gt;stt.local.model&lt;/code&gt; from &lt;code&gt;base&lt;/code&gt; to &lt;code&gt;small&lt;/code&gt; or &lt;code&gt;medium&lt;/code&gt;. Reduce noise and speak in shorter segments. If needed, switch to cloud STT for better accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Voice bubbles showing as files on Telegram:&lt;/strong&gt; Install ffmpeg and restart the gateway. This is the most common issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Free Stack
&lt;/h2&gt;

&lt;p&gt;For cost-conscious setups, this baseline is strong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STT:&lt;/strong&gt; Local faster-whisper with no API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTS:&lt;/strong&gt; Edge TTS with wide language coverage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total cost:&lt;/strong&gt; $0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a meaningful advantage over many closed assistants where voice quality and automation quickly become paid-only features.&lt;/p&gt;

&lt;p&gt;If quality requirements increase, upgrade one layer at a time. Usually STT upgrades produce the biggest immediate gain, then TTS quality can be improved later if needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ Topics in Practice
&lt;/h2&gt;

&lt;p&gt;The four most common user questions are predictable. They also overlap with memory and profile design concerns covered in &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System&lt;/a&gt; and &lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes production setup patterns&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether voice commands get the same tool access as text.&lt;/li&gt;
&lt;li&gt;Whether a free stack is viable for daily use.&lt;/li&gt;
&lt;li&gt;Why Telegram sometimes shows attachments instead of voice bubbles.&lt;/li&gt;
&lt;li&gt;Which local Whisper model should be used first.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guide addresses each of these directly in setup, tuning, and troubleshooting sections so you can move from first run to stable daily usage quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start Recap
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install voice extras&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"hermes-agent[all]"&lt;/span&gt;

&lt;span class="c"&gt;# 2. Set up Telegram gateway&lt;/span&gt;
hermes gateway setup

&lt;span class="c"&gt;# 3. Install ffmpeg (required for Telegram voice bubbles)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg

&lt;span class="c"&gt;# 4. Send a voice message from your phone&lt;/span&gt;
&lt;span class="c"&gt;# Hermes transcribes, processes, and responds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From there, iterate based on your real bottleneck. If latency is the issue, tune model size or cloud STT. If audio quality is the issue, tune TTS provider and voice preset. Start free, measure, then upgrade only where it actually improves your workflow.&lt;/p&gt;

</description>
      <category>hermes</category>
      <category>selfhosting</category>
      <category>llm</category>
    </item>
    <item>
      <title>Kanban in Hermes Agent for Self Hosted LLM Workflows</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 08 May 2026 09:48:56 +0000</pubDate>
      <link>https://dev.to/rosgluk/kanban-in-hermes-agent-for-self-hosted-llm-workflows-1ekf</link>
      <guid>https://dev.to/rosgluk/kanban-in-hermes-agent-for-self-hosted-llm-workflows-1ekf</guid>
      <description>&lt;p&gt;Hermes Agent ships with a Kanban-style board and the Hermes Gateway that can saturate your self-hosted LLM if too many tasks are dispatched at once.&lt;/p&gt;

&lt;p&gt;I can say you can easily ddos your own LLM this way.&lt;/p&gt;

&lt;p&gt;Hermes Kanban is a durable multi-profile board backed by &lt;code&gt;~/.hermes/kanban.db&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
Each lane represents a phase of work, and each card is a task that can be claimed by a specific Hermes profile.&lt;br&gt;&lt;br&gt;
Out of the box, the dispatcher can promote many &lt;code&gt;ready&lt;/code&gt; tasks in one pass. That is fine for elastic cloud APIs, but it can overload a small self-hosted GPU cluster.&lt;/p&gt;

&lt;p&gt;If you are new to this stack, start with the broader &lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes setup and operations guide&lt;/a&gt; and the &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems pillar&lt;/a&gt; for surrounding architecture.&lt;/p&gt;

&lt;p&gt;This post shows how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understand&lt;/strong&gt; how Hermes Kanban dispatch interacts with your LLM gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control&lt;/strong&gt; parallelism safely for heavy tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch&lt;/strong&gt; promotions with cron so background jobs do not collide with interactive use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor&lt;/strong&gt; and tune the system so GPUs stay busy without overload.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  How Hermes Kanban and the dispatcher work
&lt;/h2&gt;

&lt;p&gt;At a high level, the system has three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Board&lt;/strong&gt; - durable SQLite state for tasks, columns, relations, and history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workers&lt;/strong&gt; - Hermes profiles started in isolated workspaces to process a task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dispatcher&lt;/strong&gt; - a long-lived process that scans for dispatchable cards and starts runs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Tasks created from CLI or dashboard usually start in &lt;code&gt;backlog&lt;/code&gt; or &lt;code&gt;ready&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
The dispatcher scans for eligible cards, claims one atomically, and starts the assigned profile with its tools and memory.&lt;br&gt;&lt;br&gt;
Each worker then calls your LLM gateway or local runtime (for example, OpenAI-compatible endpoints backed by Ollama, vLLM, or llama.cpp). For deployment choices across these runtimes, use the &lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM Hosting in 2026 Local Self-Hosted and Cloud Infrastructure Compared&lt;/a&gt;. If you are tuning request fan-out on Ollama itself, this pairs well with &lt;a href="https://www.glukhov.org/llm-performance/ollama/how-ollama-handles-parallel-requests/" rel="noopener noreferrer"&gt;How Ollama Handles Parallel Requests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you add many heavy tasks and do not cap promotions, your gateway can get flooded with concurrent requests.&lt;br&gt;&lt;br&gt;
On a single-GPU or CPU-bound host, that often means queueing, thrashing, and timeouts instead of better throughput.&lt;/p&gt;
&lt;h2&gt;
  
  
  The practical limitation today
&lt;/h2&gt;

&lt;p&gt;In current Hermes builds many teams run, dispatcher config exposes only two Kanban dispatch keys and does not apply a global active-task cap from config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;kanban&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dispatch_in_gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
  &lt;span class="na"&gt;dispatch_interval_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For active-task control, rely on explicit dispatch cadence (&lt;code&gt;hermes kanban dispatch --max ...&lt;/code&gt;) plus dependency modeling.&lt;/p&gt;

&lt;p&gt;Known gotchas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do not run gateway-embedded dispatch and &lt;code&gt;hermes kanban daemon --force&lt;/code&gt; against the same board, or you can get claim races.&lt;/li&gt;
&lt;li&gt;If the gateway is down, &lt;code&gt;ready&lt;/code&gt; tasks do not dispatch and can burst later when service returns.&lt;/li&gt;
&lt;li&gt;Longer dispatch intervals feel uneven because claiming happens in ticks.&lt;/li&gt;
&lt;li&gt;Behavior can vary across versions because run-state and reclaim edge cases were patched over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Quick verification when behavior looks wrong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1) confirm exactly one dispatcher path is active&lt;/span&gt;
pgrep &lt;span class="nt"&gt;-af&lt;/span&gt; &lt;span class="s2"&gt;"hermes gateway start|hermes kanban daemon"&lt;/span&gt;

&lt;span class="c"&gt;# 2) check the wired Kanban dispatcher keys&lt;/span&gt;
rg &lt;span class="s2"&gt;"dispatch_in_gateway|dispatch_interval_seconds"&lt;/span&gt; ~/.hermes/config.yaml

&lt;span class="c"&gt;# 3) inspect queue shape&lt;/span&gt;
hermes kanban list &lt;span class="nt"&gt;--status&lt;/span&gt; ready
hermes kanban list &lt;span class="nt"&gt;--status&lt;/span&gt; running
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key ideas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dispatcher config wires &lt;code&gt;dispatch_in_gateway&lt;/code&gt; and &lt;code&gt;dispatch_interval_seconds&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dispatch --max&lt;/code&gt; limits new spawns in that pass, not total running tasks.&lt;/li&gt;
&lt;li&gt;For small self-hosted clusters, start conservative and increase only after latency stays stable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When first deploying Hermes near your LLM gateway:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep only supported Kanban dispatcher keys in config.&lt;/li&gt;
&lt;li&gt;Observe GPU and CPU utilization under real queue pressure.&lt;/li&gt;
&lt;li&gt;Use Strategy 1 or Strategy 2 for deterministic pacing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Investigation findings and root cause
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;hermes kanban dispatch&lt;/code&gt; does not read &lt;code&gt;config.yaml&lt;/code&gt; for &lt;code&gt;max_active_tasks&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;hermes_cli/kanban.py&lt;/code&gt;, the dispatch command exposes &lt;code&gt;--max&lt;/code&gt; as a CLI cap (default &lt;code&gt;None&lt;/code&gt;) and passes only &lt;code&gt;args.max&lt;/code&gt; into &lt;code&gt;kb.dispatch_once(...)&lt;/code&gt;. There is no &lt;code&gt;max_active_tasks&lt;/code&gt; config lookup in this path. See &lt;a href="https://github.com/NousResearch/hermes-agent/raw/refs/heads/main/hermes_cli/kanban.py" rel="noopener noreferrer"&gt;hermes_cli/kanban.py raw&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Then in &lt;code&gt;kanban_db.dispatch_once&lt;/code&gt;, the only cap is &lt;code&gt;max_spawn&lt;/code&gt;, with logic equivalent to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_spawn&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;spawned&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;max_spawn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no check of already running tasks and no &lt;code&gt;max_active_tasks&lt;/code&gt; reference in that dispatch path. See &lt;a href="https://github.com/NousResearch/hermes-agent/raw/refs/heads/main/hermes_cli/kanban_db.py" rel="noopener noreferrer"&gt;hermes_cli/kanban_db.py raw&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Effective behavior:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes kanban dispatch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;unbounded for that pass (limited by ready queue size).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes kanban dispatch &lt;span class="nt"&gt;--max&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;caps only new spawns in that pass, not total running tasks.&lt;/p&gt;

&lt;p&gt;The wired config knobs around gateway dispatch are &lt;code&gt;kanban.dispatch_in_gateway&lt;/code&gt; and &lt;code&gt;kanban.dispatch_interval_seconds&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
So &lt;code&gt;max_active_tasks&lt;/code&gt; is ignored in this dispatch path because it is not implemented there.&lt;/p&gt;
&lt;h2&gt;
  
  
  Strategy 1 - Encode dependencies for strictly sequential flows
&lt;/h2&gt;

&lt;p&gt;Some workflows should run strictly one after another — for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multi step data pipelines with shared intermediate artefacts&lt;/li&gt;
&lt;li&gt;migrations or infrastructure changes&lt;/li&gt;
&lt;li&gt;batch jobs that write to the same object store or database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hermes Kanban supports parent child dependencies between tasks so that a child card becomes dispatchable only when its parent is done.&lt;/p&gt;

&lt;p&gt;You can model this with a small helper script around the Hermes CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;

&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;parent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;hermes kanban add &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s1"&gt;'Ingest customer logs for April'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="s1"&gt;'etl-worker'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--column&lt;/span&gt; backlog&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

hermes kanban add &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s1"&gt;'Generate April anomaly report'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="s1"&gt;'analytics-worker'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--column&lt;/span&gt; backlog &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parent&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;parent_id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

hermes kanban add &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s1"&gt;'Publish April summary to dashboard'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; &lt;span class="s1"&gt;'reporting-worker'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--column&lt;/span&gt; backlog &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parent&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;parent_id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With an appropriate board policy and low dispatcher limits only the parent task runs first.&lt;br&gt;&lt;br&gt;
Once it finishes the child tasks gradually become ready, and the dispatcher pulls them one by one without ever exceeding your concurrency caps.&lt;/p&gt;
&lt;h2&gt;
  
  
  Strategy 2 - Use Linux cron with a running-aware dispatch cap
&lt;/h2&gt;

&lt;p&gt;If you want deterministic pacing, use host cron plus a small wrapper script.&lt;br&gt;&lt;br&gt;
Instead of always calling &lt;code&gt;dispatch --max 2&lt;/code&gt;, first count currently running tasks, then dispatch only the remaining slots.&lt;/p&gt;

&lt;p&gt;Create &lt;code&gt;hermes-kanban-dispatch-capped.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;MAX_PARALLEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MAX_PARALLEL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;BOARD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BOARD&lt;/span&gt;&lt;span class="k"&gt;:-}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nv"&gt;board_args&lt;/span&gt;&lt;span class="o"&gt;=()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$BOARD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nv"&gt;board_args&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="nt"&gt;--board&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$BOARD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# or where your hermes is installed&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/home/abc/.local/bin:&lt;/span&gt;&lt;span class="nv"&gt;$PATH&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nv"&gt;running_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;hermes kanban &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;board_args&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; list &lt;span class="nt"&gt;--status&lt;/span&gt; running&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$running_out&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="s2"&gt;"(no matching tasks)"&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="o"&gt;]]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nv"&gt;running_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="k"&gt;else
  &lt;/span&gt;&lt;span class="nv"&gt;running_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'%s\n'&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$running_out&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nv"&gt;slots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt; MAX_PARALLEL &lt;span class="o"&gt;-&lt;/span&gt; running_count &lt;span class="k"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt; slots &amp;lt;&lt;span class="o"&gt;=&lt;/span&gt; 0 &lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Already at limit running=&lt;/span&gt;&lt;span class="nv"&gt;$running_count&lt;/span&gt;&lt;span class="s2"&gt; max=&lt;/span&gt;&lt;span class="nv"&gt;$MAX_PARALLEL&lt;/span&gt;&lt;span class="s2"&gt; dispatch skipped"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;0
&lt;span class="k"&gt;fi

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"running=&lt;/span&gt;&lt;span class="nv"&gt;$running_count&lt;/span&gt;&lt;span class="s2"&gt; max=&lt;/span&gt;&lt;span class="nv"&gt;$MAX_PARALLEL&lt;/span&gt;&lt;span class="s2"&gt; slots=&lt;/span&gt;&lt;span class="nv"&gt;$slots&lt;/span&gt;&lt;span class="s2"&gt; dispatching up to &lt;/span&gt;&lt;span class="nv"&gt;$slots&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

hermes kanban &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;board_args&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; dispatch &lt;span class="nt"&gt;--max&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$slots&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make it executable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x ./hermes-kanban-dispatch-capped.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MAX_PARALLEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 ./hermes-kanban-dispatch-capped.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a specific board:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;BOARD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-board &lt;span class="nv"&gt;MAX_PARALLEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 ./hermes-kanban-dispatch-capped.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schedule it once per minute with cron:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /opt/hermes/scripts/hermes-kanban-dispatch-capped.sh &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /var/log/hermes/kanban-cron.log 2&amp;gt;&amp;amp;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Operational notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cron often has a minimal &lt;code&gt;PATH&lt;/code&gt;, so if &lt;code&gt;hermes&lt;/code&gt; is not found, use its full path inside the script (for example &lt;code&gt;/usr/local/bin/hermes&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;If you log to &lt;code&gt;/var/log/hermes/...&lt;/code&gt;, create that directory first and ensure the cron user has write access.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /var/log/hermes
&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;:&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; /var/log/hermes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create or edit cron entries with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;crontab &lt;span class="nt"&gt;-e&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then verify with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;crontab &lt;span class="nt"&gt;-l&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-minute cadence with one cron entry
&lt;/h3&gt;

&lt;p&gt;Cron ticks once per minute, but you can still dispatch more frequently by running a short loop inside the script.&lt;/p&gt;

&lt;p&gt;Example &lt;code&gt;hermes-kanban-dispatch-subminute.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;LOCK_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"/tmp/hermes-kanban-dispatch.lock"&lt;/span&gt;
&lt;span class="nv"&gt;RUNS_PER_MINUTE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;RUNS_PER_MINUTE&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;4&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;    &lt;span class="c"&gt;# 4 runs =&amp;gt; every 15 seconds&lt;/span&gt;
&lt;span class="nv"&gt;CAP_SCRIPT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CAP_SCRIPT&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="p"&gt;/opt/hermes/scripts/hermes-kanban-dispatch-capped.sh&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="nb"&gt;exec &lt;/span&gt;9&amp;gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCK_FILE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
flock &lt;span class="nt"&gt;-n&lt;/span&gt; 9 &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;0

&lt;span class="nv"&gt;sleep_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; RUNS_PER_MINUTE &lt;span class="k"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nv"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; i&amp;lt;&lt;span class="o"&gt;=&lt;/span&gt;RUNS_PER_MINUTE&lt;span class="p"&gt;;&lt;/span&gt; i++&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CAP_SCRIPT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt; i &amp;lt; RUNS_PER_MINUTE &lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;sleep&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$sleep_seconds&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make it executable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x ./hermes-kanban-dispatch-subminute.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Schedule it once per minute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; /opt/hermes/scripts/hermes-kanban-dispatch-subminute.sh &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /var/log/hermes/kanban-subminute.log 2&amp;gt;&amp;amp;1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives an effective sub-minute cadence while &lt;code&gt;flock&lt;/code&gt; prevents overlapping runs.&lt;/p&gt;

&lt;p&gt;Why this works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;list --status running&lt;/code&gt; gives current running load.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dispatch --max N&lt;/code&gt; caps only new spawns for that pass.&lt;/li&gt;
&lt;li&gt;Computing &lt;code&gt;N&lt;/code&gt; as remaining slots keeps total running tasks near your target limit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important caveat: this cap works only for dispatches made through this script.&lt;br&gt;&lt;br&gt;
Disable gateway embedded dispatch, otherwise it can still promote tasks independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;kanban&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dispatch_in_gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The official docs describe both command capabilities and note gateway dispatch defaults in the Kanban feature guide: &lt;a href="https://github.com/NousResearch/hermes-agent/blob/main/website/docs/user-guide/features/kanban.md" rel="noopener noreferrer"&gt;Hermes Kanban docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Internal Hermes Cron
&lt;/h2&gt;

&lt;p&gt;Do not use it.&lt;br&gt;
Do you really want your llm to process regular prompts like &lt;code&gt;Execute in terminal the command /path/hermes-kanban-dispatch-capped.sh&lt;/code&gt;, especially when it's busy doing some useful work?&lt;/p&gt;

&lt;h2&gt;
  
  
  Hermes Kanban Monitoring and Tuning
&lt;/h2&gt;

&lt;p&gt;Whichever strategy you choose you should monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM gateway metrics — request rate, latency, error rate, token throughput.&lt;/li&gt;
&lt;li&gt;Node health — GPU utilisation, VRAM usage, CPU load and RAM.&lt;/li&gt;
&lt;li&gt;Hermes metrics — how many tasks are in backlog, ready, active and done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For production metric baselines and dashboards, see &lt;a href="https://www.glukhov.org/observability/monitoring-llm-inference-prometheus-grafana/" rel="noopener noreferrer"&gt;Monitor LLM Inference in Production with Prometheus and Grafana&lt;/a&gt; and the broader &lt;a href="https://www.glukhov.org/llm-performance/" rel="noopener noreferrer"&gt;LLM Performance hub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Start with low concurrency, then gradually raise limits while watching for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rising latency at constant throughput&lt;/li&gt;
&lt;li&gt;increasing timeout or rate limit errors&lt;/li&gt;
&lt;li&gt;long tails where some tasks stay active for a very long time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As soon as you see these symptoms roll back to the previous stable configuration and keep that as your default.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Kanban is the right tool
&lt;/h2&gt;

&lt;p&gt;Hermes Kanban shines when you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;long lived research or engineering backlogs&lt;/li&gt;
&lt;li&gt;multi agent collaboration with named profiles&lt;/li&gt;
&lt;li&gt;workflows that must survive restarts and host reboots&lt;/li&gt;
&lt;li&gt;humans who want a dashboard to triage work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you only need a single run to create a few temporary helpers, the built in delegate task tools are usually simpler.&lt;br&gt;&lt;br&gt;
Once you need history, dashboards and strict control over how your agents hit self hosted LLMs the Kanban board plus dispatcher is the right foundation.&lt;/p&gt;

&lt;p&gt;With a few configuration tweaks and optional cron based batching you can keep Hermes Kanban responsive while protecting your gateway and hardware.&lt;/p&gt;

</description>
      <category>hermes</category>
      <category>selfhosting</category>
      <category>llm</category>
      <category>ai</category>
    </item>
    <item>
      <title>Hermes Agent Skill Authoring — SKILL.md Structure and Best Practices</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Wed, 06 May 2026 08:10:04 +0000</pubDate>
      <link>https://dev.to/rosgluk/hermes-agent-skill-authoring-skillmd-structure-and-best-practices-44n9</link>
      <guid>https://dev.to/rosgluk/hermes-agent-skill-authoring-skillmd-structure-and-best-practices-44n9</guid>
      <description>&lt;p&gt;Hermes Agent treats &lt;strong&gt;skills&lt;/strong&gt; as the default way to teach repeatable workflows. Official documentation describes them as on-demand knowledge documents aligned with the open &lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io&lt;/a&gt; shape, loaded through &lt;strong&gt;progressive disclosure&lt;/strong&gt; so the model sees a small index first and only pulls full instructions when a task actually needs them.&lt;/p&gt;

&lt;p&gt;Authoring is less about clever wording than about &lt;strong&gt;packaging&lt;/strong&gt;—you are telling the runtime when to load a procedure, what order of steps counts as “done,” and how to tell success from a silent failure. This article stays focused on &lt;code&gt;SKILL.md&lt;/code&gt; structure, supporting folders, visibility rules, and the split between secrets and non-secret settings—the details that decide whether a skill shows up in &lt;code&gt;/slash&lt;/code&gt; commands, survives a hub install, or quietly disappears on CI.&lt;/p&gt;

&lt;p&gt;Hermes sits inside the broader &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems: Self-Hosted Assistants, RAG, and Local Infrastructure&lt;/a&gt;&lt;/strong&gt; cluster, where assistants are treated as systems built from inference, retrieval, memory, and tooling rather than as a single chat surface. Install paths, provider wiring, gateway behavior, and the layout of &lt;code&gt;~/.hermes&lt;/code&gt; are all spelled out in the &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes AI Assistant - Install, Setup, Workflow, and Troubleshooting&lt;/a&gt;&lt;/strong&gt; guide; day-to-day shell ergonomics—&lt;code&gt;hermes skills&lt;/code&gt;, profiles, gateway, memory—are easier to scan in the &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-cli-cheatsheet/" rel="noopener noreferrer"&gt;Hermes Agent CLI cheat sheet — commands, flags, and slash shortcuts&lt;/a&gt;&lt;/strong&gt;. In real deployments, skills inherit isolation from &lt;strong&gt;profiles&lt;/strong&gt; (separate config, secrets, memories, and skill trees). &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes AI Assistant Skills for Real Production Setups&lt;/a&gt;&lt;/strong&gt; argues for treating those profiles—not individual markdown files—as the unit of ownership; keep that in mind when you name skills and decide what belongs in shared &lt;code&gt;external_dirs&lt;/code&gt; versus a single profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skill or tool?
&lt;/h2&gt;

&lt;p&gt;Official guidance is blunt. &lt;strong&gt;Use a skill&lt;/strong&gt; when the capability is mostly prose instructions plus shell commands and tools Hermes already exposes—wrapping a CLI, driving &lt;code&gt;git&lt;/code&gt;, calling &lt;code&gt;curl&lt;/code&gt;, or using &lt;code&gt;web_extract&lt;/code&gt; for structured fetches. &lt;strong&gt;Use a tool&lt;/strong&gt; when you need tight integration for API keys and auth flows, deterministic binary handling, streaming, or Python that must execute the same way every time.&lt;/p&gt;

&lt;p&gt;That boundary matters in practice because skills ship without changing agent code, while tools carry review and release overhead. Most teams benefit from starting with a skill, then promoting only the brittle core to a tool once the failure modes are obvious (auth refresh loops, binary parsers, strict idempotency).&lt;/p&gt;

&lt;h3&gt;
  
  
  Procedures versus curated memory
&lt;/h3&gt;

&lt;p&gt;Skills answer &lt;strong&gt;how&lt;/strong&gt; to run a workflow; Hermes’ bounded core memory answers &lt;strong&gt;what has already been agreed&lt;/strong&gt; about the user and the project. A skill loads when the task matches its description; MEMORY.md and USER.md stay in the prompt as a small, curated fact layer. The two mechanisms stack rather than compete, and the full picture of snapshots, limits, and external providers is laid out in &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System: How Persistent AI Memory Actually Works&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anatomy of a skill directory
&lt;/h2&gt;

&lt;p&gt;On disk, every skill is a folder under &lt;code&gt;~/.hermes/skills/&lt;/code&gt;, often nested under a category such as &lt;code&gt;devops/&lt;/code&gt; or &lt;code&gt;research/&lt;/code&gt;. Hermes expects &lt;strong&gt;&lt;code&gt;SKILL.md&lt;/code&gt; at the leaf&lt;/strong&gt;; everything else is optional structure you add when the instructions would otherwise sprawl. The usual pattern is &lt;code&gt;references/&lt;/code&gt; for long tables or vendor docs, &lt;code&gt;templates/&lt;/code&gt; for output skeletons, &lt;code&gt;scripts/&lt;/code&gt; for deterministic helpers, and &lt;code&gt;assets/&lt;/code&gt; for static files the agent should not re-fetch.&lt;/p&gt;

&lt;p&gt;That layout mirrors how progressive disclosure works in practice: the agent can stay at the main file until it truly needs a deep appendix. Keeping “happy path” prose in &lt;code&gt;SKILL.md&lt;/code&gt; and pushing rarely used detail into &lt;code&gt;references/&lt;/code&gt; is one of the cheapest ways to protect token budgets.&lt;/p&gt;

&lt;p&gt;Hermes can also merge in &lt;strong&gt;external skill directories&lt;/strong&gt; via &lt;code&gt;skills.external_dirs&lt;/code&gt; in &lt;code&gt;config.yaml&lt;/code&gt;. Those paths are scanned for discovery, but the agent still writes through &lt;code&gt;skill_manage&lt;/code&gt; into the primary &lt;code&gt;~/.hermes/skills/&lt;/code&gt; tree. &lt;strong&gt;Local names shadow external ones&lt;/strong&gt;, so if you “fix” a shared skill in your home directory, teammates pulling the same external repo will not see your edit until they remove or rename the local copy—a common source of “it works on my machine” confusion.&lt;/p&gt;

&lt;h2&gt;
  
  
  SKILL.md frontmatter that survives review
&lt;/h2&gt;

&lt;p&gt;The body of &lt;code&gt;SKILL.md&lt;/code&gt; is Markdown; the opening block must be valid YAML between &lt;code&gt;---&lt;/code&gt; delimiters. Real skills accumulate long fenced examples, so the small habits from &lt;strong&gt;&lt;a href="https://www.glukhov.org/documentation-tools/markdown/markdown-codeblocks/" rel="noopener noreferrer"&gt;Markdown Code Blocks: Complete Guide with Syntax, Languages &amp;amp; Examples&lt;/a&gt;&lt;/strong&gt;—consistent language tags, readable excerpts, tight fences—keep large files maintainable for humans and slightly easier for the model to scan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Required fields&lt;/strong&gt; are &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt;. The &lt;code&gt;name&lt;/code&gt; becomes the slash route and index key; it stays lowercase with hyphens and must respect the documented length cap. The &lt;code&gt;description&lt;/code&gt; is the only prose many sessions ever pay for at &lt;strong&gt;level zero&lt;/strong&gt;, so it should read like a search result or router string (“when backups look stale, verify latest archive and checksum”), not the first paragraph of a blog post.&lt;/p&gt;

&lt;p&gt;Optional top-level keys such as &lt;code&gt;version&lt;/code&gt;, &lt;code&gt;author&lt;/code&gt;, and &lt;code&gt;license&lt;/code&gt; help hub packaging and audits. The &lt;code&gt;platforms&lt;/code&gt; list (&lt;code&gt;macos&lt;/code&gt;, &lt;code&gt;linux&lt;/code&gt;, &lt;code&gt;windows&lt;/code&gt;) is sharper than it looks—when set, Hermes omits the skill entirely on non-matching hosts, which is why a skill that “works on my Mac” can vanish in Linux CI with no error message beyond a shorter skill list.&lt;/p&gt;

&lt;p&gt;Hermes-specific knobs live under &lt;code&gt;metadata.hermes&lt;/code&gt;: &lt;code&gt;tags&lt;/code&gt;, &lt;code&gt;related_skills&lt;/code&gt;, and the conditional visibility fields in the next section. &lt;strong&gt;&lt;code&gt;required_environment_variables&lt;/code&gt;&lt;/strong&gt; declares secrets that should land in &lt;code&gt;.env&lt;/code&gt; and pass into sandboxes; &lt;strong&gt;&lt;code&gt;required_credential_files&lt;/code&gt;&lt;/strong&gt; covers OAuth token files and other on-disk credentials that must mount into Docker or Modal; &lt;strong&gt;&lt;code&gt;metadata.hermes.config&lt;/code&gt;&lt;/strong&gt; declares non-secret preferences stored under &lt;code&gt;skills.config&lt;/code&gt; in &lt;code&gt;config.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Official docs stress &lt;strong&gt;size discipline&lt;/strong&gt; for a reason. Trim the &lt;code&gt;description&lt;/code&gt; to its budget, front-load the procedure, and push historical notes or giant option matrices into &lt;code&gt;references/&lt;/code&gt; so a partial &lt;code&gt;skill_view&lt;/code&gt; still gives the agent something actionable.&lt;/p&gt;

&lt;p&gt;Below is a &lt;strong&gt;minimal&lt;/strong&gt; &lt;code&gt;SKILL.md&lt;/code&gt; you can drop into &lt;code&gt;~/.hermes/skills/devops/backup-check/SKILL.md&lt;/code&gt; (or any category folder) and iterate from there.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backup-check&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify nightly backup archives exist, are non-empty, and pass a quick checksum spot-check on the latest file.&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1.0.0&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hermes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;devops&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;backups&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;shell&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;requires_toolsets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;terminal&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backup_check.archive_dir&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Absolute path to the directory that holds backup archives&lt;/span&gt;
        &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/var/backups"&lt;/span&gt;
        &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Backup archive directory (absolute path)&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gh"&gt;# Backup archive spot-check&lt;/span&gt;

&lt;span class="gu"&gt;## When to use&lt;/span&gt;

Use when the user asks to confirm backups ran, to audit the latest archive on disk, or to catch empty or stale backup files before a restore drill.

&lt;span class="gu"&gt;## Quick reference&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Latest archive directory is configured under &lt;span class="sb"&gt;`skills.config.backup_check.archive_dir`&lt;/span&gt; (set via &lt;span class="sb"&gt;`hermes config migrate`&lt;/span&gt; if declared in metadata).
&lt;span class="p"&gt;-&lt;/span&gt; Default check uses &lt;span class="sb"&gt;`ls`&lt;/span&gt; by mtime and &lt;span class="sb"&gt;`test -s`&lt;/span&gt; for non-empty files.

&lt;span class="gu"&gt;## Procedure&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Resolve the archive directory from skill config or ask the user once if unset.
&lt;span class="p"&gt;2.&lt;/span&gt; List the most recently modified file matching the expected pattern (for example &lt;span class="sb"&gt;`*.tar.zst`&lt;/span&gt;).
&lt;span class="p"&gt;3.&lt;/span&gt; Confirm the file exists, is non-empty, and record its path and size for the reply.
&lt;span class="p"&gt;4.&lt;/span&gt; If a checksum file exists beside the archive, verify it with the documented tool (for example &lt;span class="sb"&gt;`sha256sum -c`&lt;/span&gt;).

&lt;span class="gu"&gt;## Pitfalls&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Empty files can still have a recent mtime if a failed job touched the path; always check size.
&lt;span class="p"&gt;-&lt;/span&gt; Relative paths break when the terminal cwd is not the backup host; use absolute paths in config.

&lt;span class="gu"&gt;## Verification&lt;/span&gt;

The user should see the latest archive path, byte size, and either a checksum OK line or an explicit note that no &lt;span class="sb"&gt;`.sha256`&lt;/span&gt; sidecar was found.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Progressive disclosure in practice
&lt;/h2&gt;

&lt;p&gt;Progressive disclosure is the difference between a skill library that feels snappy and one that burns thousands of tokens before the first user message. Hermes walks three conceptual steps: a compact catalog (names and short descriptions), the full &lt;code&gt;SKILL.md&lt;/code&gt; when the task matches, and—only if needed—a slice of a reference file via &lt;code&gt;skill_view&lt;/code&gt; paths. &lt;strong&gt;Assume level zero is all the model will read&lt;/strong&gt; until it explicitly commits; every sentence in the &lt;code&gt;description&lt;/code&gt; and the first screen of body text should help routing, not storytelling.&lt;/p&gt;

&lt;p&gt;A practical outline that survives partial loads is &lt;strong&gt;When to use&lt;/strong&gt; (triggers in plain language), &lt;strong&gt;Quick reference&lt;/strong&gt; (commands, env vars, file paths), &lt;strong&gt;Procedure&lt;/strong&gt; (ordered steps the agent should not improvise away), &lt;strong&gt;Pitfalls&lt;/strong&gt; (known failure modes), and &lt;strong&gt;Verification&lt;/strong&gt; (what “green” looks like). Narrative history, vendor changelog dumps, and twenty-row option tables belong in &lt;code&gt;references/&lt;/code&gt; with stable headings so the agent can pull a single section.&lt;/p&gt;

&lt;p&gt;When a skill activates, Hermes can rewrite &lt;strong&gt;&lt;code&gt;${HERMES_SKILL_DIR}&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;${HERMES_SESSION_ID}&lt;/code&gt;&lt;/strong&gt; in the body so shell lines point at the installed folder without hand-built paths. Optional &lt;strong&gt;inline shell&lt;/strong&gt; snippets (&lt;code&gt;!&lt;/code&gt;cmd`&lt;code&gt;) can inject fresh context (current branch, disk free space), but they execute on the host and stay disabled unless&lt;/code&gt;skills.inline_shell` is on—treat that flag as a trust boundary for the whole skill source, not a convenience toggle.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conditional activation and prompt hygiene
&lt;/h2&gt;

&lt;p&gt;Skills can &lt;strong&gt;show or hide&lt;/strong&gt; based on which toolsets or tools exist in the current session. &lt;code&gt;requires_toolsets&lt;/code&gt; / &lt;code&gt;requires_tools&lt;/code&gt; gate a skill behind capabilities that must be present; &lt;code&gt;fallback_for_toolsets&lt;/code&gt; / &lt;code&gt;fallback_for_tools&lt;/code&gt; surface a cheaper or local path when a premium integration is absent—the DuckDuckGo fallback when a paid web search API is not configured is the canonical example.&lt;/p&gt;

&lt;p&gt;These predicates directly shape &lt;strong&gt;prompt noise&lt;/strong&gt;. An overly strict &lt;code&gt;requires_*&lt;/code&gt; rule hides a skill from newcomers who have not finished &lt;code&gt;hermes tools&lt;/code&gt; setup yet; an overly loose &lt;code&gt;fallback_for_*&lt;/code&gt; rule duplicates half your library whenever someone omits an API key. The useful middle ground is to name real prerequisites, test with &lt;code&gt;hermes chat --toolsets skills&lt;/code&gt;, and toggle keys or toolsets on purpose while watching whether the skill list breathes the way you expect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secrets, config, and credential files
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Secrets&lt;/strong&gt; should be declared in &lt;code&gt;required_environment_variables&lt;/code&gt;. Hermes can prompt when a skill loads in the local CLI, persist values in &lt;code&gt;.env&lt;/code&gt;, and pass them into &lt;code&gt;terminal&lt;/code&gt; and &lt;code&gt;execute_code&lt;/code&gt; sandboxes &lt;strong&gt;without&lt;/strong&gt; streaming the raw secret back into the model transcript. Remote chat surfaces refuse to collect keys inline and instead point people at &lt;code&gt;hermes setup&lt;/code&gt; or manual &lt;code&gt;.env&lt;/code&gt; edits—author your skill text so it matches that behavior (tell users &lt;em&gt;that&lt;/em&gt; a key is required, not *to paste it into Telegram).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-secret preferences&lt;/strong&gt;—default paths, org names, feature toggles—belong in &lt;code&gt;metadata.hermes.config&lt;/code&gt;. Values resolve into &lt;code&gt;skills.config&lt;/code&gt; inside &lt;code&gt;config.yaml&lt;/code&gt;, show up in &lt;code&gt;hermes config show&lt;/code&gt;, and arrive in the skill message as resolved facts so the model does not need to open your config file mid-task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File-shaped credentials&lt;/strong&gt; (OAuth token JSON, service account keys) map to &lt;code&gt;required_credential_files&lt;/code&gt;. When those files exist, Hermes can bind-mount them into Docker or sync them into Modal jobs; declaring them upfront avoids the classic “script works locally, dies in sandbox” gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supporting scripts and dependencies
&lt;/h2&gt;

&lt;p&gt;The upstream guide pushes authors toward &lt;strong&gt;boring dependencies&lt;/strong&gt;: stdlib Python, &lt;code&gt;curl&lt;/code&gt;, and Hermes’ own tools (&lt;code&gt;web_extract&lt;/code&gt;, &lt;code&gt;read_file&lt;/code&gt;, &lt;code&gt;terminal&lt;/code&gt;). That is less about purity than about reproducibility—every extra &lt;code&gt;pip install&lt;/code&gt; is another silent failure when the agent runs in a clean container.&lt;/p&gt;

&lt;p&gt;When JSON or XML parsing is fiddly, a short script under &lt;code&gt;scripts/&lt;/code&gt; plus a &lt;code&gt;${HERMES_SKILL_DIR}&lt;/code&gt; path beats asking the model to re-derive parsers each run. If you truly need a package, state the install command in &lt;strong&gt;Procedure&lt;/strong&gt;, repeat the failure symptom in &lt;strong&gt;Pitfalls&lt;/strong&gt;, and give a &lt;strong&gt;Verification&lt;/strong&gt; command that fails loudly when the dependency is missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Publishing, hub installs, and trust
&lt;/h2&gt;

&lt;p&gt;Community skills move through the Skills Hub and the other discovery paths the user guide lists—official optional skills, GitHub slugs, &lt;code&gt;skills.sh&lt;/code&gt; entries, &lt;code&gt;.well-known&lt;/code&gt; indexes, and raw &lt;code&gt;SKILL.md&lt;/code&gt; URLs. Installs are scanned for obvious exfiltration, injection, and destructive patterns; trust tiers run from &lt;strong&gt;builtin&lt;/strong&gt; through &lt;strong&gt;community&lt;/strong&gt;, and some findings only clear with &lt;code&gt;--force&lt;/code&gt; while the worst cases stay blocked entirely.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;SKILL.md&lt;/code&gt; file shape is not Hermes-specific&lt;/strong&gt;; IDE-centric assistants use the same progressive-loading idea with different discovery and triggers. &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-devtools/claude-code/claude-skills-for-developers/" rel="noopener noreferrer"&gt;Claude Skills and SKILL.md for Developers: VS Code, JetBrains, Cursor&lt;/a&gt;&lt;/strong&gt; is a useful contrast read—frontmatter discipline and “load only when relevant” carry over, even when the installer and slash-command wiring differ.&lt;/p&gt;

&lt;p&gt;Org-wide rollouts usually pair a &lt;strong&gt;private tap or shared Git repo&lt;/strong&gt; with &lt;code&gt;external_dirs&lt;/code&gt; for read-only sharing, while keeping the agent-writable copy under each profile when &lt;code&gt;skill_manage&lt;/code&gt; is allowed to mutate skills in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting and optimization
&lt;/h2&gt;

&lt;p&gt;When a skill misbehaves, walk this checklist before rewriting prose.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Visibility&lt;/strong&gt; — Confirm &lt;code&gt;platforms&lt;/code&gt;, &lt;code&gt;requires_*&lt;/code&gt;, and &lt;code&gt;fallback_for_*&lt;/code&gt; predicates. A skill that “works on my Mac” but not in Linux CI is often a platform guard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Name collisions&lt;/strong&gt; — Duplicate names across local and external directories follow &lt;strong&gt;local precedence&lt;/strong&gt;. Rename or namespace aggressively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discovery layout&lt;/strong&gt; — A misplaced &lt;code&gt;SKILL.md&lt;/code&gt; or wrong category folder can drop the skill from indexing entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token load&lt;/strong&gt; — If sessions feel slow, shorten level-zero descriptions, move depth into &lt;code&gt;references/&lt;/code&gt;, and deduplicate giant tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent edits&lt;/strong&gt; — Hermes can create, patch, or delete skills via &lt;code&gt;skill_manage&lt;/code&gt;. Treat valuable skills like code: review diffs, export snapshots, and reset bundled skills deliberately when upgrades drift.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A tight regression loop beats rereading the whole file: &lt;code&gt;hermes chat --toolsets skills -q "Use the &amp;lt;skill&amp;gt; workflow to &amp;lt;concrete task&amp;gt;"&lt;/code&gt; should show the agent pulling the right disclosure level before it freestyles. If it never invokes &lt;code&gt;skill_view&lt;/code&gt;, your &lt;strong&gt;When to use&lt;/strong&gt; text or &lt;code&gt;description&lt;/code&gt; probably does not match how people phrase requests.&lt;/p&gt;

&lt;p&gt;Official references stay authoritative for behavior changes—the &lt;strong&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/user-guide/features/skills/" rel="noopener noreferrer"&gt;Skills System&lt;/a&gt;&lt;/strong&gt; user guide for runtime semantics, &lt;strong&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/developer-guide/creating-skills" rel="noopener noreferrer"&gt;Creating Skills&lt;/a&gt;&lt;/strong&gt; for author-facing rules, the &lt;strong&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/reference/skills-catalog" rel="noopener noreferrer"&gt;Bundled Skills Catalog&lt;/a&gt;&lt;/strong&gt; for copy-paste examples, and the &lt;strong&gt;&lt;a href="https://agentskills.io/specification" rel="noopener noreferrer"&gt;agentskills.io specification&lt;/a&gt;&lt;/strong&gt; for the shared file format Hermes aligns with.&lt;/p&gt;

</description>
      <category>selfhosting</category>
      <category>hermes</category>
      <category>aiagents</category>
      <category>llm</category>
    </item>
    <item>
      <title>Hermes Agent CLI cheat sheet — commands, flags, and slash shortcuts</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 04 May 2026 10:57:09 +0000</pubDate>
      <link>https://dev.to/rosgluk/hermes-agent-cli-cheat-sheet-commands-flags-and-slash-shortcuts-3pcb</link>
      <guid>https://dev.to/rosgluk/hermes-agent-cli-cheat-sheet-commands-flags-and-slash-shortcuts-3pcb</guid>
      <description>&lt;p&gt;Hermes Agent from Nous Research is a model-agnostic, tool-using assistant you run locally or on a VPS.&lt;/p&gt;

&lt;p&gt;Hermes does not lock you into one surface. You can use&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the classic &lt;strong&gt;&lt;code&gt;hermes&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;hermes chat&lt;/code&gt;&lt;/strong&gt; CLI, &lt;/li&gt;
&lt;li&gt;the full-screen &lt;strong&gt;&lt;code&gt;hermes --tui&lt;/code&gt;&lt;/strong&gt; session, &lt;/li&gt;
&lt;li&gt;a long-running &lt;strong&gt;&lt;code&gt;hermes gateway&lt;/code&gt;&lt;/strong&gt; for Telegram, Discord, Slack, and other messaging platforms,&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;&lt;code&gt;hermes dashboard&lt;/code&gt;&lt;/strong&gt; for a local browser UI when the web extra is installed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those paths share the same config and data under &lt;strong&gt;&lt;code&gt;~/.hermes&lt;/code&gt;&lt;/strong&gt;; this page lists &lt;strong&gt;shell commands&lt;/strong&gt; that matter across those modes.&lt;/p&gt;

&lt;p&gt;Below is a &lt;strong&gt;dense command reference&lt;/strong&gt; grouped by task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install Hermes Agent and first-run CLI commands
&lt;/h2&gt;

&lt;p&gt;For install and troubleshooting, start with &lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes AI Assistant — Install, Setup, Workflow, and Troubleshooting&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;The installer pulls the repo, sets up a Python environment, and wires the &lt;code&gt;hermes&lt;/code&gt; executable. After &lt;code&gt;source ~/.bashrc&lt;/code&gt; or &lt;code&gt;~/.zshrc&lt;/code&gt;, your &lt;strong&gt;default entry point&lt;/strong&gt; for interactive chat is simply &lt;strong&gt;&lt;code&gt;hermes&lt;/code&gt;&lt;/strong&gt; (same family as &lt;strong&gt;&lt;code&gt;hermes chat&lt;/code&gt;&lt;/strong&gt;).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;`curl -fsSL &lt;a href="https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh" rel="noopener noreferrer"&gt;https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh&lt;/a&gt; \&lt;/td&gt;
&lt;td&gt;bash`&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes&lt;/code&gt; / &lt;code&gt;hermes chat&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Start interactive chat after install (default daily entry).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes --version&lt;/code&gt; / &lt;code&gt;hermes version&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Print version information.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes completion bash&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;zsh&lt;/code&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes update [--check] [--backup] [--restart-gateway]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pull latest code, reinstall deps, optional pre-update home snapshot or gateway restart.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes uninstall [--full] [--yes]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Remove Hermes; optional full data deletion.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Native Windows is not supported; use &lt;strong&gt;WSL2&lt;/strong&gt;. Android installs via Termux follow a dedicated path in the upstream docs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Global flags for every &lt;code&gt;hermes&lt;/code&gt; invocation
&lt;/h2&gt;

&lt;p&gt;These flags apply before subcommands and change &lt;strong&gt;which profile&lt;/strong&gt;, &lt;strong&gt;which session&lt;/strong&gt;, or &lt;strong&gt;how much personal config&lt;/strong&gt; loads.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--profile&lt;/code&gt;, &lt;code&gt;-p&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Select Hermes profile for this run (overrides sticky default from &lt;code&gt;hermes profile use&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--resume&lt;/code&gt;, &lt;code&gt;-r&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Resume a session by ID or title.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--continue [name]&lt;/code&gt;, &lt;code&gt;-c&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Continue the latest session, or latest matching a title.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;--worktree&lt;/code&gt;, &lt;code&gt;-w&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Start in an isolated Git worktree for parallel agents.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--yolo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bypass dangerous-command approval prompts (use with care).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--pass-session-id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include session ID in the system prompt.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--ignore-user-config&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skip &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt; (defaults only); &lt;code&gt;.env&lt;/code&gt; still loads.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--ignore-rules&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Skip auto-injection of AGENTS.md, SOUL.md, &lt;code&gt;.cursorrules&lt;/code&gt;, memory, preloaded skills.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--tui&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Launch the TUI (&lt;code&gt;HERMES_TUI=1&lt;/code&gt; equivalent).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--dev&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;With &lt;code&gt;--tui&lt;/code&gt;, run TS sources via &lt;code&gt;tsx&lt;/code&gt; for TUI development.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Isolated automation often pairs &lt;strong&gt;&lt;code&gt;hermes chat --ignore-user-config --ignore-rules&lt;/code&gt;&lt;/strong&gt; with &lt;strong&gt;&lt;code&gt;hermes -z&lt;/code&gt;&lt;/strong&gt; for reproducible one-shots.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;hermes chat&lt;/code&gt;, one-shot prompts, and &lt;code&gt;hermes -z&lt;/code&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command / pattern&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes chat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive or scripted chat; main surface for &lt;code&gt;-q&lt;/code&gt;, &lt;code&gt;-m&lt;/code&gt;, &lt;code&gt;--provider&lt;/code&gt;, toolsets, resume, worktree, checkpoints.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes chat -q "..."&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;One-shot prompt (non-interactive); keeps richer output than &lt;code&gt;-z&lt;/code&gt; when tools run.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes -z "..."&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Scripted one-shot&lt;/strong&gt; — final answer only on stdout, no banner or session noise. Same agent and tools; best for pipes and scripts.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes chat --quiet&lt;/code&gt;, &lt;code&gt;-Q&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Quieter programmatic mode (banner and tool previews suppressed).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;-m&lt;/code&gt; / &lt;code&gt;--model&lt;/code&gt;, &lt;code&gt;--provider&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Per-run model and provider overrides; env &lt;code&gt;HERMES_INFERENCE_MODEL&lt;/code&gt; / &lt;code&gt;HERMES_INFERENCE_PROVIDER&lt;/code&gt; mirror behavior.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;-t&lt;/code&gt; / &lt;code&gt;--toolsets&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Enable comma-separated toolsets for the run.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;-s&lt;/code&gt; / &lt;code&gt;--skills&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Preload skills (repeat or comma-separated).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--image path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Attach a local image to a single query.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--checkpoints&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Enable filesystem checkpoints before destructive edits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--max-turns N&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cap tool-calling iterations per turn (default from config).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--source&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Session source tag (&lt;code&gt;cli&lt;/code&gt; vs &lt;code&gt;tool&lt;/code&gt; for integrations).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Hermes model outside the session vs &lt;code&gt;/model&lt;/code&gt; inside it&lt;/strong&gt; — Running &lt;strong&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/strong&gt; from the shell is where you &lt;strong&gt;add providers&lt;/strong&gt;, keys, and OAuth. Slash &lt;strong&gt;&lt;code&gt;/model&lt;/code&gt;&lt;/strong&gt; only switches among &lt;strong&gt;already configured&lt;/strong&gt; providers. If you only see OpenRouter in &lt;code&gt;/model&lt;/code&gt;, exit the session and complete &lt;strong&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model picker, credential pools, and fallback providers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive provider and model picker; keys, OAuth, custom endpoints.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes auth&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Credential pools — &lt;code&gt;add&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;remove&lt;/code&gt;, &lt;code&gt;reset&lt;/code&gt; for rotation-friendly keys and OAuth.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`hermes fallback [list \&lt;/td&gt;
&lt;td&gt;add \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;{% raw %}`hermes setup [model \&lt;/td&gt;
&lt;td&gt;tts \&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Deprecated &lt;strong&gt;{% raw %}&lt;code&gt;hermes login&lt;/code&gt; / &lt;code&gt;hermes logout&lt;/code&gt;&lt;/strong&gt; — use &lt;strong&gt;&lt;code&gt;hermes auth&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/strong&gt; instead.&lt;/p&gt;

&lt;p&gt;Picking local OpenAI-compatible endpoints versus hosted APIs for &lt;strong&gt;&lt;code&gt;hermes model&lt;/code&gt;&lt;/strong&gt; sits on the same trade-offs as general &lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM hosting&lt;/a&gt; (latency, cost, ops).&lt;/p&gt;

&lt;h2&gt;
  
  
  Config files and &lt;code&gt;hermes config&lt;/code&gt; commands
&lt;/h2&gt;

&lt;p&gt;Configuration resolves as &lt;strong&gt;CLI overrides → &lt;code&gt;config.yaml&lt;/code&gt; → &lt;code&gt;.env&lt;/code&gt; → defaults&lt;/strong&gt;. API keys belong in &lt;strong&gt;&lt;code&gt;.env&lt;/code&gt;&lt;/strong&gt;; structured settings in &lt;strong&gt;&lt;code&gt;config.yaml&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config show&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Display effective configuration.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config edit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Open &lt;code&gt;config.yaml&lt;/code&gt; in &lt;code&gt;$EDITOR&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config set key value&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Set values (secrets routed to &lt;code&gt;.env&lt;/code&gt;, non-secrets to YAML).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes config path&lt;/code&gt; / &lt;code&gt;hermes config env-path&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Print paths to config and env files.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Detect missing or stale settings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes config migrate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Apply newly introduced options interactively.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Where files live&lt;/strong&gt; — Everything sits under &lt;strong&gt;&lt;code&gt;HERMES_HOME&lt;/code&gt;&lt;/strong&gt; (default &lt;strong&gt;&lt;code&gt;~/.hermes&lt;/code&gt;&lt;/strong&gt;) for config, secrets, memories, skills, sessions, gateway state, and logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Session management and &lt;code&gt;hermes profile&lt;/code&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes sessions list&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List recent sessions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes sessions browse&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive picker with search and resume.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes sessions export&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Export sessions (e.g. JSONL).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes sessions delete&lt;/code&gt;, &lt;code&gt;prune&lt;/code&gt;, &lt;code&gt;rename&lt;/code&gt;, &lt;code&gt;stats&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Delete one session, prune old ones, rename titles, show store stats.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes profile list&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;use&lt;/code&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes profile export&lt;/code&gt; / &lt;code&gt;import&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Archive or restore a profile tarball.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes profile alias&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Short wrapper scripts for fast profile switching.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Use &lt;strong&gt;&lt;code&gt;hermes -p work chat -q "..."&lt;/code&gt;&lt;/strong&gt; for ad hoc runs without changing the sticky default profile.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills hub, toolsets, shell hooks, and plugins
&lt;/h2&gt;

&lt;p&gt;For profile-first configuration and skills tuned to real production workflows by role, see &lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes AI Assistant Skills for Real Production Setups&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes tools&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive per-platform tool enablement; &lt;code&gt;--summary&lt;/code&gt; prints current choices.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes skills browse&lt;/code&gt;, &lt;code&gt;search&lt;/code&gt;, &lt;code&gt;inspect&lt;/code&gt;, &lt;code&gt;install&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;check&lt;/code&gt;, &lt;code&gt;update&lt;/code&gt;, &lt;code&gt;audit&lt;/code&gt;, &lt;code&gt;uninstall&lt;/code&gt;, &lt;code&gt;publish&lt;/code&gt;, &lt;code&gt;snapshot&lt;/code&gt;, &lt;code&gt;tap&lt;/code&gt;, &lt;code&gt;config&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Skills hub workflows including registries and URL installs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes curator status&lt;/code&gt;, &lt;code&gt;run&lt;/code&gt;, &lt;code&gt;pause&lt;/code&gt;, &lt;code&gt;pin&lt;/code&gt;, &lt;code&gt;rollback&lt;/code&gt;, …&lt;/td&gt;
&lt;td&gt;Background skill maintenance and safe rollback.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes hooks list&lt;/code&gt;, &lt;code&gt;test&lt;/code&gt;, &lt;code&gt;revoke&lt;/code&gt;, &lt;code&gt;doctor&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Declared shell hooks and allowlists in config.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes plugins&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Composite UI or subcommands to install, enable, disable, remove plugins.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Built-in memory and &lt;code&gt;hermes memory&lt;/code&gt; providers
&lt;/h2&gt;

&lt;p&gt;Built-in &lt;strong&gt;MEMORY.md&lt;/strong&gt; / &lt;strong&gt;USER.md&lt;/strong&gt; stay active; external providers add optional recall layers. For how that architecture behaves in practice, read &lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System — How Persistent AI Memory Actually Works&lt;/a&gt;. To compare external backends and activation trade-offs, see &lt;a href="https://www.glukhov.org/ai-systems/memory/agent-memory-providers/" rel="noopener noreferrer"&gt;Agent Memory Providers Compared — Honcho, Mem0, Hindsight, and Five More&lt;/a&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes memory setup&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive external memory provider configuration.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes memory status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show active provider settings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes memory off&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disable external provider; built-in files remain.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When a provider is active it may register extra provider-specific top-level subcommands — run &lt;strong&gt;&lt;code&gt;hermes --help&lt;/code&gt;&lt;/strong&gt; to see what is wired today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Messaging gateway, DM pairing, and platforms
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes gateway setup&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive messaging platform setup.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes gateway run&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Foreground gateway (recommended on &lt;strong&gt;WSL&lt;/strong&gt;, Docker, Termux).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes gateway start&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;stop&lt;/code&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes gateway install&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uninstall&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes pairing list&lt;/code&gt; \&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;approve&lt;/code&gt; \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes whatsapp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;WhatsApp bridge pairing flow.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes slack manifest&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generate Slack app manifest with gateway slash parity.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On &lt;strong&gt;WSL&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;hermes gateway run&lt;/code&gt;&lt;/strong&gt; inside &lt;strong&gt;tmux&lt;/strong&gt; is the resilient pattern when &lt;strong&gt;&lt;code&gt;gateway start&lt;/code&gt;&lt;/strong&gt; misbehaves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cron scheduler, webhooks, and Kanban
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes cron …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Create, edit, pause, resume, run, remove scheduled prompts (&lt;code&gt;tick&lt;/code&gt; for manual scheduler pass).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes webhook subscribe&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;remove&lt;/code&gt;, &lt;code&gt;test&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Dynamic webhook routes for event-driven runs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes kanban …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multi-profile task board backed by SQLite; &lt;code&gt;dispatch&lt;/code&gt; drives workers.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;hermes doctor&lt;/code&gt;, logs, backup, and usage insights
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes doctor [--fix]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Interactive diagnostics and optional auto-repair.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes status [--all] [--deep]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concise status; deeper checks when needed.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes dump [--show-keys]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Paste-friendly setup summary for Discord or GitHub issues.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes debug share&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Upload redacted debug bundle to a paste service (or &lt;code&gt;--local&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`hermes logs [agent \&lt;/td&gt;
&lt;td&gt;errors \&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;{% raw %}&lt;code&gt;hermes backup&lt;/code&gt;, &lt;code&gt;hermes import&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Zip snapshots of home data and restore paths.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes insights [--days N] [--source …]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Token, cost, and activity analytics.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When something breaks after an upgrade, &lt;strong&gt;&lt;code&gt;hermes doctor&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;hermes status&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;hermes logs errors -f&lt;/code&gt;&lt;/strong&gt; form the fastest triage loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP, ACP, web dashboard, and OpenClaw migration
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes mcp serve&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Run Hermes as an MCP server.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;hermes mcp add&lt;/code&gt;, &lt;code&gt;remove&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;test&lt;/code&gt;, &lt;code&gt;configure&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Manage MCP client connections from Hermes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes acp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Agent Client Protocol stdio server for editors (extra install may apply).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes dashboard [--port …] [--host …]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Local web dashboard (&lt;code&gt;pip install hermes-agent[web]&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hermes claw migrate …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Migrate OpenClaw-style configs into Hermes (&lt;code&gt;--dry-run&lt;/code&gt;, presets, optional secrets).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw migration&lt;/strong&gt; — &lt;code&gt;hermes claw migrate&lt;/code&gt; reads legacy OpenClaw home directories; for what that stack looked like before moving, see the &lt;a href="https://www.glukhov.org/ai-systems/openclaw/" rel="noopener noreferrer"&gt;OpenClaw case study&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Slash commands in the Hermes CLI session
&lt;/h2&gt;

&lt;p&gt;Type &lt;strong&gt;&lt;code&gt;/&lt;/code&gt;&lt;/strong&gt; for autocomplete. Commands are &lt;strong&gt;case-insensitive&lt;/strong&gt;; skills register extra &lt;strong&gt;&lt;code&gt;/skill-name&lt;/code&gt;&lt;/strong&gt; routes. The tables below are a curated subset; for the full registry see &lt;strong&gt;Official Hermes Agent documentation&lt;/strong&gt; at the end of this article.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session flow, background tasks, and goals
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/new&lt;/code&gt;, &lt;code&gt;/reset&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;New session ID and history.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/resume [name]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Resume a named session.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/compress [focus]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Manual context compression with optional focus topic.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/retry&lt;/code&gt;, &lt;code&gt;/undo&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Retry last turn or drop last exchange.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/title …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Name the session for later &lt;code&gt;/resume&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/background …&lt;/code&gt;, &lt;code&gt;/queue …&lt;/code&gt;, &lt;code&gt;/steer …&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Parallel background run, queued next prompt, mid-loop nudge after next tool.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/goal …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Persistent multi-turn objective with judge loop (&lt;code&gt;status&lt;/code&gt;, &lt;code&gt;pause&lt;/code&gt;, &lt;code&gt;resume&lt;/code&gt;, &lt;code&gt;clear&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/branch&lt;/code&gt;, &lt;code&gt;/fork&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Branch the conversation for alternate exploration.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Models, tool toggles, skills, and reload
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/model … [--global]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switch models among configured providers; &lt;code&gt;--global&lt;/code&gt; persists default.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/tools …&lt;/code&gt;, &lt;code&gt;/toolsets&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Session tool toggles and toolset listing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/skills …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search, install, and manage skills from chat.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/cron …&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scheduled tasks UI from the CLI session.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/reload-mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reload MCP servers from config.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/reload&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reload &lt;code&gt;.env&lt;/code&gt; into the running session without restart.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Usage, help, and quitting
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/usage&lt;/code&gt;, &lt;code&gt;/insights&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Token and cost visibility; analytics snapshot.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/help&lt;/code&gt;, &lt;code&gt;/quit&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Help or exit the CLI.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Messaging apps (Telegram, Discord, Slack, and others) expose an overlapping slash set plus &lt;strong&gt;&lt;code&gt;/approve&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;/restart&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;/commands&lt;/code&gt;&lt;/strong&gt;, and related gateway-only helpers — platform differences are documented in the slash command reference under &lt;strong&gt;Official Hermes Agent documentation&lt;/strong&gt; below.&lt;/p&gt;

&lt;h2&gt;
  
  
  More useful reading
&lt;/h2&gt;

&lt;p&gt;Related pages on this site (broader context for Hermes and terminal agents):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems — Self-Hosted Assistants, RAG, and Local Infrastructure&lt;/a&gt; — cluster overview and how assistants fit the stack&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-systems/memory/" rel="noopener noreferrer"&gt;AI Systems Memory&lt;/a&gt; — memory hub and adjacent guides&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-devtools/" rel="noopener noreferrer"&gt;AI Developer Tools&lt;/a&gt; — terminal and IDE tooling landscape&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-devtools/opencode/" rel="noopener noreferrer"&gt;OpenCode Quickstart&lt;/a&gt; — another terminal-first agent for ergonomic comparison&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Official Hermes Agent documentation
&lt;/h2&gt;

&lt;p&gt;Upstream documentation on &lt;em&gt;hermes-agent.nousresearch.com&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://hermes-agent.nousresearch.com/" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/reference/cli-commands" rel="noopener noreferrer"&gt;CLI commands reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hermes-agent.nousresearch.com/docs/reference/slash-commands" rel="noopener noreferrer"&gt;Slash commands reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Tip.&lt;/strong&gt; Keep &lt;strong&gt;&lt;code&gt;hermes dump&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;hermes doctor --fix&lt;/code&gt;&lt;/strong&gt; in muscle memory — they turn vague "something broke" reports into actionable diffs against a known-good setup.&lt;/p&gt;

</description>
      <category>cheatsheet</category>
      <category>hermes</category>
      <category>selfhosting</category>
      <category>llm</category>
    </item>
    <item>
      <title>MinIO CE in 2026: Retired Upstream, Source-Only, and What to Use</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 04 May 2026 10:56:52 +0000</pubDate>
      <link>https://dev.to/rosgluk/minio-ce-in-2026-retired-upstream-source-only-and-what-to-use-1k02</link>
      <guid>https://dev.to/rosgluk/minio-ce-in-2026-retired-upstream-source-only-and-what-to-use-1k02</guid>
      <description>&lt;p&gt;MinIO Community Edition is no longer a safe default for new production systems.  &lt;/p&gt;

&lt;p&gt;As of 2026, the public project status and distribution model changed enough that many teams now treat MinIO CE as end of life for serious workloads.&lt;/p&gt;

&lt;p&gt;If you are deciding whether to keep MinIO CE, fork it, or migrate, this guide gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a factual timeline of what changed&lt;/li&gt;
&lt;li&gt;the practical risk for operators&lt;/li&gt;
&lt;li&gt;a technical comparison of SeaweedFS, Garage, RustFS, and Ceph RGW&lt;/li&gt;
&lt;li&gt;a migration plan you can execute in phases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For broader context around storage, databases, and search in production AI stacks, see the &lt;a href="https://www.glukhov.org/data-infrastructure/" rel="noopener noreferrer"&gt;Data Infrastructure for AI Systems&lt;/a&gt; pillar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened to MinIO CE
&lt;/h2&gt;

&lt;p&gt;The community concern is not one single event. It is the sequence.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;May 2025&lt;/td&gt;
&lt;td&gt;Key management features moved out of CE path&lt;/td&gt;
&lt;td&gt;Reduced CE parity for auth and admin workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Oct 2025&lt;/td&gt;
&lt;td&gt;Community Docker images and public binaries stopped&lt;/td&gt;
&lt;td&gt;Operators must build and verify from source&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dec 2025&lt;/td&gt;
&lt;td&gt;Public maintenance mode messaging became explicit&lt;/td&gt;
&lt;td&gt;Fewer expectations for active OSS iteration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feb 2026&lt;/td&gt;
&lt;td&gt;Repository archived for the first time&lt;/td&gt;
&lt;td&gt;Read only state blocks normal OSS collaboration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apr 2026&lt;/td&gt;
&lt;td&gt;Repository archived again and stayed locked&lt;/td&gt;
&lt;td&gt;Confirms long term frozen upstream posture&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The core operational impact is simple&lt;br&gt;&lt;br&gt;
you inherit more supply chain, patching, and maintenance responsibility than most teams expect from a mainstream S3 compatible store.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is MinIO still open source in 2026
&lt;/h2&gt;

&lt;p&gt;A common question is whether MinIO is still open source at all.&lt;/p&gt;

&lt;p&gt;The server code in the public repository is still under AGPLv3.&lt;br&gt;&lt;br&gt;
However, the practical community path changed from normal binary-first consumption to source-first self build.&lt;br&gt;&lt;br&gt;
For many teams, that feels less like a living OSS ecosystem and more like unsupported source availability.&lt;/p&gt;

&lt;p&gt;So the accurate answer is nuanced&lt;br&gt;&lt;br&gt;
license status remains open source, but operationally the community experience is no longer what most platform teams need for low risk production adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is MinIO CE safe for new production deployments
&lt;/h2&gt;

&lt;p&gt;For greenfield deployments, usually no, especially when compared with documented options in this &lt;a href="https://www.glukhov.org/data-infrastructure/object-storage/garage-vs-minio-vs-s3/" rel="noopener noreferrer"&gt;MinIO vs Garage vs AWS S3 comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the risk profile changed
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Patch cadence risk&lt;/strong&gt;
no stable, trusted community binary channel means every CVE cycle becomes your build and release cycle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification burden&lt;/strong&gt;
your team must own provenance, repeatability, and rollback strategy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem drift risk&lt;/strong&gt;
tooling that assumed public images may lag or break&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;People risk&lt;/strong&gt;
senior SRE and security time is consumed by platform plumbing instead of product work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you already run MinIO CE internally, this does not mean panic shutdown.&lt;br&gt;&lt;br&gt;
It means treat the platform as controlled technical debt and put a migration runway on your roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Community verdict and market response
&lt;/h2&gt;

&lt;p&gt;Across operator communities in 2025 to 2026, the pattern is consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fewer teams choose MinIO CE for net new deployments&lt;/li&gt;
&lt;li&gt;more teams evaluate Garage and SeaweedFS first&lt;/li&gt;
&lt;li&gt;enterprise teams with strict S3 semantics often move toward Ceph RGW&lt;/li&gt;
&lt;li&gt;RustFS gets attention as a direct successor style option, but with alpha caution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This trend matters because platform safety is partly social&lt;br&gt;&lt;br&gt;
healthy ecosystems reduce integration risk, improve troubleshooting velocity, and widen hiring pools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best alternatives to MinIO CE
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SeaweedFS
&lt;/h2&gt;

&lt;p&gt;SeaweedFS is a strong option when you care about huge object counts, small file behavior, and practical efficiency in commodity environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose SeaweedFS when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;you need high small-object density&lt;/li&gt;
&lt;li&gt;you prefer Apache 2.0 governance and licensing clarity&lt;/li&gt;
&lt;li&gt;you want production readiness without the heavy footprint of Ceph&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Garage
&lt;/h2&gt;

&lt;p&gt;Garage is attractive for lightweight self-hosted clusters, edge nodes, and geo distributed deployments on modest hardware.&lt;/p&gt;

&lt;p&gt;If you want a concrete setup path, use this &lt;a href="https://www.glukhov.org/data-infrastructure/object-storage/garage-quickstart/" rel="noopener noreferrer"&gt;Garage S3 quickstart&lt;/a&gt; to validate replication and operations before migration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose Garage when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;resource efficiency matters more than full S3 feature parity&lt;/li&gt;
&lt;li&gt;you run mixed ARM or small node environments&lt;/li&gt;
&lt;li&gt;you want simple operations over maximal feature surface&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  RustFS
&lt;/h2&gt;

&lt;p&gt;RustFS is frequently discussed as the closest successor narrative to MinIO style deployment and UX.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose RustFS when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;you accept alpha-stage software risk&lt;/li&gt;
&lt;li&gt;you can test deeply before production&lt;/li&gt;
&lt;li&gt;you want to track a fast moving project with potential upside&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For regulated or high uptime systems, keep RustFS in pilot until maturity is proven in your own reliability tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ceph RGW
&lt;/h2&gt;

&lt;p&gt;Ceph RGW remains the enterprise heavyweight with broad capability and scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose Ceph RGW when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;you need mature enterprise S3 behavior&lt;/li&gt;
&lt;li&gt;your team already has Ceph operational expertise&lt;/li&gt;
&lt;li&gt;you can support higher infrastructure and on-call complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Which object store is best for your use case
&lt;/h2&gt;

&lt;p&gt;Use this pragmatic filter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small team and low ops budget&lt;/strong&gt;
start with Garage or SeaweedFS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large enterprise and strict compatibility needs&lt;/strong&gt;
prefer Ceph RGW&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experimental migration from MinIO style workflows&lt;/strong&gt;
pilot RustFS, but keep rollback options&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No option is universally best.&lt;br&gt;&lt;br&gt;
The correct target depends on required S3 features, RPO and RTO goals, team maturity, and how much platform ownership you want.&lt;/p&gt;

&lt;p&gt;If your team still needs legacy MinIO background before deciding, this &lt;a href="https://www.glukhov.org/data-infrastructure/object-storage/minio-vs-aws-s3/" rel="noopener noreferrer"&gt;MinIO vs AWS S3 overview&lt;/a&gt; and this &lt;a href="https://www.glukhov.org/data-infrastructure/object-storage/minio-cheatsheet/" rel="noopener noreferrer"&gt;MinIO command cheatsheet&lt;/a&gt; help with current-state audits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration plan from MinIO CE
&lt;/h2&gt;

&lt;p&gt;If you are currently on MinIO CE, this phased approach avoids risky big-bang moves.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1 inventory and risk scoring
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;list buckets, object counts, and growth rates&lt;/li&gt;
&lt;li&gt;classify workloads by criticality and recovery objectives&lt;/li&gt;
&lt;li&gt;identify hard S3 dependencies such as versioning, object lock, or policy behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 2 proof of compatibility
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;stand up one or two candidate platforms&lt;/li&gt;
&lt;li&gt;replay representative read and write workloads&lt;/li&gt;
&lt;li&gt;verify auth, lifecycle rules, retention behavior, and SDK edge cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plan to instrument your pilot from day one with metrics and alerts from the &lt;a href="https://www.glukhov.org/observability/" rel="noopener noreferrer"&gt;Observability pillar&lt;/a&gt; so migration regressions are measurable rather than anecdotal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3 pilot cutover
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;migrate one low blast radius workload first&lt;/li&gt;
&lt;li&gt;run dual read validation where possible&lt;/li&gt;
&lt;li&gt;measure latency, error rates, and operational overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 4 production migration
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;migrate high priority internet facing workloads first&lt;/li&gt;
&lt;li&gt;keep rollback artifacts and retention windows&lt;/li&gt;
&lt;li&gt;document final runbooks before decommissioning MinIO CE paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;MinIO CE may still run, but it is no longer the low-friction default for new production object storage.&lt;br&gt;&lt;br&gt;
Treat current clusters as transition infrastructure, not a long horizon foundation.&lt;/p&gt;

&lt;p&gt;For most teams in 2026, safer direction is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SeaweedFS or Garage for pragmatic self hosted deployments&lt;/li&gt;
&lt;li&gt;Ceph RGW for enterprise scale and mature S3 requirements&lt;/li&gt;
&lt;li&gt;RustFS for monitored pilot environments only&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Make the migration decision early while you can still choose your timeline instead of reacting to the next forced change.&lt;/p&gt;

</description>
      <category>minio</category>
      <category>garage</category>
      <category>s3</category>
      <category>selfhosting</category>
    </item>
    <item>
      <title>NemoClaw practical guide for secure OpenClaw operations in 2026</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Fri, 01 May 2026 11:02:25 +0000</pubDate>
      <link>https://dev.to/rosgluk/nemoclaw-practical-guide-for-secure-openclaw-operations-in-2026-1ck0</link>
      <guid>https://dev.to/rosgluk/nemoclaw-practical-guide-for-secure-openclaw-operations-in-2026-1ck0</guid>
      <description>&lt;p&gt;Most AI agent stacks still treat security as a post-demo fix.&lt;br&gt;
NemoClaw starts from the opposite assumption and makes isolation, policy, and routing day-zero defaults.&lt;/p&gt;



&lt;p&gt;OpenClaw stays the assistant while OpenShell stays the enforcement layer, and NemoClaw acts as the opinionated glue between them. That glue matters because it makes the safer path easier to install, easier to observe, and much harder to skip when you are rushing.&lt;/p&gt;

&lt;p&gt;That is exactly why NemoClaw matters in 2026. It is not just another wrapper around an LLM agent, because it is designed as a reference stack for running always-on OpenClaw assistants inside sandboxed OpenShell containers, with routed inference, policy-based egress control, and lifecycle tooling built in from day one. If you want broader context for where this fits, start with the &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI Systems hub&lt;/a&gt; and the baseline &lt;a href="https://www.glukhov.org/ai-systems/openclaw/" rel="noopener noreferrer"&gt;OpenClaw system overview&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There is one hard truth you should not bury for the sake of hype. NVIDIA marks NemoClaw as alpha software in early preview, beginning March 16, 2026, and explicitly warns that interfaces and behavior may still shift between releases. Treat it like a serious lab tool, not finished production furniture.&lt;/p&gt;
&lt;h2&gt;
  
  
  What NemoClaw is and when to use it
&lt;/h2&gt;

&lt;p&gt;NemoClaw exists for a specific operational job rather than for experimentation theater. It gives you a practical way to run an always-on OpenClaw assistant with guardrails around network access, filesystem access, process privileges, and model routing. If you have ever looked at an autonomous agent and thought that it should not have casual host access, NemoClaw is a strong answer to that discomfort.&lt;/p&gt;

&lt;p&gt;The stack is easier to understand when you separate the layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenClaw is the assistant runtime, tools, memory, and behaviour inside the container.&lt;/li&gt;
&lt;li&gt;OpenShell is the execution environment that provides sandbox lifecycle, credential-storing gateway, inference proxying, and policy enforcement.&lt;/li&gt;
&lt;li&gt;NemoClaw is the opinionated reference stack that onboards, configures, and operates OpenClaw correctly inside OpenShell.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction matters because it explains the product's purpose. NemoClaw is not trying to replace OpenClaw. It is trying to make OpenClaw survivable in real environments.&lt;/p&gt;

&lt;p&gt;Typical use cases are obvious and sensible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;running an always-on assistant with controlled egress&lt;/li&gt;
&lt;li&gt;testing agent behaviour before granting broader access&lt;/li&gt;
&lt;li&gt;pushing a sandboxed assistant onto a remote GPU host for persistent operation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My blunt take is this. If you only want a throwaway demo, raw OpenClaw is simpler and faster to get moving, and the &lt;a href="https://www.glukhov.org/ai-systems/openclaw/quickstart/" rel="noopener noreferrer"&gt;OpenClaw quickstart&lt;/a&gt; is the fastest route. If you want something that behaves like it belongs on an actual machine, NemoClaw is the more serious choice because its defaults are built for operators instead of screenshots.&lt;/p&gt;
&lt;h2&gt;
  
  
  NemoClaw security and operations features that matter
&lt;/h2&gt;

&lt;p&gt;A long feature list is cheap. The right feature list is not. These are the capabilities that actually change how you operate the system.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Guided onboarding&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;nemoclaw onboard&lt;/code&gt; validates prerequisites, credentials, providers, and policy before the sandbox is created.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardened blueprint&lt;/td&gt;
&lt;td&gt;NemoClaw builds on a versioned blueprint and a security-first image rather than a pile of one-off shell steps.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routed inference&lt;/td&gt;
&lt;td&gt;The agent talks to &lt;code&gt;inference.local&lt;/code&gt;, while provider credentials stay on the host.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layered protection&lt;/td&gt;
&lt;td&gt;Network, filesystem, process, and inference controls are enforced together instead of as optional extras.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Policy tiers and presets&lt;/td&gt;
&lt;td&gt;You can start restricted and selectively add access for package registries, search, messaging, or other services.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;State management&lt;/td&gt;
&lt;td&gt;Snapshots and rebuild flows exist so upgrades do not have to mean memory loss.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Channel messaging&lt;/td&gt;
&lt;td&gt;Telegram, Discord, Slack, and similar bridges can be wired in through controlled host-side operations.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill installation&lt;/td&gt;
&lt;td&gt;You can push skills into a running sandbox without turning the whole environment into mutable sludge.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;NemoClaw supports several inference paths, including NVIDIA Endpoints, OpenAI, Anthropic, Google Gemini, compatible OpenAI style and Anthropic style endpoints, and local Ollama. For compatible endpoints, onboarding validates the endpoint with a real inference request because many services copy the OpenAI shape but fail on real runtime behavior. If you are choosing an inference runtime strategy first, the broader &lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM hosting guide for 2026&lt;/a&gt; is a useful companion. Experimental local NIM and local vLLM paths also exist, but they are gated behind an environment flag for a reason, so use them for evaluation instead of unattended long-running workloads.&lt;/p&gt;

&lt;p&gt;The security model is the real headline. NemoClaw starts with deny-by-default egress, keeps provider credentials on the host, uses a read-only OpenClaw config inside the sandbox, and lets operators review unknown network requests in the OpenShell TUI. This is not flashy, but that is exactly the point, because flashy agent stacks are common while boring control surfaces are the scarce resource in production.&lt;/p&gt;
&lt;h3&gt;
  
  
  The defaults you should not casually relax
&lt;/h3&gt;

&lt;p&gt;NemoClaw does have escape hatches. It also tells you, very politely, when you are about to do something foolish.&lt;/p&gt;

&lt;p&gt;The biggest foot-guns are these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--dangerously-skip-permissions&lt;/code&gt; trades the default sandbox posture for a permissive one&lt;/li&gt;
&lt;li&gt;adding permanent baseline policy entries for one-off requests makes privilege creep feel normal&lt;/li&gt;
&lt;li&gt;writing directly into &lt;code&gt;/sandbox/.openclaw&lt;/code&gt; is the wrong mental model because that config is meant to stay locked down&lt;/li&gt;
&lt;li&gt;using &lt;code&gt;openclaw agent --local&lt;/code&gt; as if it were your standard operating mode is a bad habit for anything security-sensitive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point deserves emphasis. Local mode is convenient for smoke testing and one-off checks, but it is not the posture you should normalize for an always-on assistant that has any real permissions.&lt;/p&gt;
&lt;h2&gt;
  
  
  NemoClaw quickstart for your first sandbox
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Here is the practical baseline before you waste an afternoon pretending a tiny laptop is enough. The official prerequisites page currently lists Node.js 22.16 or later and npm 10 or later in addition to Docker runtime requirements.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Minimum&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;4 vCPU&lt;/td&gt;
&lt;td&gt;4 plus vCPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk&lt;/td&gt;
&lt;td&gt;20 GB free&lt;/td&gt;
&lt;td&gt;40 GB free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tested runtime matrix is also straightforward:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Linux&lt;/td&gt;
&lt;td&gt;Docker&lt;/td&gt;
&lt;td&gt;Primary tested path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;macOS on Apple Silicon&lt;/td&gt;
&lt;td&gt;Colima or Docker Desktop&lt;/td&gt;
&lt;td&gt;Works with limitations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DGX Spark&lt;/td&gt;
&lt;td&gt;Docker&lt;/td&gt;
&lt;td&gt;Tested&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windows&lt;/td&gt;
&lt;td&gt;WSL2 with Docker Desktop backend&lt;/td&gt;
&lt;td&gt;Works with limitations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you are on macOS, install Xcode Command Line Tools first. If you are on Linux, make sure Docker is actually running and that your user can talk to it without permission drama.&lt;/p&gt;

&lt;p&gt;There is also a resource detail that catches many first-time users. The sandbox image is around 2.4 GB compressed, and the export pipeline can temporarily consume enough memory to trigger OOM on weak machines. If you cannot add RAM, adding at least 8 GB swap is an official workaround, though it slows onboarding. For small dedicated AI boxes, the &lt;a href="https://www.glukhov.org/hardware/ai/nvidia-dgx-spark/" rel="noopener noreferrer"&gt;NVIDIA DGX Spark overview&lt;/a&gt; gives a concrete reference point for local always-on deployments.&lt;/p&gt;
&lt;h3&gt;
  
  
  Install and onboard
&lt;/h3&gt;

&lt;p&gt;The official install path is intentionally simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://www.nvidia.com/nemoclaw.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then confirm the CLI is present:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw &lt;span class="nt"&gt;--help&lt;/span&gt;
nemoclaw &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From there, the real work is onboarding. &lt;code&gt;nemoclaw onboard&lt;/code&gt; drives sandbox creation, provider setup, and policy application in one guided flow, which is why it should be your default lifecycle entry point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw onboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During onboarding you will choose an inference provider, a sandbox name, and a policy tier. The tier choice matters more than most people expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;restricted&lt;/code&gt; keeps the base sandbox only&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;balanced&lt;/code&gt; is the default and adds development tooling plus web search related access&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;open&lt;/code&gt; adds broad third-party access, including messaging and productivity services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My recommendation is not subtle. For an always-on assistant, start with the smallest posture that can possibly work. If that means &lt;code&gt;restricted&lt;/code&gt;, good. Add only what the agent proves it needs.&lt;/p&gt;

&lt;p&gt;If you want a scripted run, the non-interactive flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;NEMOCLAW_POLICY_TIER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;restricted &lt;span class="se"&gt;\&lt;/span&gt;
nemoclaw onboard &lt;span class="nt"&gt;--non-interactive&lt;/span&gt; &lt;span class="nt"&gt;--yes-i-accept-third-party-software&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use a sane sandbox name. NemoClaw expects lowercase alphanumeric characters and hyphens. If you keep trying to be clever with names, the validator will win.&lt;/p&gt;

&lt;h3&gt;
  
  
  First connection and first prompt
&lt;/h3&gt;

&lt;p&gt;Once onboarding completes, connect to the sandbox:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside the sandbox, open the terminal UI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw tui
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you only want a one-message smoke test, you can do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openclaw agent &lt;span class="nt"&gt;--agent&lt;/span&gt; main &lt;span class="nt"&gt;--local&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"hello"&lt;/span&gt; &lt;span class="nt"&gt;--session-id&lt;/span&gt; &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That said, do not confuse a smoke test with an operating model. For real day-two usage, I would rather stay honest about the system and use the TUI plus host-side monitoring instead of normalising &lt;code&gt;--local&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  NemoClaw operations that matter on day two
&lt;/h2&gt;

&lt;p&gt;Once the sandbox exists, NemoClaw becomes an operations tool, not just an installer. These are the commands that pull their weight.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;List sandboxes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw list&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shows provider, model, and applied presets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Check health&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant status&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shows sandbox health and inference state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stream logs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant logs --follow&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your first stop for failed blueprint runs and runtime errors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Watch blocked egress&lt;/td&gt;
&lt;td&gt;&lt;code&gt;openshell term&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lets you review and approve unknown network requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add a preset&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant policy-add pypi --yes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Permanent access for a known integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remove a preset&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant policy-remove pypi --yes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Roll back access when you no longer need it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pause a channel&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant channels stop telegram&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Keeps credentials but disables the bridge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Re-enable a channel&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant channels start telegram&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Brings a paused bridge back without re-entering tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Install a skill&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant skill install ./my-skill/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pushes a skill into the running sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Create a snapshot&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant snapshot create --name before-upgrade&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fast insurance before risky changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Restore a snapshot&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant snapshot restore before-upgrade&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rewind state cleanly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rebuild safely&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nemoclaw my-assistant rebuild --yes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Upgrade while preserving workspace state&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Changing network access after onboarding
&lt;/h3&gt;

&lt;p&gt;This is where NemoClaw is noticeably better than ad hoc agent setups. Instead of loosening all controls after the first block, you can keep a constrained baseline and then approve or persist only what is necessary.&lt;/p&gt;

&lt;p&gt;For one-off blocked destinations, use the TUI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openshell term
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That lets you review host, port, binary, and method or path information when available. Approved requests stay available for the current session, but they do not become permanent baseline policy. That is a feature, not a bug.&lt;/p&gt;

&lt;p&gt;For durable changes, add or remove presets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant policy-add github &lt;span class="nt"&gt;--yes&lt;/span&gt;
nemoclaw my-assistant policy-remove github &lt;span class="nt"&gt;--yes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you need something more specific than a stock preset, define a custom policy entry and keep &lt;code&gt;protocol: rest&lt;/code&gt; with method and path restrictions for HTTP APIs whenever possible. L4-only rules are a compromise. Pretending otherwise just makes bad policy look neat.&lt;/p&gt;

&lt;h3&gt;
  
  
  Switching models without rebuilding your whole life
&lt;/h3&gt;

&lt;p&gt;If you are staying within the same provider family, model changes are simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openshell inference &lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;--provider&lt;/span&gt; openai-api &lt;span class="nt"&gt;--model&lt;/span&gt; &amp;lt;model&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then verify the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are switching across provider families, the story gets more opinionated. You are not just flipping a runtime pointer. You are changing the route and some baked image configuration. In practice, that means you should treat the change like a real reconfiguration and rerun onboarding or recreate the sandbox with the appropriate overrides.&lt;/p&gt;

&lt;p&gt;Cost is another practical reason to keep this workflow clean. Public pricing pages in April 2026 show large spreads between model tiers, such as GPT-5.4 mini at low single-digit dollars per million output tokens versus premium frontier tiers that cost an order of magnitude more. Anthropic pricing similarly ranges from Haiku class to Opus class, and the wider pricing shift is covered in &lt;a href="https://www.glukhov.org/ai-systems/openclaw/anthropic-claude-subscription-agent-tools/" rel="noopener noreferrer"&gt;Claude, OpenClaw, and the End of Flat Pricing for Agents&lt;/a&gt;. If you need a practical playbook for spending less under these conditions, see &lt;a href="https://www.glukhov.org/llm-performance/cost-effective-llm-applications/" rel="noopener noreferrer"&gt;token optimization strategies for LLM cost control&lt;/a&gt;, because being able to switch models without policy chaos is an operational advantage, not just a convenience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding what persists and what does not
&lt;/h3&gt;

&lt;p&gt;The useful state lives in the workspace under &lt;code&gt;/sandbox/.openclaw/workspace/&lt;/code&gt;. That includes files such as &lt;code&gt;AGENTS.md&lt;/code&gt;, &lt;code&gt;IDENTITY.md&lt;/code&gt;, &lt;code&gt;MEMORY.md&lt;/code&gt;, &lt;code&gt;SOUL.md&lt;/code&gt;, &lt;code&gt;USER.md&lt;/code&gt;, and the &lt;code&gt;memory/&lt;/code&gt; directory of daily notes. If you are designing longer-lived assistants, the &lt;a href="https://www.glukhov.org/ai-systems/memory/" rel="noopener noreferrer"&gt;AI Systems Memory hub&lt;/a&gt; and &lt;a href="https://www.glukhov.org/ai-systems/memory/agent-memory-providers/" rel="noopener noreferrer"&gt;agent memory provider comparison&lt;/a&gt; are useful next reads.&lt;/p&gt;

&lt;p&gt;The good news is that sandbox restarts preserve this state. The bad news is that &lt;code&gt;nemoclaw destroy&lt;/code&gt; does not care about your feelings. Destroying the sandbox deletes its persistent volume and your workspace goes with it.&lt;/p&gt;

&lt;p&gt;That is why the rebuild and snapshot flows matter. NemoClaw is usable precisely because it does not force you to choose between upgrading and losing the assistant's memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  The rule everyone learns late
&lt;/h3&gt;

&lt;p&gt;Here is the rule that saves the most time once you internalize it. A surprising amount of NemoClaw configuration is build-time or image-time configuration, not live mutable state.&lt;/p&gt;

&lt;p&gt;That explains several behaviours that confuse new users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;messaging channels are baked into the image and host-side commands rebuild the sandbox when channels change&lt;/li&gt;
&lt;li&gt;the OpenClaw config path inside the sandbox is read-only&lt;/li&gt;
&lt;li&gt;some auth, proxy, and port settings require re-onboarding or sandbox recreation&lt;/li&gt;
&lt;li&gt;editing the right host-side state is usually the correct move, while editing from inside the sandbox is usually the wrong one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you accept that model, the platform stops feeling random and starts feeling deliberate.&lt;/p&gt;

&lt;h2&gt;
  
  
  NemoClaw troubleshooting that actually saves time
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Install and platform problems
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;What is really happening&lt;/th&gt;
&lt;th&gt;What to do&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;nemoclaw&lt;/code&gt; not found after install&lt;/td&gt;
&lt;td&gt;Your shell has not refreshed its PATH&lt;/td&gt;
&lt;td&gt;Run &lt;code&gt;source ~/.bashrc&lt;/code&gt; or &lt;code&gt;source ~/.zshrc&lt;/code&gt;, or open a new terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker permission denied on Linux&lt;/td&gt;
&lt;td&gt;Your user is not in the &lt;code&gt;docker&lt;/code&gt; group&lt;/td&gt;
&lt;td&gt;Run &lt;code&gt;sudo usermod -aG docker $USER&lt;/code&gt; then &lt;code&gt;newgrp docker&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker is not running&lt;/td&gt;
&lt;td&gt;The installer or onboarding cannot reach the runtime&lt;/td&gt;
&lt;td&gt;Start Docker and rerun &lt;code&gt;nemoclaw onboard&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Colima socket not detected on macOS&lt;/td&gt;
&lt;td&gt;Colima is not running or the socket path is missing&lt;/td&gt;
&lt;td&gt;Run &lt;code&gt;colima status&lt;/code&gt; and start Colima if needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unsupported platform error&lt;/td&gt;
&lt;td&gt;You are outside the tested matrix&lt;/td&gt;
&lt;td&gt;Move to a tested Docker-based runtime before wasting more time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Runtime and policy problems
&lt;/h3&gt;

&lt;p&gt;If the agent cannot reach an external host, the first answer is usually not that the provider is broken. The first answer is usually that the destination is not allowed by policy yet, especially on new sandboxes.&lt;/p&gt;

&lt;p&gt;Open the TUI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;openshell term
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the request is legitimate, approve it for the session or add the correct preset or custom policy entry permanently.&lt;/p&gt;

&lt;p&gt;If onboarding fails because port &lt;code&gt;18789&lt;/code&gt; is taken, find and kill the conflict:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;lsof &lt;span class="nt"&gt;-i&lt;/span&gt; :18789
&lt;span class="nb"&gt;kill&lt;/span&gt; &amp;lt;PID&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If an older release left an orphaned SSH forward behind after a destroy, current NemoClaw versions can clean that up automatically during re-onboard. Older ones may need the manual kill.&lt;/p&gt;

&lt;p&gt;If the dashboard does not load after setting &lt;code&gt;NEMOCLAW_DASHBOARD_PORT&lt;/code&gt;, rerun onboarding on a current release with the desired port. Older builds had a bug where the host respected the custom port but the sandbox still listened on the default one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory, rebuilds, and channels
&lt;/h3&gt;

&lt;p&gt;If sandbox creation dies with exit code 137, you probably hit an out-of-memory condition during the image push path. Add swap or use a machine with more RAM. The cheap machine was not actually cheap if it cost you a day.&lt;/p&gt;

&lt;p&gt;Before risky changes, snapshot first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant snapshot create &lt;span class="nt"&gt;--name&lt;/span&gt; before-upgrade
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you need to upgrade the sandbox but keep the assistant state, rebuild instead of destroy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant rebuild &lt;span class="nt"&gt;--yes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you rotate Telegram, Discord, or Slack tokens, rerun onboarding so NemoClaw can detect the credential change and recreate the sandbox correctly.&lt;/p&gt;

&lt;p&gt;And if you try to fix channels from inside the sandbox with &lt;code&gt;openclaw channels&lt;/code&gt; commands, stop. Channel config is baked into the image and the config path is read-only. Use the host-side commands instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant channels add telegram
nemoclaw my-assistant channels remove telegram
nemoclaw my-assistant channels stop telegram
nemoclaw my-assistant channels start telegram
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Inference and local model pain
&lt;/h3&gt;

&lt;p&gt;Compatible endpoints are the classic source of false confidence. Just because a server exposes an OpenAI-looking API does not mean it supports the streaming behaviour OpenClaw expects.&lt;/p&gt;

&lt;p&gt;If onboarding succeeded but runtime calls fail on a compatible endpoint, rerun onboarding and let NemoClaw re-probe the endpoint. Do not assume a config override alone is enough.&lt;/p&gt;

&lt;p&gt;For local backends, keep an eye on health and binding issues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw my-assistant status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If local inference health checks look wrong on older releases, IPv6 versus IPv4 resolution may be the culprit. If Ollama behaves badly in WSL, make sure Docker Desktop integration is working and consider increasing &lt;code&gt;OLLAMA_CONTEXT_LENGTH&lt;/code&gt; before restarting &lt;code&gt;ollama serve&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If all else fails, collect diagnostics instead of guessing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nemoclaw debug &lt;span class="nt"&gt;--sandbox&lt;/span&gt; my-assistant &lt;span class="nt"&gt;--output&lt;/span&gt; ./nemoclaw-debug.tar.gz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a much better bug report than a screenshot of a half-visible terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Should you use NemoClaw in 2026
&lt;/h2&gt;

&lt;p&gt;NemoClaw is opinionated in the right places. It assumes that an always-on agent should begin inside a cage, that inference credentials should stay on the host, and that network access should be earned rather than assumed. For this class of tooling, that philosophy is still the right default.&lt;/p&gt;

&lt;p&gt;It is also still alpha. That means rough edges are real, the runtime model takes time to learn, and the issue you hit may genuinely be a product issue rather than operator error. If you are honest about that constraint, the stack is usable today for serious evaluation and controlled internal workloads.&lt;/p&gt;

&lt;p&gt;My recommendation is simple. Use NemoClaw if you care about secure-by-default agent operations, want a clearer separation between assistant and enforcement layer, and are willing to operate within a deliberate lifecycle. If you only want the fastest possible demo, there are simpler toys, but if you want a safer long-running stack, NemoClaw is one of the most convincing options available right now. Once you are stable, the practical follow-on is &lt;a href="https://www.glukhov.org/ai-systems/openclaw/production-setup/" rel="noopener noreferrer"&gt;OpenClaw production setup patterns with plugins and skills&lt;/a&gt;, which maps day-to-day operating models. At that stage, add formal monitoring with &lt;a href="https://www.glukhov.org/observability/monitoring-llm-inference-prometheus-grafana/" rel="noopener noreferrer"&gt;LLM inference observability using Prometheus and Grafana&lt;/a&gt; so operations do not depend on terminal intuition alone.&lt;/p&gt;

</description>
      <category>cheatsheet</category>
      <category>openclaw</category>
      <category>selfhosting</category>
      <category>llm</category>
    </item>
    <item>
      <title>Agent Memory Providers Compared — Honcho, Mem0, Hindsight, and Five More</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Thu, 30 Apr 2026 10:15:44 +0000</pubDate>
      <link>https://dev.to/rosgluk/agent-memory-providers-compared-honcho-mem0-hindsight-and-five-more-5bl8</link>
      <guid>https://dev.to/rosgluk/agent-memory-providers-compared-honcho-mem0-hindsight-and-five-more-5bl8</guid>
      <description>&lt;p&gt;Modern assistants still forget everything when you close the tab unless something persists beyond the context window. &lt;strong&gt;Agent memory providers&lt;/strong&gt; are services or libraries that hold facts and summaries across sessions — often wired in as &lt;strong&gt;plugins&lt;/strong&gt; so the framework stays thin while memory scales.&lt;/p&gt;

&lt;p&gt;This guide compares eight backends that ship as &lt;strong&gt;Hermes Agent&lt;/strong&gt; external memory plugins — Honcho, OpenViking, Mem0, Hindsight, Holographic, RetainDB, ByteRover, and Supermemory — and explains how they fit into broader &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;AI systems&lt;/a&gt;&lt;/strong&gt; stacks. The same vendors appear in &lt;strong&gt;OpenClaw&lt;/strong&gt; and other agent tooling via community or official integrations. The &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/memory/" rel="noopener noreferrer"&gt;AI Systems Memory hub&lt;/a&gt;&lt;/strong&gt; lists this article alongside Cognee and related guides.&lt;/p&gt;

&lt;p&gt;For Hermes-specific bounded core memory (MEMORY.md and USER.md), freezing behaviour, and triggers, see &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;Hermes Agent lists eight external memory provider plugins for persistent, cross-session knowledge. Only one external provider can be active at a time. Built-in MEMORY.md and USER.md stay loaded alongside it — additive, not replacement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies.&lt;/strong&gt; Every external provider except Holographic requires at least one external service call — an LLM for memory extraction, an embedding model for semantic search, or a database like PostgreSQL for storage. These dependencies have direct implications for privacy, cost, and whether your memory stack can run fully &lt;a href="https://www.glukhov.org/llm-hosting/self-hosting/llm-selfhosting-and-ai-sovereignty/" rel="noopener noreferrer"&gt;self-hosted&lt;/a&gt;. Hindsight and ByteRover bundle or eliminate the most dependencies; Honcho, Mem0, and Supermemory require the most moving parts. Where a provider supports Ollama or any OpenAI-compatible endpoint, you can route LLM and embedding calls to a local model and keep data off third-party servers entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Activation with Hermes Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes memory setup   &lt;span class="c"&gt;# Interactive picker + configuration&lt;/span&gt;
hermes memory status  &lt;span class="c"&gt;# Check what's active&lt;/span&gt;
hermes memory off     &lt;span class="c"&gt;# Disable external provider&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or manually in &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openviking&lt;/span&gt;  &lt;span class="c1"&gt;# or honcho, mem0, hindsight, holographic, retaindb, byterover, supermemory&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Provider Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;External Dependencies&lt;/th&gt;
&lt;th&gt;Self-hostable&lt;/th&gt;
&lt;th&gt;Unique Feature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Honcho&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud/Self-hosted&lt;/td&gt;
&lt;td&gt;Paid/Free&lt;/td&gt;
&lt;td&gt;LLM + Embedding model + PostgreSQL/pgvector + Redis&lt;/td&gt;
&lt;td&gt;Yes — Docker / K3s / Fly.io&lt;/td&gt;
&lt;td&gt;Dialectic user modeling + session-scoped context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenViking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;LLM (VLM) + Embedding model&lt;/td&gt;
&lt;td&gt;Yes — local server; Ollama-native init wizard&lt;/td&gt;
&lt;td&gt;Filesystem hierarchy + tiered loading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mem0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud/Self-hosted&lt;/td&gt;
&lt;td&gt;Paid/Free OSS&lt;/td&gt;
&lt;td&gt;LLM + Embedding model + Vector store (Qdrant or pgvector)&lt;/td&gt;
&lt;td&gt;Yes — Docker Compose OSS; fully local possible&lt;/td&gt;
&lt;td&gt;Server-side LLM extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hindsight&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud/Local&lt;/td&gt;
&lt;td&gt;Free/Paid&lt;/td&gt;
&lt;td&gt;LLM + bundled PostgreSQL + built-in embedder + built-in reranker&lt;/td&gt;
&lt;td&gt;Yes — Docker or embedded Python; fully local with Ollama&lt;/td&gt;
&lt;td&gt;Knowledge graph + &lt;code&gt;reflect&lt;/code&gt; synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Holographic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Native — no infra required&lt;/td&gt;
&lt;td&gt;HRR algebra + trust scoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RetainDB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;$20/mo&lt;/td&gt;
&lt;td&gt;Cloud-managed (LLM + retrieval on RetainDB servers)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Delta compression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ByteRover&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local/Cloud&lt;/td&gt;
&lt;td&gt;Free/Paid&lt;/td&gt;
&lt;td&gt;LLM only — no embedding model, no DB&lt;/td&gt;
&lt;td&gt;Yes — local-first by default; Ollama supported&lt;/td&gt;
&lt;td&gt;File-based context tree; no embedding pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Supermemory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;td&gt;LLM + PostgreSQL/pgvector (enterprise Cloudflare deploy)&lt;/td&gt;
&lt;td&gt;Enterprise plan only&lt;/td&gt;
&lt;td&gt;Context fencing + session graph ingest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Detailed Breakdown
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Honcho
&lt;/h4&gt;

&lt;p&gt;Best for: multi-agent systems, cross-session context, user-agent alignment.&lt;/p&gt;

&lt;p&gt;Honcho runs alongside existing memory — USER.md stays as-is, and Honcho adds an additional layer of context. It models conversations as peers exchanging messages — one user peer plus one AI peer per Hermes profile, all sharing a workspace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies:&lt;/strong&gt; Honcho requires an LLM for session summarisation, user-representation derivation, and dialectic reasoning; an embedding model for semantic search across observations; PostgreSQL with the pgvector extension for vector storage; and Redis for caching. The managed cloud at &lt;code&gt;api.honcho.dev&lt;/code&gt; handles all of this for you. For self-hosted deployments (Docker, K3s, or Fly.io), you supply your own credentials. The LLM slot accepts any OpenAI-compatible endpoint, including Ollama and vLLM, so inference can stay on-premises. The embedding slot defaults to &lt;code&gt;openai/text-embedding-3-small&lt;/code&gt; but supports configurable providers via &lt;code&gt;LLM_EMBEDDING_API_KEY&lt;/code&gt; and &lt;code&gt;LLM_EMBEDDING_BASE_URL&lt;/code&gt; — any OpenAI-compatible embedding server works, including local options like vLLM with a BGE model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;honcho_profile&lt;/code&gt; (read/update peer card), &lt;code&gt;honcho_search&lt;/code&gt; (semantic search), &lt;code&gt;honcho_context&lt;/code&gt; (session context — summary, representation, card, messages), &lt;code&gt;honcho_reasoning&lt;/code&gt; (LLM-synthesized), &lt;code&gt;honcho_conclude&lt;/code&gt; (create/delete conclusions).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key config knobs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;contextCadence&lt;/code&gt; (default 1): Minimum turns between base layer refresh&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dialecticCadence&lt;/code&gt; (default 2): Minimum turns between &lt;code&gt;peer.chat()&lt;/code&gt; LLM calls (1-5 recommended)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dialecticDepth&lt;/code&gt; (default 1): &lt;code&gt;.chat()&lt;/code&gt; passes per invocation (clamped 1-3)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;recallMode&lt;/code&gt; (default 'hybrid'): &lt;code&gt;hybrid&lt;/code&gt; (auto+tools), &lt;code&gt;context&lt;/code&gt; (inject only), &lt;code&gt;tools&lt;/code&gt; (tools only)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;writeFrequency&lt;/code&gt; (default 'async'): Flush timing: &lt;code&gt;async&lt;/code&gt;, &lt;code&gt;turn&lt;/code&gt;, &lt;code&gt;session&lt;/code&gt;, or integer N&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;observationMode&lt;/code&gt; (default 'directional'): &lt;code&gt;directional&lt;/code&gt; (all on) or &lt;code&gt;unified&lt;/code&gt; (shared pool)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt; Two-layer context injection — base layer (session summary + representation + peer card) + dialectic supplement (LLM reasoning). Automatically selects cold-start vs warm prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-peer mapping:&lt;/strong&gt; Workspace is a shared environment across profiles. User peer (&lt;code&gt;peerName&lt;/code&gt;) is a global human identity. AI peer (&lt;code&gt;aiPeer&lt;/code&gt;) is one per Hermes profile (&lt;code&gt;hermes&lt;/code&gt; default, &lt;code&gt;hermes.&amp;lt;profile&amp;gt;&lt;/code&gt; for others).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes memory setup  &lt;span class="c"&gt;# select "honcho"&lt;/span&gt;
&lt;span class="c"&gt;# or legacy: hermes honcho setup&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Config: &lt;code&gt;$HERMES_HOME/honcho.json&lt;/code&gt; (profile-local) or &lt;code&gt;~/.honcho/config.json&lt;/code&gt; (global).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Profile management:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes profile create coder &lt;span class="nt"&gt;--clone&lt;/span&gt;  &lt;span class="c"&gt;# Creates hermes.coder with shared workspace&lt;/span&gt;
hermes honcho &lt;span class="nb"&gt;sync&lt;/span&gt;                   &lt;span class="c"&gt;# Backfills AI peers for existing profiles&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  OpenViking
&lt;/h4&gt;

&lt;p&gt;Best for: self-hosted knowledge management with structured browsing.&lt;/p&gt;

&lt;p&gt;OpenViking provides a filesystem hierarchy with tiered loading. It's free, &lt;a href="https://www.glukhov.org/llm-hosting/self-hosting/llm-selfhosting-and-ai-sovereignty/" rel="noopener noreferrer"&gt;self-hosted&lt;/a&gt;, and gives you full control over your memory storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies:&lt;/strong&gt; OpenViking requires a VLM (vision-language model) for semantic processing and memory extraction, and an embedding model for vector search — both are mandatory. Supported VLM providers include OpenAI, Anthropic, DeepSeek, Gemini, Moonshot, and vLLM (for local deployment). For embeddings, supported providers include OpenAI, Volcengine (Doubao), Jina, Voyage, and — via Ollama — any locally served embedding model. The &lt;code&gt;openviking-server init&lt;/code&gt; interactive wizard can detect available RAM and recommend suitable Ollama models (e.g. Qwen3-Embedding 8B for embeddings, Gemma 4 27B for VLM) and configure everything automatically for a fully local, zero-API-key setup. No external database is required; OpenViking stores memory in the filesystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;viking_search&lt;/code&gt;, &lt;code&gt;viking_read&lt;/code&gt; (tiered), &lt;code&gt;viking_browse&lt;/code&gt;, &lt;code&gt;viking_remember&lt;/code&gt;, &lt;code&gt;viking_add_resource&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;openviking
openviking-server init   &lt;span class="c"&gt;# interactive wizard (recommends Ollama models for local setup)&lt;/span&gt;
openviking-server
hermes memory setup  &lt;span class="c"&gt;# select "openviking"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"OPENVIKING_ENDPOINT=http://localhost:1933"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.hermes/.env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Mem0
&lt;/h4&gt;

&lt;p&gt;Best for: hands-off memory management with auto extraction.&lt;/p&gt;

&lt;p&gt;Mem0 handles memory extraction server-side via an LLM call on every &lt;code&gt;add&lt;/code&gt; operation — it reads the conversation, extracts discrete facts, deduplicates, and stores them. The managed cloud API handles all infrastructure. The open-source library and self-hosted server give you full control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies:&lt;/strong&gt; Mem0 requires an LLM for memory extraction (default: OpenAI &lt;code&gt;gpt-4.1-nano&lt;/code&gt;; 20 providers supported, including Ollama, vLLM, and LM Studio for local models) and an embedding model for retrieval (default: OpenAI &lt;code&gt;text-embedding-3-small&lt;/code&gt;; 10 providers supported, including Ollama and HuggingFace for local models). Storage uses Qdrant at &lt;code&gt;/tmp/qdrant&lt;/code&gt; in library mode, or PostgreSQL with pgvector in self-hosted server mode — both can run locally. A fully local, zero-cloud Mem0 stack is achievable: Ollama for LLM, Ollama for embeddings, and a local Qdrant instance, all configured via &lt;code&gt;Memory.from_config&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;mem0_profile&lt;/code&gt;, &lt;code&gt;mem0_search&lt;/code&gt;, &lt;code&gt;mem0_conclude&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mem0ai
hermes memory setup  &lt;span class="c"&gt;# select "mem0"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"MEM0_API_KEY=your-key"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.hermes/.env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Config: &lt;code&gt;$HERMES_HOME/mem0.json&lt;/code&gt; (&lt;code&gt;user_id&lt;/code&gt;: &lt;code&gt;hermes-user&lt;/code&gt;, &lt;code&gt;agent_id&lt;/code&gt;: &lt;code&gt;hermes&lt;/code&gt;).&lt;/p&gt;

&lt;h4&gt;
  
  
  Hindsight
&lt;/h4&gt;

&lt;p&gt;Best for: knowledge graph-based recall with entity relationships.&lt;/p&gt;

&lt;p&gt;Hindsight builds a knowledge graph of your memory, extracting entities and relationships. Its unique &lt;code&gt;reflect&lt;/code&gt; tool performs cross-memory synthesis — combining multiple memories into new insights. Recall runs four retrieval strategies in parallel (semantic, keyword/BM25, graph traversal, temporal), then merges and re-orders results using reciprocal rank fusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies:&lt;/strong&gt; Hindsight requires an LLM for fact and entity extraction on &lt;code&gt;retain&lt;/code&gt; calls, and for synthesis on &lt;code&gt;reflect&lt;/code&gt; calls (default: OpenAI; supported providers include Anthropic, Gemini, Groq, Ollama, LM Studio, and any OpenAI-compatible endpoint). The embedding model and cross-encoder reranking model are bundled inside Hindsight itself — they run locally within the &lt;code&gt;hindsight-all&lt;/code&gt; package and require no external API. PostgreSQL is also bundled with the embedded Python installation via a managed &lt;code&gt;pg0&lt;/code&gt; data directory; you can alternatively point Hindsight at an external PostgreSQL instance. For a fully local, zero-cloud setup, set &lt;code&gt;HINDSIGHT_API_LLM_PROVIDER=ollama&lt;/code&gt; and point it at a local Ollama model — &lt;code&gt;retain&lt;/code&gt; and &lt;code&gt;recall&lt;/code&gt; work fully; &lt;code&gt;reflect&lt;/code&gt; requires a tool-calling-capable model (e.g. &lt;code&gt;qwen3:8b&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;hindsight_retain&lt;/code&gt;, &lt;code&gt;hindsight_recall&lt;/code&gt;, &lt;code&gt;hindsight_reflect&lt;/code&gt; (unique cross-memory synthesis).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes memory setup  &lt;span class="c"&gt;# select "hindsight"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"HINDSIGHT_API_KEY=your-key"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.hermes/.env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Auto-installs &lt;code&gt;hindsight-client&lt;/code&gt; (cloud) or &lt;code&gt;hindsight-all&lt;/code&gt; (local). Requires &amp;gt;= 0.4.22.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config:&lt;/strong&gt; &lt;code&gt;$HERMES_HOME/hindsight/config.json&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mode&lt;/code&gt;: &lt;code&gt;cloud&lt;/code&gt; or &lt;code&gt;local&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;recall_budget&lt;/code&gt;: &lt;code&gt;low&lt;/code&gt; / &lt;code&gt;mid&lt;/code&gt; / &lt;code&gt;high&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory_mode&lt;/code&gt;: &lt;code&gt;hybrid&lt;/code&gt; / &lt;code&gt;context&lt;/code&gt; / &lt;code&gt;tools&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;auto_retain&lt;/code&gt; / &lt;code&gt;auto_recall&lt;/code&gt;: &lt;code&gt;true&lt;/code&gt; (default)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Local UI: &lt;code&gt;hindsight-embed -p hermes ui start&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Holographic
&lt;/h4&gt;

&lt;p&gt;Best for: privacy-focused setups with local-only storage.&lt;/p&gt;

&lt;p&gt;Holographic uses HRR (Holographic Reduced Representation) algebra for memory encoding, with trust scoring for memory reliability. No cloud dependency — everything runs locally on your own hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies:&lt;/strong&gt; None. Holographic requires no LLM, no embedding model, no database, and no network connection. Memory encoding is done entirely through HRR algebra running in-process. This makes it unique among all eight providers — it is the only one that operates with zero external calls. The trade-off is that retrieval quality is lower than embedding-based semantic search, and there is no cross-memory synthesis like Hindsight's &lt;code&gt;reflect&lt;/code&gt;. For users where privacy and zero-dependency operation are non-negotiable, Holographic is the only option that delivers that unconditionally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; 2 tools for memory operations via HRR algebra.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes memory setup  &lt;span class="c"&gt;# select "holographic"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  RetainDB
&lt;/h4&gt;

&lt;p&gt;Best for: high-frequency updates with delta compression.&lt;/p&gt;

&lt;p&gt;RetainDB uses delta compression to efficiently store memory updates and hybrid retrieval (vector + BM25 + reranking) to surface relevant context. It's cloud-based with a $20/month cost, with all memory processing handled server-side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies:&lt;/strong&gt; RetainDB's LLM calls, embedding pipeline, and reranking all run on RetainDB's own cloud infrastructure — you supply only a &lt;code&gt;RETAINDB_KEY&lt;/code&gt;. Memory extraction uses Claude Sonnet server-side. There is no self-hosting option and no local mode. All conversation data is sent to RetainDB servers for processing and storage. If data sovereignty or offline operation matters for your use case, this provider is not suitable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;retaindb_profile&lt;/code&gt; (user profile), &lt;code&gt;retaindb_search&lt;/code&gt; (semantic search), &lt;code&gt;retaindb_context&lt;/code&gt; (task-relevant context), &lt;code&gt;retaindb_remember&lt;/code&gt; (store with type + importance), &lt;code&gt;retaindb_forget&lt;/code&gt; (delete memories).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes memory setup  &lt;span class="c"&gt;# select "retaindb"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  ByteRover
&lt;/h4&gt;

&lt;p&gt;Best for: local-first memory with human-readable, auditable storage.&lt;/p&gt;

&lt;p&gt;ByteRover stores memory as a structured markdown context tree — a hierarchy of domain, topic, and subtopic files — rather than embedding vectors or a database. An LLM reads source content, reasons about it, and places extracted knowledge into the right location in the hierarchy. Retrieval is MiniSearch full-text search with tiered fallback to LLM-powered search, with no vector database required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies:&lt;/strong&gt; ByteRover requires an LLM for memory curation and search (18 providers supported, including Anthropic, OpenAI, Google, Ollama, and any OpenAI-compatible endpoint via the &lt;code&gt;openai-compatible&lt;/code&gt; provider slot). It requires no embedding model and no database — the context tree is a local directory of plain markdown files. Cloud sync is optional and used only for team collaboration; everything works fully offline by default. For a fully self-contained local setup, connect Ollama as the provider (&lt;code&gt;brv providers connect openai-compatible --base-url http://localhost:11434/v1&lt;/code&gt;) and no data leaves your machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; 3 tools for memory operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes memory setup  &lt;span class="c"&gt;# select "byterover"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Supermemory
&lt;/h4&gt;

&lt;p&gt;Best for: enterprise workflows with context fencing and session graph ingest.&lt;/p&gt;

&lt;p&gt;Supermemory provides context fencing (isolating memory by context) and session graph ingest (importing entire conversation histories). It automatically extracts memories, builds user profiles, and runs hybrid retrieval combining semantic and keyword search. The managed cloud API is the primary deployment target.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External dependencies:&lt;/strong&gt; Supermemory's cloud service handles all LLM inference and embedding server-side — you supply only a Supermemory API key. Self-hosting is available exclusively as an enterprise plan add-on and deploys to Cloudflare Workers; it requires you to provide PostgreSQL with the pgvector extension (for vector storage) and an OpenAI API key (mandatory, with Anthropic and Gemini as optional additions). There is no Docker-based or local self-hosting path — the architecture is tightly coupled to Cloudflare Workers edge compute. For users who need full data sovereignty without an enterprise contract, this provider is not the right choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; 4 tools for memory operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hermes memory setup  &lt;span class="c"&gt;# select "supermemory"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How to Choose
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Need multi-agent support?&lt;/strong&gt; Honcho&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want self-hosted and free?&lt;/strong&gt; OpenViking or Holographic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want zero-config?&lt;/strong&gt; Mem0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want knowledge graphs?&lt;/strong&gt; Hindsight&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want delta compression?&lt;/strong&gt; RetainDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want bandwidth efficiency?&lt;/strong&gt; ByteRover&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want enterprise features?&lt;/strong&gt; Supermemory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want privacy (local only)?&lt;/strong&gt; Holographic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want fully local with zero external services?&lt;/strong&gt; Holographic (no dependencies at all) or Hindsight/Mem0/ByteRover with Ollama&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Want human-readable, auditable memory with no embedding pipeline?&lt;/strong&gt; ByteRover&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For full profile-by-profile provider configurations and real-world workflow patterns, see &lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes Agent production setup&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Related guides
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-systems/memory/" rel="noopener noreferrer"&gt;AI Systems Memory hub&lt;/a&gt; — scope of this subcluster and links to Cognee guides&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-systems/hermes/hermes-agent-memory-system/" rel="noopener noreferrer"&gt;Hermes Agent Memory System&lt;/a&gt; — core two-file memory before plugins&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes Agent production setup&lt;/a&gt; — profile wiring for providers in practice&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>hermes</category>
      <category>openclaw</category>
      <category>ai</category>
      <category>selfhosting</category>
    </item>
    <item>
      <title>Hermes Agent Memory System: How Persistent AI Memory Actually Works</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Thu, 30 Apr 2026 10:15:40 +0000</pubDate>
      <link>https://dev.to/rosgluk/hermes-agent-memory-system-how-persistent-ai-memory-actually-works-1j19</link>
      <guid>https://dev.to/rosgluk/hermes-agent-memory-system-how-persistent-ai-memory-actually-works-1j19</guid>
      <description>&lt;p&gt;You know the drill. You open a chat with an AI agent, explain your project, share your preferences, get some work done, and close the tab. Come back the following week and it's like talking to a stranger — all context gone, every preference forgotten, the project re-explained from scratch.&lt;/p&gt;

&lt;p&gt;This isn't a bug. It's how Large Language Models work by design. They're stateless: each request is independent, each response generated from whatever prompt you send right now, with no memory, no history, and no continuity beyond the tokens in the current context window.&lt;/p&gt;

&lt;p&gt;For single-turn interactions, that's fine. Ask a question, get an answer, move on. But for agents — systems that are supposed to &lt;em&gt;do things&lt;/em&gt; across sessions, learn from mistakes, and evolve with you — statelessness is a hard architectural limit. It's one of the central unsolved problems in &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;self-hosted AI systems&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The industry has tried to solve this. LangChain added memory modules. OpenAI introduced assistants with threads. Frameworks like Letta, Zep, and Cognee built entire architectures around persistent memory. Databricks published on "memory scaling" — the idea that agent performance improves with accumulated experience. Dedicated benchmark papers, episodic memory surveys, and a rapidly growing ecosystem of tools have all emerged since 2024 to address what is increasingly recognised as one of the central unsolved problems in agentic AI.&lt;/p&gt;

&lt;p&gt;Most of these approaches share a common problem: they treat memory as an afterthought — a database you query, a context window you stuff, a retrieval system that adds latency and noise rather than clarity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt; takes a fundamentally different approach. Memory isn't something the agent &lt;em&gt;retrieves&lt;/em&gt; when needed. It's something the agent &lt;em&gt;is&lt;/em&gt; at all times — built into the system prompt, curated, bounded, and always active. It's small enough to be fast, structured enough to be useful, and disciplined enough to know what to forget.&lt;/p&gt;

&lt;p&gt;This article explains exactly how that works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: The AI Agent Memory Problem
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why "Just Add Context" Doesn't Scale for Agents
&lt;/h3&gt;

&lt;p&gt;The obvious solution to stateless AI is to add context. Attach the previous conversation. Include the project documentation. Send the entire history.&lt;/p&gt;

&lt;p&gt;For a while, that works. You've got a 128K context window. You can fit a lot of text in there.&lt;/p&gt;

&lt;p&gt;But context isn't memory — there's a real and important difference between them. Context is everything you're shown right now; memory is what you actively keep and carry forward.&lt;/p&gt;

&lt;p&gt;Context has no curation. It's a dump: as it grows, the model has to process thousands of tokens of irrelevant history to find the one fact it needs. That &lt;a href="https://www.glukhov.org/llm-performance/cost-effective-llm-applications/" rel="noopener noreferrer"&gt;costs tokens and money&lt;/a&gt;, compounds latency, and eventually hits the ceiling.&lt;/p&gt;

&lt;p&gt;Memory is curated. It's the distillation of experience into something compact and actionable. It doesn't grow indefinitely — it consolidates, updates, and forgets.&lt;/p&gt;

&lt;p&gt;Human memory works the same way. You don't remember every conversation you've ever had. You remember the parts that matter: who you're talking to, what they care about, what you've agreed on, what you've learned. The rest is either forgotten or searchable when you need it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Research Landscape
&lt;/h3&gt;

&lt;p&gt;The AI agent memory space has exploded since 2024, with dedicated benchmark suites, a growing research literature, and a measurable performance gap between different architectural approaches. Here's where things stand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Letta&lt;/strong&gt; (formerly MemGPT) was one of the earliest frameworks to treat persistent memory as a first-class concern, reaching 21.7K GitHub stars. It uses an OS-inspired three-tier model: core memory (small, always in context), recall memory (searchable conversation history), and archival memory (long-term cold storage). The insight that not all memory is equal was correct. The implementation, however, requires agents to run entirely inside the Letta runtime — adopting it means adopting the whole platform, not just a memory layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zep / Graphiti&lt;/strong&gt; focuses on conversational memory with temporal entity tracking — facts carry validity windows so the graph knows when something was true. It's strong for chatbots that need relationship graphs, less suited for autonomous agents tracking environment facts and project conventions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cognee&lt;/strong&gt; is built for knowledge extraction from documents and structured data, with 30+ ingestion connectors and a knowledge graph backend. It excels at institutional knowledge and &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;RAG pipelines&lt;/a&gt; but is less focused on personal agent memory. See &lt;a href="https://www.glukhov.org/ai-systems/memory/selfhosting-cognee-quickstart-llms-comparison/" rel="noopener noreferrer"&gt;self-hosting Cognee with local LLMs&lt;/a&gt; for a practical setup guide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hindsight&lt;/strong&gt; does knowledge graph-based recall with entity relationships and a unique &lt;code&gt;reflect&lt;/code&gt; synthesis tool that performs cross-memory synthesis — combining multiple memories into new insights. It's among the top performers on agent memory benchmarks and is available as a memory provider for Hermes Agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mem0&lt;/strong&gt; handles memory extraction server-side via LLM analysis, requiring minimal configuration. The Mem0 research paper, published at ECAI 2025 (arXiv:2504.19413), benchmarked ten distinct approaches to AI memory and validated the selective extraction approach — storing discrete facts, deduplicating, and retrieving only what's relevant. Mem0 has grown to approximately 48K GitHub stars and supports 21 framework integrations. The trade-off is cloud dependency and cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks' memory scaling research&lt;/strong&gt; introduced the concept that agent performance improves with accumulated experience. Their architecture holds system prompts, enterprise assets, and episodic/semantic memories scoped at organization and user level, validating the idea that memory quality matters as much as model capability.&lt;/p&gt;

&lt;p&gt;The common thread across most frameworks is that they treat memory as a retrieval problem: store it somewhere, query it when needed, inject it into context. Hermes does the opposite — memory isn't retrieved on demand, it's injected at session start and always present. Always active, always available, curated enough to stay useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: Architecture — Two Files, One Brain
&lt;/h2&gt;

&lt;p&gt;Hermes Agent's built-in memory system lives in two files.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;~/.hermes/memories/MEMORY.md&lt;/code&gt; — Agent's personal notes (2,200 chars, ~800 tokens)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;~/.hermes/memories/USER.md&lt;/code&gt; — User profile (1,375 chars, ~500 tokens)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the entire persistent memory surface: two files, under 3,600 characters total, fewer than 1,300 tokens. It looks deliberately small because it is — and that's exactly the design intent.&lt;/p&gt;

&lt;h3&gt;
  
  
  MEMORY.md: The Agent's Notes
&lt;/h3&gt;

&lt;p&gt;This is where the agent stores everything it learns about its environment, the project, tools, conventions, and lessons learned. Here's what it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;User's project is a Go microservice at ~/code/gateway using gRPC + PostgreSQL
This machine runs Ubuntu 22.04, has Docker and kubectl installed
User prefers snake_case for variable names and avoids camelCase
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These aren't logs. They're facts. Dense, declarative, information-packed. No timestamps, no fluff, no "on January 5th the user asked me to..."&lt;/p&gt;

&lt;h3&gt;
  
  
  USER.md: The User Profile
&lt;/h3&gt;

&lt;p&gt;This is where the agent stores everything it knows about you.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;User is a full-stack developer comfortable with TypeScript, Go, and Python.
User prefers snake_case for variable names and avoids camelCase.
User primarily uses Linux Ubuntu 22.04.
User deploys to AWS using Terraform.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Identity, role, preferences, technical skills, communication style, pet peeves. The stuff that makes the agent respond differently to you than to anyone else.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Frozen Snapshot Pattern
&lt;/h3&gt;

&lt;p&gt;At session start, both files are loaded from disk and injected as a frozen block into the system prompt. Here's what it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;══════════════════════════════════════════════
MEMORY (your personal notes) [7% — 166/2,200 chars]
══════════════════════════════════════════════
User's project is a Go microservice at ~/code/gateway using gRPC + PostgreSQL
§
This machine runs Ubuntu 22.04, has Docker and kubectl installed
§
User prefers snake_case for variable names and avoids camelCase
§
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;══════════════════════════════════════════════
USER PROFILE (who the user is) [8% — 110/1,375 chars]
══════════════════════════════════════════════
User is a full-stack developer comfortable with TypeScript, Go, and Python.
§
User prefers snake_case for variable names and avoids camelCase.
§
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The format uses headers, usage percentages, character counts, and &lt;code&gt;§&lt;/code&gt; (section sign) delimiters. Entries can be multiline. It's designed to be parseable by the model while remaining human-readable.&lt;/p&gt;

&lt;p&gt;Why frozen? &lt;a href="https://www.glukhov.org/llm-performance/" rel="noopener noreferrer"&gt;Prefix caching&lt;/a&gt;. The system prompt is the same across every turn in a session. By keeping memory static after session start, the model can cache the prefix computation and only process the variable parts — the conversation. This is a significant performance optimization. You're not re-computing attention over the same memory tokens on every turn.&lt;/p&gt;

&lt;p&gt;Changes made during a session persist to disk immediately, but they only appear in the system prompt at the next session start. Tool responses always show the live state, but the model's "mind" doesn't change mid-session. This prevents the model from chasing its own tail — updating memory and then reacting to its own update in the same conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Character Limits as a Feature
&lt;/h3&gt;

&lt;p&gt;2,200 characters. 1,375 characters. These aren't arbitrary limits. They're design constraints that force curation.&lt;/p&gt;

&lt;p&gt;Unlimited memory is a liability. It encourages dumping everything in, never consolidating, and eventually becoming noise. Bounded memory forces the agent to be selective. What's actually important? What will I need again? What can be compressed without losing meaning?&lt;/p&gt;

&lt;p&gt;When memory is full, the agent doesn't just fail silently. It gets an error with current entries and usage, then follows a workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read current entries from error response&lt;/li&gt;
&lt;li&gt;Identify removable or consolidatable entries&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;replace&lt;/code&gt; to merge related entries into shorter versions&lt;/li&gt;
&lt;li&gt;Add the new entry&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is how memory stays useful. It's not a database. It's a curated collection of facts that matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security: Prompt Injection Scanning
&lt;/h3&gt;

&lt;p&gt;Every memory entry is scanned before acceptance. The system blocks prompt injection attempts, credential exfiltration, SSH backdoors, and invisible Unicode characters.&lt;/p&gt;

&lt;p&gt;Memory is also deduplicated. Exact duplicate entries are rejected automatically. This prevents adversaries from trying to inject malicious content through repeated submissions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3: When Memory Fires — Triggers &amp;amp; Decisions
&lt;/h2&gt;

&lt;p&gt;The most common question about Hermes Agent's memory is when it actually saves something.&lt;/p&gt;

&lt;p&gt;The answer is: constantly, but selectively. The agent manages its own memory via the &lt;code&gt;memory&lt;/code&gt; tool, and the decision to save is driven by a combination of explicit signals and implicit patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing Triggers: When Does the Agent Decide to Save?
&lt;/h3&gt;

&lt;p&gt;The agent saves memory proactively. It doesn't wait for you to ask. Here's what triggers it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User corrections.&lt;/strong&gt; When you correct the agent, that's a signal to remember. "Don't do that again." "Use this instead." "Remember this." These are explicit instructions to update memory.&lt;/p&gt;

&lt;p&gt;Example: you ask the agent to configure a Python environment. It suggests &lt;code&gt;pip&lt;/code&gt;. You say "I use &lt;code&gt;poetry&lt;/code&gt; for everything." The agent saves: &lt;code&gt;User prefers using the 'poetry' package manager for all Python projects.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Discovered preferences.&lt;/strong&gt; The agent observes patterns and infers preferences. If you consistently use a certain tool, framework, or workflow, it gets saved.&lt;/p&gt;

&lt;p&gt;Example: after seeing you use &lt;code&gt;poetry&lt;/code&gt; multiple times across different projects, the agent saves it as a preference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Environment facts.&lt;/strong&gt; Things about the machine, the project, the tools installed. These are discovered through exploration and saved as facts.&lt;/p&gt;

&lt;p&gt;Example: the agent checks what's installed and saves: &lt;code&gt;This machine runs Ubuntu 22.04, has Docker and kubectl installed.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project conventions.&lt;/strong&gt; How the project is structured, what tools it uses, what patterns it follows. These are discovered through code inspection and saved.&lt;/p&gt;

&lt;p&gt;Example: &lt;code&gt;User's project is a Go microservice at ~/code/gateway using gRPC + PostgreSQL.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Completed complex workflows.&lt;/strong&gt; After completing a task that took 5+ tool calls, the agent considers saving the approach as a skill or at least noting what worked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool quirks and workarounds.&lt;/strong&gt; When the agent discovers something non-obvious about a tool, API, or system — a limitation, a workaround, a convention — it saves it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What gets skipped:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trivial or obvious information&lt;/li&gt;
&lt;li&gt;Things easily re-discovered&lt;/li&gt;
&lt;li&gt;Raw data dumps&lt;/li&gt;
&lt;li&gt;Session-specific ephemera&lt;/li&gt;
&lt;li&gt;Information already in context files (SOUL.md, AGENTS.md)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reading Triggers: When Does the Agent Recall?
&lt;/h3&gt;

&lt;p&gt;Memory isn't retrieved — it's always there. But there are different levels of access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session start (automatic).&lt;/strong&gt; MEMORY.md and USER.md are injected into the system prompt. The agent has them from the first token. No query needed, no latency, no tool call. This is the core memory — always active.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;session_search&lt;/code&gt; (on-demand).&lt;/strong&gt; When the agent needs to find something from past conversations that isn't in core memory, it uses the &lt;code&gt;session_search&lt;/code&gt; tool. This queries SQLite (&lt;code&gt;~/.hermes/state.db&lt;/code&gt;) with FTS5 full-text search and Gemini Flash summarization.&lt;/p&gt;

&lt;p&gt;Example: you ask "Did we discuss Docker networking last week?" The agent searches session history and returns a summary of the relevant conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External provider tools (when configured).&lt;/strong&gt; When an external memory provider is active, the agent has additional tools available: &lt;code&gt;honcho_search&lt;/code&gt;, &lt;code&gt;hindsight_recall&lt;/code&gt;, &lt;code&gt;mem0_search&lt;/code&gt;, etc. These are used when the agent determines that external context is needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Tree
&lt;/h3&gt;

&lt;p&gt;Here's how the agent weighs "is this worth remembering?":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is this a correction or explicit instruction?
  YES → Save to memory
  NO → Is this a preference or pattern?
    YES → Save to user profile
    NO → Is this an environment fact or convention?
      YES → Save to memory
      NO → Is this easily re-discovered?
        YES → Skip
        NO → Is this session-specific?
          YES → Skip
          NO → Save to memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent doesn't overthink this. It saves proactively, consolidates when full, and trusts the character limits to keep things tight.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: Internal Memory vs. External Knowledge Bases
&lt;/h2&gt;

&lt;p&gt;This is where confusion often happens. Hermes Agent has &lt;em&gt;internal memory&lt;/em&gt; (MEMORY.md, USER.md, external providers) and &lt;em&gt;external knowledge bases&lt;/em&gt; (LLM Wiki, Obsidian, Notion, ArXiv, filesystem), and they serve completely different roles. This is similar to the distinction between &lt;a href="https://www.glukhov.org/rag/" rel="noopener noreferrer"&gt;retrieval-augmented generation&lt;/a&gt; pipelines and agent working memory — external retrieval is good for deep knowledge lookups, not for carrying identity and preferences. Internal memory is the agent's brain — always active, curated, carried into every session. External knowledge bases are its library — vast reference resources consulted on demand.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Distinction
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Internal Memory (the brain):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small, persistent, injected into system prompt&lt;/li&gt;
&lt;li&gt;Contains: user preferences, agent conventions, immediate lessons&lt;/li&gt;
&lt;li&gt;Always "in mind" during conversation&lt;/li&gt;
&lt;li&gt;Curated, bounded, actively managed&lt;/li&gt;
&lt;li&gt;Examples: MEMORY.md, USER.md, Honcho, Hindsight, Mem0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;External Knowledge Bases (the library):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vast, reference-only, accessed on-demand&lt;/li&gt;
&lt;li&gt;Contains: documents, papers, code, notes, databases&lt;/li&gt;
&lt;li&gt;Accessed via tools when needed&lt;/li&gt;
&lt;li&gt;Not "remembered" — looked up&lt;/li&gt;
&lt;li&gt;Examples: LLM Wiki, Obsidian, Notion, ArXiv, filesystem, GitHub&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How They Relate
&lt;/h3&gt;

&lt;p&gt;The agent &lt;em&gt;accesses&lt;/em&gt; external bases via tools when needed. It doesn't "remember" them — it looks them up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Wiki (llm-wiki):&lt;/strong&gt; Karpathy's interlinked Markdown knowledge base for building and querying domain knowledge. The agent uses the &lt;code&gt;llm-wiki&lt;/code&gt; skill to read, search, and query it. It's a reference resource, not memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.glukhov.org/knowledge-management/tools/obsidian-for-personal-knowledge-management/" rel="noopener noreferrer"&gt;Obsidian&lt;/a&gt;:&lt;/strong&gt; Personal note vaults with bidirectional links. The agent uses the &lt;code&gt;obsidian&lt;/code&gt; skill to read, search, and create notes. Obsidian is part of the broader &lt;a href="https://www.glukhov.org/knowledge-management/" rel="noopener noreferrer"&gt;personal knowledge management&lt;/a&gt; ecosystem that Hermes can tap into as a library resource.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notion/Airtable:&lt;/strong&gt; Structured databases and wikis accessed via API. The agent queries them when needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArXiv:&lt;/strong&gt; Academic paper repositories. The agent searches and extracts papers when researching a topic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Filesystem:&lt;/strong&gt; Project code, documentation, configurations. The agent reads files when working on a project.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Distillation Pattern
&lt;/h3&gt;

&lt;p&gt;Here's the key insight: critical insights from external bases can be &lt;em&gt;distilled&lt;/em&gt; into internal memory.&lt;/p&gt;

&lt;p&gt;Example: the agent reads a paper from ArXiv about memory scaling for AI agents. It doesn't save the entire paper to memory. It saves the key takeaway: &lt;code&gt;Memory scaling: agent performance improves with accumulated experience through user interaction and business context stored in memory.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The external resource is vast. The internal memory is the distillation.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Use Which
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Internal memory for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Who am I helping?"&lt;/li&gt;
&lt;li&gt;"What do they prefer?"&lt;/li&gt;
&lt;li&gt;"What did we just learn?"&lt;/li&gt;
&lt;li&gt;"What's the project setup?"&lt;/li&gt;
&lt;li&gt;"What tools are available?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;External knowledge bases for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What's the latest research on X?"&lt;/li&gt;
&lt;li&gt;"What's in my project's documentation?"&lt;/li&gt;
&lt;li&gt;"What did we discuss last month?"&lt;/li&gt;
&lt;li&gt;"What's the API for this service?"&lt;/li&gt;
&lt;li&gt;"What's the code structure?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent understands the difference and uses each appropriately — it doesn't conflate looking up a document with recalling something it has learned about you and your environment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5: How It Actually Works
&lt;/h2&gt;

&lt;p&gt;Let's look at the mechanics.&lt;/p&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;memory&lt;/code&gt; Tool
&lt;/h3&gt;

&lt;p&gt;The agent manages memory through a single tool with three actions: &lt;code&gt;add&lt;/code&gt;, &lt;code&gt;replace&lt;/code&gt;, &lt;code&gt;remove&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There is no &lt;code&gt;read&lt;/code&gt; action — memory content is auto-injected into the system prompt. The agent doesn't need to read it because it's always there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;add&lt;/code&gt;&lt;/strong&gt; — Adds a new entry.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;add&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User runs macOS 14 Sonoma, uses Homebrew, has Docker Desktop installed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;replace&lt;/code&gt;&lt;/strong&gt; — Replaces an existing entry using substring matching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;old_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dark mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User prefers light mode in VS Code, dark mode in terminal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;remove&lt;/code&gt;&lt;/strong&gt; — Removes an entry using substring matching.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remove&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;old_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temporary project fact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Substring Matching
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;replace&lt;/code&gt; and &lt;code&gt;remove&lt;/code&gt; use short unique substrings via &lt;code&gt;old_text&lt;/code&gt;. You don't need the full entry text. This makes surgical edits possible without knowing the exact content.&lt;/p&gt;

&lt;p&gt;If a substring matches multiple entries, an error is returned requesting a more specific match. The agent then refines its query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Target Stores: &lt;code&gt;memory&lt;/code&gt; vs &lt;code&gt;user&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;target&lt;/code&gt; parameter determines which file gets updated.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;memory&lt;/code&gt;&lt;/strong&gt; — Agent's personal notes. Environment facts, project conventions, tool quirks, lessons learned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;user&lt;/code&gt;&lt;/strong&gt; — User profile. Identity, role, timezone, communication preferences, pet peeves, workflow habits.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Capacity Management
&lt;/h3&gt;

&lt;p&gt;When memory is &amp;gt;80% full, the agent consolidates. It merges related entries, removes outdated facts, and compresses information.&lt;/p&gt;

&lt;p&gt;Good memory entries are compact and information-dense:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User runs macOS 14 Sonoma, uses Homebrew, has Docker Desktop installed. Shell: zsh with oh-my-zsh. Editor: Neovim with Telescope plugin.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad memory entries are vague or verbose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User has a project.
On January 5th, 2026, the user asked me to look at their project which is located at ~/code/gateway and it uses Go with gRPC and PostgreSQL for the database layer.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first is dense and useful. The second is either too vague or too verbose.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session Search vs Persistent Memory
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;session_search&lt;/code&gt; and persistent memory serve different purposes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Persistent Memory&lt;/th&gt;
&lt;th&gt;Session Search&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Capacity&lt;/td&gt;
&lt;td&gt;~1,300 tokens total&lt;/td&gt;
&lt;td&gt;Unlimited (all sessions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Instant (in system prompt)&lt;/td&gt;
&lt;td&gt;Requires search + LLM summarization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use Case&lt;/td&gt;
&lt;td&gt;Key facts always available&lt;/td&gt;
&lt;td&gt;Finding specific past conversations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Management&lt;/td&gt;
&lt;td&gt;Manually curated by agent&lt;/td&gt;
&lt;td&gt;Automatic — all sessions stored&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token Cost&lt;/td&gt;
&lt;td&gt;Fixed per session (~1,300 tokens)&lt;/td&gt;
&lt;td&gt;On-demand (searched when needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rule of thumb: use memory for critical facts that should always be in context. Use session search for historical lookups.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 6: External Memory Providers
&lt;/h2&gt;

&lt;p&gt;Beyond built-in MEMORY.md and USER.md, Hermes Agent can attach &lt;strong&gt;one external memory plugin&lt;/strong&gt; at a time — Honcho, OpenViking, Mem0, Hindsight, Holographic, RetainDB, ByteRover, or Supermemory — for persistent, cross-session knowledge. Only one external provider is active at once; the two core files remain loaded alongside it (additive, not replacement).&lt;/p&gt;

&lt;p&gt;Activate and inspect providers with &lt;code&gt;hermes memory setup&lt;/code&gt;, &lt;code&gt;hermes memory status&lt;/code&gt;, and &lt;code&gt;hermes memory off&lt;/code&gt;, or set &lt;code&gt;memory.provider&lt;/code&gt; in &lt;code&gt;~/.hermes/config.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;full comparison table&lt;/strong&gt;, LLM and embedding dependency notes, per-provider breakdowns, and how these backends relate to OpenClaw and other stacks, see &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/memory/agent-memory-providers/" rel="noopener noreferrer"&gt;Agent memory providers compared&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;profile-specific wiring&lt;/strong&gt; and production workflows, see &lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes Agent production setup&lt;/a&gt;. The &lt;strong&gt;&lt;a href="https://www.glukhov.org/ai-systems/memory/" rel="noopener noreferrer"&gt;AI Systems Memory hub&lt;/a&gt;&lt;/strong&gt; lists this guide plus related Cognee and knowledge-layer articles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 7: The Philosophy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Bounded Memory Beats Unlimited Memory
&lt;/h3&gt;

&lt;p&gt;The instinct is to make memory as large as possible. Store everything. Retrieve what you need.&lt;/p&gt;

&lt;p&gt;Bounded memory works better. Here's why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Curation forces quality.&lt;/strong&gt; When you have limited space, you only save what matters. You compress, consolidate, and prioritize. Unlimited memory encourages dumping everything in and never cleaning up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed matters.&lt;/strong&gt; 1,300 tokens in the system prompt is fast. 100,000 tokens retrieved from a database is slow. Memory should be instant, not a query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Noise degrades performance.&lt;/strong&gt; More memory isn't better memory. It's noisier memory. The model has to distinguish signal from noise, and that takes attention — attention that should be spent on the actual task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forgetting is a feature.&lt;/strong&gt; Human memory forgets. That's not a bug — it's how we prioritize. Agents should forget too. Not everything deserves to be remembered.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Forgetting" Problem
&lt;/h3&gt;

&lt;p&gt;Agents need to unlearn. Not just forget, but &lt;em&gt;actively&lt;/em&gt; remove outdated information.&lt;/p&gt;

&lt;p&gt;Here's how Hermes Agent handles it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;remove&lt;/code&gt; action:&lt;/strong&gt; Delete entries that are no longer relevant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;replace&lt;/code&gt; action:&lt;/strong&gt; Update entries with new information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity pressure:&lt;/strong&gt; When memory is full, the agent consolidates and removes old entries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security scanning:&lt;/strong&gt; Blocks malicious or corrupted entries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Forgetting isn't failure — it's maintenance. An agent that can't unlearn will eventually carry as much noise as signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Scaling
&lt;/h3&gt;

&lt;p&gt;Databricks introduced the concept of "memory scaling": does an agent with thousands of users perform better than one with a single user?&lt;/p&gt;

&lt;p&gt;Their research suggests yes, but with caveats. Memory scaling requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Quality extraction:&lt;/strong&gt; Not all interactions are worth remembering. The agent must extract insights, not logs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Effective retrieval:&lt;/strong&gt; Retrieved memories must be relevant. Noise degrades performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generalization:&lt;/strong&gt; Memories should be patterns, not specifics. "User prefers Python" scales. "User ran command X at timestamp Y" does not.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hermes Agent's bounded memory naturally supports memory scaling. By forcing curation, it ensures that memories are generalizable, compact, and useful.&lt;/p&gt;

&lt;h3&gt;
  
  
  What This Means for the Future
&lt;/h3&gt;

&lt;p&gt;Memory is becoming the competitive moat in agentic AI — not the model itself, but what the model carries between sessions. Two agents with identical underlying models can perform very differently: one remembers your preferences, your environment, and your past mistakes; the other starts cold every time.&lt;/p&gt;

&lt;p&gt;The question is no longer whether agents should have persistent memory. It's settled: they must. The open question is how to design that memory well — what to keep, what to discard, how to make it instant, and how to prevent it from becoming noise.&lt;/p&gt;

&lt;p&gt;Hermes Agent's answer is to keep memory small, curated, and always active — not a database you query, but a working model of the user that the agent carries with it into every conversation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Hermes Agent's memory system is deliberately simple: two files, firm character limits, no retrieval pipeline, no &lt;a href="https://www.glukhov.org/rag/vector-stores/vector-stores-for-rag-comparison/" rel="noopener noreferrer"&gt;vector database&lt;/a&gt;, and no per-query latency. What sounds like a constraint is the whole point.&lt;/p&gt;

&lt;p&gt;It works because it treats memory the way a brain works rather than the way a database does — small, curated, and always active. The agent doesn't retrieve memory when it needs it; the memory is simply always there, woven into the system prompt from the first token of every session.&lt;/p&gt;

&lt;p&gt;External memory providers extend this system for users who need more: knowledge graphs, multi-agent support, self-hosted storage, enterprise features. But the core remains the same: bounded, curated, always available.&lt;/p&gt;

&lt;p&gt;And external knowledge bases — LLM Wiki, Obsidian, Notion, ArXiv — serve a different role. They're the library, not the brain. The agent looks them up, doesn't remember them. Critical insights get distilled into internal memory; the rest stays in the library.&lt;/p&gt;

&lt;p&gt;This is how an AI agent remembers you. Not by storing everything, but by remembering what matters.&lt;/p&gt;




&lt;p&gt;Hermes Agent was released by Nous Research in February 2026 and reached over 64,000 GitHub stars by April 2026 (v0.9.0), with 242+ contributors. It is open-source and available at &lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;github.com/NousResearch/hermes-agent&lt;/a&gt;. For install, configuration, and workflow guides, see the &lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes Agent overview&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>hermes</category>
      <category>architecture</category>
      <category>selfhosting</category>
      <category>llm</category>
    </item>
    <item>
      <title>OpenClaw Rise and Fall — Timeline and Real Reasons Behind the Collapse</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Tue, 28 Apr 2026 12:50:44 +0000</pubDate>
      <link>https://dev.to/rosgluk/openclaw-rise-and-fall-timeline-and-real-reasons-behind-the-collapse-3nh7</link>
      <guid>https://dev.to/rosgluk/openclaw-rise-and-fall-timeline-and-real-reasons-behind-the-collapse-3nh7</guid>
      <description>&lt;p&gt;OpenClaw did not fail as a product. It lost its fuel.&lt;/p&gt;

&lt;p&gt;What looks like a dramatic boom and collapse is actually something more mechanical and more interesting. OpenClaw was a thin layer on top of a temporary economic advantage in the AI ecosystem. Once that advantage disappeared, so did the attention.&lt;/p&gt;

&lt;p&gt;This article breaks down the exact timeline, the real drivers behind the spike, and why the drop was inevitable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The illusion of product-driven growth
&lt;/h2&gt;

&lt;p&gt;Most people assume OpenClaw grew because it was a great AI agent — and that is only partially true.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/ai-systems/openclaw/" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; was genuinely useful. It supported more than 50 integrations, worked across Claude, GPT-4o, Gemini, and DeepSeek, and attracted enterprise adoption — Tencent built a platform directly on top of it. But capability alone did not set it apart from comparable alternatives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cline&lt;/li&gt;
&lt;li&gt;LangChain-based setups&lt;/li&gt;
&lt;li&gt;Other agent wrappers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real driver was access rather than capability — a distinction that explains the entire arc of OpenClaw's rise and collapse.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;OpenClaw made powerful models cheap to use at scale.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Phase 1. Quiet emergence (November 2025)
&lt;/h2&gt;

&lt;p&gt;The story begins in November 2025, when Peter Steinberger built the first prototype in roughly one hour. He was annoyed that the tool did not exist yet, so he built it, calling it &lt;strong&gt;Clawdbot&lt;/strong&gt; — a nod to Anthropic's Claude, complete with a lobster mascot.&lt;/p&gt;

&lt;p&gt;The first version was practical rather than flashy: an AI agent that could manage calendars, check email, book appointments, and automate computer tasks on the user's behalf. Steinberger shared it in developer communities and early adopters recognized something promising, though growth at this stage remained slow and organic with no visibility outside technical circles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2. The viral ignition (January–February 2026)
&lt;/h2&gt;

&lt;p&gt;The spike began when several forces aligned in quick succession.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Naming drama and forced rebrands
&lt;/h3&gt;

&lt;p&gt;In late January 2026, Anthropic sent Steinberger a trademark notice over "Clawdbot," citing phonetic similarity to "Claude." By his account, Anthropic handled it professionally — but the notice forced a rename. The project became &lt;strong&gt;Moltbot&lt;/strong&gt; for three days, then &lt;strong&gt;OpenClaw&lt;/strong&gt;, and the forced rebranding generated exactly the kind of attention that marketing budgets cannot buy.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. The agent hype wave
&lt;/h3&gt;

&lt;p&gt;The market was already primed for an agent breakthrough:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;autonomous agents were trending across social media and the tech press&lt;/li&gt;
&lt;li&gt;"AI that can act" had become the dominant narrative&lt;/li&gt;
&lt;li&gt;developers were actively searching for tools that could automate complex workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenClaw arrived at exactly the right moment, when demand for this kind of tool was at its highest and the story of autonomous AI agents was capturing mainstream attention.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. The cheap compute loophole
&lt;/h3&gt;

&lt;p&gt;The most decisive factor was a compute pricing loophole that no amount of good engineering could have manufactured deliberately.&lt;/p&gt;

&lt;p&gt;Users discovered that OpenClaw could connect to Claude by grabbing the OAuth token from a Claude Pro or Max subscription and spoofing the authentication headers of Anthropic's own &lt;a href="https://www.glukhov.org/ai-devtools/claude-code/" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; client. Instead of paying per token through the API, they effectively got:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;near unlimited agent execution for a fixed monthly cost&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The numbers made this explosive. A Claude Max subscription cost $200 per month, while running equivalent workloads through the API would cost far more — industry analysts estimated a price gap of more than five times, meaning Anthropic was quietly subsidising each heavy OpenClaw user by hundreds of dollars a month.&lt;/p&gt;

&lt;p&gt;This changed behavior instantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;developers ran heavy experiments they would never have attempted at API prices&lt;/li&gt;
&lt;li&gt;viral demos flooded social media&lt;/li&gt;
&lt;li&gt;large-scale automation became accessible to solo developers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing in the software changed — the economics did, and that shift alone was enough to ignite a viral adoption curve. By March 2, 2026, the OpenClaw repository had accumulated &lt;strong&gt;247,000 GitHub stars and 47,700 forks&lt;/strong&gt;, reaching 100,000 stars in under 48 hours — a pace widely described as the fastest-growing GitHub project in history.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3. Peak usage and inflated expectations
&lt;/h2&gt;

&lt;p&gt;At peak interest, developers pushed agents to extremes, social media amplified the results, and expectations exploded around what personal AI automation could achieve. An estimated &lt;strong&gt;135,000 OpenClaw instances&lt;/strong&gt; were running simultaneously when Anthropic made its announcement, and one founder described publicly how she had deployed nine separate AI agents to manage her administrative work and personal household logistics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why do AI tools suddenly become popular and then fade
&lt;/h3&gt;

&lt;p&gt;Because the initial spike is driven by novelty and perceived leverage. Once users test the limits, reality sets in — the tool proves harder to use reliably, and the economic conditions that made it attractive often turn out to be temporary. In OpenClaw's case, the perceived leverage was real but built on borrowed economics that Anthropic had not priced for agentic workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  The creator leaves for OpenAI (February 2026)
&lt;/h2&gt;

&lt;p&gt;Before the collapse arrived, OpenClaw lost its original architect.&lt;/p&gt;

&lt;p&gt;On February 14–15, 2026, Steinberger announced he was leaving the project to join OpenAI. Sam Altman posted that Steinberger would "drive the next generation of personal agents" at the company, and Steinberger wrote that "teaming up with OpenAI is the fastest way to bring this to everyone." OpenClaw was transferred to an independent open-source foundation with OpenAI's continued support.&lt;/p&gt;

&lt;p&gt;The timing was striking. Anthropic had declined to hire or partner with Steinberger, despite the fact that his tool had become arguably their best free marketing in years — a project built explicitly to showcase how good Claude was. Instead, he went directly to their biggest competitor, taking with him both the project's momentum and its community relationships.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4. The correction begins
&lt;/h2&gt;

&lt;p&gt;Two things started happening at the same time.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Reality of agent limitations
&lt;/h3&gt;

&lt;p&gt;Users who had deployed OpenClaw at scale began encountering its real constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agents are brittle and fail unpredictably on multi-step tasks&lt;/li&gt;
&lt;li&gt;reliability is inconsistent across different workflows and environments&lt;/li&gt;
&lt;li&gt;setup and maintenance is non-trivial for most users outside technical circles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These limitations alone would have caused a gradual decline, but OpenClaw did not taper off gradually — it dropped sharply, because a second and more decisive force hit at exactly the same time.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. The economic layer breaks
&lt;/h3&gt;

&lt;p&gt;Anthropic had already run this playbook once. In January 2026, just weeks before OpenClaw peaked, they blocked &lt;a href="https://www.glukhov.org/ai-devtools/opencode/" rel="noopener noreferrer"&gt;&lt;strong&gt;OpenCode&lt;/strong&gt;&lt;/a&gt; — another popular third-party coding client — from using Claude subscription tokens in what was framed as a terms of service violation, not a capacity issue. OpenClaw users had every reason to expect the same treatment, and that moment arrived in April.&lt;/p&gt;

&lt;p&gt;Anthropic then introduced restrictions that closed the loophole entirely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;third-party tools were blocked from using subscription OAuth tokens&lt;/li&gt;
&lt;li&gt;usage shifted to pay-as-you-go extra billing or full API keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This removed the key advantage:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;cheap large-scale execution&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now users faced a very different cost structure:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before cutoff&lt;/th&gt;
&lt;th&gt;After cutoff&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Monthly plan cost&lt;/td&gt;
&lt;td&gt;$20–$200 (flat)&lt;/td&gt;
&lt;td&gt;$20–$200 + usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per task&lt;/td&gt;
&lt;td&gt;Effectively $0&lt;/td&gt;
&lt;td&gt;$0.50–$2.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API rate (Sonnet 4.6 input)&lt;/td&gt;
&lt;td&gt;Covered by sub&lt;/td&gt;
&lt;td&gt;$3 per million tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API rate (Sonnet 4.6 output)&lt;/td&gt;
&lt;td&gt;Covered by sub&lt;/td&gt;
&lt;td&gt;$15 per million tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Increase for heavy users&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;10× to 50×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What caused the sudden drop in interest in AI agent tools
&lt;/h3&gt;

&lt;p&gt;The answer is straightforward: not a lack of innovation, but the loss of affordable compute. Once the pricing floor disappeared, the incentive to experiment and share disappeared with it, and search interest followed almost immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  April 4, 2026 — The hard cutoff
&lt;/h2&gt;

&lt;p&gt;On April 4, 2026, at 12 PM Pacific Time, the subscription access ended for all third-party tools.&lt;/p&gt;

&lt;p&gt;Boris Cherny, Head of Claude Code at Anthropic, posted on X that Claude Pro and Max subscriptions would no longer cover usage from third-party tools, effective immediately. An Anthropic spokesperson confirmed that using subscriptions with third-party tools was always against the terms of service, and that those tools were placing "an outsized strain on our systems." Additional context made the timing feel urgent: on April 1, the full source code of Claude Code — 512,000 lines of TypeScript — had leaked through an npm package, exposing exactly how Anthropic's first-party tools authenticated with the backend and making it more pressing to lock down third-party tools that were spoofing those same patterns.&lt;/p&gt;

&lt;p&gt;Anthropic offered a one-time credit equal to one month's subscription fee and a 30% discount on pre-purchased usage bundles to ease the transition. For light users, the credit covered the adjustment period, but for power users running multiple instances the new numbers simply did not work. The effect on activity was immediate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;experimentation stopped&lt;/li&gt;
&lt;li&gt;viral sharing disappeared&lt;/li&gt;
&lt;li&gt;search interest collapsed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matches the sharp drop in Google Trends almost perfectly. The full policy mechanics and migration options after the cutoff are covered in &lt;a href="https://www.glukhov.org/ai-systems/openclaw/anthropic-claude-subscription-agent-tools/" rel="noopener noreferrer"&gt;Claude, OpenClaw, and the End of Flat Pricing for Agents&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  OpenAI moves in the opposite direction
&lt;/h2&gt;

&lt;p&gt;On the same day as the Anthropic ban, OpenAI publicly confirmed that ChatGPT Plus, Pro, and Team subscribers were entirely free to use their subscriptions to power OpenClaw through OAuth — including with models like GPT-5.3 Codex for complex coding tasks.&lt;/p&gt;

&lt;p&gt;This was not accidental timing. By hiring Steinberger and explicitly opening their subscription gates, OpenAI positioned themselves as the developer-friendly alternative at the exact moment Anthropic cut off its most active community, securing the loyalty of the developers who were building the next generation of AI tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 5. Where OpenClaw users actually went
&lt;/h2&gt;

&lt;p&gt;Users did not disappear after the ban — they redistributed across a spectrum of alternatives depending on their technical depth and budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  Direct usage of chat assistants
&lt;/h3&gt;

&lt;p&gt;Many users moved back to direct chat interfaces, trading agent automation for the simplicity and reliability they had given up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ChatGPT&lt;/li&gt;
&lt;li&gt;Claude UI&lt;/li&gt;
&lt;li&gt;Gemini&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Are AI agents replacing traditional chat assistants
&lt;/h3&gt;

&lt;p&gt;No — for most users, agents add complexity without enough reliability gains. The chat interface remains the default for daily use because it is faster to start, easier to debug when something goes wrong, and requires no infrastructure setup. Agents serve a committed minority of power users, not the general population. The &lt;a href="https://www.glukhov.org/ai-devtools/" rel="noopener noreferrer"&gt;AI developer tools&lt;/a&gt; ecosystem has evolved to fill this gap with tools that sit between raw agents and simple chat, giving developers structured assistance without full agentic overhead.&lt;/p&gt;




&lt;h3&gt;
  
  
  Cheaper model ecosystems
&lt;/h3&gt;

&lt;p&gt;Power users with the technical ability to self-host migrated toward lower-cost alternatives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qwen&lt;/li&gt;
&lt;li&gt;DeepSeek&lt;/li&gt;
&lt;li&gt;other low-cost models accessible through &lt;a href="https://www.glukhov.org/llm-hosting/comparisons/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; for fully local setups&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Which models are popular for low-cost AI experimentation
&lt;/h3&gt;

&lt;p&gt;Models that offer lower pricing, fewer usage restrictions, and flexible deployment including local self-hosting absorbed the bulk of displaced OpenClaw power users. These ecosystems grew quietly rather than generating public hype, which is why the migration was largely invisible in trend data even as it represented a significant redistribution of compute demand.&lt;/p&gt;




&lt;h3&gt;
  
  
  Alternative agent frameworks
&lt;/h3&gt;

&lt;p&gt;Developers who still needed agent capabilities switched to leaner approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;custom scripts tailored to specific workflows&lt;/li&gt;
&lt;li&gt;lightweight frameworks with fewer dependencies&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;self-hosted solutions&lt;/a&gt; combining local models with minimal tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference from OpenClaw is that these users optimized for cost and control rather than convenience, and built for sustainability rather than maximum automation at minimum price. This is the pattern common across the &lt;a href="https://www.glukhov.org/ai-systems/" rel="noopener noreferrer"&gt;self-hosted AI systems&lt;/a&gt; ecosystem — provider independence treated as a design requirement, not an afterthought.&lt;/p&gt;




&lt;h2&gt;
  
  
  The overlooked factor — why cost is the real product
&lt;/h2&gt;

&lt;p&gt;The most important insight from OpenClaw's trajectory is that cost functions as the real product in AI adoption.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is cost important in AI adoption
&lt;/h3&gt;

&lt;p&gt;Because usage scales non-linearly with compute costs. When compute is cheap, experimentation explodes, innovation accelerates, and attention grows because viral sharing becomes economically rational. When compute becomes expensive, usage contracts to serious workflows only, casual users leave, and hype disappears almost overnight — which is precisely why &lt;a href="https://www.glukhov.org/llm-performance/cost-effective-llm-applications/" rel="noopener noreferrer"&gt;token optimization and cost reduction strategies&lt;/a&gt; become critical skills once compute stops being subsidized.&lt;/p&gt;

&lt;p&gt;OpenClaw demonstrated this rule in an unusually clear form: between February and April 2026, the software did not change, but the economics of running it did — and that single shift was enough to collapse the community in a matter of days.&lt;/p&gt;




&lt;h2&gt;
  
  
  OpenClaw was never the core story
&lt;/h2&gt;

&lt;p&gt;OpenClaw functioned as a surface layer on top of more fundamental forces.&lt;/p&gt;

&lt;p&gt;The real story involved three factors operating simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;access to Claude models at subscription prices rather than API rates&lt;/li&gt;
&lt;li&gt;a five-to-one pricing mismatch between what users paid and what usage actually cost Anthropic&lt;/li&gt;
&lt;li&gt;a policy correction that had to happen eventually given the scale of that mismatch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once those underlying conditions changed, any tool that depended on them would show the same pattern — which is exactly why similar tools spiked and declined in lockstep, regardless of their individual quality or feature sets. Anthropic's decision also revealed something strategic: by blocking third-party clients while protecting Claude Code, the company chose to concentrate developer engagement inside its own first-party tooling at a moment when independent communities were iterating faster than any centralized lab.&lt;/p&gt;




&lt;h2&gt;
  
  
  The pattern repeats across AI
&lt;/h2&gt;

&lt;p&gt;OpenClaw's trajectory is not unique — the same cycle has played out repeatedly across the AI ecosystem.&lt;/p&gt;

&lt;p&gt;The same pattern appears in AutoGPT, BabyAGI, and other early agent frameworks that attracted massive attention and then faded as compute costs, reliability limits, or platform restrictions were enforced. The cycle is consistent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;New capability appears
&lt;/li&gt;
&lt;li&gt;Cheap or free usage emerges
&lt;/li&gt;
&lt;li&gt;Viral experimentation begins
&lt;/li&gt;
&lt;li&gt;Costs or limits are enforced
&lt;/li&gt;
&lt;li&gt;Attention collapses
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each cycle leaves behind a smaller, more committed user base and a clearer understanding of what actually works at scale — which is how progress compounds even through the boom-and-bust pattern.&lt;/p&gt;




&lt;h2&gt;
  
  
  OpenClaw vs Hermes Agent — what the trend data shows
&lt;/h2&gt;

&lt;p&gt;The chart above compares worldwide Google Trends search interest for OpenClaw AI (blue) and Hermes Agent (red) over the past three months. OpenClaw peaked at an index of 100 in mid-March 2026 and collapsed sharply in April after the subscription cutoff. Hermes Agent barely registered during OpenClaw's peak, then gradually picked up interest as OpenClaw faded — reaching an index of around 40 in bursts through April, compared to OpenClaw's average of 49 and Hermes's average of 8.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/ai-systems/hermes/" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt; is an open-source framework built by Nous Research and released in February 2026. Unlike OpenClaw, which is optimized for broad reactive tool use across many integrations, Hermes is built around a learning loop: it generates reusable skills from successful task completions, refines them through continued use, and maintains a persistent model of the user across sessions. The result is an agent that improves the more it is used on the same task types, rather than approaching each job from the same baseline. It reached 95,600 GitHub stars in its first seven weeks.&lt;/p&gt;

&lt;p&gt;The gap in the chart is significant. OpenClaw's hype surplus did not transfer to Hermes — it evaporated. Casual experimenters who had been running agents cheaply on Claude subscriptions simply left the space rather than migrating to an alternative. The users who did move to Hermes were the committed technical minority who needed persistent, self-hosted automation and were willing to set it up properly — which is exactly the kind of smaller, more sustainable user base that remains after every AI hype cycle collapses. For those users, &lt;a href="https://www.glukhov.org/ai-systems/hermes/production-setup/" rel="noopener noreferrer"&gt;Hermes production setup patterns&lt;/a&gt; are worth exploring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final takeaway — follow the economics, not the interface
&lt;/h2&gt;

&lt;p&gt;OpenClaw did not rise because it was revolutionary — it rose because it unlocked something temporarily underpriced, and it fell not because it failed as a product but because that pricing advantage was removed by the platform it depended on.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This was not a product lifecycle. It was a pricing event.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Understanding this distinction is critical for predicting the next spike in AI tooling. The same pattern will repeat whenever a new compute subsidy appears, whether through a subscription loophole, a generous free tier, or a new open-weight model that undercuts established pricing. Track where compute is temporarily cheap and you will find the next wave of viral AI tools before the hype arrives.&lt;/p&gt;

</description>
      <category>selfhosting</category>
      <category>llm</category>
      <category>ai</category>
    </item>
    <item>
      <title>Llama-Server Router Mode - Dynamic Model Switching Without Restarts</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Mon, 27 Apr 2026 11:42:46 +0000</pubDate>
      <link>https://dev.to/rosgluk/llama-server-router-mode-dynamic-model-switching-without-restarts-1h0j</link>
      <guid>https://dev.to/rosgluk/llama-server-router-mode-dynamic-model-switching-without-restarts-1h0j</guid>
      <description>&lt;p&gt;For a long time, &lt;code&gt;llama.cpp&lt;/code&gt; had a glaring limitation:&lt;br&gt;&lt;br&gt;
you could only serve &lt;strong&gt;one model per process&lt;/strong&gt;, and switching meant a restart.&lt;/p&gt;



&lt;p&gt;That era is over.&lt;/p&gt;

&lt;p&gt;Recent updates introduced &lt;strong&gt;router mode&lt;/strong&gt; in &lt;code&gt;llama-server&lt;/code&gt;, bringing something much closer to what people expect from modern local LLM runtimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dynamic model loading
&lt;/li&gt;
&lt;li&gt;unloading on demand
&lt;/li&gt;
&lt;li&gt;switching per request
&lt;/li&gt;
&lt;li&gt;no process restart
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words: &lt;strong&gt;Ollama-like behavior, but without the training wheels&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you are still deciding between local runtimes, cloud APIs, and self-hosted infrastructure, the&lt;br&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/" rel="noopener noreferrer"&gt;LLM hosting overview&lt;/a&gt; is a good starting point.&lt;/p&gt;


&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Router mode requires a recent &lt;code&gt;llama-server&lt;/code&gt; build — roughly post mid-2024. Older builds do not have the &lt;code&gt;--models&lt;/code&gt; flag.&lt;/p&gt;

&lt;p&gt;For install options (package manager, pre-built binaries, or full source build with CUDA), see the&lt;br&gt;
&lt;a href="https://www.glukhov.org/llm-hosting/llama-cpp/" rel="noopener noreferrer"&gt;llama.cpp quickstart&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Once you have &lt;code&gt;llama-server&lt;/code&gt;, confirm your build supports router mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="nt"&gt;--help&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the &lt;code&gt;--models&lt;/code&gt; flag appears, you are good. If it is absent, update to a newer build.&lt;/p&gt;

&lt;p&gt;My current output of models-related help:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;-cl&lt;/span&gt;,   &lt;span class="nt"&gt;--cache-list&lt;/span&gt;                     show list of models &lt;span class="k"&gt;in &lt;/span&gt;cache
                                        Prefix/Suffix/Middle&lt;span class="o"&gt;)&lt;/span&gt; as some models prefer this. &lt;span class="o"&gt;(&lt;/span&gt;default: disabled&lt;span class="o"&gt;)&lt;/span&gt;
                                        models with dynamic resolution &lt;span class="o"&gt;(&lt;/span&gt;default: &lt;span class="nb"&gt;read &lt;/span&gt;from model&lt;span class="o"&gt;)&lt;/span&gt;
                                        models with dynamic resolution &lt;span class="o"&gt;(&lt;/span&gt;default: &lt;span class="nb"&gt;read &lt;/span&gt;from model&lt;span class="o"&gt;)&lt;/span&gt;
                                        embedding models &lt;span class="o"&gt;(&lt;/span&gt;default: disabled&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nt"&gt;--models-dir&lt;/span&gt; PATH                       directory containing models &lt;span class="k"&gt;for &lt;/span&gt;the router server &lt;span class="o"&gt;(&lt;/span&gt;default: disabled&lt;span class="o"&gt;)&lt;/span&gt;
                                        &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;env&lt;/span&gt;: LLAMA_ARG_MODELS_DIR&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nt"&gt;--models-preset&lt;/span&gt; PATH                    path to INI file containing model presets &lt;span class="k"&gt;for &lt;/span&gt;the router server
                                        &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;env&lt;/span&gt;: LLAMA_ARG_MODELS_PRESET&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nt"&gt;--models-max&lt;/span&gt; N                          &lt;span class="k"&gt;for &lt;/span&gt;router server, maximum number of models to load simultaneously
                                        &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;env&lt;/span&gt;: LLAMA_ARG_MODELS_MAX&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nt"&gt;--models-autoload&lt;/span&gt;, &lt;span class="nt"&gt;--no-models-autoload&lt;/span&gt;
                                        &lt;span class="k"&gt;for &lt;/span&gt;router server, whether to automatically load models &lt;span class="o"&gt;(&lt;/span&gt;default:
                                        &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;env&lt;/span&gt;: LLAMA_ARG_MODELS_AUTOLOAD&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What router mode actually does
&lt;/h2&gt;

&lt;p&gt;Router mode turns &lt;code&gt;llama-server&lt;/code&gt; into a &lt;strong&gt;model dispatcher&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of binding to a single model via &lt;code&gt;-m&lt;/code&gt;, the server:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;starts with no model loaded&lt;/li&gt;
&lt;li&gt;receives a request that names a model&lt;/li&gt;
&lt;li&gt;loads that model if it is not already in memory&lt;/li&gt;
&lt;li&gt;runs inference&lt;/li&gt;
&lt;li&gt;optionally unloads the model after the response, or keeps it warm for the next request&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The key idea
&lt;/h3&gt;

&lt;p&gt;You are no longer running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You are running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="nt"&gt;--models&lt;/span&gt; models.ini &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And letting the server decide &lt;strong&gt;what to load and when&lt;/strong&gt;, based on what the client actually requests.&lt;/p&gt;

&lt;p&gt;This matters because it means one persistent process can serve an entire fleet of models, with clients selecting the right one per task — a coding model, a chat model, a summarisation model — without any coordination overhead on your side.&lt;/p&gt;




&lt;h2&gt;
  
  
  Configuration: defining your models
&lt;/h2&gt;

&lt;p&gt;This is where things are still a bit raw.&lt;/p&gt;

&lt;p&gt;There is no fully stable official format yet, but current builds support &lt;strong&gt;INI-style model definitions&lt;/strong&gt; via a config file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example models.ini
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[llama3]&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;/opt/models/llama-3-8b-instruct.Q5_K_M.gguf&lt;/span&gt;
&lt;span class="py"&gt;ctx-size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;8192&lt;/span&gt;
&lt;span class="py"&gt;ngl&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;35&lt;/span&gt;
&lt;span class="py"&gt;threads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;8&lt;/span&gt;

&lt;span class="nn"&gt;[mistral]&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;/opt/models/mistral-7b-instruct-v0.3.Q4_K_M.gguf&lt;/span&gt;
&lt;span class="py"&gt;ctx-size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;4096&lt;/span&gt;
&lt;span class="py"&gt;ngl&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;20&lt;/span&gt;
&lt;span class="py"&gt;threads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;8&lt;/span&gt;

&lt;span class="nn"&gt;[qwen]&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;/opt/models/qwen2.5-coder-7b-instruct.Q5_K_M.gguf&lt;/span&gt;
&lt;span class="py"&gt;ctx-size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;16384&lt;/span&gt;
&lt;span class="py"&gt;ngl&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;35&lt;/span&gt;
&lt;span class="py"&gt;threads&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each section name becomes the &lt;strong&gt;model identifier&lt;/strong&gt; that clients use in the &lt;code&gt;"model"&lt;/code&gt; field of their API requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key config parameters
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;What it controls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Absolute path to the GGUF file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ctx-size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Context window size in tokens. Larger values use more VRAM.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ngl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number of GPU layers offloaded. Set to &lt;code&gt;0&lt;/code&gt; for CPU-only; increase until you hit VRAM limits.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;threads&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPU threads for the layers that remain on CPU.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Choosing the right &lt;code&gt;ngl&lt;/code&gt; value depends on your GPU's available VRAM — for GPU selection and hardware economics, the &lt;a href="https://www.glukhov.org/hardware/" rel="noopener noreferrer"&gt;compute hardware guide&lt;/a&gt; is a useful reference. To watch live VRAM consumption while dialing it in, see the &lt;a href="https://www.glukhov.org/observability/gpu-monitoring-apps-linux/" rel="noopener noreferrer"&gt;GPU monitoring tools for Linux&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Starting the server with config
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./llama-server &lt;span class="nt"&gt;--models&lt;/span&gt; /opt/llama.cpp/models.ini &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm the server started correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/v1/models | jq &lt;span class="s1"&gt;'.data[].id'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see each section name from your &lt;code&gt;models.ini&lt;/code&gt; listed as a model ID.&lt;/p&gt;

&lt;h3&gt;
  
  
  A note on stability
&lt;/h3&gt;

&lt;p&gt;The INI config interface is &lt;strong&gt;still evolving&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;flags may change between commits&lt;/li&gt;
&lt;li&gt;some parameters are only recognised by specific build configurations&lt;/li&gt;
&lt;li&gt;documentation lags behind implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pin to a specific llama.cpp commit if you need reproducibility across restarts.&lt;/p&gt;




&lt;h2&gt;
  
  
  API usage: switching models on request
&lt;/h2&gt;

&lt;p&gt;Once the server is running, model switching happens through the standard OpenAI-compatible API. You simply set the &lt;code&gt;"model"&lt;/code&gt; field.&lt;/p&gt;

&lt;h3&gt;
  
  
  List registered models
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/v1/models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Completion request — first model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "llama3",
    "messages": [
      {"role": "user", "content": "Explain router mode in one paragraph"}
    ]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Switch to a different model — same endpoint, same port
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "qwen",
    "messages": [
      {"role": "user", "content": "Write a Python function that reads a CSV file"}
    ]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server handles the unload/load cycle transparently. Your client code does not change — only the &lt;code&gt;model&lt;/code&gt; field.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python example
&lt;/h3&gt;

&lt;p&gt;If you are using &lt;code&gt;openai&lt;/code&gt; Python client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use the coding model
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a Go HTTP handler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Switch to the chat model — same client, different model name
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of Australia?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What happens internally
&lt;/h3&gt;

&lt;p&gt;When a request arrives for &lt;code&gt;qwen&lt;/code&gt; and &lt;code&gt;llama3&lt;/code&gt; is currently loaded:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;llama3&lt;/code&gt; is unloaded from VRAM&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;qwen&lt;/code&gt; weights are read from disk and loaded into VRAM&lt;/li&gt;
&lt;li&gt;inference runs&lt;/li&gt;
&lt;li&gt;the next request determines whether to keep &lt;code&gt;qwen&lt;/code&gt; loaded or swap again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This directly answers the common question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How can a local LLM server switch models without restarting&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By dynamically loading models per request, not binding at startup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Systemd service: production-ready setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Create a dedicated user and directories
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;useradd &lt;span class="nt"&gt;--system&lt;/span&gt; &lt;span class="nt"&gt;--shell&lt;/span&gt; /usr/sbin/nologin &lt;span class="nt"&gt;--home-dir&lt;/span&gt; /opt/llama.cpp llm
&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/llama.cpp/models
&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; llm:llm /opt/llama.cpp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copy your binary and model config into place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo cp &lt;/span&gt;build/bin/llama-server /opt/llama.cpp/
&lt;span class="nb"&gt;sudo cp &lt;/span&gt;models.ini /opt/llama.cpp/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  /etc/systemd/system/llama-server.service
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Llama.cpp Router Server&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;llm&lt;/span&gt;
&lt;span class="py"&gt;WorkingDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/opt/llama.cpp&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/opt/llama.cpp/llama-server --models /opt/llama.cpp/models.ini --port 8080&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;

&lt;span class="py"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;LLAMA_LOG_LEVEL=info&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Enable and start
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;llama-server
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start llama-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verify and inspect logs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status llama-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; llama-server &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On a successful start you will see lines indicating the server is listening and the model registry has been loaded. A quick sanity check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8080/v1/models | jq &lt;span class="s1"&gt;'.data[].id'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you have a persistent service with auto-restart and centralised model switching — no manual process management required. If you want to apply the same pattern to other binaries, &lt;a href="https://www.glukhov.org/developer-tools/terminals-shell/executable-as-a-service-in-linux/" rel="noopener noreferrer"&gt;hosting any executable as a Linux service&lt;/a&gt; walks through the general approach.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;llama-server&lt;/code&gt; &lt;code&gt;--metrics&lt;/code&gt; flag exposes a Prometheus-compatible endpoint. For llama.cpp-specific dashboards, PromQL queries, and alerting rules, see the &lt;a href="https://www.glukhov.org/observability/monitoring-llm-inference-prometheus-grafana/" rel="noopener noreferrer"&gt;LLM inference monitoring guide&lt;/a&gt;. For the broader observability setup, the &lt;a href="https://www.glukhov.org/observability/" rel="noopener noreferrer"&gt;observability guide&lt;/a&gt; covers the full stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations you need to understand
&lt;/h2&gt;

&lt;p&gt;Router mode is genuinely useful, but it comes with tradeoffs you should be clear about before relying on it in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Only one model in memory at a time
&lt;/h3&gt;

&lt;p&gt;Even though multiple models are defined in &lt;code&gt;models.ini&lt;/code&gt;, only one is resident in VRAM per worker at any given moment. Switching means a full unload-and-reload cycle.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;switching means reload&lt;/li&gt;
&lt;li&gt;latency spike is unavoidable&lt;/li&gt;
&lt;li&gt;on a typical 7B model at Q5, a reload can take 3–10 seconds depending on disk speed and VRAM bandwidth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This answers another key question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does llama.cpp support serving multiple models at once&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not really. It supports &lt;strong&gt;multiple definitions&lt;/strong&gt;, not simultaneous residency. If you need two models genuinely loaded in parallel, you need two processes on two separate GPUs.&lt;/p&gt;

&lt;p&gt;For measured VRAM consumption and tokens-per-second across model sizes, the &lt;a href="https://www.glukhov.org/llm-performance/" rel="noopener noreferrer"&gt;LLM performance benchmarks&lt;/a&gt; cover the full picture. For numbers specific to llama.cpp on a 16 GB GPU — dense and MoE models at multiple context sizes — see the &lt;a href="https://www.glukhov.org/llm-performance/benchmarks/best-llm-on-16gb-vram-gpu/" rel="noopener noreferrer"&gt;16 GB VRAM llama.cpp benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  No smart caching
&lt;/h3&gt;

&lt;p&gt;Unlike Ollama, which maintains a warm pool and evicts models based on recency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;there is no automatic model eviction strategy&lt;/li&gt;
&lt;li&gt;no background pre-warming&lt;/li&gt;
&lt;li&gt;no priority queue for frequently used models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you send alternating requests for &lt;code&gt;llama3&lt;/code&gt; and &lt;code&gt;mistral&lt;/code&gt;, every single request triggers a reload. This is the fundamental cost of being closer to the metal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency is unpredictable for mixed workloads
&lt;/h3&gt;

&lt;p&gt;A well-behaved workload that uses one model consistently will be fast. A workload that interleaves multiple models will be slow. Plan your client routing logic accordingly — group requests by model where possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Config is not stable
&lt;/h3&gt;

&lt;p&gt;The INI support exists and works in most recent builds, but it is not fully standardised. Flags and parameter names have changed across versions. If you upgrade &lt;code&gt;llama-server&lt;/code&gt;, test your &lt;code&gt;models.ini&lt;/code&gt; against the new build before deploying.&lt;/p&gt;




&lt;h2&gt;
  
  
  Llama.cpp vs Ollama: honest comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;llama.cpp router&lt;/th&gt;
&lt;th&gt;Ollama&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dynamic loading&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model switching&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in registry&lt;/td&gt;
&lt;td&gt;Partial (INI)&lt;/td&gt;
&lt;td&gt;Yes (pull-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory management&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Advanced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model eviction&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;TTL-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UX polish&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI API compat&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Control&lt;/td&gt;
&lt;td&gt;Maximum&lt;/td&gt;
&lt;td&gt;Opinionated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config stability&lt;/td&gt;
&lt;td&gt;Experimental&lt;/td&gt;
&lt;td&gt;Stable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Opinionated take
&lt;/h3&gt;

&lt;p&gt;Choose llama.cpp router mode when you want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;maximum control over runtime parameters per model&lt;/li&gt;
&lt;li&gt;minimal process overhead&lt;/li&gt;
&lt;li&gt;direct access to llama.cpp flags without abstraction layers&lt;/li&gt;
&lt;li&gt;a hackable base for custom tooling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choose Ollama when you want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a stable, polished experience&lt;/li&gt;
&lt;li&gt;automatic model downloading and versioning&lt;/li&gt;
&lt;li&gt;smart keep-alive and eviction without configuration&lt;/li&gt;
&lt;li&gt;batteries included from day one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither is wrong. The choice depends on how much you want to manage yourself.&lt;/p&gt;

&lt;p&gt;If you go with Ollama, the &lt;a href="https://www.glukhov.org/llm-hosting/ollama/ollama-cheatsheet/" rel="noopener noreferrer"&gt;Ollama CLI cheatsheet&lt;/a&gt; covers day-to-day commands. For a broader comparison that also includes vLLM, LM Studio, and LocalAI, see &lt;a href="https://www.glukhov.org/llm-hosting/comparisons/hosting-llms-ollama-localai-jan-lmstudio-vllm-comparison/" rel="noopener noreferrer"&gt;how different local runtimes compare in 2026&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Llama.cpp vs llama-swap
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;llama-swap&lt;/code&gt; is an &lt;strong&gt;external orchestrator&lt;/strong&gt; that sits in front of one or more &lt;code&gt;llama-server&lt;/code&gt; instances:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it intercepts requests and inspects the &lt;code&gt;model&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;it starts the appropriate &lt;code&gt;llama-server&lt;/code&gt; process for that model&lt;/li&gt;
&lt;li&gt;it shuts down idle instances after a configurable timeout&lt;/li&gt;
&lt;li&gt;it proxies the request through once the model is ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a hands-on setup, see the &lt;a href="https://www.glukhov.org/llm-hosting/llama-swap/" rel="noopener noreferrer"&gt;llama-swap quickstart&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key difference
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;router mode&lt;/th&gt;
&lt;th&gt;llama-swap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (separate binary)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maturity&lt;/td&gt;
&lt;td&gt;Experimental&lt;/td&gt;
&lt;td&gt;More stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flexibility&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Control layer&lt;/td&gt;
&lt;td&gt;Internal&lt;/td&gt;
&lt;td&gt;External proxy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-model config&lt;/td&gt;
&lt;td&gt;INI file&lt;/td&gt;
&lt;td&gt;YAML file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Process model&lt;/td&gt;
&lt;td&gt;Single process&lt;/td&gt;
&lt;td&gt;One process per model&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When to use llama-swap
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;llama-swap&lt;/code&gt; gives you process-level isolation per model, which means a crash in one model instance does not affect others. It also lets each model run with completely independent &lt;code&gt;llama-server&lt;/code&gt; flags.&lt;/p&gt;

&lt;p&gt;Use it if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;better lifecycle control and isolation&lt;/li&gt;
&lt;li&gt;smarter switching logic with configurable idle timeouts&lt;/li&gt;
&lt;li&gt;more predictable latency (each model has a warm process after first load)&lt;/li&gt;
&lt;li&gt;production stability today, not eventually&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When native router mode is enough
&lt;/h3&gt;

&lt;p&gt;Use the built-in router if you want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;zero external dependencies&lt;/li&gt;
&lt;li&gt;a single process to manage&lt;/li&gt;
&lt;li&gt;simpler deployment (one binary, one config file)&lt;/li&gt;
&lt;li&gt;minimal stack for dev or single-user setups&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Router mode is a meaningful step forward for &lt;code&gt;llama-server&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It answers the long-standing demand:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is router mode in llama.cpp server&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is the missing layer that turns a static binary into a &lt;strong&gt;dynamic inference service&lt;/strong&gt; — one process that can field requests for a whole catalogue of models.&lt;/p&gt;

&lt;p&gt;But it is not finished.&lt;/p&gt;

&lt;p&gt;Today it is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;powerful enough for real workloads&lt;/li&gt;
&lt;li&gt;promising as a foundation for more sophisticated routing&lt;/li&gt;
&lt;li&gt;slightly rough around the config and stability edges&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your workload is predictable and you can group requests by model, router mode works well today. If you need production-grade reliability and per-model isolation, reach for &lt;code&gt;llama-swap&lt;/code&gt; while the native implementation matures.&lt;/p&gt;

&lt;p&gt;Either way, you get &lt;strong&gt;Ollama-like behavior, without hiding the machinery&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>cheatsheet</category>
      <category>gguf</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Claude Skills and SKILL.md for Developers: VS Code, JetBrains, Cursor</title>
      <dc:creator>Rost</dc:creator>
      <pubDate>Sat, 25 Apr 2026 08:10:03 +0000</pubDate>
      <link>https://dev.to/rosgluk/claude-skills-and-skillmd-for-developers-vs-code-jetbrains-cursor-17f6</link>
      <guid>https://dev.to/rosgluk/claude-skills-and-skillmd-for-developers-vs-code-jetbrains-cursor-17f6</guid>
      <description>&lt;p&gt;Most teams misuse Claude Skills in one of two ways. They either turn &lt;code&gt;SKILL.md&lt;/code&gt; into a dumping ground, or they never graduate from giant copy-pasted prompts.&lt;/p&gt;

&lt;p&gt;Both approaches are sloppy. If you want Skills to work in a real dev workflow, you need to treat them like code and operations logic, not like prompt poetry.&lt;/p&gt;

&lt;p&gt;Claude Skills are directories anchored by &lt;code&gt;SKILL.md&lt;/code&gt;, with optional scripts, references, and assets. They work because of progressive disclosure. The agent starts by loading only compact metadata such as the skill name and description, then reads the full instructions only when the task matches. That lets an agent keep many skills available without bloating every session from the start.&lt;/p&gt;

&lt;p&gt;Anthropic's own guidance makes the intended division of labour pretty clear. &lt;code&gt;CLAUDE.md&lt;/code&gt; is for durable, always-on project context. Skills are for reusable knowledge, playbooks, and invocable workflows that should load on demand. Claude Code even folded old custom commands into the same mechanism, so legacy &lt;code&gt;.claude/commands/*.md&lt;/code&gt; files still work, but Skills are now the better long-term shape — and the most reusable building block in any &lt;a href="https://www.glukhov.org/ai-devtools/" rel="noopener noreferrer"&gt;AI-powered development workflow&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Claude Skills: CLAUDE.md vs Skills vs Hooks
&lt;/h2&gt;

&lt;p&gt;A Claude Skill is worth creating when you keep pasting the same checklist, the same deployment playbook, the same code review rubric, or the same internal API gotchas into chat. Anthropic explicitly recommends creating a skill when you keep reusing the same procedure, or when a section of &lt;code&gt;CLAUDE.md&lt;/code&gt; has grown into a process rather than a fact. That is the practical answer to the FAQ question "What is a Claude Skill and when should you use one". Use a Skill for repeatable procedure, not for general taste or broad repo rules.&lt;/p&gt;

&lt;p&gt;The real win is control over context cost and behaviour. A good Skill is loaded only when relevant, while a bloated &lt;code&gt;CLAUDE.md&lt;/code&gt; is loaded every session. Anthropic recommends keeping &lt;code&gt;CLAUDE.md&lt;/code&gt; short and moving domain knowledge or procedures into Skills precisely because on-demand loading keeps the agent focused on the task in front of it.&lt;/p&gt;

&lt;p&gt;My opinionated rule is simple. If the instruction should apply every single session, it belongs in &lt;code&gt;CLAUDE.md&lt;/code&gt;. If the instruction is a reusable method, checklist, or workflow that matters only sometimes, it belongs in a Skill. If the action must happen automatically on every matching event, it probably belongs in a hook, not a Skill. Anthropic's feature overview frames those tools in almost exactly that layering model.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;When to use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Always loaded&lt;/td&gt;
&lt;td&gt;Project facts, durable conventions, repo-wide rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skill&lt;/td&gt;
&lt;td&gt;Loaded on demand&lt;/td&gt;
&lt;td&gt;Repeatable procedures, playbooks, domain checklists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hook&lt;/td&gt;
&lt;td&gt;Event-triggered&lt;/td&gt;
&lt;td&gt;Automatic side effects on file save, commit, or session start&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A practical smell for each: if you find yourself pasting the same instructions into every chat, that is a Skill. If a &lt;code&gt;CLAUDE.md&lt;/code&gt; section has grown into a step-by-step process, extract it into a Skill. If you want something to fire silently every time a file is saved, write a hook instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Skills IDE Support: VS Code, JetBrains, Cursor, and Codex
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.glukhov.org/ai-devtools/claude-code/" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; runs across CLI, Desktop, VS Code, JetBrains, web, and mobile-related remote control flows. Anthropic describes the CLI as the most complete local surface, while the IDE integrations trade some CLI-only capabilities for editor-native review, file context, and tighter workflow ergonomics. Configuration, project memory, and MCP servers are shared across the local surfaces, so your &lt;code&gt;.claude&lt;/code&gt; setup follows you rather than being trapped in one editor.&lt;/p&gt;

&lt;p&gt;For VS Code, Anthropic says the extension is the recommended interface inside the editor. It provides plan review, inline diffs, file mention support, and integrated access to the CLI. The same install flow also exposes a direct path for Cursor. For JetBrains, the current supported list includes IntelliJ IDEA, PyCharm, Android Studio, WebStorm, PhpStorm, and GoLand, with diff viewing, selection sharing, file-reference shortcuts, and diagnostic sharing built into the plugin.&lt;/p&gt;

&lt;p&gt;JetBrains support is better than many developers realise. If you run &lt;code&gt;claude&lt;/code&gt; from the IDE's integrated terminal, the integration features are active automatically. If you start from an external terminal, Anthropic documents the &lt;code&gt;/ide&lt;/code&gt; command to connect Claude Code back to the JetBrains session, and it explicitly recommends launching from the same project root so Claude sees the same files your IDE sees. If you use auto-edit modes in JetBrains, Anthropic also warns that IDE configuration files can become part of the editable surface, so manual approvals are the safer default in that environment.&lt;/p&gt;

&lt;p&gt;Now the bigger point. Claude Skills are not only a Claude Code thing. Agent Skills is an open standard. The official Agent Skills quickstart says the same skill can work in VS Code with GitHub Copilot, Claude Code, and OpenAI Codex, and OpenAI's own Codex docs say Skills are available in the Codex CLI, IDE extension, and app. The Agent Skills implementation guide adds an important portability detail: &lt;code&gt;.agents/skills&lt;/code&gt; has emerged as the cross-client convention, while some clients also scan &lt;code&gt;.claude/skills&lt;/code&gt; for pragmatic compatibility.&lt;/p&gt;

&lt;p&gt;So here is the practical compatibility rule I recommend. If you are building for Claude Code first and only, author in &lt;code&gt;.claude/skills&lt;/code&gt;. If you genuinely want cross-client portability, target the open Agent Skills shape and use &lt;code&gt;.agents/skills&lt;/code&gt; as the canonical path. Do not pretend those two goals are identical. They are related, not identical.&lt;/p&gt;

&lt;p&gt;Quick compatibility reference:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Client&lt;/th&gt;
&lt;th&gt;Skills path&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code CLI&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;.claude/skills/&lt;/code&gt; or &lt;code&gt;~/.claude/skills/&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Most complete surface; full &lt;code&gt;allowed-tools&lt;/code&gt; support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VS Code + Claude extension&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.claude/skills/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inline diffs, plan review, file mention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.claude/skills/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same install path as VS Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JetBrains (IDEA, PyCharm, etc.)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.claude/skills/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Run &lt;code&gt;claude&lt;/code&gt; from IDE terminal or use &lt;code&gt;/ide&lt;/code&gt; to reconnect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Copilot, OpenAI Codex&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.agents/skills/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Open Agent Skills standard; cross-client portability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude.ai web&lt;/td&gt;
&lt;td&gt;Upload via UI&lt;/td&gt;
&lt;td&gt;Dir name must match &lt;code&gt;name&lt;/code&gt; field; 200-char description cap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  SKILL.md File Structure, Folder Layout, and Storage Locations
&lt;/h2&gt;

&lt;p&gt;A proper Skill is a folder, not a random markdown file sitting at repo root. The core specification requires a directory with a &lt;code&gt;SKILL.md&lt;/code&gt; file and allows optional &lt;code&gt;scripts/&lt;/code&gt;, &lt;code&gt;references/&lt;/code&gt;, and &lt;code&gt;assets/&lt;/code&gt; directories. &lt;code&gt;SKILL.md&lt;/code&gt; must contain YAML frontmatter followed by markdown instructions. In the spec, &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt; are required, &lt;code&gt;name&lt;/code&gt; is limited to 64 characters using lowercase letters, numbers, and hyphens, &lt;code&gt;compatibility&lt;/code&gt; is only for real environment requirements, and &lt;code&gt;allowed-tools&lt;/code&gt; is explicitly experimental across implementations.&lt;/p&gt;

&lt;p&gt;Claude Code is a bit looser than the portable spec because it can derive a name from the directory and fall back to the first paragraph when &lt;code&gt;description&lt;/code&gt; is missing. You should not rely on that if you care about portability or predictability. Claude.ai requires the directory name to match the &lt;code&gt;name&lt;/code&gt; field, and its custom-skill upload path caps descriptions at 200 characters even though the broader spec allows much more. The portable choice is to set an explicit &lt;code&gt;name&lt;/code&gt;, keep the directory identical, and write a precise description that fits in tight limits. That answers the FAQ topic "What should a SKILL.md file contain" without hand-waving.&lt;/p&gt;

&lt;p&gt;Start from a structure this boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;repo/
  .claude/
    skills/
      review-pr/
        SKILL.md
        scripts/
          review.sh
        references/
          checklist.md
        assets/
          comment-template.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If portability across Skills-compatible clients matters more than Claude Code convenience, keep the same internal shape and swap &lt;code&gt;.claude/skills/&lt;/code&gt; for &lt;code&gt;.agents/skills/&lt;/code&gt;. The folder structure is the same idea either way.&lt;/p&gt;

&lt;p&gt;For Claude Code, the storage locations are straightforward. Project skills live at &lt;code&gt;.claude/skills/&amp;lt;skill-name&amp;gt;/SKILL.md&lt;/code&gt;. Personal skills live at &lt;code&gt;~/.claude/skills/&amp;lt;skill-name&amp;gt;/SKILL.md&lt;/code&gt;. Plugin-distributed skills live under &lt;code&gt;&amp;lt;plugin&amp;gt;/skills/&amp;lt;skill-name&amp;gt;/SKILL.md&lt;/code&gt;. Anthropic documents precedence across the built-in scopes as enterprise over personal over project, while plugin skills avoid collisions by using a namespaced form such as &lt;code&gt;plugin-name:skill-name&lt;/code&gt;. On Windows, &lt;code&gt;~/.claude&lt;/code&gt; resolves to &lt;code&gt;%USERPROFILE%\.claude&lt;/code&gt;, and &lt;code&gt;CLAUDE_CONFIG_DIR&lt;/code&gt; can relocate the whole base directory.&lt;/p&gt;

&lt;p&gt;The choice between project and personal scope is straightforward. Use &lt;code&gt;.claude/skills/&lt;/code&gt; inside the repo when the Skill is tightly coupled to that codebase — for example, a deploy playbook that knows your specific cluster names or a review rubric tuned to your team's conventions. Use &lt;code&gt;~/.claude/skills/&lt;/code&gt; for Skills that travel with you across projects: personal checklists, generic changelog generators, preferred debugging workflows. Anything you would put in a dotfiles repo belongs in personal scope.&lt;/p&gt;

&lt;p&gt;A few sharp edges are worth memorising. &lt;code&gt;SKILL.md&lt;/code&gt; must be named exactly with that casing. Anthropic's PDF guide recommends kebab-case folder names and explicitly says not to place a &lt;code&gt;README.md&lt;/code&gt; inside the skill folder, because the operative documentation should live in &lt;code&gt;SKILL.md&lt;/code&gt; or &lt;code&gt;references/&lt;/code&gt;. That same guide also stresses that &lt;code&gt;SKILL.md&lt;/code&gt; naming is case-sensitive. These are boring constraints, but boring constraints are what make tooling reliable.&lt;/p&gt;

&lt;p&gt;Claude Code also does the right thing for monorepos. It automatically discovers nested &lt;code&gt;.claude/skills/&lt;/code&gt; directories when you work inside subdirectories, which is ideal for package-level or service-level skills. It also watches existing skill directories for live changes during the current session. The one restart trap is creating a top-level skills directory that did not exist when the session started. Anthropic documents that as the case where you do need to restart so the new directory can be watched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Skills Best Practices: Descriptions, Scripts, and Scope
&lt;/h2&gt;

&lt;p&gt;The fastest way to create a useless Skill is to ask an LLM to invent one from generic training knowledge. Anthropic's best-practices guide warns against exactly that. The valuable bits are the domain-specific corrections, edge cases, tool choices, and conventions the model would not reliably invent on its own. The right workflow is to solve the task once with the agent, correct it until it works, then extract the method into a Skill.&lt;/p&gt;

&lt;p&gt;Scope the Skill like a good function, not like a wiki. Anthropic says Skills should encapsulate a coherent unit of work. Too narrow, and you force multiple skills to stack for one task. Too broad, and the agent cannot activate them precisely. The best-practices guide is blunt that overly comprehensive skills can hurt more than they help because the model chases irrelevant instructions and loses the signal.&lt;/p&gt;

&lt;p&gt;Description quality is not a cosmetic concern. It is the routing layer. Both Anthropic and the Agent Skills docs say the &lt;code&gt;description&lt;/code&gt; field is the primary mechanism the model uses to decide whether to load a Skill at all. Good descriptions say what the Skill does, when to use it, and the trigger phrases or file types a user would actually mention. Bad descriptions are vague, overly technical, or broad enough to match nonsense. That is the real answer to the FAQ question "Why is a Claude Skill not triggering". Usually the router is bad, not the model.&lt;/p&gt;

&lt;p&gt;The contrast is clear side by side:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bad descriptions — too vague to route reliably:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Helps with code review&lt;/code&gt; — matches everything, disambiguates nothing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Useful for development tasks&lt;/code&gt; — broader than a search query&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Assists with writing&lt;/code&gt; — not a router, just a category label&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Good descriptions — specific trigger language:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Review pull requests for security issues, migration risk, and missing tests. Use when reviewing a PR, git diff, or release critical change.&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Generate a changelog from git log output. Use when preparing a release, writing release notes, or summarising commits since last tag.&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Scaffold a new Go HTTP handler with request validation and error middleware. Use when adding a new endpoint or route to a Go service.&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern is the same each time: state what the Skill does, name the exact user phrases that should activate it, and optionally name file types or tools that are relevant. If your description would match a generic Google query, it is not specific enough.&lt;/p&gt;

&lt;p&gt;If a workflow has side effects, make it manual. Claude Code exposes that directly. &lt;code&gt;disable-model-invocation: true&lt;/code&gt; makes a Skill user-invoked only, which Anthropic recommends for actions like deploys, commits, or outbound messages. &lt;code&gt;user-invocable: false&lt;/code&gt; goes the other way and hides the Skill from the slash menu while still letting Claude use it as background knowledge. That answers the FAQ topic "When should a skill be manual instead of automatic" in one sentence: manual for risk, automatic for safe repeatable guidance.&lt;/p&gt;

&lt;p&gt;Keep &lt;code&gt;SKILL.md&lt;/code&gt; small enough to stay intelligible. Anthropic recommends keeping it under 500 lines and around 5,000 tokens, then moving detailed material into &lt;code&gt;references/&lt;/code&gt; or similar files with explicit loading instructions. "Read &lt;code&gt;references/api-errors.md&lt;/code&gt; if the API returns a non-200" is a good pattern. "See references/" is lazy. Claude Code also injects the rendered Skill into the conversation as a message and does not keep re-reading the file on later turns. After context compaction, only recent Skill content is carried forward within token budgets. Huge Skills are therefore not merely ugly. They are brittle over long sessions.&lt;/p&gt;

&lt;p&gt;A good &lt;code&gt;SKILL.md&lt;/code&gt; can stay very plain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;review-pr&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Review pull requests for security issues, migration risk, and missing tests. Use when reviewing a PR, git diff, or release critical change.&lt;/span&gt;
&lt;span class="na"&gt;compatibility&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Designed for Claude Code. Requires git and gh.&lt;/span&gt;
&lt;span class="na"&gt;disable-model-invocation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;allowed-tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bash(git diff *) Bash(gh pr diff *) Read Grep Glob&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Review PR&lt;/span&gt;

&lt;span class="s"&gt;Read references/checklist.md before running any commands.&lt;/span&gt;

&lt;span class="s"&gt;1. Collect the diff and changed files.&lt;/span&gt;
&lt;span class="s"&gt;2. Flag correctness, security, and test coverage issues.&lt;/span&gt;
&lt;span class="s"&gt;3. Return findings grouped by severity with file references.&lt;/span&gt;
&lt;span class="s"&gt;4. Suggest the smallest safe fix first.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use scripts when determinism matters more than eloquence. The Skills scripts guide is excellent here. It says agent-facing scripts must avoid interactive prompts, document usage through &lt;code&gt;--help&lt;/code&gt;, emit helpful error messages, prefer structured output such as JSON or CSV on stdout, send diagnostics to stderr, and support retry-safe use. It also recommends pinning one-off tool versions and describing runtime requirements explicitly in &lt;code&gt;SKILL.md&lt;/code&gt; or the &lt;code&gt;compatibility&lt;/code&gt; field rather than assuming the environment has the right packages.&lt;/p&gt;

&lt;p&gt;A minimal but correct agent-facing script looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="c"&gt;# scripts/collect-diff.sh — called by review-pr skill&lt;/span&gt;
&lt;span class="c"&gt;# Usage: collect-diff.sh &amp;lt;base-ref&amp;gt; [&amp;lt;head-ref&amp;gt;]&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;BASE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;:?Usage:&lt;span class="p"&gt; collect-diff.sh &amp;lt;base-ref&amp;gt; [&amp;lt;head-ref&amp;gt;]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nv"&gt;HEAD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;HEAD&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Structured output to stdout so the agent can parse it&lt;/span&gt;
git diff &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;BASE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HEAD&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--stat&lt;/span&gt; &lt;span class="nt"&gt;--name-only&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="nt"&gt;-Rs&lt;/span&gt; &lt;span class="s1"&gt;'{
      "changed_files": split("\n") | map(select(length &amp;gt; 0))
    }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'{"error":"git diff failed"}\n'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&amp;amp;2&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things make this agent-safe. &lt;code&gt;set -euo pipefail&lt;/code&gt; ensures the script exits loudly on any failure rather than silently proceeding. JSON on stdout gives the agent a format it can parse without guessing. Diagnostics go to stderr so the agent's stdout stream stays clean. None of this is clever. All of it is necessary.&lt;/p&gt;

&lt;p&gt;One subtle trap is &lt;code&gt;allowed-tools&lt;/code&gt;. In the spec it is experimental and support varies. In Claude Code it pre-approves specific tools while the Skill is active, but it does not restrict the universe of callable tools, and deny rules still belong in Claude Code permissions. In the Claude Agent SDK, Anthropic explicitly says the &lt;code&gt;allowed-tools&lt;/code&gt; frontmatter in &lt;code&gt;SKILL.md&lt;/code&gt; does not apply, so SDK apps must enforce tool access in the main &lt;code&gt;allowed_tools&lt;/code&gt; or &lt;code&gt;allowedTools&lt;/code&gt; configuration instead. If you ignore that difference, your Skill will behave differently in the CLI and in SDK-powered automation.&lt;/p&gt;

&lt;p&gt;One more advanced pattern is worth stealing. When a workflow would flood your main thread with logs, file searches, or long research output, Claude Code lets a Skill run in a forked subagent using &lt;code&gt;context: fork&lt;/code&gt; and an &lt;code&gt;agent&lt;/code&gt; such as &lt;code&gt;Explore&lt;/code&gt;. Anthropic shows this for research workflows, where the heavy lifting happens in isolated context and the main conversation gets the summary. For deep codebase exploration, that is a much better design than a giant inline Skill that pollutes the main session.&lt;/p&gt;

&lt;p&gt;A forked Skill looks like this in frontmatter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;explore-codebase&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deep exploration of an unfamiliar codebase. Use when onboarding to a new repo, auditing architecture, or mapping module dependencies.&lt;/span&gt;
&lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fork&lt;/span&gt;
&lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Explore&lt;/span&gt;
&lt;span class="na"&gt;compatibility&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Requires Claude Code CLI.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Explore Codebase&lt;/span&gt;

&lt;span class="s"&gt;1. Walk the directory tree and summarise the top-level modules.&lt;/span&gt;
&lt;span class="s"&gt;2. Identify the main entry points and their responsibilities.&lt;/span&gt;
&lt;span class="s"&gt;3. Map the dependency graph between packages.&lt;/span&gt;
&lt;span class="s"&gt;4. Return a structured summary to the main session — not the raw file list.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key line is &lt;code&gt;context: fork&lt;/code&gt;. Without it, the exploration output lands inline in your conversation. With it, the subagent runs in its own context window and hands back a summary. The difference matters on large repos where exploration alone can consume thousands of tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing Claude Skills: Triggers, Correctness, and Baseline Comparisons
&lt;/h2&gt;

&lt;p&gt;A Skill is not tested because one happy-path demo worked once. Anthropic's guide breaks testing into three layers: manual testing in Claude.ai, scripted testing in Claude Code, and programmatic testing via the Skills API. The recommended evaluation areas are triggering, functional correctness, and performance against a baseline without the Skill. That is also the best answer to the FAQ question "How do you test whether a skill is reliable". You test route selection, output quality, and efficiency, not just whether the model sounded confident.&lt;/p&gt;

&lt;p&gt;The official eval guidance gives a clean structure for test cases. Each case should include a realistic user prompt, a human-readable description of the expected output, and optional input files. The docs store those in &lt;code&gt;evals/evals.json&lt;/code&gt; inside the Skill directory, which is a sensible convention even if you roll your own harness.&lt;/p&gt;

&lt;p&gt;Use a fixture file and a no-nonsense eval layout like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skill_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"review-pr"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"evals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Review this PR for security issues and missing tests"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expected_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Findings grouped by severity with file references and at least one test recommendation."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"files"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"evals/files/pr-diff.patch"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Summarise last week's commits"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"expected_output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The skill should not activate."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"files"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My own testing rule is harsher than most teams use, but it lines up with the official guidance. Every serious Skill should have should-trigger queries, should-not-trigger queries, at least one edge-case test, and a baseline comparison without the Skill. Anthropic's examples compare tool calls, failed API calls, clarification loops, and token use with and without the Skill because "works" is not the same as "improves the workflow".&lt;/p&gt;

&lt;p&gt;If you test through the Claude Agent SDK, remember the plumbing. Skills are filesystem artefacts there, not programmatic registrations. Anthropic says you must enable the &lt;code&gt;"Skill"&lt;/code&gt; tool and load the relevant filesystem settings through &lt;code&gt;settingSources&lt;/code&gt; or &lt;code&gt;setting_sources&lt;/code&gt;. If you omit &lt;code&gt;user&lt;/code&gt; or &lt;code&gt;project&lt;/code&gt;, or point &lt;code&gt;cwd&lt;/code&gt; at the wrong place, the SDK will not discover the Skill. Anthropic even recommends asking "What Skills are available?" as a direct discovery check.&lt;/p&gt;

&lt;p&gt;Also test on the model and client you actually intend to ship. The open Agent Skills quickstart explicitly warns that tool-use reliability varies across models, and some models may answer directly instead of executing the command the Skill intends. That is not always a Skill design problem. Sometimes it is a model-selection problem, and your test matrix should expose it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Skills Troubleshooting: Common Failures and Fixes
&lt;/h2&gt;

&lt;p&gt;When a Skill misbehaves, assume packaging before intelligence. The most common failures are still the boring ones.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the Skill is not found at all, verify the file is named exactly &lt;code&gt;SKILL.md&lt;/code&gt;, with the right case, inside the correct directory. Anthropic's troubleshooting guide calls out filename case explicitly, and its Claude Code and SDK docs point you straight at &lt;code&gt;.claude/skills/*/SKILL.md&lt;/code&gt; and &lt;code&gt;~/.claude/skills/*/SKILL.md&lt;/code&gt; as the first checks.&lt;/li&gt;
&lt;li&gt;If frontmatter is invalid, check the YAML delimiters and quotes first. Anthropic's examples show the classic mistakes: missing &lt;code&gt;---&lt;/code&gt;, unclosed quotes, or invalid names with spaces and capitals. Skill names should be lowercase and hyphenated.&lt;/li&gt;
&lt;li&gt;If the Skill exists but does not trigger, the description is usually too vague. Claude Code's own troubleshooting says to include keywords users would naturally say, verify the Skill appears when you ask "What skills are available?", and try rephrasing closer to the description. Anthropic's PDF guide adds a great diagnostic trick: ask Claude when it would use the Skill and listen to how it paraphrases the description back to you.&lt;/li&gt;
&lt;li&gt;If the Skill triggers too often, narrow the scope. Anthropic recommends making the description more specific, adding negative triggers, and using &lt;code&gt;disable-model-invocation: true&lt;/code&gt; for workflows you want only by explicit command. Over-triggering is usually just under-specified routing language.&lt;/li&gt;
&lt;li&gt;If the Skill seems to lose influence in long sessions, remember that descriptions can be shortened in the Claude Code catalogue when many skills are present, and invoked Skills are later carried within token budgets after compaction. Anthropic recommends front-loading keywords in the description, trimming excess text, and, for Claude Code specifically, adjusting &lt;code&gt;SLASH_COMMAND_TOOL_CHAR_BUDGET&lt;/code&gt; if description listings are being squeezed too aggressively.&lt;/li&gt;
&lt;li&gt;If a bundled script hangs or behaves erratically, check whether it expects interactive input. The scripts guide says agents run in non-interactive shells, so TTY prompts, password dialogs, and confirmation menus are design bugs. Accept input through flags, environment variables, or stdin and make failures explicit.&lt;/li&gt;
&lt;li&gt;If the SDK does not see your Skill, confirm that &lt;code&gt;allowed_tools&lt;/code&gt; includes &lt;code&gt;"Skill"&lt;/code&gt;, that &lt;code&gt;settingSources&lt;/code&gt; or &lt;code&gt;setting_sources&lt;/code&gt; contains &lt;code&gt;user&lt;/code&gt; and or &lt;code&gt;project&lt;/code&gt;, and that &lt;code&gt;cwd&lt;/code&gt; points at the directory that actually contains &lt;code&gt;.claude/skills/&lt;/code&gt;. Without that setup, the Skill system is not enabled no matter how correct your markdown looks.&lt;/li&gt;
&lt;li&gt;If an MCP-backed Skill loads but the tool calls fail, Anthropic's troubleshooting checklist is sensible: verify the MCP server is connected, confirm authentication and scopes, test the MCP tool directly without the Skill, then check the exact tool names because they are case-sensitive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The boring truth is that good Claude Skills look like good operational engineering. Clear names. Small files. Explicit triggers. Deterministic scripts where needed. Real tests. If your Skill reads like a crisp runbook, the agent has a fighting chance. If it reads like a brainstorm, you have simply hidden chaos in a folder.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>aicoding</category>
      <category>dev</category>
    </item>
  </channel>
</rss>
