<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Teruo Kunihiro</title>
    <description>The latest articles on DEV Community by Teruo Kunihiro (@trknhr).</description>
    <link>https://dev.to/trknhr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F41546%2F717de37a-8306-4328-8010-f38b0c5ed560.jpg</url>
      <title>DEV Community: Teruo Kunihiro</title>
      <link>https://dev.to/trknhr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/trknhr"/>
    <language>en</language>
    <item>
      <title>Apple’s container Just Hit v1.0.0</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Wed, 10 Jun 2026 13:50:19 +0000</pubDate>
      <link>https://dev.to/trknhr/apples-container-just-hit-v100-mid</link>
      <guid>https://dev.to/trknhr/apples-container-just-hit-v100-mid</guid>
      <description>&lt;p&gt;Apple’s &lt;code&gt;container&lt;/code&gt; has finally reached &lt;strong&gt;v1.0.0&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The name is a bit too generic, so in this article I’ll call it Apple’s &lt;code&gt;container&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At first glance, it is tempting to describe it as “Apple’s Docker.” But that is not quite accurate. Apple’s &lt;code&gt;container&lt;/code&gt; is a CLI tool for creating and running Linux containers as lightweight virtual machines on macOS. It is written in Swift, optimized for Apple silicon, and works with OCI-compatible container images, so it can pull images from standard registries and push images that you build yourself. The GitHub repository currently lists &lt;code&gt;1.0.0&lt;/code&gt; as the latest release, dated June 9, 2026. (&lt;a href="https://github.com/apple/container" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;I have not used it heavily in production yet, so this is mostly a documentation-based first look rather than a deep hands-on review.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Apple’s &lt;code&gt;container&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;Apple’s &lt;code&gt;container&lt;/code&gt; is a tool for running Linux containers on a Mac.&lt;/p&gt;

&lt;p&gt;The official README describes it as a tool that lets you create and run Linux containers as lightweight virtual machines on macOS. It consumes and produces OCI-compatible container images, which means the workflow should feel familiar if you already use Docker, Podman, or other OCI-based tools. (&lt;a href="https://github.com/apple/container" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;For example, the basic commands look very Docker-like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;container run &lt;span class="nt"&gt;-it&lt;/span&gt; ubuntu:latest /bin/bash
container build &lt;span class="nt"&gt;-t&lt;/span&gt; my-app:latest &lt;span class="nb"&gt;.&lt;/span&gt;
container image pull alpine:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The command reference includes familiar operations such as &lt;code&gt;container run&lt;/code&gt;, &lt;code&gt;container build&lt;/code&gt;, &lt;code&gt;container create&lt;/code&gt;, &lt;code&gt;container exec&lt;/code&gt;, &lt;code&gt;container logs&lt;/code&gt;, &lt;code&gt;container start&lt;/code&gt;, and &lt;code&gt;container stop&lt;/code&gt;. It also supports options such as volume mounts, memory/CPU configuration, port publishing, Rosetta support, and SSH agent forwarding. (&lt;a href="https://github.com/apple/container/blob/main/docs/command-reference.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;It has a Docker-like CLI, and it understands OCI images, but it is not a Docker daemon-compatible implementation. It is a different implementation built around Apple’s Containerization framework and macOS virtualization technologies.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;container machine&lt;/code&gt;: the interesting part
&lt;/h2&gt;

&lt;p&gt;One of the most interesting features is &lt;code&gt;container machine&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A normal application container is usually modeled around one process or one application. A &lt;code&gt;container machine&lt;/code&gt;, on the other hand, is modeled more like a persistent Linux environment on your Mac.&lt;/p&gt;

&lt;p&gt;Apple describes container machines as fast, lightweight, persistent Linux environments based on standard OCI images. They also provide host integrations such as automatic username and home directory sharing. (&lt;a href="https://github.com/apple/container/blob/main/docs/container-machine.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;container machine create alpine:latest &lt;span class="nt"&gt;--name&lt;/span&gt; dev
container machine run &lt;span class="nt"&gt;-n&lt;/span&gt; dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a Linux environment that is closer to a small development machine than a one-shot container.&lt;/p&gt;

&lt;p&gt;Your macOS home directory can be mounted inside the container machine, so you can edit code with your Mac editor or IDE while building and running the project inside Linux. Apple’s docs describe this as “edit on the Mac, build inside.” (&lt;a href="https://github.com/apple/container/blob/main/docs/container-machine.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;It can also run the image’s init system. If the image includes &lt;code&gt;systemd&lt;/code&gt;, you can run services such as PostgreSQL using commands like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;systemctl start postgresql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apple’s documentation explicitly calls out this use case for testing real Linux services inside a container machine. (&lt;a href="https://github.com/apple/container/blob/main/docs/container-machine.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;This feels less like “Docker replacement” and more like a &lt;strong&gt;Mac-native WSL-like Linux development environment&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That is probably the part I am most interested in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can it replace Docker Desktop?
&lt;/h2&gt;

&lt;p&gt;For some workflows, maybe.&lt;/p&gt;

&lt;p&gt;For many real-world team workflows, probably not yet.&lt;/p&gt;

&lt;p&gt;Docker Desktop is not just a container runner. Docker’s documentation describes Docker Desktop as an application for Mac, Linux, and Windows that lets you build, share, and run containerized applications. It includes Docker Engine, Docker CLI, Docker Build, Docker Compose, Docker Scout, and Kubernetes. (&lt;a href="https://docs.docker.com/desktop/" rel="noopener noreferrer"&gt;Docker Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Apple’s &lt;code&gt;container&lt;/code&gt; covers an important subset of that world:&lt;/p&gt;

&lt;p&gt;It can run containers.&lt;br&gt;
It can build OCI images.&lt;br&gt;
It can pull and push images.&lt;br&gt;
It has familiar container lifecycle commands.&lt;br&gt;
It integrates nicely with Apple silicon and macOS.&lt;/p&gt;

&lt;p&gt;But it does not look like a drop-in Docker Desktop replacement.&lt;/p&gt;

&lt;p&gt;The biggest practical gap for many developers is &lt;strong&gt;Docker Compose&lt;/strong&gt;. A lot of local development environments are built around &lt;code&gt;docker compose up&lt;/code&gt;, especially for apps that need a database, Redis, background workers, and multiple services.&lt;/p&gt;

&lt;p&gt;There is a third-party project called &lt;code&gt;container-compose&lt;/code&gt;, and Homebrew also has a &lt;code&gt;container-compose&lt;/code&gt; formula, but that still means relying on a non-Apple bridge for a very central part of many workflows. (&lt;a href="https://formulae.brew.sh/formula/container-compose" rel="noopener noreferrer"&gt;Homebrew Formulae&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;For single containers, isolated experiments, image builds, and lightweight Linux dev environments, Apple’s &lt;code&gt;container&lt;/code&gt; looks very promising.&lt;/p&gt;

&lt;p&gt;For Compose-heavy development environments, Docker Desktop is still the safer default.&lt;/p&gt;
&lt;h2&gt;
  
  
  The biggest architectural difference
&lt;/h2&gt;

&lt;p&gt;The biggest difference between Docker Desktop and Apple’s &lt;code&gt;container&lt;/code&gt; is how they use virtual machines.&lt;/p&gt;

&lt;p&gt;macOS does not have a Linux kernel. So if you want to run Linux containers on a Mac, some kind of Linux environment is required.&lt;/p&gt;

&lt;p&gt;Docker Desktop for Mac uses a Linux VM to run containers. Docker’s documentation says Docker Desktop supports multiple Virtual Machine Managers to power the Linux VM that runs containers. (&lt;a href="https://docs.docker.com/desktop/features/vmm/" rel="noopener noreferrer"&gt;Docker Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Conceptually, Docker Desktop looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Docker Desktop for Mac:

macOS
  └─ Linux VM
       ├─ container A
       ├─ container B
       └─ container C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apple’s &lt;code&gt;container&lt;/code&gt; takes a different approach.&lt;/p&gt;

&lt;p&gt;Instead of putting many containers inside one shared Linux VM, it runs &lt;strong&gt;a lightweight VM for each container&lt;/strong&gt;. Apple’s technical overview says this gives each container VM-level isolation, lets the user mount only the necessary host data into each VM, and aims for memory usage lower than full VMs with boot times comparable to containers inside a shared VM. (&lt;a href="https://github.com/apple/container/blob/main/docs/technical-overview.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Apple container:

macOS
  ├─ lightweight VM ─ container A
  ├─ lightweight VM ─ container B
  └─ lightweight VM ─ container C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That design is very Apple-like.&lt;/p&gt;

&lt;p&gt;It keeps the container workflow, but moves the isolation boundary closer to a VM boundary. This is attractive for security and privacy, especially when running code you do not fully trust.&lt;/p&gt;

&lt;p&gt;At WWDC, Apple also described Containerization as running each container inside its own lightweight VM while still providing sub-second start times. Each container can also get its own dedicated IP address, which can remove the need for individual port mappings in some cases. (&lt;a href="https://developer.apple.com/videos/play/wwdc2025/346/" rel="noopener noreferrer"&gt;Apple Developer&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Of course, there are trade-offs.&lt;/p&gt;

&lt;p&gt;Apple’s technical overview notes that memory pages freed inside a container VM are not always returned to the macOS host. If you run many memory-intensive containers, you may need to restart them occasionally to reduce memory usage. (&lt;a href="https://github.com/apple/container/blob/main/docs/technical-overview.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;So I would not assume this is automatically lighter than Docker Desktop for every workload. It needs real-world testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker-like, but not Docker-compatible
&lt;/h2&gt;

&lt;p&gt;The CLI looks familiar, but compatibility is not the same thing as similarity.&lt;/p&gt;

&lt;p&gt;For basic usage, you might imagine something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;alias &lt;/span&gt;&lt;span class="nv"&gt;docker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'container'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And for simple commands, that may feel surprisingly natural.&lt;/p&gt;

&lt;p&gt;But the broader Docker ecosystem is not just the CLI shape. Many tools expect the Docker Engine API, the Docker socket, or Docker Compose behavior.&lt;/p&gt;

&lt;p&gt;There is already a GitHub issue asking Apple’s &lt;code&gt;container&lt;/code&gt; to expose the Docker Engine API through something like &lt;code&gt;/var/run/docker.sock&lt;/code&gt;, and that issue is marked “Closed as not planned.” (&lt;a href="https://github.com/apple/container/issues/66" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Another issue requesting Moby API support says that such support would be a prerequisite for Docker Compose to support Apple’s runtime, but that issue is marked as a duplicate. (&lt;a href="https://github.com/apple/container/issues/229" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;So at least right now, I would not think of Apple’s &lt;code&gt;container&lt;/code&gt; as a Docker Desktop drop-in replacement.&lt;/p&gt;

&lt;h2&gt;
  
  
  OS support
&lt;/h2&gt;

&lt;p&gt;The official README says you need an Apple silicon Mac to run &lt;code&gt;container&lt;/code&gt;, and that macOS 26 is the supported target because the project relies on new virtualization and networking features in that release. (&lt;a href="https://github.com/apple/container" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Homebrew already provides&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;container
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Homebrew formula lists stable version &lt;code&gt;1.0.0&lt;/code&gt;, requires arm64 architecture, and lists macOS 15 or newer as a requirement, with Xcode 26 or newer required for building. (&lt;a href="https://formulae.brew.sh/formula/container" rel="noopener noreferrer"&gt;Homebrew Formulae&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;There is an important nuance here.&lt;/p&gt;

&lt;p&gt;Apple’s technical overview says &lt;code&gt;container&lt;/code&gt; can run on macOS 15, but with functional and user-experience limitations. For example, macOS 15 has limitations around network isolation, multiple networks, and container IP addresses. The &lt;code&gt;container network&lt;/code&gt; commands are not available on macOS 15. (&lt;a href="https://github.com/apple/container/blob/main/docs/technical-overview.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;So the cleanest experience seems to be macOS 26 on Apple silicon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters on macOS
&lt;/h2&gt;

&lt;p&gt;On Linux, containers run on the host Linux kernel using features such as namespaces and cgroups.&lt;/p&gt;

&lt;p&gt;On macOS, that is not possible directly because macOS is not Linux.&lt;/p&gt;

&lt;p&gt;That is why Docker Desktop, Colima, Rancher Desktop, Podman Machine, and now Apple’s &lt;code&gt;container&lt;/code&gt; all have to solve the same core problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do we provide a Linux environment on a Mac without making the developer experience terrible?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Docker Desktop solves this with a managed Linux VM and a polished Docker ecosystem around it.&lt;/p&gt;

&lt;p&gt;Apple’s &lt;code&gt;container&lt;/code&gt; solves it by making lightweight VMs part of the container abstraction itself.&lt;/p&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;Docker Desktop optimizes for compatibility with the existing Docker ecosystem.&lt;/p&gt;

&lt;p&gt;Apple’s &lt;code&gt;container&lt;/code&gt; appears to optimize for Apple-platform integration, isolation, and a cleaner VM-per-container model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Apple’s &lt;code&gt;container&lt;/code&gt; could be useful
&lt;/h2&gt;

&lt;p&gt;I see three strong use cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Running single containers on Mac
&lt;/h3&gt;

&lt;p&gt;For quick experiments, sandboxing, and running one-off Linux tools, Apple’s &lt;code&gt;container&lt;/code&gt; could be very convenient.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;container run &lt;span class="nt"&gt;-it&lt;/span&gt; ubuntu:latest /bin/bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you do not need Compose, Kubernetes, or deep Docker ecosystem compatibility, this might be enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. A Mac-native Linux development environment
&lt;/h3&gt;

&lt;p&gt;This is where &lt;code&gt;container machine&lt;/code&gt; gets interesting.&lt;/p&gt;

&lt;p&gt;You can keep using your Mac editor, your Mac terminal, your Mac tools, and still build or test inside a real Linux environment.&lt;/p&gt;

&lt;p&gt;This could become a really nice “WSL for Mac” experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Local sandboxing for untrusted code
&lt;/h3&gt;

&lt;p&gt;Because each container runs inside its own lightweight VM, Apple’s &lt;code&gt;container&lt;/code&gt; may be a good fit for running code with stronger isolation than a regular shared-kernel container.&lt;/p&gt;

&lt;p&gt;That could be useful for local experiments, CI-like testing, or even AI coding agents that need to run generated code in a safer environment.&lt;/p&gt;

&lt;p&gt;I would not call it a complete security solution by itself, but the direction is interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Apple’s &lt;code&gt;container&lt;/code&gt; reaching v1.0.0 is a big milestone.&lt;/p&gt;

&lt;p&gt;I do not think it means everyone should uninstall Docker Desktop today.&lt;/p&gt;

&lt;p&gt;Docker Desktop is still the more complete environment if your workflow depends on Docker Compose, Kubernetes, Docker Engine API compatibility, or existing tools that expect the Docker socket.&lt;/p&gt;

&lt;p&gt;But Apple’s &lt;code&gt;container&lt;/code&gt; is important because it gives macOS developers an official Apple-native option for running Linux containers.&lt;/p&gt;

&lt;p&gt;Docker Desktop usually runs multiple containers inside a shared Linux VM.&lt;br&gt;
Apple’s &lt;code&gt;container&lt;/code&gt; runs each container inside its own lightweight VM. (&lt;a href="https://github.com/apple/container/blob/main/docs/technical-overview.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;That makes Apple’s approach especially interesting for isolation, privacy, and sandbox-style workflows.&lt;/p&gt;

&lt;p&gt;Personally, the feature I am most excited about is not the Docker-like CLI itself. It is &lt;code&gt;container machine&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The idea of having a persistent Linux development environment that integrates naturally with macOS, shares my home directory, lets me edit on the Mac, and build inside Linux feels very promising.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>container</category>
    </item>
    <item>
      <title>How I Make Claude Code's 5-Hour Usage Window Last Longer on Claude Pro</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Wed, 27 May 2026 13:46:15 +0000</pubDate>
      <link>https://dev.to/trknhr/how-i-make-claude-codes-5-hour-usage-window-last-longer-on-claude-pro-jkb</link>
      <guid>https://dev.to/trknhr/how-i-make-claude-codes-5-hour-usage-window-last-longer-on-claude-pro-jkb</guid>
      <description>&lt;p&gt;When using Claude Code with Claude Pro, one problem you will almost certainly run into is the Usage Limit.&lt;/p&gt;

&lt;p&gt;The actual usage depends on many things: message length, attached files, conversation history, the model you are using, and the features you enable. Claude Pro has a session-based limit that resets every five hours, as well as a weekly usage limit. (&lt;a href="https://support.claude.com/en/articles/8325606-what-is-the-pro-plan" rel="noopener noreferrer"&gt;Claude Help Center&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;I use Claude Pro with the assumption that I will eventually hit the limit. Because of that, I try to avoid carrying unnecessary context, control when I start heavy work, and move important information out of the conversation and into files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use &lt;code&gt;/clear&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;When I start a new task or switch models, I usually run &lt;code&gt;/clear&lt;/code&gt; first.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/clear&lt;/code&gt; starts a new conversation with an empty context. The previous conversation is still available through &lt;code&gt;/resume&lt;/code&gt;, so this does not mean throwing away the work entirely. I use it to separate the current task from old context that is no longer needed. (&lt;a href="https://code.claude.com/docs/en/commands" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;The two cases I pay the most attention to are model switching and long idle periods.&lt;/p&gt;

&lt;p&gt;In Claude Code, each model has its own prompt cache. Because of that, when you switch models with &lt;code&gt;/model&lt;/code&gt;, the next request will reread the entire conversation history without a cache hit, even if the conversation itself has not changed. If you switch models while carrying a long conversation, the first request after the switch can consume a large amount of your usage limit. (&lt;a href="https://code.claude.com/docs/en/prompt-caching" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Long idle periods across sessions are also worth watching. For Claude subscriptions, Claude Code uses a one-hour TTL prompt cache for the main conversation. If the conversation is idle for a long time, the cache can expire. For example, if you wait more than an hour for a reset and then resume, the next input may reprocess the long history. (&lt;a href="https://code.claude.com/docs/en/prompt-caching" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;New Claude Code sessions can also change the cache prefix. The working directory, OS, shell, and git status snapshot can all affect the prefix. The official docs explain that sequential sessions can share the cache only when they are on the same machine and directory, and when the git status snapshot at startup matches. In other words, it is better to think of session resumes as situations where cache misses can easily happen because of TTL expiration or prefix differences. (&lt;a href="https://code.claude.com/docs/en/prompt-caching" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;For that reason, before switching models or leaving work idle for a long time, I save the necessary information as a plan or spec file and then run &lt;code&gt;/clear&lt;/code&gt;. I want to avoid starting a new session with the first prompt already carrying a huge old conversation history.&lt;/p&gt;

&lt;p&gt;If I want to continue the same task, I use &lt;code&gt;/compact&lt;/code&gt; instead of &lt;code&gt;/clear&lt;/code&gt;. &lt;code&gt;/compact&lt;/code&gt; replaces the conversation history with a summary, so the conversation-layer cache is rebuilt. However, the next turn can rebuild the cache from a much shorter summary. Used at a natural stopping point, it helps both with usage and with keeping the model focused. (&lt;a href="https://code.claude.com/docs/en/prompt-caching" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Start the session early with &lt;code&gt;/schedule&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Claude Pro has a session limit that resets every five hours. If I start working at 6 a.m., I can expect resets around 11 a.m. and 4 p.m. during the workday. Of course, the exact remaining usage and reset time should be checked in the Usage screen or with &lt;code&gt;/usage&lt;/code&gt;, but the timing of when you start heavy work matters a lot. (&lt;a href="https://support.claude.com/en/articles/9797557-usage-limit-best-practices" rel="noopener noreferrer"&gt;Claude Help Center&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Recently, the limits have become more generous, so once you clear each five-hour session, you can use a fairly large amount of tokens. On May 6, 2026, Anthropic announced that it doubled Claude Code's five-hour rate limits for Pro, Max, Team, and Enterprise users, and removed peak-hours limit reductions for Pro and Max users. Claude Code weekly limits were also increased by 50% through July 13. (&lt;a href="https://www.anthropic.com/news/higher-limits-spacex" rel="noopener noreferrer"&gt;Anthropic&lt;/a&gt;, &lt;a href="https://x.com/ClaudeDevs/status/2054639777685934564" rel="noopener noreferrer"&gt;ClaudeDevs&lt;/a&gt;, &lt;a href="https://www.businessinsider.com/openai-anthropic-out-freebie-each-other-codex-claude-code-2026-5" rel="noopener noreferrer"&gt;Business Insider&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Even so, if you use Claude Code heavily on Pro, you can burn through a five-hour session very quickly. A temporary increase in the weekly limit does not prevent you from hitting the session limit if you pack too much heavy work into a short period.&lt;/p&gt;

&lt;p&gt;I want to maximize the number of useful resets during my daytime working hours, so I use &lt;code&gt;/schedule&lt;/code&gt; to run a routine early in the morning. &lt;code&gt;/schedule&lt;/code&gt; creates, updates, and runs Claude Code routines, and those routines run on Anthropic-managed cloud infrastructure. By scheduling something simple, such as a small &lt;code&gt;hello&lt;/code&gt; command, I can start the session early and plan the day around the five-hour windows. (&lt;a href="https://code.claude.com/docs/en/commands" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;For my schedule, simply controlling when the session starts can turn two useful reset windows into three.&lt;/p&gt;

&lt;h2&gt;
  
  
  Write plans and specs to files
&lt;/h2&gt;

&lt;p&gt;For larger Claude Code tasks, I try not to keep plans and specs only inside the conversation. I write them out to files.&lt;/p&gt;

&lt;p&gt;This is very important. If the working state exists only in the conversation history, running &lt;code&gt;/clear&lt;/code&gt; removes the context. But if the plan or spec is saved as a Markdown file, I can simply ask the next session to read that file and continue.&lt;/p&gt;

&lt;p&gt;I sometimes use Superpowers skills such as &lt;code&gt;writing-plan&lt;/code&gt; and &lt;code&gt;writing-spec&lt;/code&gt;. The Superpowers &lt;code&gt;brainstorming&lt;/code&gt; skill stores design specs under &lt;code&gt;docs/superpowers/specs/YYYY-MM-DD-&amp;lt;topic&amp;gt;-design.md&lt;/code&gt;, and the &lt;code&gt;writing-plans&lt;/code&gt; skill stores implementation plans under &lt;code&gt;docs/superpowers/plans/YYYY-MM-DD-&amp;lt;feature-name&amp;gt;.md&lt;/code&gt;. (&lt;a href="https://github.com/obra/superpowers/blob/main/skills/brainstorming/SKILL.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://github.com/obra/superpowers/blob/main/skills/writing-plans/SKILL.md" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;The tool itself does not have to be Superpowers. The important point is this: do not use the conversation as the only place where the working state lives.&lt;/p&gt;

&lt;p&gt;I keep only "what to do now" in the conversation. Specs, plans, test procedures, and reasons for decisions go into files. That way, I can run &lt;code&gt;/clear&lt;/code&gt; without losing productivity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cheaper models when possible
&lt;/h2&gt;

&lt;p&gt;Not every task needs Opus.&lt;/p&gt;

&lt;p&gt;The official Claude Code cost management docs say that Sonnet can handle many coding tasks well and is cheaper than Opus. Opus is better reserved for complex architecture decisions and multi-step reasoning. (&lt;a href="https://code.claude.com/docs/en/costs" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;To be honest, I do not always optimize this perfectly. But when using Superpowers skills or subagents, simple subtasks are sometimes routed toward Sonnet, which feels like it saves usage. The official docs also mention that simple subagent tasks can be configured to use Haiku. (&lt;a href="https://code.claude.com/docs/en/costs" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;On the other hand, for tasks that involve orchestrating multiple subagents or making high-level design decisions, Opus feels more stable to me. If I try too hard to save usage there, I often pay for it later with retries and corrections.&lt;/p&gt;

&lt;p&gt;So the pattern I usually follow is similar to the Superpowers &lt;a href="https://github.com/obra/superpowers/blob/main/skills/subagent-driven-development/SKILL.md" rel="noopener noreferrer"&gt;SKILL.md&lt;/a&gt;: use Sonnet for simple implementation, research, test fixes, and file-level work; use Opus for design decisions, complex debugging, subagent orchestration, and reviewing long plans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you use Claude Code on Claude Pro, it is better to assume that you will eventually hit the usage limit.&lt;/p&gt;

&lt;p&gt;The most important thing is to avoid carrying unnecessary context. Use &lt;code&gt;/clear&lt;/code&gt; when starting a new task or switching models. Use &lt;code&gt;/compact&lt;/code&gt; when continuing the same task while cleaning up the context. Do not keep all working state inside a long conversation; write plans and specs to files.&lt;/p&gt;

&lt;p&gt;Also, pay attention to when your five-hour session starts. Starting early in the morning makes it easier to take advantage of multiple resets during the day. With &lt;code&gt;/schedule&lt;/code&gt;, you can control the start timing of routine work to some extent.&lt;/p&gt;

&lt;p&gt;For models, use Sonnet for everyday work and reserve Opus for heavy design decisions and complex orchestration. The goal is not simply to use the cheapest model. The goal is to choose the model that fails the least within your available usage limit.&lt;/p&gt;

&lt;p&gt;In the end, saving Claude Code usage is not really about being stingy. It is about managing working state. Keep the session light, move important information into files, and avoid making the model carry everything in the conversation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>productivity</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Choosing Models for an Agentic Chat App on Amazon Bedrock</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Mon, 25 May 2026 07:34:35 +0000</pubDate>
      <link>https://dev.to/trknhr/choosing-models-for-an-agentic-chat-app-on-amazon-bedrock-3gdi</link>
      <guid>https://dev.to/trknhr/choosing-models-for-an-agentic-chat-app-on-amazon-bedrock-3gdi</guid>
      <description>&lt;h1&gt;
  
  
  Choosing Models for an Agentic Chat App on Amazon Bedrock
&lt;/h1&gt;

&lt;p&gt;When building an agentic chat application on Amazon Bedrock, one of the first hard decisions is model selection.&lt;/p&gt;

&lt;p&gt;This article is not a rigorous benchmark or formal evaluation. It is simply a set of practical notes from experimenting with multiple Bedrock models while building a personal agentic chat application. Pricing, supported features, and regional behavior change frequently, so you should always validate with official documentation and your own workload before making production decisions.&lt;/p&gt;

&lt;p&gt;The app I’m currently building is a serverless agent that gets invoked from Slack. It receives user messages and dynamically calls tools such as memory, task management, calendar integration, web extraction, and custom skills.&lt;/p&gt;

&lt;p&gt;So this is not just a simple chatbot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;user message
  -&amp;gt; model decides tool usage
  -&amp;gt; tool execution
  -&amp;gt; model observes result
  -&amp;gt; sometimes replans
  -&amp;gt; final Slack response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this setup, model pricing alone is not enough. Tool call stability, Japanese UX quality, retry rate, fallback frequency, and output token volume all matter a lot.&lt;/p&gt;

&lt;p&gt;My conclusion, at least for now, is that Moonshot AI’s Kimi K2.5 works best as the primary model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sonnet Is Expensive
&lt;/h2&gt;

&lt;p&gt;Claude Sonnet is the baseline reference point.&lt;/p&gt;

&lt;p&gt;Claude Sonnet 4.5 costs $3 per 1M input tokens and $15 per 1M output tokens. Claude Haiku 4.5 is much cheaper at $1 / $5, so while Sonnet provides reassuring quality, the cost becomes significant for agentic chat workloads where output tokens can grow quickly.&lt;/p&gt;

&lt;p&gt;Agentic chat systems often invoke the model multiple times for a single user message. Tool schemas, tool results, conversation history, and system prompts all inflate token usage compared to ordinary Q&amp;amp;A applications.&lt;/p&gt;

&lt;p&gt;Because of that, I positioned Sonnet like this from the beginning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Sonnet:
  fallback on failure
  escalator for high-value users
  difficult multi-step reasoning
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the main model, I needed something cheaper than Sonnet while still being more reliable for agentic behavior than lightweight models.&lt;/p&gt;




&lt;h2&gt;
  
  
  Haiku Is Cheap, but Slightly Weak
&lt;/h2&gt;

&lt;p&gt;Claude Haiku 4.5 is attractive from a pricing perspective. If your architecture benefits heavily from prompt caching, it can become extremely cost efficient for applications with large system prompts and repeated tool schemas.&lt;/p&gt;

&lt;p&gt;Bedrock prompt caching reduces input token cost and latency by caching repeated prompt prefixes.&lt;/p&gt;

&lt;p&gt;Still, in my own testing, Haiku felt slightly too weak to serve as the main model.&lt;/p&gt;

&lt;p&gt;It works well for simple classification, lightweight extraction, and short summaries. But I had concerns about tool selection, replanning stability, Japanese response quality, and multi-step reliability.&lt;/p&gt;

&lt;p&gt;So Haiku feels better suited as a helper model rather than the primary agent model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Haiku:
  routing
  lightweight classification
  lightweight extraction
  first-pass processing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  MiniMax M2.5 Is Cheap and Agent-Friendly — but Japanese UX Is Weak
&lt;/h2&gt;

&lt;p&gt;MiniMax M2.5 was one of the strongest candidates.&lt;/p&gt;

&lt;p&gt;According to the Bedrock model card, MiniMax M2.5 is positioned as an “agent-native frontier model” optimized for reasoning efficiency, task decomposition, complex workflows, and agentic scaffolding. It supports a 196K context window and 8K maximum output tokens.&lt;/p&gt;

&lt;p&gt;The pricing is also extremely competitive.&lt;/p&gt;

&lt;p&gt;In the Tokyo region:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Approximate Cost for 1,000 Calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M2.5&lt;/td&gt;
&lt;td&gt;~$4.32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Large 3&lt;/td&gt;
&lt;td&gt;~$6.70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;~$9.36&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On paper, MiniMax M2.5 is very attractive. It also supports Bedrock Agents, Flows, and structured outputs.&lt;/p&gt;

&lt;p&gt;However, after actually using it, I felt that the Japanese UX and customer-facing conversational quality were slightly off. It may work well for internal planning or orchestration, but I was not fully comfortable exposing it directly to users in Slack conversations.&lt;/p&gt;

&lt;p&gt;MiniMax is probably one of the strongest cost-performance options available today, but I ultimately excluded it as the main chat model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gemma Is Extremely Cheap, but Better for First-Pass Processing
&lt;/h2&gt;

&lt;p&gt;The Gemma 3 family was also considered.&lt;/p&gt;

&lt;p&gt;In the Tokyo region, Gemma 3 pricing is extremely low:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gemma 3 27B: $0.28 / $0.46&lt;/li&gt;
&lt;li&gt;Gemma 3 12B: $0.11 / $0.35&lt;/li&gt;
&lt;li&gt;Gemma 3 4B: $0.05 / $0.10&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At those prices, Gemma becomes very useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;classification&lt;/li&gt;
&lt;li&gt;lightweight RAG answers&lt;/li&gt;
&lt;li&gt;short summaries&lt;/li&gt;
&lt;li&gt;routing&lt;/li&gt;
&lt;li&gt;first-pass response generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, my target workload was an agentic chat main model. Since even Haiku already felt slightly weak for that role, Gemma was difficult to justify as the primary agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Nemotron 3 Super 120B
&lt;/h2&gt;

&lt;p&gt;At one point I also evaluated NVIDIA Nemotron 3 Super 120B.&lt;/p&gt;

&lt;p&gt;According to the Bedrock model card, Nemotron 3 Super is a 120B-parameter open hybrid MoE model with 12B active parameters. It targets complex multi-agent applications and supports a 256K context window with 32K output tokens.&lt;/p&gt;

&lt;p&gt;Pricing is surprisingly low:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$0.18 / 1M input tokens&lt;/li&gt;
&lt;li&gt;$0.78 / 1M output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even cheaper than MiniMax.&lt;/p&gt;

&lt;p&gt;On paper, it looked extremely compelling.&lt;/p&gt;

&lt;p&gt;However, in my own testing, on-demand invocation latency in the Tokyo region was sometimes very slow, and even short responses occasionally timed out. Meanwhile, in us-east-1, forced tool calls and short responses often completed in around 2–3 seconds.&lt;/p&gt;

&lt;p&gt;So I would not conclude that Nemotron itself is fundamentally slow. Regional infrastructure and routing likely have a large impact.&lt;/p&gt;

&lt;p&gt;Since my target use case is a customer-facing chat application deployed in Tokyo, I decided not to use it as the main model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nemotron 3 Super:
  strong pricing and specs
  tool use works
  but latency in ap-northeast-1 felt risky
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Mistral Large 3 Is Good, but Not Decisive
&lt;/h2&gt;

&lt;p&gt;Mistral Large 3 was also a very realistic option.&lt;/p&gt;

&lt;p&gt;According to the Bedrock model card, Mistral Large 3 is a 675B-parameter model optimized for coding, reasoning, and multilingual tasks. It supports a 256K context window and 32K output tokens.&lt;/p&gt;

&lt;p&gt;In Bedrock Runtime, it supports Agents, Flows, structured outputs, and prompt caching.&lt;/p&gt;

&lt;p&gt;Pricing in Tokyo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$0.61 / 1M input tokens&lt;/li&gt;
&lt;li&gt;$1.82 / 1M output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Considerably cheaper than Kimi K2.5.&lt;/p&gt;

&lt;p&gt;My practical experience with it was not bad at all. But in this specific agentic chat workload, Kimi K2.5 consistently felt more stable.&lt;/p&gt;

&lt;p&gt;Also, while the official model card says prompt caching is supported, I occasionally saw Bedrock reject requests when using &lt;code&gt;cachePoint&lt;/code&gt; in my own setup.&lt;/p&gt;

&lt;p&gt;Mistral offers a very good balance between cost and quality, but Kimi ultimately ranked higher for my use case.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Ended Up Choosing Kimi K2.5
&lt;/h2&gt;

&lt;p&gt;In the end, I chose &lt;code&gt;moonshotai.kimi-k2.5&lt;/code&gt; as the main model.&lt;/p&gt;

&lt;p&gt;The reason is simple:&lt;/p&gt;

&lt;p&gt;Among all the models I tested, it provided the best balance of agentic behavior stability and Japanese UX quality.&lt;/p&gt;

&lt;p&gt;According to the Bedrock model card, Kimi K2.5 offers improved reasoning, coding, and multilingual capabilities. It supports a 256K context window, 16K output tokens, and multimodal image input.&lt;/p&gt;

&lt;p&gt;Within Bedrock Runtime, it supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;response streaming&lt;/li&gt;
&lt;li&gt;Guardrails&lt;/li&gt;
&lt;li&gt;Prompt Management&lt;/li&gt;
&lt;li&gt;Flows&lt;/li&gt;
&lt;li&gt;Agents&lt;/li&gt;
&lt;li&gt;structured outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pricing in Tokyo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$0.72 / 1M input tokens&lt;/li&gt;
&lt;li&gt;$3.60 / 1M output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More expensive than MiniMax or Mistral, but still significantly cheaper than Sonnet.&lt;/p&gt;

&lt;p&gt;When selecting models, failure rate matters as much as raw token pricing.&lt;/p&gt;

&lt;p&gt;Even if a model is cheap, frequent tool selection failures, malformed JSON, retries, or Sonnet fallbacks can easily increase the total effective cost.&lt;/p&gt;

&lt;p&gt;In agentic systems especially, a single bad decision can cascade into failed tool calls and unnecessary replanning.&lt;/p&gt;

&lt;p&gt;That is why my final evaluation of Kimi K2.5 became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Not the cheapest model, but the most stable main model.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  No Prompt Cache Support for Kimi K2.5 on Bedrock
&lt;/h2&gt;

&lt;p&gt;One unfortunate limitation is prompt caching.&lt;/p&gt;

&lt;p&gt;The Bedrock model card for Kimi K2.5 lists support for Agents, Flows, and structured outputs, but does not currently mention prompt caching.&lt;/p&gt;

&lt;p&gt;The Bedrock prompt caching documentation explicitly lists which models support cache checkpoints and where they can be inserted (&lt;code&gt;system&lt;/code&gt;, &lt;code&gt;messages&lt;/code&gt;, or &lt;code&gt;tools&lt;/code&gt;). Claude models and some others are listed there, but Kimi K2.5 currently has weak evidence for Bedrock-side prompt cache support.&lt;/p&gt;

&lt;p&gt;Moonshot’s direct API does show cache-hit pricing for Kimi K2.5.&lt;/p&gt;

&lt;p&gt;However, that does not automatically mean the same cache behavior or pricing applies through Bedrock.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reducing Cost with Payload Slimming and Flex Tier
&lt;/h2&gt;

&lt;p&gt;Once Kimi K2.5 became the primary model, the next challenge was cost optimization.&lt;/p&gt;

&lt;p&gt;Especially output tokens.&lt;/p&gt;

&lt;p&gt;The first thing that matters is payload slimming.&lt;/p&gt;

&lt;p&gt;That means minimizing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system prompts&lt;/li&gt;
&lt;li&gt;tool schemas&lt;/li&gt;
&lt;li&gt;tool results&lt;/li&gt;
&lt;li&gt;conversation history&lt;/li&gt;
&lt;li&gt;RAG excerpts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In agentic chat systems, tool schemas and tool results can dramatically inflate input token usage.&lt;/p&gt;

&lt;p&gt;Some practical optimizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;limit &lt;code&gt;maxTokens&lt;/code&gt; depending on workload&lt;/li&gt;
&lt;li&gt;avoid exposing long intermediate reasoning&lt;/li&gt;
&lt;li&gt;trim tool results down to only required fields&lt;/li&gt;
&lt;li&gt;avoid injecting every tool schema every time&lt;/li&gt;
&lt;li&gt;cache repeated FAQ answers, search results, and tool results on the application side&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These optimizations matter regardless of which model you choose.&lt;/p&gt;

&lt;p&gt;I also started experimenting with Bedrock Flex tier.&lt;/p&gt;

&lt;p&gt;Bedrock provides Standard, Flex, Priority, and Reserved service tiers. Flex is intended for workloads that can tolerate slightly more variable latency in exchange for lower cost.&lt;/p&gt;

&lt;p&gt;AWS documentation specifically mentions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model evaluation&lt;/li&gt;
&lt;li&gt;summarization&lt;/li&gt;
&lt;li&gt;agentic workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Moonshot Flex pricing on Bedrock is advertised at roughly a 50% discount compared to Standard.&lt;/p&gt;

&lt;p&gt;That means Kimi K2.5 in Tokyo becomes approximately:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;$0.72&lt;/td&gt;
&lt;td&gt;$3.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flex&lt;/td&gt;
&lt;td&gt;~$0.36&lt;/td&gt;
&lt;td&gt;~$1.80&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Initially, I planned to use Standard for interactive chat and Flex only for asynchronous tasks, evaluations, summaries, and background processing.&lt;/p&gt;

&lt;p&gt;However, after trying Kimi K2.5 on Flex, the latency for lightweight Slack interactions felt much better than expected.&lt;/p&gt;

&lt;p&gt;This is not a rigorous benchmark, and behavior may differ under heavy load or long tool loops.&lt;/p&gt;

&lt;p&gt;Still, for small-scale personal projects or serverless agents, starting with Flex for the main response path actually feels realistic.&lt;/p&gt;

&lt;p&gt;My current setup looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main interactive responses:
  moonshotai.kimi-k2.5 / Flex

async processing, summaries, evaluations:
  moonshotai.kimi-k2.5 / Flex

failure handling and difficult reasoning:
  Claude Sonnet fallback

lightweight classification and routing:
  cheaper helper models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Explaining Security Concerns Around Chinese Models
&lt;/h2&gt;

&lt;p&gt;When using Chinese-origin models like Kimi K2.5 or MiniMax M2.5, security concerns often appear internally.&lt;/p&gt;

&lt;p&gt;The important point is not to argue that “Chinese models are safe.”&lt;/p&gt;

&lt;p&gt;Instead, the distinction between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;direct API usage&lt;/li&gt;
&lt;li&gt;Bedrock-managed usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;must be explained clearly.&lt;/p&gt;

&lt;p&gt;According to Amazon Bedrock documentation, model providers cannot access Bedrock logs or customer prompts/completions.&lt;/p&gt;

&lt;p&gt;That means using Kimi or MiniMax through Bedrock has a very different risk profile compared to directly calling vendor APIs.&lt;/p&gt;

&lt;p&gt;The explanation I found most practical was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;We are not sending data directly to a Chinese model provider.
The models are executed within Amazon Bedrock’s managed environment.
Customer prompts and completions are not shared with the model provider through Bedrock.

Therefore, the main operational concerns become:
  IAM
  logging
  Guardrails
  RAG access control
  tool-call permissions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Final Architecture
&lt;/h2&gt;

&lt;p&gt;My final conclusion currently looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main model:
  moonshotai.kimi-k2.5

interactive tier:
  currently testing Flex
  fallback to Standard if latency becomes problematic

cost-sensitive tier:
  Flex

fallback:
  Claude Sonnet

helper models:
  MiniMax / Gemma / Nemotron for specialized workloads
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Model roles ended up being:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Evaluation&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet&lt;/td&gt;
&lt;td&gt;Excellent but expensive&lt;/td&gt;
&lt;td&gt;fallback / escalator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku&lt;/td&gt;
&lt;td&gt;Cheap but slightly weak&lt;/td&gt;
&lt;td&gt;routing / extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M2.5&lt;/td&gt;
&lt;td&gt;Cheap and agent-oriented&lt;/td&gt;
&lt;td&gt;not ideal for Japanese-facing UX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 3&lt;/td&gt;
&lt;td&gt;Extremely cheap&lt;/td&gt;
&lt;td&gt;first-pass processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nemotron 3 Super&lt;/td&gt;
&lt;td&gt;Cheap, non-Chinese, tool-capable&lt;/td&gt;
&lt;td&gt;latency concerns in Tokyo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Large 3&lt;/td&gt;
&lt;td&gt;Strong balance&lt;/td&gt;
&lt;td&gt;good, but less stable than Kimi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;Strong Japanese UX and tool stability&lt;/td&gt;
&lt;td&gt;main model&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;This whole exploration started from a simple question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Sonnet is expensive. Is there a cheaper main model for agentic chat?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;MiniMax M2.5 was extremely attractive in terms of pricing and agent-oriented behavior, but the Japanese customer-facing UX did not fully work for me.&lt;/p&gt;

&lt;p&gt;Mistral Large 3 offered an excellent balance overall, but Kimi K2.5 consistently felt more stable.&lt;/p&gt;

&lt;p&gt;Nemotron 3 Super 120B looked fascinating from a pricing and specification perspective, but latency in the Tokyo region made it difficult to trust for customer-facing chat.&lt;/p&gt;

&lt;p&gt;Haiku can become highly cost efficient with prompt caching, but it still felt slightly weak for my main agent workload.&lt;/p&gt;

&lt;p&gt;As a result, I settled on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kimi K2.5 as the main model&lt;/li&gt;
&lt;li&gt;Sonnet as fallback&lt;/li&gt;
&lt;li&gt;Flex tier and payload slimming for cost optimization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For my own use case, Kimi K2.5 was not the absolute cheapest model.&lt;/p&gt;

&lt;p&gt;But once retry rates, UX quality, and operational stability were included in the calculation, it delivered the best effective cost.&lt;/p&gt;

&lt;p&gt;Going forward, I want to build more formal evaluations around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;conversational quality&lt;/li&gt;
&lt;li&gt;tool call success rate&lt;/li&gt;
&lt;li&gt;retry frequency&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;Japanese UX&lt;/li&gt;
&lt;li&gt;token cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rather than endlessly adding more candidate models, I want to keep pruning the stack into something operationally simple and reliable.&lt;/p&gt;

</description>
      <category>amazonbedrock</category>
    </item>
    <item>
      <title>TanStack Was Not the Whole Story: Mini Shai-Hulud Was an npm/PyPI Supply-Chain Worm</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Wed, 13 May 2026 08:09:17 +0000</pubDate>
      <link>https://dev.to/trknhr/tanstack-was-not-the-whole-story-mini-shai-hulud-was-an-npmpypi-supply-chain-worm-pok</link>
      <guid>https://dev.to/trknhr/tanstack-was-not-the-whole-story-mini-shai-hulud-was-an-npmpypi-supply-chain-worm-pok</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article is based on public reporting available as of 2026-05-13. Mini Shai-Hulud is still an actively tracked campaign, so affected packages and IOCs (indicators of compromise) may change.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In May 2026, a supply-chain compromise was reported across TanStack's npm packages. Malicious versions were published for 42 &lt;code&gt;@tanstack/*&lt;/code&gt; packages, and installing those versions triggered a credential stealer.&lt;/p&gt;

&lt;p&gt;If you look only at TanStack, the incident can seem like a single npm compromise. But when you read The Hacker News coverage and the analyses from StepSecurity and Socket, it is better understood as part of a broader self-propagating supply-chain campaign called &lt;strong&gt;Mini Shai-Hulud&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The important point is that this was not just "a dependency package was compromised." It was closer to &lt;strong&gt;a worm that used developer machines and CI/CD environments as stepping stones to reach the next maintainer and the next package ecosystem&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened at TanStack
&lt;/h2&gt;

&lt;p&gt;According to the TanStack GitHub Advisory, malicious versions were published to the npm registry for 42 &lt;code&gt;@tanstack/*&lt;/code&gt; packages, totaling 84 versions, between 2026-05-11 19:20 and 19:26 UTC. The issue is tracked as &lt;code&gt;CVE-2026-45321&lt;/code&gt; with a CVSS score of 9.6.&lt;/p&gt;

&lt;p&gt;The publish was authenticated through the legitimate GitHub Actions OIDC trusted-publisher binding. At the same time, the advisory explains that the publish workflow itself was not modified.&lt;/p&gt;

&lt;p&gt;This section is based on TanStack's official postmortem and GitHub Advisory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://tanstack.com/blog/npm-supply-chain-compromise-postmortem" rel="noopener noreferrer"&gt;Postmortem: TanStack NPM supply-chain compromise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/TanStack/router/security/advisories/GHSA-g7cv-rxg3-hmpx" rel="noopener noreferrer"&gt;TanStack GitHub Advisory GHSA-g7cv-rxg3-hmpx&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At a high level, the TanStack-specific path looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TanStack-specific path:

checkout and build fork PR code inside pull_request_target
  + GitHub Actions cache poisoning
  + OIDC token extraction from the Actions runner process
  -&amp;gt; malicious publish that looked like it came from the legitimate release path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The issue was not simply the use of &lt;code&gt;pull_request_target&lt;/code&gt;. The problem was that the workflow checked out and executed untrusted fork PR code inside a &lt;code&gt;pull_request_target&lt;/code&gt; workflow. &lt;code&gt;pull_request_target&lt;/code&gt; runs in the context of the base repository, so it should generally be limited to operations that do not execute the contents of the PR, such as labeling or commenting.&lt;/p&gt;

&lt;p&gt;TanStack's postmortem explains that &lt;code&gt;bundle-size.yml&lt;/code&gt; ran on &lt;code&gt;pull_request_target&lt;/code&gt;, checked out the fork PR merge ref, and ran a build for bundle-size measurement. In other words, untrusted fork PR code ran within the base repository's cache scope. That became the entry point for cache poisoning.&lt;/p&gt;

&lt;p&gt;Using similar cache keys between test and release workflows is not unusual by itself. For example, caching a pnpm store based on the hash of &lt;code&gt;pnpm-lock.yaml&lt;/code&gt; is a common CI optimization.&lt;/p&gt;

&lt;p&gt;The problem is when a cache touched by untrusted PR code can also be restored by the release workflow. A cache is not executed just because it is restored. But if the release workflow later runs &lt;code&gt;pnpm install&lt;/code&gt; or a build step that references dependencies or binaries from the restored pnpm store, attacker-controlled code placed there can be invoked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Using the same cache key:
  common

Letting release restore a cache created by untrusted PR code:
  should not happen
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the TanStack case, a malicious script executed from the fork PR poisoned the pnpm store. The &lt;code&gt;actions/cache&lt;/code&gt; post-job save then stored that pnpm store. Later, a release workflow triggered by a push to &lt;code&gt;main&lt;/code&gt; restored the same cache. During build, test, or cleanup work, attacker-controlled binaries were invoked, leading to OIDC token extraction and a direct publish to npm.&lt;/p&gt;

&lt;p&gt;The malicious package versions included an obfuscated JavaScript payload called &lt;code&gt;router_init.js&lt;/code&gt;, roughly 2.3 MB in size. It ran during install and collected AWS IMDS credentials, GCP metadata, Kubernetes service-account tokens, Vault tokens, npm tokens from &lt;code&gt;~/.npmrc&lt;/code&gt;, GitHub tokens, SSH private keys, and more.&lt;/p&gt;

&lt;p&gt;That explains how the TanStack release pipeline was abused. But Mini Shai-Hulud becomes more concerning when you look beyond TanStack.&lt;/p&gt;

&lt;h2&gt;
  
  
  It was not only TanStack
&lt;/h2&gt;

&lt;p&gt;The Hacker News article lists package compromises associated with TeamPCP that went beyond TanStack, including UiPath, Mistral AI-related packages, OpenSearch, and Guardrails AI across npm and PyPI.&lt;/p&gt;

&lt;p&gt;Socket also tracked additional compromised artifacts after the initial TanStack reporting, including OpenSearch, PyPI &lt;code&gt;mistralai@2.4.6&lt;/code&gt;, PyPI &lt;code&gt;guardrails-ai@0.10.1&lt;/code&gt;, and additional Squawk-related npm packages.&lt;/p&gt;

&lt;p&gt;The broader campaign can be summarized like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mini Shai-Hulud
  + credential stealing
  + package maintainer enumeration
  + cross-ecosystem infection across npm and PyPI
  + persistence in Claude Code, VS Code, and GitHub Actions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The attacker did not only steal credentials. The malware also enumerated packages that a maintainer could publish to, then republished infected versions. A compromise in one developer machine or CI/CD environment could therefore spread into another package ecosystem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worm behavior
&lt;/h2&gt;

&lt;p&gt;The post-install flow looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compromised package install
  -&amp;gt; install of an infected package

router_init.js / transformers.pyz
  -&amp;gt; malicious payload execution

credential theft
  -&amp;gt; credential collection
  - GitHub token
  - npm token
  - cloud credentials
  - SSH keys
  - CI secrets

exfiltration
  -&amp;gt; data exfiltration
  - filev2.getsession.org
  - seed1/2/3.getsession.org
  - GitHub GraphQL dead drop

self-propagation
  -&amp;gt; spreading to more packages and repositories
  - enumerate maintainer packages
  - publish infected versions
  - inject workflows / persistence hooks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stolen data was sent to Session/Oxen-related infrastructure such as &lt;code&gt;filev2.getsession.org&lt;/code&gt; and &lt;code&gt;seed1.getsession.org&lt;/code&gt;. The Hacker News describes the use of &lt;code&gt;filev2.getsession.org&lt;/code&gt; and Session Protocol infrastructure as an attempt to evade detection, since those domains may be less likely to be blocked in enterprise environments.&lt;/p&gt;

&lt;p&gt;There was also a fallback path that used stolen GitHub tokens to commit encrypted data to attacker-controlled repositories through the GitHub GraphQL API. This is essentially a dead drop: if the malware cannot send data directly to an external server, it can temporarily place the data in a GitHub repository for later retrieval. The commit author &lt;code&gt;claude@users.noreply.github.com&lt;/code&gt; is one IOC to look for in that path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persistence and lateral movement
&lt;/h2&gt;

&lt;p&gt;The concerning part is not only the credential theft that happens at install time. The reported persistence and lateral movement surface is broad.&lt;/p&gt;

&lt;p&gt;StepSecurity and Socket describe artifacts such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.claude/settings.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.claude/router_runtime.js&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.claude/setup.mjs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.vscode/tasks.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.vscode/setup.mjs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;~/Library/LaunchAgents/com.user.gh-token-monitor.plist&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;~/.config/systemd/user/gh-token-monitor.service&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.github/workflows/codeql_analysis.yml&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If hooks are installed into Claude Code or VS Code, the stealer can run again when the IDE starts. The &lt;code&gt;gh-token-monitor&lt;/code&gt; service is used to monitor and retransmit GitHub tokens.&lt;/p&gt;

&lt;p&gt;There are also reports of injected GitHub Actions workflows that serialize repository secrets with &lt;code&gt;toJSON(secrets)&lt;/code&gt; and send them to &lt;code&gt;api.masscan.cloud&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On the CI/CD side, StepSecurity reported an especially important behavior. On Linux GitHub Actions runners, the malicious payload looked for the &lt;code&gt;Runner.Worker&lt;/code&gt; process and read &lt;code&gt;/proc/&amp;lt;pid&amp;gt;/mem&lt;/code&gt; to extract workflow secrets, including masked secrets. That means even secrets not explicitly referenced in the workflow YAML may be at risk if they are present in the runner process memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  PyPI was also affected
&lt;/h2&gt;

&lt;p&gt;This was not only an npm incident.&lt;/p&gt;

&lt;p&gt;Socket highlights PyPI &lt;code&gt;guardrails-ai@0.10.1&lt;/code&gt; because malicious code could run on import. On Linux, it downloaded a Python artifact from &lt;code&gt;git-tanstack.com/transformers.pyz&lt;/code&gt;, wrote it to &lt;code&gt;/tmp/transformers.pyz&lt;/code&gt;, and executed it with &lt;code&gt;python3&lt;/code&gt;. Socket notes that this behavior was not present in the previous &lt;code&gt;guardrails-ai@0.10.0&lt;/code&gt; release.&lt;/p&gt;

&lt;p&gt;The Hacker News, citing Microsoft's analysis on X, also discusses &lt;code&gt;mistralai@2.4.6&lt;/code&gt;, including behavior that fetched a credential stealer from a remote server, avoided Russian-language environments, and included destructive branching for environments that appeared to be in certain regions.&lt;/p&gt;

&lt;p&gt;Watching npm lifecycle scripts is not enough. Python imports, CI installs, developer machines, and IDE hooks all matter here.&lt;/p&gt;

&lt;h2&gt;
  
  
  SLSA provenance was not enough
&lt;/h2&gt;

&lt;p&gt;One of the most important details is that the malicious packages were published through legitimate GitHub Actions OIDC trusted publishing and had valid SLSA provenance.&lt;/p&gt;

&lt;p&gt;Provenance tells you which pipeline produced an artifact. It does not prove that the pipeline was not contaminated by attacker-controlled code.&lt;/p&gt;

&lt;p&gt;In this attack, the trusted pipeline itself became the attacker's publish path. A provenance badge or Sigstore attestation alone is not enough to conclude that the artifact is safe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Initial response
&lt;/h2&gt;

&lt;p&gt;If a developer machine or runner may have installed an affected version, reverting the lockfile is not enough. At the same time, this article should not be treated as a full incident-response runbook. It is safer to follow the official advisory and vendor analyses.&lt;/p&gt;

&lt;p&gt;The TanStack GitHub Advisory recommends treating affected developer machines and CI environments as compromised, rotating credentials that were accessible from the install process, checking cloud audit logs, and auditing CI pipelines.&lt;/p&gt;

&lt;p&gt;Start with these references:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/TanStack/router/security/advisories/GHSA-g7cv-rxg3-hmpx" rel="noopener noreferrer"&gt;TanStack GitHub Advisory&lt;/a&gt;: affected versions, patched versions, workaround, IOCs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.stepsecurity.io/blog/mini-shai-hulud-is-back-a-self-spreading-supply-chain-attack-hits-the-npm-ecosystem" rel="noopener noreferrer"&gt;StepSecurity analysis&lt;/a&gt;: GitHub Actions, OIDC, SLSA provenance, secret exfiltration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://socket.dev/blog/tanstack-npm-packages-compromised-mini-shai-hulud-supply-chain-attack" rel="noopener noreferrer"&gt;Socket analysis&lt;/a&gt;: additional affected packages, PyPI, persistence artifacts, detection notes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, the areas to check include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;isolation of affected machines and runners&lt;/li&gt;
&lt;li&gt;rotation of GitHub PATs, npm tokens, cloud credentials, Vault tokens, Kubernetes tokens, and SSH keys&lt;/li&gt;
&lt;li&gt;rotation of GitHub Actions secrets and environment secrets&lt;/li&gt;
&lt;li&gt;npm publish logs and unexpected changes in GitHub repositories&lt;/li&gt;
&lt;li&gt;persistence artifacts under &lt;code&gt;.claude/&lt;/code&gt;, &lt;code&gt;.vscode/&lt;/code&gt;, LaunchAgent, and systemd user services&lt;/li&gt;
&lt;li&gt;egress to &lt;code&gt;filev2.getsession.org&lt;/code&gt;, &lt;code&gt;seed*.getsession.org&lt;/code&gt;, and &lt;code&gt;api.masscan.cloud&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;StepSecurity also warns about npm tokens with the description &lt;code&gt;IfYouRevokeThisTokenItWillWipeTheComputerOfTheOwner&lt;/code&gt;. Because that may indicate destructive behavior, token revocation should be handled from a clean machine and according to the organization's incident-response process, not casually from a potentially infected host.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention lessons
&lt;/h2&gt;

&lt;p&gt;The lesson is not just "be careful with &lt;code&gt;pull_request_target&lt;/code&gt;."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;do not share cache between untrusted PR workflows and release pipelines&lt;/li&gt;
&lt;li&gt;do not checkout and execute untrusted code in &lt;code&gt;pull_request_target&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;grant &lt;code&gt;id-token: write&lt;/code&gt; only to the publish job&lt;/li&gt;
&lt;li&gt;explicitly set &lt;code&gt;permissions: id-token: none&lt;/code&gt; elsewhere&lt;/li&gt;
&lt;li&gt;separate release workflows from normal test workflows&lt;/li&gt;
&lt;li&gt;pin third-party actions by commit SHA instead of tags&lt;/li&gt;
&lt;li&gt;avoid leaving secrets on self-hosted or long-lived runners&lt;/li&gt;
&lt;li&gt;enforce lockfiles and frozen installs&lt;/li&gt;
&lt;li&gt;add a minimum release age for dependency updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With pnpm, &lt;code&gt;minimumReleaseAge&lt;/code&gt; can be set in &lt;code&gt;pnpm-workspace.yaml&lt;/code&gt;. For example, a 7-day delay is 10080 minutes. In pnpm 11, &lt;code&gt;minimumReleaseAgeStrict&lt;/code&gt; can also be set when you want stricter behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;minimumReleaseAge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10080&lt;/span&gt;
&lt;span class="na"&gt;minimumReleaseAgeStrict&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a complete defense. It will not magically clean a malicious version already in your lockfile, and it will not protect you if you explicitly install a malicious version. But it can reduce the chance of immediately pulling a newly published malicious release.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you explain the TanStack incident only as a &lt;code&gt;pull_request_target&lt;/code&gt; mistake, it sounds smaller than it was.&lt;/p&gt;

&lt;p&gt;The broader picture is a self-propagating worm that crossed CI/CD, caches, OIDC, npm trusted publishing, IDE hooks, GitHub Actions secrets, and PyPI. The attacker did not merely compromise packages. They used developer and CI environments as stepping stones to reach the next maintainer and the next package.&lt;/p&gt;

&lt;p&gt;The right mental model is not just "dependency package compromise." It is &lt;strong&gt;developer environment and CI/CD compromise&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://thehackernews.com/2026/05/mini-shai-hulud-worm-compromises.html?m=1" rel="noopener noreferrer"&gt;Mini Shai-Hulud Worm Compromises TanStack, Mistral AI, Guardrails AI &amp;amp; More Packages - The Hacker News&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://tanstack.com/blog/npm-supply-chain-compromise-postmortem" rel="noopener noreferrer"&gt;Postmortem: TanStack NPM supply-chain compromise - TanStack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/TanStack/router/security/advisories/GHSA-g7cv-rxg3-hmpx" rel="noopener noreferrer"&gt;Malware in 42 @tanstack/* packages exfiltrates cloud credentials, GitHub tokens, and SSH keys - GitHub Advisory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.stepsecurity.io/blog/mini-shai-hulud-is-back-a-self-spreading-supply-chain-attack-hits-the-npm-ecosystem" rel="noopener noreferrer"&gt;TeamPCP's Mini Shai-Hulud Is Back - StepSecurity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://socket.dev/blog/tanstack-npm-packages-compromised-mini-shai-hulud-supply-chain-attack" rel="noopener noreferrer"&gt;TanStack npm Packages Compromised in Ongoing Mini Shai-Hulud Supply-Chain Attack - Socket&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://news.ycombinator.com/item?id=48100706" rel="noopener noreferrer"&gt;Postmortem: TanStack NPM supply-chain compromise - Hacker News&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pnpm.io/settings#minimumreleaseage" rel="noopener noreferrer"&gt;Settings: minimumReleaseAge - pnpm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>npm</category>
      <category>pypi</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>Building a Home Personal Assistant with Claude Managed Agents</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Mon, 13 Apr 2026 07:46:16 +0000</pubDate>
      <link>https://dev.to/trknhr/building-a-home-personal-assistant-with-claude-managed-agents-5a8f</link>
      <guid>https://dev.to/trknhr/building-a-home-personal-assistant-with-claude-managed-agents-5a8f</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Claude Managed Agents was just announced, so I tried using it to build a personal assistant for household tasks.&lt;/p&gt;

&lt;p&gt;What I wanted was pretty simple: an AI I can call from Slack that can handle family notes, tasks, reminders, and schedules without too much ceremony. Things like birthdays, what gifts I bought last year, school handouts, grocery co-op deadlines, and small day-to-day household tasks.&lt;/p&gt;

&lt;p&gt;My first impression was very positive. Claude Managed Agents solves a lot of the annoying parts up front:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I do not have to host the execution environment myself&lt;/li&gt;
&lt;li&gt;Vaults and sandboxes are built in from the start&lt;/li&gt;
&lt;li&gt;MCP and custom tools make it easier to build a safer architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That said, it does not eliminate the need for surrounding application code. I still needed a Slack event endpoint, persistent task state, and scheduled execution. In the end, I landed on an architecture centered on Claude Managed Agents, with &lt;code&gt;Lambda + DynamoDB + EventBridge Scheduler&lt;/code&gt; around it.&lt;/p&gt;

&lt;p&gt;My app is like this.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7xqb1nb6rndum75152m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7xqb1nb6rndum75152m.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What I wanted to build
&lt;/h2&gt;

&lt;p&gt;These were the rough requirements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Trigger the AI from Slack mentions for household tasks&lt;/li&gt;
&lt;li&gt;Let the AI take notes and transcribe things&lt;/li&gt;
&lt;li&gt;Connect with Google Calendar and Drive so important things are not missed&lt;/li&gt;
&lt;li&gt;Have the AI send a daily reminder about household tasks&lt;/li&gt;
&lt;li&gt;Let me send rough notes about finished tasks or recurring events and have the AI remember them in a useful way&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So far, the parts that are actually working are mainly &lt;code&gt;1 / 2 / 4 / 5&lt;/code&gt;. Calendar and Drive integration are next.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quickstart was genuinely useful
&lt;/h2&gt;

&lt;p&gt;I started from the Claude Console Quickstart:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/workspaces/default/agents/" rel="noopener noreferrer"&gt;https://platform.claude.com/workspaces/default/agents/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is a good way to get an initial agent configuration in place. You can shape the setup through conversation instead of writing everything from scratch. Japanese IME input still felt a little awkward, and Enter could fire too early, but overall it was fast enough to be useful.&lt;/p&gt;
&lt;h3&gt;
  
  
  Slack MCP
&lt;/h3&gt;

&lt;p&gt;On the Slack side, I created a bot account and added the scopes I needed. The main ones ended up being:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app_mentions:read&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;chat:write&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;files:read&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Slack MCP lives on the Managed Agent side, but actual event ingestion and attachment retrieval are handled by Lambda. In practice, that split felt better than trying to force everything through MCP alone.&lt;/p&gt;
&lt;h3&gt;
  
  
  Sandbox
&lt;/h3&gt;

&lt;p&gt;Claude Managed Agents also gives you a managed execution environment. In this project I used a sandbox configured for Slack MCP calls and custom tool usage.&lt;/p&gt;

&lt;p&gt;I did &lt;strong&gt;not&lt;/strong&gt; let the agent touch DynamoDB directly. Instead, DynamoDB access goes through custom tools, and Lambda performs the actual reads and writes. That keeps the permission boundary clear and makes the update rules easier to control from the application side.&lt;/p&gt;

&lt;p&gt;In Anthropic's docs, this execution environment is modeled as an &lt;code&gt;Environment&lt;/code&gt;. An Environment is basically the container configuration where the agent runs. You create it once and refer to it by ID. Multiple sessions can reuse the same Environment definition, but each session gets its own isolated container instance, and filesystem state is not shared across sessions. In other words, configuration is reusable, but runtime state is isolated per session.&lt;/p&gt;

&lt;p&gt;References:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/managed-agents/overview" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/managed-agents/environments" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/environments&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That matters a lot. Even for a personal or family assistant, it means each run starts from a clean, isolated environment instead of inheriting leftovers from the previous run. Network settings are also part of the Environment, and Anthropic recommends using &lt;code&gt;limited&lt;/code&gt; networking with explicit &lt;code&gt;allowed_hosts&lt;/code&gt; for production. So the sandbox is not just “a safe box for Claude.” It is the unit that bundles isolation, dependency setup, and network permissions together.&lt;/p&gt;
&lt;h3&gt;
  
  
  Vault
&lt;/h3&gt;

&lt;p&gt;I stored the Slack MCP credentials in a Vault. Not having to place raw credentials directly into the agent configuration is a big win.&lt;/p&gt;

&lt;p&gt;The value of Vaults is pretty clear in Anthropic's docs. Vaults and credentials are treated as reusable authentication primitives that you register once and reference by ID. That means you do not need to run your own secret store for this part, pass tokens around on every request, or lose track of which credentials a session is using.&lt;/p&gt;

&lt;p&gt;Reference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/managed-agents/vaults" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/vaults&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Another important point is that MCP server definitions and authentication are separated. When you create the agent, you declare which MCP servers it can connect to. When you create a session, you pass &lt;code&gt;vault_ids&lt;/code&gt; to resolve authentication. Anthropic explicitly calls out that this separation keeps secrets out of reusable agent definitions while still letting each session authenticate with different credentials if needed. For a setup like this, where Slack MCP exists alongside application-managed Slack event handling, that split is very helpful.&lt;/p&gt;

&lt;p&gt;Reference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/managed-agents/mcp-connector" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/mcp-connector&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  I still needed regular application code
&lt;/h2&gt;

&lt;p&gt;At first I thought Managed Agents might cover most of it. In practice, I still needed surrounding application code for three reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an HTTP endpoint for Slack Events API&lt;/li&gt;
&lt;li&gt;asynchronous processing to stay within Slack’s 3-second response limit&lt;/li&gt;
&lt;li&gt;application state such as memory, tasks, sessions, and idempotency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the architecture ended up looking like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Slack mention
  -&amp;gt; API Gateway
  -&amp;gt; Lambda (ingress)
  -&amp;gt; SQS
  -&amp;gt; Lambda (worker)
  -&amp;gt; Claude Managed Agent
  -&amp;gt; Slack reply

Daily reminder
  -&amp;gt; EventBridge Scheduler
  -&amp;gt; Lambda (scheduled runner)
  -&amp;gt; Claude Managed Agent
  -&amp;gt; Slack post

State
  -&amp;gt; DynamoDB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Slack mentions flow through &lt;code&gt;ingress Lambda -&amp;gt; SQS -&amp;gt; worker Lambda&lt;/code&gt;. Slack gets an immediate ACK, and the Claude interaction happens asynchronously in the background.&lt;/p&gt;

&lt;p&gt;The daily reminder is triggered by EventBridge Scheduler. Right now it runs every day at &lt;code&gt;09:00 JST&lt;/code&gt; and posts a reminder for unfinished tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What gets stored where
&lt;/h2&gt;

&lt;p&gt;This setup currently uses seven DynamoDB tables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SlackThreadSessionsTable&lt;/code&gt;: mapping between Slack threads and Claude sessions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ProcessedEventsTable&lt;/code&gt;: Slack event deduplication&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ScheduledTasksTable&lt;/code&gt;: scheduled task definitions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;UserMemoriesTable&lt;/code&gt;: mapping to Claude memory stores&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MemoryItemsTable&lt;/code&gt;: semi-structured memory persisted through custom tools&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TasksTable&lt;/code&gt;: current task state&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TaskEventsTable&lt;/code&gt;: task history&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;MemoryItemsTable&lt;/code&gt; and &lt;code&gt;TasksTable&lt;/code&gt; / &lt;code&gt;TaskEventsTable&lt;/code&gt; are the important ones here.&lt;/p&gt;

&lt;p&gt;For household use, the data I actually care about looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;whose birthday it is&lt;/li&gt;
&lt;li&gt;what I gave them last year&lt;/li&gt;
&lt;li&gt;what tasks are still unfinished&lt;/li&gt;
&lt;li&gt;whether a task is already done&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That kind of information is easier to manage if it lives in DynamoDB as the source of truth, with Claude pulling it through tools only when needed. That is the approach I took.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using custom tools for memory and tasks
&lt;/h2&gt;

&lt;p&gt;I ended up defining these five tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;search_memories&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;save_memory&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;list_tasks&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;upsert_task&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;mark_task_done&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the Managed Agent calls one of them, it emits &lt;code&gt;agent.custom_tool_use&lt;/code&gt;. Lambda receives that request, updates DynamoDB, and returns the result via &lt;code&gt;user.custom_tool_result&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I like this pattern a lot. The agent never needs direct DynamoDB IAM permissions, which makes the boundary safer and gives the application control over how updates are applied.&lt;/p&gt;

&lt;p&gt;I verified the flow end to end:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;save_memory&lt;/code&gt; stored “Hanako’s birthday is 8/12”&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;upsert_task&lt;/code&gt; created a task for buying a birthday gift&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mark_task_done&lt;/code&gt; updated that task to done&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TaskEventsTable&lt;/code&gt; recorded &lt;code&gt;created&lt;/code&gt; and &lt;code&gt;marked_done&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Slack mentions work naturally
&lt;/h2&gt;

&lt;p&gt;When I mention &lt;code&gt;@AI&lt;/code&gt; in Slack, the conversation continues in the same thread.&lt;/p&gt;

&lt;p&gt;What made this feel right was treating &lt;code&gt;Slack thread = Claude session&lt;/code&gt;. That aligns the Slack UX with the conversation context in a very natural way.&lt;/p&gt;

&lt;p&gt;I also added attachment handling on the Lambda side. With &lt;code&gt;files:read&lt;/code&gt;, Lambda can fetch PDFs or images from Slack’s &lt;code&gt;url_private&lt;/code&gt; endpoints and pass them to Claude as &lt;code&gt;document&lt;/code&gt; or &lt;code&gt;image&lt;/code&gt; blocks.&lt;/p&gt;

&lt;p&gt;That makes flows like this possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;upload a school or daycare PDF&lt;/li&gt;
&lt;li&gt;let the AI read it&lt;/li&gt;
&lt;li&gt;extract tasks if needed&lt;/li&gt;
&lt;li&gt;save important details into memory&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Daily reminders also worked well
&lt;/h2&gt;

&lt;p&gt;For scheduled execution, I used EventBridge Scheduler rather than the older CloudWatch Events style rules.&lt;/p&gt;

&lt;p&gt;The current setup stores a &lt;code&gt;daily-summary&lt;/code&gt; task definition in DynamoDB. Every morning at 9 AM, the scheduled runner loads that definition, starts Claude, calls &lt;code&gt;list_tasks&lt;/code&gt; to fetch unfinished tasks, and posts a short reminder to Slack.&lt;/p&gt;

&lt;p&gt;What I like about this is that the reminder is not a fixed template. Claude can shape the wording based on the unfinished tasks in DynamoDB.&lt;/p&gt;

&lt;h2&gt;
  
  
  Letting it read PDFs and remember things is surprisingly good
&lt;/h2&gt;

&lt;p&gt;This turned out to be one of the most promising parts for household use.&lt;/p&gt;

&lt;p&gt;If I can just upload a PDF to Slack and say &lt;code&gt;@AI take a look at this&lt;/code&gt;, the system can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;extract dates&lt;/li&gt;
&lt;li&gt;turn them into tasks&lt;/li&gt;
&lt;li&gt;save names or events into memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly the kind of workflow that matters for family operations, where the problem is usually not a lack of information but forgetting things at the wrong time.&lt;/p&gt;

&lt;p&gt;In that sense, &lt;code&gt;save_memory&lt;/code&gt; and &lt;code&gt;search_memories&lt;/code&gt; seem especially useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;Cost is obviously a concern.&lt;/p&gt;

&lt;p&gt;According to Anthropic’s pricing page, the model I am using here, &lt;code&gt;Claude Sonnet 4.6&lt;/code&gt;, is priced at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: &lt;code&gt;$3 / MTok&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Output: &lt;code&gt;$15 / MTok&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Session runtime: &lt;code&gt;$0.08 / session-hour&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/about-claude/pricing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For household use, a rough estimate still puts this in a pretty reasonable range, around &lt;code&gt;$10/month&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I used these assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 Slack mentions per day&lt;/li&gt;
&lt;li&gt;1 daily reminder per day&lt;/li&gt;
&lt;li&gt;per mention: 12k input tokens / 1.2k output tokens / 20 seconds runtime&lt;/li&gt;
&lt;li&gt;per reminder: 15k input tokens / 1.5k output tokens / 15 seconds runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mentions: &lt;code&gt;about $8.4 / month&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;reminders: &lt;code&gt;about $2.1 / month&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;total: &lt;code&gt;about $10.5 / month&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This will go up quickly if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you read a lot of long PDFs&lt;/li&gt;
&lt;li&gt;you use web search or extra tools heavily&lt;/li&gt;
&lt;li&gt;conversations get long and context keeps expanding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Still, for a personal household assistant with a small number of daily interactions, AWS costs are likely minor compared to Claude token costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things that were tricky
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Slack Events configuration
&lt;/h3&gt;

&lt;p&gt;At first, I had the classic problem where the &lt;code&gt;Request URL&lt;/code&gt; was verified but no events were arriving. In the end, I had to carefully make sure that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event Subscriptions were enabled&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;app_mention&lt;/code&gt; was added&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;files:read&lt;/code&gt; was added&lt;/li&gt;
&lt;li&gt;the Slack app was reinstalled after changing scopes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Splitting responsibility between Slack MCP and Lambda
&lt;/h2&gt;

&lt;p&gt;Slack MCP is useful, but once you need external event ingestion, attachment handling, threaded replies, and idempotency, it is easier to keep Slack input/output under application control.&lt;/p&gt;

&lt;p&gt;The split that worked best here was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda handles input and delivery&lt;/li&gt;
&lt;li&gt;Managed Agent handles reasoning and tool usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That division felt clean.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do not start with fully automatic memory saving
&lt;/h3&gt;

&lt;p&gt;This is more of an operational lesson than a technical one. Memory gets messy fast. Birthdays and gift history are good durable facts, but if you save every temporary request automatically, the memory store becomes noisy very quickly.&lt;/p&gt;

&lt;p&gt;For now, I prefer having an explicit &lt;code&gt;save_memory&lt;/code&gt; entry point. The agent can decide what looks durable, but the application still controls how it is persisted.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I want to do next
&lt;/h2&gt;

&lt;p&gt;These are the next things I want to add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;register events in Google Calendar and link the returned event IDs to tasks&lt;/li&gt;
&lt;li&gt;read Google Drive documents and turn them into tasks or memories&lt;/li&gt;
&lt;li&gt;run weekly summaries of completed tasks&lt;/li&gt;
&lt;li&gt;add reminders like “a birthday is coming up” or “the co-op deadline is close”&lt;/li&gt;
&lt;li&gt;refine the memory persistence policy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Calendar integration feels especially important. The shape I want is: Claude registers something in Calendar, returns structured JSON, and the application syncs that result into DynamoDB task state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Closing thoughts
&lt;/h3&gt;

&lt;p&gt;I came away with a very good impression.&lt;/p&gt;

&lt;p&gt;The managed aspect matters a lot. Availability, execution environments, credentials, and permission boundaries are all expensive to get right on your own. Claude Managed Agents makes that much easier to control.&lt;/p&gt;

&lt;p&gt;The pattern that currently feels best to me is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reasoning and sandboxing in Managed Agents&lt;/li&gt;
&lt;li&gt;webhooks, state, and integration glue in Lambda&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That split worked well for a household assistant too. At this point I can already see a path where I throw rough notes into Slack and get “remember this,” “remind me later,” and “what is still unfinished?” out of the same system.&lt;/p&gt;

&lt;p&gt;The next step is to connect Calendar and Drive and see how far this can go in real day-to-day use.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Claude Managed Agents Quickstart: &lt;a href="https://platform.claude.com/docs/en/managed-agents/quickstart" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/quickstart&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Managed Agents Environments: &lt;a href="https://platform.claude.com/docs/en/managed-agents/environments" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/environments&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Managed Agents Events and Streaming: &lt;a href="https://platform.claude.com/docs/en/managed-agents/events-and-streaming" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/events-and-streaming&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Managed Agents Memory: &lt;a href="https://platform.claude.com/docs/en/managed-agents/memory" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/memory&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Managed Agents Vaults: &lt;a href="https://platform.claude.com/docs/en/managed-agents/vaults" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/vaults&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Managed Agents MCP Connector: &lt;a href="https://platform.claude.com/docs/en/managed-agents/mcp-connector" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/managed-agents/mcp-connector&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Pricing: &lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/about-claude/pricing&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>claude</category>
    </item>
    <item>
      <title>Semver in Retrograde</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Wed, 08 Apr 2026 15:02:07 +0000</pubDate>
      <link>https://dev.to/trknhr/semver-in-retrograde-1oj3</link>
      <guid>https://dev.to/trknhr/semver-in-retrograde-1oj3</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/aprilfools-2026"&gt;DEV April Fools Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;I built a dependency analysis tool that delivers executive-grade reports about your project's emotional state.&lt;br&gt;
It just happens to be astrology. So I built Semver in Retrograde.&lt;/p&gt;

&lt;p&gt;You paste a &lt;code&gt;package.json&lt;/code&gt;, click "Analyze my dependency aura", and get a straight-faced executive report about the project's emotional state. It gives you Aura Stability, Chaos Index, Peer Dependency Tension, Mercury Status, the dependency Big 3, a prophecy, a lucky command, and a share card that looks ready for an internal quarterly review.&lt;/p&gt;

&lt;p&gt;That contrast is the joke. The interface looks like a serious dashboard. The output is dependency mysticism delivered in the tone of an operations meeting.&lt;/p&gt;

&lt;p&gt;I also added one feature that makes me disproportionately happy: if you paste something that looks like &lt;code&gt;requirements.txt&lt;/code&gt; or a &lt;code&gt;Gemfile&lt;/code&gt;, the app returns &lt;strong&gt;418 I'm a teapot&lt;/strong&gt;. Wrong ecosystem, wrong beverage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F65tywvg645npj5xc4sjs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F65tywvg645npj5xc4sjs.png" alt=" " width="800" height="121"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Live demo: &lt;a href="https://semver-in-retrograde.vercel.app/" rel="noopener noreferrer"&gt;https://semver-in-retrograde.vercel.app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/trknhr/semver-in-retrograde" rel="noopener noreferrer"&gt;trknhr/semver-in-retrograde&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One practical note: the public deployment does not call Gemini in production. I turned that off to keep the joke within budget, so the hosted version runs in a fixed "budget committee safe mode" for the narrative copy. The full Gemini path is what I used in local development and in the eval run.&lt;/p&gt;

&lt;p&gt;This is the demo flow I used:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Paste a package.json&lt;/li&gt;
&lt;li&gt;Click "Analyze my dependency aura"&lt;/li&gt;
&lt;li&gt;Watch the dashboard appear like it's about to audit your org&lt;/li&gt;
&lt;li&gt;Then realize it's talking about your emotional instability&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;The code is here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/trknhr/semver-in-retrograde" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The app has a clean split. Local code parses and scores the manifest. Gemini writes the executive reading. So the same manifest always produces the same numbers, while the model handles the polished nonsense.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I built it
&lt;/h2&gt;

&lt;p&gt;I used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Next.js&lt;/li&gt;
&lt;li&gt;TypeScript&lt;/li&gt;
&lt;li&gt;Tailwind CSS&lt;/li&gt;
&lt;li&gt;server-side Gemini API&lt;/li&gt;
&lt;li&gt;Zod&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is more serious than the premise. That felt appropriate.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Deterministic manifest analysis
&lt;/h3&gt;

&lt;p&gt;The first step is completely local.&lt;/p&gt;

&lt;p&gt;The app parses &lt;code&gt;package.json&lt;/code&gt;, flattens the dependency sections, inspects the scripts block, and turns the manifest into a feature set. It looks at things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dependency counts&lt;/li&gt;
&lt;li&gt;&lt;code&gt;peerDependencies&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;overrides&lt;/code&gt; / &lt;code&gt;resolutions&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;wildcard and &lt;code&gt;latest&lt;/code&gt; versions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pre*&lt;/code&gt; / &lt;code&gt;post*&lt;/code&gt; scripts&lt;/li&gt;
&lt;li&gt;&lt;code&gt;postinstall&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;package manager hints&lt;/li&gt;
&lt;li&gt;framework / test / build tool fingerprints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those features feed a weighted scoring model. I wanted the joke to start from real manifest behavior, not from a model improvising a vibe.&lt;/p&gt;

&lt;p&gt;Pinned versions help Aura Stability. Wildcards, &lt;code&gt;latest&lt;/code&gt;, extra scripts, and override-heavy manifests drag it down. Chaos Index climbs when the project has loose version ranges, lifecycle scripts, &lt;code&gt;postinstall&lt;/code&gt;, suspicious script names, or workspace sprawl. Peer Dependency Tension rises when the package asks other people to satisfy more of its needs. Boundary Issues is really a score for governance by exception, so &lt;code&gt;overrides&lt;/code&gt;, &lt;code&gt;resolutions&lt;/code&gt;, and workspace hints push it upward. Trust Issues gets worse when the manifest is private, carries a &lt;code&gt;postinstall&lt;/code&gt;, or leans on suspicious scripts and &lt;code&gt;latest&lt;/code&gt; tags. Mercury Status comes from lifecycle-script severity, especially &lt;code&gt;pre*&lt;/code&gt;, &lt;code&gt;post*&lt;/code&gt;, and &lt;code&gt;postinstall&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So yes, the result is silly. But it is silly in a deterministic way.&lt;/p&gt;

&lt;p&gt;Those signals show up in the product as Aura Stability, Chaos Index, Peer Dependency Tension, Boundary Issues, Trust Issues, and Mercury Status.&lt;/p&gt;

&lt;p&gt;All of this is computed locally so the core behavior stays deterministic.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Gemini for the narrative layer
&lt;/h3&gt;

&lt;p&gt;I used Gemini on the server for the parts that needed tone rather than math:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;executive summary&lt;/li&gt;
&lt;li&gt;sun / moon / rising interpretations&lt;/li&gt;
&lt;li&gt;red flags&lt;/li&gt;
&lt;li&gt;prophecy&lt;/li&gt;
&lt;li&gt;lucky command&lt;/li&gt;
&lt;li&gt;share caption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gemini does not decide the scores. It gets the extracted features and the computed numbers, then turns them into a dead-serious reading.&lt;/p&gt;

&lt;p&gt;The app asks for structured JSON and validates the result with Zod before rendering anything. That kept the product funny without handing core logic to the model.&lt;/p&gt;

&lt;p&gt;The public deployment does not hit Gemini live. I disabled that in production because paying for unlimited dependency clairvoyance for strangers seemed like a bad financial habit. So production serves a fixed, intentionally budget-conscious executive statement, while local development and evals use the real Gemini path.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. UI direction
&lt;/h3&gt;

&lt;p&gt;I did not want this to look like a horoscope app. I wanted it to look like a corporate audit dashboard that had developed a spiritual problem.&lt;/p&gt;

&lt;p&gt;The design goal was:&lt;/p&gt;

&lt;p&gt;"This should look like a compliance product that got trapped in a spiritual crisis."&lt;/p&gt;

&lt;h3&gt;
  
  
  4. My favorite April Fools detail
&lt;/h3&gt;

&lt;p&gt;If the input looks like Python or Ruby dependency files, the app returns &lt;strong&gt;418&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That part is useless, correct, and deeply satisfying.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Eval, because the joke works better if the nonsense is measured
&lt;/h3&gt;

&lt;p&gt;I did not want the AI layer to run on hope.&lt;/p&gt;

&lt;p&gt;So I added a small &lt;code&gt;promptfoo&lt;/code&gt; harness around the reading endpoint and treated it like a real structured-output feature.&lt;/p&gt;

&lt;p&gt;The eval setup has two layers. The first is deterministic and checks response contract, writing constraints, and fixture-specific signal coverage. The second uses LLM-as-a-judge rubrics for tone and grounding.&lt;/p&gt;

&lt;p&gt;The deterministic checks cover things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the endpoint returns the full expected JSON shape&lt;/li&gt;
&lt;li&gt;the response stays in &lt;code&gt;live&lt;/code&gt; mode for the eval fixtures&lt;/li&gt;
&lt;li&gt;the copy does not drift into practical engineering advice&lt;/li&gt;
&lt;li&gt;the &lt;code&gt;luckyCommand&lt;/code&gt; still looks like a shell command&lt;/li&gt;
&lt;li&gt;the response actually reflects the manifest signals it was supposed to notice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then I added judge-based checks for the harder-to-measure parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does this still sound polished, dead-serious, and vaguely B2B?&lt;/li&gt;
&lt;li&gt;is it funny through sincerity rather than random nonsense?&lt;/li&gt;
&lt;li&gt;does it stay grounded in the fixture instead of inventing facts?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gave me a cleaner contract for the product:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local code owns the real scoring logic&lt;/li&gt;
&lt;li&gt;Gemini owns the tone&lt;/li&gt;
&lt;li&gt;evals make sure those boundaries do not blur&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The runner hits the local Next.js app over HTTP, so the eval path matches the real product path instead of a helper in isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Eval results
&lt;/h3&gt;

&lt;p&gt;The saved run I kept for the project was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;eval-qw8-2026-04-08T00:18:21&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;public report: &lt;a href="https://semver-in-retrograde.vercel.app/evals/eval-qw8-2026-04-08T00:18:21" rel="noopener noreferrer"&gt;semver-in-retrograde.vercel.app/evals/eval-qw8-2026-04-08T00:18:21&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;raw JSON: &lt;a href="https://semver-in-retrograde.vercel.app/evals/eval-qw8-2026-04-08T00-18-21.json" rel="noopener noreferrer"&gt;semver-in-retrograde.vercel.app/evals/eval-qw8-2026-04-08T00-18-21.json&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That run used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;promptfoo&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;4 manifest fixtures&lt;/li&gt;
&lt;li&gt;8 expanded test cases&lt;/li&gt;
&lt;li&gt;concurrency set to &lt;code&gt;1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;light retrying around transient model-availability issues&lt;/li&gt;
&lt;li&gt;Gemini as the judge model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;8 / 8 passing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;0 failures&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;0 errors&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;runtime: about &lt;strong&gt;133 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fixtures cover four different dependency personalities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a mildly over-governed Next.js workspace&lt;/li&gt;
&lt;li&gt;a commitment-avoidant Vite app with &lt;code&gt;latest&lt;/code&gt; and wildcard ranges&lt;/li&gt;
&lt;li&gt;a haunted library with overrides, resolutions, and lifecycle weirdness&lt;/li&gt;
&lt;li&gt;a relatively boring steady package that should not be over-dramatized&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last case mattered. A joke product can always get louder. The harder part is keeping it funny without inventing drama the manifest did not earn.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prize category
&lt;/h2&gt;

&lt;p&gt;I am submitting this for &lt;strong&gt;Best Google AI Usage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Google AI is central to the project. Gemini runs the narrative layer on the server, returns structured JSON instead of free-form prose, gets validated before display, and sits behind evals that check both hard constraints and tone. The product only works because of that split between deterministic scoring and AI-generated corporate mysticism.&lt;/p&gt;

&lt;p&gt;That is the role I wanted the model to play. It does not own the critical logic. It owns the polished nonsense.&lt;/p&gt;

&lt;p&gt;If your JavaScript project has unresolved dependency feelings, Semver in Retrograde is ready to misinterpret them at enterprise scale.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>418challenge</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Lessons from the Spring 2026 OSS Incidents: Hardening npm, pnpm, and GitHub Actions Against Supply-Chain Attacks</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Thu, 02 Apr 2026 05:00:50 +0000</pubDate>
      <link>https://dev.to/trknhr/lessons-from-the-spring-2026-oss-incidents-hardening-npm-pnpm-and-github-actions-against-1jnp</link>
      <guid>https://dev.to/trknhr/lessons-from-the-spring-2026-oss-incidents-hardening-npm-pnpm-and-github-actions-against-1jnp</guid>
      <description>&lt;p&gt;March 2026 saw a rapid succession of OSS supply-chain incidents.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In Trivy, an attacker repointed 76 of the 77 version tags for &lt;code&gt;trivy-action&lt;/code&gt; and 7 tags for &lt;code&gt;setup-trivy&lt;/code&gt; to a malicious commit, and a tampered &lt;code&gt;v0.69.4&lt;/code&gt; binary was released.&lt;/li&gt;
&lt;li&gt;In LiteLLM, malicious &lt;code&gt;1.82.7&lt;/code&gt; and &lt;code&gt;1.82.8&lt;/code&gt; packages were uploaded to PyPI, and the maintainers later identified &lt;code&gt;1.83.0&lt;/code&gt; as the clean release.&lt;/li&gt;
&lt;li&gt;In axios, &lt;code&gt;1.14.1&lt;/code&gt; and &lt;code&gt;0.30.4&lt;/code&gt; were briefly published to npm, and the hidden dependency &lt;code&gt;plain-crypto-js&lt;/code&gt; used &lt;code&gt;postinstall&lt;/code&gt; to distribute a cross-platform RAT (remote access trojan that allows attackers to remotely control infected machines). (&lt;a href="https://www.aquasec.com/blog/trivy-supply-chain-attack-what-you-need-to-know/" rel="noopener noreferrer"&gt;Aqua&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A common recommendation for preventing incidents like these is to enable npm’s &lt;code&gt;min-release-age&lt;/code&gt; or pnpm’s &lt;code&gt;minimumReleaseAge&lt;/code&gt;.&lt;br&gt;
npm’s &lt;code&gt;min-release-age&lt;/code&gt; prevents versions newer than a specified number of days from being installed, while pnpm’s &lt;code&gt;minimumReleaseAge&lt;/code&gt; applies the same idea in minutes.&lt;br&gt;
Both are highly effective at reducing the chance of immediately picking up a freshly published malicious release. But they only protect you at the &lt;strong&gt;moment of dependency resolution&lt;/strong&gt;. They do not stop automatic install script execution, CI pipelines that reference mutable tags, or long-lived publish tokens lingering in your environment. pnpm itself makes this distinction explicit: compromised packages are often detected relatively quickly, but there is still an unavoidable exposure window between publication and detection. (&lt;a href="https://docs.npmjs.com/cli/v11/using-npm/config/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;One screenshot captured the direction of travel perfectly. In the current stable pnpm release, both &lt;code&gt;blockExoticSubdeps&lt;/code&gt; and &lt;code&gt;strictDepBuilds&lt;/code&gt; default to &lt;code&gt;false&lt;/code&gt;, but in the next docs and the v11 release notes, both move to &lt;code&gt;true&lt;/code&gt;. &lt;code&gt;blockExoticSubdeps&lt;/code&gt; prevents transitive dependencies from pulling from exotic sources such as git repos or tarball URLs, while &lt;code&gt;strictDepBuilds&lt;/code&gt; can fail installation when unreviewed build scripts are present.&lt;br&gt;
pnpm is clearly steering toward a security-first model: away from “install anything” and toward “resolve and execute only what has been explicitly trusted.” (&lt;a href="https://pnpm.io/settings" rel="noopener noreferrer"&gt;pnpm&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;This post breaks the defense surface into four layers:&lt;br&gt;
&lt;strong&gt;dependency resolution&lt;/strong&gt;, &lt;strong&gt;install-time execution&lt;/strong&gt;, &lt;strong&gt;CI execution&lt;/strong&gt;, and the &lt;strong&gt;publish path&lt;/strong&gt;.&lt;br&gt;
&lt;code&gt;min-release-age&lt;/code&gt; belongs primarily to the dependency-resolution layer.&lt;/p&gt;
&lt;h2&gt;
  
  
  Delay and lock dependency resolution
&lt;/h2&gt;

&lt;p&gt;The first thing to stabilize is &lt;strong&gt;which versions get resolved&lt;/strong&gt;. npm’s &lt;code&gt;min-release-age&lt;/code&gt; works in days, while pnpm’s &lt;code&gt;minimumReleaseAge&lt;/code&gt; works in minutes, allowing you to let newly published versions “cool off” before they are eligible for installation.&lt;br&gt;
In practice, though, you will eventually want exceptions for emergency security fixes or dependencies that you need to update immediately.&lt;/p&gt;

&lt;p&gt;pnpm also provides &lt;code&gt;minimumReleaseAgeExclude&lt;/code&gt;, which lets you carve out exceptions for specific packages or versions.&lt;br&gt;
Dependabot has &lt;code&gt;cooldown&lt;/code&gt;, a grace-period setting that delays version update PRs even after a new dependency version has been published. That grace period applies only to version updates, not to security updates.&lt;br&gt;
So an operating model like “delay routine upgrades, but fast-track urgent security fixes” is perfectly workable in production. (&lt;a href="https://docs.npmjs.com/cli/v11/using-npm/config/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;That said, delaying upgrades is not enough on its own. If the dependency graph resolved at one point in time cannot be reproduced consistently across your team and CI, different environments will drift onto different versions. That is where the lockfile becomes critical.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;package-lock.json&lt;/code&gt; records the exact dependency graph and versions that were actually resolved. Committing it makes it much easier to reproduce the same dependency set in development and CI. &lt;code&gt;npm ci&lt;/code&gt; is designed around the lockfile: it fails if &lt;code&gt;package.json&lt;/code&gt; and the lockfile are out of sync, and it never rewrites the lockfile. In CI, that makes &lt;code&gt;npm ci&lt;/code&gt; safer than &lt;code&gt;npm install&lt;/code&gt; from a reproducibility standpoint, and it also makes unintended dependency changes easier to spot in diffs. (&lt;a href="https://docs.npmjs.com/cli/v8/configuring-npm/package-lock-json/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Lockfiles matter for security, too. In GitHub’s dependency graph, a lockfile gives GitHub a much more accurate picture of the dependencies you actually resolved than a manifest alone. Indirect dependencies inferred only from the manifest may be excluded from vulnerability checks. (&lt;a href="https://docs.github.com/en/code-security/concepts/supply-chain-security/dependency-graph-data" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;There is one more risk in a different category worth calling out: dependency confusion. As a mitigation against public packages colliding with private package names, npm strongly recommends scoped packages. Managing internal packages under a namespace like &lt;code&gt;@your-org/foo&lt;/code&gt; is not flashy, but it is effective. (&lt;a href="https://docs.npmjs.com/threats-and-mitigations/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# .npmrc
&lt;/span&gt;&lt;span class="py"&gt;min-release-age&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;
&lt;span class="py"&gt;ignore-scripts&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pnpm-workspace.yaml&lt;/span&gt;
&lt;span class="na"&gt;minimumReleaseAge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1440&lt;/span&gt;
&lt;span class="na"&gt;minimumReleaseAgeExclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@your-org/*'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using npm’s &lt;code&gt;min-release-age&lt;/code&gt; or pnpm’s &lt;code&gt;minimumReleaseAge&lt;/code&gt; helps you avoid immediately consuming newly published versions. npm configures this in days, pnpm in minutes, and pnpm also applies it to transitive dependencies.&lt;/p&gt;

&lt;p&gt;But this is only a mechanism for delaying the adoption of new releases. It does not guarantee reproducibility by itself. If you want stable, repeatable installs, the baseline is still to commit the lockfile and enforce strict lockfile-based installs in CI with commands like &lt;code&gt;npm ci&lt;/code&gt; or &lt;code&gt;pnpm install --frozen-lockfile&lt;/code&gt;. (&lt;a href="https://docs.npmjs.com/cli/v11/using-npm/config/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Treat install as code execution, not just downloading packages
&lt;/h2&gt;

&lt;p&gt;The axios incident is a perfect example. The problem was not the Axios code itself, but the &lt;code&gt;postinstall&lt;/code&gt; hook in the hidden package &lt;code&gt;plain-crypto-js&lt;/code&gt;. In other words, &lt;code&gt;npm install&lt;/code&gt; is not just artifact retrieval. Through dependency scripts, it is also &lt;strong&gt;code execution at install time&lt;/strong&gt;. (&lt;a href="https://snyk.io/blog/axios-npm-package-compromised-supply-chain-attack-delivers-cross-platform/" rel="noopener noreferrer"&gt;Snyk&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;npm has &lt;code&gt;ignore-scripts&lt;/code&gt;, and when set to &lt;code&gt;true&lt;/code&gt;, it suppresses automatic script execution from &lt;code&gt;package.json&lt;/code&gt; during installation. Explicitly invoked scripts such as &lt;code&gt;npm run&lt;/code&gt; or &lt;code&gt;npm test&lt;/code&gt; still work, but at minimum, you are no longer running every dependency’s &lt;code&gt;preinstall&lt;/code&gt; / &lt;code&gt;install&lt;/code&gt; / &lt;code&gt;postinstall&lt;/code&gt; hook by default. (&lt;a href="https://docs.npmjs.com/cli/v11/using-npm/config/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;pnpm pushes this idea further. In its supply-chain security guidance, pnpm notes that many past compromised packages abused &lt;code&gt;postinstall&lt;/code&gt;, and that v10 stopped automatically executing dependency &lt;code&gt;postinstall&lt;/code&gt; hooks. The recommended model is to explicitly allow only trusted packages via &lt;code&gt;allowBuilds&lt;/code&gt;. In the stable docs, &lt;code&gt;allowBuilds&lt;/code&gt; supports per-package allow/deny rules, and with &lt;code&gt;strictDepBuilds&lt;/code&gt; enabled, installation can fail the moment an unreviewed build script appears. (&lt;a href="https://pnpm.io/supply-chain-security" rel="noopener noreferrer"&gt;pnpm&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;On top of that, enabling &lt;code&gt;blockExoticSubdeps&lt;/code&gt; prevents transitive dependencies from pulling from exotic sources such as git repositories or tarball URLs. &lt;code&gt;trustPolicy: no-downgrade&lt;/code&gt; can reject artifacts whose trust evidence is weaker than what was seen in earlier versions.&lt;br&gt;
All of these are ways to ensure that even if you do pull something bad, it does not automatically spread or execute. (&lt;a href="https://pnpm.io/settings" rel="noopener noreferrer"&gt;pnpm&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pnpm-workspace.yaml&lt;/span&gt;
&lt;span class="na"&gt;minimumReleaseAge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1440&lt;/span&gt;
&lt;span class="na"&gt;blockExoticSubdeps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;strictDepBuilds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;allowBuilds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;esbuild&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;trustPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;no-downgrade&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In short, &lt;code&gt;min-release-age&lt;/code&gt; makes it less likely that you will ingest a freshly compromised release, while &lt;code&gt;ignore-scripts&lt;/code&gt; and &lt;code&gt;strictDepBuilds&lt;/code&gt; are about preventing it from executing automatically even if it does get in. (&lt;a href="https://docs.npmjs.com/cli/v11/using-npm/config/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Run GitHub Actions with immutable refs and least privilege
&lt;/h2&gt;

&lt;p&gt;In GitHub Actions, the first rule is to &lt;strong&gt;pin workflow code to immutable references&lt;/strong&gt;. Tag references such as &lt;code&gt;@v1&lt;/code&gt; or &lt;code&gt;@v1.2.3&lt;/code&gt; are convenient, but tags can be retargeted after the fact. GitHub explicitly states that the only way to reference an Action immutably is to pin it to a &lt;strong&gt;full-length commit SHA&lt;/strong&gt;. So instead of &lt;code&gt;uses: owner/action@v1&lt;/code&gt;, the safer baseline is &lt;code&gt;uses: owner/action@&amp;lt;commit SHA&amp;gt;&lt;/code&gt;. If your workflow depends on a moving reference like a tag, the code that runs later can change even when the workflow file itself does not. (&lt;a href="https://docs.github.com/en/actions/reference/security/secure-use" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;The next step is to &lt;strong&gt;minimize runtime privileges&lt;/strong&gt;. Keep &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; permissions to the bare minimum, with defaults as narrow as &lt;code&gt;contents: read&lt;/code&gt;, and grant additional permissions only to the specific jobs that need them. Protect workflow files themselves with &lt;code&gt;CODEOWNERS&lt;/code&gt;, so changes to &lt;code&gt;.github/workflows&lt;/code&gt; require review. And for jobs that need cloud access, use OIDC instead of storing long-lived secrets in GitHub. Importantly, &lt;code&gt;permissions: id-token: write&lt;/code&gt; is only for minting an OIDC token to authenticate to an external service. It does not expand the workflow’s GitHub-side privileges. (&lt;a href="https://docs.github.com/en/actions/reference/security/secure-use" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;From there, the next defensive layer is to &lt;strong&gt;gate dependency changes at the PR boundary&lt;/strong&gt;. GitHub’s dependency review action checks dependencies added or updated in a pull request and can block merges when known vulnerabilities are introduced. In the review UI, you can inspect newly added or updated dependencies alongside release dates and vulnerability data. For example, the following workflow fails when the PR includes dependency changes with vulnerabilities rated high severity or above. (&lt;a href="https://docs.github.com/en/code-security/how-tos/secure-your-supply-chain/manage-your-dependency-security/configuring-the-dependency-review-action" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dependency-review&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@&amp;lt;FULL_LENGTH_SHA&amp;gt;&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/dependency-review-action@&amp;lt;FULL_LENGTH_SHA&amp;gt;&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;fail-on-severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is an important nuance here. The dependency review action is primarily a mechanism for checking the safety of &lt;strong&gt;dependency changes introduced via PRs&lt;/strong&gt;. GitHub also recognizes &lt;code&gt;uses:&lt;/code&gt; references in &lt;code&gt;.github/workflows/&lt;/code&gt; as dependencies in the dependency graph, but &lt;strong&gt;Dependabot alerts for Actions are only generated automatically for semver-based references&lt;/strong&gt;. &lt;strong&gt;SHA-pinned Actions do not receive those alerts&lt;/strong&gt;. In practice, that means external Actions should be pinned by SHA for safety, and then reviewed on a schedule as part of deliberate update work. The operating model becomes: stay safe by default with immutable references, and review upgrades intentionally when you choose to move them. (&lt;a href="https://docs.github.com/en/code-security/how-tos/secure-your-supply-chain/manage-your-dependency-security/configuring-the-dependency-review-action" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Protect the publish path itself
&lt;/h2&gt;

&lt;p&gt;If you publish npm packages yourself, the publish path can become the source of upstream compromise. npm’s trusted publishing uses OIDC so you do not need to keep long-lived npm tokens in CI. After you configure a trusted publisher, npm strongly recommends restricting legacy token-based publishing and enabling &lt;strong&gt;“Require two-factor authentication and disallow tokens”&lt;/strong&gt;. The docs even walk through revoking old automation tokens after the migration. (&lt;a href="https://docs.npmjs.com/trusted-publishers/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;When trusted publishing is used from GitHub Actions or GitLab CI/CD, npm also generates provenance attestations automatically. npm provenance makes it publicly verifiable where a package was built and who published it. In other words, if you publish from GitHub Actions with a trusted publisher configured, you usually do not need to explicitly add &lt;code&gt;npm publish --provenance&lt;/code&gt;; provenance is attached automatically. (&lt;a href="https://docs.npmjs.com/generating-provenance-statements/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;publish&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;publish&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@&amp;lt;FULL_LENGTH_SHA&amp;gt;&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-node@&amp;lt;FULL_LENGTH_SHA&amp;gt;&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24"&lt;/span&gt;
          &lt;span class="na"&gt;registry-url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://registry.npmjs.org"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm ci&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm publish&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is worth separating signatures from provenance here. npm’s ECDSA registry signatures are designed to verify that the distributed tarball was not tampered with in transit. For example, they can detect whether package contents were altered somewhere along the way by a mirror or proxy.&lt;/p&gt;

&lt;p&gt;Provenance, on the other hand, captures &lt;strong&gt;where a package came from, how it was built, and from which source code it was published&lt;/strong&gt;. So while signatures answer “Was the package that arrived here modified?”, provenance answers “Where did this package come from, and how was it produced?”&lt;/p&gt;

&lt;p&gt;&lt;code&gt;npm audit signatures&lt;/code&gt; can verify both registry signatures and provenance attestations. But it is best thought of as a complementary integrity-and-origin check, not the primary mechanism for day-to-day vulnerability detection. (&lt;a href="https://docs.npmjs.com/about-registry-signatures/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;pnpm takes a slightly different posture. In addition to “verify later” mechanisms like npm’s signatures and provenance, pnpm can proactively block untrusted dependencies at install time with settings like &lt;code&gt;blockExoticSubdeps&lt;/code&gt; and &lt;code&gt;strictDepBuilds&lt;/code&gt;. In that sense, npm focuses more on verification, while pnpm also leans into prevention through install-time policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cross-cutting controls: detect with SCA, block with package-manager policy
&lt;/h2&gt;

&lt;p&gt;This is where SCA becomes important. SCA (Software Composition Analysis) is the practice of enumerating the libraries your project depends on and continuously checking them for known vulnerabilities and license issues. It is the foundation for understanding what is actually in your stack and whether any of it is already known to be risky.&lt;/p&gt;

&lt;p&gt;In GitHub, that role is largely filled by the dependency graph. The dependency graph ingests dependencies from manifests and lockfiles, and dependencies that land in the graph can receive Dependabot alerts and security updates. GitHub also explicitly recommends lockfiles for building a more trustworthy graph. The flip side is that transitive dependencies resolved only at build time, or indirect dependencies inferred only from the manifest, can still be missed. (&lt;a href="https://docs.github.com/en/code-security/concepts/supply-chain-security/dependency-graph-data" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;That is what automatic dependency submission and the dependency submission API are for. They let you send not just lockfile-declared dependencies, but also the dependencies actually resolved by a real build, into the dependency graph. GitHub provides built-in workflows for this, and external CI/CD systems or custom build pipelines can also push dependency snapshots through the API. In other words, you can reflect not only &lt;strong&gt;statically visible dependencies&lt;/strong&gt;, but also &lt;strong&gt;the dependencies that were actually resolved at runtime&lt;/strong&gt;. (&lt;a href="https://docs.github.com/en/code-security/how-tos/secure-your-supply-chain/secure-your-dependencies/configuring-automatic-dependency-submission-for-your-repository" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;External tools are easier to reason about when you split them by role. Snyk Open Source is a classic SCA tool for open-source dependency vulnerabilities and license issues. OSV-Scanner supports major JavaScript lockfiles including &lt;code&gt;package-lock.json&lt;/code&gt;, &lt;code&gt;pnpm-lock.yaml&lt;/code&gt;, &lt;code&gt;yarn.lock&lt;/code&gt;, and &lt;code&gt;bun.lock&lt;/code&gt;. Trivy can emit GitHub dependency snapshots with &lt;code&gt;--format github&lt;/code&gt;, which makes it useful as a bridge for feeding dependencies observed from images or artifacts back into GitHub’s dependency graph. (&lt;a href="https://docs.snyk.io/scan-with-snyk/snyk-open-source" rel="noopener noreferrer"&gt;Snyk User Docs&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Many of these tools are strongest at known vulnerabilities, advisories, and license metadata. Socket is addressing a different problem: through static analysis, it looks for suspicious behavior such as install scripts, network requests, environment variable access, telemetry, and obfuscated code, including cases that have not yet become formal advisories.&lt;/p&gt;

&lt;p&gt;The key point is that SCA alone is not enough. It can catch known vulnerabilities, but there is always a lag for freshly published malware or suspicious packages that have not yet been assigned an advisory. As pnpm points out, there is an unavoidable gap between the publication of malware and its detection. In practice, that is why you should not rely on &lt;strong&gt;detection&lt;/strong&gt; alone. You also need &lt;strong&gt;preventive controls&lt;/strong&gt; at the package-manager level—such as &lt;code&gt;minimumReleaseAge&lt;/code&gt;, &lt;code&gt;ignore-scripts&lt;/code&gt;, &lt;code&gt;blockExoticSubdeps&lt;/code&gt;, and &lt;code&gt;strictDepBuilds&lt;/code&gt;—to make risky dependencies both harder to ingest and harder to execute in the first place. (&lt;a href="https://pnpm.io/supply-chain-security" rel="noopener noreferrer"&gt;pnpm&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  The minimum baseline to put in place today
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Add &lt;code&gt;min-release-age=3&lt;/code&gt; and &lt;code&gt;ignore-scripts=true&lt;/code&gt; to &lt;code&gt;.npmrc&lt;/code&gt;. npm provides the former as a day-based maturity window and the latter as a way to suppress automatic script execution. (&lt;a href="https://docs.npmjs.com/cli/v11/using-npm/config/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Always commit the lockfile, and use &lt;code&gt;npm ci&lt;/code&gt; in CI. &lt;code&gt;npm ci&lt;/code&gt; fails on lockfile mismatch and never rewrites the lockfile. (&lt;a href="https://docs.npmjs.com/cli/v11/commands/npm-ci/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Scope private packages. It is a basic but effective mitigation against dependency confusion. (&lt;a href="https://docs.npmjs.com/threats-and-mitigations/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;If you use pnpm, enable &lt;code&gt;minimumReleaseAge&lt;/code&gt;, &lt;code&gt;blockExoticSubdeps&lt;/code&gt;, &lt;code&gt;strictDepBuilds&lt;/code&gt;, and &lt;code&gt;allowBuilds&lt;/code&gt;, and consider going as far as &lt;code&gt;trustPolicy: no-downgrade&lt;/code&gt; if appropriate. (&lt;a href="https://pnpm.io/settings" rel="noopener noreferrer"&gt;pnpm&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;In GitHub Actions, combine full-length commit SHA pinning, least-privilege &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; settings, and &lt;code&gt;CODEOWNERS&lt;/code&gt; review requirements for workflow changes. (&lt;a href="https://docs.github.com/en/actions/reference/security/secure-use" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Move cloud authentication to OIDC, and grant &lt;code&gt;id-token: write&lt;/code&gt; only to the jobs that need it. (&lt;a href="https://docs.github.com/actions/security-for-github-actions/security-hardening-your-deployments/about-security-hardening-with-openid-connect" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Add the dependency review action to PRs so dependency diffs are reviewed before merge. Use GitHub dependency graph / Dependabot as the baseline monitoring layer for dependency visibility. (&lt;a href="https://docs.github.com/en/code-security/how-tos/secure-your-supply-chain/manage-your-dependency-security/configuring-the-dependency-review-action" rel="noopener noreferrer"&gt;GitHub Docs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;If you publish packages, migrate to trusted publishing, disable legacy tokens, and revoke the ones you no longer need. (&lt;a href="https://docs.npmjs.com/trusted-publishers/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Delay resolution. Prevent install-time auto-execution. Pin references and permissions in CI. Eliminate long-lived credentials from the publish path, attach provenance, and verify what you ship. Then use SCA to monitor dependency drift and known risk.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Only when these controls are combined can you say you have actually started defending against supply-chain attacks. (&lt;a href="https://docs.npmjs.com/cli/v11/using-npm/config/" rel="noopener noreferrer"&gt;npm Docs&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>security</category>
    </item>
    <item>
      <title>What I Learned from Reading Claude Code’s Reconstructed Source</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Thu, 02 Apr 2026 01:45:41 +0000</pubDate>
      <link>https://dev.to/trknhr/what-i-learned-from-reading-claude-codes-reconstructed-source-1ebf</link>
      <guid>https://dev.to/trknhr/what-i-learned-from-reading-claude-codes-reconstructed-source-1ebf</guid>
      <description>&lt;h2&gt;
  
  
  What I Learned from Reading Claude Code’s Reconstructed Source
&lt;/h2&gt;

&lt;p&gt;Around March 31, 2026, it became widely known that parts of Claude Code CLI’s implementation could be reconstructed from source maps that had remained in the npm package. A public mirror circulated for a while, but it was not an official open-source release by Anthropic, and it has since turned into a different project.&lt;/p&gt;

&lt;p&gt;This post is a memo of my own impressions after reading a reconstructed copy of the source that I had saved locally at the time. Rather than discussing the current state of any public mirror, I want to focus on the design characteristics that became visible from actually tracing through the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  My first impression: this is a much larger product codebase than I expected
&lt;/h2&gt;

&lt;p&gt;The first thing that surprised me was the sheer size of the codebase. In the reconstructed source I had on hand, there were roughly 1,900 files and about 510,000 lines of code. This is not a small single-purpose CLI. It is a fairly large product codebase that bundles terminal UI, tool execution, safety controls, IDE integration, memory, and extension mechanisms into one system.&lt;/p&gt;

&lt;p&gt;Technically, the project appears to be centered on TypeScript, with Bun as the runtime and a React/Ink-style stack for the terminal UI. In other words, it felt less like “a small CLI with some AI added on top” and more like “a substantial TypeScript product with an AI experience layered into it.”&lt;/p&gt;

&lt;h2&gt;
  
  
  The prompts live on the client side more than I expected
&lt;/h2&gt;

&lt;p&gt;One of the easiest things to start tracing in this codebase is prompt construction. At least within the portion that could be reconstructed, a surprisingly large part of the system instruction layer is present in the client-side code, where runtime context is then injected into it.&lt;/p&gt;

&lt;p&gt;That runtime context includes things like the current date, Git state, recent commits, Git user information, and the contents of local instruction files. On top of that foundation, additional instructions and memory-related text are composed into something close to the final system prompt.&lt;/p&gt;

&lt;p&gt;What I found especially interesting was that the intuitive assumption that “the real prompt must be assembled as a black box on the server side” did not seem to hold very well here, at least not within the portion of the code I could inspect. That does not prove there is no additional server-side processing, of course. But it does show that a significant amount of the prompt logic also exists on the client side.&lt;/p&gt;

&lt;h2&gt;
  
  
  In tool design, what matters is not the number of tools but how they are exposed and controlled
&lt;/h2&gt;

&lt;p&gt;Another striking part of the design is the layer that decides which tools are visible to the model and the separate layer that manages execution permissions. The system is clearly feature-rich, but there is a fairly sharp distinction between tools that are exposed routinely and tools that are internal, behind feature flags, or otherwise conditionally enabled.&lt;/p&gt;

&lt;p&gt;My impression was fairly simple: this codebase does not look like it was built around the idea that “more tools automatically make the system stronger.” If anything, it seems closer to the opposite view: the surface that is exposed to the model in normal operation should be kept as narrow as possible.&lt;/p&gt;

&lt;p&gt;There are also implementation details suggesting that the tool list itself has to stay aligned with prompt caching. That means the number of tools and their schemas are not just implementation details; they appear to be part of stable prompt operation as well.&lt;/p&gt;

&lt;p&gt;This lines up quite well with the increasingly common intuition that “fewer tools often lead to more stable behavior.” That said, this is my interpretation of the code, not an explicit principle written down in those exact words.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bash is not “just a way to run shell commands”
&lt;/h2&gt;

&lt;p&gt;The shell execution layer was one of the most memorable parts of the codebase for me. What is going on there is not simply command execution.&lt;/p&gt;

&lt;p&gt;Commands are categorized into groups such as search-oriented commands, read-oriented commands, listing commands, and commands where silence on success is the natural behavior. Exit codes are also normalized in command-specific ways. For example, the &lt;code&gt;1&lt;/code&gt; returned by grep-like commands is not always treated as a plain error; it can be reinterpreted as “no match found.”&lt;/p&gt;

&lt;p&gt;On top of that, commands that are considered read-only are guarded by allowlist-based flag checks, path validation, sed-specific restrictions, sandbox eligibility checks, and even AST-based safety checks. For more complex compound commands, there are also explicit upper bounds on the fan-out of the safety analysis.&lt;/p&gt;

&lt;p&gt;So while Bash is clearly a powerful general-purpose tool inside Claude Code, it does not look like something the model is given raw. Instead, it seems to sit on top of a fairly thick deterministic scaffold before the model is allowed to use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The comments are unusually good
&lt;/h2&gt;

&lt;p&gt;Another thing that stood out was the quality of the comments. By that, I do not just mean that there are many comments.&lt;/p&gt;

&lt;p&gt;In several places, the comments explain not only what the code is doing but why certain decisions were made: why a heavy operation needs to run before imports, why a given validator is necessary, or why a particular flag should not be treated as safe. They carry background reasoning, not just surface-level description.&lt;/p&gt;

&lt;p&gt;That makes the code easier for humans to follow, of course, but it also felt like the sort of writing that would remain legible to future code-completion systems or coding agents as well.&lt;/p&gt;

&lt;p&gt;People often say these days that comments should be kept to a minimum. But reading code like this is a good reminder that good comments are not clutter. They are part of the design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Even the startup path shows product-level polish
&lt;/h2&gt;

&lt;p&gt;Looking around the entry path, it becomes clear that this product is not only concerned with adding features. It is also carefully tuned around perceived performance. The code is explicit about which side effects should run before heavier imports and what can be parallelized to reduce startup latency.&lt;/p&gt;

&lt;p&gt;When people talk about AI agents, attention tends to go first to prompts and loops. But in practice, details like startup optimization and other non-AI engineering work are often what determine how polished the product feels.&lt;/p&gt;

&lt;h2&gt;
  
  
  “Being visible” is not the same thing as “being open source”
&lt;/h2&gt;

&lt;p&gt;Finally, I want to emphasize the most important point.&lt;/p&gt;

&lt;p&gt;What became visible in this case was that some source code could be read because of the way published artifacts were left exposed. That is not the same thing as Anthropic officially releasing Claude Code as open source.&lt;/p&gt;

&lt;p&gt;Those two things need to be kept clearly separate. Anthropic’s current terms include restrictions aimed at preventing the construction of competing products, service replication, and reverse engineering. So treating this as an interesting code-reading exercise is one thing; assuming that the code can therefore be freely reused or redistributed is something else entirely.&lt;/p&gt;

&lt;p&gt;There is value in reading it. But “readable” and “freely usable” are not the same thing, and it is important not to blur that distinction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;What made this source-reading exercise interesting was not a generic takeaway like “Claude Code runs an agentic loop.” The more interesting part was seeing, in concrete form, which parts were made deterministic, which parts were injected as runtime context, and where the safety mechanisms were made deliberately thick.&lt;/p&gt;

&lt;p&gt;At least within the portion that could be reconstructed, the prompts were more client-side than I expected, Bash was more heavily guarded than I expected, the tool surface was narrower than I expected, and the comments were more thoughtful than I expected. The overall codebase is well organized, but at the same time it still has a little of the human roughness you would expect in a real product—for example, the way prompt construction seems to be spread across multiple layers.&lt;/p&gt;

&lt;p&gt;That mix of order and messiness is part of what makes the codebase interesting to me. In the end, that is what I wanted to capture in this memo.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>cli</category>
      <category>javascript</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Code Review with multiple AIs</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Fri, 19 Dec 2025 02:00:48 +0000</pubDate>
      <link>https://dev.to/trknhr/code-review-with-ais-dep</link>
      <guid>https://dev.to/trknhr/code-review-with-ais-dep</guid>
      <description>&lt;p&gt;Hello folks.&lt;br&gt;
Have you ever wanted to quickly run code reviews using multiple AIs? I have. If you really want to do something like this, you can have an AI generate a script and run it locally right away. Problem solved! …But if we stop there, the blog post ends immediately, so please stick with me for a little longer.&lt;/p&gt;
&lt;h2&gt;
  
  
  The problem I want to solve
&lt;/h2&gt;

&lt;p&gt;In most cases, that really does solve it—but scripts created this way often end up calling pay-as-you-go APIs such as the ChatGPT API. Calling APIs isn’t inherently a problem, but I personally wanted to keep these kinds of tasks within a subscription fee if possible. (Subscriptions also have usage limits, so they’re effectively usage-based too but with how I use them, I rarely hit the limit.)&lt;/p&gt;

&lt;p&gt;AI vendors also offer their own coding agents like Codex, Claude Code, Gemini CLI, and so on. By authenticating inside those coding agents, you can use them within your subscription plan. GitHub Copilot doesn’t develop its own models, but it’s appealing because it’s inexpensive and fixed-price, and lets you try a variety of models.&lt;/p&gt;

&lt;p&gt;So it seems promising to delegate code review to these fixed-price coding agents and compare their results. That way, without issuing API keys, you can internally call multiple coding agents you already use and instantly get second opinions on your code review.&lt;/p&gt;

&lt;p&gt;You might also want to use a team-standard prompt for code reviews. Even if you don’t fully standardize, it’s nice to avoid reinventing prompts each time and use a reasonably well-prepared team-specific one.&lt;/p&gt;
&lt;h3&gt;
  
  
  Then why not run the CLIs in CI?
&lt;/h3&gt;

&lt;p&gt;You’re absolutely right. But once you run it in CI, the next questions become bigger ones—how do we handle team-wide subscriptions, or would it be cheaper and higher-performing to adopt a dedicated SaaS, etc.&lt;/p&gt;

&lt;p&gt;So this time, I wanted something that runs locally at a script-like level. If you want to do it properly, as mentioned, moving it into CI is likely better. But even then, it might still be handy to quickly check multiple reviews locally.&lt;/p&gt;
&lt;h2&gt;
  
  
  The solution
&lt;/h2&gt;

&lt;p&gt;I’d like to introduce &lt;a href="https://github.com/trknhr/ai-utils" rel="noopener noreferrer"&gt;ai-utils&lt;/a&gt;, a tool I made that has made my development flow just a bit more convenient.&lt;/p&gt;
&lt;h3&gt;
  
  
  How it actually behaves
&lt;/h3&gt;

&lt;p&gt;Let’s look at a real review I ran when I added Copilot CLI support to &lt;a href="https://github.com/trknhr/ai-utils" rel="noopener noreferrer"&gt;ai-utils&lt;/a&gt;. Including full results for every AI would be too long, so I’ll show only excerpts.&lt;/p&gt;

&lt;p&gt;I reviewed &lt;a href="https://github.com/trknhr/ai-utils/commit/9c8db862016d71e492367c7d6dc0a16a4abd4480" rel="noopener noreferrer"&gt;this commit&lt;/a&gt;. In practice, this is an AI check against pre-commit (staged) files called &lt;code&gt;staged-check&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The check is performed using a prompt like the one linked here:&lt;br&gt;
&lt;a href="https://github.com/trknhr/ai-utils/blob/main/internal/template/prompts/staged-check.md" rel="noopener noreferrer"&gt;staged-check.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this case, Codex’s feedback captured the situation most accurately. Since this is a lightweight check, it’s understandable but because the prompt only passes &lt;code&gt;git diff&lt;/code&gt;, the LLM seems to struggle to grasp the repository as a whole. I’d like to improve that at some point.&lt;/p&gt;

&lt;p&gt;Below are the actual review outputs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code (Sonnet 4.5)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     The code quality is very good. The new Copilot provider implementation follows a structure consistent with the existing Codex provider. The `stripFlagWithValue` helper function is implemented in a properly reusable way and adheres to the DRY principle. Naming is clear and follows Go conventions.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini CLI (Gemini 2.5)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     * **Code duplication**:
         * **File**: `internal/provider/copilot.go`
         * **Issue**: The `stripFlagWithValue` function is newly defined in `copilot.go`, but this logic is also used for model flag handling in `codex.go`. Currently, `codex.go` cannot access this function defined in `copilot.go`, so similar logic would be (not visible here, but presumably) needed in `codex.go` as well. To avoid code duplication, this helper function should be moved to a shared utility file within the `provider` package (e.g., `provider/utils.go`).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Codex (GPT-5.2)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     - **Fatal timeout unit bug**: In `initConfig()` in `internal/cli/root.go`, `config.ProviderConfig.Timeout` is assigned `120`, but since the type is `time.Duration`, that becomes **120ns** (you likely expected `120*time.Second`). This same value is set when adding `copilot`, so with the default configuration it will time out immediately / become unstable. At minimum, the newly added `copilot` part must be fixed; ideally, fix existing `claude/gemini/codex` similarly.

     - **Where to place the utility function**: `stripFlagWithValue()` is defined in `internal/provider/copilot.go` and is also used from `internal/provider/codex.go` (it works because it’s the same package, but it hurts readability). Moving it to a shared file like `internal/provider/args.go` would make the intent clearer.

     - **Hard-coded strings**: Provider name `"copilot"` and flags like `"--model"` are scattered (e.g., `internal/cli/root.go`, `internal/provider/*`). Even small constants (at least provider names) would improve consistency.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Copilot CLI (Claude Sonnet 4.5)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     **`internal/provider/copilot.go`**: The `stripFlagWithValue` function is newly implemented, but the same logic is also used in `codex.go`. Following the DRY principle, this function should be moved to a common file such as `provider.go` so it can be shared by both providers.

     **`internal/provider/copilot_test.go`**: A trailing newline is missing at the end of the file (line 115). Go formatting conventions expect files to end with a newline.

     **`internal/provider/codex.go`**: An implementation of `stripFlagWithValue` has not been added. With this change, the codex provider may end up with an undefined reference when processing the model flag. The staging may be partial.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Even with some prompt issues, asking multiple AIs to review helped me avoid blindly trusting a single model’s feedback. I didn’t do it this time, but you can also compare SOTA models like Opus 4.5 and Gemini 3.0. In many cases you don’t know which model is best, so being able to benchmark and compare with a single command is very convenient.&lt;/p&gt;
&lt;h2&gt;
  
  
  My OSS project
&lt;/h2&gt;

&lt;p&gt;As mentioned above, &lt;a href="https://github.com/trknhr/ai-utils" rel="noopener noreferrer"&gt;ai-utils&lt;/a&gt; is my own OSS project. It’s small and functionally simple, but seemed useful enough that I decided to build it.&lt;br&gt;
Details are here: &lt;a href="https://github.com/trknhr/ai-utils" rel="noopener noreferrer"&gt;ai-utils&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Concept
&lt;/h3&gt;

&lt;p&gt;Easily run multiple AIs locally within the subscription plans.&lt;/p&gt;
&lt;h3&gt;
  
  
  Problems it solves
&lt;/h3&gt;

&lt;p&gt;There are plenty of OSS tools like this. But the three things I specifically wanted to solve were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I don’t want to issue API keys&lt;/li&gt;
&lt;li&gt;I want to rewrite prompts in my own style&lt;/li&gt;
&lt;li&gt;I want to compare responses from multiple AIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I couldn’t find an OSS project that satisfied all three, so I chose to build one. In the AI era, it’s easy to build what you want, so I was able to overcome the cost of “reinventing the wheel.”&lt;/p&gt;
&lt;h3&gt;
  
  
  How to use
&lt;/h3&gt;

&lt;p&gt;On macOS, you can install easily with Homebrew:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap trknhr/homebrew-tap

brew &lt;span class="nb"&gt;install &lt;/span&gt;aiu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Linux, run the install shell script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sSfL&lt;/span&gt; https://raw.githubusercontent.com/trknhr/ai-utils/main/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can’t use it unless supported coding agents like Claude Code or Codex are installed and ready to use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trying it out
&lt;/h3&gt;

&lt;p&gt;Using &lt;code&gt;commit-msg&lt;/code&gt;, you can generate a commit message based on staged files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aiu commit-msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;-m&lt;/code&gt;, you can run multiple AIs in parallel.&lt;/p&gt;

&lt;p&gt;You can also run your own prompts. Inside a prompt file, &lt;code&gt;{{$ }}&lt;/code&gt; executes a command, so you can dynamically pass the command output to the AI.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Just say {{$ date }}.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This passes the current time to the AI, and it will return only the current time. Using the same mechanism, the review task passes things like &lt;code&gt;git diff&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So if your team wants custom prompts, you can place team-specific prompts under &lt;code&gt;.aiu/prompts/&lt;/code&gt; and run standardized reviews.&lt;/p&gt;

&lt;h2&gt;
  
  
  About development
&lt;/h2&gt;

&lt;p&gt;The implementation required for this app wasn’t challenging. AI is so good at implementing typical CLI applications that there wasn’t much I had to do myself. What I did was mostly defining the spec and writing tests and I found myself thinking “So this is the AI era...” over and over.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;This tool just calls the coding agents provided by each vendor, but wrapping it up as a CLI makes it surprisingly comfortable.&lt;/p&gt;

&lt;p&gt;Because the tool’s functionality is simple, it’s also an application where it’s easy to let AI handle most of the implementation. Probably about 95% of the code was written by AI.&lt;/p&gt;

&lt;p&gt;It won’t dramatically improve something by itself, but it helps you move through small daily tasks a little more smoothly.&lt;/p&gt;

&lt;p&gt;If you’re interested, please refer to the &lt;a href="https://github.com/trknhr/ai-utils" rel="noopener noreferrer"&gt;GitHub page&lt;/a&gt; and install it. If you have complaints or requests, please open an Issue.&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>Assessing TOON Token Savings in an MCP Server</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Thu, 20 Nov 2025 13:37:09 +0000</pubDate>
      <link>https://dev.to/trknhr/assessing-toon-token-savings-in-an-mcp-server-2b3i</link>
      <guid>https://dev.to/trknhr/assessing-toon-token-savings-in-an-mcp-server-2b3i</guid>
      <description>&lt;p&gt;I have been wiring &lt;a href="https://github.com/toon-format/toon" rel="noopener noreferrer"&gt;TOON&lt;/a&gt; support with &lt;a href="https://www.npmjs.com/package/toon-token-diff" rel="noopener noreferrer"&gt;&lt;code&gt;toon-token-diff&lt;/code&gt;&lt;/a&gt; into &lt;a href="https://github.com/trknhr/toon-token-diff" rel="noopener noreferrer"&gt;this MCP server&lt;/a&gt; to understand whether converting JSON payloads to TOON meaningfully reduces prompt costs. The short answer: TOON is elegant, but in my test harness it delivered microscopic savings for real-world workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Project mode&lt;/strong&gt;: &lt;code&gt;toon-token-diff&lt;/code&gt; in &lt;code&gt;libraryMode&lt;/code&gt; via &lt;code&gt;npm install toon-token-diff&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models monitored&lt;/strong&gt;: &lt;code&gt;openai&lt;/code&gt; (tiktoken GPT-5 profile) and &lt;code&gt;claude&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration strategy&lt;/strong&gt;: lightweight instrumentation that appends token stats into a JSONL ledger for later analysis
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;estimateAndLog&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;toon-token-diff/libraryMode&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// inside my MCP tool handler&lt;/span&gt;
&lt;span class="nf"&gt;estimateAndLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./token-logs.jsonl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;mcp_tool_call&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This snippet runs after the MCP tool produces a JSON response. It serializes the payload, estimates TOON vs JSON tokens, and emits a structured record to &lt;code&gt;token-logs.jsonl&lt;/code&gt;. The rest of the MCP server stays untouched—no need to change transport or business logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Timestamp (UTC)&lt;/th&gt;
&lt;th&gt;openai JSON&lt;/th&gt;
&lt;th&gt;openai TOON&lt;/th&gt;
&lt;th&gt;openai Δ (%)&lt;/th&gt;
&lt;th&gt;claude JSON&lt;/th&gt;
&lt;th&gt;claude TOON&lt;/th&gt;
&lt;th&gt;claude Δ (%)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:16:54.296Z&lt;/td&gt;
&lt;td&gt;127&lt;/td&gt;
&lt;td&gt;126&lt;/td&gt;
&lt;td&gt;0.79&lt;/td&gt;
&lt;td&gt;130&lt;/td&gt;
&lt;td&gt;129&lt;/td&gt;
&lt;td&gt;0.77&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:17:15.720Z&lt;/td&gt;
&lt;td&gt;53,703&lt;/td&gt;
&lt;td&gt;53,702&lt;/td&gt;
&lt;td&gt;0.0019&lt;/td&gt;
&lt;td&gt;54,977&lt;/td&gt;
&lt;td&gt;54,976&lt;/td&gt;
&lt;td&gt;0.0018&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:17:34.988Z&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;14.29&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;14.29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:17:39.246Z&lt;/td&gt;
&lt;td&gt;53,703&lt;/td&gt;
&lt;td&gt;53,702&lt;/td&gt;
&lt;td&gt;0.0019&lt;/td&gt;
&lt;td&gt;54,977&lt;/td&gt;
&lt;td&gt;54,976&lt;/td&gt;
&lt;td&gt;0.0018&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:17:48.333Z&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:18:13.725Z&lt;/td&gt;
&lt;td&gt;91,729&lt;/td&gt;
&lt;td&gt;91,728&lt;/td&gt;
&lt;td&gt;0.0011&lt;/td&gt;
&lt;td&gt;98,607&lt;/td&gt;
&lt;td&gt;98,606&lt;/td&gt;
&lt;td&gt;0.0010&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:21:19.174Z&lt;/td&gt;
&lt;td&gt;127&lt;/td&gt;
&lt;td&gt;126&lt;/td&gt;
&lt;td&gt;0.79&lt;/td&gt;
&lt;td&gt;130&lt;/td&gt;
&lt;td&gt;129&lt;/td&gt;
&lt;td&gt;0.77&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:21:23.370Z&lt;/td&gt;
&lt;td&gt;91,729&lt;/td&gt;
&lt;td&gt;91,728&lt;/td&gt;
&lt;td&gt;0.0011&lt;/td&gt;
&lt;td&gt;98,607&lt;/td&gt;
&lt;td&gt;98,606&lt;/td&gt;
&lt;td&gt;0.0010&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2025-11-19T14:21:30.314Z&lt;/td&gt;
&lt;td&gt;53,703&lt;/td&gt;
&lt;td&gt;53,702&lt;/td&gt;
&lt;td&gt;0.0019&lt;/td&gt;
&lt;td&gt;54,977&lt;/td&gt;
&lt;td&gt;54,976&lt;/td&gt;
&lt;td&gt;0.0018&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nine consecutive tool runs told the same story: production payloads barely moved. Only the intentionally tiny sample showed double-digit savings, which is irrelevant for backlog-scale prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Reduction Rate Is Flat
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Content dominates token volume&lt;/strong&gt; – The payload body itself accounts for nearly every token, so TOON’s structural tweaks barely register in the total.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Practical Guidance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Keep TOON handy as a normalization format, but don't promise cost savings without benchmarking your actual payloads.&lt;/li&gt;
&lt;li&gt;Instrument with the libraryMode snippet above before ship time; it gives you historical evidence of whether TOON helps.&lt;/li&gt;
&lt;li&gt;If savings are negligible, redirect effort toward higher-impact tactics: pruning unused fields, batching small tool calls, or applying semantic compression upstream.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next Experiments
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Compare with alternative tokenizers (Gemini, Llama) to see whether non-GPT vocabularies respond differently.&lt;/li&gt;
&lt;li&gt;Add diff tooling that highlights specific fields TOON shrinks, so we can manually prune them if needed.&lt;/li&gt;
&lt;li&gt;Explore policy-driven trimming (e.g., dropping debug blobs) prior to TOON conversion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TOON remains a clever serialization trick, but as my MCP experiment showed, it is not an automatic token economy lever. Measure, log, and decide based on real numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.npmjs.com/package/toon-token-diff" rel="noopener noreferrer"&gt;&lt;code&gt;toon-token-diff&lt;/code&gt; on npm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/toon-format/toon" rel="noopener noreferrer"&gt;toon-format/toon on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/trknhr/toon-token-diff" rel="noopener noreferrer"&gt;trknhr/toon-token-diff on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>toon</category>
    </item>
    <item>
      <title>ai-docs managing AI generated context files</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Sun, 06 Jul 2025 14:50:41 +0000</pubDate>
      <link>https://dev.to/trknhr/managing-ai-generated-context-files-with-ai-docs-keep-your-main-branch-clean-1lcl</link>
      <guid>https://dev.to/trknhr/managing-ai-generated-context-files-with-ai-docs-keep-your-main-branch-clean-1lcl</guid>
      <description>&lt;h1&gt;
  
  
  Why I Built &lt;code&gt;ai-docs&lt;/code&gt;: Managing the Growing Chaos of AI Context Files
&lt;/h1&gt;

&lt;p&gt;When developing alongside AI agents, one of the first headaches that arises is how to manage the flood of context files they generate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Here are a few specific challenges I kept facing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;As your AI coding assistant evolves, you naturally want to externalize and back up its memory files.&lt;/li&gt;
&lt;li&gt;These files are not deterministic and will inevitably differ across local environments and each developer's.&lt;/li&gt;
&lt;li&gt;Git merges often lead to nasty conflicts.&lt;/li&gt;
&lt;li&gt;During code review, these files just get in the way.&lt;/li&gt;
&lt;li&gt;Yet simply ignoring them with &lt;code&gt;.gitignore&lt;/code&gt; is risky to disappear. You still want to back them up remotely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s when I realized: maybe these files don't belong in the main branch at all. And that's how &lt;code&gt;ai-docs&lt;/code&gt; was born.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/trknhr/ai-docs" rel="noopener noreferrer"&gt;GitHub - trknhr/ai-docs&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Spark
&lt;/h2&gt;

&lt;p&gt;The idea hit me during a casual meeting. What if we isolate AI-related files on a separate Git branch and mount them as a worktree? That way, we could keep them versioned and visible, without polluting the main development flow.&lt;/p&gt;

&lt;p&gt;Two days and one impulsive coding spree later, I had a working prototype. Like any proper AI-era project, I co-built it with ChatGPT and Claude.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;I brainstormed the ideal workflow with ChatGPT.&lt;/li&gt;
&lt;li&gt;When the conversation alone didn’t give me clarity, I prototyped locally using Git worktrees.&lt;/li&gt;
&lt;li&gt;I summarized everything into a spec file and let Claude Code scaffold the CLI.&lt;/li&gt;
&lt;li&gt;Then I tested, tweaked, and patched wherever things didn’t behave as expected.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What &lt;code&gt;ai-docs&lt;/code&gt; Does
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;ai-docs&lt;/code&gt; is a CLI tool that helps you manage AI assistant context files by separating them into an isolated Git branch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates an isolated branch named &lt;code&gt;@ai-docs/{username}&lt;/code&gt;, {username} is determined by your name on config file, git user.name or hostname.&lt;/li&gt;
&lt;li&gt;Mounts this branch locally at &lt;code&gt;.ai-docs/&lt;/code&gt; via Git worktree&lt;/li&gt;
&lt;li&gt;Moves files like &lt;code&gt;memory-bank/&lt;/code&gt; and &lt;code&gt;CLAUDE.md&lt;/code&gt; to this branch&lt;/li&gt;
&lt;li&gt;Automatically updates &lt;code&gt;.gitignore&lt;/code&gt; in &lt;code&gt;main&lt;/code&gt; to prevent tracking those files&lt;/li&gt;
&lt;li&gt;Provides &lt;code&gt;pull&lt;/code&gt; and &lt;code&gt;push&lt;/code&gt; commands to sync changes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Claude Code and the Danger of &lt;code&gt;rm -rf&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The initial versions made liberal use of &lt;code&gt;rm -rf&lt;/code&gt;, which ended up deleting my &lt;code&gt;.git&lt;/code&gt; folder. A brutal reminder that you should &lt;em&gt;never&lt;/em&gt; blindly run AI-generated code.&lt;/p&gt;

&lt;p&gt;I later restricted file deletions to cases where the &lt;code&gt;--force&lt;/code&gt; flag is used, and leaned more heavily on safe &lt;code&gt;git&lt;/code&gt; commands.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. GitHub Actions: Trial and (Mostly) Error
&lt;/h3&gt;

&lt;p&gt;I wanted to set up automatic releases using GoReleaser + GitHub Actions. But it was a frustrating loop of misconfigurations, outdated AI suggestions, and documentation-diving. I learned a lot, but definitely want to improve my speed here next time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage (macOS Recommended)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew tap trknhr/homebrew-tap
brew &lt;span class="nb"&gt;install &lt;/span&gt;ai-docs

&lt;span class="c"&gt;# First-time setup (may need to run twice to initialize config)&lt;/span&gt;
ai-docs init &lt;span class="nt"&gt;-v&lt;/span&gt;

&lt;span class="c"&gt;# Push local AI context files to remote&lt;/span&gt;
aI-docs push &lt;span class="nt"&gt;-v&lt;/span&gt;

&lt;span class="c"&gt;# Pull updates from remote&lt;/span&gt;
aI-docs pull &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Options like &lt;code&gt;--dry-run&lt;/code&gt; and &lt;code&gt;--force&lt;/code&gt; are supported and useful during testing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary: A Clean Home for Your AI Files
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;ai-docs&lt;/code&gt; helps you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keep your working branches clean&lt;/strong&gt;: AI context files live elsewhere&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access files easily&lt;/strong&gt;: via &lt;code&gt;.ai-docs/&lt;/code&gt; worktree&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sync with ease&lt;/strong&gt;: using simple &lt;code&gt;push&lt;/code&gt; and &lt;code&gt;pull&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s still a rough-around-the-edges tool, but it works well enough to use daily.&lt;/p&gt;

&lt;p&gt;If you're building with AI and want to keep things organized, give &lt;code&gt;ai-docs&lt;/code&gt; a try. Feedback on GitHub or X (Twitter) would be amazing!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/trknhr/ai-docs" rel="noopener noreferrer"&gt;GitHub - trknhr/ai-docs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy vibe coding!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>coding</category>
      <category>go</category>
    </item>
    <item>
      <title>Cha Cha Chat with AI in Local</title>
      <dc:creator>Teruo Kunihiro</dc:creator>
      <pubDate>Tue, 19 Dec 2023 04:33:31 +0000</pubDate>
      <link>https://dev.to/trknhr/cha-cha-chat-with-ai-in-local-a37</link>
      <guid>https://dev.to/trknhr/cha-cha-chat-with-ai-in-local-a37</guid>
      <description>&lt;p&gt;Hello everyone. I've recently joined a generative AI team on the &lt;a href="https://nulab.com" rel="noopener noreferrer"&gt;current company&lt;/a&gt;. I don't have much experience with generative AI though, I've been experimenting with running a Large Language Model (LLM) locally to prepare for any future requests to develop AI chat app like ChatGPT. Since I'm a Japanese speaker, I look for LLMs for Japanese one in this article.&lt;/p&gt;

&lt;p&gt;Let's get started.&lt;/p&gt;

&lt;h2&gt;
  
  
  About PC Specifications
&lt;/h2&gt;

&lt;p&gt;In this article, all tries were on this environment. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt;: MacBook Pro 14-inch, 2023&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chip&lt;/strong&gt;: Apple M2 Max&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt;: 64GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS&lt;/strong&gt;: macOS 14.1 &lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  About Large Language Models
&lt;/h1&gt;

&lt;p&gt;There are various types of Large Language Models (LLMs), like the well-known GPT, BERT, LLaMA, etc. I won't dive into their differences or specifics given my current knowledge, but for this endeavor, I chose  LLaMA for this article, which is popular among third parties for its accuracy and commercial viability.&lt;/p&gt;

&lt;h1&gt;
  
  
  Just Want to Get It Running
&lt;/h1&gt;

&lt;p&gt;I knew that publicly available LLMs could be found on a site called Hugging Face, but I had no idea how to run them on the local. My aim was to create something like ChatGPT for future app implementation ideas.&lt;/p&gt;

&lt;p&gt;After some research, I came across an Open Source Software (OSS) called &lt;a href="https://github.com/lm-sys/FastChat/tree/main" rel="noopener noreferrer"&gt;FastChat&lt;/a&gt;, &lt;a href="https://github.com/oobabooga/text-generation-webui" rel="noopener noreferrer"&gt;Text generation web UI&lt;/a&gt;. With this repository, I was able to locally run llama2 and chat with it.&lt;/p&gt;

&lt;p&gt;For those who just want to try llama2, Hugging Face has a demo page, which is probably the quickest way to experience it: &lt;a href="https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI" rel="noopener noreferrer"&gt;Hugging Face Demo for Llama2&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  About Japanese Models
&lt;/h1&gt;

&lt;p&gt;While llama2 performs well in English, it seems far from the level of ChatGPT in Japanese. The responses in Japanese often include English words or are expressed in romanized Japanese. So, I looked for Japanese models.&lt;/p&gt;

&lt;h2&gt;
  
  
  About Youri7B
&lt;/h2&gt;

&lt;p&gt;This is a model pre-trained in Japanese by rinna Co., Ltd., based on llama2. I tried running it using the 'Text generation web UI' mentioned earlier. &lt;a href="https://rinna.co.jp/news/2023/10/20231031.html" rel="noopener noreferrer"&gt;Rinna Youri-7B&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, it didn't work as expected. The model seemed to load correctly in the UI, but all responses were in English. I didn't know the reason why it didn't work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Python Files
&lt;/h2&gt;

&lt;p&gt;I tried running Python scripts as described on the Hugging Face Youri-7B page. It looked like to be simpler than using third-party UIs and I could embed this to API after it would work, but due to my limited Python knowledge and the script consuming about 30GB of memory, my PC crashed.&lt;/p&gt;

&lt;h1&gt;
  
  
  Discovering Ollama
&lt;/h1&gt;

&lt;p&gt;There were some reasons I couldn't complete to run some LLMs on my local environment. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lack of Python Knowledge&lt;/li&gt;
&lt;li&gt;Many dependencies caused difficulties and frustrations &lt;/li&gt;
&lt;li&gt;Wanted to ignore runtime environments&lt;/li&gt;
&lt;li&gt;Wanted to avoid troubleshooting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Summing up these points, what I'm looking for now is an OSS with a chat UI that doesn't require specific knowledge of Python or understanding of dependencies, and one that has clear documentation on how to apply models from Hugging Face.&lt;/p&gt;

&lt;p&gt;Meanwhile, I was drifting on the internet and I stumbled upon Ollama. Its documentation seemed minimal but sufficient for my needs. Ollama operates like Docker, with model configuration files and instructions for using models downloaded from Hugging Face. That's what I wanted!&lt;/p&gt;

&lt;h2&gt;
  
  
  Trying Ollama
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Run a LLM for Japanese
&lt;/h3&gt;

&lt;p&gt;I wanted to run the Japanese model Youri, so I set up the Modelfile as suggested in the documentation. Like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM ./models/rinna-youri-7b-chat-q6_K.gguf

TEMPLATE """[INST] {{ .Prompt }} [/INST] """
PARAMETER num_ctx 4096
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additionally, I used a gguf model converted by a volunteer from this &lt;a href="https://huggingface.co/mmnga/rinna-youri-7b-chat-gguf" rel="noopener noreferrer"&gt;Hugging Face page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running as a server
&lt;/h3&gt;

&lt;p&gt;Ollama can set up a local server while the app is running and it's totally easy. Let's take a look &lt;a href="https://github.com/jmorganca/ollama?tab=readme-ov-file#start-ollama" rel="noopener noreferrer"&gt;README.md&lt;/a&gt; to launch the server. I tried one of the user-provided UIs called &lt;a href="https://github.com/ollama-ui/ollama-ui" rel="noopener noreferrer"&gt;ollama-ui&lt;/a&gt; and asked it a question about Japanese history. But the quality in Japanese is less than in English.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs6i229xxblkx7qqqc8cc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs6i229xxblkx7qqqc8cc.png" alt="Ask the history of Japan in Japanese. AI responses short answer." width="800" height="158"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fca3dnd9h67yn8nmmvirq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fca3dnd9h67yn8nmmvirq.png" alt="Ask the history of Japan in English. AI responses with a enough brief overview of Japan's history in English" width="800" height="944"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Insights Gained While Running Ollama
&lt;/h2&gt;

&lt;p&gt;While exploring the Ollama repository, I noticed it was written in Go. It piqued my interest in how it runs LLaMA. It turns out that Ollama uses llama.cpp for execution, which appears to be an app designed to run LLaMA smoothly on Mac. Llama.cpp itself seems not to depend on Python and using C++ instead, which is wrapping up the complex parts and making it accessible even to those with little understanding like myself.&lt;/p&gt;

&lt;h1&gt;
  
  
  Exploring Frontend LLM
&lt;/h1&gt;

&lt;p&gt;I had heard rumors about running LLaMA as WebAssembly (WASM) on the frontend. So, I looked into some ambitious projects like &lt;a href="https://github.com/dmarcos/llama2.c-web" rel="noopener noreferrer"&gt;llama2.c-web&lt;/a&gt; and &lt;a href="https://github.com/mlc-ai/web-llm" rel="noopener noreferrer"&gt;WebLLM&lt;/a&gt;, which run LLMs on WASM. Running LLMs on the frontend is fascinating as it allows immediate responses without network dependency, ideal for quick-response needs like voice input or text summarization. I tried both platforms, and they worked impressively.&lt;br&gt;
This seems particularly useful for immediate responses in cases like voice input or text summarization. A configuration where lightweight and rapid-response tasks are handled at the edge, while relatively heavier tasks are managed by server-based LLMs, appears to have high potential for scalability.&lt;/p&gt;

&lt;p&gt;Chat with llama2 on a web browser.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqah6prokvyu471o565m8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqah6prokvyu471o565m8.png" alt="The image depicts a chat interface where a user is asking about the capital of Japan." width="800" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check those demos out! They are fantastic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://webllm.mlc.ai/#chat-demo" rel="noopener noreferrer"&gt;https://webllm.mlc.ai/#chat-demo&lt;/a&gt;&lt;br&gt;
&lt;a href="https://diegomarcos.com/llama2.c-web" rel="noopener noreferrer"&gt;https://diegomarcos.com/llama2.c-web&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Try WebLLM
&lt;/h2&gt;

&lt;p&gt;WebLLM is one of the MLC-LLM projects that compiles LLMs for web execution. By compiling the models, it enables them to run on various device runtimes prepared by MLC-LLM. This means you can create LLMs that run in the browser's WASM runtime without depending on Python modules. For users, it's quite amazing that simply loading the model in the browser can start a chat like magic.&lt;/p&gt;

&lt;p&gt;Reference:&lt;a href="https://llm.mlc.ai/docs/get_started/project_overview.html" rel="noopener noreferrer"&gt;MLC-LLM Project Overview&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To run youri7b-chat, as described above, the model needs to be compiled first. For this, I referred to the following documentation and proceeded with the compilation:&lt;br&gt;
&lt;a href="https://llm.mlc.ai/docs/compilation/compile_models.html" rel="noopener noreferrer"&gt;Compile Models - MLC-LLM&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While going through the documentation, I realized that emscripten also needs to be installed, so I prepared that as well:&lt;br&gt;
&lt;a href="https://emscripten.org/docs/getting_started/downloads.html#installation-instructions-using-the-emsdk-recommended" rel="noopener noreferrer"&gt;Emscripten Installation Instructions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once everything was ready and the compilation was done, I found something called simple-chat in the examples directory of webllm, which I decided to run locally:&lt;br&gt;
&lt;a href="https://github.com/mlc-ai/web-llm/tree/main/examples/simple-chat" rel="noopener noreferrer"&gt;Simple-Chat Example - WebLLM&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The compilation and web server setup went smoothly, but then it didn't work and I have completely no idea to make it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fneq6pbeev3nxpz3uk8a2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fneq6pbeev3nxpz3uk8a2.png" alt="It depicts a chat interface where a user is asking about the capital of Japan.But an error happens associated with WebGPU" width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Wrap-up
&lt;/h1&gt;

&lt;p&gt;This journey was solely about exploring and running OSS in my local environment, meanwhile I didn't code any single line. It highlighted the power of the OSS community and my respect for everyone developing OSS. I hope to contribute to the LLM ecosystem in some way in the future.&lt;/p&gt;

&lt;p&gt;In conclusion, while there were many challenges, it was a learning experience. M2 Macs can handle these models surprisingly well, encouraging me to keep experimenting. Goodbye for now.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
