DEV Community: Teruo Kunihiro

Apple’s container Just Hit v1.0.0

Teruo Kunihiro — Wed, 10 Jun 2026 13:50:19 +0000

Apple’s container has finally reached v1.0.0.

The name is a bit too generic, so in this article I’ll call it Apple’s container.

At first glance, it is tempting to describe it as “Apple’s Docker.” But that is not quite accurate. Apple’s container is a CLI tool for creating and running Linux containers as lightweight virtual machines on macOS. It is written in Swift, optimized for Apple silicon, and works with OCI-compatible container images, so it can pull images from standard registries and push images that you build yourself. The GitHub repository currently lists 1.0.0 as the latest release, dated June 9, 2026. (GitHub)

I have not used it heavily in production yet, so this is mostly a documentation-based first look rather than a deep hands-on review.

What is Apple’s `container`?

Apple’s container is a tool for running Linux containers on a Mac.

The official README describes it as a tool that lets you create and run Linux containers as lightweight virtual machines on macOS. It consumes and produces OCI-compatible container images, which means the workflow should feel familiar if you already use Docker, Podman, or other OCI-based tools. (GitHub)

For example, the basic commands look very Docker-like:

container run -it ubuntu:latest /bin/bash
container build -t my-app:latest .
container image pull alpine:latest

The command reference includes familiar operations such as container run, container build, container create, container exec, container logs, container start, and container stop. It also supports options such as volume mounts, memory/CPU configuration, port publishing, Rosetta support, and SSH agent forwarding. (GitHub)

It has a Docker-like CLI, and it understands OCI images, but it is not a Docker daemon-compatible implementation. It is a different implementation built around Apple’s Containerization framework and macOS virtualization technologies.

`container machine`: the interesting part

One of the most interesting features is container machine.

A normal application container is usually modeled around one process or one application. A container machine, on the other hand, is modeled more like a persistent Linux environment on your Mac.

Apple describes container machines as fast, lightweight, persistent Linux environments based on standard OCI images. They also provide host integrations such as automatic username and home directory sharing. (GitHub)

For example:

container machine create alpine:latest --name dev
container machine run -n dev

This gives you a Linux environment that is closer to a small development machine than a one-shot container.

Your macOS home directory can be mounted inside the container machine, so you can edit code with your Mac editor or IDE while building and running the project inside Linux. Apple’s docs describe this as “edit on the Mac, build inside.” (GitHub)

It can also run the image’s init system. If the image includes systemd, you can run services such as PostgreSQL using commands like:

systemctl start postgresql

Apple’s documentation explicitly calls out this use case for testing real Linux services inside a container machine. (GitHub)

This feels less like “Docker replacement” and more like a Mac-native WSL-like Linux development environment.

That is probably the part I am most interested in.

Can it replace Docker Desktop?

For some workflows, maybe.

For many real-world team workflows, probably not yet.

Docker Desktop is not just a container runner. Docker’s documentation describes Docker Desktop as an application for Mac, Linux, and Windows that lets you build, share, and run containerized applications. It includes Docker Engine, Docker CLI, Docker Build, Docker Compose, Docker Scout, and Kubernetes. (Docker Documentation)

Apple’s container covers an important subset of that world:

It can run containers.
It can build OCI images.
It can pull and push images.
It has familiar container lifecycle commands.
It integrates nicely with Apple silicon and macOS.

But it does not look like a drop-in Docker Desktop replacement.

The biggest practical gap for many developers is Docker Compose. A lot of local development environments are built around docker compose up, especially for apps that need a database, Redis, background workers, and multiple services.

There is a third-party project called container-compose, and Homebrew also has a container-compose formula, but that still means relying on a non-Apple bridge for a very central part of many workflows. (Homebrew Formulae)

For single containers, isolated experiments, image builds, and lightweight Linux dev environments, Apple’s container looks very promising.

For Compose-heavy development environments, Docker Desktop is still the safer default.

The biggest architectural difference

The biggest difference between Docker Desktop and Apple’s container is how they use virtual machines.

macOS does not have a Linux kernel. So if you want to run Linux containers on a Mac, some kind of Linux environment is required.

Docker Desktop for Mac uses a Linux VM to run containers. Docker’s documentation says Docker Desktop supports multiple Virtual Machine Managers to power the Linux VM that runs containers. (Docker Documentation)

Conceptually, Docker Desktop looks like this:

Docker Desktop for Mac:

macOS
  └─ Linux VM
       ├─ container A
       ├─ container B
       └─ container C

Apple’s container takes a different approach.

Instead of putting many containers inside one shared Linux VM, it runs a lightweight VM for each container. Apple’s technical overview says this gives each container VM-level isolation, lets the user mount only the necessary host data into each VM, and aims for memory usage lower than full VMs with boot times comparable to containers inside a shared VM. (GitHub)

Conceptually:

Apple container:

macOS
  ├─ lightweight VM ─ container A
  ├─ lightweight VM ─ container B
  └─ lightweight VM ─ container C

That design is very Apple-like.

It keeps the container workflow, but moves the isolation boundary closer to a VM boundary. This is attractive for security and privacy, especially when running code you do not fully trust.

At WWDC, Apple also described Containerization as running each container inside its own lightweight VM while still providing sub-second start times. Each container can also get its own dedicated IP address, which can remove the need for individual port mappings in some cases. (Apple Developer)

Of course, there are trade-offs.

Apple’s technical overview notes that memory pages freed inside a container VM are not always returned to the macOS host. If you run many memory-intensive containers, you may need to restart them occasionally to reduce memory usage. (GitHub)

So I would not assume this is automatically lighter than Docker Desktop for every workload. It needs real-world testing.

Docker-like, but not Docker-compatible

The CLI looks familiar, but compatibility is not the same thing as similarity.

For basic usage, you might imagine something like this:

alias docker='container'

And for simple commands, that may feel surprisingly natural.

But the broader Docker ecosystem is not just the CLI shape. Many tools expect the Docker Engine API, the Docker socket, or Docker Compose behavior.

There is already a GitHub issue asking Apple’s container to expose the Docker Engine API through something like /var/run/docker.sock, and that issue is marked “Closed as not planned.” (GitHub)

Another issue requesting Moby API support says that such support would be a prerequisite for Docker Compose to support Apple’s runtime, but that issue is marked as a duplicate. (GitHub)

So at least right now, I would not think of Apple’s container as a Docker Desktop drop-in replacement.

OS support

The official README says you need an Apple silicon Mac to run container, and that macOS 26 is the supported target because the project relies on new virtualization and networking features in that release. (GitHub)

Homebrew already provides

brew install container

The Homebrew formula lists stable version 1.0.0, requires arm64 architecture, and lists macOS 15 or newer as a requirement, with Xcode 26 or newer required for building. (Homebrew Formulae)

There is an important nuance here.

Apple’s technical overview says container can run on macOS 15, but with functional and user-experience limitations. For example, macOS 15 has limitations around network isolation, multiple networks, and container IP addresses. The container network commands are not available on macOS 15. (GitHub)

So the cleanest experience seems to be macOS 26 on Apple silicon.

Why this matters on macOS

On Linux, containers run on the host Linux kernel using features such as namespaces and cgroups.

On macOS, that is not possible directly because macOS is not Linux.

That is why Docker Desktop, Colima, Rancher Desktop, Podman Machine, and now Apple’s container all have to solve the same core problem:

How do we provide a Linux environment on a Mac without making the developer experience terrible?

Docker Desktop solves this with a managed Linux VM and a polished Docker ecosystem around it.

Apple’s container solves it by making lightweight VMs part of the container abstraction itself.

That distinction matters.

Docker Desktop optimizes for compatibility with the existing Docker ecosystem.

Apple’s container appears to optimize for Apple-platform integration, isolation, and a cleaner VM-per-container model.

Where Apple’s `container` could be useful

I see three strong use cases.

1. Running single containers on Mac

For quick experiments, sandboxing, and running one-off Linux tools, Apple’s container could be very convenient.

container run -it ubuntu:latest /bin/bash

If you do not need Compose, Kubernetes, or deep Docker ecosystem compatibility, this might be enough.

2. A Mac-native Linux development environment

This is where container machine gets interesting.

You can keep using your Mac editor, your Mac terminal, your Mac tools, and still build or test inside a real Linux environment.

This could become a really nice “WSL for Mac” experience.

3. Local sandboxing for untrusted code

Because each container runs inside its own lightweight VM, Apple’s container may be a good fit for running code with stronger isolation than a regular shared-kernel container.

That could be useful for local experiments, CI-like testing, or even AI coding agents that need to run generated code in a safer environment.

I would not call it a complete security solution by itself, but the direction is interesting.

Conclusion

Apple’s container reaching v1.0.0 is a big milestone.

I do not think it means everyone should uninstall Docker Desktop today.

Docker Desktop is still the more complete environment if your workflow depends on Docker Compose, Kubernetes, Docker Engine API compatibility, or existing tools that expect the Docker socket.

But Apple’s container is important because it gives macOS developers an official Apple-native option for running Linux containers.

Docker Desktop usually runs multiple containers inside a shared Linux VM.
Apple’s container runs each container inside its own lightweight VM. (GitHub)

That makes Apple’s approach especially interesting for isolation, privacy, and sandbox-style workflows.

Personally, the feature I am most excited about is not the Docker-like CLI itself. It is container machine.

The idea of having a persistent Linux development environment that integrates naturally with macOS, shares my home directory, lets me edit on the Mac, and build inside Linux feels very promising.

How I Make Claude Code's 5-Hour Usage Window Last Longer on Claude Pro

Teruo Kunihiro — Wed, 27 May 2026 13:46:15 +0000

When using Claude Code with Claude Pro, one problem you will almost certainly run into is the Usage Limit.

The actual usage depends on many things: message length, attached files, conversation history, the model you are using, and the features you enable. Claude Pro has a session-based limit that resets every five hours, as well as a weekly usage limit. (Claude Help Center)

I use Claude Pro with the assumption that I will eventually hit the limit. Because of that, I try to avoid carrying unnecessary context, control when I start heavy work, and move important information out of the conversation and into files.

Use `/clear`

When I start a new task or switch models, I usually run /clear first.

/clear starts a new conversation with an empty context. The previous conversation is still available through /resume, so this does not mean throwing away the work entirely. I use it to separate the current task from old context that is no longer needed. (Claude Code)

The two cases I pay the most attention to are model switching and long idle periods.

In Claude Code, each model has its own prompt cache. Because of that, when you switch models with /model, the next request will reread the entire conversation history without a cache hit, even if the conversation itself has not changed. If you switch models while carrying a long conversation, the first request after the switch can consume a large amount of your usage limit. (Claude Code)

Long idle periods across sessions are also worth watching. For Claude subscriptions, Claude Code uses a one-hour TTL prompt cache for the main conversation. If the conversation is idle for a long time, the cache can expire. For example, if you wait more than an hour for a reset and then resume, the next input may reprocess the long history. (Claude Code)

New Claude Code sessions can also change the cache prefix. The working directory, OS, shell, and git status snapshot can all affect the prefix. The official docs explain that sequential sessions can share the cache only when they are on the same machine and directory, and when the git status snapshot at startup matches. In other words, it is better to think of session resumes as situations where cache misses can easily happen because of TTL expiration or prefix differences. (Claude Code)

For that reason, before switching models or leaving work idle for a long time, I save the necessary information as a plan or spec file and then run /clear. I want to avoid starting a new session with the first prompt already carrying a huge old conversation history.

If I want to continue the same task, I use /compact instead of /clear. /compact replaces the conversation history with a summary, so the conversation-layer cache is rebuilt. However, the next turn can rebuild the cache from a much shorter summary. Used at a natural stopping point, it helps both with usage and with keeping the model focused. (Claude Code)

Start the session early with `/schedule`

Claude Pro has a session limit that resets every five hours. If I start working at 6 a.m., I can expect resets around 11 a.m. and 4 p.m. during the workday. Of course, the exact remaining usage and reset time should be checked in the Usage screen or with /usage, but the timing of when you start heavy work matters a lot. (Claude Help Center)

Recently, the limits have become more generous, so once you clear each five-hour session, you can use a fairly large amount of tokens. On May 6, 2026, Anthropic announced that it doubled Claude Code's five-hour rate limits for Pro, Max, Team, and Enterprise users, and removed peak-hours limit reductions for Pro and Max users. Claude Code weekly limits were also increased by 50% through July 13. (Anthropic, ClaudeDevs, Business Insider)

Even so, if you use Claude Code heavily on Pro, you can burn through a five-hour session very quickly. A temporary increase in the weekly limit does not prevent you from hitting the session limit if you pack too much heavy work into a short period.

I want to maximize the number of useful resets during my daytime working hours, so I use /schedule to run a routine early in the morning. /schedule creates, updates, and runs Claude Code routines, and those routines run on Anthropic-managed cloud infrastructure. By scheduling something simple, such as a small hello command, I can start the session early and plan the day around the five-hour windows. (Claude Code)

For my schedule, simply controlling when the session starts can turn two useful reset windows into three.

Write plans and specs to files

For larger Claude Code tasks, I try not to keep plans and specs only inside the conversation. I write them out to files.

This is very important. If the working state exists only in the conversation history, running /clear removes the context. But if the plan or spec is saved as a Markdown file, I can simply ask the next session to read that file and continue.

I sometimes use Superpowers skills such as writing-plan and writing-spec. The Superpowers brainstorming skill stores design specs under docs/superpowers/specs/YYYY-MM-DD-<topic>-design.md, and the writing-plans skill stores implementation plans under docs/superpowers/plans/YYYY-MM-DD-<feature-name>.md. (GitHub, GitHub)

The tool itself does not have to be Superpowers. The important point is this: do not use the conversation as the only place where the working state lives.

I keep only "what to do now" in the conversation. Specs, plans, test procedures, and reasons for decisions go into files. That way, I can run /clear without losing productivity.

Use cheaper models when possible

Not every task needs Opus.

The official Claude Code cost management docs say that Sonnet can handle many coding tasks well and is cheaper than Opus. Opus is better reserved for complex architecture decisions and multi-step reasoning. (Claude Code)

To be honest, I do not always optimize this perfectly. But when using Superpowers skills or subagents, simple subtasks are sometimes routed toward Sonnet, which feels like it saves usage. The official docs also mention that simple subagent tasks can be configured to use Haiku. (Claude Code)

On the other hand, for tasks that involve orchestrating multiple subagents or making high-level design decisions, Opus feels more stable to me. If I try too hard to save usage there, I often pay for it later with retries and corrections.

So the pattern I usually follow is similar to the Superpowers SKILL.md: use Sonnet for simple implementation, research, test fixes, and file-level work; use Opus for design decisions, complex debugging, subagent orchestration, and reviewing long plans.

Conclusion

If you use Claude Code on Claude Pro, it is better to assume that you will eventually hit the usage limit.

The most important thing is to avoid carrying unnecessary context. Use /clear when starting a new task or switching models. Use /compact when continuing the same task while cleaning up the context. Do not keep all working state inside a long conversation; write plans and specs to files.

Also, pay attention to when your five-hour session starts. Starting early in the morning makes it easier to take advantage of multiple resets during the day. With /schedule, you can control the start timing of routine work to some extent.

For models, use Sonnet for everyday work and reserve Opus for heavy design decisions and complex orchestration. The goal is not simply to use the cheapest model. The goal is to choose the model that fails the least within your available usage limit.

In the end, saving Claude Code usage is not really about being stingy. It is about managing working state. Keep the session light, move important information into files, and avoid making the model carry everything in the conversation.

Choosing Models for an Agentic Chat App on Amazon Bedrock

Teruo Kunihiro — Mon, 25 May 2026 07:34:35 +0000

Choosing Models for an Agentic Chat App on Amazon Bedrock

When building an agentic chat application on Amazon Bedrock, one of the first hard decisions is model selection.

This article is not a rigorous benchmark or formal evaluation. It is simply a set of practical notes from experimenting with multiple Bedrock models while building a personal agentic chat application. Pricing, supported features, and regional behavior change frequently, so you should always validate with official documentation and your own workload before making production decisions.

The app I’m currently building is a serverless agent that gets invoked from Slack. It receives user messages and dynamically calls tools such as memory, task management, calendar integration, web extraction, and custom skills.

So this is not just a simple chatbot.

user message
  -> model decides tool usage
  -> tool execution
  -> model observes result
  -> sometimes replans
  -> final Slack response

In this setup, model pricing alone is not enough. Tool call stability, Japanese UX quality, retry rate, fallback frequency, and output token volume all matter a lot.

My conclusion, at least for now, is that Moonshot AI’s Kimi K2.5 works best as the primary model.

Sonnet Is Expensive

Claude Sonnet is the baseline reference point.

Claude Sonnet 4.5 costs $3 per 1M input tokens and $15 per 1M output tokens. Claude Haiku 4.5 is much cheaper at $1 / $5, so while Sonnet provides reassuring quality, the cost becomes significant for agentic chat workloads where output tokens can grow quickly.

Agentic chat systems often invoke the model multiple times for a single user message. Tool schemas, tool results, conversation history, and system prompts all inflate token usage compared to ordinary Q&A applications.

Because of that, I positioned Sonnet like this from the beginning:

Claude Sonnet:
  fallback on failure
  escalator for high-value users
  difficult multi-step reasoning

For the main model, I needed something cheaper than Sonnet while still being more reliable for agentic behavior than lightweight models.

Haiku Is Cheap, but Slightly Weak

Claude Haiku 4.5 is attractive from a pricing perspective. If your architecture benefits heavily from prompt caching, it can become extremely cost efficient for applications with large system prompts and repeated tool schemas.

Bedrock prompt caching reduces input token cost and latency by caching repeated prompt prefixes.

Still, in my own testing, Haiku felt slightly too weak to serve as the main model.

It works well for simple classification, lightweight extraction, and short summaries. But I had concerns about tool selection, replanning stability, Japanese response quality, and multi-step reliability.

So Haiku feels better suited as a helper model rather than the primary agent model.

Claude Haiku:
  routing
  lightweight classification
  lightweight extraction
  first-pass processing

MiniMax M2.5 Is Cheap and Agent-Friendly — but Japanese UX Is Weak

MiniMax M2.5 was one of the strongest candidates.

According to the Bedrock model card, MiniMax M2.5 is positioned as an “agent-native frontier model” optimized for reasoning efficiency, task decomposition, complex workflows, and agentic scaffolding. It supports a 196K context window and 8K maximum output tokens.

The pricing is also extremely competitive.

In the Tokyo region:

Model	Approximate Cost for 1,000 Calls
MiniMax M2.5	~$4.32
Mistral Large 3	~$6.70
Kimi K2.5	~$9.36

On paper, MiniMax M2.5 is very attractive. It also supports Bedrock Agents, Flows, and structured outputs.

However, after actually using it, I felt that the Japanese UX and customer-facing conversational quality were slightly off. It may work well for internal planning or orchestration, but I was not fully comfortable exposing it directly to users in Slack conversations.

MiniMax is probably one of the strongest cost-performance options available today, but I ultimately excluded it as the main chat model.

Gemma Is Extremely Cheap, but Better for First-Pass Processing

The Gemma 3 family was also considered.

In the Tokyo region, Gemma 3 pricing is extremely low:

Gemma 3 27B: $0.28 / $0.46
Gemma 3 12B: $0.11 / $0.35
Gemma 3 4B: $0.05 / $0.10

At those prices, Gemma becomes very useful for:

classification
lightweight RAG answers
short summaries
routing
first-pass response generation

However, my target workload was an agentic chat main model. Since even Haiku already felt slightly weak for that role, Gemma was difficult to justify as the primary agent.

Nemotron 3 Super 120B

At one point I also evaluated NVIDIA Nemotron 3 Super 120B.

According to the Bedrock model card, Nemotron 3 Super is a 120B-parameter open hybrid MoE model with 12B active parameters. It targets complex multi-agent applications and supports a 256K context window with 32K output tokens.

Pricing is surprisingly low:

$0.18 / 1M input tokens
$0.78 / 1M output tokens

Even cheaper than MiniMax.

On paper, it looked extremely compelling.

However, in my own testing, on-demand invocation latency in the Tokyo region was sometimes very slow, and even short responses occasionally timed out. Meanwhile, in us-east-1, forced tool calls and short responses often completed in around 2–3 seconds.

So I would not conclude that Nemotron itself is fundamentally slow. Regional infrastructure and routing likely have a large impact.

Since my target use case is a customer-facing chat application deployed in Tokyo, I decided not to use it as the main model.

Nemotron 3 Super:
  strong pricing and specs
  tool use works
  but latency in ap-northeast-1 felt risky

Mistral Large 3 Is Good, but Not Decisive

Mistral Large 3 was also a very realistic option.

According to the Bedrock model card, Mistral Large 3 is a 675B-parameter model optimized for coding, reasoning, and multilingual tasks. It supports a 256K context window and 32K output tokens.

In Bedrock Runtime, it supports Agents, Flows, structured outputs, and prompt caching.

Pricing in Tokyo:

$0.61 / 1M input tokens
$1.82 / 1M output tokens

Considerably cheaper than Kimi K2.5.

My practical experience with it was not bad at all. But in this specific agentic chat workload, Kimi K2.5 consistently felt more stable.

Also, while the official model card says prompt caching is supported, I occasionally saw Bedrock reject requests when using cachePoint in my own setup.

Mistral offers a very good balance between cost and quality, but Kimi ultimately ranked higher for my use case.

Why I Ended Up Choosing Kimi K2.5

In the end, I chose moonshotai.kimi-k2.5 as the main model.

The reason is simple:

Among all the models I tested, it provided the best balance of agentic behavior stability and Japanese UX quality.

According to the Bedrock model card, Kimi K2.5 offers improved reasoning, coding, and multilingual capabilities. It supports a 256K context window, 16K output tokens, and multimodal image input.

Within Bedrock Runtime, it supports:

response streaming
Guardrails
Prompt Management
Flows
Agents
structured outputs

Pricing in Tokyo:

$0.72 / 1M input tokens
$3.60 / 1M output tokens

More expensive than MiniMax or Mistral, but still significantly cheaper than Sonnet.

When selecting models, failure rate matters as much as raw token pricing.

Even if a model is cheap, frequent tool selection failures, malformed JSON, retries, or Sonnet fallbacks can easily increase the total effective cost.

In agentic systems especially, a single bad decision can cascade into failed tool calls and unnecessary replanning.

That is why my final evaluation of Kimi K2.5 became:

Not the cheapest model, but the most stable main model.

No Prompt Cache Support for Kimi K2.5 on Bedrock

One unfortunate limitation is prompt caching.

The Bedrock model card for Kimi K2.5 lists support for Agents, Flows, and structured outputs, but does not currently mention prompt caching.

The Bedrock prompt caching documentation explicitly lists which models support cache checkpoints and where they can be inserted (system, messages, or tools). Claude models and some others are listed there, but Kimi K2.5 currently has weak evidence for Bedrock-side prompt cache support.

Moonshot’s direct API does show cache-hit pricing for Kimi K2.5.

However, that does not automatically mean the same cache behavior or pricing applies through Bedrock.

Reducing Cost with Payload Slimming and Flex Tier

Once Kimi K2.5 became the primary model, the next challenge was cost optimization.

Especially output tokens.

The first thing that matters is payload slimming.

That means minimizing:

system prompts
tool schemas
tool results
conversation history
RAG excerpts

In agentic chat systems, tool schemas and tool results can dramatically inflate input token usage.

Some practical optimizations:

limit maxTokens depending on workload
avoid exposing long intermediate reasoning
trim tool results down to only required fields
avoid injecting every tool schema every time
cache repeated FAQ answers, search results, and tool results on the application side

These optimizations matter regardless of which model you choose.

I also started experimenting with Bedrock Flex tier.

Bedrock provides Standard, Flex, Priority, and Reserved service tiers. Flex is intended for workloads that can tolerate slightly more variable latency in exchange for lower cost.

AWS documentation specifically mentions:

model evaluation
summarization
agentic workflows

Moonshot Flex pricing on Bedrock is advertised at roughly a 50% discount compared to Standard.

That means Kimi K2.5 in Tokyo becomes approximately:

Tier	Input	Output
Standard	$0.72	$3.60
Flex	~$0.36	~$1.80

Initially, I planned to use Standard for interactive chat and Flex only for asynchronous tasks, evaluations, summaries, and background processing.

However, after trying Kimi K2.5 on Flex, the latency for lightweight Slack interactions felt much better than expected.

This is not a rigorous benchmark, and behavior may differ under heavy load or long tool loops.

Still, for small-scale personal projects or serverless agents, starting with Flex for the main response path actually feels realistic.

My current setup looks roughly like this:

main interactive responses:
  moonshotai.kimi-k2.5 / Flex

async processing, summaries, evaluations:
  moonshotai.kimi-k2.5 / Flex

failure handling and difficult reasoning:
  Claude Sonnet fallback

lightweight classification and routing:
  cheaper helper models

Explaining Security Concerns Around Chinese Models

When using Chinese-origin models like Kimi K2.5 or MiniMax M2.5, security concerns often appear internally.

The important point is not to argue that “Chinese models are safe.”

Instead, the distinction between:

direct API usage
Bedrock-managed usage

must be explained clearly.

According to Amazon Bedrock documentation, model providers cannot access Bedrock logs or customer prompts/completions.

That means using Kimi or MiniMax through Bedrock has a very different risk profile compared to directly calling vendor APIs.

The explanation I found most practical was:

We are not sending data directly to a Chinese model provider.
The models are executed within Amazon Bedrock’s managed environment.
Customer prompts and completions are not shared with the model provider through Bedrock.

Therefore, the main operational concerns become:
  IAM
  logging
  Guardrails
  RAG access control
  tool-call permissions

Final Architecture

My final conclusion currently looks like this:

main model:
  moonshotai.kimi-k2.5

interactive tier:
  currently testing Flex
  fallback to Standard if latency becomes problematic

cost-sensitive tier:
  Flex

fallback:
  Claude Sonnet

helper models:
  MiniMax / Gemma / Nemotron for specialized workloads

Model roles ended up being:

Model	Evaluation	Role
Claude Sonnet	Excellent but expensive	fallback / escalator
Claude Haiku	Cheap but slightly weak	routing / extraction
MiniMax M2.5	Cheap and agent-oriented	not ideal for Japanese-facing UX
Gemma 3	Extremely cheap	first-pass processing
Nemotron 3 Super	Cheap, non-Chinese, tool-capable	latency concerns in Tokyo
Mistral Large 3	Strong balance	good, but less stable than Kimi
Kimi K2.5	Strong Japanese UX and tool stability	main model

Closing Thoughts

This whole exploration started from a simple question:

“Sonnet is expensive. Is there a cheaper main model for agentic chat?”

MiniMax M2.5 was extremely attractive in terms of pricing and agent-oriented behavior, but the Japanese customer-facing UX did not fully work for me.

Mistral Large 3 offered an excellent balance overall, but Kimi K2.5 consistently felt more stable.

Nemotron 3 Super 120B looked fascinating from a pricing and specification perspective, but latency in the Tokyo region made it difficult to trust for customer-facing chat.

Haiku can become highly cost efficient with prompt caching, but it still felt slightly weak for my main agent workload.

As a result, I settled on:

Kimi K2.5 as the main model
Sonnet as fallback
Flex tier and payload slimming for cost optimization

For my own use case, Kimi K2.5 was not the absolute cheapest model.

But once retry rates, UX quality, and operational stability were included in the calculation, it delivered the best effective cost.

Going forward, I want to build more formal evaluations around:

conversational quality
tool call success rate
retry frequency
latency
Japanese UX
token cost

Rather than endlessly adding more candidate models, I want to keep pruning the stack into something operationally simple and reliable.

TanStack Was Not the Whole Story: Mini Shai-Hulud Was an npm/PyPI Supply-Chain Worm

Teruo Kunihiro — Wed, 13 May 2026 08:09:17 +0000

This article is based on public reporting available as of 2026-05-13. Mini Shai-Hulud is still an actively tracked campaign, so affected packages and IOCs (indicators of compromise) may change.

In May 2026, a supply-chain compromise was reported across TanStack's npm packages. Malicious versions were published for 42 @tanstack/* packages, and installing those versions triggered a credential stealer.

If you look only at TanStack, the incident can seem like a single npm compromise. But when you read The Hacker News coverage and the analyses from StepSecurity and Socket, it is better understood as part of a broader self-propagating supply-chain campaign called Mini Shai-Hulud.

The important point is that this was not just "a dependency package was compromised." It was closer to a worm that used developer machines and CI/CD environments as stepping stones to reach the next maintainer and the next package ecosystem.

What happened at TanStack

According to the TanStack GitHub Advisory, malicious versions were published to the npm registry for 42 @tanstack/* packages, totaling 84 versions, between 2026-05-11 19:20 and 19:26 UTC. The issue is tracked as CVE-2026-45321 with a CVSS score of 9.6.

The publish was authenticated through the legitimate GitHub Actions OIDC trusted-publisher binding. At the same time, the advisory explains that the publish workflow itself was not modified.

This section is based on TanStack's official postmortem and GitHub Advisory:

At a high level, the TanStack-specific path looked like this:

TanStack-specific path:

checkout and build fork PR code inside pull_request_target
  + GitHub Actions cache poisoning
  + OIDC token extraction from the Actions runner process
  -> malicious publish that looked like it came from the legitimate release path

The issue was not simply the use of pull_request_target. The problem was that the workflow checked out and executed untrusted fork PR code inside a pull_request_target workflow. pull_request_target runs in the context of the base repository, so it should generally be limited to operations that do not execute the contents of the PR, such as labeling or commenting.

TanStack's postmortem explains that bundle-size.yml ran on pull_request_target, checked out the fork PR merge ref, and ran a build for bundle-size measurement. In other words, untrusted fork PR code ran within the base repository's cache scope. That became the entry point for cache poisoning.

Using similar cache keys between test and release workflows is not unusual by itself. For example, caching a pnpm store based on the hash of pnpm-lock.yaml is a common CI optimization.

The problem is when a cache touched by untrusted PR code can also be restored by the release workflow. A cache is not executed just because it is restored. But if the release workflow later runs pnpm install or a build step that references dependencies or binaries from the restored pnpm store, attacker-controlled code placed there can be invoked.

Using the same cache key:
  common

Letting release restore a cache created by untrusted PR code:
  should not happen

In the TanStack case, a malicious script executed from the fork PR poisoned the pnpm store. The actions/cache post-job save then stored that pnpm store. Later, a release workflow triggered by a push to main restored the same cache. During build, test, or cleanup work, attacker-controlled binaries were invoked, leading to OIDC token extraction and a direct publish to npm.

The malicious package versions included an obfuscated JavaScript payload called router_init.js, roughly 2.3 MB in size. It ran during install and collected AWS IMDS credentials, GCP metadata, Kubernetes service-account tokens, Vault tokens, npm tokens from ~/.npmrc, GitHub tokens, SSH private keys, and more.

That explains how the TanStack release pipeline was abused. But Mini Shai-Hulud becomes more concerning when you look beyond TanStack.

It was not only TanStack

The Hacker News article lists package compromises associated with TeamPCP that went beyond TanStack, including UiPath, Mistral AI-related packages, OpenSearch, and Guardrails AI across npm and PyPI.

Socket also tracked additional compromised artifacts after the initial TanStack reporting, including OpenSearch, PyPI mistralai@2.4.6, PyPI guardrails-ai@0.10.1, and additional Squawk-related npm packages.

The broader campaign can be summarized like this:

Mini Shai-Hulud
  + credential stealing
  + package maintainer enumeration
  + cross-ecosystem infection across npm and PyPI
  + persistence in Claude Code, VS Code, and GitHub Actions

The attacker did not only steal credentials. The malware also enumerated packages that a maintainer could publish to, then republished infected versions. A compromise in one developer machine or CI/CD environment could therefore spread into another package ecosystem.

Worm behavior

The post-install flow looks roughly like this:

Compromised package install
  -> install of an infected package

router_init.js / transformers.pyz
  -> malicious payload execution

credential theft
  -> credential collection
  - GitHub token
  - npm token
  - cloud credentials
  - SSH keys
  - CI secrets

exfiltration
  -> data exfiltration
  - filev2.getsession.org
  - seed1/2/3.getsession.org
  - GitHub GraphQL dead drop

self-propagation
  -> spreading to more packages and repositories
  - enumerate maintainer packages
  - publish infected versions
  - inject workflows / persistence hooks

Stolen data was sent to Session/Oxen-related infrastructure such as filev2.getsession.org and seed1.getsession.org. The Hacker News describes the use of filev2.getsession.org and Session Protocol infrastructure as an attempt to evade detection, since those domains may be less likely to be blocked in enterprise environments.

There was also a fallback path that used stolen GitHub tokens to commit encrypted data to attacker-controlled repositories through the GitHub GraphQL API. This is essentially a dead drop: if the malware cannot send data directly to an external server, it can temporarily place the data in a GitHub repository for later retrieval. The commit author claude@users.noreply.github.com is one IOC to look for in that path.

Persistence and lateral movement

The concerning part is not only the credential theft that happens at install time. The reported persistence and lateral movement surface is broad.

StepSecurity and Socket describe artifacts such as:

.claude/settings.json
.claude/router_runtime.js
.claude/setup.mjs
.vscode/tasks.json
.vscode/setup.mjs
~/Library/LaunchAgents/com.user.gh-token-monitor.plist
~/.config/systemd/user/gh-token-monitor.service
.github/workflows/codeql_analysis.yml

If hooks are installed into Claude Code or VS Code, the stealer can run again when the IDE starts. The gh-token-monitor service is used to monitor and retransmit GitHub tokens.

There are also reports of injected GitHub Actions workflows that serialize repository secrets with toJSON(secrets) and send them to api.masscan.cloud.

On the CI/CD side, StepSecurity reported an especially important behavior. On Linux GitHub Actions runners, the malicious payload looked for the Runner.Worker process and read /proc/<pid>/mem to extract workflow secrets, including masked secrets. That means even secrets not explicitly referenced in the workflow YAML may be at risk if they are present in the runner process memory.

PyPI was also affected

This was not only an npm incident.

Socket highlights PyPI guardrails-ai@0.10.1 because malicious code could run on import. On Linux, it downloaded a Python artifact from git-tanstack.com/transformers.pyz, wrote it to /tmp/transformers.pyz, and executed it with python3. Socket notes that this behavior was not present in the previous guardrails-ai@0.10.0 release.

The Hacker News, citing Microsoft's analysis on X, also discusses mistralai@2.4.6, including behavior that fetched a credential stealer from a remote server, avoided Russian-language environments, and included destructive branching for environments that appeared to be in certain regions.

Watching npm lifecycle scripts is not enough. Python imports, CI installs, developer machines, and IDE hooks all matter here.

SLSA provenance was not enough

One of the most important details is that the malicious packages were published through legitimate GitHub Actions OIDC trusted publishing and had valid SLSA provenance.

Provenance tells you which pipeline produced an artifact. It does not prove that the pipeline was not contaminated by attacker-controlled code.

In this attack, the trusted pipeline itself became the attacker's publish path. A provenance badge or Sigstore attestation alone is not enough to conclude that the artifact is safe.

Initial response

If a developer machine or runner may have installed an affected version, reverting the lockfile is not enough. At the same time, this article should not be treated as a full incident-response runbook. It is safer to follow the official advisory and vendor analyses.

The TanStack GitHub Advisory recommends treating affected developer machines and CI environments as compromised, rotating credentials that were accessible from the install process, checking cloud audit logs, and auditing CI pipelines.

Start with these references:

TanStack GitHub Advisory: affected versions, patched versions, workaround, IOCs
StepSecurity analysis: GitHub Actions, OIDC, SLSA provenance, secret exfiltration
Socket analysis: additional affected packages, PyPI, persistence artifacts, detection notes

In practice, the areas to check include:

isolation of affected machines and runners
rotation of GitHub PATs, npm tokens, cloud credentials, Vault tokens, Kubernetes tokens, and SSH keys
rotation of GitHub Actions secrets and environment secrets
npm publish logs and unexpected changes in GitHub repositories
persistence artifacts under .claude/, .vscode/, LaunchAgent, and systemd user services
egress to filev2.getsession.org, seed*.getsession.org, and api.masscan.cloud

StepSecurity also warns about npm tokens with the description IfYouRevokeThisTokenItWillWipeTheComputerOfTheOwner. Because that may indicate destructive behavior, token revocation should be handled from a clean machine and according to the organization's incident-response process, not casually from a potentially infected host.

Prevention lessons

The lesson is not just "be careful with pull_request_target."

do not share cache between untrusted PR workflows and release pipelines
do not checkout and execute untrusted code in pull_request_target
grant id-token: write only to the publish job
explicitly set permissions: id-token: none elsewhere
separate release workflows from normal test workflows
pin third-party actions by commit SHA instead of tags
avoid leaving secrets on self-hosted or long-lived runners
enforce lockfiles and frozen installs
add a minimum release age for dependency updates

With pnpm, minimumReleaseAge can be set in pnpm-workspace.yaml. For example, a 7-day delay is 10080 minutes. In pnpm 11, minimumReleaseAgeStrict can also be set when you want stricter behavior.

minimumReleaseAge: 10080
minimumReleaseAgeStrict: true

This is not a complete defense. It will not magically clean a malicious version already in your lockfile, and it will not protect you if you explicitly install a malicious version. But it can reduce the chance of immediately pulling a newly published malicious release.

Conclusion

If you explain the TanStack incident only as a pull_request_target mistake, it sounds smaller than it was.

The broader picture is a self-propagating worm that crossed CI/CD, caches, OIDC, npm trusted publishing, IDE hooks, GitHub Actions secrets, and PyPI. The attacker did not merely compromise packages. They used developer and CI environments as stepping stones to reach the next maintainer and the next package.

The right mental model is not just "dependency package compromise." It is developer environment and CI/CD compromise.

References

Building a Home Personal Assistant with Claude Managed Agents

Teruo Kunihiro — Mon, 13 Apr 2026 07:46:16 +0000

Introduction

Claude Managed Agents was just announced, so I tried using it to build a personal assistant for household tasks.

What I wanted was pretty simple: an AI I can call from Slack that can handle family notes, tasks, reminders, and schedules without too much ceremony. Things like birthdays, what gifts I bought last year, school handouts, grocery co-op deadlines, and small day-to-day household tasks.

My first impression was very positive. Claude Managed Agents solves a lot of the annoying parts up front:

I do not have to host the execution environment myself
Vaults and sandboxes are built in from the start
MCP and custom tools make it easier to build a safer architecture

That said, it does not eliminate the need for surrounding application code. I still needed a Slack event endpoint, persistent task state, and scheduled execution. In the end, I landed on an architecture centered on Claude Managed Agents, with Lambda + DynamoDB + EventBridge Scheduler around it.

My app is like this.

What I wanted to build

These were the rough requirements:

Trigger the AI from Slack mentions for household tasks
Let the AI take notes and transcribe things
Connect with Google Calendar and Drive so important things are not missed
Have the AI send a daily reminder about household tasks
Let me send rough notes about finished tasks or recurring events and have the AI remember them in a useful way

So far, the parts that are actually working are mainly 1 / 2 / 4 / 5. Calendar and Drive integration are next.

Quickstart was genuinely useful

I started from the Claude Console Quickstart:

https://platform.claude.com/workspaces/default/agents/

It is a good way to get an initial agent configuration in place. You can shape the setup through conversation instead of writing everything from scratch. Japanese IME input still felt a little awkward, and Enter could fire too early, but overall it was fast enough to be useful.

Slack MCP

On the Slack side, I created a bot account and added the scopes I needed. The main ones ended up being:

app_mentions:read
chat:write
files:read

The Slack MCP lives on the Managed Agent side, but actual event ingestion and attachment retrieval are handled by Lambda. In practice, that split felt better than trying to force everything through MCP alone.

Sandbox

Claude Managed Agents also gives you a managed execution environment. In this project I used a sandbox configured for Slack MCP calls and custom tool usage.

I did not let the agent touch DynamoDB directly. Instead, DynamoDB access goes through custom tools, and Lambda performs the actual reads and writes. That keeps the permission boundary clear and makes the update rules easier to control from the application side.

In Anthropic's docs, this execution environment is modeled as an Environment. An Environment is basically the container configuration where the agent runs. You create it once and refer to it by ID. Multiple sessions can reuse the same Environment definition, but each session gets its own isolated container instance, and filesystem state is not shared across sessions. In other words, configuration is reusable, but runtime state is isolated per session.

References:

That matters a lot. Even for a personal or family assistant, it means each run starts from a clean, isolated environment instead of inheriting leftovers from the previous run. Network settings are also part of the Environment, and Anthropic recommends using limited networking with explicit allowed_hosts for production. So the sandbox is not just “a safe box for Claude.” It is the unit that bundles isolation, dependency setup, and network permissions together.

Vault

I stored the Slack MCP credentials in a Vault. Not having to place raw credentials directly into the agent configuration is a big win.

The value of Vaults is pretty clear in Anthropic's docs. Vaults and credentials are treated as reusable authentication primitives that you register once and reference by ID. That means you do not need to run your own secret store for this part, pass tokens around on every request, or lose track of which credentials a session is using.

Reference:

https://platform.claude.com/docs/en/managed-agents/vaults

Another important point is that MCP server definitions and authentication are separated. When you create the agent, you declare which MCP servers it can connect to. When you create a session, you pass vault_ids to resolve authentication. Anthropic explicitly calls out that this separation keeps secrets out of reusable agent definitions while still letting each session authenticate with different credentials if needed. For a setup like this, where Slack MCP exists alongside application-managed Slack event handling, that split is very helpful.

Reference:

https://platform.claude.com/docs/en/managed-agents/mcp-connector

I still needed regular application code

At first I thought Managed Agents might cover most of it. In practice, I still needed surrounding application code for three reasons:

an HTTP endpoint for Slack Events API
asynchronous processing to stay within Slack’s 3-second response limit
application state such as memory, tasks, sessions, and idempotency

So the architecture ended up looking like this:

Slack mention
  -> API Gateway
  -> Lambda (ingress)
  -> SQS
  -> Lambda (worker)
  -> Claude Managed Agent
  -> Slack reply

Daily reminder
  -> EventBridge Scheduler
  -> Lambda (scheduled runner)
  -> Claude Managed Agent
  -> Slack post

State
  -> DynamoDB

Slack mentions flow through ingress Lambda -> SQS -> worker Lambda. Slack gets an immediate ACK, and the Claude interaction happens asynchronously in the background.

The daily reminder is triggered by EventBridge Scheduler. Right now it runs every day at 09:00 JST and posts a reminder for unfinished tasks.

What gets stored where

This setup currently uses seven DynamoDB tables:

SlackThreadSessionsTable: mapping between Slack threads and Claude sessions
ProcessedEventsTable: Slack event deduplication
ScheduledTasksTable: scheduled task definitions
UserMemoriesTable: mapping to Claude memory stores
MemoryItemsTable: semi-structured memory persisted through custom tools
TasksTable: current task state
TaskEventsTable: task history

MemoryItemsTable and TasksTable / TaskEventsTable are the important ones here.

For household use, the data I actually care about looks like this:

whose birthday it is
what I gave them last year
what tasks are still unfinished
whether a task is already done

That kind of information is easier to manage if it lives in DynamoDB as the source of truth, with Claude pulling it through tools only when needed. That is the approach I took.

Using custom tools for memory and tasks

I ended up defining these five tools:

search_memories
save_memory
list_tasks
upsert_task
mark_task_done

When the Managed Agent calls one of them, it emits agent.custom_tool_use. Lambda receives that request, updates DynamoDB, and returns the result via user.custom_tool_result.

I like this pattern a lot. The agent never needs direct DynamoDB IAM permissions, which makes the boundary safer and gives the application control over how updates are applied.

I verified the flow end to end:

save_memory stored “Hanako’s birthday is 8/12”
upsert_task created a task for buying a birthday gift
mark_task_done updated that task to done
TaskEventsTable recorded created and marked_done

Slack mentions work naturally

When I mention @AI in Slack, the conversation continues in the same thread.

What made this feel right was treating Slack thread = Claude session. That aligns the Slack UX with the conversation context in a very natural way.

I also added attachment handling on the Lambda side. With files:read, Lambda can fetch PDFs or images from Slack’s url_private endpoints and pass them to Claude as document or image blocks.

That makes flows like this possible:

upload a school or daycare PDF
let the AI read it
extract tasks if needed
save important details into memory

Daily reminders also worked well

For scheduled execution, I used EventBridge Scheduler rather than the older CloudWatch Events style rules.

The current setup stores a daily-summary task definition in DynamoDB. Every morning at 9 AM, the scheduled runner loads that definition, starts Claude, calls list_tasks to fetch unfinished tasks, and posts a short reminder to Slack.

What I like about this is that the reminder is not a fixed template. Claude can shape the wording based on the unfinished tasks in DynamoDB.

Letting it read PDFs and remember things is surprisingly good

This turned out to be one of the most promising parts for household use.

If I can just upload a PDF to Slack and say @AI take a look at this, the system can:

extract dates
turn them into tasks
save names or events into memory

That is exactly the kind of workflow that matters for family operations, where the problem is usually not a lack of information but forgetting things at the wrong time.

In that sense, save_memory and search_memories seem especially useful.

Pricing

Cost is obviously a concern.

According to Anthropic’s pricing page, the model I am using here, Claude Sonnet 4.6, is priced at:

Input: $3 / MTok
Output: $15 / MTok
Session runtime: $0.08 / session-hour

Reference:

https://platform.claude.com/docs/en/about-claude/pricing

For household use, a rough estimate still puts this in a pretty reasonable range, around $10/month.

I used these assumptions:

5 Slack mentions per day
1 daily reminder per day
per mention: 12k input tokens / 1.2k output tokens / 20 seconds runtime
per reminder: 15k input tokens / 1.5k output tokens / 15 seconds runtime

That gives roughly:

mentions: about $8.4 / month
reminders: about $2.1 / month
total: about $10.5 / month

This will go up quickly if:

you read a lot of long PDFs
you use web search or extra tools heavily
conversations get long and context keeps expanding

Still, for a personal household assistant with a small number of daily interactions, AWS costs are likely minor compared to Claude token costs.

Things that were tricky

Slack Events configuration

At first, I had the classic problem where the Request URL was verified but no events were arriving. In the end, I had to carefully make sure that:

Event Subscriptions were enabled
app_mention was added
files:read was added
the Slack app was reinstalled after changing scopes

Splitting responsibility between Slack MCP and Lambda

Slack MCP is useful, but once you need external event ingestion, attachment handling, threaded replies, and idempotency, it is easier to keep Slack input/output under application control.

The split that worked best here was:

Lambda handles input and delivery
Managed Agent handles reasoning and tool usage

That division felt clean.

Do not start with fully automatic memory saving

This is more of an operational lesson than a technical one. Memory gets messy fast. Birthdays and gift history are good durable facts, but if you save every temporary request automatically, the memory store becomes noisy very quickly.

For now, I prefer having an explicit save_memory entry point. The agent can decide what looks durable, but the application still controls how it is persisted.

What I want to do next

These are the next things I want to add:

register events in Google Calendar and link the returned event IDs to tasks
read Google Drive documents and turn them into tasks or memories
run weekly summaries of completed tasks
add reminders like “a birthday is coming up” or “the co-op deadline is close”
refine the memory persistence policy

Calendar integration feels especially important. The shape I want is: Claude registers something in Calendar, returns structured JSON, and the application syncs that result into DynamoDB task state.

Closing thoughts

I came away with a very good impression.

The managed aspect matters a lot. Availability, execution environments, credentials, and permission boundaries are all expensive to get right on your own. Claude Managed Agents makes that much easier to control.

The pattern that currently feels best to me is:

reasoning and sandboxing in Managed Agents
webhooks, state, and integration glue in Lambda

That split worked well for a household assistant too. At this point I can already see a path where I throw rough notes into Slack and get “remember this,” “remind me later,” and “what is still unfinished?” out of the same system.

The next step is to connect Calendar and Drive and see how far this can go in real day-to-day use.

References

Claude Managed Agents Quickstart: https://platform.claude.com/docs/en/managed-agents/quickstart
Claude Managed Agents Environments: https://platform.claude.com/docs/en/managed-agents/environments
Claude Managed Agents Events and Streaming: https://platform.claude.com/docs/en/managed-agents/events-and-streaming
Claude Managed Agents Memory: https://platform.claude.com/docs/en/managed-agents/memory
Claude Managed Agents Vaults: https://platform.claude.com/docs/en/managed-agents/vaults
Claude Managed Agents MCP Connector: https://platform.claude.com/docs/en/managed-agents/mcp-connector
Claude Pricing: https://platform.claude.com/docs/en/about-claude/pricing

Semver in Retrograde

Teruo Kunihiro — Wed, 08 Apr 2026 15:02:07 +0000

This is a submission for the DEV April Fools Challenge

What I built

I built a dependency analysis tool that delivers executive-grade reports about your project's emotional state.
It just happens to be astrology. So I built Semver in Retrograde.

You paste a package.json, click "Analyze my dependency aura", and get a straight-faced executive report about the project's emotional state. It gives you Aura Stability, Chaos Index, Peer Dependency Tension, Mercury Status, the dependency Big 3, a prophecy, a lucky command, and a share card that looks ready for an internal quarterly review.

That contrast is the joke. The interface looks like a serious dashboard. The output is dependency mysticism delivered in the tone of an operations meeting.

I also added one feature that makes me disproportionately happy: if you paste something that looks like requirements.txt or a Gemfile, the app returns 418 I'm a teapot. Wrong ecosystem, wrong beverage.

Demo

Live demo: https://semver-in-retrograde.vercel.app/

Repo: trknhr/semver-in-retrograde

One practical note: the public deployment does not call Gemini in production. I turned that off to keep the joke within budget, so the hosted version runs in a fixed "budget committee safe mode" for the narrative copy. The full Gemini path is what I used in local development and in the eval run.

This is the demo flow I used:

Paste a package.json
Click "Analyze my dependency aura"
Watch the dashboard appear like it's about to audit your org
Then realize it's talking about your emotional instability

Code

The code is here:

GitHub Repository

The app has a clean split. Local code parses and scores the manifest. Gemini writes the executive reading. So the same manifest always produces the same numbers, while the model handles the polished nonsense.

How I built it

I used:

Next.js
TypeScript
Tailwind CSS
server-side Gemini API
Zod

The architecture is more serious than the premise. That felt appropriate.

1. Deterministic manifest analysis

The first step is completely local.

The app parses package.json, flattens the dependency sections, inspects the scripts block, and turns the manifest into a feature set. It looks at things like:

dependency counts
peerDependencies
overrides / resolutions
wildcard and latest versions
pre* / post* scripts
postinstall
package manager hints
framework / test / build tool fingerprints

Those features feed a weighted scoring model. I wanted the joke to start from real manifest behavior, not from a model improvising a vibe.

Pinned versions help Aura Stability. Wildcards, latest, extra scripts, and override-heavy manifests drag it down. Chaos Index climbs when the project has loose version ranges, lifecycle scripts, postinstall, suspicious script names, or workspace sprawl. Peer Dependency Tension rises when the package asks other people to satisfy more of its needs. Boundary Issues is really a score for governance by exception, so overrides, resolutions, and workspace hints push it upward. Trust Issues gets worse when the manifest is private, carries a postinstall, or leans on suspicious scripts and latest tags. Mercury Status comes from lifecycle-script severity, especially pre*, post*, and postinstall.

So yes, the result is silly. But it is silly in a deterministic way.

Those signals show up in the product as Aura Stability, Chaos Index, Peer Dependency Tension, Boundary Issues, Trust Issues, and Mercury Status.

All of this is computed locally so the core behavior stays deterministic.

2. Gemini for the narrative layer

I used Gemini on the server for the parts that needed tone rather than math:

executive summary
sun / moon / rising interpretations
red flags
prophecy
lucky command
share caption

Gemini does not decide the scores. It gets the extracted features and the computed numbers, then turns them into a dead-serious reading.

The app asks for structured JSON and validates the result with Zod before rendering anything. That kept the product funny without handing core logic to the model.

The public deployment does not hit Gemini live. I disabled that in production because paying for unlimited dependency clairvoyance for strangers seemed like a bad financial habit. So production serves a fixed, intentionally budget-conscious executive statement, while local development and evals use the real Gemini path.

3. UI direction

I did not want this to look like a horoscope app. I wanted it to look like a corporate audit dashboard that had developed a spiritual problem.

The design goal was:

"This should look like a compliance product that got trapped in a spiritual crisis."

4. My favorite April Fools detail

If the input looks like Python or Ruby dependency files, the app returns 418.

That part is useless, correct, and deeply satisfying.

5. Eval, because the joke works better if the nonsense is measured

I did not want the AI layer to run on hope.

So I added a small promptfoo harness around the reading endpoint and treated it like a real structured-output feature.

The eval setup has two layers. The first is deterministic and checks response contract, writing constraints, and fixture-specific signal coverage. The second uses LLM-as-a-judge rubrics for tone and grounding.

The deterministic checks cover things like:

the endpoint returns the full expected JSON shape
the response stays in live mode for the eval fixtures
the copy does not drift into practical engineering advice
the luckyCommand still looks like a shell command
the response actually reflects the manifest signals it was supposed to notice

Then I added judge-based checks for the harder-to-measure parts:

does this still sound polished, dead-serious, and vaguely B2B?
is it funny through sincerity rather than random nonsense?
does it stay grounded in the fixture instead of inventing facts?

That gave me a cleaner contract for the product:

local code owns the real scoring logic
Gemini owns the tone
evals make sure those boundaries do not blur

The runner hits the local Next.js app over HTTP, so the eval path matches the real product path instead of a helper in isolation.

6. Eval results

The saved run I kept for the project was:

eval-qw8-2026-04-08T00:18:21
public report: semver-in-retrograde.vercel.app/evals/eval-qw8-2026-04-08T00:18:21
raw JSON: semver-in-retrograde.vercel.app/evals/eval-qw8-2026-04-08T00-18-21.json

That run used:

promptfoo
4 manifest fixtures
8 expanded test cases
concurrency set to 1
light retrying around transient model-availability issues
Gemini as the judge model

Result:

8 / 8 passing
0 failures
0 errors
runtime: about 133 seconds

The fixtures cover four different dependency personalities:

a mildly over-governed Next.js workspace
a commitment-avoidant Vite app with latest and wildcard ranges
a haunted library with overrides, resolutions, and lifecycle weirdness
a relatively boring steady package that should not be over-dramatized

That last case mattered. A joke product can always get louder. The harder part is keeping it funny without inventing drama the manifest did not earn.

Prize category

I am submitting this for Best Google AI Usage.

Google AI is central to the project. Gemini runs the narrative layer on the server, returns structured JSON instead of free-form prose, gets validated before display, and sits behind evals that check both hard constraints and tone. The product only works because of that split between deterministic scoring and AI-generated corporate mysticism.

That is the role I wanted the model to play. It does not own the critical logic. It owns the polished nonsense.

If your JavaScript project has unresolved dependency feelings, Semver in Retrograde is ready to misinterpret them at enterprise scale.

Lessons from the Spring 2026 OSS Incidents: Hardening npm, pnpm, and GitHub Actions Against Supply-Chain Attacks

Teruo Kunihiro — Thu, 02 Apr 2026 05:00:50 +0000

March 2026 saw a rapid succession of OSS supply-chain incidents.

In Trivy, an attacker repointed 76 of the 77 version tags for trivy-action and 7 tags for setup-trivy to a malicious commit, and a tampered v0.69.4 binary was released.
In LiteLLM, malicious 1.82.7 and 1.82.8 packages were uploaded to PyPI, and the maintainers later identified 1.83.0 as the clean release.
In axios, 1.14.1 and 0.30.4 were briefly published to npm, and the hidden dependency plain-crypto-js used postinstall to distribute a cross-platform RAT (remote access trojan that allows attackers to remotely control infected machines). (Aqua)

A common recommendation for preventing incidents like these is to enable npm’s min-release-age or pnpm’s minimumReleaseAge.
npm’s min-release-age prevents versions newer than a specified number of days from being installed, while pnpm’s minimumReleaseAge applies the same idea in minutes.
Both are highly effective at reducing the chance of immediately picking up a freshly published malicious release. But they only protect you at the moment of dependency resolution. They do not stop automatic install script execution, CI pipelines that reference mutable tags, or long-lived publish tokens lingering in your environment. pnpm itself makes this distinction explicit: compromised packages are often detected relatively quickly, but there is still an unavoidable exposure window between publication and detection. (npm Docs)

One screenshot captured the direction of travel perfectly. In the current stable pnpm release, both blockExoticSubdeps and strictDepBuilds default to false, but in the next docs and the v11 release notes, both move to true. blockExoticSubdeps prevents transitive dependencies from pulling from exotic sources such as git repos or tarball URLs, while strictDepBuilds can fail installation when unreviewed build scripts are present.
pnpm is clearly steering toward a security-first model: away from “install anything” and toward “resolve and execute only what has been explicitly trusted.” (pnpm)

This post breaks the defense surface into four layers:
dependency resolution, install-time execution, CI execution, and the publish path.
min-release-age belongs primarily to the dependency-resolution layer.

Delay and lock dependency resolution

The first thing to stabilize is which versions get resolved. npm’s min-release-age works in days, while pnpm’s minimumReleaseAge works in minutes, allowing you to let newly published versions “cool off” before they are eligible for installation.
In practice, though, you will eventually want exceptions for emergency security fixes or dependencies that you need to update immediately.

pnpm also provides minimumReleaseAgeExclude, which lets you carve out exceptions for specific packages or versions.
Dependabot has cooldown, a grace-period setting that delays version update PRs even after a new dependency version has been published. That grace period applies only to version updates, not to security updates.
So an operating model like “delay routine upgrades, but fast-track urgent security fixes” is perfectly workable in production. (npm Docs)

That said, delaying upgrades is not enough on its own. If the dependency graph resolved at one point in time cannot be reproduced consistently across your team and CI, different environments will drift onto different versions. That is where the lockfile becomes critical.

package-lock.json records the exact dependency graph and versions that were actually resolved. Committing it makes it much easier to reproduce the same dependency set in development and CI. npm ci is designed around the lockfile: it fails if package.json and the lockfile are out of sync, and it never rewrites the lockfile. In CI, that makes npm ci safer than npm install from a reproducibility standpoint, and it also makes unintended dependency changes easier to spot in diffs. (npm Docs)

Lockfiles matter for security, too. In GitHub’s dependency graph, a lockfile gives GitHub a much more accurate picture of the dependencies you actually resolved than a manifest alone. Indirect dependencies inferred only from the manifest may be excluded from vulnerability checks. (GitHub Docs)

There is one more risk in a different category worth calling out: dependency confusion. As a mitigation against public packages colliding with private package names, npm strongly recommends scoped packages. Managing internal packages under a namespace like @your-org/foo is not flashy, but it is effective. (npm Docs)

# .npmrc
min-release-age=3
ignore-scripts=true

# pnpm-workspace.yaml
minimumReleaseAge: 1440
minimumReleaseAgeExclude:
  - '@your-org/*'

Using npm’s min-release-age or pnpm’s minimumReleaseAge helps you avoid immediately consuming newly published versions. npm configures this in days, pnpm in minutes, and pnpm also applies it to transitive dependencies.

But this is only a mechanism for delaying the adoption of new releases. It does not guarantee reproducibility by itself. If you want stable, repeatable installs, the baseline is still to commit the lockfile and enforce strict lockfile-based installs in CI with commands like npm ci or pnpm install --frozen-lockfile. (npm Docs)

Treat install as code execution, not just downloading packages

The axios incident is a perfect example. The problem was not the Axios code itself, but the postinstall hook in the hidden package plain-crypto-js. In other words, npm install is not just artifact retrieval. Through dependency scripts, it is also code execution at install time. (Snyk)

npm has ignore-scripts, and when set to true, it suppresses automatic script execution from package.json during installation. Explicitly invoked scripts such as npm run or npm test still work, but at minimum, you are no longer running every dependency’s preinstall / install / postinstall hook by default. (npm Docs)

pnpm pushes this idea further. In its supply-chain security guidance, pnpm notes that many past compromised packages abused postinstall, and that v10 stopped automatically executing dependency postinstall hooks. The recommended model is to explicitly allow only trusted packages via allowBuilds. In the stable docs, allowBuilds supports per-package allow/deny rules, and with strictDepBuilds enabled, installation can fail the moment an unreviewed build script appears. (pnpm)

On top of that, enabling blockExoticSubdeps prevents transitive dependencies from pulling from exotic sources such as git repositories or tarball URLs. trustPolicy: no-downgrade can reject artifacts whose trust evidence is weaker than what was seen in earlier versions.
All of these are ways to ensure that even if you do pull something bad, it does not automatically spread or execute. (pnpm)

# pnpm-workspace.yaml
minimumReleaseAge: 1440
blockExoticSubdeps: true
strictDepBuilds: true
allowBuilds:
  esbuild: true
trustPolicy: no-downgrade

In short, min-release-age makes it less likely that you will ingest a freshly compromised release, while ignore-scripts and strictDepBuilds are about preventing it from executing automatically even if it does get in. (npm Docs)

Run GitHub Actions with immutable refs and least privilege

In GitHub Actions, the first rule is to pin workflow code to immutable references. Tag references such as @v1 or @v1.2.3 are convenient, but tags can be retargeted after the fact. GitHub explicitly states that the only way to reference an Action immutably is to pin it to a full-length commit SHA. So instead of uses: owner/action@v1, the safer baseline is uses: owner/action@<commit SHA>. If your workflow depends on a moving reference like a tag, the code that runs later can change even when the workflow file itself does not. (GitHub Docs)

The next step is to minimize runtime privileges. Keep GITHUB_TOKEN permissions to the bare minimum, with defaults as narrow as contents: read, and grant additional permissions only to the specific jobs that need them. Protect workflow files themselves with CODEOWNERS, so changes to .github/workflows require review. And for jobs that need cloud access, use OIDC instead of storing long-lived secrets in GitHub. Importantly, permissions: id-token: write is only for minting an OIDC token to authenticate to an external service. It does not expand the workflow’s GitHub-side privileges. (GitHub Docs)

From there, the next defensive layer is to gate dependency changes at the PR boundary. GitHub’s dependency review action checks dependencies added or updated in a pull request and can block merges when known vulnerabilities are introduced. In the review UI, you can inspect newly added or updated dependencies alongside release dates and vulnerability data. For example, the following workflow fails when the PR includes dependency changes with vulnerabilities rated high severity or above. (GitHub Docs)

name: dependency-review

on:
  pull_request:

permissions: {}

jobs:
  review:
    permissions:
      contents: read
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@<FULL_LENGTH_SHA>
      - uses: actions/dependency-review-action@<FULL_LENGTH_SHA>
        with:
          fail-on-severity: high

There is an important nuance here. The dependency review action is primarily a mechanism for checking the safety of dependency changes introduced via PRs. GitHub also recognizes uses: references in .github/workflows/ as dependencies in the dependency graph, but Dependabot alerts for Actions are only generated automatically for semver-based references. SHA-pinned Actions do not receive those alerts. In practice, that means external Actions should be pinned by SHA for safety, and then reviewed on a schedule as part of deliberate update work. The operating model becomes: stay safe by default with immutable references, and review upgrades intentionally when you choose to move them. (GitHub Docs)

Protect the publish path itself

If you publish npm packages yourself, the publish path can become the source of upstream compromise. npm’s trusted publishing uses OIDC so you do not need to keep long-lived npm tokens in CI. After you configure a trusted publisher, npm strongly recommends restricting legacy token-based publishing and enabling “Require two-factor authentication and disallow tokens”. The docs even walk through revoking old automation tokens after the migration. (npm Docs)

When trusted publishing is used from GitHub Actions or GitLab CI/CD, npm also generates provenance attestations automatically. npm provenance makes it publicly verifiable where a package was built and who published it. In other words, if you publish from GitHub Actions with a trusted publisher configured, you usually do not need to explicitly add npm publish --provenance; provenance is attached automatically. (npm Docs)

name: publish

on:
  release:
    types: [published]

permissions:
  contents: read
  id-token: write

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@<FULL_LENGTH_SHA>
      - uses: actions/setup-node@<FULL_LENGTH_SHA>
        with:
          node-version: "24"
          registry-url: "https://registry.npmjs.org"
      - run: npm ci
      - run: npm publish

It is worth separating signatures from provenance here. npm’s ECDSA registry signatures are designed to verify that the distributed tarball was not tampered with in transit. For example, they can detect whether package contents were altered somewhere along the way by a mirror or proxy.

Provenance, on the other hand, captures where a package came from, how it was built, and from which source code it was published. So while signatures answer “Was the package that arrived here modified?”, provenance answers “Where did this package come from, and how was it produced?”

npm audit signatures can verify both registry signatures and provenance attestations. But it is best thought of as a complementary integrity-and-origin check, not the primary mechanism for day-to-day vulnerability detection. (npm Docs)

pnpm takes a slightly different posture. In addition to “verify later” mechanisms like npm’s signatures and provenance, pnpm can proactively block untrusted dependencies at install time with settings like blockExoticSubdeps and strictDepBuilds. In that sense, npm focuses more on verification, while pnpm also leans into prevention through install-time policy.

Cross-cutting controls: detect with SCA, block with package-manager policy

This is where SCA becomes important. SCA (Software Composition Analysis) is the practice of enumerating the libraries your project depends on and continuously checking them for known vulnerabilities and license issues. It is the foundation for understanding what is actually in your stack and whether any of it is already known to be risky.

In GitHub, that role is largely filled by the dependency graph. The dependency graph ingests dependencies from manifests and lockfiles, and dependencies that land in the graph can receive Dependabot alerts and security updates. GitHub also explicitly recommends lockfiles for building a more trustworthy graph. The flip side is that transitive dependencies resolved only at build time, or indirect dependencies inferred only from the manifest, can still be missed. (GitHub Docs)

That is what automatic dependency submission and the dependency submission API are for. They let you send not just lockfile-declared dependencies, but also the dependencies actually resolved by a real build, into the dependency graph. GitHub provides built-in workflows for this, and external CI/CD systems or custom build pipelines can also push dependency snapshots through the API. In other words, you can reflect not only statically visible dependencies, but also the dependencies that were actually resolved at runtime. (GitHub Docs)

External tools are easier to reason about when you split them by role. Snyk Open Source is a classic SCA tool for open-source dependency vulnerabilities and license issues. OSV-Scanner supports major JavaScript lockfiles including package-lock.json, pnpm-lock.yaml, yarn.lock, and bun.lock. Trivy can emit GitHub dependency snapshots with --format github, which makes it useful as a bridge for feeding dependencies observed from images or artifacts back into GitHub’s dependency graph. (Snyk User Docs)

Many of these tools are strongest at known vulnerabilities, advisories, and license metadata. Socket is addressing a different problem: through static analysis, it looks for suspicious behavior such as install scripts, network requests, environment variable access, telemetry, and obfuscated code, including cases that have not yet become formal advisories.

The key point is that SCA alone is not enough. It can catch known vulnerabilities, but there is always a lag for freshly published malware or suspicious packages that have not yet been assigned an advisory. As pnpm points out, there is an unavoidable gap between the publication of malware and its detection. In practice, that is why you should not rely on detection alone. You also need preventive controls at the package-manager level—such as minimumReleaseAge, ignore-scripts, blockExoticSubdeps, and strictDepBuilds—to make risky dependencies both harder to ingest and harder to execute in the first place. (pnpm)

The minimum baseline to put in place today

Add min-release-age=3 and ignore-scripts=true to .npmrc. npm provides the former as a day-based maturity window and the latter as a way to suppress automatic script execution. (npm Docs)
Always commit the lockfile, and use npm ci in CI. npm ci fails on lockfile mismatch and never rewrites the lockfile. (npm Docs)
Scope private packages. It is a basic but effective mitigation against dependency confusion. (npm Docs)
If you use pnpm, enable minimumReleaseAge, blockExoticSubdeps, strictDepBuilds, and allowBuilds, and consider going as far as trustPolicy: no-downgrade if appropriate. (pnpm)
In GitHub Actions, combine full-length commit SHA pinning, least-privilege GITHUB_TOKEN settings, and CODEOWNERS review requirements for workflow changes. (GitHub Docs)
Move cloud authentication to OIDC, and grant id-token: write only to the jobs that need it. (GitHub Docs)
Add the dependency review action to PRs so dependency diffs are reviewed before merge. Use GitHub dependency graph / Dependabot as the baseline monitoring layer for dependency visibility. (GitHub Docs)
If you publish packages, migrate to trusted publishing, disable legacy tokens, and revoke the ones you no longer need. (npm Docs)

Closing thoughts

Delay resolution. Prevent install-time auto-execution. Pin references and permissions in CI. Eliminate long-lived credentials from the publish path, attach provenance, and verify what you ship. Then use SCA to monitor dependency drift and known risk.

Only when these controls are combined can you say you have actually started defending against supply-chain attacks. (npm Docs)

What I Learned from Reading Claude Code’s Reconstructed Source

Teruo Kunihiro — Thu, 02 Apr 2026 01:45:41 +0000

What I Learned from Reading Claude Code’s Reconstructed Source

Around March 31, 2026, it became widely known that parts of Claude Code CLI’s implementation could be reconstructed from source maps that had remained in the npm package. A public mirror circulated for a while, but it was not an official open-source release by Anthropic, and it has since turned into a different project.

This post is a memo of my own impressions after reading a reconstructed copy of the source that I had saved locally at the time. Rather than discussing the current state of any public mirror, I want to focus on the design characteristics that became visible from actually tracing through the code.

My first impression: this is a much larger product codebase than I expected

The first thing that surprised me was the sheer size of the codebase. In the reconstructed source I had on hand, there were roughly 1,900 files and about 510,000 lines of code. This is not a small single-purpose CLI. It is a fairly large product codebase that bundles terminal UI, tool execution, safety controls, IDE integration, memory, and extension mechanisms into one system.

Technically, the project appears to be centered on TypeScript, with Bun as the runtime and a React/Ink-style stack for the terminal UI. In other words, it felt less like “a small CLI with some AI added on top” and more like “a substantial TypeScript product with an AI experience layered into it.”

The prompts live on the client side more than I expected

One of the easiest things to start tracing in this codebase is prompt construction. At least within the portion that could be reconstructed, a surprisingly large part of the system instruction layer is present in the client-side code, where runtime context is then injected into it.

That runtime context includes things like the current date, Git state, recent commits, Git user information, and the contents of local instruction files. On top of that foundation, additional instructions and memory-related text are composed into something close to the final system prompt.

What I found especially interesting was that the intuitive assumption that “the real prompt must be assembled as a black box on the server side” did not seem to hold very well here, at least not within the portion of the code I could inspect. That does not prove there is no additional server-side processing, of course. But it does show that a significant amount of the prompt logic also exists on the client side.

In tool design, what matters is not the number of tools but how they are exposed and controlled

Another striking part of the design is the layer that decides which tools are visible to the model and the separate layer that manages execution permissions. The system is clearly feature-rich, but there is a fairly sharp distinction between tools that are exposed routinely and tools that are internal, behind feature flags, or otherwise conditionally enabled.

My impression was fairly simple: this codebase does not look like it was built around the idea that “more tools automatically make the system stronger.” If anything, it seems closer to the opposite view: the surface that is exposed to the model in normal operation should be kept as narrow as possible.

There are also implementation details suggesting that the tool list itself has to stay aligned with prompt caching. That means the number of tools and their schemas are not just implementation details; they appear to be part of stable prompt operation as well.

This lines up quite well with the increasingly common intuition that “fewer tools often lead to more stable behavior.” That said, this is my interpretation of the code, not an explicit principle written down in those exact words.

Bash is not “just a way to run shell commands”

The shell execution layer was one of the most memorable parts of the codebase for me. What is going on there is not simply command execution.

Commands are categorized into groups such as search-oriented commands, read-oriented commands, listing commands, and commands where silence on success is the natural behavior. Exit codes are also normalized in command-specific ways. For example, the 1 returned by grep-like commands is not always treated as a plain error; it can be reinterpreted as “no match found.”

On top of that, commands that are considered read-only are guarded by allowlist-based flag checks, path validation, sed-specific restrictions, sandbox eligibility checks, and even AST-based safety checks. For more complex compound commands, there are also explicit upper bounds on the fan-out of the safety analysis.

So while Bash is clearly a powerful general-purpose tool inside Claude Code, it does not look like something the model is given raw. Instead, it seems to sit on top of a fairly thick deterministic scaffold before the model is allowed to use it.

The comments are unusually good

Another thing that stood out was the quality of the comments. By that, I do not just mean that there are many comments.

In several places, the comments explain not only what the code is doing but why certain decisions were made: why a heavy operation needs to run before imports, why a given validator is necessary, or why a particular flag should not be treated as safe. They carry background reasoning, not just surface-level description.

That makes the code easier for humans to follow, of course, but it also felt like the sort of writing that would remain legible to future code-completion systems or coding agents as well.

People often say these days that comments should be kept to a minimum. But reading code like this is a good reminder that good comments are not clutter. They are part of the design.

Even the startup path shows product-level polish

Looking around the entry path, it becomes clear that this product is not only concerned with adding features. It is also carefully tuned around perceived performance. The code is explicit about which side effects should run before heavier imports and what can be parallelized to reduce startup latency.

When people talk about AI agents, attention tends to go first to prompts and loops. But in practice, details like startup optimization and other non-AI engineering work are often what determine how polished the product feels.

“Being visible” is not the same thing as “being open source”

Finally, I want to emphasize the most important point.

What became visible in this case was that some source code could be read because of the way published artifacts were left exposed. That is not the same thing as Anthropic officially releasing Claude Code as open source.

Those two things need to be kept clearly separate. Anthropic’s current terms include restrictions aimed at preventing the construction of competing products, service replication, and reverse engineering. So treating this as an interesting code-reading exercise is one thing; assuming that the code can therefore be freely reused or redistributed is something else entirely.

There is value in reading it. But “readable” and “freely usable” are not the same thing, and it is important not to blur that distinction.

Conclusion

What made this source-reading exercise interesting was not a generic takeaway like “Claude Code runs an agentic loop.” The more interesting part was seeing, in concrete form, which parts were made deterministic, which parts were injected as runtime context, and where the safety mechanisms were made deliberately thick.

At least within the portion that could be reconstructed, the prompts were more client-side than I expected, Bash was more heavily guarded than I expected, the tool surface was narrower than I expected, and the comments were more thoughtful than I expected. The overall codebase is well organized, but at the same time it still has a little of the human roughness you would expect in a real product—for example, the way prompt construction seems to be spread across multiple layers.

That mix of order and messiness is part of what makes the codebase interesting to me. In the end, that is what I wanted to capture in this memo.

Code Review with multiple AIs

Teruo Kunihiro — Fri, 19 Dec 2025 02:00:48 +0000

Hello folks.
Have you ever wanted to quickly run code reviews using multiple AIs? I have. If you really want to do something like this, you can have an AI generate a script and run it locally right away. Problem solved! …But if we stop there, the blog post ends immediately, so please stick with me for a little longer.

The problem I want to solve

In most cases, that really does solve it—but scripts created this way often end up calling pay-as-you-go APIs such as the ChatGPT API. Calling APIs isn’t inherently a problem, but I personally wanted to keep these kinds of tasks within a subscription fee if possible. (Subscriptions also have usage limits, so they’re effectively usage-based too but with how I use them, I rarely hit the limit.)

AI vendors also offer their own coding agents like Codex, Claude Code, Gemini CLI, and so on. By authenticating inside those coding agents, you can use them within your subscription plan. GitHub Copilot doesn’t develop its own models, but it’s appealing because it’s inexpensive and fixed-price, and lets you try a variety of models.

So it seems promising to delegate code review to these fixed-price coding agents and compare their results. That way, without issuing API keys, you can internally call multiple coding agents you already use and instantly get second opinions on your code review.

You might also want to use a team-standard prompt for code reviews. Even if you don’t fully standardize, it’s nice to avoid reinventing prompts each time and use a reasonably well-prepared team-specific one.

Then why not run the CLIs in CI?

You’re absolutely right. But once you run it in CI, the next questions become bigger ones—how do we handle team-wide subscriptions, or would it be cheaper and higher-performing to adopt a dedicated SaaS, etc.

So this time, I wanted something that runs locally at a script-like level. If you want to do it properly, as mentioned, moving it into CI is likely better. But even then, it might still be handy to quickly check multiple reviews locally.

The solution

I’d like to introduce ai-utils, a tool I made that has made my development flow just a bit more convenient.

How it actually behaves

Let’s look at a real review I ran when I added Copilot CLI support to ai-utils. Including full results for every AI would be too long, so I’ll show only excerpts.

I reviewed this commit. In practice, this is an AI check against pre-commit (staged) files called staged-check.

The check is performed using a prompt like the one linked here:
staged-check.md

In this case, Codex’s feedback captured the situation most accurately. Since this is a lightweight check, it’s understandable but because the prompt only passes git diff, the LLM seems to struggle to grasp the repository as a whole. I’d like to improve that at some point.

Below are the actual review outputs.

Claude Code (Sonnet 4.5)

     The code quality is very good. The new Copilot provider implementation follows a structure consistent with the existing Codex provider. The `stripFlagWithValue` helper function is implemented in a properly reusable way and adheres to the DRY principle. Naming is clear and follows Go conventions.

Gemini CLI (Gemini 2.5)

     * **Code duplication**:
         * **File**: `internal/provider/copilot.go`
         * **Issue**: The `stripFlagWithValue` function is newly defined in `copilot.go`, but this logic is also used for model flag handling in `codex.go`. Currently, `codex.go` cannot access this function defined in `copilot.go`, so similar logic would be (not visible here, but presumably) needed in `codex.go` as well. To avoid code duplication, this helper function should be moved to a shared utility file within the `provider` package (e.g., `provider/utils.go`).

Codex (GPT-5.2)

     - **Fatal timeout unit bug**: In `initConfig()` in `internal/cli/root.go`, `config.ProviderConfig.Timeout` is assigned `120`, but since the type is `time.Duration`, that becomes **120ns** (you likely expected `120*time.Second`). This same value is set when adding `copilot`, so with the default configuration it will time out immediately / become unstable. At minimum, the newly added `copilot` part must be fixed; ideally, fix existing `claude/gemini/codex` similarly.

     - **Where to place the utility function**: `stripFlagWithValue()` is defined in `internal/provider/copilot.go` and is also used from `internal/provider/codex.go` (it works because it’s the same package, but it hurts readability). Moving it to a shared file like `internal/provider/args.go` would make the intent clearer.

     - **Hard-coded strings**: Provider name `"copilot"` and flags like `"--model"` are scattered (e.g., `internal/cli/root.go`, `internal/provider/*`). Even small constants (at least provider names) would improve consistency.

GitHub Copilot CLI (Claude Sonnet 4.5)

     **`internal/provider/copilot.go`**: The `stripFlagWithValue` function is newly implemented, but the same logic is also used in `codex.go`. Following the DRY principle, this function should be moved to a common file such as `provider.go` so it can be shared by both providers.

     **`internal/provider/copilot_test.go`**: A trailing newline is missing at the end of the file (line 115). Go formatting conventions expect files to end with a newline.

     **`internal/provider/codex.go`**: An implementation of `stripFlagWithValue` has not been added. With this change, the codex provider may end up with an undefined reference when processing the model flag. The staging may be partial.

Even with some prompt issues, asking multiple AIs to review helped me avoid blindly trusting a single model’s feedback. I didn’t do it this time, but you can also compare SOTA models like Opus 4.5 and Gemini 3.0. In many cases you don’t know which model is best, so being able to benchmark and compare with a single command is very convenient.

My OSS project

As mentioned above, ai-utils is my own OSS project. It’s small and functionally simple, but seemed useful enough that I decided to build it.
Details are here: ai-utils

Concept

Easily run multiple AIs locally within the subscription plans.

Problems it solves

There are plenty of OSS tools like this. But the three things I specifically wanted to solve were:

I don’t want to issue API keys
I want to rewrite prompts in my own style
I want to compare responses from multiple AIs

I couldn’t find an OSS project that satisfied all three, so I chose to build one. In the AI era, it’s easy to build what you want, so I was able to overcome the cost of “reinventing the wheel.”

How to use

On macOS, you can install easily with Homebrew:

brew tap trknhr/homebrew-tap

brew install aiu

On Linux, run the install shell script:

curl -sSfL https://raw.githubusercontent.com/trknhr/ai-utils/main/install.sh | sh

You can’t use it unless supported coding agents like Claude Code or Codex are installed and ready to use.

Trying it out

Using commit-msg, you can generate a commit message based on staged files:

aiu commit-msg

With -m, you can run multiple AIs in parallel.

You can also run your own prompts. Inside a prompt file, {{$ }} executes a command, so you can dynamically pass the command output to the AI.

Example:

Just say {{$ date }}.

This passes the current time to the AI, and it will return only the current time. Using the same mechanism, the review task passes things like git diff.

So if your team wants custom prompts, you can place team-specific prompts under .aiu/prompts/ and run standardized reviews.

About development

The implementation required for this app wasn’t challenging. AI is so good at implementing typical CLI applications that there wasn’t much I had to do myself. What I did was mostly defining the spec and writing tests and I found myself thinking “So this is the AI era...” over and over.

Summary

This tool just calls the coding agents provided by each vendor, but wrapping it up as a CLI makes it surprisingly comfortable.

Because the tool’s functionality is simple, it’s also an application where it’s easy to let AI handle most of the implementation. Probably about 95% of the code was written by AI.

It won’t dramatically improve something by itself, but it helps you move through small daily tasks a little more smoothly.

If you’re interested, please refer to the GitHub page and install it. If you have complaints or requests, please open an Issue.

Assessing TOON Token Savings in an MCP Server

Teruo Kunihiro — Thu, 20 Nov 2025 13:37:09 +0000

I have been wiring TOON support with toon-token-diff into this MCP server to understand whether converting JSON payloads to TOON meaningfully reduces prompt costs. The short answer: TOON is elegant, but in my test harness it delivered microscopic savings for real-world workloads.

Environment

Project mode: toon-token-diff in libraryMode via npm install toon-token-diff
Models monitored: openai (tiktoken GPT-5 profile) and claude
Integration strategy: lightweight instrumentation that appends token stats into a JSONL ledger for later analysis

import { estimateAndLog } from "toon-token-diff/libraryMode";

// inside my MCP tool handler
estimateAndLog(JSON.stringify(result), {
  models: ["openai", "claude"],
  file: "./token-logs.jsonl",
  format: "json",
  label: "mcp_tool_call",
});

This snippet runs after the MCP tool produces a JSON response. It serializes the payload, estimates TOON vs JSON tokens, and emits a structured record to token-logs.jsonl. The rest of the MCP server stays untouched—no need to change transport or business logic.

Observations

Timestamp (UTC)	openai JSON	openai TOON	openai Δ (%)	claude JSON	claude TOON	claude Δ (%)
2025-11-19T14:16:54.296Z	127	126	0.79	130	129	0.77
2025-11-19T14:17:15.720Z	53,703	53,702	0.0019	54,977	54,976	0.0018
2025-11-19T14:17:34.988Z	14	12	14.29	14	12	14.29
2025-11-19T14:17:39.246Z	53,703	53,702	0.0019	54,977	54,976	0.0018
2025-11-19T14:17:48.333Z	29	29	0.00	28	28	0.00
2025-11-19T14:18:13.725Z	91,729	91,728	0.0011	98,607	98,606	0.0010
2025-11-19T14:21:19.174Z	127	126	0.79	130	129	0.77
2025-11-19T14:21:23.370Z	91,729	91,728	0.0011	98,607	98,606	0.0010
2025-11-19T14:21:30.314Z	53,703	53,702	0.0019	54,977	54,976	0.0018

Nine consecutive tool runs told the same story: production payloads barely moved. Only the intentionally tiny sample showed double-digit savings, which is irrelevant for backlog-scale prompts.

Why the Reduction Rate Is Flat

Content dominates token volume – The payload body itself accounts for nearly every token, so TOON’s structural tweaks barely register in the total.

Practical Guidance

Keep TOON handy as a normalization format, but don't promise cost savings without benchmarking your actual payloads.
Instrument with the libraryMode snippet above before ship time; it gives you historical evidence of whether TOON helps.
If savings are negligible, redirect effort toward higher-impact tactics: pruning unused fields, batching small tool calls, or applying semantic compression upstream.

Next Experiments

Compare with alternative tokenizers (Gemini, Llama) to see whether non-GPT vocabularies respond differently.
Add diff tooling that highlights specific fields TOON shrinks, so we can manually prune them if needed.
Explore policy-driven trimming (e.g., dropping debug blobs) prior to TOON conversion.

TOON remains a clever serialization trick, but as my MCP experiment showed, it is not an automatic token economy lever. Measure, log, and decide based on real numbers.

Resources

ai-docs managing AI generated context files

Teruo Kunihiro — Sun, 06 Jul 2025 14:50:41 +0000

Why I Built `ai-docs`: Managing the Growing Chaos of AI Context Files

When developing alongside AI agents, one of the first headaches that arises is how to manage the flood of context files they generate.

The Problem

Here are a few specific challenges I kept facing:

As your AI coding assistant evolves, you naturally want to externalize and back up its memory files.
These files are not deterministic and will inevitably differ across local environments and each developer's.
Git merges often lead to nasty conflicts.
During code review, these files just get in the way.
Yet simply ignoring them with .gitignore is risky to disappear. You still want to back them up remotely.

That’s when I realized: maybe these files don't belong in the main branch at all. And that's how ai-docs was born.

GitHub - trknhr/ai-docs

The Spark

The idea hit me during a casual meeting. What if we isolate AI-related files on a separate Git branch and mount them as a worktree? That way, we could keep them versioned and visible, without polluting the main development flow.

Two days and one impulsive coding spree later, I had a working prototype. Like any proper AI-era project, I co-built it with ChatGPT and Claude.

How I Built It

I brainstormed the ideal workflow with ChatGPT.
When the conversation alone didn’t give me clarity, I prototyped locally using Git worktrees.
I summarized everything into a spec file and let Claude Code scaffold the CLI.
Then I tested, tweaked, and patched wherever things didn’t behave as expected.

What `ai-docs` Does

ai-docs is a CLI tool that helps you manage AI assistant context files by separating them into an isolated Git branch.

Key Features:

Creates an isolated branch named @ai-docs/{username}, {username} is determined by your name on config file, git user.name or hostname.
Mounts this branch locally at .ai-docs/ via Git worktree
Moves files like memory-bank/ and CLAUDE.md to this branch
Automatically updates .gitignore in main to prevent tracking those files
Provides pull and push commands to sync changes

Challenges I Faced

1. Claude Code and the Danger of `rm -rf`

The initial versions made liberal use of rm -rf, which ended up deleting my .git folder. A brutal reminder that you should never blindly run AI-generated code.

I later restricted file deletions to cases where the --force flag is used, and leaned more heavily on safe git commands.

2. GitHub Actions: Trial and (Mostly) Error

I wanted to set up automatic releases using GoReleaser + GitHub Actions. But it was a frustrating loop of misconfigurations, outdated AI suggestions, and documentation-diving. I learned a lot, but definitely want to improve my speed here next time.

Usage (macOS Recommended)

brew tap trknhr/homebrew-tap
brew install ai-docs

# First-time setup (may need to run twice to initialize config)
ai-docs init -v

# Push local AI context files to remote
aI-docs push -v

# Pull updates from remote
aI-docs pull -v

Options like --dry-run and --force are supported and useful during testing.

Summary: A Clean Home for Your AI Files

ai-docs helps you:

Keep your working branches clean: AI context files live elsewhere
Access files easily: via .ai-docs/ worktree
Sync with ease: using simple push and pull

It’s still a rough-around-the-edges tool, but it works well enough to use daily.

If you're building with AI and want to keep things organized, give ai-docs a try. Feedback on GitHub or X (Twitter) would be amazing!

GitHub - trknhr/ai-docs

Happy vibe coding!

Cha Cha Chat with AI in Local

Teruo Kunihiro — Tue, 19 Dec 2023 04:33:31 +0000

Hello everyone. I've recently joined a generative AI team on the current company. I don't have much experience with generative AI though, I've been experimenting with running a Large Language Model (LLM) locally to prepare for any future requests to develop AI chat app like ChatGPT. Since I'm a Japanese speaker, I look for LLMs for Japanese one in this article.

Let's get started.

About PC Specifications

In this article, all tries were on this environment.

Model: MacBook Pro 14-inch, 2023
Chip: Apple M2 Max
Memory: 64GB
OS: macOS 14.1

About Large Language Models

There are various types of Large Language Models (LLMs), like the well-known GPT, BERT, LLaMA, etc. I won't dive into their differences or specifics given my current knowledge, but for this endeavor, I chose LLaMA for this article, which is popular among third parties for its accuracy and commercial viability.

Just Want to Get It Running

I knew that publicly available LLMs could be found on a site called Hugging Face, but I had no idea how to run them on the local. My aim was to create something like ChatGPT for future app implementation ideas.

After some research, I came across an Open Source Software (OSS) called FastChat, Text generation web UI. With this repository, I was able to locally run llama2 and chat with it.

For those who just want to try llama2, Hugging Face has a demo page, which is probably the quickest way to experience it: Hugging Face Demo for Llama2

About Japanese Models

While llama2 performs well in English, it seems far from the level of ChatGPT in Japanese. The responses in Japanese often include English words or are expressed in romanized Japanese. So, I looked for Japanese models.

About Youri7B

This is a model pre-trained in Japanese by rinna Co., Ltd., based on llama2. I tried running it using the 'Text generation web UI' mentioned earlier. Rinna Youri-7B

However, it didn't work as expected. The model seemed to load correctly in the UI, but all responses were in English. I didn't know the reason why it didn't work.

Running Python Files

I tried running Python scripts as described on the Hugging Face Youri-7B page. It looked like to be simpler than using third-party UIs and I could embed this to API after it would work, but due to my limited Python knowledge and the script consuming about 30GB of memory, my PC crashed.

Discovering Ollama

There were some reasons I couldn't complete to run some LLMs on my local environment.

Lack of Python Knowledge
Many dependencies caused difficulties and frustrations
Wanted to ignore runtime environments
Wanted to avoid troubleshooting

Summing up these points, what I'm looking for now is an OSS with a chat UI that doesn't require specific knowledge of Python or understanding of dependencies, and one that has clear documentation on how to apply models from Hugging Face.

Meanwhile, I was drifting on the internet and I stumbled upon Ollama. Its documentation seemed minimal but sufficient for my needs. Ollama operates like Docker, with model configuration files and instructions for using models downloaded from Hugging Face. That's what I wanted!

Trying Ollama

Run a LLM for Japanese

I wanted to run the Japanese model Youri, so I set up the Modelfile as suggested in the documentation. Like this.

FROM ./models/rinna-youri-7b-chat-q6_K.gguf

TEMPLATE """[INST] {{ .Prompt }} [/INST] """
PARAMETER num_ctx 4096
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"

Additionally, I used a gguf model converted by a volunteer from this Hugging Face page.

Running as a server

Ollama can set up a local server while the app is running and it's totally easy. Let's take a look README.md to launch the server. I tried one of the user-provided UIs called ollama-ui and asked it a question about Japanese history. But the quality in Japanese is less than in English.

Insights Gained While Running Ollama

While exploring the Ollama repository, I noticed it was written in Go. It piqued my interest in how it runs LLaMA. It turns out that Ollama uses llama.cpp for execution, which appears to be an app designed to run LLaMA smoothly on Mac. Llama.cpp itself seems not to depend on Python and using C++ instead, which is wrapping up the complex parts and making it accessible even to those with little understanding like myself.

Exploring Frontend LLM

I had heard rumors about running LLaMA as WebAssembly (WASM) on the frontend. So, I looked into some ambitious projects like llama2.c-web and WebLLM, which run LLMs on WASM. Running LLMs on the frontend is fascinating as it allows immediate responses without network dependency, ideal for quick-response needs like voice input or text summarization. I tried both platforms, and they worked impressively.
This seems particularly useful for immediate responses in cases like voice input or text summarization. A configuration where lightweight and rapid-response tasks are handled at the edge, while relatively heavier tasks are managed by server-based LLMs, appears to have high potential for scalability.

Chat with llama2 on a web browser.

Check those demos out! They are fantastic.

https://webllm.mlc.ai/#chat-demo
https://diegomarcos.com/llama2.c-web

Try WebLLM

WebLLM is one of the MLC-LLM projects that compiles LLMs for web execution. By compiling the models, it enables them to run on various device runtimes prepared by MLC-LLM. This means you can create LLMs that run in the browser's WASM runtime without depending on Python modules. For users, it's quite amazing that simply loading the model in the browser can start a chat like magic.

Reference:MLC-LLM Project Overview

To run youri7b-chat, as described above, the model needs to be compiled first. For this, I referred to the following documentation and proceeded with the compilation:
Compile Models - MLC-LLM

While going through the documentation, I realized that emscripten also needs to be installed, so I prepared that as well:
Emscripten Installation Instructions

Once everything was ready and the compilation was done, I found something called simple-chat in the examples directory of webllm, which I decided to run locally:
Simple-Chat Example - WebLLM

The compilation and web server setup went smoothly, but then it didn't work and I have completely no idea to make it.

Wrap-up

This journey was solely about exploring and running OSS in my local environment, meanwhile I didn't code any single line. It highlighted the power of the OSS community and my respect for everyone developing OSS. I hope to contribute to the LLM ecosystem in some way in the future.

In conclusion, while there were many challenges, it was a learning experience. M2 Macs can handle these models surprisingly well, encouraging me to keep experimenting. Goodbye for now.

DEV Community: Teruo Kunihiro

Apple’s container Just Hit v1.0.0

What is Apple’s container?

container machine: the interesting part

Can it replace Docker Desktop?

The biggest architectural difference

Docker-like, but not Docker-compatible

OS support

Why this matters on macOS

Where Apple’s container could be useful

1. Running single containers on Mac

2. A Mac-native Linux development environment

3. Local sandboxing for untrusted code

Conclusion

How I Make Claude Code's 5-Hour Usage Window Last Longer on Claude Pro

Use /clear

Start the session early with /schedule

Write plans and specs to files

Use cheaper models when possible

Conclusion

Choosing Models for an Agentic Chat App on Amazon Bedrock

Choosing Models for an Agentic Chat App on Amazon Bedrock

Sonnet Is Expensive

Haiku Is Cheap, but Slightly Weak

MiniMax M2.5 Is Cheap and Agent-Friendly — but Japanese UX Is Weak

Gemma Is Extremely Cheap, but Better for First-Pass Processing

Nemotron 3 Super 120B

Mistral Large 3 Is Good, but Not Decisive

Why I Ended Up Choosing Kimi K2.5

No Prompt Cache Support for Kimi K2.5 on Bedrock

Reducing Cost with Payload Slimming and Flex Tier

Explaining Security Concerns Around Chinese Models

Final Architecture

Closing Thoughts

TanStack Was Not the Whole Story: Mini Shai-Hulud Was an npm/PyPI Supply-Chain Worm

What happened at TanStack

It was not only TanStack

Worm behavior

Persistence and lateral movement

PyPI was also affected

SLSA provenance was not enough

Initial response

Prevention lessons

Conclusion

References

Building a Home Personal Assistant with Claude Managed Agents

Introduction

What I wanted to build

Quickstart was genuinely useful

Slack MCP

Sandbox

Vault

I still needed regular application code

What gets stored where

Using custom tools for memory and tasks

Slack mentions work naturally

Daily reminders also worked well

Letting it read PDFs and remember things is surprisingly good

Pricing

Things that were tricky

Slack Events configuration

Splitting responsibility between Slack MCP and Lambda

Do not start with fully automatic memory saving

What I want to do next

Closing thoughts

References

Semver in Retrograde

What I built

Demo

Code

How I built it

1. Deterministic manifest analysis

2. Gemini for the narrative layer

3. UI direction

4. My favorite April Fools detail

5. Eval, because the joke works better if the nonsense is measured

6. Eval results

Prize category

Lessons from the Spring 2026 OSS Incidents: Hardening npm, pnpm, and GitHub Actions Against Supply-Chain Attacks

Delay and lock dependency resolution

What is Apple’s `container`?

`container machine`: the interesting part

Where Apple’s `container` could be useful

Use `/clear`

Start the session early with `/schedule`

Why I Built `ai-docs`: Managing the Growing Chaos of AI Context Files

What `ai-docs` Does

1. Claude Code and the Danger of `rm -rf`