DEV Community: Amit Raz

How I used Claude Code hooks to build a menu bar watcher for all my AI coding agents

Amit Raz — Tue, 30 Jun 2026 16:53:33 +0000

I usually have three or four AI coding agents running at the same time. Claude Code in one terminal working through a refactor, Codex in another, Cursor open on a second feature, sometimes Copilot in the background. The actual coding part is great. The part that wore me down was the bookkeeping.

Because the thing about running agents in parallel is that they don't finish in sync. One stops to ask whether it can run a migration. Another quietly waits on a permission prompt. A third is still grinding. And there's no single place to see any of that, so you end up cycling through windows every couple of minutes just to check who needs you. I'd get deep into one terminal and a "can I edit this file?" would sit untouched in another for ten minutes. I wasn't coding at that point. I was babysitting.

So I built a small thing to fix it. This is how it works and the one design decision that made it actually useful.

Why I didn't just use an existing tool

There are tools that aggregate multiple agents. I tried a few. They mostly work by wrapping your agents: you launch and run everything through their UI, and in exchange you get a combined view.

That was a non-starter for me. I've spent years getting my terminal, my editor, and my multiplexer set up the way I like. I don't want to move my whole workflow inside someone else's shell just to get a status light. I wanted something that watched from the outside and changed nothing about how I work.

That ruled out wrapping. Which left the question: how do you know what an agent is doing without sitting between the user and the agent?

The insight: agents already announce their state

You don't have to intercept anything, because the agents already tell you when something happens. You just have to listen.

Claude Code has a hooks system. Hooks are shell commands that Claude Code runs automatically at specific points in its lifecycle, and they're defined in a JSON settings file. Two of them are exactly what I needed:

Stop fires when Claude Code finishes responding, in other words when a turn ends.
Notification fires when Claude Code needs your attention, like when it's waiting on input or a permission prompt.

The key property is that hooks are deterministic. They don't depend on the model deciding to tell you something. When the event happens, the command runs, every time. That's the difference between a reliable status light and one that's right most of the time.

When a hook runs, Claude Code passes it JSON on stdin with context about the event, including things like the session and the working directory. So a hook handler can read that, figure out what just happened, and record it somewhere.

The architecture: write small files, watch a folder

Here's the decision that made the whole thing simple. Instead of having the app talk to each agent, I had each agent's hook write a tiny status file, and I pointed the app at the folder those files live in.

A Stop or Notification hook just writes a small JSON file for that session. Roughly:

{
  "session": "billing-refactor",
  "tool": "claude-code",
  "state": "waiting",
  "cwd": "/Users/me/code/billing",
  "updated_at": "2026-06-30T09:14:22Z"
}

The hook config that produces it is ordinary Claude Code settings. A simplified version of the idea:

{
  "hooks": {
    "Notification": [
      { "hooks": [{ "type": "command", "command": "theeye-status waiting" }] }
    ],
    "Stop": [
      { "hooks": [{ "type": "command", "command": "theeye-status idle" }] }
    ]
  }
}

The app does one job: it watches that folder and folds every status file into a single picture. No sockets, no daemon to babysit, no agent talking to another agent. A file appears or changes, the app re-reads the folder, the icon updates.

This decoupling is what makes it pleasant to live with. The agents don't know the app exists. The app doesn't know how the files got there. And adding a new agent later means writing one more hook that drops a file in the same folder. The viewer doesn't change at all.

It's also why the privacy story is short. The status files carry the session name, the tool, and the state. They never carry your code, your prompts, or the model's output, because the watcher never needs any of that to tell you who's waiting. Everything stays on your machine. There's no account and no server.

The part you actually look at

The viewer is a native macOS menu bar app, built with SwiftUI's MenuBarExtra. It runs as a background agent with no Dock icon, because a thing that watches all day shouldn't take up a Dock slot.

There's one eye in the menu bar, and it changes by state. It opens red with a count when a session is waiting on you. It watches calmly while everything's working. It closes when it's all quiet. I made the shape change along with the color, not just the color, so it's still readable if you don't catch the exact hue out of the corner of your eye.

Two things turned it from "neat" into "I use this every day." First, clicking a waiting session brings that exact terminal or editor window to the front, whether that's tmux, iTerm, WezTerm, Cursor, or VS Code. No hunting for the right window. Second, the alerts are opt-in and layered: a sound, a native notification, or having the waiting window pull itself forward on its own. You pick how loud it is.

Going from one agent to many

Once the file-watching pattern was in place, supporting more agents was almost boring, which is the point. Each one has its own lifecycle hooks or events, and each one gets a small handler that writes the same kind of status file into the same folder. Today it watches Claude Code, Codex CLI, Cursor (CLI and in-editor), Copilot (CLI and in VS Code), and Google Antigravity. They all collapse into the one eye.

Reaching past the Mac

The last gap was physical. The whole point is to stop watching windows, but a menu bar icon still assumes you're at the screen. So I added Telegram. When a session starts waiting, it can ping your phone, which means you can walk away and still know the moment something needs you.

The design followed the same restraint as the rest. You connect a bot from Settings in about a minute. The bot token lives in the macOS Keychain, not in a plist or a config file. The message carries only the agent and the project name. And nothing is sent until you turn it on.

What I'd tell you if you're building something similar

The lesson that generalized best: when you want to observe a system, look for the events it already emits before you reach for interception. Hooks meant I never had to wrap, proxy, or parse a terminal. And writing plain files to a folder, instead of wiring components directly to each other, kept every piece ignorant of the others, which is exactly what you want when you'll be adding more pieces later.

If you run more than one coding agent at a time and you recognize the window-cycling, the app is free, notarized, and runs on macOS 14 or later. It's at theeye.dev.

I'm Amit Raz, a software architect and AI consultant. I build small tools like this under RZApps and help teams put AI to work in their own products and workflows. More of what I'm building and writing is at rzailabs.com.

I Ran Google's New Gemma 4 Models Locally (26B and 31B) — Here's What I Found

Amit Raz — Mon, 06 Apr 2026 11:03:46 +0000

Google dropped Gemma 4 a few days ago and I immediately wanted to know: can
you actually run these things locally on consumer hardware? Not for a research
project. For real use.

I had two machines to test with:

An i9 with 96GB RAM and an RTX 4090
A 64-core / 128-thread AMD machine (CPU-only)

I ran the 26B and 31B variants. Here's what happened.

A quick note on the architecture

Before the numbers, one thing worth knowing: these two models are
architecturally different.

The 26B is a Mixture-of-Experts (MoE) model with 128 experts, but only
~4B parameters are active at any given time. That's why it's fast and fits
comfortably in VRAM despite the 26B label.

The 31B is a dense model — all 31 billion parameters are active on every
token. That's why it hits the memory wall hard.

This distinction explains everything you're about to see in the benchmarks.

Setup

I used Ollama to pull and run both models:

ollama run gemma4:26b
ollama run gemma4:31b

Both support a 256K context window and native function calling out of the box.

Benchmarks

I ran a mix of prompts: simple factual questions, some reasoning tasks, and
something heavier — a complex trading algorithm that uses AI-based prediction.
I asked the models to explain the logic and suggest improvements.

I also compared the outputs directly against Claude Code on the same prompts.

26B (MoE) on RTX 4090

Metric	Value
Prompt eval rate	15.56 tokens/s
Eval duration	~10.5s
Generation rate	149.56 tokens/s

This is fast. Like, actually fast. 149 tokens per second means you're not
sitting and watching a cursor blink. It feels close to real-time. The MoE
architecture earns its keep here — only 4B parameters are active, so the
4090's 24GB VRAM handles it cleanly with room to spare.

31B (Dense) on RTX 4090

Metric	Value
Prompt eval rate	26.30 tokens/s
Eval duration	~3m 5s
Generation rate	7.84 tokens/s

Big drop. Unlike the 26B, the dense 31B has to load all its parameters for
every token. It doesn't fit cleanly into the 4090's VRAM and spills into
system RAM — you feel every bit of it. For interactive use, this is painful.

The screenshot below shows what that looks like in Task Manager: GPU Dedicated

Memory is maxed out at ~45.9GB out of ~49GB available. The GPU Usage reads
low (around 24%) not because there's no work, but because the GPU spends most
of its time waiting on data coming from system RAM.

26B (MoE) on AMD 64-core / 128-thread (CPU only)

Metric	Value
Prompt eval rate	45.33 tokens/s
Eval duration	~3m 20s
Generation rate	8.80 tokens/s

Slower for generation, but the prompt eval rate is actually higher than the
4090 — all those cores load context fast. Generation at 8.80 tokens/s is slow
for interactive chat, but more usable than you'd expect for background tasks.

Quality

All three runs handled the trading algorithm task well. The output was
structured, accurate, and included reasonable improvement suggestions.

I compared the responses directly against Claude Code on the same prompts.
They were practically identical. Not "close enough" — genuinely hard to tell
apart on this type of task.

That surprised me. A model running locally on your own hardware, for free,
producing output indistinguishable from a frontier cloud API on complex
reasoning tasks.

The part that surprised me most — and it applies to every setup

Here's the thing that changed how I think about local models, regardless of
whether you're running on a GPU or CPU:

A local model isn't subject to API limits. No token limits per minute, no cost
per call, no rate limiting. If you're running agents that need to process large
contexts, search through a codebase, analyze documents, or run long autonomous
tasks — you can just let them run overnight. The agent works while you sleep.

For agentic workflows specifically, this is a bigger deal than the raw token/s
numbers suggest. A 8.80 tok/s model running uninterrupted for 8 hours
processes a lot more work than a faster cloud model that hits rate limits every
few minutes.

Verdict

The 26B MoE on a 4090 is the sweet spot right now. It fits cleanly in VRAM,
generates at 149 tok/s, and produces quality that holds up against frontier
models on reasoning tasks. For most local development and agentic use cases,
you won't feel a meaningful gap.

The 31B dense needs more VRAM than most people have. Unless you have a
multi-GPU setup or an M-series Mac with 64GB+, the memory pressure kills the
speed advantage you'd expect from the larger model.

The CPU-only path is more viable than I expected for non-latency-sensitive
work. If you have a powerful server without a GPU, the 26B MoE is genuinely
runnable for batch tasks.

What's next

In the next post I'll show how to connect Cursor, VS Code, and Claude Code to
a locally running model like this. That's where it becomes practically useful
for day-to-day development.

I'm Amit Raz, a Software Architect specializing in AI and software
development. I build tools and apps at rzailabs.com.

Why Your Android Reminder App Is Silently Failing You (And How to Fix It)

Amit Raz — Fri, 03 Apr 2026 14:29:54 +0000

You set a reminder. You go about your day. The time passes. Nothing fires.

You open the app and the task is sitting there, overdue, with no notification ever sent. Sound familiar?

This isn't a bug in any one app. It's a systemic problem with how most reminder apps handle alarms on Android, and once you understand why it happens, you can't unsee it.

The Root Cause: Android's Battery Optimization

Android has gotten increasingly aggressive about killing background processes to save battery. This is mostly good for users, but it creates a real problem for apps that need to wake up at a specific time and do something.

The standard approach most apps use is WorkManager or Handler.postDelayed() or scheduled jobs. These are perfectly fine for non-time-critical background work. But for a reminder that needs to fire at exactly 9:00am, they're not reliable. Android can, and will, defer or skip them entirely when Doze mode is active or battery saver is on.

Doze mode kicks in when the device is stationary and unplugged for a while. Battery saver can be triggered manually or automatically. On some manufacturers (Samsung, Xiaomi, OnePlus especially) there's additional proprietary battery optimization on top of Android's own system that makes this even worse.

The result: your reminder app looks fine. It shows the task. It shows the time. But the alarm never fires.

The Right API for Time-Critical Alarms

Android has a specific API designed for exactly this use case: AlarmManager.

There are several methods, and the difference between them matters a lot:

// Inexact, deferrable. Android can batch and delay this.
alarmManager.set(AlarmManager.RTC_WAKEUP, triggerTime, pendingIntent)

// Exact, but still deferrable during Doze mode.
alarmManager.setExact(AlarmManager.RTC_WAKEUP, triggerTime, pendingIntent)

// Exact, fires even during Doze mode. This is what you want.
alarmManager.setExactAndAllowWhileIdle(AlarmManager.RTC_WAKEUP, triggerTime, pendingIntent)

// Treated like a clock alarm by the system. Highest priority.
alarmManager.setAlarmClock(alarmClockInfo, pendingIntent)

setAlarmClock() is the one I ended up using in Sticky Tasks. It shows a clock icon in the status bar (which is actually useful UX, users can see their next alarm is set) and Android won't suppress it. It's the same mechanism the built-in clock app uses.

setExactAndAllowWhileIdle() is a solid alternative if you don't want the status bar indicator. Both will fire reliably during Doze mode and with battery saver active.

Permissions You Need to Declare

Starting from Android 12 (API 31), you need to explicitly request the SCHEDULE_EXACT_ALARM permission:

<uses-permission android:name="android.permission.SCHEDULE_EXACT_ALARM" />

From Android 13 (API 33), you should also handle the case where this permission is revoked by the user. Check it before scheduling:

if (alarmManager.canScheduleExactAlarms()) {
    // schedule the alarm
} else {
    // direct user to settings to grant permission
    val intent = Intent(Settings.ACTION_REQUEST_SCHEDULE_EXACT_ALARM)
    context.startActivity(intent)
}

For full-screen notifications (the kind that show even when the screen is off), you also need:

<uses-permission android:name="android.permission.USE_FULL_SCREEN_INTENT" />

Since Android 14, you need to request this at runtime, not just declare it in the manifest.

Surviving Reboots

Here's one that catches a lot of developers off guard. AlarmManager alarms don't survive a device restart. The moment the phone reboots, all your scheduled alarms are gone.

The fix is a BroadcastReceiver that listens for BOOT_COMPLETED and re-registers all pending alarms:

class BootReceiver : BroadcastReceiver() {
    override fun onReceive(context: Context, intent: Intent) {
        if (intent.action == Intent.ACTION_BOOT_COMPLETED) {
            // fetch all pending tasks from your database
            // re-schedule their alarms
        }
    }
}

<receiver android:name=".BootReceiver" android:exported="true">
    <intent-filter>
        <action android:name="android.intent.action.BOOT_COMPLETED" />
    </intent-filter>
</receiver>

And add the permission:

<uses-permission android:name="android.permission.RECEIVE_BOOT_COMPLETED" />

Without this, any user who restarts their phone loses all their scheduled reminders silently. They'll never know why.

Testing This Properly

Manual testing is painful here. You can't just wait for alarms to fire. A few adb commands that help:

# Force device into Doze mode immediately
adb shell dumpsys deviceidle force-idle

# Check current Doze state
adb shell dumpsys deviceidle

# Simulate battery saver on
adb shell settings put global low_power 1

# Turn it off
adb shell settings put global low_power 0

# Step through Doze states manually
adb shell dumpsys deviceidle step

Test your alarm fires correctly in each of these states before shipping. It's tedious but worth it. This is exactly the kind of thing that fails silently in production and you'll never see it in your crash logs.

What This Looks Like in Practice

I ran into all of this while building Sticky Tasks, a reminder app I built because I kept missing notifications from other apps. Once I switched to setAlarmClock() and added the boot receiver, the reliability difference was immediate.

Alarms fire with battery saver on. They fire after restarts. They fire on Samsung devices with aggressive battery optimization enabled.

It's not magic. It's just using the right API for the job.

If you're building anything time-critical on Android, whether it's reminders, medication alerts, scheduled notifications, or anything that has to fire at a specific moment, this is the approach. The standard background job APIs aren't designed for this and they'll let you down.

I'm Amit Raz, a Software Architect and AI consultant based in Israel. I build Android apps under the RZApps brand and write about what I learn along the way. More at rzailabs.com.

Why I Started Watching My Claude Code Context Window (And Built Something to Track It)

Amit Raz — Thu, 02 Apr 2026 11:01:12 +0000

If you're using Claude Code heavily and not paying attention to your context window, you're probably paying more than you need to. Here's why it matters and what I changed.

The thing most people don't realize

Every time you send a message in Claude Code, the entire conversation history gets sent with it. Not just your new question. Everything. Every file you pasted, every response Claude gave, every back-and-forth since you opened the session.

This means cost doesn't scale with message length. It scales with accumulated context.

If your context window is at 70% and you ask something simple like "can you rename this variable?", you're paying for the full 70% of history sitting behind that tiny question. The question itself is almost irrelevant to the token count.

Once this clicked for me, I couldn't unsee it.

What actually drives your token costs

Let's make this concrete.

Say you've been in a Claude Code session for two hours. You've pasted several files, iterated on a feature, debugged a few things. Your context is sitting at 65%. Now you ask a quick follow-up question.

That API call includes:

All the files you pasted earlier
Every response Claude gave
All your messages
Your new question (tiny, almost irrelevant to the total)

The new question might be 20 tokens. The history behind it could be 40,000. That's what you're paying for.

This is by design, not a bug. The model needs the history to maintain coherence. But it means your costs compound as a session grows, and most people don't notice because there's no obvious signal telling them to pay attention.

The fix is simple but you have to be deliberate about it

When a session gets long, especially before starting a new feature or a significant refactor, I now do this:

Open a fresh session
Write a short handoff note: what we built, current state of the code, what I need next
Paste only the files relevant to the next task
Continue from there

That's it. The handoff takes maybe two minutes. In exchange, I'm starting the next task with a lean context instead of dragging tens of thousands of tokens of history into every subsequent query.

The responses often get sharper too. A packed context window can cause the model to lose focus on earlier content. Starting fresh with a tight, relevant context tends to produce more focused answers.

Why I built a custom status bar

The problem is that none of this is visible by default in Claude Code. You're flying blind. There's no indicator telling you how full your context is, how much of your 5-hour session budget you've used, or how much of your 7-day limit remains.

So I built a custom status bar that shows all three in real time.

It sits in the terminal and updates as I work. When I see the context creeping up, it's a clear signal: finish this thread, write the handoff, open a new session.

Before I built it, I had no idea how fast context accumulates during a real coding session. Seeing the number climb in real time changes how you work.

(I shared how I built it in a previous post. Link in the comments.)

The mental model shift

Think of the context window status like a fuel gauge, not a progress bar.

A progress bar tells you how far you've come. A fuel gauge tells you when to stop and refuel before you run out. The context window is the latter. Watching it helps you make an active decision: keep going, or reset and start lean.

Most developers I've talked to treat Claude Code sessions like a continuous conversation that they just let run. That works fine for short tasks. For longer sessions, it's quietly expensive.

TL;DR

Every Claude Code query sends the full conversation history
Cost scales with accumulated context, not message length
Heavy context also affects response quality
The fix: start fresh sessions before big new tasks, with a short handoff summary
Make context visible so you know when to reset

If you're using Claude Code for serious development work, the context window is worth paying attention to. It's not just a technical detail. It's directly tied to what you're spending.

Amit Raz is a Software Architect and AI consultant based in Israel. I build AI-powered products and write about developer tools, Android development, and AI workflows at rzailabs.com.

Stop Using Claude Code for Everything: How I Cut My Token Usage by Being Smarter About Which AI Does What

Amit Raz — Mon, 30 Mar 2026 07:57:02 +0000

Claude Code is genuinely impressive. It can navigate a large codebase, plan multi-step changes, write and run tests, and iterate on its own output. But it's also expensive to run, and if you're using it the way I was at first, you're probably burning a lot of tokens on tasks that don't need it.

Here's the workflow shift that made the biggest difference for me.
The problem with using one tool for everything
When Claude Code is open in your terminal, it's tempting to just throw everything at it. Rename this variable. Write a quick helper function. Add a comment here. Fix this typo.

The thing is, Claude Code sends your full conversation context with every request. That means even a tiny ask like "rename this variable" is consuming tokens proportional to how long your session has been running. It adds up fast.

Meanwhile, you probably already have a perfectly capable model sitting right there in your editor that's much cheaper to run for simple tasks.
The setup: run Claude Code inside your editor
Before getting into the workflow, there's a small change worth making. If you've been running Claude Code in an external terminal while VSCode or Cursor is open on the same folder, move it into the integrated terminal.
You don't need to configure anything. Just open the terminal panel inside the editor and run Claude Code from there.

What you gain from this is access to the full editor context while Claude is working. The git diff panel shows you exactly what's being changed in real time. You can stage or revert specific hunks without switching windows. And when something breaks, the debugger is right there. You can set breakpoints, inspect variables, and feed that information back to Claude without ever leaving the editor.
It also makes the two-model workflow I'm about to describe much easier to actually use in practice.

The workflow: right tool for the right task
Once Claude Code is running inside your editor, you have two AI tools in the same window:
Claude Code in the terminal, for tasks that need real reasoning. Architecting a new feature. Debugging something complex. Refactoring across multiple files. Anything where you need the model to actually think through a problem.

Cursor composer or GitHub Copilot (depending on your editor), for everything else. Renaming things. Writing a simple util. Adding a docblock. Generating a test for a function that's already clearly defined. These are pattern-matching tasks, and a smaller model handles them just fine.

The practical rule I use: if I can describe the task in one sentence and I already know roughly what the output should look like, it goes to the composer or Copilot. If I'm not sure how to solve it, or it touches more than one or two files, it goes to Claude Code.
Why this actually matters

Claude Code's billing is based on tokens, and context accumulates over a session. The longer you've been working, the more expensive each request gets, even simple ones. By handling small tasks in the composer instead, you're keeping Claude Code sessions shorter and more focused, which means less context bloat and lower costs overall.
There's also a quality argument here. Claude Code does its best work on hard problems. If you're constantly interrupting it with trivial requests, you're not getting the most out of it.
What this looks like in practice
A typical session for me now looks something like this:
I open Cursor with Claude Code running in the integrated terminal. I'm building a new feature. I ask Claude Code to plan the implementation and start on the core logic. While it's working I can watch the diffs in the git panel. If it breaks something I use the debugger to understand what happened and give Claude Code the specific info it needs.
For smaller things that come up along the way, like tweaking a component or writing a quick helper, I switch to the composer. It's faster, it's cheaper, and it doesn't interrupt the Claude Code session or bloat the context.
When I'm done with the feature, I start a fresh Claude Code session for the next task rather than letting the context grow indefinitely.
The short version
Claude Code is a powerful tool but it's not free, and not every task needs it. Running it inside your editor instead of an external terminal makes it easier to use it alongside your editor's built-in AI for simpler tasks. Keep Claude Code for the hard stuff. Let the lighter models handle the rest.
It's a small workflow change but it makes a real difference over a full day of coding.

I'm Amit Raz, a Software Architect specializing in AI and software development based in Haifa, Israel. I build AI-powered products and help businesses integrate AI into their workflows. More at rzailabs.com.

I Made a Command That Documents My Entire Repo Every Time I Take a Break

Amit Raz — Tue, 24 Mar 2026 07:44:11 +0000

I work with AI coding agents every day. Cursor, Claude Code, sometimes both in the same project. And the thing that used to slow me down the most wasn't writing code. It was re-orienting the agent at the start of every session.
"Here's the folder structure." "We use Provider for state." "Don't put new screens in the root of lib." "Android alarm testing needs a real device."
Over and over. Every session.
So I built a fix. I call it /document-project, and the prompt file is here if you want to skip straight to it: https://gist.github.com/razamit/b28d7d8b0acaf995969673df47333d58

What it actually is
It's a markdown prompt file I keep in my projects. When I type the command in Cursor or Claude Code, the agent reads the file, walks the entire repo, and produces or updates two things:

AGENTS.md at the root — a machine-oriented map with build commands, tech stack, layout, conventions, and known footguns
Folder-level README.md files — short, only where they add navigation value

The file tells the agent exactly what to write, what to skip, and how to format it. It prioritizes accurate and short over comprehensive and stale.

A real example
Here's what AGENTS.md looks like after running on my StickyTasks Flutter app:

## One-line purpose
Cross-platform Flutter task app with sticky reminders, local
notifications, recurring tasks, optional daily recap, ads + Pro IAP.

## Tech stack
| Area  | Stack                                    |
|-------|------------------------------------------|
| App   | Flutter, Dart SDK ^3.9.0                 |
| State | provider                                 |
| DB    | Hive + hive_flutter                      |
| Notif | flutter_local_notifications, timezone    |

## Layout map
| Path              | Role                                          |
|-------------------|-----------------------------------------------|
| lib/              | screens, widgets, viewmodels, controllers...  |
| android/          | Kotlin native, alarms, manifest               |
| functions/        | Firebase callable sendFeedback                |

## Known footguns
- Android alarms: test on real device, not emulator
- Hive model changes: run build_runner after
- Firebase feedback: needs RESEND_API_KEY configured

No walls of text. No dependency dumps. Just what an agent needs to orient fast.

The key design decisions in the prompt

AGENTS.md comes first. Many tools treat it as the machine-oriented entry point.
Folder READMEs only where they pay off — top-level areas, confusing names, shared entry points. Not everywhere.
Don't paste dependency lists or full directory trees. Summarize and point to files.
Update, don't append. If structure changed, remove outdated sections instead of adding contradictions.

How to set it up
Download the file from the Gist and drop it in your project.
Cursor: place it in .cursor/ named document-project.md
Claude Code: place it in .claude/commands/ named document-project.md
Then just type /document-project in the chat. That's it.

When I run it
I run it on a break. Literally: type the command, go make coffee, come back and the whole repo is documented and up to date.
The next session, the agent starts oriented. It puts new features in the right layer. It knows what build_runner is for. It doesn't ask me to re-explain the architecture.

The bigger principle
AI agents don't have memory. That's not going to change soon. So the question is: how do you give them the context they need without spending 10 minutes typing it every single time?
Good documentation written for machines, not humans, is the answer. Short, scannable, honest about footguns. AGENTS.md is that format.
If you're using Cursor or Claude Code on any project longer than a weekend, this is worth setting up.
Prompt file: https://gist.github.com/razamit/b28d7d8b0acaf995969673df47333d58

I'm Amit Raz, a Software Architect and AI consultant. I build AI-powered products and help teams integrate AI into their workflows. Check out my work at rzailabs.com