DEV Community: Urvil Joshi

10 Claude Code Features for Daily Work

Urvil Joshi — Mon, 22 Jun 2026 06:41:14 +0000

Claude Code starts brilliant. Then twenty minutes in, it forgets your conventions, edits the one package you explicitly told it to leave alone, and confidently ships the wrong thing.

You’ve felt this. And the natural conclusion is “the model got worse.” It didn’t. Your context got messy. Almost every Claude Code feature worth knowing exists to fix exactly that one problem and once you see them through that lens, they stop being a random list of commands and start being a single discipline.

Everything below is demoed on one project: a small Spring Boot project task-api (controllers, services, repositories, a shared exception handler, JUnit tests).

🍥Feature 1 : CLAUDE.md - stop re-explaining your project

The number-one reason Claude feels inconsistent isn’t that it’s dumb. It’s that every new session starts from zero knowledge of your project. So it makes perfectly reasonable choices that just aren’t your choices.

CLAUDE.md fixes that. It's a markdown file in your repo root that Claude Code auto-loads at the start of every conversation your stack, your conventions, your "never touch this" rules so you stop re-explaining yourself.

Here’s the difference, with one prompt run twice: “Add a DELETE /tasks/{id} endpoint.”

Without CLAUDE.md , Claude adds the controller mapping correctly and writes a service method — but it invents its own exception handling, throwing a generic “resource not found” when the task is missing. The repo already has a shared getOrThrow + ApiError pattern for exactly this. It did not wrote any test case or is inconsistent in use of of the best practices we used in this repo because it had no way to know it existed. The code isn't wrong. It's generic.

That’s the real value. CLAUDE.md turns “technically correct” into “the way we do it here,” every session, without you typing it again.

Best practices for claude.md:

Create it once with /init , which scans the repo and scaffolds a starter file. A common question is "how often do I run /init ?" The answer is once it's a bootstrap, not a ritual. Not per feature, not per story. After that you maintain the file by hand, and only re-run /init after a big architectural overhaul.
Edit it with /memory (it opens the file for you), or just open it in your editor and type.
What goes in it: things true almost every turn build command, test framework, architecture rules. If something only matters sometimes, it doesn’t belong here.
Phrase rules positively. “Prefer X over Y” sticks better than “Don’t do Y” language models handle positive framing better than negation.

Here is my generated claude.md and I added few best practices which should be used when adding any code in this project

With CLAUDE.md , same prompt: now it routes through the shared getOrThrow method, returns the standard ApiError, keeps the controller thin, and writes the test in the project's style first try, no nudging.

🍥Feature 2 : /context - see what’s actually filling the window

So how do you see your context? Type /context.

It’s the dashboard most people never open. It shows exactly what’s filling your window right now system prompt, tools, your CLAUDE.md, loaded skills, MCP servers, and your message history broken down by token count and percentage. When Claude starts drifting, this tells you why, and it flags when tools or history are eating too much.

One thing worth understanding: the context window is the whole container. Message history every prompt, reply, and tool result so far is just one slice of it, and it’s the slice that balloons over a long session and crowds everything else out.

Want the full breakdown? /context all expands the view.

🍥Feature 3 : Subagents - the biggest context win

A subagent is a separate Claude inst ance with its own context window. When you need it to read forty files to find something, you don’t want forty files dumped into your main conversation that’s pure pollution. The subagent does the digging in its own window and hands back only the answer.

Start with /agents. The library shows the built-in subagents that ship with Claude Code, and you can create custom ones.
I made a project-scoped “codebase investigator” gave its description.
Gave it read-only tools (it’s an investigator; it should never edit).
Picked Sonnet as the model, and confirmed.
It now lives at .claude/agents/ in the repo.

To run it, mention it with @ :

Build a few an explorer, a test-runner, a security-auditor and each one keeps its mess out of your main thread. It’s the difference between researching in your main document versus opening a scratch tab. Same brain, protected workspace.

🍥Feature 4 : /compact - compress on purpose, not blindly

When a session gets long, Claude auto-compacts: it summarizes history to free up space. The upgrade is to not wait for it, and not let it summarize blind.

Picture a session with a useful chunk of work plus a messy debugging detour you no longer care about. Passive auto-compaction summarizes everything equally including the dead ends. Manual compaction lets you steer it.

Check the expanded view first with /context all to see what you're working with. Then compact with intent:

Run /context all again and you'll see usage drop sharply — but the decisions survived and the exhaust is gone. Compact on purpose and a session stays sharp for hours instead of slowly turning to mush.

🍥Feature 5: Plan Mode - think before you do

The fastest way to waste tokens is letting Claude edit before it understands. Plan Mode flips that: it researches and writes a plan no file changes and you approve before a single line is written.

Two ways in: cycle modes with Shift+Tab , or type /plan.

Give it a task
It produces a structured plan of steps and the files it intends to touch.
You review, then approve execution.

Two wins here. One, you catch a wrong approach before it’s written, not after you’re staring at a bad diff. Two, all that planning research happens without dumping every file it peeked at into your main context.

Anything more complex than a one-liner, plan first. It feels slower. It’s dramatically faster.

🍥Feature 6: Skills - on-demand context

Remember I said some rules don’t belong in CLAUDE.md? Skills are where they go. A Skill is on-demand context a packaged capability Claude loads only when the task calls for it.

Create a folder named for the skill with a SKILL.md inside, under .claude/skills/ for a project-level skill
The file holds the skill’s name, a description of when to use it, and the details.
I built a code-review skill tuned to the project's conventions. /skills lists it.

Invoke it using / ( /code-review ) pulls the checklist into context and runs it against recent changes

Here’s the mental model : CLAUDE.md is what Claude always knows. Skills are what it can go get. Your PR checklist, your changelog format, your deploy runbook none of that needs to be in context for every “ fix a typo ” request. Make it a Skill.

The more you move out of CLAUDE.md into Skills, the less context you burn every single turn.

🍥Feature 7 : MCP — reach outside your repo

Everything so far lives inside your repo. MCP the Model Context Protocol connects Claude Code to the outside: GitHub, your database, a browser, your issue tracker. Instead of pasting a stack trace or a schema into chat, Claude pulls it live.

To wire up MCP for a project, add an mcp.json with the connection details for your server

Enable MCP in your project-level settings.json

For the demo I planted a bug removed the shared exception handling and did a direct findById(...).get(), so a missing task returns a 500 instead of a clean 404.

I filed a real GitHub issue describing it

/mcp shows my connected servers, including the GitHub issues server

Now I will just tell it to fix the issue which I filled It calls the MCP server, reads the live issue,

Opens the relevant files, and fixes it properly back through the shared getOrThrow method, returning a 404 with the standard ApiError.

This is the jump from “Claude that edits files” to “Claude that operates your whole dev environment.”

The context it needs, it fetches itself through a real tool not a copy-paste you’ll forget to refresh. The issue, the schema, the running UI all live.

🍥Feature 8: Hooks - deterministic automation

Some things shouldn’t depend on Claude remembering to do them. Hooks are deterministic automation shell commands that fire on events. After every edit, run the linter. After a file changes, run its tests.

I added a hook in my project’s .claude/settings.jsonthat runs on PostToolUse, matching Edit, Write, and MultiEdit, and runs the Maven tests:

/hooks lets you view it (under PostToolUse, project scope)

To show it working, I had Claude change a controller’s POST response status

I asked it to set it back to 201 Created and the moment the edit landed the hook fired the tests on its own, repeatedly, surfacing the failures with no prompt from me.

The underrated part: hooks cost zero context. You’re not burning tokens asking Claude “ please remember to run the tests ” and hoping it complies you’re guaranteeing it, every time, for free. Lint on save, test on change, format on write. Deterministic beats hopeful.

🍥Feature 9: Rewind - undo a bad path in two keystrokes

So you went down a bad path. Three failed refactor attempts are now sitting in your context, poisoning everything after.

I did git status shows nothing to commit.

Then a deliberately gave a bad prompt to show rewind

A few files in, you decide it’s the wrong call. git status now shows untracked files and a tangled change.

Instead of manually reverting everything, press Esc Esc to open the rewind menu (it lists your prompts as a timeline)
select the point before the refactor, and choose Restore code and conversation

Run git status again clean. The whole tangent, gone in two keystrokes.

You can roll back code only, conversation only, or both perfect for trial-and-error refactoring. And a related option, Summarize up to here , keeps the decisions and drops the exhaust by compressing the earlier history into a clean summary while leaving recent messages intact.

The distinction matters: rewind undoes state; summarize compresses context. Every turn is a branching point.

The pros aren’t smarter they just refuse to drag dead context forward.

🍥Feature 10: /goal + Auto Mode - let it run

/goal keeps Claude working across turns until a completion condition holds — not "do one thing and stop," but "keep going until this is actually done."

Pair it with Auto Mode : a safety classifier handles your permission prompts, so safe actions just run and risky ones still get blocked.

Switch to automode with Shift+Tab

Set a concrete, checkable goal

Claude edits, runs the tests, hits a failure, fixes it, re-runs — looping autonomously until the suite is green and the goal is met.

This is the closest Claude Code gets to “ go build it while I grab a coffee. ”

The guardrail is the classifier the sane middle between approving every single step and the fully-yolo skip-permissions flag. Set a clear, checkable goal, let it cook, and come back to working code. (Auto Mode runs on the Pro plan, too.)

✨It was one idea the whole time

Ten features, but they’re one idea wearing different hats: protect your context, and the model stays sharp.

Set it up with CLAUDE.md and Skills.
See it with /context.
Plan before you touch it with Plan Mode.
Protect it with subagents, /compact , and /rewind.
Automate it with hooks and MCP.
Let it run with /goal and Auto Mode.

None of these are tricks. They’re a single discipline: keep the window clean, and a capable model stays capable.

🎗️Reference

10 Claude Code Features Every Dev Should Know

I Built My Own Spec-Driven Dev Workflow in Claude Code. Here’s What I Learned.

Urvil Joshi — Tue, 26 May 2026 04:58:04 +0000

I’ve been using AI to code for some time now. Copilot, Claude Code, Codex, Pi . I’ve shipped code with them. I’ve also spent more time than I’d like to admit fixing the things they confidently produced.

A few months ago I started seeing a pattern I couldn’t ignore. The bugs weren’t random. They were the same bugs, over and over, across different projects. AI would write code that looked right. It would compile. The obvious test case would pass. And then it would fail on the edge case nobody asked about.

I thought it was my prompting. So I got better at prompting. The bugs got more subtle, not fewer.

Then I created my workflow where I focused on human gates , planning and review. If you want to check that article out here is the link

Then I came across spec-driven development SDD and it was a obvious update for my workflow so I built my own workflow around it in Claude Code. I’m going to walk you through what I built and what I learned. I’m still figuring it out. This is not a “here’s the answer” post. This is a “here’s where I am” post.

🍥The Experiment That Made Me Care

A few weeks ago I did an experiment I built a refund endpoint for a Spring Boot project. Standard stuff. I thought to add a partial refund feature to compare vibe coding and SDD.

I prompted Claude Code with a reasonable description. Out came code that compiled and ran. I tested a single refund. It worked.

Then I ran two refund requests at the same time on the same order. Both succeeded. The order’s total was $100. The total refunded came out to $150.

The customer would have walked away with an extra $50.

The code wasn’t broken in the usual sense. It had a check. It compared the refund amount to the order total. The check was just wrong under concurrent load a classic race condition. The AI didn’t know to handle concurrency because I didn’t tell it to handle concurrency. I didn’t tell it because I used it as a search engine not a pair programmer . And I wasn’t thinking about it because the prompt and go workflow doesn’t make you think about it.

That’s the moment I stopped blaming prompts.

✨What I Actually Realized

If you see SDD approach, you are not just writing a prompt. When you start building a feature, you should know in your head how you will build the feature.

That’s how we did coding previously, right?

You get a feature. You create a design. You put it somewhere Obsidian, Notion, a notepad, your notebook, whatever. You draw the full picture. You know in your head what code you’ll add, what design pattern you’ll use, what the edge cases are. Once you know everything, you start coding.

That’s what spec-driven development is asking you to do with AI.

You’re clarifying everything to the agent. You’re not the audience watching it work. You’re the one who is driving it. You should know how this feature should be built, what changes should be done, what the requirements are, what the design changes in the code will be.

The AI is incredibly capable at translating clear specs into working code. It’s bad at extracting intent from vague prompts. Once you accept that, the whole approach inverts.

🔍SDD Is Not New (And That’s the Point)

I want to be honest here because I see a lot of takes painting spec-driven development as some breakthrough AI-era methodology. It’s not.

CORBA’s IDL files in the nineties pioneered spec generates code for interfaces. Protocol Buffers carried that pattern forward in 2001. Test-Driven Development from the late nineties established “write the contract first.” Behavior-Driven Development in 2006 made specs readable in plain English.

SDD takes those ideas and applies them to entire features, scaled up by LLMs. One researcher named Bryan Finster put it bluntly in a January 2026 paper: “ SDD is not a revolution. It’s just BDD with branding. ”

He’s mostly right. The branding does matter, because it reminds practitioners that specs should be authoritative, not advisory.

The reason this works now and didn’t work in the 2000s with UML codegen is that natural language plus LLMs can bridge the gap that diagrams and codegen compilers never could. We’re not inventing the methodology. We’re finally making it viable.

Spec Kit and Kiro

There are two real tools getting attention right now.

GitHub’s Spec Kit , open-sourced September 2025. It’s a CLI that installs slash commands and templates into your existing project. You run it, and your AI agent (Claude Code, Copilot, Cursor, over thirty of them) gets a structured spec-driven workflow. Good if you want to get started fast.

AWS’s Kiro. A full IDE built on Code OSS, with spec-driven development as a first-class primitive. Three documents per feature requirements, design, tasks with human approval gates between each. Good if you want the IDE experience.

Both are solid. Honestly. If you want to try SDD today, install one of them.

But if you’re a developer reading this, I assume you have your own way of working. Even in your career, you’ll try different workflows to find what suits your style. Instead of adopting someone else’s structure, you can create your own something that complements how you think.

That’s what I did.

🧰What I Built

I built a 14-phase workflow inside Claude Code using its native primitives — subagents, slash commands, hooks, and a status tracking file. No external tools. No third-party install. Just Claude Code’s own capabilities, composed deliberately.

Eleven Claude subagents, each with one job:

A repo-init agent that reads my codebase and writes a project.md. Tech stack, build commands, test commands, conventions. Every later agent reads this.

An issue-fetch agent that pulls a ticket from GitHub and creates a working folder.

Requirements clarification agents is for business questions, They ask me questions in batches till it is clear with requirements.

A requirements agent that drafts a requirements.md from the answers which I review and give review comments it will resolve and present me file until I am satisfied with the requirements.

Technical Design clarification agents is for technical questions, They ask me questions in batches till it is clear with design.
A technical design agent that drafts a design.md which I review and give review comments, psuedo code to help it make perfect design as i like and present me file until I am satisfied with the design.

A task planner that breaks the design into ordered tasks with explicit test-first requirements.
A TDD implementation agent that runs one task at a time red, green, refactor logging every test command and result to a traceability.md file.
A review agent that audits the diff against requirements, design, tests, conventions, security, and maintainability.

A review resolution agent that fixes the issues the human accepts or what Human gave as a custom review finding as human should review all the code at this place too as even with all this there are mistakes in implamentation which as a dev you are responsible for and should be minimum.
A Human resolution review agent will show what is resolved to the human ask for final approval this will be the gate where you review the diff again if you have time for extra caution.

Final Summary will be created once approved by human and will upate final summary.md file.

A PR agent that drafts the commit message and pull request body.

Above all of them sits an orchestrator a thirteenth file that parses my plain-English messages and routes them to the right subagent. I never type slash commands during the workflow. I just say “approve requirements” or “accept findings 1 and 2, reject 3 and all also add this finding” or “raise PR” and the orchestrator handles it.

I’m not claiming this is the best structure. It’s mine. It fits how I work. Yours would look different and should.

🏁What It Caught On the Refund Bug

I ran the same refund task through this workflow.

The very first clarification agent before any code

None of these were in the ticket description. None would have been caught by vibe coding. Every single one would become a bug if missed.

That’s the whole thing. The clarification phase is where the bugs that ship in production get caught before any code exists.

The agent didn’t catch the race condition. It caught it because the workflow forced me to think about concurrency before I let the AI write anything.

I’m the one who solved the bug. The workflow at least just made sure I didn’t skip the question.

🔍The Results with Spec driven Development Flow

I again ran two refund requests at the same time on the same order. This time one succeeded one did not as expected. The order’s total was $100. The total refunded came out to $90.

✨The Honest Trade-offs

This workflow has overhead. Real overhead.

A vibe-coded refund endpoint takes me maybe ten minutes. The spec-driven version through this workflow takes closer to forty minutes clarification rounds, requirements review, design clarification, design review, then the actual TDD implementation.

For a one-off script? Not worth it. Friday afternoon prototype? Vibe code it. Exploring something where you don’t know what you want yet? Vibe code it.

But for code that handles money, code that lives in production, code that other people will read and maintain the forty minutes upfront saves multiples on rework, debugging, and shipped bugs. The tests aren’t an afterthought. The design doc isn’t fiction. The next person reading the code including future me has the requirements, the design, the tasks, and the full test history sitting right there in the repo.

I’m not telling you to use my workflow. I’m telling you to think about which work in your life deserves which approach.

🍥Why I Wrote This

This post isn’t a tutorial. It’s me sharing where I am.

I’ve been hearing a lot of “vibe coding is dead” and “spec-driven is the future” takes lately. I think both are slightly wrong. Vibe coding is the right tool for some work. Spec-driven is the right tool for other work. The skill is knowing which is which.

I’m continuously improving my workflow. If you’ve built something similar, if you think mine can be improved somewhere please tell me. That’s the whole point of writing this in public.

🎗️Reference

I Built a AI Dev Workflow using Spec Driven Development in Claude Code

I tried Pi after watching its founder explain why he quit Claude Code

Urvil Joshi — Thu, 07 May 2026 06:11:29 +0000

A walkthrough of the open-source coding agent that fits in 1,000 tokens and the one reason I can’t fully switch yet.

A few days ago, I watched Mario Zechner the creator of Pi explain why he stopped using Claude Code. By the end, he’d convinced me to try it.

Pi is an open-source coding agent with four tools, a system prompt under a thousand tokens. And one idea : the agent should be minimal and should able to modify itself.

This post walks through what’s actually different about Pi, the demo I built to test it, an honest comparison with Claude Code, and the one reason I haven’t fully switched.

🍥Why Mario built Pi

Mario’s pitch is simple and authentic.

Modern coding agents got bloated.

Claude Code’s system prompt has got to roughly 14,000 tokens. Tools get added, modified, and removed between releases. System reminders get injected into your context behind your back. You aren’t the owner of your context and you have zero control over it.

Mario’s argument, paraphrased:

Models are already trained to be coding agents. They don’t need a 10,000-token system prompt explaining what a coding agent is. They know.

So Pi strips it all down. Four tools read, write, edit, bash. A system prompt under 1,000 tokens:

No MCP servers
No sub-agents
No permission prompts
No plan mode
No built-in to-dos
No background bash

Instead, the agent extends itself. You ask Pi to add a feature, it writes a TypeScript extension, you hot-reload, and you’re done.

An agent that adapts to your workflow, instead of the other way around.

That line is what got me to install it.

✨Installing Pi (60 seconds)

Head to pi.dev there’s a one-line install command. Or grab it from npm. Paste it in your terminal and you’re done.

Type pi to start it. Then /login to pick a provider.

You can sign in with your Anthropic, OpenAI, or GitHub Copilot subscription, or bring your own API key. Worth knowing: as of recently, Anthropic’s Pro plan limits don’t apply when you authenticate Pi with your Claude account . You’ll be billed as extra usage on top of the subscription for me it’s the dealbreaker.

I’m logged in with my Claude account, so /model lets me pick from any Anthropic model. I'm running Sonnet here. There's also a /settings slash command if you want to change reasoning level, theme, or hide thinking traces.

🔍Three things that are actually different in Pi

I won’t bore you with every slash command Pi’s GitHub has the full reference. But there are three design choices that genuinely set Pi apart.

1. The system prompt is yours

Drop a file called system.md in ~/.pi/agent/ and Pi uses your system prompt instead of the default. Want to keep Pi's prompt and just append your own rules? Use append-system.md.

This is huge. If you want to use Pi for non-coding work like research, writing or anything else you can swap out the entire instruction set. No coding agent I’ve used lets you do this. In Claude Code, you’re stuck inside the 14k-token prompt the team ships.

2. Sessions are trees, not lines

Most coding agents give you a linear conversation. If the agent went the wrong way ten messages ago, you re-prompt or restart.

Pi sessions are trees. Use /tree to see the full branch structure. Use /fork to create a new branch from any earlier message. You jump back to the point where things went sideways and continue from there so no re-prompting, no context loss and, no restart.

3. Bash does almost everything

Pi only ships four tools. There’s no dedicated grep tool, no find tool, no git_status tool. Just bash.

Mario’s reasoning: models are reinforcement-trained on bash. They know how to use it. Adding specialized tools is just added noise. If you want a custom tool, you build it as an extension.

There are also no permission prompts. Pi runs full access by default. Mario’s argument is that most users mindlessly click “accept” on every permission prompt anyway. If you want real human gates, build them as an extension.

🧰The part that sold me

Pi is a deliberately minimal harness. To get anything beyond the bare minimum, you have to build it. That sounds like a downside until you see what Mario gave you in return: the agent ships with full knowledge of its own source code, and it can extend itself.

In other words, you ask Pi to add a feature. Pi reads its own extension docs. Pi writes the TypeScript file. You hot-reload. The feature is now part of your agent.

I’ll show you two extensions.

Extension 1: rebuilding my Claude Code orchestrator in Pi

If you read my last post, you saw my issue-to-PR workflow in Claude Code. It’s an orchestrator sub-agent that spawns four other sub-agents with human approval gates in between.

When I switched to Pi, I wanted the same workflow. But Pi has no sub-agents.

So I had two choices:

Option A :- spawn separate Pi processes. Each phase runs in its own isolated context window. This mirrors Claude Code’s sub-agent model exactly.
Option B :- single shared session. All phases run in one continuous Pi conversation. Higher token usage, but simpler to demo.

For this demo I went with Option B. For real production work, I’d use Option A.

I won’t walk through the full flow as its already there in my previous post. The point here is that the same multi-phase, gated workflow I used Claude Code’s sub-agent system for, I rebuilt as a single Pi extension. A TypeScript file in ~/.pi/agent/extensions/. Hot-reload, and it's live.

Extension 2: a status widget Pi built for itself

This one’s the demo that captures Pi’s whole pitch in 30 seconds. Everyone on Reddit is building this, so I built one too.

I gave this Prompt :

Read your own extension docs and build a status widget that shows the current git branch and number of uncommitted changes. Save it to my extensions folder.

Pi read its own documentation, wrote the extension, saved it to the right path, and told me to run /reload.

I ran /reload.

The status bar at the bottom of my terminal now showed my git branch and uncommitted file count. The agent had just extended itself. Live. In one prompt. Try doing that in Claude Code.

🏁Skills

Pi also supports skills. They live in ~/.pi/agent/skills/ (or in your project repo for project-local skills).

To invoke one, type /skill and write your prompt. There's nothing radically different from how skills work in other agents you just paste your skill and call it like above.

✨Pi vs Claude Code — honest comparison

Here’s what each one ships out of the box.

Claude Code has permission prompts, MCP support, sub-agents, plan mode, a large system prompt, and is locked to Anthropic models. It’s a finished product. Everything you need is there on day one.

Pi has four tools, a sub-1,000-token system prompt, multi-provider support (Anthropic, OpenAI, Copilot, OpenRouter, Ollama, and more), an editable system prompt, extensions, and the ability to modify itself. It ships none of features by default but you can build any of them as extensions, or install someone else’s package.

If you want something that works on day one, Claude Code wins. If you want to actually own your workflow, Pi wins.

✍️The one reason I can’t fully switch

Anthropic bills your Pi sessions as extra per-token usage. Effectively, you’re paying twice once for your subscription, again for every Pi token.

That’s not Pi’s fault. It’s Anthropic’s policy. But it means switching to Pi while keeping my Anthropic subscription is financially dumb for me right now.

If I weren’t on Anthropic and if I was using a ChatGPT subscription, or Copilot, or running local models with Ollama Pi would be my full-time coding agent. The minimalism, the extensibility, the fact that I control my context. Mario nailed it. This is how I want my coding agent to work.

So for now, Pi sits next to Claude Code in my workflow.

🎗️Reference

I Built an Orchestrator AI Agent That Takes My Github issue to Pull Requests.

Urvil Joshi — Tue, 28 Apr 2026 13:09:05 +0000

A Claude Code workflow with one orchestrator, five subagents, and three human gates running on a real Spring Boot project, end to end

This is my Minimal dev setup in 2026.

Pixel Agents in VS Code to monitor my agents. Claude Code in the terminal. Together, they take a GitHub issue and turn it into a merged pull request with three approvals from me along the way.

🍥 The project: LinkStash

LinkStash is a Spring Boot URL shortener I built last week. To be upfront: this is not a serious production repo. It’s a demo I created specifically to show this workflow on a realistic codebase.

The repo has one open issue:

That’s the issue I want my workflow to handle.I’m going to invoke an agent and it will handle it with my inputs and reviews.

✨ Step 1 → CLAUDE.md sets the rules

Every Claude Code session reads CLAUDE.md first. It's the file you keep at the repo root that tells Claude how your project works. Conventions, what not to do, project structure basically all of it.

For LinkStash, mine includes things like:

Constructor injection only — never field injection
Records for DTOs
Don’t add Lombok
Don’t push to main
Always write a Flyway migration, never use ddl-auto: update

If you’ve never written one, run /init in Claude Code and it'll generate a starting point you can edit. The trick is keeping it tight long CLAUDE.md files dilute attention. Forty lines of clear rules beats two hundred lines of vague guidance.

If you use multiple coding agents(Claude, Codex, Copilot) : you can create AGENT.md for general conventions shared across all coding agents (Claude, Codex, Copilot), and keep agent specific md for your Coding Agent specific things.

✨Step 2 → The orchestrator agent

Here’s where it gets interesting. My main agent is called issue-resolver, and it lives in .claude/agents/.

It does three things on its own and pauses three times for me. The high-level flow:

Fetches the GitHub issue (via my MCP server — more on this below)
Spawns a subagent that explores the codebase and writes ARCHITECTURE.md
Spawns a subagent that drafts plan.md
Pauses for me to approve the plan
Spawns a subagent that implements the plan
Spawns a subagent that runs /ultrareview for self-critique
Pauses for me to triage findings
Spawns a subagent that applies accepted findings
Pauses for me to do a final review of the changes
Pushes and opens the PR

The key rule baked into the agent prompt: never modify code yourself, always delegate to subagents. The orchestrator only orchestrates. Each subagent has one job.

🔍A note on the MCP server

For fetching the issue, I’m using a custom MCP server I built in a previous video. You don’t have to do this the official GitHub MCP server has a gh_get_issue tool that does the same thing. Or you could use Claude Code skills.

I’m using my own because I built it for a related workflow already. Pick whichever fits your workflow.

✨Step 3 → Kicking off the loop

The whole invocation is one line:

@issue-resolver fetch and resolve issue #1

That’s it. The orchestrator goes to work. First it fetches the issue from my MCP server. Then the explore subagent reads the codebase and writes ARCHITECTURE.md : entities, endpoints, data flow, testing conventions.

Then the plan subagent runs and produces plan.md. This is where I get the first interesting moment.

✨ Step 4 → Gate 1: Plan approval

The plan came back with two open questions the agent flagged for me:

Which rate-limiting algorithm — token bucket or fixed window?
Should creating a link with a past expiresAt return 400?

This is exactly what I want. The agent isn’t guessing it’s asking back the engineering decisions.

I told it: greedy token bucket (Bucket4j default behavior) and yes, return 400 on past expiry validated against server time.

I sent the plan back. Plan came back updated. Both open questions resolved. Approved.

✨Step 5 → Implementation runs

The implement subagent takes over. New tables for API keys and link expiration. The Bucket4j filter. Updated controllers. Tests added. Tests run after each major change. All green.

✨Step 6 → Gate 2: /ultrareview findings

After implementation, the orchestrator spawns the review subagent. /ultrareview is Claude Code's high-effort self-critique mode.

Findings come back as a numbered list with severity, file location, and suggested fixes. I get a structured prompt:

Reply: "accept all" / "accept all except [numbers]" / "accept only [numbers]"

This is the second gate. I read each finding, decide which are real, which are nitpicks, which are wrong. If I disagree with one, I exclude it.

The honest part of this workflow is right here: even when /ultrareview is correct in principle, I’m the one who is responsible for the commit. I don’t accept findings blindly. I read them.

This time I accepted all of them as they were all fair calls.

✨Step 7 → Gate 3: Post-fix review

After applying findings subagent which will resolve the findings we found out , the orchestrator pauses one more time before pushing.

I scroll through the diff. If I have any additional changes I’d want even though /ultrareview didn’t flag them I describe them and the agent runs apply-findings again. If the diff looks clean, I type push.

You might say three gates is excessive that it slows the workflow down. For a demo, sure, you can argue that. But for production code, code that ships to clients, code that runs at scale you should know what your AI is shipping.

The gates aren’t friction. They’re the part of the workflow that keeps you accountable.

✨Step 8 →The PR

The PR has a clean summary, “Generated by Claude Code” attribution, all the commits, all the file changes. Ready for review.

🧰What I’d take from this if I were building it myself

A few things I learned that I’d recommend if you’re trying to set up your own version:

Keep your CLAUDE.md tight. Forty lines of clear rules beats two hundred lines of vague guidance.

Subagents for one job each. The temptation is to make smart, multi-purpose agents. Resist it. Each subagent does one thing explore, plan, implement, review, apply, resolve. Predictable and easy to debug.

Human gates are non-negotiable for real work. Demo? Skip them if you want. Production? Human gates. That’s the floor, not the ceiling.

🏁Closing

The point of this workflow isn’t “ AI does my job. ” It’s the opposite. AI does the typing. I do the deciding. Critical Decisions, in the right places. That’s the modern dev workflow if you’re trying to use these tools seriously instead of as a novelty :) .

🎗️Reference

Claude Code Agent Workflow: Issue to Merged PR (Full Demo)

Andrej Karpathy’s LLM Wiki: Create your own knowledge base

Urvil Joshi — Mon, 20 Apr 2026 04:50:08 +0000

Andrej Karpathy tweetedsomething that quietly broke the AI community’s understanding of how we should be using LLMs to manage knowledge.

Two days later, he followed up with a GitHub gist called llm-wiki.md. The idea isn’t a product. It’s not code. It’s a pattern a special one that might make will help you create a small scale personal knowledge base in few minutes.

Let’s break this down.

🍥The Tweet That Started It

Karpathy’s original tweet:

“Something I’m finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating…”

— @karpathy, April 2, 2026

And that’s what he published a single markdown file on GitHub Gist. Something he calls an idea file : a document meant to be copy-pasted into an LLM agent like Claude Code , OpenAI Codex or any agent, where your agent then instantiates the pattern for your specific needs.

✨ The Core Idea: Stop Retrieving. Start Compiling.

Here’s the insight in one sentence: instead of having the LLM re-read your raw documents every time you ask a question, build a persistent, structured wiki once and keep it updated forever.

Karpathy used an analogy from software engineering: compilation.

┌─────────────────────────────────────────────────────────────┐
│ SOFTWARE ENGINEERING │
│ │
│ Source Code ──[compile once]──► Binary │
│ (readable) (runs fast every │
│ single call) │
└─────────────────────────────────────────────────────────────┘
                          ⇕ same idea ⇕
┌─────────────────────────────────────────────────────────────┐
│ LLM WIKI │
│ │
│ Raw Sources ──[LLM compiles]──► Wiki │
│ (PDFs, notes, (pre-synthesized, │
│ articles) interlinked, │
│ always ready) │
└─────────────────────────────────────────────────────────────┘

You don’t execute source code every time you want to run a program. You compile it once into a binary and run that. Karpathy says: treat knowledge the same way. Your PDFs and notes are the source code. The wiki is the binary.

Every time you add a new document, the LLM doesn’t just index it. It reads it, extracts the key information, updates existing pages, revises summaries, flags contradictions, and strengthens cross-links. The wiki is a persistent, compounding artifact.

In Karpathy’s own words, the line that captures the whole philosophy:

“Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase.”

You rarely write the wiki yourself. You curate sources, ask questions, and think. The LLM handles the whole work summarizing, cross-referencing, filing, and bookkeeping.

🔍The Three-Layer Architecture

╔══════════════════════════════════════════════════════════════╗
║ LAYER 3 — THE SCHEMA ║
║ (CLAUDE.md / AGENTS.md) ║
║ ║
║ Rules • Conventions • Workflows • How to ingest/query ║
║ ║
║ ↕ tells the LLM HOW to behave ║
╠══════════════════════════════════════════════════════════════╣
║ LAYER 2 — THE WIKI ║
║ (LLM owns this entirely) ║
║ ║
║ ┌──────────┐ ┌──────────┐ ┌──────────┐ ║
║ │ Entity │──│ Concept │──│ Overview │ index.md ║
║ │ pages │ │ pages │ │ pages │ log.md ║
║ └──────────┘ └──────────┘ └──────────┘ ║
║ ↑ LLM creates, links, updates, maintains ║
╠══════════════════════════════════════════════════════════════╣
║ LAYER 1 — RAW SOURCES ║
║ (IMMUTABLE) ║
║ ║
║ 📄 PDFs 📰 Articles 🎧 Podcast notes 🖼️ Images ║
║ ║
║ LLM reads • NEVER modifies • source of truth ║
╚══════════════════════════════════════════════════════════════╝

Layer 1 — Raw sources. Your curated collection. Articles, papers, meeting notes, images. Immutable. The LLM reads them but never modifies them. This is your ground truth. The fact that they’re immutable is a deliberate design choice: you can always re-compile the wiki from scratch if needed.

Layer 2 — The wiki. A directory of markdown files the LLM owns completely. Entity pages, concept pages, summaries, an index, a log. You read it. The LLM writes it.

Layer 3 — The schema. This is a CLAUDE.md (for Claude Code) or AGENTS.md (for Codex) file. It’s the config that turns a generic agent into a disciplined wiki maintainer. It defines how pages are structured, how new sources get ingested, how answers get formatted.

🧰The Three Operations

                    ┌──────────────────────┐
                    │ YOU (Human) │
                    │ curates & asks │
                    └──────────┬───────────┘
                               │
          ┌────────────────────┼────────────────────┐
          │ │ │
          ▼ ▼ ▼
   ┌────────────┐ ┌────────────┐ ┌────────────┐
   │ 1. INGEST │ │ 2. QUERY │ │ 3. LINT │
   ├────────────┤ ├────────────┤ ├────────────┤
   │ Drop new │ │ Ask a │ │ Health- │
   │ source → │ │ question → │ │ check wiki │
   │ LLM reads, │ │ LLM reads │ │ → find │
   │ summarises,│ │ wiki & │ │ contra- │
   │ updates │ │ synthesises│ │ dictions, │
   │ 10–15 wiki │ │ answer │ │ orphans, │
   │ pages │ │ w/ cites │ │ stale data │
   └─────┬──────┘ └─────┬──────┘ └─────┬──────┘
         │ │ │
         └────────────────────┴────────────────────┘
                              │
                              ▼
                    ┌──────────────────────┐
                    │ WIKI COMPOUNDS │
                    │ (every op makes it │
                    │ richer over time)│
                    └──────────────────────┘

Ingest. You drop a source into the raw folder. The LLM reads it, writes a summary page, and touches some related pages updating, cross-linking, flagging contradictions. A single article becomes a web of updates across your entire knowledge base.

Query. You ask a question. The LLM doesn’t search raw documents it reads the already synthesized wiki and answers. And here’s the compounding trick: good answers can be filed back into the wiki as new pages. Your explorations become permanent knowledge.

Lint. Periodically, you ask the LLM to audit the whole wiki. Find contradictions. Find orphan pages with no links pointing in. Find concepts that are mentioned but missing their own page. The wiki stays healthy because the LLM does the maintenance no human ever wants to do.

✨Let’s Actually Build One

Let’s build a working LLM Wiki together.

What you need

Claude Code (or OpenAI Codex, or any agent) the brain
Obsidian (free, obsidian.md) — the viewer
A folder on your computer — your vault

Step 1: Create the folder structure

Open your terminal:

bash

mkdir llm-wiki-demo && cd llm-wiki-demo
mkdir raw

You now have:

llm-wiki-demo/
├── raw/ (your immutable sources go here)

Step 2: Open Claude Code in that folder, and paste this single message

“I want you to read this idea file by Andrej Karpathy and help me set up an LLM Wiki in this directory. Before you do anything, ask me what this wiki will be about, and what sources I plan to feed it. Once I answer, write me a CLAUDE.md schema file based on my answer”.

paste the full contents of Karpathy’s original gist here

Step 3: Claude will respond with some clarifying questions

Claude will respond with a few clarifying questions like:

What topic will this wiki cover?
What kinds of sources will you feed it?
Roughly how many sources are you planning to ingest?
What page types do you want?

Step 4: Answer honestly

For this demo, I’m building a wiki about AI and the philosophy of software. My answer:

“The wiki covers AI research and the philosophy of software. I’ll feed it short essays and blog posts from people like Rich Sutton and Andrej Karpathy. Probably 10–20 sources. I want concept pages, essay summaries, and author pages.”

Claude will now write a CLAUDE.md file tailored to that use case, initialize wiki/index.md and wiki/log.md, and say something like "Ready to ingest your first source."

You just built the whole schema without writing a line of code. That’s Karpathy’s pattern working exactly as intended.

Step 5: Ingest sources

For my demo I have two sources

#1 Rich Sutton’s “The Bitter Lesson”

Drop Rich Sutton’s “The Bitter Lesson” into raw/ as bitter-lesson.pdf.

Tell Claude:

“Ingest raw/bitter-lesson.pdf."

Watch what happens. Claude reads the 2-page essay and generates something like:

wiki/
├── index.md (updated)
├── log.md (new entry appended)
├── sources/
│ └── bitter-lesson.md (summary page)
├── concepts/
│ ├── search.md
│ ├── learning.md
│ ├── moores-law.md
│ ├── general-methods.md
│ └── human-knowledge-approaches.md
├── examples/
│ ├── computer-chess.md
│ ├── computer-go.md
│ ├── speech-recognition.md
│ └── computer-vision.md
└── people/
    └── rich-sutton.md

One 2-page PDF just became ~10 interlinked pages. Each page cross-references the others with Obsidian-style [[wikilinks]].

#2 — Karpathy’s “Software 2.0”

Drop Karpathy’s “Software 2.0” into raw/as software-2-0.pdf.

Tell Claude:

“Ingest raw/software-2-0.pdf."

Claude doesn’t start from scratch. It reads your existing wiki first, recognizes that Karpathy’s “Software 2.0” essay is arguing something closely related to the Bitter Lesson, and does something remarkable: it updates the existing pages to add Karpathy’s framing, strengthens the cross-references, and creates new pages only where needed.

The software-2-0.md page now includes a [[bitter-lesson]] backlink because the LLM detected the conceptual connection between the two essays a link no human added.

Your wiki got denser, not just bigger. This is the compounding property Karpathy is pointing at.

Step 6: Ask a synthesis question

Now the payoff:

“How do Sutton and Karpathy agree about the future of software, and where might they disagree?”

Claude doesn’t reopen the PDFs. It reads the two wiki pages you just built, follows the [[links]] between them, and gives you a grounded cross-author synthesis in seconds. That answer which draws on connections that didn't exist 60 seconds ago is now a file sitting in your vault forever.

This is what Karpathy means when he says knowledge compounds.

Step 7: Open Obsidian and point it at the folder

Install Obsidian, create a new vault, point it at your llm-wiki-demo/ folder, and hit the graph view.

You’re now looking at your knowledge as a network. Nodes are pages. Edges are the links Claude added automatically. Every source you add makes the graph denser.

That moment when the graph renders for the first time is when most people get it.

🔍RAG vs LLM Wiki: The Honest Comparison

The question everyone asks: is this actually better than RAG?

Honest answer: neither wins. They solve different problems.

┌─────────────────────────────────┬─────────────────────────────────┐
│ RAG │ LLM WIKI │
├─────────────────────────────────┼─────────────────────────────────┤
│ │ │
│ 📄 Raw docs stay raw │ 📄 Raw docs compiled into │
│ │ structured wiki pages │
│ │ │
│ 🔍 Retrieves chunks per query │ 📖 Reads pre-synthesized pages │
│ │ │
│ 🔁 Stateless — every query │ 📈 Stateful — knowledge │
│ starts from scratch │ compounds over time │
│ │ │
│ 🧩 Answers assembled from │ 🔗 Answers drawn from already- │
│ fragments at runtime │ connected concepts │
│ │ │
│ 🕒 Cheap per query │ 💰 Expensive ingest, │
│ │ cheap query │
│ │ │
│ ✅ Perfect traceability to │ ⚠️ Answers 1–2 steps removed │
│ source (which chunk?) │ from raw source │
│ │ │
│ ❌ No cross-time synthesis │ ✅ Links March article to │
│ │ October article naturally │
│ │ │
│ ✅ Fresh data always re-read │ ⚠️ Updates require re-ingest │
│ │ │
│ ✅ Hallucinations stay local │ ⚠️ Hallucinations can get │
│ to one answer │ baked in as "facts" │
│ │ │
│ 🎯 Best for: large, changing │ 🎯 Best for: ~100–500 curated │
│ corpora, fact lookup, │ sources, research projects, │
│ millions of docs │ personal knowledge, books │
│ │ │
└─────────────────────────────────┴─────────────────────────────────┘

RAG is great when you have millions of documents that change constantly and you need precise citations to an exact chunk. Think customer support, legal search, enterprise fact lookup.

LLM Wiki is great when you have a bounded, curated corpus maybe a few hundred sources on a topic you’re going deep on. Research projects. A book you’re studying. A course you’re taking. Your own journal. Situations where synthesis matters more than retrieval where the valuable answers require connecting five sources, not looking up one.

There’s a real critique of the LLM Wiki pattern worth taking seriously: because the LLM summarizes and compresses sources into wiki pages, there’s a risk of hallucinations getting baked in as “facts.” With pure RAG, a wrong answer is just one wrong answer. With an LLM Wiki, a small misunderstanding can quietly propagate across linked pages.

That’s why Karpathy emphasizes the lint step periodic audits and why any serious implementation should spot-check generated pages against raw sources.

🧰Why This Actually Matters

It’s not really about wikis. Karpathy is pointing at something much older a 1945 vision by Vannevar Bush called the Memex : a personal, curated knowledge store where the connections between documents are as valuable as the documents themselves.

Bush’s vision was closer to this than to what the web became: private, actively curated, with associative trails between ideas. The reason the Memex was never really built isn’t technical. It’s that nobody wants to do the bookkeeping updating cross-references, keeping summaries current, noting when new data contradicts old claims.

As Karpathy writes in the gist:

“The tedious part of maintaining a knowledge base is not the reading or the thinking it’s the bookkeeping. Humans abandon wikis because the maintenance burden grows faster than the value. LLMs don’t get bored, don’t forget to update a cross-reference, and can touch 15 files in one pass.”

The tedious part of knowledge is finally solved.

Your job shifts from filing to thinking. From organizing to curating. From searching to asking better questions. The LLM handles everything else.

🎗️Reference

Karpathy’s Tweet: https://x.com/karpathy/status/2039805659525644595
Karpathy’s original gist: gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
Claude Code: claude.com/claude-code
Obsidian: obsidian.md
Demo source 1 — Sutton’s “The Bitter Lesson”: incompleteideas.net/IncIdeas/BitterLesson.html
Demo source 2 — Karpathy’s “Software 2.0”: karpathy.medium.com/software-2–0-a64152b37c35
Karpathy’s LLM Wiki Changes Everything: https://youtu.be/04z2M_Nv_Rk

Karpathy’s Auto Research and its application beyond ML

Urvil Joshi — Mon, 13 Apr 2026 04:53:16 +0000

What if you could hand an AI agent a single file, a scoring function, and say “make this better” then go to sleep? You wake up to a hundred experiments completed, the best ones committed to your git history, and a better result than what you started with.

Andrej Karpathy open-sourced exactly this way ago and It’s called Auto Research , and once you understand the pattern, you start seeing places to apply it everywhere.

Who Is Andrej Karpathy and Why Does This Matter?

If you write code for a living, you’ve probably used something Karpathy built without knowing it.

He was a co-founder of OpenAI. He led Tesla’s Autopilot team the neural networks that power self-driving. He created nanoGPT, minbpe, and llm.c, three of the most influential open-source AI projects in existence. He also coined the term vibe coding, which, love it or hate it, is now part of every developer's vocabulary.

So when Karpathy open-sources something, it’s worth paying attention.

🍥 What Is Auto Research?

The story behind Auto Research is simple. Karpathy had a training script for GPT-2 that he’d been hand-optimizing for months. Tweaking hyperparameters. Trying different learning rate schedules. Adjusting batch sizes. At some point he asked himself the obvious question:

“Why am I doing this manually? Why not have an AI agent run these experiments for me?”

That question became Auto Research.

Auto Research is a closed-loop autonomous optimization system. An AI agent runs experiments in a tight loop: hypothesize, modify , evaluate, keep or revert. Then repeat. Forever, or until you tell it to stop.

Here’s the loop in plain terms:

Hypothesize. The agent reads the current state of the something we want to modify, looks at previous results, and forms a theory about what to try next.
Modify. It edits exactly one file.
Evaluate. It runs an evaluation script that returns a single score
Keep or revert. If the score improved, git commit. If it got worse, git reset --hard. Clean slate.
Loop. Back to step 1 with the new context.

There’s a detail in here that’s easy to miss but absolutely critical: fixed time budget per experiment.

Every experiment gets the same amount of compute. Why? Because otherwise the agent could cheat. If experiment A gets five minutes and experiment B gets fifty, of course B might look better it had ten times the compute. By fixing the time budget, you force the agent to win on the quality of its ideas, not on brute force.

And notice what’s acting as the memory: git. Your git log becomes a complete trail of every successful experiment. Every commit message says what the agent tried and what score it achieved. At the end of an overnight run, you can git log --oneline and see the entire optimization journey.

If you start it before bed, you can wake up to roughly a hundred experiments completed.

✨The Three-File Architecture

Auto Research works because of a constraint system built around three files. Each one has a specific role, and the boundaries between them are what prevent the whole thing from collapsing into chaos.

File 1: program.md

This is the file you write. Think of it as a system prompt for the experiment loop. You define three things:

The objective. What are you optimizing? “Minimize p99 latency.” “Maximize test pass rate.” “Reduce image size.”
The constraints. What can’t the agent do? “Don’t exceed 512MB of memory.”
The protocol. How should the agent behave? “Run eval after every change.” “Commit if better, revert if worse.” “Don’t stop to ask questions.”

program.md is the job description. You're hiring an AI employee, and this is their employment contract.

File 2: train.py

This is the one and only file the agent can edit. The name comes from Karpathy’s original use case (training GPT-2), but it doesn’t have to be a Python script. It can be:

A system prompt
A SQL query
A Dockerfile
A CSS file
A config file
Literally anything you want to optimize

The single-file constraint is deliberate. By giving the agent one degree of freedom, you prevent it from making sprawling changes you can’t review. The agent has a focused surface area; you have a reviewable diff.

File 3: prepare.py

This is the most important file in the entire system, and the agent absolutely cannot touch it.

prepare.py defines what better means. It runs the evaluation, computes the metric, and outputs a single scalar number. The agent reads this score and decides whether to commit or revert.

Why is it locked? Because if the agent could edit the evaluation, it could just rewrite the scoring function to always return a perfect score. Game over. Optimization meaningless.

There’s a subtle but important corollary here: if you set the wrong metric, the agent will confidently optimize the wrong thing. It will improve the number you gave it, even if that number doesn’t measure what you actually care about. Choosing the right metric is your job. The agent handles execution. You handle direction.

🚨The Misconception That’s Costing People

Most people who hear about Auto Research think it’s a machine learning thing. They see Karpathy’s GPT-2 example and assume the pattern only applies to training models.

This is wrong, and it’s the most expensive misconception you can have about this technology.

Auto Research is a pattern, not a tool. The pattern works anywhere you have three things:

One scalar metric. A single number that tells you if things got better or worse.
Automated evaluation. No human in the loop. If you need a person to look at the result and judge, it’s too slow.
One mutable file. A focused surface area for the agent to work with.

If all three conditions are met, you can Auto Research it. Here’s what that opens up:

Prompt engineering. Your file is system_prompt.txt. Your metric is accuracy on a labeled test set. The agent tries different phrasings, few-shot examples, chain-of-thought instructions, even different languages. Each experiment runs the prompt against your test data and reports accuracy.

API performance. Your file is the handler code. Your metric is p99 latency under load. The agent experiments with caching, connection pooling, query batching, async patterns. The eval script fires a thousand requests and measures the 99th percentile.

Dockerfile optimization. Your file is the Dockerfile. Your metric is build_time × image_size. The agent tries multi-stage builds, different base images, layer reordering. The eval runs docker build and measures both numbers.

SQL query tuning. Your file is query.sql. Your metric is execution time on a fixed dataset. The agent tries index hints, join strategies, CTEs vs subqueries. The eval just runs the query and reports wall clock time.

The pattern doesn’t change. The loop doesn’t change. The three files don’t change. Only the contents change.

The rule is simple: if you can score it, you can Auto Research it. If you can’t score it, don’t try.

🔍Why This Is Bigger Than It Looks

I want to zoom out for a second, because the implications of Auto Research go beyond “neat trick for optimizing code.”

Karpathy has talked about his end vision for this. Back in the early 2000s, there was a project called SETI@home you could donate spare computing power on your home PC to the search for extraterrestrial life. Karpathy wants to build the same thing, but for AI research. Millions of agents, distributed across thousands of computers, all running Auto Research loops on different problems.

Think about what that means. Right now, AI labs spend tens of millions of dollars on researchers whose job is essentially to be the experiment loop propose changes, run training runs, look at results, decide what to try next. That work is now scriptable.

Karpathy’s prediction is that every frontier AI lab will eventually adopt some form of Auto Research internally. And he made the basic version open source.

The person who can set up the loop correctly will out-produce a team doing it manually.

🧰What I Think You Should Do With This

If you’ve read this far and you’re a developer, here’s my honest take.

Try it once. Clone Karpathy’s repo. Pick the smallest possible problem in your codebase that has a clear metric maybe a slow function, a bloated Dockerfile, a config file with too many knobs. Set up the three files. Let an AI agent loop on it for an hour. You’ll learn more from one run than from reading ten articles like this one.

Then start noticing. Once the pattern is in your head, you’ll start seeing Auto Research opportunities everywhere. That nightly batch job that takes 40 minutes? The system prompt you’ve been hand-tuning for weeks? The query that’s too slow but you don’t have time to optimize? All of these are candidates.

Get good at picking metrics. This is the hard part and the only part the agent can’t do for you. A bad metric gives you confident, automated, beautifully-committed garbage. A good metric gives you measurable progress toward something that actually matters. Spend more time on this than on anything else.

The closing thought I keep coming back to is something Karpathy himself said:

“Any metric you care about that is reasonably efficient to evaluate can be auto researched.”

The loop is open source. The pattern is freely available. The only thing standing between you and using it is picking your first metric.

🎗️Reference

Karpathy Autoresearch Github

Karpathy’s AutoResearch and its applications beyond ML

7 Git Concepts That Will Boost Your Productivity Exponentially

Urvil Joshi — Sat, 28 Mar 2026 23:44:37 +0000

If you’ve been using Git for a while and have the basics down there’s more power remaining to be seen. Git is packed with features that most developers never touch, and learning just a handful of them can improve your workflow.

Here are seven concepts that, once learned, you’ll find yourself reaching for regularly.

1. Git Worktree : Work on Multiple Branches Without the Stash

We’ve all been there. You’re deep in a feature, changes are halfway done, and suddenly you need to review someone’s code or jump on a hotfix. You will use git stash, switch branches, do your thing, switch back, git stash pop, and pray you remember which stash was which.

git stash
# ...do other work...
git stash pop

This works, but it has real downsides. If your codebase is large and takes 20–30 minutes to compile, switching branches means recompiling twice. And if you’re a serial stasher, you’ll inevitably lose track of what’s where.

Git worktree solves this elegantly. It lets you check out multiple branches into separate directories simultaneously, all backed by the same repository.

git worktree add -b code-review ../code-review release-branch

This creates a new directory ../code-review with the code-review branch checked out from release-branch. Your original working directory is untouched no stashing, no recompilation.

A few things to keep in mind. Always create your worktree in a sibling directory, not inside your main repo. Nesting worktrees causes duplicate file issues. The main worktree contains the .git folder with all repository metadata, while secondary worktrees have a .git file that points back to the main one.

You can list all active worktrees with git worktree list and clean them up when you're done:

git worktree remove ../code-review

Personally, I keep worktrees open for release branches and branches that need regular code review. It saves a surprising amount of context-switching time.

2. Git Squash : Keep Your History Clean

Over the course of a feature or hotfix, it’s easy to accumulate a trail of commits like “fix typo,” “change color,” “update docs,” and “refactor again.” These might be meaningful while you’re working, but they clutter the project history for everyone else.

Git squash lets you collapse multiple commits into one, giving you a clean, readable history. There are two common approaches.

Interactive rebase:

git rebase -i HEAD~3

This opens an editor showing your last three commits. You mark the ones you want to fold in with squash (or s), keep the top one as pick, and Git combines them. You'll get a chance to write a new commit message.

pick abc1234 Code refactor
squash def5678 Code refactor
squash ghi9012 Code refactor

Squash merge:

When merging a feature branch into your release branch, you can use the --squash flag:

git merge --squash feature-branch

This pulls in all the changes but doesn’t create a commit automatically. You then commit once with a clean message that summarizes the entire body of work. The result is a single, meaningful entry in your branch’s history instead of a dozen incremental ones.

3. Git Aliases : Stop Typing the Same Long Commands

If you’re anything like most Git users, you probably type certain commands dozens of times a day. Commands like:

git log --oneline --graph --decorate

That gets old fast. Git aliases let you create shortcuts for frequently used commands:

git config --global alias.logs "log --oneline --graph --decorate"

Now git logs runs the full command. You can alias complex or long frequently used commands . If you're maintaining an open-source project or spending a lot of time in the terminal, aliases add up to real time savings over the course of a day.

4. Git Bisect : Find the Exact Commit That Broke Things

Regression bugs are frustrating. Something that worked last week is now broken, and somewhere in the last 100 commits, something went wrong. You could manually check out commits one by one, but Git has a smarter tool built in.

git bisect performs a binary search through your commit history to find the exact commit that introduced the bug.

git bisect start
git bisect bad # current HEAD is broken
git bisect good abc1234 # this older commit was working

Git checks out a commit halfway between the two. You run your tests, then tell Git whether that commit is good or bad:

git bisect good # or: git bisect bad

It narrows the range and checks out the next candidate. For 100 commits, you’ll find the culprit in roughly 7 steps instead of 100.

*A practical tip * : If you’re using a test file to verify the bug, add it to .gitignore so it persists across checkouts. Otherwise, Git will remove it each time it switches to a different commit.

5. Git Cherry-Pick : Selectively Apply Commits Across Branches

Sometimes you need to move a specific commit from one branch to another without merging the entire branch. Maybe you fixed a bug on your feature branch that the release branch also needs, but the feature isn’t ready to merge yet.

git cherry-pick does exactly this:

git checkout release-branch
git cherry-pick <commit-hash>

This applies the changes from that single commit onto your current branch as a new commit.

Common use cases include applying a bugfix to a release branch without merging unfinished feature work, backporting fixes to a main or master branch that’s publicly visible, and recovering lost commits that you found via git reflog. It's far cleaner than copy-pasting code changes and committing them separately, which is something you see more often than you'd expect.

6. Git Reflog : Your Safety Net for Lost Commits

git log shows you commit history. git reflog shows you everything like every branch switch, every rebase, every cherry-pick, every checkout. It's a full activity log of what your HEAD has pointed to.

This becomes invaluable when something goes wrong. Say you accidentally delete a local branch that was never pushed to remote:

git branch -D feature-branch

The commits aren’t actually gone yet. Git only deleted the branch label; the commits persist until garbage collection runs, which can be weeks or months later.

To recover:

git reflog
# Find the commit hash from before the deletion
git checkout -b feature-branch <commit-hash>

Your branch and all its commits are back. This works for detached HEAD situations, careless rebases, and any number of “I’ve made a terrible mistake” moments. Think of git reflog as Git's undo history

7. Git Hooks : Automate Quality Checks Before You Commit

Git hooks are scripts that run automatically at specific points in your Git workflow like before a commit, before a push, after a merge, and more. They’re a powerful way to enforce quality standards without relying on memory or discipline.

The most commonly useful hook is the pre-commit hook. Here’s how to set one up from scratch.

Inside any Git repository, navigate to .git/hooks/. You'll find sample files there. To create a pre-commit hook, rename pre-commit.sample to pre-commit (removing the .sample extension) and write your script:

#!/bin/bash
# Pre-commit hook: compile Java files before allowing commit

echo "Running pre-commit checks..."
javac *.java
if [$? -ne 0]; then
    echo "Compilation failed. Commit aborted."
    exit 1
fi
echo "All checks passed."
exit 0

Now, every time you run git commit, this script executes first. If the compilation fails, the commit is blocked.

You can extend this pattern to run unit tests, static analysis tools like SonarQube, linting, or any validation your project requires. If the hook exits with a non-zero status, the commit is rejected. It’s a simple mechanism that prevents “I forgot to do this before pushing” problems.

✍️Conclusion

These seven concepts sit just beyond the basics, but each one addresses a real world problem that you as a developers face regularly. Worktrees eliminate context-switching overhead. Squash keeps your history readable. Aliases save keystrokes. Bisect turns debugging from a guessing game into a logarithmic search. Cherry-pick gives you precision. Reflog acts as your safety net. And hooks automate through lifecycle events.

Pick one that solves a problem you’re currently facing, try it out. You’ll be surprised how quickly these become part of your daily workflow.

🎗️Reference

10X Your Git Workflow: 7 Pro Tips [Git Productivity 2025]

Canonical Log Lines Stripe Brilliant Technique for Production Observability

Urvil Joshi — Tue, 17 Mar 2026 13:38:47 +0000

Stripe published a technique they call canonical log lines one fat, structured log line emitted at the end of every request that contains all the important telemetry in one place. Sounds too simple right but it fundamentally changes how you query, debug, and understand production systems.

🚨The Problem With Traditional Logging

Let’s say your payment API receives a single request. Internally, it will hit every layer authentication, rate limiting, database queries, and response generation. Traditional logging looks something like this:

[2024-03-18 22:48:32.990] Request started http_method=POST http_path=/v1/charges request_id=req_123
[2024-03-18 22:48:32.991] User authenticated auth_type=api_key key_id=mk_123 user_id=usr_123
[2024-03-18 22:48:32.992] Rate limiting ran rate_allowed=true rate_quota=100 rate_remaining=99
[2024-03-18 22:48:32.998] Charge created charge_id=ch_123 team=acquiring
[2024-03-18 22:48:32.999] Request finished duration=0.009 http_status=200 database_queries=34

This looks perfectly fine. And for simple questions, it works:

# Was anything rate limited recently?
"Rate limiting ran" rate_allowed=false

# Duration stats over the last hour
"Request finished" earliest=-1h | stats count p50(duration) p95(duration) p99(duration)

But now imagine for a incident and someone asks: “Which users are being rate limited the most?”

Suddenly you have a problem. The user_id field is in the authentication line. The rate_allowed field is in the rate limiting line. They're completely separate. To link them, you need to join across lines using request_id as a common bridge field.

*PROBLEM * : Each log line only knows what its own module knows. The rate limiter doesn’t know the user ID. The auth module does not know if rate limiting passed. The information you need is almost always spread across multiple lines of log which you is too much processing when we consider millions or billions of requests.

✨The Solution: One Canonical Line With All Importent Data

The canonical log line is Stripe’s answer: at the end of every request, create a single structured fat log line that contains all the key data together. Keep in mind not instead of regular logs but in addition to them.

[2024-03-18 22:48:32.999] canonical-log-line
  alloc_count=9123
  auth_type=api_key
  database_queries=34
  duration=0.009
  http_method=POST
  http_path=/v1/charges
  http_status=200
  key_id=mk_123
  permissions_used=account_write
  rate_allowed=true
  rate_quota=100
  rate_remaining=99
  request_id=req_123
  user_id=usr_123

Every key piece of information about this request like auth, rate limiting, HTTP details, database stats lives in a single readable line. Now the query for “who is being rate limited most?” becomes:

Canonical lines are an ergonomic feature for engineers. By collecting everything that is important and accessible through queries that are easy to write, makes production incident easy to debug and analyse_._

🔍More Real Query Examples

The power of canonical log lines is not just rate limiting. Because every key field is together, you can ask almost any operational question in a single line of query syntax.

Example 1 :Performance by Endpoint

During an incident, you want to see latency percentiles for the /v1/charges endpoint for a specific user, while filtering out client errors :

Example 2 — Detect a Bug vs. Legitimate Rate Limiting

Is rate limiting hitting few users or is it a bug affecting everyone?

Example 3 — TLS Version Adoption (Real Stripe Use Case)

Stripe needed to migrate users from TLS 1.0/1.1 to TLS 1.2. They could query this instantly with canonical lines:

🍥Architecture Used By Stripe

The beauty of canonical log lines is that the implementation is simple.Idea is to do it in middleware, which makes it one completely automatic and second we have have strict control over the it.

During a request’s lifecycle, each module decorates a shared environment object with relevant fields. The canonical logger sits at the very end of the middleware chain and drains all those fields into one log line.

Stripe wraps the canonical line emission in a Ruby ensure block which means it runs even if an exception was thrown mid-request. This guarantees you always have observability, especially during the incidents when you need it most.

Stripe’s approach for canonical line storage:

Kafka serialize canonical lines as Protocol Buffers and push asynchronously to a Kafka topic. This keeps the request path fast
Splunk will be ingested almost in real-time. Perfect when you need answers in seconds. Great for the last hour of data.
A consumer reads from Kafka, batches the data, and writes it to S3. Periodic jobs ingest it into Redshift for SQL-based long-term analytics over months of history
Stripe’s Developer Dashboard is powered by MapReduce jobs over these S3 archives.

Long-Term Analytics in SQL (Redshift)

Stripe archives canonical lines to Redshift. This lets them run months of historical queries in standard SQL

🧰Canonical vs. Other Observability Tools

Canonical log lines do not replace metrics or distributed tracing. They occupy a unique place in observability:

Ideally we should use all four, with canonical log lines as the first view in any debugging process because they are the fast to query and the cheap to aggregate.

✍️Conclusion

Canonical log lines are not a new technology. They are a simple convention that makes your existing logging infrastructure dramatically more powerful:

The beauty of this pattern is that it scales from a small start-up to Stripe’s global payment infrastructure. The implementation is simple yet powerful.

🎗️Reference

Docker Model Runner: Run AI Models Locally Within Your Docker Ecosystem

Urvil Joshi — Tue, 07 Oct 2025 13:10:57 +0000

Docker Model Runner (DMR)

Docker Model Runner (DMR) officially reached General Availability on September 18th, transitioning from its beta phase that began in April. This powerful tool enables developers to pull, run, and manage AI models locally within the Docker ecosystem, bringing the convenience of containerization to machine learning workflows.

What is Docker Model Runner?

Docker Model Runner allows you to run Large Language Models (LLMs) directly on your local machine while leveraging Docker's robust ecosystem. It combines the best features of local AI inference with Docker's familiar tooling and workflows.

Key Features

1. Local LLM Execution

Running LLM models locally provides several critical advantages:

Enhanced Data Security: Your data remains entirely on your local machine, never leaving your control or being sent to external services. This is particularly important for sensitive or proprietary information.
Accelerated Development Workflows: Developers can iterate faster by running AI models alongside their applications without network latency or API rate limits.
Seamless Integration: If you're already using Docker Compose for your development environment, you can easily add AI models to your stack. When you spin up your containers, your LLM will launch simultaneously, creating a fully integrated local development environment.

2. OpenAI-Compatible APIs

Docker Model Runner provides OpenAI-compatible API endpoints, making integration straightforward. Many applications already use OpenAI's API format, which means:

No code changes required in existing applications
Client applications can switch seamlessly between cloud and local models
Response formats remain consistent with OpenAI standards
Your existing parsing logic continues to work without modification

3. Integrated Inference Engine

The architecture is designed for optimal performance:

Models run on your host machine rather than inside Docker containers, maximizing performance
Utilizes Llama.cpp inference server for efficient model execution
Automatic NVIDIA GPU support when available
Combines Docker's ecosystem management capabilities with Ollama-like performance
Provides Docker commands for pulling, caching, managing, and running models

4. OCI Artifact Distribution

Models are packaged and distributed as Open Container Initiative (OCI) artifacts, the same standardized format used for Docker images. This means:

Models can be pushed to any OCI-compatible registry
Standardized packaging ensures consistency and portability
Most models are distributed in GGUF (GPT-Generated Unified Format)

GGUF uses quantization to reduce model size, enabling AI models to run on standard hardware, including CPU-only systems. This format is ideal for local deployments where computational resources may be limited.

5. Multiple Interaction Methods

Docker Model Runner offers flexibility in how you interact with models:

Command-line interface for terminal-based interactions
Docker Desktop GUI for visual model management
OpenAI-compatible REST APIs for programmatic access

6. Parallel Multi-Model Support

Need to run multiple models simultaneously? Docker Model Runner handles this effortlessly. For example, if you're building an AI agent that performs text summarization and image generation, you can run both models in parallel without complex configuration. Models can be accessed through the GUI, CLI, and API endpoints concurrently.

Getting Started

Prerequisites

Docker Desktop version 4.41.0 or higher
(Optional) NVIDIA GPU for accelerated inference

Configuration

Open Docker Desktop settings and enable:

GPU-backed inference: Allows automatic NVIDIA GPU detection and utilization
Host TCP support: Enables OpenAI-compatible API access via HTTP
CORS settings: Set to "all" if you encounter API access issues

Finding and Pulling Models

From Docker Hub:

Navigate to the AI section in Docker Hub and search for models using the ai/ prefix. Popular models like Llama 3.2, Mistral, and Phi-3 are readily available. Each model listing shows different quantization versions, allowing you to balance performance and resource requirements.

To pull a model:

docker model pull ai/llama3.2:1b-instruct-q4_K_M

From Hugging Face:

Browse to your desired model on Hugging Face, select "Use this model," and choose "Docker Model Runner" as the deployment method. The interface will display the appropriate pull command with your selected quantization level.

Basic Commands

Check Docker Model Runner status:

docker model status

List downloaded models:

docker model list

This displays model metadata including name, parameters, quantization level, and architecture.

Interacting with Models

Command-Line Interface

Single query:

docker model run ai/llama3.2:1b-instruct-q4_K_M "What is Docker?"

Interactive session:

docker model run ai/llama3.2:1b-instruct-q4_K_M

This opens a chat interface where you can have multi-turn conversations. Docker Model Runner maintains context across multiple exchanges. Exit by typing /bye.

Docker Desktop GUI

In the Models tab, navigate to the Local section and click "Run" next to your desired model. This launches an interactive interface where you can:

Chat with the model through a text input field
View the Inspect tab for model metadata and architecture details
Check the Requests tab to see your conversation history

The GUI maintains multi-turn conversation context, allowing natural, contextual interactions.

OpenAI-Compatible API

With host TCP support enabled in Docker Desktop settings, you can access models via REST API on the configured port (default varies based on your settings):

curl -X POST http://localhost:PORT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/llama3.2:1b-instruct-q4_K_M",
    "messages": [{"role": "user", "content": "What is Docker?"}]
  }'

The response format matches OpenAI's API specification, ensuring compatibility with existing tooling and parsers.

Docker Model Runner vs. Ollama

Both tools enable local AI model execution, but they have distinct characteristics:

Performance: Docker Model Runner runs models on the host machine rather than in containers, typically achieving approximately 12% better performance than containerized approaches. Ollama also runs on the host, either as a standalone binary or managed service, providing similar performance benefits.
Integration: Docker Model Runner provides seamless integration with Docker Desktop and Docker Compose, making it ideal if you're already using Docker for development. Models can be defined in your compose files and started automatically with your application stack. Ollama operates as a standalone application with its own CLI and basic API.
API Endpoints: Both offer OpenAI-compatible endpoints, but they use different default ports. You can configure these as needed for your environment.

Tips and Resources

The official Docker Model Runner documentation provides comprehensive guidance for various platforms including WSL 2, Linux, and macOS. The "Known Issues" section addresses common problems and their solutions.

For those interested in the technical details, the Docker team has published an in-depth blog post covering the design philosophy, goals, GPU acceleration strategies, and high-level architecture. This resource is invaluable for understanding the engineering decisions behind Docker Model Runner.

Conclusion

Docker Model Runner represents a natural evolution for developers already invested in the Docker ecosystem. By bringing local AI model execution to Docker Desktop, it eliminates the need for separate tools while providing familiar commands and workflows.

The combination of data privacy, development speed, and seamless integration makes Docker Model Runner particularly attractive for:

Development teams building AI-powered applications
Organizations with data sensitivity requirements
Developers seeking faster iteration cycles
Teams already standardized on Docker tooling

If you're currently using Docker for development but haven't explored local AI model execution, Docker Model Runner offers a compelling entry point. Its integration with existing Docker workflows means minimal learning curve while unlocking powerful AI capabilities directly in your development environment.

Whether you're building chatbots, implementing RAG systems, or experimenting with AI agents, Docker Model Runner provides the infrastructure to do so efficiently and securely on your local machine.

Resources

Docker Model Runner Tutorial 2025: Run AI Models Locally in Minutes | Complete Guide

10X Your Git Workflow: 7 Pro Tips to Boost Productivity 🚀

Urvil Joshi — Wed, 17 Sep 2025 08:52:10 +0000

Hey DEV community! Tired of Git stashes or messy commits? My new YouTube video, 10X Your Git Workflow: 7 Pro Tips (Worktree, Hooks & More), shares advanced hacks to save time and streamline version control.

Highlights:

Swap git stash for git worktree to juggle branches smoothly.
Clean commits with interactive rebase for polished PRs.
Automate checks with Git hooks to catch errors early.
Recover lost commits with git reflog—your safety net!

Perfect for devs using GitHub or GitLab. Watch now: [https://youtu.be/d_xZgcRJ--Q]

What’s your top Git trick or worst Git headache? Share below! 😄

git #versioncontrol #developerproductivity #programming #coding

Hey DEV community! Tired of Git chaos? My new YouTube video, 10X Your Git Workflow: 7 Pro Tips, shares advanced hacks to save time https://youtu.be/d_xZgcRJ--Q #git #versioncontrol #developerproductivity #programming #coding

Urvil Joshi — Wed, 17 Sep 2025 08:44:00 +0000

youtu.be

Understanding False Sharing and How to Mitigate It in Java

Urvil Joshi — Sat, 08 Feb 2025 17:06:53 +0000

Keeping in mind the world of multi-threaded programming, optimizing performance is often a never-ending task. One of the bottlenecks that is often neglected by developers is false sharing. This article deep dives to understand what false sharing is, how it impacts performance, and other ways to mitigate it using practical examples in Java.

🍥What is False Sharing?

False sharing occurs when multiple threads modify variables that reside on the same cache line. A cache line is the smallest unit of data that can be transferred between the main memory (RAM) and the CPU cache. Modern CPUs cache data in chunks (typically 64 bytes), and when one thread updates a variable in a cache line, it invalidates the entire cache line for other threads. This forces other threads to reload the cache line from memory, even if they are accessing different variables within the same cache line.

The result? Unnecessary cache invalidations and reloads, leading to significant performance degradation, especially in high-concurrency scenarios.

🚨The Problem: False Sharing in Action

Let’s start by looking at a simple example that demonstrates false sharing. Consider the following Java code:

public class FalseSharingProblem {

    public static void main(String[] args) {
        FalseSharingCounter falseSharingCounter1 = new FalseSharingCounter();
        FalseSharingCounter falseSharingCounter2 = falseSharingCounter1;

        Runnable r1 = () -> {
        int iterations = 1_000_000_000;
        long start = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            falseSharingCounter1.count1++;
        }
            System.out.println("Time taken "+(System.currentTimeMillis()-start)+" ms");
        };

        Runnable r2 = () -> {
            int iterations = 1_000_000_000;
            long start = System.currentTimeMillis();
            for (int i = 0; i < iterations; i++) {
                falseSharingCounter2.count2++;
            }
            System.out.println("Time taken "+(System.currentTimeMillis()-start)+" ms");
        };

        Thread.ofPlatform().name("Thread1").start(r1);
        Thread.ofPlatform().name("Thread1").start(r2);
    }
}

public class FalseSharingCounter {

    public volatile int count1 = 0;
    public volatile int count2 = 0;
}

In this example, two threads (Thread1 and Thread2) are incrementing two different counters (count1 and count2) that reside in the same FalseSharingCounter object. Since count1 and count2 are likely to be on the same cache line, updating one counter will invalidate the cache line for the other thread, causing false sharing.

🔍The Impact

When you run this code, you’ll notice that the time taken to complete the increments is significantly higher than expected. This is due to the constant cache line invalidations caused by false sharing.

✨The Artificial Solution: Separate Objects

One way to mitigate false sharing is to ensure that the counters are not on the same cache line. This can be achieved by using separate objects for each counter:

public class FalseSharingArtificialSolution {

    public static void main(String[] args) {
        FalseSharingCounter falseSharingCounter1 = new FalseSharingCounter();
        FalseSharingCounter falseSharingCounter2 = new FalseSharingCounter();

        Runnable r1 = () -> {
        int iterations = 1_000_000_000;
        long start = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            falseSharingCounter1.count1++;
        }
            System.out.println("Time taken "+(System.currentTimeMillis()-start)+" ms");
        };

        Runnable r2 = () -> {
            int iterations = 1_000_000_000;
            long start = System.currentTimeMillis();
            for (int i = 0; i < iterations; i++) {
                falseSharingCounter2.count2++;
            }
            System.out.println("Time taken "+(System.currentTimeMillis()-start)+" ms");
        };

        Thread.ofPlatform().name("Thread1").start(r1);
        Thread.ofPlatform().name("Thread1").start(r2);
    }
}

public class FalseSharingCounter {

    public volatile int count1 = 0;
    public volatile int count2 = 0;
}

In this solution, falseSharingCounter1 and falseSharingCounter2 are two separate objects, ensuring that count1 and count2 are not on the same cache line. This eliminates false sharing, and you'll observe a significant improvement in performance.

🧰The Elegant Solution: Using @Contended

While the artificial solution works, it’s not always practical to create separate objects for every counter. One of the solution is to add padding to the variables. We can do manually but Java provides a more elegant solution using the @jdk.internal.vm.annotation.Contended annotation. This annotation tells the JVM to add padding around the annotated field or class to prevent false sharing.

Example 1: Padding a Single Field

public class FalseSharingContendedCounter1 {
    // this mean this jvm will pad it so that it will not be in same cache line as other fields of this class
    @jdk.internal.vm.annotation.Contended
    public volatile int count1 = 0;
    public volatile int count2 = 0;
}

In this example, count1 is padded to ensure it doesn't share a cache line with count2.

Example 2: Padding the Entire Class

@jdk.internal.vm.annotation.Contended
public class FalseSharingContendedCounter2 {
    public volatile int count1 = 0;
    public volatile int count2 = 0;
}

Here, the entire class is padded, ensuring that none of its fields share a cache line.

Example 3: Grouping Fields

public class FalseSharingContendedCounter3 {
    @jdk.internal.vm.annotation.Contended("group1")
    public volatile int count1 = 0;
    @jdk.internal.vm.annotation.Contended("group1")
    public volatile int count2 = 0;
    @jdk.internal.vm.annotation.Contended("group2")
    public volatile int count3 = 0;
}

In this example, count1 and count2 are grouped together and will share the same cache line, while count3 is placed in a different cache line.

Running the Contended Solution

To use the @Contended annotation, you need to run your Java program with the -XX:-RestrictContended JVM option:

Here’s the complete code for the contended solution:

// use -XX:-RestrictContended cm options to run this
public class FalseSharingContendedSolution {

    public static void main(String[] args) {
        FalseSharingContendedCounter1 falseSharingCounter1 = new FalseSharingContendedCounter1();
        FalseSharingContendedCounter1 falseSharingCounter2 = falseSharingCounter1;

        Runnable r1 = () -> {
        int iterations = 1_000_000_000;
        long start = System.currentTimeMillis();
        for (int i = 0; i < iterations; i++) {
            falseSharingCounter1.count1++;
        }
            System.out.println("Time taken "+(System.currentTimeMillis()-start)+" ms");
        };

        Runnable r2 = () -> {
            int iterations = 1_000_000_000;
            long start = System.currentTimeMillis();
            for (int i = 0; i < iterations; i++) {
                falseSharingCounter2.count2++;
            }
            System.out.println("Time taken "+(System.currentTimeMillis()-start)+" ms");
        };

        Thread.ofPlatform().name("Thread1").start(r1);
        Thread.ofPlatform().name("Thread1").start(r2);
    }
}

✍️Conclusion

False sharing is a subtle but significant performance issue in multi-threaded applications. By understanding how cache lines work and using techniques like object separation or the @Contended annotation, you can mitigate false sharing and improve the performance of your Java applications.

Remember, in the world of high-performance computing, every nanosecond counts. So, the next time you’re dealing with multi-threaded counters or shared variables, don’t forget to check for false sharing!

🎗️Reference

False Sharing in Java — Jakob Jenkov

Happy coding! 🚀