DEV Community

Cover image for Three things AI agents keep getting wrong (and why I'm rebuilding the platform from scratch)
Tianshu AI
Tianshu AI

Posted on

Three things AI agents keep getting wrong (and why I'm rebuilding the platform from scratch)

TL;DR. Today's AI agents can call tools. The scaffolding for actually running a job — visible progress, real isolation, continuity across devices — is still missing. After hitting the same three pain points enough times, I'm rebuilding the agent platform from scratch. Open source. Self-hostable. Pre-alpha. Build in public, every week. Star github.com/tianshu-ai if you want the v0.1 demo when it lands — or scroll down and tell me which of the three pains hit you the hardest.


The 3-minute video version

If you'd rather read, the rest of the post says the same thing — with a few more concrete examples and three architecture sketches the video skips.


I've been using AI agents for a while

For a year or so I've been using a handful of agent-shaped tools — to do research, run code, automate boring stuff. Different products, similar shape: a chat box, a tool-using model behind it, sometimes a sandbox.

They work. Until you actually try to put one to work.

Three things keep showing up. They aren't the same bug. But every time, I get the same uneasy feeling — this is supposed to be the easy part, and it isn't.


Pain #1 — You think it's running. It's actually waiting for you to talk

You hand the agent a task. You walk away. You come back to make coffee, or take a meeting.

You come back, and it's stuck on a question:

"Should I use npm or pnpm?"

So much for "automatic." I'm just babysitting a robot.

The deeper version of this isn't even the question itself — it's that the agent has no idea which decisions you actually want to be asked about and which ones it should just pick a default and move on. Every CLI flag, every package manager, every "do you want to enable telemetry?" prompt becomes a stop. The agent is technically running. In practice, it's been waiting for you for forty minutes.

If a junior engineer Slack-pinged me every time they had to choose between npm and pnpm, I would not call that "an agent." I would call it "an interview I am running."


Pain #2 — You think it finished. It actually crashed half an hour ago

You give it a research task. Twenty open tabs, summarize, dump in a doc. You go to a meeting.

You come back. Silent screen.

Context overflowed. Or the model timed out. Or some tool errored. Task killed mid-way. No notice. No trail.

You have to reverse-engineer where it got to: open the half-written doc, scroll the chat, guess which subtasks finished, then write a new prompt to drag it back on track. Sometimes it's faster to start over than to figure out what happened. Sometimes — be honest — you only realize it died because you noticed the laptop fan stopped.

The thing that bothers me here is the asymmetry. Modern agents have plenty of internal state — chain-of-thought, tool call traces, intermediate scratchpads. None of that is exposed in a way that survives a crash. When it dies, you get nothing. When it finishes, you get the answer. When it half-finishes? You get a vibe.

A human contractor who disappeared mid-job and didn't text you would lose the contract. We somehow accept this from the agents we pay for.


Pain #3 — Your task. Someone else's account

This one is a category most "personal AI assistant" products quietly skip.

A lot of agent products are built single-user: one machine, one human, one set of cookies. That's a fine assumption if your only target is the solo MacBook owner.

But — the family laptop. The shared workstation in the office. The team's agent that everyone in the channel pokes at. Those get used by two or three people back-to-back. The agent fires up a browser, the browser still has the previous user's session open, and now your "summarize this thread" task is reading email as somebody else.

Or worse: the agent acts. Likes a post. Replies to a DM. Submits a form. Under the wrong identity.

Next morning, somebody at the office asks: "Hey — were you on Slack at 1 a.m.?"

This isn't a hypothetical. It's an account-isolation bug, and once you know it exists, you stop wanting to share an "agent" with anybody.


So… what's the actual hole?

These three aren't the same bug. But they're missing the same thing.

Today's agents can call tools. That's the part LLM products got good at. The thing that's still missing is the scaffolding for actually running a job — the layer that lives between "the model can call this tool" and "a human can leave the room and trust this thing."

I've hit those three pains enough times that the obvious patch — "just prompt better" or "use a different agent" — stopped feeling like a real fix. So I'm rebuilding the platform from scratch.

The scaffolding, I think, is three things.


Idea 1 — Make progress visible

A board. Not a chat log. A first-class plan view that tells you:

  • where the agent is (which step, which subtask)
  • where it's stuck (which input it's waiting for, which tool failed)
  • why it stopped (model finished? error? user cancel? context overflow?)

Imagine a Kanban-shaped surface where every running agent is a column, every step is a card, and the card stays around with its log even after the agent dies. You should be able to glance and know whether to walk away or step in. You shouldn't have to guess.

Plan + workers on a Kanban board: each agent step is a card with status, assignee, and result. Stuck steps stay visible instead of dying silently.

The unsexy version of this insight is: agents don't need more autonomy. They need a better status bar.


Idea 2 — Real isolation. One workspace per (person, task)

A workspace is the agent's "user" — its own:

  • browser profile (cookies, login state, extensions)
  • file root
  • credentials / secrets vault
  • tool config

Two jobs by two people on the same machine should run in two workspaces. Two jobs by the same person, where one is "personal Twitter cleanup" and the other is "company OKR draft," should also run in two workspaces. You can share the machine without sharing the identity.

This is unglamorous infrastructure work. It's also why I think the right primitive for an agent platform isn't "session" — it's tenant. Multi-tenancy as a design assumption from day one, not a Pro-tier feature bolted on later.

Sandbox + workspace per tenant: each tenant gets an isolated microsandbox that mounts only its own workspace tree, with workers collaborating through one shared filesystem.

Nothing crosses workspace boundaries by default. Cookies don't leak. Files don't leak. Identity doesn't leak.

Multi-tenancy partitioned across five layers: DB rows, filesystem, sandbox, browser sidecar, and secrets vault — every layer carries tenant_id from day 1.

This is unglamorous infrastructure work. It's also why I think the right primitive for an agent platform isn't "session" — it's tenant. Multi-tenancy as a design assumption from day one, not a Pro-tier feature bolted on later.


Idea 3 — Continuity across devices

You should be able to:

  • send the agent a line from your phone on the bus,
  • pick it up from your laptop at the desk,
  • ask one more follow-up from a different chat tool entirely,

…and the agent should keep up. Same plan, same workspace, same memory. The chat surface (Telegram, WhatsApp, the project's own web UI, an iMessage thread, a hardware button on a desk gadget) is just a channel — it's not where the agent lives.

That's a stronger claim than "we have a mobile app." It means the agent identity is portable across surfaces, not duplicated per surface. Channels are pluggable; the agent is one thing.

Channel abstraction: web, Slack, Telegram, Discord, and hardware all enter through the same ChannelAdapter interface; a Hub normalizes events and a Router scopes them into per-chat sessions, never merging across channels.

The agent stays put. The channels come and go.


Naming

I'm calling it Tianshu (天枢).

Tianshu is the first star of the Big Dipper — the one that decides where the whole constellation points. The orchestrator. You give it the direction, and it brings workers along to do the actual job.

The workers, eventually, are going to get names too — drawn from Chinese craftsman gods rather than the usual "Worker-1, Worker-2." Lu Ban (鲁班) the master builder for code-generation work. Nüwa (女娲) the creator for synthesis. Xihe (羲和) the sun-charioteer for scheduled / time-bound jobs. There's a small mythology behind it, and I'll write that one up properly later — partly because it's fun, partly because every single AI-agent product I see in English is named like a SaaS startup, and I want this one to feel different.

Open source. Self-hostable. The default is your machine, not someone else's cloud.


What's actually built

Honestly: not much yet, by design.

  • The architecture is being written down as RFCs (this post is the first half of RFC-001 — the "why" half).
  • v0.1 demo target is the plan board + workspace isolation — the bare minimum to demonstrate Pains #1 and #3 are fixable.
  • Code drops will follow each RFC, not lead them.

For the architecturally-curious, here's the one-page v0.1 sketch — channel layer, planner / dispatcher / aggregator main agent, sandboxed workers, tenant-scoped storage:

Tianshu v0.1 architecture: the channel layer feeds the main agent (Planner / Dispatcher / Aggregator), which spawns workers in per-tenant sandboxes; workspace, browser sidecar, and secrets vault are all scoped to the tenant.

I'd rather show up with one running pixel than with a glossy landing page and no repo. So this post lives where I am right now: somewhere between "I have opinions" and "I have a binary."


What I want from you

Three real questions, not rhetorical:

1. Have you hit any of those three pains? Which one was worst? I want anti-patterns, not validation.

2. What did I miss? Pain #4. The hole I haven't noticed yet. Especially around eval, observability, or "things go fine in dev, weird in prod" stories.

3. Is there a tool out there that already gets one of these right? I'm not looking for a list of every agent framework — I'm looking for the one piece somebody nailed that I should just copy from instead of reinventing.

Reply, file an issue, DM me, email — whatever works. The goal of this post is to be wrong in a useful way before I write much more code.


Subscribe to the build

Devlog every week. Once a demo runs, a video. If those three pains sound like your week — see you in the issues.

I'm Tianshu. See you next week.

Top comments (0)