<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tianshu AI</title>
    <description>The latest articles on DEV Community by Tianshu AI (@tianshu_ai).</description>
    <link>https://dev.to/tianshu_ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960037%2Ff266cfbf-3e12-4130-8639-b8b1c21d77bf.png</url>
      <title>DEV Community: Tianshu AI</title>
      <link>https://dev.to/tianshu_ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tianshu_ai"/>
    <language>en</language>
    <item>
      <title>Three things AI agents keep getting wrong (and why I'm rebuilding the platform from scratch)</title>
      <dc:creator>Tianshu AI</dc:creator>
      <pubDate>Tue, 02 Jun 2026 14:44:47 +0000</pubDate>
      <link>https://dev.to/tianshu_ai/three-things-ai-agents-keep-getting-wrong-and-why-im-rebuilding-the-platform-from-scratch-42p6</link>
      <guid>https://dev.to/tianshu_ai/three-things-ai-agents-keep-getting-wrong-and-why-im-rebuilding-the-platform-from-scratch-42p6</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt; Today's AI agents can call tools. The scaffolding for actually &lt;em&gt;running a job&lt;/em&gt; — visible progress, real isolation, continuity across devices — is still missing. After hitting the same three pain points enough times, I'm rebuilding the agent platform from scratch. Open source. Self-hostable. Pre-alpha. Build in public, every week. &lt;strong&gt;Star &lt;a href="https://github.com/tianshu-ai" rel="noopener noreferrer"&gt;github.com/tianshu-ai&lt;/a&gt;&lt;/strong&gt; if you want the v0.1 demo when it lands — or scroll down and tell me which of the three pains hit you the hardest.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The 3-minute video version
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/Xw7c3JrlUVo"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;If you'd rather read, the rest of the post says the same thing — with a few more concrete examples and three architecture sketches the video skips.&lt;/p&gt;




&lt;h2&gt;
  
  
  I've been using AI agents for a while
&lt;/h2&gt;

&lt;p&gt;For a year or so I've been using a handful of agent-shaped tools — to do research, run code, automate boring stuff. Different products, similar shape: a chat box, a tool-using model behind it, sometimes a sandbox.&lt;/p&gt;

&lt;p&gt;They work. Until you actually try to &lt;em&gt;put one to work&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Three things keep showing up. They aren't the same bug. But every time, I get the same uneasy feeling — &lt;em&gt;this is supposed to be the easy part, and it isn't&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pain #1 — You think it's running. It's actually waiting for you to talk
&lt;/h2&gt;

&lt;p&gt;You hand the agent a task. You walk away. You come back to make coffee, or take a meeting.&lt;/p&gt;

&lt;p&gt;You come back, and it's stuck on a question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Should I use &lt;strong&gt;npm&lt;/strong&gt; or &lt;strong&gt;pnpm&lt;/strong&gt;?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So much for "automatic." I'm just babysitting a robot.&lt;/p&gt;

&lt;p&gt;The deeper version of this isn't even the question itself — it's that the agent has no idea which decisions you actually want to be asked about and which ones it should just pick a default and move on. Every CLI flag, every package manager, every "do you want to enable telemetry?" prompt becomes a stop. The agent is &lt;em&gt;technically&lt;/em&gt; running. In practice, it's been waiting for you for forty minutes.&lt;/p&gt;

&lt;p&gt;If a junior engineer Slack-pinged me every time they had to choose between npm and pnpm, I would not call that "an agent." I would call it "an interview I am running."&lt;/p&gt;




&lt;h2&gt;
  
  
  Pain #2 — You think it finished. It actually crashed half an hour ago
&lt;/h2&gt;

&lt;p&gt;You give it a research task. Twenty open tabs, summarize, dump in a doc. You go to a meeting.&lt;/p&gt;

&lt;p&gt;You come back. Silent screen.&lt;/p&gt;

&lt;p&gt;Context overflowed. Or the model timed out. Or some tool errored. Task killed mid-way. &lt;strong&gt;No notice. No trail.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You have to reverse-engineer where it got to: open the half-written doc, scroll the chat, guess which subtasks finished, then write a new prompt to drag it back on track. Sometimes it's faster to start over than to figure out what happened. &lt;em&gt;Sometimes&lt;/em&gt; — be honest — you only realize it died because you noticed the laptop fan stopped.&lt;/p&gt;

&lt;p&gt;The thing that bothers me here is the &lt;em&gt;asymmetry&lt;/em&gt;. Modern agents have plenty of internal state — chain-of-thought, tool call traces, intermediate scratchpads. &lt;strong&gt;None of that is exposed in a way that survives a crash.&lt;/strong&gt; When it dies, you get nothing. When it finishes, you get the answer. When it half-finishes? You get a vibe.&lt;/p&gt;

&lt;p&gt;A human contractor who disappeared mid-job and didn't text you would lose the contract. We somehow accept this from the agents we pay for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pain #3 — Your task. Someone else's account
&lt;/h2&gt;

&lt;p&gt;This one is a category most "personal AI assistant" products quietly skip.&lt;/p&gt;

&lt;p&gt;A lot of agent products are built &lt;strong&gt;single-user&lt;/strong&gt;: one machine, one human, one set of cookies. That's a fine assumption if your only target is the solo MacBook owner.&lt;/p&gt;

&lt;p&gt;But — the family laptop. The shared workstation in the office. The team's agent that everyone in the channel pokes at. &lt;strong&gt;Those get used by two or three people back-to-back.&lt;/strong&gt; The agent fires up a browser, the browser still has the previous user's session open, and now your "summarize this thread" task is reading email as somebody else.&lt;/p&gt;

&lt;p&gt;Or worse: the agent &lt;em&gt;acts&lt;/em&gt;. Likes a post. Replies to a DM. Submits a form. Under the wrong identity.&lt;/p&gt;

&lt;p&gt;Next morning, somebody at the office asks: &lt;em&gt;"Hey — were you on Slack at 1 a.m.?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This isn't a hypothetical. It's an account-isolation bug, and once you know it exists, you stop wanting to share an "agent" with anybody.&lt;/p&gt;




&lt;h2&gt;
  
  
  So… what's the actual hole?
&lt;/h2&gt;

&lt;p&gt;These three aren't the same bug. But they're missing the same thing.&lt;/p&gt;

&lt;p&gt;Today's agents can &lt;strong&gt;call tools&lt;/strong&gt;. That's the part LLM products got good at. The thing that's still missing is the &lt;strong&gt;scaffolding for actually running a job&lt;/strong&gt; — the layer that lives between "the model can call this tool" and "a human can leave the room and trust this thing."&lt;/p&gt;

&lt;p&gt;I've hit those three pains enough times that the obvious patch — "just prompt better" or "use a different agent" — stopped feeling like a real fix. So I'm rebuilding the platform from scratch.&lt;/p&gt;

&lt;p&gt;The scaffolding, I think, is three things.&lt;/p&gt;




&lt;h2&gt;
  
  
  Idea 1 — Make progress visible
&lt;/h2&gt;

&lt;p&gt;A board. Not a chat log. A first-class &lt;strong&gt;plan&lt;/strong&gt; view that tells you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;where the agent is (which step, which subtask)&lt;/li&gt;
&lt;li&gt;where it's stuck (which input it's waiting for, which tool failed)&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;why&lt;/em&gt; it stopped (model finished? error? user cancel? context overflow?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Imagine a Kanban-shaped surface where every running agent is a column, every step is a card, and the card stays around with its log even after the agent dies. You should be able to &lt;strong&gt;glance&lt;/strong&gt; and know whether to walk away or step in. You shouldn't have to guess.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2vmlwh6zm0pyidgblk99.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2vmlwh6zm0pyidgblk99.png" alt="Plan + workers on a Kanban board: each agent step is a card with status, assignee, and result. Stuck steps stay visible instead of dying silently." width="800" height="738"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The unsexy version of this insight is: agents don't need more autonomy. They need a better status bar.&lt;/p&gt;




&lt;h2&gt;
  
  
  Idea 2 — Real isolation. One workspace per (person, task)
&lt;/h2&gt;

&lt;p&gt;A workspace is the agent's "user" — its own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser profile (cookies, login state, extensions)&lt;/li&gt;
&lt;li&gt;file root&lt;/li&gt;
&lt;li&gt;credentials / secrets vault&lt;/li&gt;
&lt;li&gt;tool config&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two jobs by two people on the same machine should run in two workspaces. Two jobs by &lt;em&gt;the same&lt;/em&gt; person, where one is "personal Twitter cleanup" and the other is "company OKR draft," should also run in two workspaces. &lt;strong&gt;You can share the machine without sharing the identity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is unglamorous infrastructure work. It's also why I think the right primitive for an agent platform isn't "session" — it's &lt;strong&gt;tenant&lt;/strong&gt;. Multi-tenancy as a design assumption from day one, not a Pro-tier feature bolted on later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsyi7xy46466c79bd0mi8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsyi7xy46466c79bd0mi8.png" alt="Sandbox + workspace per tenant: each tenant gets an isolated microsandbox that mounts only its own workspace tree, with workers collaborating through one shared filesystem." width="800" height="719"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nothing crosses workspace boundaries by default. Cookies don't leak. Files don't leak. Identity doesn't leak.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1syvybw4vl018zr92uzj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1syvybw4vl018zr92uzj.png" alt="Multi-tenancy partitioned across five layers: DB rows, filesystem, sandbox, browser sidecar, and secrets vault — every layer carries tenant_id from day 1." width="800" height="758"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is unglamorous infrastructure work. It's also why I think the right primitive for an agent platform isn't "session" — it's &lt;strong&gt;tenant&lt;/strong&gt;. Multi-tenancy as a design assumption from day one, not a Pro-tier feature bolted on later.&lt;/p&gt;




&lt;h2&gt;
  
  
  Idea 3 — Continuity across devices
&lt;/h2&gt;

&lt;p&gt;You should be able to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;send the agent a line from your phone on the bus,&lt;/li&gt;
&lt;li&gt;pick it up from your laptop at the desk,&lt;/li&gt;
&lt;li&gt;ask one more follow-up from a different chat tool entirely,&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…and the agent should &lt;em&gt;keep up&lt;/em&gt;. Same plan, same workspace, same memory. The chat surface (Telegram, WhatsApp, the project's own web UI, an iMessage thread, a hardware button on a desk gadget) is just a &lt;strong&gt;channel&lt;/strong&gt; — it's not where the agent lives.&lt;/p&gt;

&lt;p&gt;That's a stronger claim than "we have a mobile app." It means the agent identity is portable across surfaces, not duplicated per surface. Channels are pluggable; the agent is one thing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8rakcn2kwpblto9ddd8c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8rakcn2kwpblto9ddd8c.png" alt="Channel abstraction: web, Slack, Telegram, Discord, and hardware all enter through the same ChannelAdapter interface; a Hub normalizes events and a Router scopes them into per-chat sessions, never merging across channels." width="800" height="742"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent stays put. The channels come and go.&lt;/p&gt;




&lt;h2&gt;
  
  
  Naming
&lt;/h2&gt;

&lt;p&gt;I'm calling it &lt;strong&gt;Tianshu (天枢)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Tianshu is the first star of the Big Dipper — the one that decides where the whole constellation points. The orchestrator. You give it the direction, and it brings &lt;em&gt;workers&lt;/em&gt; along to do the actual job.&lt;/p&gt;

&lt;p&gt;The workers, eventually, are going to get names too — drawn from Chinese craftsman gods rather than the usual "Worker-1, Worker-2." &lt;strong&gt;Lu Ban&lt;/strong&gt; (鲁班) the master builder for code-generation work. &lt;strong&gt;Nüwa&lt;/strong&gt; (女娲) the creator for synthesis. &lt;strong&gt;Xihe&lt;/strong&gt; (羲和) the sun-charioteer for scheduled / time-bound jobs. There's a small mythology behind it, and I'll write that one up properly later — partly because it's fun, partly because every single AI-agent product I see in English is named like a SaaS startup, and I want this one to feel different.&lt;/p&gt;

&lt;p&gt;Open source. Self-hostable. The default is &lt;em&gt;your&lt;/em&gt; machine, not someone else's cloud.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's actually built
&lt;/h2&gt;

&lt;p&gt;Honestly: not much yet, by design.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The architecture is being written down as RFCs (this post is the first half of RFC-001 — the "why" half).&lt;/li&gt;
&lt;li&gt;v0.1 demo target is the &lt;strong&gt;plan board + workspace isolation&lt;/strong&gt; — the bare minimum to demonstrate Pains #1 and #3 are fixable.&lt;/li&gt;
&lt;li&gt;Code drops will follow each RFC, not lead them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the architecturally-curious, here's the one-page v0.1 sketch — channel layer, planner / dispatcher / aggregator main agent, sandboxed workers, tenant-scoped storage:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbygi5mv74m31shwe9v9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbygi5mv74m31shwe9v9f.png" alt="Tianshu v0.1 architecture: the channel layer feeds the main agent (Planner / Dispatcher / Aggregator), which spawns workers in per-tenant sandboxes; workspace, browser sidecar, and secrets vault are all scoped to the tenant." width="800" height="681"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'd rather show up with one running pixel than with a glossy landing page and no repo. So this post lives where I am right now: somewhere between "I have opinions" and "I have a binary."&lt;/p&gt;




&lt;h2&gt;
  
  
  What I want from you
&lt;/h2&gt;

&lt;p&gt;Three real questions, not rhetorical:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Have you hit any of those three pains?&lt;/strong&gt; Which one was worst? I want anti-patterns, not validation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. What did I miss?&lt;/strong&gt; Pain #4. The hole I haven't noticed yet. Especially around eval, observability, or "things go fine in dev, weird in prod" stories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Is there a tool out there that already gets one of these right?&lt;/strong&gt; I'm not looking for a list of every agent framework — I'm looking for the &lt;em&gt;one piece&lt;/em&gt; somebody nailed that I should just copy from instead of reinventing.&lt;/p&gt;

&lt;p&gt;Reply, file an issue, DM me, email — whatever works. The goal of this post is to be wrong in a useful way before I write much more code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Subscribe to the build
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/tianshu-ai" rel="noopener noreferrer"&gt;github.com/tianshu-ai&lt;/a&gt; — star it to be the first to see v0.1 drop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;X&lt;/strong&gt;: &lt;a href="https://x.com/tianshuAIdev" rel="noopener noreferrer"&gt;@tianshuAIdev&lt;/a&gt; — short build-in-public updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;YouTube&lt;/strong&gt;: &lt;a href="https://www.youtube.com/@Tianshu-AI" rel="noopener noreferrer"&gt;@Tianshu-AI&lt;/a&gt; — the 3-minute version of this post lives here.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Devlog every week. Once a demo runs, a video. If those three pains sound like &lt;em&gt;your&lt;/em&gt; week — see you in the issues.&lt;/p&gt;

&lt;p&gt;I'm Tianshu. See you next week.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
      <category>buildinpublic</category>
    </item>
  </channel>
</rss>
