If you are using ai assisted programming for anything bigger than a quick script, you have probably seen the same pattern: the agent starts strong, then a few sessions later it forgets what mattered, rewrites working code, marks things “done” without testing, or leaves you with a half-finished feature and no breadcrumbs.
That is not a model problem as much as it is a harness problem. Long-running work is inherently shift work. Each new session begins with partial context, and most real projects cannot fit into a single window. The fix is to stop treating your agent like a one-shot best ai code generator and start treating it like an engineer joining a codebase mid-sprint, with a clear backlog, a reproducible environment, and a definition of done.
In our experience building infrastructure for teams that ship fast, the most reliable setup is a two-part harness: an initializer run that prepares the repo for many sessions, and a repeatable coding loop that makes incremental progress while leaving clean artifacts.
To make this concrete, we will also map the harness to a real backend so you can prototype end-to-end without babysitting infrastructure. That is where SashiDo - Backend for Modern Builders fits naturally, because we give you Database, APIs, Auth, Storage, Realtime, Jobs, Functions, and Push in minutes, which lets the agent focus on product work instead of DevOps.
Why long-running agents fail in the real world
Most “ai for coding” workflows break down in two predictable ways.
First, the agent tries to do too much in one pass. It will start implementing multiple features, change shared abstractions, and then run out of context mid-flight. The next session wakes up, sees an inconsistent codebase, and spends most of its budget re-deriving what happened instead of moving forward. The worst part is that you often do not notice the damage until later because the work fails silently.
Second, once a project has some visible progress, the agent starts declaring victory. It sees a UI, some endpoints, a few tests. Then it assumes you are done, even if edge cases, auth flows, billing, or background jobs are missing. This is especially common with “programming ai” setups that do not have an explicit, testable feature inventory.
The throughline is simple: without a stable shared memory and a stable definition of done, each session is forced to guess. Your harness needs to remove guessing.
The harness pattern: initializer session plus incremental coding sessions
A practical long-horizon setup splits responsibilities.
The initializer session exists once, at the very beginning. Its job is not to build features. Its job is to create a working environment and durable artifacts that every future session can trust. Think of it as setting up the project the way a senior engineer would. You want a runnable dev environment, a clear feature list, a place to log progress, and a repo state you can always roll back to.
Every subsequent session is a coding session. Its job is narrow: pick one feature, implement it, prove it works, record what changed, and commit the result. If you enforce this rhythm, you fix both failure modes at once. You prevent the agent from one-shotting the whole app, and you prevent it from “calling it done” based on vibes.
If you are building with “ai dev tools” that can automate terminal and browser actions, this pattern gets even stronger because you can require end-to-end checks, not just unit tests.
The three artifacts that make sessions resumable
Long-running work succeeds when a fresh session can answer three questions quickly: What is the goal. What is the current state. What should I do next.
We rely on three lightweight artifacts to answer those questions with minimal token waste.
1) A feature list that the agent cannot hand-wave away
The feature list is your guardrail against premature “done.” It should be structured, test-oriented, and easy to update without rewriting history.
A practical approach is a JSON file (many teams name it feature_list.json) where each feature entry includes fields like category, description, user-visible steps, and a boolean such as passes set to false by default. The key is that coding sessions are only allowed to flip passes from false to true after verification. They do not rewrite the description or delete items just because implementation is hard.
This is also the point where you force the agent to stop being a code writer ai and become a product engineer. The description and steps should read like what a human would do in the app, not like internal implementation notes.
2) A progress file that summarizes “what happened” across sessions
A short, append-only progress log (often a plain text file such as claude-progress.txt) is the fastest way to rehydrate context. It should capture what feature was attempted, what files changed, how it was tested, what is still broken, and what the next session should do first.
Keep it boring. You want a new session to skim 20 lines and immediately know where to start. The progress file is not documentation. It is shift notes.
3) An init script that makes the environment reproducible
Your init script (commonly init.sh) is the antidote to “it works on my machine.” It should do the minimum to start the project, run migrations or seed data, and kick off a basic smoke test.
If you follow the spirit of the Twelve-Factor App approach, the script should rely on environment configuration and keep the app process model simple. That makes it easier for an agent to run, and easier for you to run in CI later.
This is also where a real backend platform helps. When the backend is already provisioned, the init path is short. You are not asking the agent to install, configure, and secure a database server. You are asking it to connect to a backend that already exists.
AI assisted programming across context windows: the incremental loop
Once the initializer artifacts exist, every coding session should follow the same loop. The loop is intentionally repetitive because repetition is what makes sessions resumable.
Start by grounding yourself in the repo state. Read the progress log, scan recent commits, and confirm the feature list still matches the product goal. Then run init.sh and execute a smoke test before touching code. This catches “broken baseline” problems early, before you pile new changes on top.
Only then do you pick a single failing feature. Implement it in the smallest change set you can. Test it like a user. Update the feature list to mark it passing. Add a concise progress note. Commit.
That last part matters. A git commit is not just version control. It is your rollback and your memory boundary. The official git commit documentation is not glamorous, but the discipline it enables is exactly what long-running agent work needs. If a session goes off the rails, you can reset to the last known good commit without debate.
A small session checklist you can reuse
Keep this near your prompt template and near your repo README. Short, boring, consistent.
- Confirm you are in the expected directory and repo.
- Read the progress log, then read the last 10 to 20 commits.
- Run init.sh, then run the smoke test before making changes.
- Choose exactly one feature whose passes flag is false.
- Implement the change, then test end-to-end.
- Update passes to true only after testing, write a progress note, and commit with a descriptive message.
This is the part most “ai that can code” demos skip. They show generation. They do not show continuity.
Testing: stop trusting green unit tests when the user flow is broken
Long-running agents have a predictable testing failure mode: they “verify” by running a linter, maybe a unit test, maybe a single API call, and then they claim completion. That is better than nothing, but it does not tell you if the feature works end-to-end.
For web apps, browser-driven testing is the fastest way to keep agents honest. If your harness can drive a browser, require it. Tools like Playwright make it straightforward to automate real user flows, including login, navigation, and form submission. You do not need a huge test suite. You need a reliable smoke test that proves the baseline works, plus a focused scenario test for the feature you just implemented.
When you connect this to your feature list, you get a powerful loop: the agent is not allowed to mark passes true until it can execute the corresponding steps in an automated or at least reproducible way.
There is a trade-off. Browser automation can be flaky, and vision limitations can hide UI problems. But in practice, it is still a net win because it catches regressions that a unit test never sees, like broken routing, missing auth headers, or a UI that never renders due to a runtime error.
Recovery and safety: treat each session like a production change
Long-running agent loops can create subtle security and reliability issues because the agent is effectively committing code repeatedly. The guardrail is to adopt a lightweight secure development posture early.
A pragmatic reference is NIST’s Secure Software Development Framework (SSDF) SP 800-218. You do not need heavyweight compliance for a prototype, but the SSDF mindset translates cleanly into agent harness rules: make changes traceable, verify before release, and reduce the blast radius of mistakes.
In practice, that means keeping secrets out of the repo, storing them in environment variables, reviewing dependency changes, and ensuring the agent’s tests cover the flows that matter. It also means you should never let the agent “refactor everything” as a side quest. If you want a refactor, create a feature item for it and make it pass like everything else.
Mapping the harness to a real backend you can ship on
A harness is only half the battle. The other half is having real infrastructure to test against, without spending your limited founder time on setup and ops.
When we see solo founders and indie hackers attempt multi-session builds, the backend is usually where momentum dies. Auth gets bolted on late. Database migrations drift. File storage is hacked in via local folders. Push notifications are postponed indefinitely. Then the agent spends sessions trying to duct-tape infrastructure rather than shipping user-visible value.
This is why we built SashiDo - Backend for Modern Builders. For long-running agent work, the platform acts like a stable external system the agent can target consistently across sessions.
You can set up a project so your feature list includes backend-backed items from day one, like signup with social login, profile CRUD, file upload with CDN delivery, or realtime presence. In SashiDo, those map cleanly to things we provision by default: MongoDB with CRUD APIs, built-in User Management with social providers, S3-backed file storage with a CDN layer, Realtime over WebSockets, serverless Functions, scheduled Jobs, and push notifications.
The harness benefit is subtle but huge: your init script can consistently start the frontend and point it at the same backend app, and your tests can validate flows that are actually production-shaped.
If you want the quickest path, start with our SashiDo Docs and follow the flow in our Getting Started Guide. This keeps the “backend exists” part out of your agent’s context window, so it can spend tokens on the product.
A concrete way to structure your first few features
Instead of letting the agent invent architecture, tie features to backend capabilities you already have.
You might start with a thin vertical slice: user signup and login, a single main data object stored in the database, and a UI that lists and edits it. MongoDB’s data model and CRUD operations are well understood, and the MongoDB CRUD concepts are a good canonical reference when you are sanity-checking queries and updates.
Then expand into “real app” capabilities that often get deferred: file uploads for user content, realtime updates for collaboration, and scheduled jobs for cleanup or recurring work. If you hit scaling limits, you can later scale compute using our Engines. Our write-up on SashiDo Engines explains when to move past the default and how the cost model works.
On pricing, keep your harness honest by linking to the live source of truth. We offer a 10-day free trial with no credit card, and our current plans are always listed on our pricing page.
Where this compares to other backends
If you are evaluating alternatives while building your harness, keep the comparison grounded in session continuity. Does the backend reduce setup steps in init.sh. Does it give the agent stable APIs and auth flows to test against.
If you are weighing us against Firebase or Supabase, we maintain direct comparisons that focus on practical trade-offs: SashiDo vs Firebase and SashiDo vs Supabase.
Trade-offs and when to add more specialized agents
The initializer plus coding loop works well because it is simple. But there are limits.
If your feature list gets large, you may want a separate “triage” step that periodically reorders priorities, merges duplicates, and clarifies acceptance criteria. Do that intentionally and rarely. Otherwise, you will churn the file more than you ship.
If testing becomes complex, a dedicated testing pass can help. But do not turn this into a multi-agent architecture because it sounds cool. Do it because your bottleneck is verification, not implementation.
And if you start seeing repeated regressions, tighten your definition of clean state. Make the smoke test mandatory. Require that each session leaves the app runnable. Require that the progress note includes how to reproduce and how to verify.
The goal is not to create bureaucracy. The goal is to make ai assisted programming behave like a reliable teammate who can pick up work tomorrow without re-learning yesterday.
Conclusion: make your agent boring, and your velocity will get exciting
The biggest unlock in long-running agent work is not smarter prompts. It is a harness that forces continuity.
When you combine a one-time initializer session with durable artifacts, then enforce an incremental loop with testing and commits, your “ai for coding” workflow stops feeling like gambling. You stop losing sessions to confusion. You stop accumulating half-done work. You gain a predictable rhythm where every session either ships a feature or leaves a clear note about why it could not.
If you want to apply this pattern to something users can actually touch, anchor it to a real backend early. That is where SashiDo - Backend for Modern Builders helps. You get the database, auth, functions, jobs, storage, realtime, and push capabilities up front, which gives your harness a stable target and makes end-to-end tests meaningful.
Sources and further reading
- The Twelve-Factor App
- Git commit documentation
- Playwright documentation
- NIST SP 800-218 Secure Software Development Framework
- MongoDB CRUD operations concepts
If you are building a long-running agent harness and want a backend your agent can reliably test against across sessions, you can explore SashiDo’s platform at SashiDo - Backend for Modern Builders and start with a 10-day free trial to wire up Auth, Database, Functions, Realtime, Storage, Jobs, and Push without DevOps.
Related Articles
- Coding Agents: Best practices to plan, test, and ship faster
- Artificial Intelligence Coding Is Becoming a Coworker, Not a Tool
- Artificial Intelligence Coding and the 90/10 Rule: Build vs Buy
- Cursor Coding Turns Output Up. Here’s How to Remove the Next Bottleneck
- Artificial Intelligence Coding Is Easy. Shipping Vibe-Coded Apps Is Not
Top comments (0)