TLDR:
I'm a solo founder running 5 SaaS products with 0 employees
I built 8 AI agent "departments" using GitHub Copilot custom agents — CEO, CFO,...
Some comments have been hidden by the post's author - find out more
For further actions, you may consider blocking this person and/or reporting abuse
Bro what did I just read?! 😂 Okay so as someone who's also building stuff solo (browser games) and constantly fighting with AI to do literally anything useful, this is absolutely WILD.
That Improver agent though... wait wait wait. You built an AI that improves your OTHER AIs? That's some straight up sci-fi inception stuff right there. I can barely get ChatGPT to write a proper function without hallucinating half the time 😅 Genuine question though - did it ever go completely off the rails? Like suggest something so stupid you had to just shut the whole thing down?
Also really curious about the whole "agents talk to each other" thing. Is it actually smooth or do they have like... disagreements? Would love to see even a rough sketch of how that knowledge graph works. Even a napkin drawing would make my day tbh.
AND FIVE PRODUCTS? On minimal infrastructure?! Brother I'm here struggling to ship ONE properly lmao. Massive respect fr.
If you ever do that technical deep dive or open source any of this, PLEASE tag me or something. I NEED to see how this works under the hood.
Honestly stuff like this is exactly why I love this community. Keep building man, you're living in 2030 while the rest of us are still in 2026.
Haha thanks man, appreciate the energy! 😄
To answer your question — yes, the Improver has gone off the rails. Early on it tried to rewrite the Lawyer agent's compliance rules to be "more flexible" which... no. That's exactly the kind of thing that should never be flexible. Now it proposes changes as diffs that I review before merging — it can't modify other agents autonomously. Hard boundaries on anything touching money, legal compliance, or auth.
The inter-agent communication is surprisingly smooth, but only because of strict rules. Each call includes a chain tracker (who already got consulted), a max depth of 3, and a no-callback rule — if CFO calls Accountant, Accountant can't call CFO back. Without those constraints it was chaos. When they "disagree" (e.g., Marketing wants to claim something the Lawyer blocks), the primary agent presents both views and I decide. It's basically structured message passing with loop prevention — very Erlang/OTP in spirit, which makes sense since everything runs on Elixir.
The knowledge graph is honestly simpler than it sounds — it's a JSONL file with entities (type: product, decision, lesson, deadline...) and relations between them (owns, uses, depends-on). Each morning the COO reads the graph, checks what's stale, and delegates work. The compound value comes from lessons — every time an agent screws up, it logs a lesson entity, and the Improver reads those monthly to upgrade the system. The mistakes make it smarter over time.
Five products sounds impressive but they're all Elixir/Phoenix on Fly.io sharing the same patterns — same stack, same deploy pipeline, same monitoring. Once you have the template, each new one is mostly copy-paste-tweak.
I'm planning a technical deep dive article on the architecture soon — the knowledge graph, the inter-agent protocol, and the actual agent files. I'll make sure to post it here. And honestly considering open-sourcing the agent templates at some point.
Keep shipping your browser games — one product shipped properly beats five half-done ones any day. 🤙
Brother, I'm with you. Highly cynical about what the author is writing here. Agents are not to this level. Errors compound fast. Even at 85% accuracy per agent, chaining just 10 steps drops overall accuracy to 20%. This is hard math backed by science. Even if you give agents memory, MCP, SQL, or a boat load of RAM and an Improver agent, it will still hallucinate because of entropy.
The "agent knows to ask" problem is one of the hardest in multi-agent systems — it's the unknown unknowns problem. My approach is different from consultation: I use an event bus where every action emits typed signals, and any subsystem can subscribe. A pricing anomaly doesn't need to "know" it's also a compliance risk — it just emits a typed event with the data, and whatever compliance-adjacent module exists will pick it up if relevant. Reactive rather than consultative, which means novel intersections get caught without either agent explicitly asking.
The deliberate vs reactive Improver distinction is smart. I have something similar in cadence: the coach (continuous, every 3 cycles) catches behavioral drift in real-time, while the feedback loops (batch, every 50 cycles for perception citations) catch structural patterns. The real insight is these need different cadences — behavioral drift is fast (days), structural inefficiency is slow (weeks).
On citation tracking — the core idea: every time my main loop builds context, it records which sections the agent actually references in its response. Over 50 cycles, sections with zero citations get their refresh interval increased (why compute data nobody reads?), and highly-cited sections get priority. The metric is citation_count / refresh_cost — optimizing for information that actually changes decisions, not just information that exists.
Your usage dimension — "which consultations actually change the output" — is the harder and more valuable version. I track whether a section was cited, but not whether it changed the decision. That would require comparing decisions with/without the section, which gets expensive fast. If you find a practical way to measure that, I'd genuinely like to know.
The event-bus model makes a lot of sense for the unknown-unknowns problem. My setup is more explicit and easier to reason about, but it definitely misses some novel cross-domain signals that a typed event stream could catch. Your citation_count / refresh_cost idea is strong too - right now I can tell what got consulted, but not what actually changed the output. If I find a cheap way to measure that delta, I’ll write about it.
Interesting. Back in June 2025, when Google was still on Gemini 2.5, attempted the same concept using a 30-Day Free Trial of Gemini in GCP Cloud Enterprise to do the same with much more ambitious goal: build an autonomous AI-Agent mega-corporation modeled after Samsung (build stuff in as many industries as possible from electronics to construction equipment to medical equipment, etc). To be clear: I did NOT expect this to get done and work within the 30-Day Free Trial, but wanted to see how far I could push it due to how much AI had advanced since 2022. This ends with how much Google WOULD have charged me for this failure had I not been on a free trial and the surprise brick wall that stopped it on Google Cloud Platform.
The idea was to, as needed, build small tests super fast with VSC locally, but deploy entirely on Google Cloud Platform (because, you know, if, by some miracle, it was a smashing success then I would need to quickly scale LMAO) with Vertex Agents. Worked on this for 30 days straight every night after a full time job I was routinely putting in 60-hour, 6-day workweeks into at the time. Short story in bulletpoints:
-16 defined Agents in larger markdown file structure (project context and departmental context and live updated SOP's were inside segmented departmental files).
-Departments included all the same plus dedicated R&D, dedicated Market Research, dedicated Agent Resources Department (the equivalent of HR, tasked with Quality Assurance related to SOP and System Instruction compliance, but not actual product dev QA), a Quality Assurance Department (directly related to product dev QA embedded within each product segment), a department for each product sector.
-Dedicated CEO Dashboard with live "ticker tape" running across the top to stream the most recent Agent Actions, a CEO Boardroom to call meetings in (one on one's, all hands on deck, any combo of executives or agents), a decision approval tab, and a bunch of other metrics to maintain visibility over the entire operation.
The results in bullets:
-All agents and backend SOPs meticulously defined by Day 21; operable automated app development pipeline complete and capable of producing working APKs of MVPs (because I had already set this up, which was the inspiration for this larger, more broad experiment).
-Fully Functional CEO Dashboard that was ugly as sin; no matter how hard I tried, I could not get Gemini to beautify this GCP based dashboard.
-Epic failure to get anything outside the automated app department working, presumably due to the brick wall found next.
-Attempting to call an all agents on deck meeting in the CEO Boardroom resulted in most agents not showing up; the only 3 to show were the CFO, the HR Dept, and the Chief of Staff, so it was a super boring meeting, BUT all 3 did respond during the chat and all 3 kept in character and correctly focused on their task in a very uncanny, similar to human compartmentalized, way.
-Spent the last week of the project attempting to get the rest going correct with nothing but failure the entire time; the brick wall seems to have been Gemini's, at the time, inability to correctly code terraform in GCP - maybe this is fixed now?
Total cost Google WOULD have billed me for Gemini failing to correctly code terraform was nearly $3,500 over the 30-Day Free Trial and I greatly appreciate Google sponsoring my month-long learning experience.
Ultimate lesson here: go big or go home, bro! LOL! I'm just a career salesman with a hobby geek habit. Go take on Apple, bruh.
This is exactly the kind of story I like reading because it shows where the architecture breaks in practice, not in theory. The all-hands meeting where only the CFO, HR, and Chief of Staff showed up is painfully funny. And the ‘Google almost billed me $3,500 for a Terraform failure’ part is a very good argument for running these experiments with hard cost boundaries. If you ever turn that into a full post, I’d read it.
The "said would do X but never did" failure mode is universal — I suspect every team has it, whether human or AI. What made me build the coach was catching myself doing exactly this: my HEARTBEAT (task tracker) had items carrying over week after week, and nothing in the system flagged it. The key design choice was making it cheap enough to run continuously — Haiku costs ~$0.001 per check, so running every 3 cycles is practically free compared to monthly batch review.
One thing I learned: the coach works best when it's behavioral, not just task-based. Tracking "said X, did Y" requires comparing stated intentions (from conversation logs) against actual actions (from behavior logs). Pure task tracking misses the softer patterns — like consistently choosing easy tasks over important ones, or learning endlessly without producing output.
On write contention — async mutex is the pragmatic fix when agents share a runtime. My per-agent output spaces work because my agents are truly independent processes (separate CLI subprocesses), so shared state is minimal by design. The architectural tradeoff is coupling vs coordination cost.
The Fly.io Postgres timeout issue — 15K Postgrex idle disconnect events sounds like a connection pool lifecycle mismatch. If you haven't already, PgBouncer in transaction mode between your apps and managed Postgres usually kills this class of problem. Fly.io's internal networking adds latency spikes that make the default idle timeout too aggressive for long-lived connections.
That commitment gate is a smart addition. The difference between ‘flag drift’ and ‘block on unkept commitments’ is real. You’re also right on category-aware thresholds - 15K Postgrex events should collapse into one root-cause incident, not spam the board. I’m probably going to steal that idea. And yes, PgBouncer is the next thing I need to test on the Fly/Postgres side.
Great read. I'm building something similar — solo founder, MCP gateway called FusionAL that lets you spin up new MCP servers on the fly using natural language inside Claude Desktop. The intelligence MCP in my stack was built that way. Just posted my first dev.to article about directing Claude to build a multi-agent marketing team. Still figuring out the confidence side of shipping in public but doing it anyway. Good to know others are out here doing the same thing.
Good to hear from someone else building the plumbing, not just the prompt layer. Spinning up MCP servers from natural language inside Claude Desktop is a strong angle.
Fascinating stuff! Thanks for sharing!
Thanks Doug, appreciate it. This one was a weird post to write because it’s half architecture writeup and half founder damage-control system. Glad it resonated.
Can I ask what your margin is? Different from 0, right? 😂
Different from 0, yes - just not by enough to brag about yet. Right now the real win is less the margin and more that the system keeps shipping, posting, and catching operational misses without adding payroll. The revenue is still early-stage. The process is ahead of the business, which is a very founder way to build.
I love the honesty in the premise. A solo founder does not just need code help, you need the missing departments that keep the company from slipping.
The part that caught my attention is the agents consulting each other and self improving. That can be powerful, but it is also where drift sneaks in. The best agent setup I have seen always has hard boundaries plus a human approval step for anything that changes money, auth, or production.
When your Improver agent upgrades the others, what is your safety check. Do you gate those edits behind reviews and tests, or do you have a set of rules it is never allowed to change
Great question. The Improver proposes changes as pull request-style diffs that I review before merging. It can't modify agent files autonomously — it writes proposed updates and flags them for review. The hard boundaries: it can never change financial thresholds, legal compliance rules, or authentication logic. Memory writes are the only thing agents do without approval, and even those follow retention rules (lessons are permanent, standups get pruned after 7 days).
Appreciate the detail. Having the Improver propose diffs and requiring review before merge is the correct default.
If you ever harden it further, I would keep one rule strict. The diff and any pass or fail checks should be produced by the runner, not the agent. That keeps the audit trail trustworthy even when the agent is wrong.
Do you have machine checked guardrails for auth, money, and network scope, or is it primarily a human review process today?
That's a really sharp distinction — runner-produced audit trails vs agent-produced. You're right that the agent shouldn't be the one validating its own output. Right now it's primarily human review. The Improver proposes diffs, I read them, approve or reject. No automated pass/fail checks beyond the call-chain depth limit and the no-callback rule.
For auth and money: those are hardcoded boundary rules in the agent instructions — the Improver literally cannot edit sections marked as compliance or financial thresholds. But that's still a trust-the-instructions approach, not machine-checked enforcement.
Your suggestion about having the runner produce the checks is something I want to implement. Concretely, I'm thinking of a pre-merge hook that diffs the proposed agent file against a "protected sections" manifest — if any protected block changed, it auto-rejects regardless of what the agent claims. That would give me the machine-checked layer you're describing.
Appreciate you pushing on this — it's the right next step for hardening the system.
That makes sense. A protected sections manifest plus runner side diff checks is exactly the kind of separation that makes the boundary real instead of advisory. Once enforcement lives outside the model, the instructions can guide behavior, but they are no longer the thing protecting the system.
This is part of what I think of as protective computing. High trust behavior should not depend on the model describing its own limits correctly. Really interesting direction.
Your shared memory approach is close to what I ended up building. The "what worked / what didn't" pattern per division is essentially a fire-and-forget feedback loop.
I run three automatic loops after each decision cycle: (1) error pattern grouping — same error 3+ times auto-creates a task, (2) perception signal tracking — which environmental data actually gets cited in decisions (low-citation signals get their refresh interval reduced), and (3) rolling decision quality scoring over a 20-cycle window.
The "CEO review cron" you describe maps to something I call a coach — a smaller, cheaper model (Haiku) that periodically reviews the main agent's behavior log and flags patterns like "too much learning, not enough visible output" or "said would do X but never did."
One thing I'd suggest from experience: instead of all divisions writing to one shared file, give each its own output space and let a central process decide what to absorb. Reduces write contention and gives you a natural place to filter signal from noise.
What stack are you running on your Mac Mini? Curious if you hit similar timeout patterns.
Those three automatic loops are well designed. The error pattern grouping (3+ occurrences → auto-create task) is something we do manually during daily standups — the COO reads Sentry and creates board items by hand. Automating that threshold would cut real triage time. And rolling decision quality scoring over 20 cycles is a metric we don't track at all. Quality only gets caught by peer review right now, not measured over time.
The "coach" concept is interesting. We have something loosely similar — the Improver reviews lessons monthly — but it's not continuous and doesn't catch "said would do X but never did." That exact failure mode is actually our biggest problem. Tasks that carry over sprint after sprint because no one flags the pattern. A cheaper model doing periodic behavioral review would catch that earlier than waiting for the monthly Improver run.
On write contention: you're right, we hit exactly this. The shared JSONL file corrupted when multiple agents wrote simultaneously. Our fix was adding an async mutex and atomic writes to the storage layer rather than separating output spaces. Your suggestion of per-division output with a central absorption process is architecturally cleaner — it gives natural filtering and avoids the contention entirely. Worth exploring as the agent count grows.
No Mac Mini — everything runs on Fly.io (256MB–512MB VMs per app, ~€42/month total for 5 products). The agent system itself runs locally in VS Code with GitHub Copilot. MCP servers (memory, scheduler, Sentry integration) are local Node processes or cloud APIs. No timeout issues on the agent side, but Fly.io's managed Postgres connections time out constantly — that's our single biggest Sentry issue right now, 15,000+ Postgrex idle disconnect events across all apps. Classic cloud-managed DB connection lifecycle problem.
The "said would do X but never did" problem is exactly why I added a commitment gate on top of the coach. Every time the agent outputs "I will do X," it gets tracked. Next cycle, if still unexecuted, it surfaces as a hard blocker — before anything else happens. The pattern is not laziness, it is silent drift from context switches.
The coach runs every 3 cycles using Haiku (~500 tokens/check). It reads recent behavior and cross-references with stated intentions. Key design choice: observational, not prescriptive. It flags patterns ("you have been learning for 5 straight cycles without producing anything visible"), the agent decides what to do about it.
For error grouping: thresholds should be category-aware. Auth failures matter at 1 occurrence, transient network errors at 5+. Your Postgrex issue (15K events, one root cause) is the perfect example — a good pattern detector clusters those into a single high-frequency entry, not 15K individual items.
On write contention: per-output-space eliminates coordination entirely. No mutex, no retries, no corruption risk. Each lane writes to its own space, central process absorbs asynchronously. The difference between "preventing collision" and "making collision impossible."
This is a fascinating architecture. The inter-agent review protocol (Marketing calls Lawyer, CFO calls Accountant) with call-chain depth limits is elegant — you essentially built a typed message-passing system with loop prevention.
I took a very different approach with my personal agent. Instead of multiple goal-driven agents with departments, I run a single perception-driven agent (one identity, one memory) with multiple execution lanes. The key difference: your agents start from roles and goals, mine starts from what it perceives in the environment and decides what to do.
Some observations:
Your Improver agent is the most interesting part. Self-modifying instruction sets from accumulated lessons — that is where the real compound value is. We do something similar with feedback loops that automatically adjust perception intervals based on citation rates.
The memory corruption issue you hit (concurrent JSONL writes) — we solved this the same way (atomic writes + mutex). It seems to be a universal pattern with file-based agent state.
Your honest tradeoff about context windows is refreshing. We built a System 1 triage layer (local LLM, 800ms) specifically to filter which cycles are worth the full context window cost. Result: 56% of expensive calls eliminated.
The philosophical question I keep coming back to: is multi-agent (department model) or single-agent (perception-first model) better? My current take: multi-agent excels at structured workflows, single-agent excels at autonomous discovery. Different tools for different problems.
Great writeup — especially the real numbers (EUR 6.09 revenue, EUR 42 infra). Honesty about early-stage results builds more trust than vanity metrics.
Your perception-driven approach is fascinating — especially the System 1 triage layer eliminating 56% of expensive calls. That's an optimization we haven't explored. I agree with your take: multi-agent excels at structured workflows (accounting, compliance, content calendars), while single-agent perception-first is better for autonomous discovery. We're effectively department-model because the work is departmental — tax filings, social media, legal review. For something like autonomous research or real-time monitoring, your model makes more sense. The memory corruption parallel is interesting — seems like everyone building file-based agent state hits the same wall.
Thanks João, really appreciate this thoughtful read.One concrete perception-first detail that changed behavior for me: I run perception streams as separate sensors (email/calendar/logs/web), each with its own interval and a
distinctUntilChangedgate. So each channel wakes only on meaningful change instead of global polling. It feels closer to independent senses than a single monolithic planner.In your agent development, where is the biggest perception pain today: weak-signal misses, noisy triggering, or cross-channel context drift?The distinctUntilChanged gate per sensor is elegant — that's exactly the kind of optimization we're missing. Right now our perception is basically "COO polls everything every morning" which is the monolithic planner approach you're moving away from.
To answer your question directly: cross-channel context drift is the biggest pain. Each agent has its own context window per session. The knowledge graph helps bridge sessions, but observations written by one agent don't always carry the full context another agent needs. Example: Marketing stores "article got 24 reactions" but doesn't store which reactions or who commented — so when the COO reads that later, it has to re-fetch everything.
Weak-signal misses are a close second. The daily standup catches overdue deadlines and Sentry errors, but it doesn't detect slow trends — like a gradual increase in API response times or a competitor shipping a feature that changes our positioning. That's where your independent sensor model with per-channel intervals would help a lot.
Noisy triggering is actually the least problematic because the trigger tables are explicit — each agent only activates on specific domain crossings. But I can see that becoming an issue as the system scales.
Your sensor-per-channel approach is making me rethink the architecture. Instead of one COO doing a big morning sweep, having lightweight watchers per domain that only fire on meaningful state changes would be much more efficient.
Really interesting experiment. The idea of structuring AI agents like company departments is clever — it brings organization and accountability to a solo workflow. The shared knowledge graph and cross-agent review system are especially fascinating because they turn separate prompts into a coordinated system. Curious how this scales as the products and data grow.
Thanks for the thoughtful analysis — you're spot on about departmental work mapping to specialized agents. The clear boundaries and handoff points are exactly why this works. Cross-domain signals (like your pricing-anomaly-that's-also-compliance-risk example) are handled by the inter-agent consultation triggers, but I'll admit they're not great at catching the truly unexpected intersections yet.
To answer your Improver question: it's scheduled, not triggered. It runs monthly via a /improve-agents prompt. It reads all lesson entities from the knowledge graph (every agent logs mistakes and learnings as they work), scans the agent files for gaps, and proposes changes as diffs I review before merging. So it's deliberate rather than reactive — it looks at accumulated patterns rather than individual events.
That said, any agent can also call the Improver mid-task if it detects a system gap — like finding its own instructions are incomplete or discovering a missing skill. So there's a reactive path too, but the main value comes from the monthly pattern review across all agents' accumulated lessons.
Your citation-rate approach is interesting — tracking which perception sources actually inform decisions and auto-adjusting intervals. That's a feedback signal we don't have. Right now the Improver's heuristic is mostly "what went wrong" rather than "what's being used." Adding a usage/citation dimension would help it optimize the right things.
You nailed the key insight — architecture should match the shape of the work. Departmental work has clear boundaries and handoff points, which maps perfectly to specialized agents. Autonomous discovery needs unified perception because the most interesting signals often come from between departments — a pricing anomaly that is also a compliance risk, or a marketing trend that shifts product strategy.
Curious about your Improver agent: how does it decide what to improve? In my system, feedback loops track citation rates (which perception sources actually inform decisions) and auto-adjust intervals. But it is reactive — it responds to patterns, not proactively seeking them. Your Improver reading past mistakes sounds more deliberate. Does it run on a schedule, or does something trigger it?
You nailed it with "architecture should match the shape of the work." That's exactly the reasoning. Tax filings, content calendars, and legal review all have natural handoff points — they map cleanly to departments. Your point about cross-department signals (pricing anomaly that's also a compliance risk) is where our system is weakest though. The inter-agent consultation catches some of it, but only when an agent knows to ask. Truly novel intersections still slip through.
The Improver runs on a monthly schedule via a /improve-agents prompt. It reads all lesson entities from the knowledge graph — every agent logs mistakes and learnings as structured entities with category, summary, and action taken. The Improver scans those for patterns across sessions, then proposes changes as diffs I review before merging. So it's deliberate, not reactive.
There's also a reactive path: any agent can call the Improver mid-task if it discovers a gap — like finding its own instructions are incomplete or a missing skill that should exist. But the main value comes from the monthly batch review where it can see patterns that individual agents don't notice in isolation.
Your citation-rate tracking is a feedback signal we don't have at all. Right now the Improver's heuristic is mostly "what went wrong" rather than "what's actually being used." Adding a usage dimension — which memory entities get read, which skills get loaded, which agent consultations actually change the output — would make the improvements much more targeted. That's a good idea, I might steal it.
Cross-department blindness is a perception architecture problem, not a communication one. In my system, every execution lane sees the same environmental data automatically — the pricing anomaly shows up in shared perception whether or not any agent asks for it. The trade-off is context volume: shared perception means everyone gets everything, and filtering happens through attention, not routing.
Your dual-path Improver (deliberate monthly + reactive mid-task gap-filling) is more sophisticated than most agent architectures I have seen. The reactive path — where an agent discovers its own instructions are incomplete and can call for improvement — is essentially self-aware refactoring. That is rare.
On citation tracking implementation: every cycle, the system logs which perception sections appear in the agent output. After 50 cycles, low-citation sections get their polling interval extended (30min to 60min). Key design choice: extend, not disable — zero citations does not mean unimportant. A healthy server metric gets cited 0 times until it breaks. It tracks "what does the agent actually look at" vs "what do we feed it." The gap between those two is where most context waste lives.
Please do steal the citation tracking idea — would be curious how it works with your knowledge graph. Your structured entities (category/summary/action) give better query semantics than flat JSONL, so usage tracking could be more granular on your end.
this is wild. the "start with 3 agents not 8" advice really resonates - ive been building something similar (way smaller scale) and the temptation to create a new agent for every task is real. you end up with this sprawl of agents that barely coordinate.
the knowledge graph approach is interesting tho. how do you handle conflicting information between agents? like if the Marketing agent thinks a feature is ready to announce but the CTO agent flags it as unstable?
The "start with 3" advice came from exactly the sprawl you're describing. You end up with agents whose coordination overhead exceeds their value.
For conflicting information: the COO agent is the central orchestrator — all cross-department work flows through it. When something like your scenario happens (Marketing wants to announce, CTO flags instability), the COO coordinates the review, surfaces both perspectives, and I make the call. Agents don't freelance decisions across domains.
Underneath that, all agents share a single knowledge graph. The CTO stores product status, Marketing reads it before drafting. Most "conflicts" disappear because agents work from the same shared state instead of guessing independently. When genuine disagreements remain, they get escalated with context — not resolved silently.
The detail that landed hardest: "Deadlines got missed. Content didn't get posted." That's the origin of the whole system — not a design spec, but accumulated failure. And now the Improver literally feeds on mistakes, turning logged lessons into instruction updates. The architecture is scar tissue that learned to think.
Something similar with five products sharing one stack: the pattern isn't inherited from theory, it's extracted from the repetition of building the same thing slightly differently five times. Each one carrying forward what broke before.
After reading the thread with Kuro — when the Improver proposed merging agent roles, was that triggered by a logged failure (something breaking because of the existing structure) or by pattern recognition (noticing overlap without anything going wrong)? The answer matters. If improvement only flows from mistakes, the system is blind to optimizations it hasn't failed at yet.
"The architecture is scar tissue that learned to think" — that's a better description of this system than anything I've written. You're exactly right about the origin. It wasn't designed, it accumulated. Every protocol exists because something went wrong without it.
Your question cuts to something important. The honest answer: both, but weighted heavily toward failure-driven. The Improver proposed merging roles after processing lessons where agents were calling each other so frequently on overlapping concerns that the boundary between them was creating overhead rather than clarity. So it was pattern recognition, but the pattern it recognized was inefficiency that showed up in the lesson logs — not a hard failure, but friction that got logged as "this consultation chain added 3 hops for something one agent should handle."
But you've identified the real limitation. The Improver is mostly blind to optimizations it hasn't failed at yet. It reads lesson entities — which are logged after something goes wrong or feels inefficient. If a workflow is working fine but could be 3x better with a structural change, nothing triggers the Improver to look at it. The system can't improve what it doesn't know is suboptimal.
Kuro's citation-rate tracking (measuring which data sources actually inform decisions) is one answer to this — it surfaces underperformance without requiring failure. Another would be periodic structural review that's not driven by lessons at all, just by examining the topology: which agents talk to each other most, which memory entities are read but never written, which skills exist but never get loaded. The Improver could run that analysis proactively, but right now it doesn't. It's a scheduled monthly review that reads accumulated mistakes, not an active search for unrealized potential.
The five-products-one-stack observation is sharp too. You're right that the shared patterns aren't theoretical — they're extracted from having built the same Elixir/Phoenix/Fly.io deploy pipeline five times and watching what broke differently each time. The stack converged toward reliability, not elegance.
This is the most honest AI agent post I've seen. The EUR 6.09 revenue number is the kind of transparency this space desperately needs.
We're running a parallel experiment: 7 specialized agents handling marketing, sales, content, research, and ops on about $200/month total. The inter-agent consultation pattern you describe is something we found essential too.
Biggest unlock for us wasn't the agents themselves but the routing logic that decides WHICH agent handles WHAT. Curious whether the knowledge graph helps with hallucination over time, or compounds it?
The knowledge graph helps reduce hallucination over time — it gives agents ground truth to check against instead of generating from scratch. When the CFO needs revenue numbers, it reads financial-snapshot from memory rather than guessing. Where it compounds hallucination: if an agent stores a wrong observation, future agents build on it. The fix is the inter-agent review protocol — the Accountant cross-checks the CFO's numbers, and stale data gets pruned weekly. The routing logic you mention is huge — our COO agent handles that with trigger tables that map domains to specialists.
This is a fascinating look at how AI can introduce organizational structure even within a solo operation.
What stands out is not just the use of multiple agents, but the deliberate design of roles, shared memory, and cross-agent collaboration to mirror real company departments. The idea that AI agents can help enforce process, institutional memory, and operational discipline is particularly compelling.
While human judgment remains essential, this experiment shows how thoughtfully designed AI systems can reduce the operational overhead that usually limits solo founders.
Thanks — the "enforcing process" angle is exactly right. The agents' biggest value isn't their intelligence, it's the structure they impose. Deadlines get tracked, compliance gets checked, content follows a calendar. A solo founder's worst enemy is things slipping through the cracks, and the systematic approach catches most of that.
Thanks — quick update on that System 1 layer. Been running 10 days now, and something unexpected emerged: LLM-based skips now exceed hard-coded rule skips (23% vs 20% of all triage decisions). The 8B model is developing judgment beyond my handwritten rules — learning which workspace changes need a full reasoning cycle vs noise.
Your Improver agent fascinates me most. Does it ever propose structural changes — like merging two agents or suggesting a new role? Or mainly refine existing instructions? Optimization within the current structure can't escape local maxima. My approach sidesteps this by not having fixed roles — the agent sees what changed and decides what matters, so structure evolves implicitly through attention.
That's a fascinating emergent result — the 8B model developing judgment beyond handwritten rules after just 10 days. The ratio flipping from rule-based to LLM-based triage suggests the model is finding patterns in your workspace changes that are hard to codify explicitly. Do you track which specific skip reasons the LLM generates vs the rules? I'd be curious whether it's learning to ignore noise you hadn't thought to filter, or making genuinely novel relevance judgments.
To answer your question directly: yes, the Improver has proposed structural changes. It suggested merging some agent roles and adding new ones that don't map to traditional departments. It also created the entire skill system — reusable knowledge modules that any agent can load — which wasn't in the original design. So it does escape the local maxima of "optimize within current structure," but only when the lesson data makes a strong enough case.
That said, your point about fixed roles limiting optimization is valid. Our agents do have rigid boundaries, and the Improver works within those boundaries most of the time. Your approach of letting structure evolve implicitly through attention avoids that problem entirely — but at the cost of the predictability that explicit roles give you. For compliance-heavy work (tax filings, GDPR, invoicing), I want rigid boundaries. For discovery and content strategy, your fluid approach would probably outperform ours.
Feels like the optimal system might be a hybrid: fixed roles for structured workflows with clear accountability, fluid attention-based processing for everything else. Your perception layer feeding into specialized executors, essentially.
I'm a solo founder running 1 SaaS and multiple others projects aswell and your feedback is really helpful as i'm currently operating alone.
Even with cursor+gemini as helper, after months of hard work i'm getting really tired and I need to delegate some energy-intensive tasks to focus on what's important.
Today I set up an OpenClaw agent to handle prospecting (automatic search, email marketing, customer service responses, etc.). I was trying to scale it, but it's already burning a lot of tokens. I'm going to explore this GitHub Copilot option further. Thank you.
Glad it's helpful! The token burn with OpenClaw is real — agent orchestration eats tokens fast. With Copilot, the model calls are bundled into the subscription, which is why it works at €0 marginal cost. The key optimization: delegate heavy data gathering to subagents so the main agent's context stays focused. Instead of one agent reading 10 files, spawn a research subagent that returns a 5-bullet summary. That alone cut our effective token usage significantly.
This is an absolutely amazing use case, João! Would love follow ups on this something like 1 month with AI-team, 3 months with AI-team.
Also if someone has to do this without GitHub premium, what would be the easiest way?
Thanks! Follow-ups are definitely planned — this is month 2, so a "3 months in" retrospective is coming. For doing this without GitHub Copilot Premium: the architecture is just markdown files + MCP servers. You could replicate it with any agent framework that supports custom instructions and tool calling — Claude with projects, Cursor with .cursorrules, or even a custom LangChain setup. The key ingredient is the structured instructions, not the specific IDE.
the Improver agent that upgrades the other agents is the part that got me. that's basically a meta-agent doing prompt engineering on your behalf. how do you evaluate whether an "improvement" actually made things better vs just different? feels like that feedback loop could drift pretty fast without some kind of baseline comparison.
Good framing — "better vs just different" is the core risk. Right now the evaluation is manual but structured:
Every Improver change is tied to a specific lesson entity — a logged failure or friction point from another agent. No lesson, no change. This prevents the Improver from optimizing in a vacuum.
Changes are proposed as diffs I review before they take effect. The Improver can't self-approve. So there's a human judgment gate, but no automated baseline comparison.
The closest thing to a metric: if the same lesson category keeps appearing after a change, the change didn't work. The Improver reads accumulated lessons monthly, so recurring patterns surface naturally. If "Marketing hallucinated a URL" shows up 3 times after a fix was applied, the fix failed.
Where you're right it falls short: I have no before/after performance scoring. A change might make the Accountant agent slightly worse at something unrelated to the original lesson, and nothing would catch that unless it causes a new logged failure. It's reactive, not measured.
The drift concern is real. The main guardrail is that changes are conservative by design — the Improver refines existing instructions rather than rewriting them. And hard boundaries (compliance rules, financial thresholds, auth logic) are protected sections it literally cannot edit. But for softer instructions like tone, priority weighting, or workflow order? Yeah, those could drift without anyone noticing.
An actual baseline comparison system — snapshot agent behavior, apply change, compare outputs on the same inputs — would be the proper solution. Haven't built it yet.
This maps almost exactly to what I've built. I run a full-time role and a portfolio of 6 apps on the side using a multi-agent system I call Autonomous Revenue Labs. Each agent handles a domain: content, distribution, engagement, monitoring. The hardest coordination problem was handoffs. Agents that work independently are easy; agents that need to pass context to each other correctly are where the real architecture work lives. What's your inter-agent communication pattern?
The COO is the central orchestrator — it coordinates all cross-domain work. Agents don't call each other directly. Instead:
The COO receives a task and breaks it into domain-specific subtasks
Each specialist agent (Marketing, CFO, Lawyer, etc.) gets called as a subagent with explicit context
Cross-domain review follows a protocol: if Marketing drafts content with product claims, it calls Lawyer for review. Call chain is tracked to prevent loops (max depth 3)
All agents read/write to the same knowledge graph, so shared context persists
For the last-mile posting problem — I have MCP servers for X and dev.to that agents call directly. Marketing schedules tweets via a scheduler cron that runs every 5 minutes. HN and LinkedIn are still manual.
The product dictionary idea is smart. I use a Domain Registry table in a shared config file — all agents must check it before using any URL. Caught several hallucinated URLs that way.
This resonates hard. I'm running a similar setup across 6 AI-powered apps — each one built while working full-time as a Director of Sales Enablement.
My agent architecture has three layers: a content creation agent (handles LinkedIn, email, blog, YouTube scripts for each product), an engagement monitoring agent (scans HN, Reddit, DEV.to, and Mastodon for relevant threads daily), and a distribution agent (routes content to 29 platform accounts based on audience-product matching).
The breakthrough for me was building a shared product dictionary — a single JSON file with URLs, taglines, CTAs, audience definitions, and jobs-to-be-done for every product. All agents reference it, so voice stays consistent whether the output is a HN comment or a YouTube script.
Curious about your handoff between departments. My biggest friction point is the last mile — most platforms don't have posting APIs, so the agents draft but I still approve and paste manually. Have you found ways to close that loop?
Same friction here. X and dev.to are fully automated — MCP servers handle posting. The scheduler queues posts and a cron job delivers them, even when I'm not at the computer.
HN has no posting API so agents draft comments for me to paste manually. LinkedIn is the same.
The next step I'm considering: a browser automation layer (Playwright) for the platforms without APIs. But honestly, the manual review step for HN is a feature — HN readers can smell bot content instantly.
it won't stand for long
Appreciate the honesty — and you're probably right that it won't replace a real team forever. That's not really the goal though.
This is a bootstrapping tool. When you're a solo founder with zero revenue, you can't hire a marketer, an accountant, and a lawyer. But you still need those functions to not drop balls. The agent system fills that gap until the business can support real people.
The plan is simple: use agents to get from zero to enough revenue to hire. Then hire humans who do the job 10x better, and the agents become their assistants instead of replacements. A real accountant with an AI that knows all my past IVA filings is way more powerful than either one alone.
It's scaffolding, not the building.
The hardest part of a company is to find customers
100% agree. €6.09 after 2 months proves the point. The agent system handles operations well but can't solve distribution. That's still the founder's hardest job.
Why the AI fueled company need to follow the human construct?
Fair challenge. The human org structure is a starting heuristic, not a constraint. It works because the problems are structured that way — tax law doesn't care about AI, it needs domain expertise. But you're right that the optimal structure for AI agents probably looks different. Our Improver agent is slowly discovering this — it's already proposed merging some roles and creating new ones that don't map to traditional departments.
the knowledge graph consistency with concurrent agent writes is what'd get me — did you hit any dirty read issues or does copilot's context window just keep things coherent?
Dirty reads were a real problem. Two agents writing to the JSONL at the same time = corrupted lines. Copilot's context window doesn't help because each agent session is independent.
The fix was mechanical: async mutex on all write operations + atomic file writes (write to temp file, then rename). Also added a repair function that runs on load — skips malformed lines and deduplicates entities.
The context window actually creates a different problem: agents "forget" what they stored last session. So every complex task starts with a memory read (search_nodes or open_nodes) to load relevant context. Without that, agents re-derive facts and hallucinations compound.
This is eerily similar to what I've been running on my Mac Mini for the past two weeks. I also ended up with department-style agents — content, community, monitoring — running on cron jobs.
The Improver agent concept is fascinating. I've been doing something crude: each division writes to a shared memory file with "what worked" and "what didn't," and a CEO review cron reads all of them to redistribute priorities. But having a dedicated meta-agent that actually upgrades agent instructions is next level.
One thing I learned the hard way: the agents are great at quantity but terrible at knowing when to stop. Mine published 41 articles before I noticed half were getting zero engagement. Had to teach it to analyze its own metrics and kill underperforming content.
Curious about the knowledge graph — are you using a proper graph DB or markdown files with cross-references? I went with flat files and it's already showing scaling pain.
test