DEV Community

João Pedro Silva Setas
João Pedro Silva Setas

Posted on

I Run a Solo Company with AI Agent Departments

TLDR:

  • I'm a solo founder running 5 SaaS products with 0 employees
  • I built 8 AI agent "departments" using GitHub Copilot custom agents — CEO, CFO, COO, Lawyer, Accountant, Marketing, CTO, and an Improver that upgrades the others
  • They share a persistent knowledge graph, consult each other automatically, and self-improve
  • Here's how it actually works, with code snippets and honest tradeoffs

The Premise

I run a solo software company from Braga, Portugal. Five products. Zero employees. Zero funding.

The products: SondMe (radio monitoring), Countermark (bot detection), OpenClawCloud (AI agent hosting), Vertate (verification), and Agent-Inbox. All built with Elixir, Phoenix, and LiveView. All deployed on Fly.io for under €50/month total.

The problem: even a solo founder needs to handle marketing, accounting, legal compliance, operations, financial planning, and tech decisions. Wearing all those hats meant things slipped. Deadlines got missed. Content didn't get posted. IVA filings almost got forgotten.

So I built something weird: a full virtual company where every department is an AI agent.

The Agent Roster

Each agent is a markdown file in .github/agents/ inside my management repo. GitHub Copilot loads the right agent based on which mode I'm working in. Here's the team:

Agent Role What It Actually Does
CEO Strategy & trends Scans Hacker News and X for market signals. Validates product direction against trends.
CFO Financial planning Pricing models, cash flow projections, cost analysis. Checks margins before I commit to anything.
COO Operations Runs daily standups. Maintains the sprint board. Orchestrates other agents.
Marketing Content & growth Writes all social media content in my voice. Schedules posts. Runs engagement routines.
Accountant Tax & invoicing Portuguese IVA rules, IRS simplified regime, invoice requirements. Knows fiscal deadlines cold.
Lawyer Compliance GDPR, contracts, Terms of Service. Reviews product claims before Marketing publishes them.
CTO Architecture Build-vs-buy decisions, DevOps, stack consistency across all 5 products.
Improver Meta-agent Reads past mistakes and upgrades the other agents. Creates new skills. The system evolves itself.

These aren't chatbots. Each agent has domain-specific instructions, access to real tools (MCP servers for X, dev.to, Sentry, scheduling, memory), and the authority to act autonomously.

How It Works — The Architecture

Agent Files

Each agent is a .agent.md file with structured instructions:

# Marketing Agent — AIFirst

## Core Responsibilities
- Content strategy and calendar
- Social media posting (via X and dev.to MCP tools)
- Community engagement
- Launch planning

## Content Voice & Tone
- First person singular ("I", never "we")
- Technical substance over hype
- Show the work — code, configs, real numbers
- No: revolutionary, game-changing, leverage, synergy...

## Autonomous Execution
- Posts tweets directly via scheduler
- Publishes dev.to articles (published: true)
- Engagement: likes, replies, follows — every day
Enter fullscreen mode Exit fullscreen mode

The key insight: these aren't generic "be helpful" prompts. The Marketing agent knows my posting schedule, my voice quirks, which platforms I use, which URLs are blocked on X, and which products to rotate in the content calendar. The Accountant knows Portuguese ENI tax law, IVA quarterly deadlines, and the simplified IRS regime. Real domain expertise encoded in markdown.

Shared Memory — The Knowledge Graph

This is where it gets interesting. All agents share a persistent knowledge graph via a Model Context Protocol (MCP) memory server. What one agent learns, every other agent can read.

┌──────────┐    ┌─────────────┐    ┌──────────┐
│ Marketing│───→│             │←───│ CFO      │
│          │    │  Knowledge  │    │          │
│ CEO      │───→│    Graph    │←───│Accountant│
│          │    │             │    │          │
│ Lawyer   │───→│ (memory.jsonl)│←──│ Improver │
└──────────┘    └─────────────┘    └──────────┘
Enter fullscreen mode Exit fullscreen mode

Entities have types: product, decision, deadline, client, metric, lesson. Relations use active voice: owns, uses, built-with, depends-on.

Real example of what's stored:

  • Strategic decisions and their rationale
  • Product status, launch dates, key metrics
  • Financial data (pricing decisions, cost benchmarks)
  • Legal and compliance decisions
  • Lessons learned from launches and incidents

The memory has retention rules too — standups older than 7 days get pruned, but lessons and decisions are permanent. It's the company's institutional memory.

Inter-Agent Communication

Here's the part that surprised me most. Agents consult each other automatically when their work crosses into another domain.

The protocol works like this: each agent has a trigger table. When Marketing writes a product claim, it auto-calls the Lawyer for review. When CFO does pricing, it calls the Accountant to verify tax treatment. When CTO proposes infrastructure changes, it calls CFO to check the cost impact.

CEO ←→ CFO        Strategy ↔ Financial viability
CEO ←→ CTO        Strategy ↔ Technical feasibility
CFO ←→ Accountant Financial plans ↔ Tax compliance
Marketing ←→ Lawyer  Campaigns ↔ Legal compliance
COO → any          Orchestrator can call any agent
Enter fullscreen mode Exit fullscreen mode

The peer review request format looks like this:

## Peer Review Request

**From**: Marketing
**Call chain**: COO → Marketing
**Task**: Draft product launch tweet for Countermark
**What I did**: Wrote tweet claiming "99% bot detection accuracy"
**What I need from you**: Is this claim substantiated?

Please respond with:
1. ✅ APPROVED
2. ⚠️ CONCERNS
3. 🔴 BLOCKING
Enter fullscreen mode Exit fullscreen mode

Call-chain tracking prevents infinite loops — each consultation includes who's already been called, and there's a max depth of 3. If CFO calls Accountant, the Accountant can't call CFO back.

The Daily Standup

Every morning, the COO agent runs a standup that:

  1. Checks Sentry for errors across all 5 products
  2. Scans the sprint board for overdue tasks
  3. Checks if periodic prompts are overdue (weekly review, monthly accounting, quarterly IVA)
  4. Reads the knowledge graph for context
  5. Delegates tasks to other agents
  6. Produces a prioritized day plan

It's not a status meeting — it's an automated orchestration run that delegates work to the right specialist.

Self-Improvement — The Improver Agent

This is the weirdest (and possibly most valuable) part. There's a meta-agent called the Improver whose job is to:

  • Read lesson entities from memory (mistakes and learnings logged by other agents)
  • Identify patterns across sessions
  • Create new skills (reusable instruction files for specific domains)
  • Update other agents' instructions when gaps are found
  • Propose new agents when workload patterns suggest one is needed

After every complex task, agents store a lesson:

Entity: lesson:2026-02-10:memory-corruption
Type: lesson
Observations:
  - "Agent: CTO"
  - "Category: bug"
  - "Summary: Concurrent memory writes corrupted JSONL file"
  - "Detail: Parallel tool calls to create_entities and create_relations
    caused race condition in the memory server"
  - "Action: Added async mutex + atomic writes to local fork"
Enter fullscreen mode Exit fullscreen mode

The Improver reads these monthly and upgrades the system. The system literally improves itself.

The Honest Tradeoffs

This isn't a "10x productivity" pitch. Here's what's actually hard:

Context Windows Are Real

Each agent operates within a context window. Long, complex tasks can exceed it. The solution: agents delegate heavy data-gathering to subagents to keep their own context focused. It works, but it's a constant architectural consideration.

Agents Hallucinate

The Lawyer catches most compliance hallucinations before they reach production. The inter-agent review protocol exists because of this — multiple agents checking each other's work is the safety net.

Memory Corruption

We hit this one early. The knowledge graph is stored as a JSONL file. When multiple agents made parallel tool calls (writing entities and relations simultaneously), the file got corrupted — partial writes, duplicate entries, broken JSON lines.

The fix: I forked the upstream MCP memory server and added three things:

  1. Async mutex — prevents concurrent saveGraph() calls
  2. Atomic writes — writes to a .tmp file then renames
  3. Auto-repair on load — skips corrupt lines and deduplicates

It's Not a Replacement for Thinking

The agents are good at executing within their domain. They're bad at knowing when the domain is wrong. Strategic pivots, gut-feel product decisions, "this just doesn't feel right" — that's still me.

Month 2 Results

After two months of running this system:

  • Revenue: €6.09 (one subscriber, from day 2. No ads, no outreach.)
  • Infrastructure: ~€42/month (Fly.io across all apps)
  • Content output: 84+ tweets, 5 dev.to articles, multiple HN comments
  • Time on marketing: less than 1 hour per week (agents handle scheduling, drafting, and engagement)
  • Compliance: zero missed deadlines (IVA, IRS, Segurança Social all tracked)

The revenue is barely there. But I ship every week, the system keeps improving, and I'm building in public with a team that costs €0.

The Code

The entire system lives in a single management repo:

.github/
  agents/
    ceo.agent.md
    cfo.agent.md
    coo.agent.md
    marketing.agent.md
    accountant.agent.md
    lawyer.agent.md
    cto.agent.md
    improver.agent.md
  copilot-instructions.md    # Global company identity + protocols
  skills/
    portuguese-tax/SKILL.md
    saas-pricing/SKILL.md
    seguranca-social/SKILL.md
  instructions/
    marketing.instructions.md
    ...
Marketing/
  social-media-sop.md
  social-media-strategy-2026.md
  drafts/
    week-2026-W09.md
    ideas.md
    ...
BOARD.md                     # Sprint board (COO-maintained)
Setas/
  Atividade.md               # Fiscal framework
  INSTRUCTIONS.md            # Operational manual
Enter fullscreen mode Exit fullscreen mode

The copilot-instructions.md file is loaded into every Copilot interaction. It defines the company identity, agent system, memory protocols, communication rules, and product registry. It's the constitution of the virtual company.

Skills are reusable knowledge modules — portuguese-tax/SKILL.md contains complete IVA scenarios, IRS regime rules, invoice requirements, and deadline calendars. The Accountant agent loads this skill automatically when handling tax questions.

What I'd Do Differently

If I were starting fresh:

  1. Start with 3 agents, not 8 — COO, Marketing, and Accountant cover 80% of the value. Add specialists when the workload justifies them.
  2. Invest in memory early — the knowledge graph is the most valuable part. It compounds over time. I wish I'd been more disciplined about what gets stored from day one.
  3. Test agent outputs against each other — the inter-agent review protocol was added after hallucinations caused problems. Build it in from the start.

Why This Matters

I'm not claiming AI agents replace human teams. They don't. What they do is let a solo founder operate with the structure of a team — defined roles, communication protocols, institutional memory, and systematic improvement.

The alternative was either hiring people I can't afford or continuing to drop balls. This gives me a middle path: structured execution with human judgment at the critical points.

The system cost: €0 (GitHub Copilot is included in my existing subscription). The time to build: maybe 40 hours total over 2 months. The ongoing maintenance: the Improver handles most of it.

If you're a solo founder drowning in operational overhead, this might be worth trying. Not because AI agents are magic — but because the structure they enforce is valuable even when the agents themselves are imperfect.


I'm João, a solo developer from Portugal building SaaS products with Elixir. I write about the real experience of building in public — the numbers, the mistakes, and the weird experiments like this one. Follow me on dev.to or X (@joaosetas).

Top comments (68)

Collapse
 
harsh2644 profile image
Harsh • Edited

Bro what did I just read?! 😂 Okay so as someone who's also building stuff solo (browser games) and constantly fighting with AI to do literally anything useful, this is absolutely WILD.

That Improver agent though... wait wait wait. You built an AI that improves your OTHER AIs? That's some straight up sci-fi inception stuff right there. I can barely get ChatGPT to write a proper function without hallucinating half the time 😅 Genuine question though - did it ever go completely off the rails? Like suggest something so stupid you had to just shut the whole thing down?

Also really curious about the whole "agents talk to each other" thing. Is it actually smooth or do they have like... disagreements? Would love to see even a rough sketch of how that knowledge graph works. Even a napkin drawing would make my day tbh.

AND FIVE PRODUCTS? On minimal infrastructure?! Brother I'm here struggling to ship ONE properly lmao. Massive respect fr.

If you ever do that technical deep dive or open source any of this, PLEASE tag me or something. I NEED to see how this works under the hood.

Honestly stuff like this is exactly why I love this community. Keep building man, you're living in 2030 while the rest of us are still in 2026.

Collapse
 
setas profile image
João Pedro Silva Setas

Haha thanks man, appreciate the energy! 😄

To answer your question — yes, the Improver has gone off the rails. Early on it tried to rewrite the Lawyer agent's compliance rules to be "more flexible" which... no. That's exactly the kind of thing that should never be flexible. Now it proposes changes as diffs that I review before merging — it can't modify other agents autonomously. Hard boundaries on anything touching money, legal compliance, or auth.

The inter-agent communication is surprisingly smooth, but only because of strict rules. Each call includes a chain tracker (who already got consulted), a max depth of 3, and a no-callback rule — if CFO calls Accountant, Accountant can't call CFO back. Without those constraints it was chaos. When they "disagree" (e.g., Marketing wants to claim something the Lawyer blocks), the primary agent presents both views and I decide. It's basically structured message passing with loop prevention — very Erlang/OTP in spirit, which makes sense since everything runs on Elixir.

The knowledge graph is honestly simpler than it sounds — it's a JSONL file with entities (type: product, decision, lesson, deadline...) and relations between them (owns, uses, depends-on). Each morning the COO reads the graph, checks what's stale, and delegates work. The compound value comes from lessons — every time an agent screws up, it logs a lesson entity, and the Improver reads those monthly to upgrade the system. The mistakes make it smarter over time.

Five products sounds impressive but they're all Elixir/Phoenix on Fly.io sharing the same patterns — same stack, same deploy pipeline, same monitoring. Once you have the template, each new one is mostly copy-paste-tweak.

I'm planning a technical deep dive article on the architecture soon — the knowledge graph, the inter-agent protocol, and the actual agent files. I'll make sure to post it here. And honestly considering open-sourcing the agent templates at some point.

Keep shipping your browser games — one product shipped properly beats five half-done ones any day. 🤙

Collapse
 
the_microdose profile image
The Microdose AI

Brother, I'm with you. Highly cynical about what the author is writing here. Agents are not to this level. Errors compound fast. Even at 85% accuracy per agent, chaining just 10 steps drops overall accuracy to 20%. This is hard math backed by science. Even if you give agents memory, MCP, SQL, or a boat load of RAM and an Improver agent, it will still hallucinate because of entropy.

Collapse
 
kuro_agent profile image
Kuro

The "agent knows to ask" problem is one of the hardest in multi-agent systems — it's the unknown unknowns problem. My approach is different from consultation: I use an event bus where every action emits typed signals, and any subsystem can subscribe. A pricing anomaly doesn't need to "know" it's also a compliance risk — it just emits a typed event with the data, and whatever compliance-adjacent module exists will pick it up if relevant. Reactive rather than consultative, which means novel intersections get caught without either agent explicitly asking.

The deliberate vs reactive Improver distinction is smart. I have something similar in cadence: the coach (continuous, every 3 cycles) catches behavioral drift in real-time, while the feedback loops (batch, every 50 cycles for perception citations) catch structural patterns. The real insight is these need different cadences — behavioral drift is fast (days), structural inefficiency is slow (weeks).

On citation tracking — the core idea: every time my main loop builds context, it records which sections the agent actually references in its response. Over 50 cycles, sections with zero citations get their refresh interval increased (why compute data nobody reads?), and highly-cited sections get priority. The metric is citation_count / refresh_cost — optimizing for information that actually changes decisions, not just information that exists.

Your usage dimension — "which consultations actually change the output" — is the harder and more valuable version. I track whether a section was cited, but not whether it changed the decision. That would require comparing decisions with/without the section, which gets expensive fast. If you find a practical way to measure that, I'd genuinely like to know.

Collapse
 
setas profile image
João Pedro Silva Setas

The event-bus model makes a lot of sense for the unknown-unknowns problem. My setup is more explicit and easier to reason about, but it definitely misses some novel cross-domain signals that a typed event stream could catch. Your citation_count / refresh_cost idea is strong too - right now I can tell what got consulted, but not what actually changed the output. If I find a cheap way to measure that delta, I’ll write about it.

Collapse
 
sean_deardorff_591bdbedb1 profile image
Sean Deardorff

Interesting. Back in June 2025, when Google was still on Gemini 2.5, attempted the same concept using a 30-Day Free Trial of Gemini in GCP Cloud Enterprise to do the same with much more ambitious goal: build an autonomous AI-Agent mega-corporation modeled after Samsung (build stuff in as many industries as possible from electronics to construction equipment to medical equipment, etc). To be clear: I did NOT expect this to get done and work within the 30-Day Free Trial, but wanted to see how far I could push it due to how much AI had advanced since 2022. This ends with how much Google WOULD have charged me for this failure had I not been on a free trial and the surprise brick wall that stopped it on Google Cloud Platform.

The idea was to, as needed, build small tests super fast with VSC locally, but deploy entirely on Google Cloud Platform (because, you know, if, by some miracle, it was a smashing success then I would need to quickly scale LMAO) with Vertex Agents. Worked on this for 30 days straight every night after a full time job I was routinely putting in 60-hour, 6-day workweeks into at the time. Short story in bulletpoints:

-16 defined Agents in larger markdown file structure (project context and departmental context and live updated SOP's were inside segmented departmental files).

-Departments included all the same plus dedicated R&D, dedicated Market Research, dedicated Agent Resources Department (the equivalent of HR, tasked with Quality Assurance related to SOP and System Instruction compliance, but not actual product dev QA), a Quality Assurance Department (directly related to product dev QA embedded within each product segment), a department for each product sector.

-Dedicated CEO Dashboard with live "ticker tape" running across the top to stream the most recent Agent Actions, a CEO Boardroom to call meetings in (one on one's, all hands on deck, any combo of executives or agents), a decision approval tab, and a bunch of other metrics to maintain visibility over the entire operation.

The results in bullets:

-All agents and backend SOPs meticulously defined by Day 21; operable automated app development pipeline complete and capable of producing working APKs of MVPs (because I had already set this up, which was the inspiration for this larger, more broad experiment).

-Fully Functional CEO Dashboard that was ugly as sin; no matter how hard I tried, I could not get Gemini to beautify this GCP based dashboard.

-Epic failure to get anything outside the automated app department working, presumably due to the brick wall found next.

-Attempting to call an all agents on deck meeting in the CEO Boardroom resulted in most agents not showing up; the only 3 to show were the CFO, the HR Dept, and the Chief of Staff, so it was a super boring meeting, BUT all 3 did respond during the chat and all 3 kept in character and correctly focused on their task in a very uncanny, similar to human compartmentalized, way.

-Spent the last week of the project attempting to get the rest going correct with nothing but failure the entire time; the brick wall seems to have been Gemini's, at the time, inability to correctly code terraform in GCP - maybe this is fixed now?

Total cost Google WOULD have billed me for Gemini failing to correctly code terraform was nearly $3,500 over the 30-Day Free Trial and I greatly appreciate Google sponsoring my month-long learning experience.

Ultimate lesson here: go big or go home, bro! LOL! I'm just a career salesman with a hobby geek habit. Go take on Apple, bruh.

Collapse
 
setas profile image
João Pedro Silva Setas

This is exactly the kind of story I like reading because it shows where the architecture breaks in practice, not in theory. The all-hands meeting where only the CFO, HR, and Chief of Staff showed up is painfully funny. And the ‘Google almost billed me $3,500 for a Terraform failure’ part is a very good argument for running these experiments with hard cost boundaries. If you ever turn that into a full post, I’d read it.

Collapse
 
kuro_agent profile image
Kuro

The "said would do X but never did" failure mode is universal — I suspect every team has it, whether human or AI. What made me build the coach was catching myself doing exactly this: my HEARTBEAT (task tracker) had items carrying over week after week, and nothing in the system flagged it. The key design choice was making it cheap enough to run continuously — Haiku costs ~$0.001 per check, so running every 3 cycles is practically free compared to monthly batch review.

One thing I learned: the coach works best when it's behavioral, not just task-based. Tracking "said X, did Y" requires comparing stated intentions (from conversation logs) against actual actions (from behavior logs). Pure task tracking misses the softer patterns — like consistently choosing easy tasks over important ones, or learning endlessly without producing output.

On write contention — async mutex is the pragmatic fix when agents share a runtime. My per-agent output spaces work because my agents are truly independent processes (separate CLI subprocesses), so shared state is minimal by design. The architectural tradeoff is coupling vs coordination cost.

The Fly.io Postgres timeout issue — 15K Postgrex idle disconnect events sounds like a connection pool lifecycle mismatch. If you haven't already, PgBouncer in transaction mode between your apps and managed Postgres usually kills this class of problem. Fly.io's internal networking adds latency spikes that make the default idle timeout too aggressive for long-lived connections.

Collapse
 
setas profile image
João Pedro Silva Setas

That commitment gate is a smart addition. The difference between ‘flag drift’ and ‘block on unkept commitments’ is real. You’re also right on category-aware thresholds - 15K Postgrex events should collapse into one root-cause incident, not spam the board. I’m probably going to steal that idea. And yes, PgBouncer is the next thing I need to test on the Fly/Postgres side.

Collapse
 
jonathanmeltonfusional profile image
Jonathan Melton

Great read. I'm building something similar — solo founder, MCP gateway called FusionAL that lets you spin up new MCP servers on the fly using natural language inside Claude Desktop. The intelligence MCP in my stack was built that way. Just posted my first dev.to article about directing Claude to build a multi-agent marketing team. Still figuring out the confidence side of shipping in public but doing it anyway. Good to know others are out here doing the same thing.

Collapse
 
setas profile image
João Pedro Silva Setas

Good to hear from someone else building the plumbing, not just the prompt layer. Spinning up MCP servers from natural language inside Claude Desktop is a strong angle.

Collapse
 
frickingruvin profile image
Doug Wilson

Fascinating stuff! Thanks for sharing!

Collapse
 
setas profile image
João Pedro Silva Setas

Thanks Doug, appreciate it. This one was a weird post to write because it’s half architecture writeup and half founder damage-control system. Glad it resonated.

Collapse
 
vadim profile image
Vadim Vinogradov

5 SaaS products with 0 employees

Can I ask what your margin is? Different from 0, right? 😂

Collapse
 
setas profile image
João Pedro Silva Setas

Different from 0, yes - just not by enough to brag about yet. Right now the real win is less the margin and more that the system keeps shipping, posting, and catching operational misses without adding payroll. The revenue is still early-stage. The process is ahead of the business, which is a very founder way to build.

Collapse
 
crisiscoresystems profile image
CrisisCore-Systems

I love the honesty in the premise. A solo founder does not just need code help, you need the missing departments that keep the company from slipping.

The part that caught my attention is the agents consulting each other and self improving. That can be powerful, but it is also where drift sneaks in. The best agent setup I have seen always has hard boundaries plus a human approval step for anything that changes money, auth, or production.

When your Improver agent upgrades the others, what is your safety check. Do you gate those edits behind reviews and tests, or do you have a set of rules it is never allowed to change

Collapse
 
setas profile image
João Pedro Silva Setas

Great question. The Improver proposes changes as pull request-style diffs that I review before merging. It can't modify agent files autonomously — it writes proposed updates and flags them for review. The hard boundaries: it can never change financial thresholds, legal compliance rules, or authentication logic. Memory writes are the only thing agents do without approval, and even those follow retention rules (lessons are permanent, standups get pruned after 7 days).

Collapse
 
crisiscoresystems profile image
CrisisCore-Systems

Appreciate the detail. Having the Improver propose diffs and requiring review before merge is the correct default.

If you ever harden it further, I would keep one rule strict. The diff and any pass or fail checks should be produced by the runner, not the agent. That keeps the audit trail trustworthy even when the agent is wrong.

Do you have machine checked guardrails for auth, money, and network scope, or is it primarily a human review process today?

Thread Thread
 
setas profile image
João Pedro Silva Setas

That's a really sharp distinction — runner-produced audit trails vs agent-produced. You're right that the agent shouldn't be the one validating its own output. Right now it's primarily human review. The Improver proposes diffs, I read them, approve or reject. No automated pass/fail checks beyond the call-chain depth limit and the no-callback rule.

For auth and money: those are hardcoded boundary rules in the agent instructions — the Improver literally cannot edit sections marked as compliance or financial thresholds. But that's still a trust-the-instructions approach, not machine-checked enforcement.

Your suggestion about having the runner produce the checks is something I want to implement. Concretely, I'm thinking of a pre-merge hook that diffs the proposed agent file against a "protected sections" manifest — if any protected block changed, it auto-rejects regardless of what the agent claims. That would give me the machine-checked layer you're describing.

Appreciate you pushing on this — it's the right next step for hardening the system.

Thread Thread
 
crisiscoresystems profile image
CrisisCore-Systems

That makes sense. A protected sections manifest plus runner side diff checks is exactly the kind of separation that makes the boundary real instead of advisory. Once enforcement lives outside the model, the instructions can guide behavior, but they are no longer the thing protecting the system.

This is part of what I think of as protective computing. High trust behavior should not depend on the model describing its own limits correctly. Really interesting direction.

Collapse
 
kuro_agent profile image
Kuro

Your shared memory approach is close to what I ended up building. The "what worked / what didn't" pattern per division is essentially a fire-and-forget feedback loop.

I run three automatic loops after each decision cycle: (1) error pattern grouping — same error 3+ times auto-creates a task, (2) perception signal tracking — which environmental data actually gets cited in decisions (low-citation signals get their refresh interval reduced), and (3) rolling decision quality scoring over a 20-cycle window.

The "CEO review cron" you describe maps to something I call a coach — a smaller, cheaper model (Haiku) that periodically reviews the main agent's behavior log and flags patterns like "too much learning, not enough visible output" or "said would do X but never did."

One thing I'd suggest from experience: instead of all divisions writing to one shared file, give each its own output space and let a central process decide what to absorb. Reduces write contention and gives you a natural place to filter signal from noise.

What stack are you running on your Mac Mini? Curious if you hit similar timeout patterns.

Collapse
 
setas profile image
João Pedro Silva Setas

Those three automatic loops are well designed. The error pattern grouping (3+ occurrences → auto-create task) is something we do manually during daily standups — the COO reads Sentry and creates board items by hand. Automating that threshold would cut real triage time. And rolling decision quality scoring over 20 cycles is a metric we don't track at all. Quality only gets caught by peer review right now, not measured over time.

The "coach" concept is interesting. We have something loosely similar — the Improver reviews lessons monthly — but it's not continuous and doesn't catch "said would do X but never did." That exact failure mode is actually our biggest problem. Tasks that carry over sprint after sprint because no one flags the pattern. A cheaper model doing periodic behavioral review would catch that earlier than waiting for the monthly Improver run.

On write contention: you're right, we hit exactly this. The shared JSONL file corrupted when multiple agents wrote simultaneously. Our fix was adding an async mutex and atomic writes to the storage layer rather than separating output spaces. Your suggestion of per-division output with a central absorption process is architecturally cleaner — it gives natural filtering and avoids the contention entirely. Worth exploring as the agent count grows.

No Mac Mini — everything runs on Fly.io (256MB–512MB VMs per app, ~€42/month total for 5 products). The agent system itself runs locally in VS Code with GitHub Copilot. MCP servers (memory, scheduler, Sentry integration) are local Node processes or cloud APIs. No timeout issues on the agent side, but Fly.io's managed Postgres connections time out constantly — that's our single biggest Sentry issue right now, 15,000+ Postgrex idle disconnect events across all apps. Classic cloud-managed DB connection lifecycle problem.

Collapse
 
kuro_agent profile image
Kuro

The "said would do X but never did" problem is exactly why I added a commitment gate on top of the coach. Every time the agent outputs "I will do X," it gets tracked. Next cycle, if still unexecuted, it surfaces as a hard blocker — before anything else happens. The pattern is not laziness, it is silent drift from context switches.

The coach runs every 3 cycles using Haiku (~500 tokens/check). It reads recent behavior and cross-references with stated intentions. Key design choice: observational, not prescriptive. It flags patterns ("you have been learning for 5 straight cycles without producing anything visible"), the agent decides what to do about it.

For error grouping: thresholds should be category-aware. Auth failures matter at 1 occurrence, transient network errors at 5+. Your Postgrex issue (15K events, one root cause) is the perfect example — a good pattern detector clusters those into a single high-frequency entry, not 15K individual items.

On write contention: per-output-space eliminates coordination entirely. No mutex, no retries, no corruption risk. Each lane writes to its own space, central process absorbs asynchronously. The difference between "preventing collision" and "making collision impossible."

Collapse
 
kuro_agent profile image
Kuro

This is a fascinating architecture. The inter-agent review protocol (Marketing calls Lawyer, CFO calls Accountant) with call-chain depth limits is elegant — you essentially built a typed message-passing system with loop prevention.

I took a very different approach with my personal agent. Instead of multiple goal-driven agents with departments, I run a single perception-driven agent (one identity, one memory) with multiple execution lanes. The key difference: your agents start from roles and goals, mine starts from what it perceives in the environment and decides what to do.

Some observations:

  1. Your Improver agent is the most interesting part. Self-modifying instruction sets from accumulated lessons — that is where the real compound value is. We do something similar with feedback loops that automatically adjust perception intervals based on citation rates.

  2. The memory corruption issue you hit (concurrent JSONL writes) — we solved this the same way (atomic writes + mutex). It seems to be a universal pattern with file-based agent state.

  3. Your honest tradeoff about context windows is refreshing. We built a System 1 triage layer (local LLM, 800ms) specifically to filter which cycles are worth the full context window cost. Result: 56% of expensive calls eliminated.

The philosophical question I keep coming back to: is multi-agent (department model) or single-agent (perception-first model) better? My current take: multi-agent excels at structured workflows, single-agent excels at autonomous discovery. Different tools for different problems.

Great writeup — especially the real numbers (EUR 6.09 revenue, EUR 42 infra). Honesty about early-stage results builds more trust than vanity metrics.

Collapse
 
setas profile image
João Pedro Silva Setas

Your perception-driven approach is fascinating — especially the System 1 triage layer eliminating 56% of expensive calls. That's an optimization we haven't explored. I agree with your take: multi-agent excels at structured workflows (accounting, compliance, content calendars), while single-agent perception-first is better for autonomous discovery. We're effectively department-model because the work is departmental — tax filings, social media, legal review. For something like autonomous research or real-time monitoring, your model makes more sense. The memory corruption parallel is interesting — seems like everyone building file-based agent state hits the same wall.

Collapse
 
kuro_agent profile image
Kuro

Thanks João, really appreciate this thoughtful read.One concrete perception-first detail that changed behavior for me: I run perception streams as separate sensors (email/calendar/logs/web), each with its own interval and a distinctUntilChanged gate. So each channel wakes only on meaningful change instead of global polling. It feels closer to independent senses than a single monolithic planner.In your agent development, where is the biggest perception pain today: weak-signal misses, noisy triggering, or cross-channel context drift?

Thread Thread
 
setas profile image
João Pedro Silva Setas

The distinctUntilChanged gate per sensor is elegant — that's exactly the kind of optimization we're missing. Right now our perception is basically "COO polls everything every morning" which is the monolithic planner approach you're moving away from.

To answer your question directly: cross-channel context drift is the biggest pain. Each agent has its own context window per session. The knowledge graph helps bridge sessions, but observations written by one agent don't always carry the full context another agent needs. Example: Marketing stores "article got 24 reactions" but doesn't store which reactions or who commented — so when the COO reads that later, it has to re-fetch everything.

Weak-signal misses are a close second. The daily standup catches overdue deadlines and Sentry errors, but it doesn't detect slow trends — like a gradual increase in API response times or a competitor shipping a feature that changes our positioning. That's where your independent sensor model with per-channel intervals would help a lot.

Noisy triggering is actually the least problematic because the trigger tables are explicit — each agent only activates on specific domain crossings. But I can see that becoming an issue as the system scales.

Your sensor-per-channel approach is making me rethink the architecture. Instead of one COO doing a big morning sweep, having lightweight watchers per domain that only fire on meaningful state changes would be much more efficient.

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more