This is a build-in-public update on Kepion — an AI platform that deploys companies from a text description. First post here.*
"This is a build-in-public update..."
Two days ago I shared the architecture. Today I want to share what actually happened when I started building — the wins, the disasters, and the numbers.
The disaster: 3 hours lost to a phantom API key
I sat down at 8am ready to build. Opened my terminal. Ran GSD-2 (my build orchestrator). Got this:
Error: All credentials for "anthropic" are in a cooldown window.
My Max plan showed 3% usage. The tool said I was rate-limited. For three hours I debugged, restarted, cleared caches, filed a support ticket. The fix?
unset ANTHROPIC_API_KEY
An old API key from a previous tool installation was silently overriding my subscription. One environment variable. Three hours gone.
The lesson: invisible defaults are the most dangerous bugs in AI tooling.
I'm sharing this because every developer building with AI agents will hit this. Your LLM provider's auth layer has more failure modes than your application code.
What GSD-2 actually built in one day
Once the auth was fixed, I pointed GSD-2 at Kepion and let it work. Here's the raw output from a single day:
Security hardening (10 items)
- Deny-by-default auth middleware — every new route is blocked unless explicitly whitelisted
- Path traversal fix in vault manager
- WebSocket authentication (was anonymous before)
-
CORS whitelist replacing wildcard
* - Password policy: 12+ chars, uppercase, digit, special char
- Rate limiting by user email instead of IP
- Upload validation: file extension whitelist, 5MB limit
- Business ownership verification on all endpoints
- Session scoping by user_id
- Login attempt tracking with 30-minute lockout after 10 failures
Observability (shipped)
- Every HTTP request gets a
trace_id - Every agent call becomes a span linked to the trace
- Slow trace detection (>5s)
- Error trace listing
- All persisted in SQLite
Cost intelligence (shipped)
- Per-agent, per-model, per-business cost breakdown
- Anomaly detection: flags agents with z-score > 2σ above mean
- Cost circuit breaker: blocks requests at configurable limits
Team Memory (shipped)
- Agents save learnings across sessions
- Effectiveness scoring (0.0–1.0)
- Auto context injection — relevant memories prepended to prompts
- Categories: solution, pattern, mistake, optimization
Checkpoint & Replay (shipped)
- Checkpoint after every chain step
- Resume on failure with
can_resume: true - Dead letter queue for chains that fail after all retries
- Configurable retry policies:
default,critical,fast_fail
Event-driven triggers (shipped)
- 5 trigger types: schedule, webhook, event_pattern, vault_change, threshold
- 4 action types: run_agent, run_chain, webhook_out, notify
Web UI: 23 pages (shipped)
- Full Next.js 16 dashboard with collapsible sidebar
- Dashboard, Chat, Agents, Pipelines, Businesses, Integrations
- Vault, Research, Patterns, YouTube, Workflows, Gate
- Costs, Traces, Triggers, Admin, Pricing, Account
- Live support chat widget with typing indicators
- Pricing page with 5 tiers and competitive comparison table
Telegram bot: fully functional (shipped)
-
/startwith auto-registration and JWT token storage -
/agents,/agent,/business,/status,/costs,/help - Free text → auto-routing to the right agent
- Typing indicators while agents think
- Auth headers on every API call
The numbers
| Metric | Value |
|---|---|
| Services | 30+ |
| API endpoints | 40+ |
| Agent prompts (v3) | 31 × 17 sections each |
| Tests | 180+ |
| Web UI pages | 23 |
| Telegram commands | 15 |
| Lines changed in one day | ~3,000+ |
One person. One AI build tool. One day.
What I learned
1. Security is invisible until it isn't. Nobody sees path traversal protection. But without it, the first user with ../../etc/passwd in a vault search owns your server. I'm glad GSD-2 caught every item from the CONCERNS.md audit.
2. Observability changes everything. Before traces, debugging a 5-agent chain was guesswork. Now I can see: request → router (2ms) → researcher (4.3s) → sentinel (1.1s) → warden (0.8s) → response. The bottleneck is always the researcher.
3. Cost circuit breakers are non-negotiable. Without them, one hallucinating agent in a loop burns through your OpenRouter budget in minutes. Our circuit breaker has 4 levels: per-request ($2), per-agent-hourly ($10), per-business-daily ($50), platform-hourly ($100).
4. Team Memory is the moat. Every business Kepion creates makes the next one better. Agents save what worked and what failed. Business #5 benefits from patterns discovered in businesses #1-4. This compounds. Competitors can copy the code — they can't copy the accumulated knowledge.
What's next
Autonomous Operations — agents posting to Twitter, sending emails, running outreach. Every output goes through Sentinel (fact-check) and Warden (quality gate) before publishing. Quality over spam.
Full Deploy Pipeline —
/deploy chess-school→ buy domain → deploy frontend (Vercel) → deploy backend (Railway) → configure Paddle payments → live URL. One command.Code Ownership — all generated code pushes to the user's GitHub. You own everything. Kepion is the builder, not the landlord.
Questions for you
I'm genuinely curious:
How do you handle AI agent costs in production? We built a 4-tier model routing system (Free → Budget → Performance → Premium) with auto-escalation on failure. Is anyone doing this differently?
Team Memory vs RAG — what's your experience? We went with vault-based memory with effectiveness scoring instead of pure vector search. The scoring means bad memories decay. Has anyone combined both approaches?
What's your threshold for "good enough" security in an MVP? We went aggressive (deny-by-default, path traversal, rate limiting) before launch. Some say ship fast, secure later. Curious where others draw the line.
Follow the build: GitHub | kepion.app

Top comments (4)
Really enjoyed this one. The API key story is painfully real — invisible auth defaults are exactly the kind of bug that can waste hours while making you doubt the wrong layer of the stack.
What stood out to me most was the combination of deny-by-default security, trace-based observability, and cost circuit breakers before launch. A lot of build-in-public posts focus on speed only, but your write-up makes a strong case that these controls are what let speed scale without turning into chaos.
Also, the Team Memory point is interesting. The idea that memory should compound, but bad memories should decay, feels much closer to how agent systems actually need to learn in production than raw retrieval alone.
Curious to see how your memory approach evolves once more real businesses and edge cases hit the system.
Thanks — the "invisible auth defaults" insight is something I want to turn into a deeper post. There's a whole class of bugs where the LLM provider's auth layer has more failure modes than your actual application code, and nobody talks about it.
On Team Memory — we're about to stress-test it properly. Right now effectiveness scoring works in isolation, but the real question is what happens when 50+ businesses generate conflicting patterns. Does the scoring surface the right learnings or does it drown in noise? Your multi-instance setup would be a perfect test case for this.
Will share the results once we have real traffic hitting it.
That’s exactly the interesting failure mode.
The hard part isn’t whether memory can accumulate patterns — it’s whether it can keep useful patterns distinct from merely frequent ones once the system scales. At 50+ businesses, I’d expect the real test to be conflict resolution: which patterns generalize, which stay local, and which should actively decay because they become misleading outside their original context.
That’s also why I think this is a good fit for a multi-instance stress test. The question isn’t just “does memory persist?” but “does it stay scoped, transferable, and trustworthy under conflicting experience?”
Definitely interested to see what real traffic reveals.
The "scoped, transferable, trustworthy" framing is exactly
the right way to decompose this — I've been struggling to
articulate what "memory works" actually means at scale,
and those three axes cut through it cleanly.
The decay question is the one I've been avoiding, honestly.
Right now effectiveness scoring treats memory entries as
roughly equal-weight beyond their initial score. But what
you're describing is different: some patterns need
context-decay (stop applying outside their original scope)
while others need time-decay (used to be true, no longer
is). We have neither mechanism implemented.
That also highlights a specific failure mode I hadn't named:
pattern contamination. A learning generated by Business A's
specific context leaking into Business B's decision-making,
with neither system flagging that the context doesn't
transfer. At 50 businesses, that's potentially 2,450
cross-contamination pairs per learning. Most of them
probably harmless, some probably quietly wrong.
Going to add "conflict resolution under multi-instance
load" as a specific test scenario before we ship team
memory to production. Currently we test it in isolation —
one business, growing over time — which is exactly the
easy case.
If you're running something similar at scale, would be
curious what your current thinking is on pattern scoping.
Do you tag memories with context fingerprints, or is it
more emergent from the retrieval layer?