Pavel Gajvoronski

Posted on Apr 15

I Built 23 Pages in One Day With AI. Then One API Key Almost Killed Everything

#buildinpublic #agents #ai #architecture

This is a build-in-public update on Kepion — an AI platform that deploys companies from a text description. First post here.*

"This is a build-in-public update..."

Two days ago I shared the architecture. Today I want to share what actually happened when I started building — the wins, the disasters, and the numbers.

The disaster: 3 hours lost to a phantom API key

I sat down at 8am ready to build. Opened my terminal. Ran GSD-2 (my build orchestrator). Got this:

Error: All credentials for "anthropic" are in a cooldown window.

My Max plan showed 3% usage. The tool said I was rate-limited. For three hours I debugged, restarted, cleared caches, filed a support ticket. The fix?

unset ANTHROPIC_API_KEY

An old API key from a previous tool installation was silently overriding my subscription. One environment variable. Three hours gone.

The lesson: invisible defaults are the most dangerous bugs in AI tooling.

I'm sharing this because every developer building with AI agents will hit this. Your LLM provider's auth layer has more failure modes than your application code.

What GSD-2 actually built in one day

Once the auth was fixed, I pointed GSD-2 at Kepion and let it work. Here's the raw output from a single day:

Security hardening (10 items)

Deny-by-default auth middleware — every new route is blocked unless explicitly whitelisted
Path traversal fix in vault manager
WebSocket authentication (was anonymous before)
CORS whitelist replacing wildcard *
Password policy: 12+ chars, uppercase, digit, special char
Rate limiting by user email instead of IP
Upload validation: file extension whitelist, 5MB limit
Business ownership verification on all endpoints
Session scoping by user_id
Login attempt tracking with 30-minute lockout after 10 failures

Observability (shipped)

Every HTTP request gets a trace_id
Every agent call becomes a span linked to the trace
Slow trace detection (>5s)
Error trace listing
All persisted in SQLite

Cost intelligence (shipped)

Per-agent, per-model, per-business cost breakdown
Anomaly detection: flags agents with z-score > 2σ above mean
Cost circuit breaker: blocks requests at configurable limits

Team Memory (shipped)

Agents save learnings across sessions
Effectiveness scoring (0.0–1.0)
Auto context injection — relevant memories prepended to prompts
Categories: solution, pattern, mistake, optimization

Checkpoint & Replay (shipped)

Checkpoint after every chain step
Resume on failure with can_resume: true
Dead letter queue for chains that fail after all retries
Configurable retry policies: default, critical, fast_fail

Event-driven triggers (shipped)

5 trigger types: schedule, webhook, event_pattern, vault_change, threshold
4 action types: run_agent, run_chain, webhook_out, notify

Web UI: 23 pages (shipped)

Full Next.js 16 dashboard with collapsible sidebar
Dashboard, Chat, Agents, Pipelines, Businesses, Integrations
Vault, Research, Patterns, YouTube, Workflows, Gate
Costs, Traces, Triggers, Admin, Pricing, Account
Live support chat widget with typing indicators
Pricing page with 5 tiers and competitive comparison table

Telegram bot: fully functional (shipped)

/start with auto-registration and JWT token storage
/agents, /agent, /business, /status, /costs, /help
Free text → auto-routing to the right agent
Typing indicators while agents think
Auth headers on every API call

The numbers

Metric	Value
Services	30+
API endpoints	40+
Agent prompts (v3)	31 × 17 sections each
Tests	180+
Web UI pages	23
Telegram commands	15
Lines changed in one day	~3,000+

One person. One AI build tool. One day.

What I learned

1. Security is invisible until it isn't. Nobody sees path traversal protection. But without it, the first user with ../../etc/passwd in a vault search owns your server. I'm glad GSD-2 caught every item from the CONCERNS.md audit.

2. Observability changes everything. Before traces, debugging a 5-agent chain was guesswork. Now I can see: request → router (2ms) → researcher (4.3s) → sentinel (1.1s) → warden (0.8s) → response. The bottleneck is always the researcher.

3. Cost circuit breakers are non-negotiable. Without them, one hallucinating agent in a loop burns through your OpenRouter budget in minutes. Our circuit breaker has 4 levels: per-request ($2), per-agent-hourly ($10), per-business-daily ($50), platform-hourly ($100).

4. Team Memory is the moat. Every business Kepion creates makes the next one better. Agents save what worked and what failed. Business #5 benefits from patterns discovered in businesses #1-4. This compounds. Competitors can copy the code — they can't copy the accumulated knowledge.

What's next

Autonomous Operations — agents posting to Twitter, sending emails, running outreach. Every output goes through Sentinel (fact-check) and Warden (quality gate) before publishing. Quality over spam.
Full Deploy Pipeline — /deploy chess-school → buy domain → deploy frontend (Vercel) → deploy backend (Railway) → configure Paddle payments → live URL. One command.
Code Ownership — all generated code pushes to the user's GitHub. You own everything. Kepion is the builder, not the landlord.

Questions for you

I'm genuinely curious:

How do you handle AI agent costs in production? We built a 4-tier model routing system (Free → Budget → Performance → Premium) with auto-escalation on failure. Is anyone doing this differently?
Team Memory vs RAG — what's your experience? We went with vault-based memory with effectiveness scoring instead of pure vector search. The scoring means bad memories decay. Has anyone combined both approaches?
What's your threshold for "good enough" security in an MVP? We went aggressive (deny-by-default, path traversal, rate limiting) before launch. Some say ship fast, secure later. Curious where others draw the line.

Follow the build: GitHub | kepion.app

Top comments (4)

kanta13jp1 • Apr 17

Really enjoyed this one. The API key story is painfully real — invisible auth defaults are exactly the kind of bug that can waste hours while making you doubt the wrong layer of the stack.

What stood out to me most was the combination of deny-by-default security, trace-based observability, and cost circuit breakers before launch. A lot of build-in-public posts focus on speed only, but your write-up makes a strong case that these controls are what let speed scale without turning into chaos.

Also, the Team Memory point is interesting. The idea that memory should compound, but bad memories should decay, feels much closer to how agent systems actually need to learn in production than raw retrieval alone.

Curious to see how your memory approach evolves once more real businesses and edge cases hit the system.

Pavel Gajvoronski • Apr 17

Thanks — the "invisible auth defaults" insight is something I want to turn into a deeper post. There's a whole class of bugs where the LLM provider's auth layer has more failure modes than your actual application code, and nobody talks about it.
On Team Memory — we're about to stress-test it properly. Right now effectiveness scoring works in isolation, but the real question is what happens when 50+ businesses generate conflicting patterns. Does the scoring surface the right learnings or does it drown in noise? Your multi-instance setup would be a perfect test case for this.
Will share the results once we have real traffic hitting it.

kanta13jp1 • Apr 18

That’s exactly the interesting failure mode.

The hard part isn’t whether memory can accumulate patterns — it’s whether it can keep useful patterns distinct from merely frequent ones once the system scales. At 50+ businesses, I’d expect the real test to be conflict resolution: which patterns generalize, which stay local, and which should actively decay because they become misleading outside their original context.

That’s also why I think this is a good fit for a multi-instance stress test. The question isn’t just “does memory persist?” but “does it stay scoped, transferable, and trustworthy under conflicting experience?”

Definitely interested to see what real traffic reveals.

Pavel Gajvoronski • Apr 18

The "scoped, transferable, trustworthy" framing is exactly
the right way to decompose this — I've been struggling to
articulate what "memory works" actually means at scale,
and those three axes cut through it cleanly.

The decay question is the one I've been avoiding, honestly.
Right now effectiveness scoring treats memory entries as
roughly equal-weight beyond their initial score. But what
you're describing is different: some patterns need
context-decay (stop applying outside their original scope)
while others need time-decay (used to be true, no longer
is). We have neither mechanism implemented.

That also highlights a specific failure mode I hadn't named:
pattern contamination. A learning generated by Business A's
specific context leaking into Business B's decision-making,
with neither system flagging that the context doesn't
transfer. At 50 businesses, that's potentially 2,450
cross-contamination pairs per learning. Most of them
probably harmless, some probably quietly wrong.

Going to add "conflict resolution under multi-instance
load" as a specific test scenario before we ship team
memory to production. Currently we test it in isolation —
one business, growing over time — which is exactly the
easy case.

If you're running something similar at scale, would be
curious what your current thinking is on pattern scoping.
Do you tag memories with context fingerprints, or is it
more emergent from the retrieval layer?