This is the launch post in a series on building Praxia, an Apache-2.0 multi-agent orchestrator. Later posts go deep on the TiDB Vector memory backend and a Japanese-specialized STT integration.
TL;DR
This spring I built and shipped Praxia, a multi-agent orchestrator OS, from scratch in about 5 weeks of nights and weekends (Apache-2.0).
- 🚀 PyPI:
pip install praxia— https://pypi.org/project/praxia/ - 📦 GitHub: https://github.com/praxia-dev/praxia
- 🎬 60-second demo: https://youtu.be/o_6NbjJU1AA
- 🌐 Landing: https://praxia.tools/
The differentiator is automatic personal → organizational memory promotion. The "prompts that actually work," which a senior engineer painstakingly tunes, usually stay locked in that one person's head. Praxia tries to solve that with a 5-layer memory stack + a 3-path promotion engine.
What I actually started — the decision
I'd been using LangChain / CrewAI / AutoGen at work since late 2025, and one structural discomfort kept nagging at me:
What really separates a good agent from a bad one isn't the library or the model — it's the accumulated domain-specific trial and error.
And that accumulation almost always lives in one senior person's head (their Cursor / VS Code / Obsidian). Tacit knowledge that evaporates the day they leave. General-purpose frameworks are powerless against that.
In April 2026 I started writing code between day-job hours. A month later I shipped v0.1.0 to PyPI and GitHub.
Why existing frameworks didn't cut it
Four walls I hit in practice:
| Wall | What was happening |
|---|---|
| Setup complexity | 2-3 days just to get something running. Can't make a production call. |
| Tacit knowledge doesn't propagate | The prompts that work stay in one person's private space. |
| No evidence for evaluation | "It runs" doesn't guarantee "it works." |
| Agents stagnate | Build it once, and there's no feedback loop. |
The second one was the killer. General agent frameworks give you strong primitives, but the process of accumulating knowledge into the organization is left entirely to the implementer.
The core design — 5 layers + 3 paths
The 5-layer memory stack
L1 PersonalMemory Per-user (6 backends: JSON / Mem0 / Letta / Zep / Hindsight / LangMem)
L2 PromotionEngine Nightly batch. Decides L1 → L3 promotion
L3 SharedMemory Org-wide. RBAC gating, time decay
L4 MarkdownStore Git-managed, PR review required, immutable
L5 GraphLayer Optional (Zep / Graphiti), relation extraction
The key is that L1 → L4 is not manual. A sleep-time consolidator (nightly batch) scans personal memory and auto-promotes the right parts into organizational knowledge.
The 3-path promotion engine
Memory promotion is evaluated in parallel across three independent signals:
- Frequency — facts repeated across N+ people
- Outcome correlation — co-occurrence with wins / approved PRs / passing tests
- LLM self-eval — a 0..1 "org-knowledge candidacy" score
The final score is a weighted blend, and any single decisive path triggers promotion. This deliberately avoids single-mechanism dependence (where one broken signal takes the whole thing down).
The design choices that made "I can build this myself" possible
Shipping in 5 weeks came down to four design choices.
1. Seven extension points
Every extension point is built on the same praxia.extensions.Registry primitive:
| Extension point | ~LoC | Entry point |
|---|---|---|
| Connector (Box / Notion / Slack …) | ~50 | praxia.connectors |
| Memory backend | ~80 | praxia.memory_backends |
| File parser | ~30 | praxia.parsers |
| Output exporter | ~30 | praxia.exporters |
| OAuth provider | ~20 | praxia.oauth_providers |
| Skill | ~50 | praxia.skills |
| Flow | ~50 | praxia.flows |
"Extend via a pyproject.toml entry point, never edit core files" keeps the cognitive load low when you write your own.
2. Apache-2.0 with everything included (no paywall)
SSO (Google / Microsoft Entra / Okta / GitHub / Keycloak), RBAC, audit logs, per-user OAuth (13 providers), KMS-backed token encryption (AWS / Azure / GCP / Vault / local) — all in the OSS core.
Most of what commercial agent platforms paywall as an "Enterprise tier" ships here under Apache-2.0. That directly speeds up adoption decisions.
3. 100+ providers via LiteLLM
Provider quirks (Anthropic not supporting response_format, GPT-5.x disallowing temperature, Azure's deployment-name format, …) are absorbed at the LiteLLM layer. No provider-specific API keys leak into Praxia core. Fully offline operation is possible too (Ollama + a local model + backend=json).
4. A Streamlit UI that's easy to throw away
The UI is Streamlit, but the backend also runs headless as praxia serve (FastAPI). When I swap to Next.js or mobile later, I throw away only the UI layer.
What went into v0.1.0, and what didn't
Shipped (deliberately a full set):
- 5-layer memory + 3-path promotion engine
- 6 business skills (investment / sales / design / procurement / patent / legal)
- 6 LTM backends + Composite/Routed parallel fusion
- Per-user OAuth for 13 providers
- SSO + RBAC + ACL + audit logs
- Autonomous agent (LLM-driven tool-use loop)
- Document Designer (sandboxed
python-pptx/docx→ designed file output) - i18n in 8 languages (en / ja / zh-CN / ko / es / fr / de / pt-BR)
Deliberately deferred (v0.2+):
- Multi-tenant GUI — OSS targets "single-org self-host"; SaaS-grade tenant isolation belongs in a future Open Core tier
- PDF output — LibreOffice-based workflow recommended for now
-
Native Pinecone / Weaviate / Qdrant backends — can be wrapped via
mem0through Composite/Routed, so low priority
The "what NOT to build" list was as important as the "what to build" list.
The TOP 3 things that actually ate my time
This was a "ship something working in 5 weeks" sprint, and the three things that genuinely ate time were all in the memory layer and promotion engine.
1. Blending three signals that live on completely different scales
The PromotionEngine evaluates L1 → L3 promotion across Frequency / Outcome correlation / LLM self-eval in parallel. This was trickier than I expected:
- Frequency is an integer in 0..∞ (reference count for the same fact)
- Outcome correlation is a ratio in 0..1 (success rate of tasks where the fact co-occurred)
- LLM self-eval is a continuous 0..1 — but non-deterministic, and the score distribution differs per provider (GPT-5 strict, median ~0.4; Claude lenient ~0.7; Gemini bimodal)
My first implementation was a plain weighted average. Because frequency is unbounded, "facts I just happened to touch a lot in L1" floated to the top. I ended up here:
@dataclass(frozen=True)
class PromoteSignal:
frequency: float # raw count
outcome_corr: float # 0..1
self_eval: float # 0..1, median of N=3 LLM calls
def decide_promotion(sig: PromoteSignal, cfg: PromoteConfig) -> bool:
# Z-score normalize on a rolling 30-day population, then sigmoid
z_freq = sigmoid((sig.frequency - cfg.freq_mean) / cfg.freq_std)
# OR logic with per-path threshold
return (
z_freq > cfg.freq_threshold # 0.85
or sig.outcome_corr > cfg.outcome_threshold # 0.70
or sig.self_eval > cfg.self_eval_threshold # 0.80
)
Key points:
- Z-score normalization tames the runaway frequency signal
- OR logic with per-path thresholds means "any single decisive path promotes" — deliberately avoiding single-mechanism dependence
- LLM self-eval uses the median of N=3 calls to average out non-determinism (I tried N=5 — the benefit plateaued)
-
decide_promotionis a pure function, so I can replay historical promotion logs to do parameter sensitivity analysis
It took ~5 redesigns to land on "Z-score → sigmoid → OR with thresholds." Lesson: don't start signal fusion with a weighted average. Per-path decisive thresholds + OR turned out to be the most robust.
2. Reconciling "fan-out search × single-destination write" in the Composite backend
The Composite backend sends search queries in parallel to multiple backends (JSON / Mem0 / TiDB / Letta…) and fuses results with Reciprocal Rank Fusion (RRF). But writes go to a single backend specified by write_to=. This asymmetry created three pitfalls.
(a) Add a backend later, and past data is invisible to it
Running Composite(backends=[A, B], write_to="A") and then adding C means C has none of the past writes. Search fan-out becomes lopsided — 2 backends hit, 1 doesn't. I discovered this during dogfooding as "one backend just has lower search quality."
→ Added a replay_writes(source=A, target=C, since=...) admin API after the fact. Needed for rebalancing.
(b) Graceful degradation when write_to goes down
When write_to is down, stopping writes but continuing reads looks attractive. But that can cause "a record that was hitting in search disappears on the next search" (because the write never re-runs after recovery).
I ended up dropping graceful degradation and raising instead. Half-baked availability stacks a data-truthfulness problem on top of eventual consistency. "When it's down, honestly say it's down" turned out to be the easiest to operate.
(c) Tie-breaking in RRF fusion
When multiple backends return the same record_id at rank 1, the scores tie exactly. Without a tie-breaking rule, ordering depends on backend registration order, and CI tests go flaky.
→ Introduced lexicographic tie-breaking on (rrf_score, timestamp DESC, backend_priority). Now ordering is reproducible even in CI.
3. Idempotency of the sleep-time consolidator
Since the nightly batch re-runs L1 → L3 promotion, not promoting the same fact twice is the crux of idempotency. Strict record_id-based dedup wasn't enough:
- The same user records "Customer X prioritizes ROI" and "X-san values ROI" as two different phrasings → different record_ids, semantically identical
- LLM self-eval is non-deterministic: the same text scores 0.78 / 0.82 / 0.76, so near a threshold you get "didn't promote yesterday, promoted today"
So I added fuzzy dedup:
def is_duplicate(candidate: Record, l3_recent: list[Record]) -> bool:
# Already-promoted check: any L3 record within recent window whose
# embedding cosine-sim >= threshold is treated as duplicate
cand_vec = embed(candidate.text)
return any(
cosine_sim(cand_vec, r.embedding) >= 0.92
for r in l3_recent # already filtered to last 30 days
)
The hard part was tuning the two thresholds (similarity + window):
| Setting | Problem it causes |
|---|---|
| 0.95 / 7 days (tight) | "Same fact" with different wording ends up duplicated across L3 |
| 0.85 / 90 days (loose) | "New but similar facts" (e.g. a similar trend for customer Y) get suppressed |
| 0.92 / 30 days (adopted) | On a hand-labeled set of 50, both false-merge and false-split stayed < 5% |
The deciding factor was instrumenting both directions at once. Not just "rate of missed duplicates" but also "rate of treating genuinely distinct facts as duplicates." Track only one side and you inevitably tune lopsided. I later applied this to the PromotionEngine's whole calibration loop.
v0.1.0 by the numbers
| Metric | Value |
|---|---|
| Codebase | ~25,000 LoC (Python + tests + i18n) |
| Tests | 431 passing |
| Supported LLM providers | 100+ via LiteLLM |
| Bundled connectors | 19 (pull + push) |
| Per-user OAuth providers | 13 |
| Docs | 30,000+ chars across EN + JA |
| Dev time | ~5 weeks (nights and weekends) |
Next 3 months
- v0.2: First-class TiDB Vector / pgvector backends (currently via
mem0wrap) - v0.2: localhost loopback OAuth (drop the
praxia serverequirement) - v0.3: Multi-tenant org features (the Open Core entry point)
- Docs: expanded English tutorials
- Community: Discord, more active Discussions
For anyone in the same spot
Three things I felt after sprinting for 5 weeks, for anyone debating whether to start a solo OSS project:
-
Ship frequency over polish. Shipping
v0.1.0teaches you far more than polishingv0.0.4forever. - State your differentiator in one line. For Praxia: "personal → org memory auto-promotion." If you can't say it, you can't write the post, can't pitch it, and PRs won't come.
- The real competitor for OSS is the commercial SaaS solving the same problem — not LangChain or CrewAI, but the paywall on paid agent platforms. Bundling those features under Apache-2.0 is itself the differentiation.
Closing
⭐ Stars / 🍴 Forks / Issues / PRs all welcome.
github.com/praxia-dev/praxia
If you liked this, the 60-second demo is here: https://youtu.be/o_6NbjJU1AA
Connecting "individual brilliance × organizational continuity" with AI — that's the mission Praxia started with this spring.
Next in this series: the enterprise-platform features I put directly in the OSS core (SSO, RBAC, audit logs, KMS-envelope-encrypted OAuth tokens) and why none of them are behind an Enterprise tier.
Top comments (0)