DEV Community

Cover image for I spent 5 weeks building an open-source multi-agent orchestrator. The hard part wasn't the agents — it was the memory.
gen99
gen99

Posted on

I spent 5 weeks building an open-source multi-agent orchestrator. The hard part wasn't the agents — it was the memory.

This is the launch post in a series on building Praxia, an Apache-2.0 multi-agent orchestrator. Later posts go deep on the TiDB Vector memory backend and a Japanese-specialized STT integration.

TL;DR

This spring I built and shipped Praxia, a multi-agent orchestrator OS, from scratch in about 5 weeks of nights and weekends (Apache-2.0).

The differentiator is automatic personal → organizational memory promotion. The "prompts that actually work," which a senior engineer painstakingly tunes, usually stay locked in that one person's head. Praxia tries to solve that with a 5-layer memory stack + a 3-path promotion engine.


What I actually started — the decision

I'd been using LangChain / CrewAI / AutoGen at work since late 2025, and one structural discomfort kept nagging at me:

What really separates a good agent from a bad one isn't the library or the model — it's the accumulated domain-specific trial and error.

And that accumulation almost always lives in one senior person's head (their Cursor / VS Code / Obsidian). Tacit knowledge that evaporates the day they leave. General-purpose frameworks are powerless against that.

In April 2026 I started writing code between day-job hours. A month later I shipped v0.1.0 to PyPI and GitHub.


Why existing frameworks didn't cut it

Four walls I hit in practice:

Wall What was happening
Setup complexity 2-3 days just to get something running. Can't make a production call.
Tacit knowledge doesn't propagate The prompts that work stay in one person's private space.
No evidence for evaluation "It runs" doesn't guarantee "it works."
Agents stagnate Build it once, and there's no feedback loop.

The second one was the killer. General agent frameworks give you strong primitives, but the process of accumulating knowledge into the organization is left entirely to the implementer.


The core design — 5 layers + 3 paths

The 5-layer memory stack

L1 PersonalMemory   Per-user (6 backends: JSON / Mem0 / Letta / Zep / Hindsight / LangMem)
L2 PromotionEngine  Nightly batch. Decides L1 → L3 promotion
L3 SharedMemory     Org-wide. RBAC gating, time decay
L4 MarkdownStore    Git-managed, PR review required, immutable
L5 GraphLayer       Optional (Zep / Graphiti), relation extraction
Enter fullscreen mode Exit fullscreen mode

The key is that L1 → L4 is not manual. A sleep-time consolidator (nightly batch) scans personal memory and auto-promotes the right parts into organizational knowledge.

The 3-path promotion engine

Memory promotion is evaluated in parallel across three independent signals:

  1. Frequency — facts repeated across N+ people
  2. Outcome correlation — co-occurrence with wins / approved PRs / passing tests
  3. LLM self-eval — a 0..1 "org-knowledge candidacy" score

The final score is a weighted blend, and any single decisive path triggers promotion. This deliberately avoids single-mechanism dependence (where one broken signal takes the whole thing down).


The design choices that made "I can build this myself" possible

Shipping in 5 weeks came down to four design choices.

1. Seven extension points

Every extension point is built on the same praxia.extensions.Registry primitive:

Extension point ~LoC Entry point
Connector (Box / Notion / Slack …) ~50 praxia.connectors
Memory backend ~80 praxia.memory_backends
File parser ~30 praxia.parsers
Output exporter ~30 praxia.exporters
OAuth provider ~20 praxia.oauth_providers
Skill ~50 praxia.skills
Flow ~50 praxia.flows

"Extend via a pyproject.toml entry point, never edit core files" keeps the cognitive load low when you write your own.

2. Apache-2.0 with everything included (no paywall)

SSO (Google / Microsoft Entra / Okta / GitHub / Keycloak), RBAC, audit logs, per-user OAuth (13 providers), KMS-backed token encryption (AWS / Azure / GCP / Vault / local) — all in the OSS core.

Most of what commercial agent platforms paywall as an "Enterprise tier" ships here under Apache-2.0. That directly speeds up adoption decisions.

3. 100+ providers via LiteLLM

Provider quirks (Anthropic not supporting response_format, GPT-5.x disallowing temperature, Azure's deployment-name format, …) are absorbed at the LiteLLM layer. No provider-specific API keys leak into Praxia core. Fully offline operation is possible too (Ollama + a local model + backend=json).

4. A Streamlit UI that's easy to throw away

The UI is Streamlit, but the backend also runs headless as praxia serve (FastAPI). When I swap to Next.js or mobile later, I throw away only the UI layer.


What went into v0.1.0, and what didn't

Shipped (deliberately a full set):

  • 5-layer memory + 3-path promotion engine
  • 6 business skills (investment / sales / design / procurement / patent / legal)
  • 6 LTM backends + Composite/Routed parallel fusion
  • Per-user OAuth for 13 providers
  • SSO + RBAC + ACL + audit logs
  • Autonomous agent (LLM-driven tool-use loop)
  • Document Designer (sandboxed python-pptx / docx → designed file output)
  • i18n in 8 languages (en / ja / zh-CN / ko / es / fr / de / pt-BR)

Deliberately deferred (v0.2+):

  • Multi-tenant GUI — OSS targets "single-org self-host"; SaaS-grade tenant isolation belongs in a future Open Core tier
  • PDF output — LibreOffice-based workflow recommended for now
  • Native Pinecone / Weaviate / Qdrant backends — can be wrapped via mem0 through Composite/Routed, so low priority

The "what NOT to build" list was as important as the "what to build" list.


The TOP 3 things that actually ate my time

This was a "ship something working in 5 weeks" sprint, and the three things that genuinely ate time were all in the memory layer and promotion engine.

1. Blending three signals that live on completely different scales

The PromotionEngine evaluates L1 → L3 promotion across Frequency / Outcome correlation / LLM self-eval in parallel. This was trickier than I expected:

  • Frequency is an integer in 0..∞ (reference count for the same fact)
  • Outcome correlation is a ratio in 0..1 (success rate of tasks where the fact co-occurred)
  • LLM self-eval is a continuous 0..1 — but non-deterministic, and the score distribution differs per provider (GPT-5 strict, median ~0.4; Claude lenient ~0.7; Gemini bimodal)

My first implementation was a plain weighted average. Because frequency is unbounded, "facts I just happened to touch a lot in L1" floated to the top. I ended up here:

@dataclass(frozen=True)
class PromoteSignal:
    frequency: float       # raw count
    outcome_corr: float    # 0..1
    self_eval: float       # 0..1, median of N=3 LLM calls

def decide_promotion(sig: PromoteSignal, cfg: PromoteConfig) -> bool:
    # Z-score normalize on a rolling 30-day population, then sigmoid
    z_freq = sigmoid((sig.frequency - cfg.freq_mean) / cfg.freq_std)
    # OR logic with per-path threshold
    return (
        z_freq              > cfg.freq_threshold       # 0.85
        or sig.outcome_corr > cfg.outcome_threshold    # 0.70
        or sig.self_eval    > cfg.self_eval_threshold  # 0.80
    )
Enter fullscreen mode Exit fullscreen mode

Key points:

  • Z-score normalization tames the runaway frequency signal
  • OR logic with per-path thresholds means "any single decisive path promotes" — deliberately avoiding single-mechanism dependence
  • LLM self-eval uses the median of N=3 calls to average out non-determinism (I tried N=5 — the benefit plateaued)
  • decide_promotion is a pure function, so I can replay historical promotion logs to do parameter sensitivity analysis

It took ~5 redesigns to land on "Z-score → sigmoid → OR with thresholds." Lesson: don't start signal fusion with a weighted average. Per-path decisive thresholds + OR turned out to be the most robust.

2. Reconciling "fan-out search × single-destination write" in the Composite backend

The Composite backend sends search queries in parallel to multiple backends (JSON / Mem0 / TiDB / Letta…) and fuses results with Reciprocal Rank Fusion (RRF). But writes go to a single backend specified by write_to=. This asymmetry created three pitfalls.

(a) Add a backend later, and past data is invisible to it

Running Composite(backends=[A, B], write_to="A") and then adding C means C has none of the past writes. Search fan-out becomes lopsided — 2 backends hit, 1 doesn't. I discovered this during dogfooding as "one backend just has lower search quality."

→ Added a replay_writes(source=A, target=C, since=...) admin API after the fact. Needed for rebalancing.

(b) Graceful degradation when write_to goes down

When write_to is down, stopping writes but continuing reads looks attractive. But that can cause "a record that was hitting in search disappears on the next search" (because the write never re-runs after recovery).

I ended up dropping graceful degradation and raising instead. Half-baked availability stacks a data-truthfulness problem on top of eventual consistency. "When it's down, honestly say it's down" turned out to be the easiest to operate.

(c) Tie-breaking in RRF fusion

When multiple backends return the same record_id at rank 1, the scores tie exactly. Without a tie-breaking rule, ordering depends on backend registration order, and CI tests go flaky.

→ Introduced lexicographic tie-breaking on (rrf_score, timestamp DESC, backend_priority). Now ordering is reproducible even in CI.

3. Idempotency of the sleep-time consolidator

Since the nightly batch re-runs L1 → L3 promotion, not promoting the same fact twice is the crux of idempotency. Strict record_id-based dedup wasn't enough:

  • The same user records "Customer X prioritizes ROI" and "X-san values ROI" as two different phrasings → different record_ids, semantically identical
  • LLM self-eval is non-deterministic: the same text scores 0.78 / 0.82 / 0.76, so near a threshold you get "didn't promote yesterday, promoted today"

So I added fuzzy dedup:

def is_duplicate(candidate: Record, l3_recent: list[Record]) -> bool:
    # Already-promoted check: any L3 record within recent window whose
    # embedding cosine-sim >= threshold is treated as duplicate
    cand_vec = embed(candidate.text)
    return any(
        cosine_sim(cand_vec, r.embedding) >= 0.92
        for r in l3_recent  # already filtered to last 30 days
    )
Enter fullscreen mode Exit fullscreen mode

The hard part was tuning the two thresholds (similarity + window):

Setting Problem it causes
0.95 / 7 days (tight) "Same fact" with different wording ends up duplicated across L3
0.85 / 90 days (loose) "New but similar facts" (e.g. a similar trend for customer Y) get suppressed
0.92 / 30 days (adopted) On a hand-labeled set of 50, both false-merge and false-split stayed < 5%

The deciding factor was instrumenting both directions at once. Not just "rate of missed duplicates" but also "rate of treating genuinely distinct facts as duplicates." Track only one side and you inevitably tune lopsided. I later applied this to the PromotionEngine's whole calibration loop.


v0.1.0 by the numbers

Metric Value
Codebase ~25,000 LoC (Python + tests + i18n)
Tests 431 passing
Supported LLM providers 100+ via LiteLLM
Bundled connectors 19 (pull + push)
Per-user OAuth providers 13
Docs 30,000+ chars across EN + JA
Dev time ~5 weeks (nights and weekends)

Next 3 months

  • v0.2: First-class TiDB Vector / pgvector backends (currently via mem0 wrap)
  • v0.2: localhost loopback OAuth (drop the praxia serve requirement)
  • v0.3: Multi-tenant org features (the Open Core entry point)
  • Docs: expanded English tutorials
  • Community: Discord, more active Discussions

For anyone in the same spot

Three things I felt after sprinting for 5 weeks, for anyone debating whether to start a solo OSS project:

  1. Ship frequency over polish. Shipping v0.1.0 teaches you far more than polishing v0.0.4 forever.
  2. State your differentiator in one line. For Praxia: "personal → org memory auto-promotion." If you can't say it, you can't write the post, can't pitch it, and PRs won't come.
  3. The real competitor for OSS is the commercial SaaS solving the same problem — not LangChain or CrewAI, but the paywall on paid agent platforms. Bundling those features under Apache-2.0 is itself the differentiation.

Closing

⭐ Stars / 🍴 Forks / Issues / PRs all welcome.
github.com/praxia-dev/praxia

If you liked this, the 60-second demo is here: https://youtu.be/o_6NbjJU1AA

Connecting "individual brilliance × organizational continuity" with AI — that's the mission Praxia started with this spring.

Next in this series: the enterprise-platform features I put directly in the OSS core (SSO, RBAC, audit logs, KMS-envelope-encrypted OAuth tokens) and why none of them are behind an Enterprise tier.

Top comments (0)