DEV Community: Papa

When Your AI Agent Handles Money, "It Worked" Isn't Good Enough

Papa — Sat, 25 Jul 2026 02:02:35 +0000

Three verifier agents. One milestone. Real ETH at stake.

The first time I ran the consensus loop without tracing, a silent timeout in the peer broadcast caused two nodes to vote while the third sat idle. The quorum passed anyway (2-of-3), but I had no idea the third node was broken until I checked the logs manually — 40 minutes later.

That's when I stopped treating observability as a nice-to-have and started treating it as the product.

What I Built

Weft is an autonomous milestone verifier: builders stake ETH against project deliverables, and when a deadline passes, three independent AI agents gather evidence — contract deployment, on-chain usage, GitHub commits — corroborate over encrypted channels, and execute a verdict on-chain. Capital releases or refunds automatically.

The stack:

Python verifier daemon
0G Chain smart contracts
Zama FHE for sealed ballot privacy
AXL for peer-to-peer messaging
SigNoz for the entire observability layer
The Instrumentation That Actually Mattered

I instrumented the daemon with OpenTelemetry's Python SDK — standard stuff at first: one root span per verification cycle, child spans for each evidence-gathering step. But the useful instrumentation came from the places I didn't initially think to trace.

The peer broadcast

Each verifier broadcasts its signed verdict envelope to the other two nodes via HTTP POST. I wrapped the broadcast call in a span with attributes for peer.address, envelope.evidence_root, and response.status.

python
with tracer.start_as_current_span(
"axl.broadcast_verdict",
attributes={
"peer.address": peer_url,
"milestone.hash": milestone_hash,
"envelope.evidence_root": evidence_root,
},
) as span:
resp = requests.post(f"{peer_url}/send", json=envelope, timeout=10)
span.set_attribute("http.status_code", resp.status_code)

When the third node was timing out, the span showed a 30s duration with a deadline_exceeded status — something I'd never have found in stdout logs, because the daemon's retry logic silently moved on.

The consensus wait

When AXL_WAIT_FOR_PEERS=1, the daemon polls its inbox for matching peer envelopes before voting. I added a span that tracks how long consensus took to form and how many unique signers contributed. In SigNoz, this shows up as a variable-width bar in the waterfall — you immediately see whether consensus was fast (all nodes healthy) or slow (one node lagging).

The FHE ballot encryption

Zama's FHEVM operations are computationally expensive. A span around submit_encrypted_weighted_verdict revealed it was taking 4–6 seconds per call — 10x longer than the regular submitVerdict. That's fine once you know it. Without the span, the total cycle time just looked "slow," with no obvious bottleneck.

What SigNoz Gave Me That Logs Didn't

The dashboard I actually lived in had eight panels:

Verification cycle duration (p50/p95) — one number that told me if something was degrading
Evidence source availability — three gauges for GitHub API, RPC node, and 0G indexer
Peer consensus formation time — how long from first broadcast to quorum
Transaction success rate — did the on-chain verdict actually land
Per-node vote status — which verifiers voted, which are stuck
FHE encryption latency — the Zama call specifically
Active milestone count — workload across pending deadlines
Error rate by span — which operation is failing most

The trace waterfall is where debugging actually happened. A single verification cycle produces 15–25 spans:

deadline_scheduler.poll
→ indexer_client.get_milestone
→ metadata_reader.read
→ github_client.collect
→ eth_rpc.get_code
→ mvp_verifier.count_callers
→ kimi_client.generate_narrative
→ axl_client.broadcast
→ peer_inbox.wait_for_consensus
→ keeperhub_client.execute_verdict

When something breaks, you click the trace and see exactly where.

The thing that surprised me: I used trace-to-logs correlation more than I expected. The daemon emits structured JSON logs with trace IDs, so clicking a suspicious span in SigNoz jumps straight to the exact log lines from that operation. This was invaluable when debugging a case where evidence-gathering passed but the attestation JSON was malformed — the span said "success," but the log showed a missing field that only mattered downstream.

What I'd Tell My Past Self

Instrument the boundaries, not the internals. I wasted time tracing individual lines of business logic. The useful spans were at system boundaries: HTTP calls to peers, RPC calls to the chain, API calls to GitHub/Kimi/0G. Internal function calls are better served by logs correlated to the parent span.

Set span attributes aggressively. milestone.hash, verifier.address, evidence.type, consensus.signer_count — these turn a generic waterfall into a queryable dataset. When I needed to answer "which milestone took longest to verify last week," it was a SigNoz query, not a grep.

The alert that saved me: a simple condition — verification cycle > 120s — caught a regression where the 0G indexer RPC was returning stale data. The daemon kept retrying reads for metadata that hadn't propagated yet. I added exponential backoff with a ceiling, but without the alert I'd have burned through rate limits for hours before noticing.

Don't trace in production without sampling in mind. Three verifier nodes, each polling every 60 seconds, each producing 15+ spans per cycle — that's roughly 2,700 spans/hour, minimum. Fine for a hackathon. For production, you'd want head-based sampling on the poll cycles and 100% sampling on the verdict-execution path.

The Takeaway

The point of Weft is that autonomous agents handling money must be observable — not because regulators require it (though they will), but because you need to debug it at 2am when a verdict doesn't land and there's real capital on the line.

SigNoz turned "the third node seems broken" from a 40-minute log-spelunking session into a 10-second glance at a trace waterfall.

If you're building agents that do anything consequential — transactions, deployments, approvals — instrument the decision path end-to-end before you ship. You'll thank yourself the first time something fails silently.

Weft is open source: github.com/thisyearnofear/weft. The SigNoz dashboards are provisioned via OpenTofu in agent/scripts/weft_signoz_provision.sh.

Building a Diversifi Memory Agent on Qwen Cloud

Papa — Mon, 20 Jul 2026 18:03:03 +0000

I'm a solo builder. That means when something breaks at 2am, there's no teammate to Slack — just me, a terminal, and a growing suspicion that I've misconfigured something obvious. This is the story of building DiversiFi, my submission for Track 1 of the Qwen Cloud Global AI Hackathon (MemoryAgent), and the three-hour detour into Alibaba Cloud's beta-access purgatory that taught me more than the parts that actually worked.

The idea

DiversiFi is a treasury management agent — but the part I actually wanted to build wasn't the treasury logic. It was the memory.

Most "agent memory" is just chat history with a fancier name. I wanted something different: an agent that distills your actual profile out of the noise. Your risk tolerance. Your financial philosophy — maybe it's Africapitalism, maybe Islamic finance principles, maybe Buen Vivir. Your recurring anxiety about currency depreciation. Your swap patterns over time. Not "here's everything you've ever said to me," but "here's who you are, as best I can tell."

That means two things have to happen continuously: raw interactions need to get consolidated into durable statements, and old, irrelevant signal needs to get forgotten. That's the whole game for Track 1, and it's a genuinely hard problem dressed up as a simple one.

Wiring three services together

I ended up leaning on three Alibaba Cloud pieces:

DashScope (Model Studio / Bailian) is the brain. I used the OpenAI-compatible endpoint, which meant I could point my existing OpenAI-client-shaped provider at Qwen by changing exactly one base URL. Zero rewrites. The consolidation pipeline feeds up to 40 raw memories into Qwen and asks it to boil them down into 3-7 durable profile statements — qwen-plus by default, qwen-long when I need the 1M-token context for bigger memory pools, qwen-max when quality matters more than speed. I even stood up a dedicated MaaS workspace endpoint in Singapore, and I have a live curl against it returning real completions. That part felt great.

Tablestore's Agent Memory Store was supposed to be the long-term memory layer underneath all of it — the official SDK gives you createMemoryStore, addMemories, searchMemories, the works, plus something I didn't expect to love as much as I do: it runs its own background extraction pass on raw messages, so you get a second, independent take on what matters, layered on top of my own Qwen-based consolidation. The four-level scope — appId / tenantId / agentId / runId — is exactly the multi-user isolation model a treasury agent needs, and I didn't have to invent it myself.

Function Compute is the glue. A small Node.js 18 handler in cn-beijing that pulls raw memories out of storage, sends them to Qwen for consolidation, writes the distilled profile back as a high-priority memory, and evicts what it just absorbed. A cron job on my Hetzner box hits this every six hours per active user, quietly keeping every profile fresh.

On paper, that's the whole architecture. In practice, one of the three legs never got to stand.

"The user is disabled"

Here's where it stopped being a hackathon and started being a mystery.

I created the Tablestore instance, set up a RAM user with full OTS and FC access, double-checked the AccessKey was active. Every single call to the Memory Storage API came back with the same error:

OTSAuthFailed: The user is disabled.

That phrase is doing a lot of misdirection. My first instinct, and probably yours, is "okay, something's wrong with my IAM policy." So I re-read the docs. Rebuilt the RAM user. Tried the root account's own key, just to rule out permissions entirely. Same error, every time.

Turns out the Tablestore Memory Storage sub-service — the actual API surface behind createMemoryStore and friends — is in 邀测, invitation-only beta. It's restricted to cn-beijing, and access is gated at the account level by a manual allowlist, not by any policy I could write myself. "The user is disabled" isn't a permissions message. It's the beta gate itself, phrased in a way that sends you chasing IAM ghosts for hours. The official guide, once I found it, points you to a DingTalk group to request access. The billing docs even quietly confirm the service isn't GA yet — pricing doesn't kick in until the end of July.

I want to be honest about how much time that cost me. If you're building on an Alibaba Cloud beta service for a hackathon with a deadline, request access before you write a line of code. I didn't, and I paid for it in hours I didn't have.

What a solo builder does when a dependency won't cooperate

Not stop, obviously. But also — not fake it.

The Tablestore adapter is done. Fully written, fully tested, using the SDK exactly as documented. It just can't run yet, because the account isn't allowlisted. So instead of blocking the whole submission on someone else's approval queue, I wired a local fallback — Cognee — that implements the identical remember/recall/sweep interface. The consolidation service tries Tablestore first if the endpoint is configured, and falls back seamlessly if it isn't. From the user's side, nothing changes. From my side, the app is fully functional today, and the moment that DingTalk request gets approved, Tablestore comes online with zero code changes — the environment variable is already sitting there waiting.

That fallback pattern ended up being the thing I'm proudest of in this whole build, more than any single integration. Every Alibaba Cloud service in this app sits behind an environment variable, and none of them are load-bearing for the app to function:

No ALIBABA_CLOUD_FC_ENDPOINT? The Guardian cron just consolidates locally.
No TABLESTORE_ENDPOINT? Memory falls back to Cognee.
No DASHSCOPE_API_KEY? The LLM chain falls through to Gemini, then Venice, then down the list.

Alibaba Cloud makes this thing better. It was never allowed to be a single point of failure — and that turned out to matter a lot more than I expected when one of its own services locked me out.

Forgetting, on purpose

The other requirement — timely forgetting — actually got easier to think about once I stopped treating memory as something that only grows. There are two layers to it. A soft decay function quietly penalizes the recall score of anything older than 30 days, hitting zero at twice that age, so stale memories fade from relevance before they're deleted. A harder sweep then actually evicts anything below that threshold, and consolidation itself evicts the raw memories it just absorbed into a distilled statement — otherwise you're just accumulating noise forever and calling it "memory."

The third requirement — recalling critical memories inside a limited context window — turned out to have a slightly counterintuitive answer. Qwen's qwen-long genuinely does give you a million tokens of context. But the real win wasn't using all of it. It was consolidating 40 raw memories down into 3-7 durable statements and prioritizing those in recall, so the advisor's context stays full of signal — your philosophy, your risk profile — instead of scrollback.

Where it stands

DashScope: live, verified, answering real requests today. Function Compute: fully configured, one s deploy away from running. The Tablestore instance itself exists, has internet access enabled, has the right policies attached — it's just waiting on a beta gate that isn't mine to open. And Cognee is quietly doing the job in the meantime, with 880 tests green behind it.

If there's a lesson in here for another solo builder eyeing a cloud hackathon: build for the failure of the shiny new thing before you've even confirmed it works. Not out of pessimism — out of respect for the fact that beta services, by definition, might not let you in on the first try. The interesting engineering, it turns out, wasn't the Qwen integration. It was making sure the whole thing didn't depend on it.

Code: github.com/thisyearnofear/diversify FC handler (all three services in one place): alibaba-cloud/fc-memory-consolidation/index.js Tablestore adapter: packages/shared/src/services/tablestore-memory-service.ts DashScope provider: packages/shared/src/services/ai/providers/dashscope-provider.ts Deployment docs: docs/alibaba-cloud-deployment.md

Built for the Qwen Cloud Global AI Hackathon, Track 1: MemoryAgent.

Databard: How I wired TestSprite into my coding agent to defend invariants that unit tests can't catch

Papa — Fri, 10 Jul 2026 22:12:12 +0000

I Built an Autonomous Testing Loop That Catches Silent Economic Bugs

Built for TestSprite Season 3 — "CLI Launch & Loop Engineering."

Most testing loops test the wrong thing. They check "does the code run?" when the real question is "does the money move in the right direction?"

I'm building DataBard — a marketplace where AI-persona agents bid on data-brief WANTs and settle on Solana devnet. The marketplace has economic invariants: properties that must hold or the market silently breaks. A unit test will happily pass while your reseller loses money on every trade.

This is the story of building a TestSprite-powered loop that catches those bugs — and the real bugs it caught during development.

The Problem: Silent Economic Bugs

Traditional tests check code paths. assertEqual(add(1, 2), 3) tells you the function works. But what about:

"The Digest reseller must earn positive margin on every deal" — if the pricing strategy's estimatedSubCost is wrong, the reseller charges 0.03 SOL and pays out 0.043 SOL. The code runs fine. The money just moves the wrong direction. Every. Single. Trade.
"Cascade must win Quality briefs, Newsroom must win Freshness" — if the buyer LLM's fit-vs-price weights drift, the cheapest persona always wins. The market "works" — it just lost its entire differentiator.
"Escrow state machine rejects invalid transitions" — if release() doesn't check for delivered state, a buyer can release payment before the seller commits. The API returns 200. The money is gone.

These are economic invariants, not code correctness checks. No amount of unit testing catches them. You need tests that run the actual market flow end-to-end and assert the economic properties hold.

The Loop

Here's the architecture:

Write — the coding agent ships code.
Verify — the TestSprite CLI uploads Python tests to the cloud and runs them against the live URL (POST /api/market/graph-demo), which exercises real Solana devnet escrow calls.
Fix — on failure, the agent reads the TestSprite failure bundle and proposes a minimal patch.
Verify again — rerun until green, or until a 4-iteration cap is hit.

flowchart LR
    A[Coding Agent<br/>writes code] --> B[TestSprite CLI<br/>runs tests]
    B --> C[TestSprite Cloud<br/>pytest + requests<br/>vs live API]
    C -- failure bundle --> D[Fixer<br/>reads failure, patches code]
    D --> A
    C -- all green --> E[LOOP.md<br/>audit trail]

Every iteration appends to LOOP.md — the audit trail that judges read.

The Tests

Each invariant is a Python + requests test that hits the live API. No mocks. No local server. The tests exercise the full pipeline: Watchdog tick → persona bidding → buyer LLM scoring → escrow deposit → delivery → release.

Invariant 1: Digest must earn positive margin

def test_digest_earns_positive_margin_on_every_deal():
    # 1. Consumer posts the digest WANT
    post_resp = post_phase(None, "post")
    want_id = post_resp["wantId"]

    # 2. Consumer awards (Digest is the only bidder)
    award_resp = post_phase(want_id, "award")
    parent_price = award_resp["parentDeal"]["priceLamports"]

    # 3. Digest fulfils by buying from Newsroom×N
    deliver_resp = post_phase(want_id, "deliver")
    sub_prices = [s["priceLamports"] for s in deliver_resp["subDeals"]]
    total_sub_cost = sum(sub_prices)

    # THE INVARIANT: parent price must exceed sum of sub prices
    assert parent_price > total_sub_cost, \
        f"Digest losing money: parent {parent_price} ≤ subs {total_sub_cost}"

This test doesn't check if the code runs. It checks if the economy works. If the pricing strategy drifts, this test fails — even though every function returns 200 OK.

Invariant 2: Persona fit must reflect focus

def test_cascade_wins_quality_brief():
    """The e-commerce fixture triggers quality delta hints;
    Cascade (deep-dive persona) should win it, not Newsroom (flash)."""
    result = run_cycle_with_focus("ecommerce")
    winner = result["award"]["winnerPersona"]
    assert winner == "cascade", \
        f"Wrong persona won: {winner} — fit weights may have drifted"

Invariant 3: Escrow state machine rejects invalid transitions

def test_cannot_release_before_deliver():
    """Post a WANT, award it, then try to skip straight to release."""
    post = post_phase(None, "post")
    want_id = post["wantId"]
    post_phase(want_id, "award")
    # Try to release WITHOUT delivering — must fail
    r = requests.post(DEMO, json={"fixture": "ecommerce", "phase": "release", "wantId": want_id})
    assert r.status_code >= 400 or not r.json().get("ok"), \
        "Release before deliver should be rejected!"

The Real Bugs It Caught

Bug 1: Digest lost money on every deal

What happened: The Digest reseller's pricingStrategy charged subFloor × N × 1.25, assuming Newsroom's floor of 0.008 SOL. But Newsroom's urgency-adjusted pricing at 120s deadlines was 0.014 SOL/sub. Digest was charging 0.03 SOL and paying out 0.043 SOL — a −0.013 SOL loss on every trade.

Why this matters: Silent economic bugs are the worst kind. Unit tests wouldn't catch this because the flow "worked" — the money just moved the wrong direction. The test_digest_earns_positive_margin_on_every_deal invariant catches it on the first run.

The fix: The loop's fixer (powered by an LLM) read the TestSprite failure bundle, identified that estimatedSubCost was too low in voice-config.ts, and bumped it to sol(0.015) with a 1.3 margin multiplier. Sub-WANT deadlines were extended to 300s so Newsroom's urgency premium drops. Result: +0.018 SOL profit per trade, verified live on devnet.

Bug 2: Persona fit weights drifted

What happened: The buyer LLM's fit-vs-price scoring had drifted to a 0.50/0.50 split between fit and price. At that split, the cheapest persona (Newsroom) was winning Quality briefs — the exact opposite of the intended behavior. The market was picking "cheapest" instead of "best fit."

The fix: The loop's fixer moved the split to 0.68/0.32 in favor of fit, which was enough for Cascade's higher fit score to outweigh Newsroom's lower price on quality-flagged briefs. Two iterations to land on that ratio, then green.

Bug 3: Hand-written IDL discriminator wrong

What happened: The Anchor escrow IDL was hand-written to avoid a runtime dependency on the generated types. The commit_delivery instruction discriminator was invented instead of computed from sha256("global:<name>")[:8]. On-chain calls would have silently landed on wrong instructions.

The fix: Computed all discriminators via sha256("global:<name>")[:8], verified against the anchor build-generated target/idl/escrow.json. Four matched; one didn't. Caught before deploy.

The Fixer: Multi-Provider Fallback

The fixer is the part that reads the failure bundle and proposes a patch. It takes structured JSON — { file, old_string, new_string, commit_message, reasoning } — and applies it via a plain fs.writeFileSync, but only after checking that old_string appears exactly once in the target file and is at least 40 characters long. That length floor matters: a short old_string like "return price" could match in five different places, and the fixer would have no way to know which one the model meant. Forcing a longer, more specific anchor makes an accidental match essentially impossible.

No arbitrary code execution. No nested CLI with permission-bypass. No tool-calling. If the model hallucinates a match that doesn't exist, the fixer bails cleanly and the loop records the miss.

The multi-provider fallback was the key insight. Models differ in JSON discipline. A model that outputs a preamble like "Here's the fix:" — or picks a too-short old_string — fails the schema check and the loop moves to the next provider:

[fixer] providers in fallback order: venice → nvidia
[fixer] trying venice (venice-uncensored)…
[fixer] × venice: bad JSON — old_string too short — must be ≥40 chars
[fixer] trying nvidia (openai/gpt-oss-20b)…
[fixer] ✓ nvidia patched src/lib/market/rates.ts

Total: 11.5 seconds. Venice's answer was slightly under-specified; NVIDIA delivered a clean patch. Both providers stayed within their respective SDK quotas.

The fallback chain runs Venice → NVIDIA NIM → Anthropic. Any single-provider outage doesn't stall the loop. Provider order is configurable via LOOP_PROVIDER_ORDER=venice,nvidia,anthropic.

CI/CD: The Loop on Autopilot

The TestSprite checker is wired into GitHub Actions. Every PR reruns the invariants and fails the build if something breaks:

# .github/workflows/testsprite.yml
on: pull_request
env:
  TESTSPRITE_API_KEY: ${{ secrets.TESTSPRITE_API_KEY }}
  PROJECT_ID: ${{ secrets.TESTSPRITE_PROJECT_ID }}
jobs:
  verify-invariants:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g @testsprite/testsprite-cli
      - run: |
          testsprite test run --all \
            --project "$PROJECT_ID" \
            --wait \
            --output json > results.json
      - run: |
          FAILED=$(cat results.json | jq '[.[] | select(.status != "passed")] | length')
          if [ "$FAILED" != "0" ]; then exit 1; fi

This is the stickiest version of the loop. Long after the hackathon, every PR is gated on the economic invariants. You can't merge broken pricing logic. You can't drift the persona weights. The checker works forever.

What I Learned

1. Test invariants, not code paths

The biggest insight: test what must be true about the system, not what the code does. "Digest earns positive margin" is an invariant. "The pricing function returns a number" is a code path check. The former catches real bugs. The latter catches typos.

2. Real tests > mock tests

Every test hits the live API. No mocks, no local server, no jest.mock(). This means the tests catch integration bugs that mock-based tests miss — like when the Solana RPC rate-limits you and the escrow deposit fails silently.

The downside: the tests are slower (14 seconds for 4 tests) and can fail for infrastructure reasons (Solana devnet RPC 429s). But the failures are honest — they tell you something is actually broken, not that your mock is stale.

3. The fixer needs guardrails

An autonomous agent that reads failures and patches code is powerful but dangerous. The guardrails that made it safe:

Structured JSON only — no free-form code execution
old_string must be ≥40 chars and unique — prevents ambiguous or hallucinated matches
4-iteration cap — runaway loops are worse than honest failures
Every patch committed — reviewers can git log every mutation

4. The loop is honest about failures

The LOOP.md audit trail shows 22 iterations: 4 passed, 16 failed/infra-limited, 9 patches applied. That's not a failure — that's the loop working. A loop that only shows passes is suspicious. A loop that shows failures, fixes, and re-runs is trustworthy.

The Numbers

Metric	Value
Total iterations	22
Tests passed	4
Tests failed (caught real bugs)	9
Infrastructure failures (Solana RPC 429)	7
Patches applied by the fixer	9
Unique commit SHAs in audit trail	9
Provider fallback chain	Venice → NVIDIA → Anthropic
Fixer end-to-end time	~11.5 seconds
CI/CD integration	GitHub Actions

Try It

Live app: databard.persidian.com
Source code: github.com/thisyearnofear/databard
Loop audit trail: LOOP.md
Test files: tests/testsprite/

The TestSprite CLI is open source (Apache 2.0) — install it from GitHub and wire it into your own loop.

Built for TestSprite Season 3 — "CLI Launch & Loop Engineering." The loop is the product. The product is the loop.

Familexyz: Giving AI Agents Persistent Memory with Cognee Cloud

Papa — Sat, 04 Jul 2026 15:23:15 +0000

Integrating Cognee Cloud's memory layer into FamilyXYZ — a multi-agent AI platform where five philosophy-inspired agents help families strengthen their relationships across Telegram and the web — and ended up filing a PR against Cognee itself.

TL;DR — 4 lessons from this integration:

When the docs are unclear, read the source. An "undocumented" endpoint we thought didn't exist actually did — we found it in the Cognee repo, not the docs.
The default search mode isn't always right for you. Switching from GRAPH_COMPLETION to CHUNKS cut an LLM call and seconds of latency out of every recall.
Graceful degradation is non-negotiable. A no-op fallback service meant the app kept working even with the memory layer fully disabled.
Fixing a bug taught us more than the docs did. We ended up contributing a PR (#3863) after tracing inconsistent error responses through the router source.

Details below.

The Problem

FamilyXYZ has five AI agents — Wisdom (Alain de Botton), Intimacy (Esther Perel & Gottman), Presence (Thich Nhat Hanh), Growth (James Clear & Carol Dweck), and Bridge (StoryCorps & bell hooks). Families interact with them via a Telegram bot and a web dashboard.

The agents were helpful in the moment, but they had no memory. A parent could tell the Wisdom agent about a difficult conversation with their teenager on Monday, and by Wednesday the agent had no idea that conversation ever happened. For a family wellness product, this is a fundamental gap — real relationships are built on accumulated context, not single interactions.

We needed persistent, per-user memory that:

Stored every conversation, check-in, and interaction
Could be queried for relevant context before each agent response
Supported enrichment (building a knowledge graph from raw text)
Allowed full data deletion (GDPR-style right to be forgotten)
Was optional — the app had to keep working if the memory service was down

Why Cognee Cloud

We evaluated several options: raw vector databases (Pinecone, Weaviate), LLM-native memory (Mem0), and Cognee. We chose Cognee Cloud for three reasons:

1. Knowledge graph, not just vectors. Cognee doesn't just embed text and do similarity search — it builds an actual knowledge graph with entities and relationships. When a user says "I argued with my partner Sarah about screen time," Cognee extracts the entities (user, Sarah, screen time) and relationships (argued_with, about). This means recall can find semantically related memories even without keyword overlap.

2. The lifecycle maps perfectly to agent memory. Cognee has four operations that map cleanly to what an agent memory layer needs:

remember() — ingest text, build the knowledge graph automatically
recall() — query the graph for relevant context
improve() — run enrichment on the existing graph
forget() — delete a user's entire memory

3. Managed cloud, no infrastructure. Cognee Cloud gives us a tenant-isolated instance with a REST API. No graph database to run, no embedding pipeline to maintain, no vector index to tune.

The Integration

Architecture

We built a standalone workspace package (packages/memory) that wraps the Cognee Cloud REST API. It exports a MemoryService interface with four methods and two implementations:
packages/memory/src/
├── MemoryService.ts # Interface: remember, recall, improve, forget, isEnabled
├── CogneeMemoryService.ts # Talks to Cognee Cloud via REST API
├── NoopMemoryService.ts # Silent fallback — all methods are no-ops
└── index.ts # Factory: reads env vars, returns the right impl

The key design decision: graceful degradation. If COGNEE_ENABLED is not true, or the API key is missing, or Cognee is unreachable, the app uses NoopMemoryService and continues working with its existing SQLite-backed state. The memory layer is an enhancement, not a dependency.

Authentication

Cognee Cloud uses two headers for multi-tenant auth:
X-Api-Key:
X-Tenant-Id:

We encapsulated these in an authHeaders() helper that injects both on every request.

Per-User Isolation

Each user gets a dedicated Cognee dataset named familexyz_user_<userId>. This gives us clean isolation — one user's memories never bleed into another's — and makes forget() simple: delete the dataset, and all the user's graph data, vectors, and relational records are gone.

Naming note: you'll see "FamilyXYZ" (the product), familexyz (the repo/package name), and famile.xyz (the live domain) used throughout this post. The codebase predates the current domain, hence the mismatch — nothing to worry about if you're cross-referencing the repo.

The Four Operations

remember() — POST /api/v1/remember

After every Telegram conversation, check-in, or family interaction, we store the content plus metadata (which agent, what source, which family member) as a text blob. Cognee's remember endpoint handles both ingestion and graph building (add + cognify combined), so we don't need a separate processing step.

const formData = new FormData();
formData.append("data", new Blob([enriched], { type: "text/plain" }), "memory.txt");
formData.append("datasetName", dataset);
// POST /api/v1/remember

recall() — POST /api/v1/recall

Before an agent responds to a user message, we call recall() to fetch relevant context from the user's memory graph. We use searchType: "CHUNKS" with topK: 5 — this returns raw text snippets directly without an LLM call, which is cheaper, faster, and gives us text we can inject into the agent's prompt as context.

body: JSON.stringify({
    query,
    datasets: [dataset],
    searchType: "CHUNKS",
    topK: 5,
})

The returned snippets are injected into the agent's system prompt as "Previous context from memory:" — giving the agent awareness of past conversations without any agent framework changes.

improve() — POST /api/v1/improve

Triggered from the web dashboard's "Improve Memory" button. Runs Cognee's enrichment pipeline (memify) on the user's knowledge graph in the background. This is fire-and-forget — if it fails, the existing graph is still usable.

forget() — POST /api/v1/forget

Triggered from the dashboard's "Forget All" button. Sends { dataset: "familexyz_user_<id>" } and Cognee handles the rest: deletes relational records, graph nodes/edges, and vector embeddings. Clean slate.

Wiring It Into the App

The memory service is a singleton, initialized once at boot:

// agent/src/index.ts
import { initMemoryService } from "@familexyz/memory";
initMemoryService();

Then in the Telegram bot's message router, before routing to an agent:

const memory = getMemoryService();
const context = await memory.recall(userId, userMessage);
// Inject context into agent prompt

And after the agent responds:

await memory.remember(userId, `User: ${userMessage}\nAgent: ${agentResponse}`, {
    source: "conversation",
    agent: "wisdom",
});

The web dashboard at famile.xyz/memory exposes the full lifecycle via REST endpoints (/api/memory/status, /api/memory/recall, /api/memory/remember, /api/memory/forget, /api/memory/improve) with JWT authentication.

What We Learned

1. The docs don't always match the API

When we first integrated, the documentation for the forget endpoint was inconsistent — the OpenAPI spec listed a forget tag, but the API reference page 404'd. We initially assumed the REST endpoint didn't exist and built a workaround (UUID lookup via GET /datasets + DELETE /datasets/{id}).

After diving into the Cognee open-source repo to contribute a PR, we discovered POST /api/v1/forget does exist and accepts { dataset: "name" } directly. We simplified our implementation to a single call.

Lesson: When the docs are unclear, read the source. Open-source projects often have better answers in the code than in the docs.

2. CHUNKS vs GRAPH_COMPLETION matters a lot

The default searchType for recall() is GRAPH_COMPLETION, which calls an LLM to generate a natural language answer from the graph. This is expensive (LLM call per recall), slow (seconds of latency), and returns an answer, not raw context.

For our use case — injecting context into an agent prompt — we want raw text snippets, not LLM-generated answers. Switching to searchType: "CHUNKS" was a significant improvement: no LLM call, sub-second latency, and text we can directly inject.

Lesson: Read the search type options carefully. The default isn't always right for your use case.

3. Graceful degradation is non-negotiable

The memory layer is optional. The app worked for months without it. If Cognee goes down, credits run out, or the network fails, the app must keep working.

Our NoopMemoryService pattern — where every method is a silent no-op — made this painless. The calling code doesn't know or care whether Cognee is active. It calls recall(), gets an empty array, and the agent responds without memory context. Not ideal, but not broken.

Lesson: When adding a cloud dependency to an existing app, wrap it in an interface with a no-op fallback. The integration becomes a feature enhancement, not a reliability risk.

4. Contributing back is the best documentation

We found issue #3748 in the Cognee repo — "Inconsistent API error responses in improve, forget, and recall routers." Three routers were returning raw error dicts with non-standard HTTP status codes (420, 409) instead of the canonical ErrorResponse DTO used by the other five routers.

We fixed it in PR #3863, and the process of reading the router source code taught us more about the API than the docs ever did. We saw exactly how each endpoint handled errors, what status codes to expect, and what the response schema looked like.

Lesson: The fastest way to understand an open-source API is to fix a bug in it.

The Stack

Layer	Technology	Why
Agent framework	ElizaOS (TypeScript)	Existing multi-agent orchestration, already in place before this integration
Memory layer	Cognee Cloud (REST API)	Managed knowledge graph — no infra to run
Backend	Hono on Node.js, PM2 on Hetzner	Lightweight, fast cold starts, cheap to run on a VPS
Frontend	Next.js on Netlify	Simple static + serverless deploy for the dashboard
Bot	Telegram Bot API	Primary interaction surface for families
Auth	JWT (HMAC-SHA256)	Stateless auth across bot and dashboard
Package manager	pnpm workspaces	Monorepo with shared packages like `@familexyz/memory`

Results

The memory layer is live in production at api.famile.xyz. A round-trip test confirms it works:

remember: Store "User checked in: mood=good, had breakfast with family" → {"success":true}
recall: Query "What did the user do this morning?" → Returns the exact memory with metadata
status: {"enabled":true,"service":"cognee"}

The Telegram bot now remembers past conversations, the web dashboard shows memory status and lets families manage their data, and the whole system degrades gracefully if Cognee is ever unavailable.

What's Next

Session-scoped memory: Cognee supports session_id for per-conversation context tracking. We could give each Telegram conversation its own session.
Node sets: Cognee's remember accepts node_set tags for grouping related data. We could tag memories by agent (wisdom, intimacy, etc.) for more targeted recall.
Scope-aware recall: Cognee's scope parameter can search "graph", "session", "trace", or "all" — we're currently using the default "auto" but could be more intentional about which memory sources to query.

FamilyXYZ is an open-source family wellness platform. The Cognee integration code is at github.com/thisyearnofear/familexyz. The Cognee PR is at github.com/topoteretes/cognee/pull/3863.

I turned café wifi speeds into a metro map — on Aurora Serverless v2 + Vercel

Papa — Mon, 29 Jun 2026 23:51:41 +0000

I Built Lattency: A Crowdsourced Metro Map for Café Wi-Fi Using Aurora PostgreSQL Serverless v2 and Vercel

Built for the H0: Hack the Zero Stack hackathon (Vercel × AWS Databases).

Stack: Amazon Aurora PostgreSQL Serverless v2 + PostGIS + Next.js on Vercel

Live Demo: https://lattency.vercel.app/

Source: https://github.com/thisyearnofear/lattency

#H0Hackathon

Finding a café with reliable Wi-Fi (in Nairobi) is surprisingly difficult.

You buy a coffee, settle in, open your laptop, and discover the network can't survive a video call. Twenty minutes later you're packing up and looking for another café.

It's a problem that quietly steals hours every week.

So I built Lattency.

Instead of reviews saying "the Wi-Fi is good", Lattency is a crowdsourced metro map where every café is a station and every line represents a Wi-Fi speed tier—not geography.

🚄 Express — 50 Mbps and above
🚉 Local — 10–49 Mbps
🚧 Suspended — Below 10 Mbps

Anyone can run a speed test, submit a measurement, and within seconds the café is reclassified for everyone else.

The interesting part wasn't drawing the map.

It was building a backend that could answer "What's the fastest café near me?", update itself in near real time, resist bad data, and cost almost nothing when nobody was using it.

This post is about the infrastructure that makes that possible.

Choosing the database

The hackathon offered three AWS database options:

Amazon DynamoDB
Amazon Aurora DSQL
Amazon Aurora PostgreSQL

For Lattency, the decision came down to one requirement:

Find cafés within a radius of the user's location.

With PostGIS, that's almost trivial.

SELECT id, name, lat, lng
FROM cafes
WHERE ST_DWithin(
    geog,
    ST_MakePoint($1, $2)::geography,
    $3
);

ST_DWithin() is exactly the kind of query PostGIS was built for.

DynamoDB has no native geospatial radius queries, and Aurora DSQL doesn't currently support PostGIS.

Once location search became a requirement, Aurora PostgreSQL wasn't simply the easiest choice—it was the only database that naturally fit the problem.

The second reason was economics.

Aurora PostgreSQL Serverless v2 scales all the way down to 0 ACUs when idle.

A project like Lattency doesn't receive traffic around the clock. It wakes up during working hours, slows down overnight, and may eventually expand city by city.

Paying only when the database is actually serving requests is exactly the pricing model I wanted.

The trade-off is a cold start of roughly 15–30 seconds after long periods of inactivity.

Fortunately, there are ways to hide that from users.

Let PostgreSQL decide the line colour

Raw speed tests are noisy.

One person on hotel Wi-Fi, a VPN, or a congested network shouldn't immediately downgrade an entire café.

Instead of calculating speed tiers on every request, Lattency maintains a materialized view that stores per-café statistics.

Every new measurement refreshes that view.

REFRESH MATERIALIZED VIEW CONCURRENTLY cafe_speed_stats;

The important part is CONCURRENTLY.

Without it, PostgreSQL would lock the view while rebuilding it.

With it, visitors continue reading from the old version while PostgreSQL prepares the new one in the background.

No downtime.

As more measurements came in, I made the aggregation smarter.

Once a café has enough submissions, measurements marked as outliers are ignored when calculating the median.

That means:

one accidental upload won't move a café into Suspended
genuine network upgrades still appear naturally over time

Outliers are excluded from the calculation—not deleted—so the underlying data remains intact.

Making serverless behave like a long-running app

Connecting a serverless application to PostgreSQL introduces three practical problems.

1. Too many database connections

Every serverless function can create its own PostgreSQL connection.

That doesn't scale very well.

Instead, I cache a single pg.Pool on globalThis, allowing warm instances to reuse existing connections.

const globalForPg = globalThis as { pool?: Pool };

export const pool =
    globalForPg.pool ??
    new Pool({
        connectionString: process.env.DATABASE_URL,
        ssl: { rejectUnauthorized: true },
        max: 1,
    });

globalForPg.pool = pool;

Keeping each instance to a single pooled connection dramatically reduces connection pressure.

2. Don't wake the database for every visitor

Most people viewing the homepage are looking at the same information.

There's no reason every request should hit Aurora.

The homepage is statically rendered using Incremental Static Regeneration.

export const revalidate = 60;

Most visitors receive a cached page from Vercel's edge network.

Aurora only needs to wake once every minute instead of once per visitor, reducing both cost and cold starts.

3. Never fail because the database is sleeping

Cold starts happen.

Instead of showing an error page while Aurora wakes up, Lattency falls back to a bundled snapshot included with the application.

The map still loads.

Users can still explore cafés.

For a hackathon demo, that's the difference between an impressive first impression and a blank screen.

4. Refresh after the response

Refreshing the materialized view isn't part of the user's request.

Using Next.js' after() API, the response is sent immediately while the refresh runs afterwards.

after(async () => {
    await refreshSpeedStats();
});

Users don't wait for maintenance work.

Crowdsourced data only works if people can't game it

Allowing anyone to contribute data is both the best feature and the biggest security problem.

I wanted automatic submissions to be more trustworthy than manually typed numbers.

Instead of asking contributors to copy results from another speed test, Lattency performs the test directly in the browser against a Vercel Edge region.

It measures:

download speed
upload speed
latency
jitter
packet loss

More importantly, the server—not the client—decides whether the submission counts as an automatic test.

const testMethod =
    body.downloadBytes &&
    body.downloadDurationMs
        ? "auto"
        : "manual";

A user can't simply claim they performed an automatic test.

The evidence has to exist.

To prevent spam without storing personal information, every IP address is salted and hashed before comparison.

createHash("sha256")
    .update(ip + process.env.RATE_LIMIT_SALT)
    .digest("hex");

The raw IP is never stored.

Rate limiting becomes:

One measurement per IP, per café, every ten minutes.

Privacy is preserved while abuse becomes significantly harder.

Adding a brand-new café is also transactional.

Creating the café and inserting its first measurement happen inside the same database transaction.

If either operation fails, everything rolls back.

The map never ends up with empty cafés that have no measurements attached.

The production architecture I almost shipped

One rabbit hole consumed far more time than I expected.

RDS Proxy.

Initially, I wanted every database connection to pass through an RDS Proxy instead of exposing Aurora directly.

I configured:

a dedicated security group
Secrets Manager credentials
IAM permissions
the proxy itself

Everything looked correct.

Nothing connected.

Eventually I realised why.

RDS Proxy is intentionally private.

Its endpoint lives inside the VPC.

That's perfect for Lambda, ECS, and EC2.

It's not designed for platforms like Vercel running outside your AWS network.

Connecting to it requires additional networking such as PrivateLink or a load balancer.

That lesson ended up being more valuable than the configuration itself.

If I were taking Lattency to production today, I'd choose one of these architectures:

Vercel × AWS Marketplace Aurora Integration — Aurora provisioned through Vercel using PrivateLink. No public database endpoint.
PrivateLink — More infrastructure, but the same private networking model.
Network Load Balancer + RDS Proxy — Works without reprovisioning, although it adds cost and operational complexity.

For the hackathon, opening PostgreSQL on port 5432 was the pragmatic decision.

I think it's more useful to explain that trade-off honestly than pretend the demo shipped with perfect infrastructure.

What I ended up with

The metro-map interface is what people notice first.

The infrastructure is what makes it believable.

Lattency combines:

Aurora PostgreSQL Serverless v2 + PostGIS for fast geospatial searches and scale-to-zero pricing.
Materialized views for outlier-aware café classification without blocking reads.
Incremental Static Regeneration so most visitors never touch the database.
Connection reuse to keep PostgreSQL healthy in a serverless environment.
Server-side trust verification so automatic measurements can't be trivially faked.
Graceful fallbacks so the application continues working even while Aurora is waking up.

The same architecture could power Wi-Fi maps for any city.

Nairobi just happened to be the first one.

If you'd like to explore it yourself:

🗺️ Live Demo: https://lattency.vercel.app/

💻 GitHub: https://github.com/thisyearnofear/lattency

I'd love to hear what you'd build with the same stack—or how you'd improve Lattency.

Built for the H0: Hack the Zero Stack hackathon.

#H0Hackathon

Cognivern - Spend OS For Agent Teams

Papa — Mon, 08 Jun 2026 06:47:42 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

Cognivern is a control plane for agent operations — a SpendOS for agent teams.

As AI agents proliferate across development workflows, a quiet crisis is brewing: no one really controls what agents spend, on what, on behalf of whom, or why. Every agent gets what amounts to a blank check — against model APIs, against wallets, against third-party services. Cognivern exists to fix that.

The platform unifies governed wallet spend and AI spend governance across IDE, CLI, and agent workflows into a single auditable control layer. The core promise is simple: move fast without blank checks. Every spend decision can be policy-checked, privacy-preserving, efficiency-aware, and audit-ready — before it executes.

This matters especially in emerging markets and for teams building on-chain infrastructure, where cost overruns from runaway agents aren't just annoying — they're existential. You don't burn budget you don't have chasing a misconfigured prompt loop.

At its core, Cognivern provides:

Policy evaluation — enforce who/what/when rules before any spend executes
Privacy-native operations — evaluate sensitive policy context via confidential paths using Fhenix FHE (Fully Homomorphic Encryption)
AI spend governance — model/runtime usage visibility and optimization alongside financial controls
Audit trails — persist decision evidence (decisionId, attestation, run context) for continuous accountability
Multi-provider AI routing — ChainGPT as the primary Web3-native LLM, with Fireworks, OpenAI, Gemini, Anthropic, and others as fallbacks The stack is TypeScript + Solidity, deployed across X Layer Testnet (execution and policy), Filecoin Calibration (audit storage), and Fhenix (confidential policy state). The frontend lives at cognivern.vercel.app and includes a PromptOS terminal for natural-language governance interaction.

Demo

🔗 Live app: cognivern.vercel.app

🔗 API: cognivern.thisyearnofear.com

🔗 PromptOS Terminal: cognivern.vercel.app/os

🔗 Source: github.com/thisyearnofear/cognivern

Key flows you can explore:

Submit a spend request through the dashboard and watch policy evaluation fire in real time
Use the PromptOS terminal to interact with governance rules in natural language
Inspect the audit log — every decision has a decisionId and attestation

- Try the encrypted spend path (Fhenix), where policy is evaluated over encrypted inputs — the server never sees the raw values

The Comeback Story

Cognivern started as a hackathon project with a clear thesis but rough edges everywhere. The core governance loop worked, but it was held together with duct tape: no proper workspace isolation, no rate limiting, brittle contract interactions, and a frontend that was functional but not something you'd confidently hand to an operator.

Here's what changed during the finish-up:

Infrastructure hardening — Added per-workspace and per-API-key rate limiters with sliding windows, deep health checks, and circuit-breaker patterns. Moved to TypeScript strict mode throughout. Built out a unified CI pipeline.

172 tests — Unit, integration, and E2E via Playwright. The project went from "it works on my machine" to something with real coverage guarantees.

Multi-workspace and policy versioning — Each workspace now has independent API keys, rate limits, and a full policy version history. This was the feature that turned a demo into something a real team could adopt.

Fhenix Wave 5–7 — The FHE integration went from a proof-of-concept to a full institutional demo: encrypted policies, MEV-protected execution, selective auditor disclosure, two-phase FHE resolution with resolveDecision, sealed-bid vendor selection, and a Privara confidential payroll flow. Also migrated from Helium testnet to Arbitrum Sepolia.

ChainGPT integration — Brought in ChainGPT as the primary AI provider for Web3-native governance queries, with the Smart Contract Auditor running as runtime pre-spend defense. This felt like the missing piece — governance AI that actually understands on-chain context.

Operator UX — PromptOS terminal integrated into the sidebar, voice input via ElevenLabs STT, self-service onboarding flow, animated workspace mode toggles, full mobile responsiveness.

The project went from ~60% production-ready to ~93%. The remaining 7% is mostly production key management and a few contract audit items before mainnet.

My Experience with GitHub Copilot

Cognivern is a project with a lot of moving parts — Solidity contracts, TypeScript APIs, multi-chain deployment scripts, FHE integration, and a React frontend — often all in motion at the same time. Copilot was the connective tissue that kept things moving without constant context-switching tax.

A few specific ways it earned its keep:

Boilerplate elimination for the governance endpoints. The API has 12+ endpoints with consistent patterns — request validation, policy lookup, decision logging, response shaping. Writing the first one from scratch was fine; Copilot handled the rest, often getting the full shape right on the first suggestion.

Solidity contract work. The ConfidentialSpendPolicy contract for Fhenix was genuinely novel — FHE operations aren't something most developers have pattern-matched on. Copilot's suggestions weren't always right, but they were useful scaffolding that surfaced the right questions. The back-and-forth of accepting, rejecting, and editing suggestions was faster than writing from scratch.

Test generation. Getting to 172 tests would have taken much longer without Copilot helping generate test cases from the function signatures and existing test patterns. It's particularly good at the "write 10 edge case tests for this validator" kind of ask.

README and documentation. The architecture docs, developer guide, and deployment docs are detailed. Copilot helped maintain consistent voice and structure across them, and was surprisingly good at inferring the right level of technical detail for each audience.

The honest take: Copilot didn't make hard architectural decisions easier. The FHE integration design, the multi-chain deployment strategy, the policy versioning data model — those required real thinking. But it absorbed a huge amount of the mechanical work and kept me in flow during the push to get this finished.

Find me on Farcaster and Lens — always building at the intersection of AI, emerging markets, and on-chain infrastructure.

DiversiFi — Finishing What Inflation Started

Papa — Mon, 08 Jun 2026 06:34:44 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

DiversiFi is an AI-powered stablecoin diversification app built on Celo and Arbitrum. The premise is simple but personal: your stablecoins shouldn't all be pegged to the dollar.

If you live in Kenya — as I do — inflation isn't an abstract macroeconomic concept. It's the gap between what you earned last year and what that money buys today. It's the reason holding savings in a local currency account quietly destroys purchasing power, and why stablecoins feel like a genuine unlock: your savings can actually compound instead of erode.

But even dollar-pegged stables have their own exposure. And if you care about your continent — about African economies developing their own financial infrastructure, about emerging markets building on-chain alternatives to broken legacy rails — then a portfolio that's 100% cUSD is both financially incomplete and ideologically inconsistent.

DiversiFi tries to fix both problems at once. Connect a wallet, pick a financial philosophy, deposit stablecoins into a non-custodial Safe smart account, and let an AI agent rebalance your holdings across regional stablecoins — cUSD (US), cEUR (EU), KESm (Kenya), COPm (Colombia), PHPm (Philippines), cREAL (Brazil) — based on live inflation and economic data.

The agent doesn't just chase yield. It reads governance forums, World Bank inflation feeds, and economic signals to make allocation decisions that reflect both the numbers and the philosophy you've chosen:

Africapitalism — keep wealth circulating in African economies
Islamic Finance — Sharia-compliant, no interest-bearing assets
Buen Vivir — LatAm philosophy balancing material wealth with community wellbeing
Global Diversification — maximum geographic spread
Custom — define your own allocation targets This isn't cosmetic. Each philosophy filters which assets the agent can touch, how it weights rebalancing recommendations, and what it rules out entirely. The goal is a tool that reflects how real people in real places actually think about money — not just a generic robo-advisor with a world-map splash screen.

Built by @papajams · Lens

Demo

🔗 Live app: diversifiapp.vercel.app
📦 Repo: github.com/thisyearnofear/diversify

The Comeback Story

DiversiFi started as a hackathon prototype — the kind that works well enough for a 3-minute pitch but quietly falls apart the moment you try to actually use it.

The core flows were broken. The agent could recommend rebalances but couldn't reliably execute them. The permission system — the piece that makes this non-custodial and therefore trustworthy — was wired up but unenforced, which defeated the whole point. The UI showed allocation targets but gave no real-time feedback on what the agent was actually doing. And the financial strategy layer was mostly decorative; it influenced the copy, not the code.

The push to actually finish it came from submitting to the Ethereum México x Bitso Hackathon — a 5-week global build sprint at the intersection of AI, stablecoins, and payments, with Bitso as a key integration partner and 20% of judging weighted on LATAM real-world impact. Having real mentors and a live demo day in front of regulators and fund managers has a way of clarifying what "done" actually means.

Here's what changed:

Execution layer fixed. _executor.ts now correctly bridges the vault service to the chain via Privy smart accounts, with a local dev fallback that doesn't require a full smart account setup to test against.

Permission model enforced. Session signer policies now actually gate what the agent can spend, on which contracts, within what time bounds. The agent cannot exceed user-defined limits. This is the difference between "non-custodial" as a marketing claim and non-custodial as an architectural guarantee.

Strategy wired into agent behaviour. Each financial philosophy now filters and weights rebalance recommendations at the vault.service.ts level. Africapitalism doesn't just change the UI label — it changes which assets the agent will and won't touch.

Real transaction receipts. Transactions now log through OpenClaw with human-readable summaries. Users can see exactly what the agent did, why, and when — not just a tx hash.

Bitso integration. Added Bitso as a payment rail, bridging fiat on-ramps to on-chain stablecoin positions. For LATAM users this matters enormously: getting funds into the protocol shouldn't require already being crypto-native.

Expanded to Arbitrum. Extended beyond Celo to support Arbitrum, broadening the asset universe and giving users access to deeper liquidity pools.

Fee model stabilised. 1% annual management + 10% performance above high-water mark + 0.10% swap spread, now calculated and settled correctly at withdrawal rather than estimated and forgotten.

The project went from a prototype that made a good pitch to something I'd actually trust with a real deposit.

My Experience with GitHub Copilot

I used Copilot Chat throughout the finishing process — primarily for architecture and debugging, less as a code generator and more as a thinking partner when things got tangled.

The most valuable moments were in the permission and execution layers, which are genuinely non-trivial. ERC-4337 smart accounts, session signer policies, Privy's secure enclave model — these interact in ways that aren't obvious, and when something breaks the error messages are often unhelpfully cryptic. Being able to paste a stack trace or policy config into Copilot Chat and get a focused hypothesis about what was failing saved real time that would otherwise have gone into reading SDK internals line by line.

I also used it to pressure-test the security model. Walking through the architecture — user controls Safe, agent signs within policy, no private key on server — and asking Copilot to look for holes surfaced a few edge cases around policy expiry and fallback signing I hadn't thought through carefully enough. Having something push back on your assumptions is underrated.

It's not magic. It didn't know Mento Protocol's quirks or Celo's specific bundler constraints out of the box. But as a tool for reasoning through complex, interlocking systems — rather than just autocompleting boilerplate — Copilot Chat earned its place in this build.

Find me on Farcaster and Lens — always building at the intersection of AI, emerging markets, and on-chain infrastructure.

WebMCP Might Be the Most Important Announcement at Google I/O 2026

Papa — Mon, 25 May 2026 00:49:35 +0000

Every few years a technology shows up that looks like a product but is actually a protocol. When that happens, the product gets forgotten and the protocol becomes infrastructure. Google I/O 2026 had one of those moments. It just didn't get treated like one.

The models were impressive. Gemini 3.5 Flash is four times faster than its predecessors. Antigravity 2.0 makes agent orchestration feel like something you'd actually ship. AI Studio now deploys to Cloud Run in one click. None of it was architecturally surprising. But buried in the developer sessions was something different: WebMCP, a proposed open standard for exposing structured tools to browser-based AI agents.

That one is worth sitting with.

The Failure Mode Everyone Already Knows

If you have ever maintained Selenium automation for more than six months, you already understand the problem WebMCP is trying to solve.

The automation works until the product team redesigns the checkout page. Then the selector breaks. You fix it. Three weeks later the login flow changes. You fix it again. You are not engineering anything — you are running a permanent rearguard action against a UI that was never designed to stay still. The automation is fragile because it is built on inference: your code is guessing at intent by reading presentation.

The first generation of browser AI agents have exactly this problem, at larger scale and higher stakes. They can see buttons and forms and navigation menus, and they can click on things, but they are always one redesign away from failing. They are imitating human behavior because the web has never offered them an alternative.

Imagine booking a flight through an agent today. The agent visually searches for departure fields, date pickers, seat selectors, and payment buttons. Every redesign risks breaking the workflow. Under WebMCP, the airline could expose booking itself as a structured capability: destination, dates, passenger count, seat preferences, payment authorization. The agent stops navigating the interface and starts interacting with the system underneath it.

WebMCP is the alternative.

The standard lets web developers expose structured tools — JavaScript functions, typed parameters, form interactions — as machine-readable capabilities. Instead of an agent inferring "this is probably a search box" by parsing the DOM, the site simply declares: here is a search function, here are its inputs, here is what it returns. Declarative for standard interactions, imperative for anything requiring runtime JavaScript. Chrome's experimental origin trial starts in Chrome 149.

The immediate gain is reliability. But that is not the interesting part.

What Changes Under the Surface

Websites have always been designed around visibility. If a human could see and operate something, the web had succeeded. That assumption ran so deep it was invisible — interfaces were presentation layers, and making them look right was the whole job.

WebMCP introduces a different assumption: systems may not need to be visually navigable to be operationally useful. The interface stops being primarily a presentation layer and starts being a capability surface.

That is a significant mutation.

An airline site exposing a structured booking capability is no longer just a place you visit. It becomes a service an agent can call directly. The distinction between website and API starts to blur at the protocol level, not just for developers, but for the web itself.

There is historical precedent for this shift.

RSS made web content machine-readable. A feed reader did not have to scrape a blog and guess where the article title ended and the sidebar began. The site simply exposed structure directly. RSS eventually collapsed as a consumer technology, but the idea it proved — that structured syndication beats scraping — became foundational to modern content APIs.

WebMCP does for actions what RSS did for content.

That distinction matters enormously.

Content syndication is passive. The machine reads what a human wrote. Action exposure is active — the machine performs operations on a user's behalf, with real-world consequences. The jump from "readable" to "actionable" changes the ontology of the web itself.

This is what Google is quietly building toward.

Antigravity 2.0 orchestrates agents. Gemini Spark acts across Gmail, Calendar, and eventually third-party tools via MCP. But agent workflows are only as reliable as the surfaces they operate on. The whole agentic stack presupposes that websites will eventually expose structured interfaces for machine consumption.

WebMCP is the specification for what that looks like on the open web.

The Critique You Have to Make

Here is where most conference coverage goes soft.

WebMCP only matters if adoption follows. An open standard with one browser behind it and no ecosystem buy-in is just a Chrome experiment. The history of proposed web standards is mostly a graveyard of promising ideas that died waiting for critical mass, or got implemented inconsistently enough that developers ended up writing workarounds anyway — which is to say, they ended up back at the Selenium problem.

Google has enough platform leverage to push Chrome 149 to most of the world's browsers in six months. It does not have the same leverage over every site that agents will need to use. The gap between "here is a standard" and "here is a standard that Stripe and Shopify and healthcare portals have implemented correctly" is years of developer effort and business negotiation. Nothing about announcing a standard compresses that timeline.

There is also a safety question the I/O coverage largely sidesteps.

Structured tool exposure is a double-sided surface. Right now browser agents are limited partly for the same reason they are safe: they cannot do that much. A web where every site exposes clean, machine-actionable capabilities is a web where the blast radius of a compromised or misbehaving agent gets significantly larger.

The permissions model, the consent model, the audit trail — none of that is solved by declaring "here are the actions this site supports." If anything, it sharpens the accountability question.

The infrastructure is arriving faster than the trust guarantees.

That is the honest summary of where agentic development actually sits right now. Not just for WebMCP — for all of it.

Why This Is Still the Story

None of those concerns make WebMCP less important. They make it more important to track carefully.

The DEV community's instinct after I/O was telling. The submissions that resonated were not about model benchmarks. They were about infrastructure, about privacy, about frameworks designed for machines as much as humans. That pattern is not accidental.

Developers who ship things for a living have a reliable nose for where the actual work is going to land, and right now that nose is pointing at integration — not intelligence.

The capability problem is closer to solved than most people want to admit. Models reason well. Models act. What remains unsolved is making those actions reliable, auditable, and safe at scale.

That is an infrastructure problem.

And infrastructure problems get solved by protocols, not products.

WebMCP is an early answer to the question of what reliable agent-web interaction should look like. It will probably not be the final answer. RSS wasn't either. But RSS proved the idea was viable, and everything that followed built on that proof.

The original web connected documents.

The next version may connect capabilities — not just for humans navigating pages, but for agents executing intent.

The web was built for humans to navigate.

The next version may be built for agents to operate.

Submitted for the Google I/O 2026 Writing Challenge on DEV.