I spent a weekend building a paper trading system for Kalshi's 15-minute BTC binary prediction markets. The hook was Sentient Foundation's roma-dspy Python package β their ROMA (Recursive Open Meta-Agent) framework β which I wanted to actually use for something real-ish rather than run the hello-world example and close the tab.
This post is about what that looks like in practice: the architecture, the places ROMA genuinely helped, the places it caused problems, and how the whole thing actually behaved.
What KXBTC15M is
Kalshi runs a market called KXBTC15M. Every 15 minutes a new binary contract opens: will BTC's price be higher at the end of this window than it was at the start? You bet YES or NO in cents (0β99Β’), which maps directly to the market's implied probability. A 38Β’ YES ask means the crowd thinks there's roughly a 38% chance BTC ends the window above the strike.
The ticker format is KXBTC15M-{YY}{MON}{DD}{HHMM}-{NN} in US Eastern Time. The floor_strike field on the market object is the BTC price to beat, set when the window opens. These markets are only live during certain hours β worth knowing before you try to test against a live environment.
Architecture overview
Two processes, one purpose:
Next.js 16 (port 3000) Python FastAPI (port 8001)
βββββββββββββββββββββ ββββββββββββββββββββββββββ
API routes (proxy) /analyze β roma-dspy solve()
6-agent TypeScript pipeline ββ /reset β circuit breaker reset
useMarketTick (2s poll) /health
usePipeline (5m cycle)
The Next.js app owns the UI, the Kalshi API calls, the price feed, and the orchestration. The Python service does exactly one thing: accept a goal + context string, run it through Sentient's roma-dspy solve(), and return the result as JSON.
The service supports four LLM providers out of the box: Grok, Anthropic, OpenAI, and OpenRouter. OpenRouter is worth highlighting β it gives you access to any model through a single API key and pay-per-use pricing, which is useful when you're hitting per-provider rate limits (more on that shortly).
The pipeline DAG
Six agents, run in sequence:
MarketDiscovery βββ
PriceFeed ββββββββββΌβββΊ SentimentAgent (ROMA) βββΊ ProbabilityModel (ROMA) βββΊ RiskManager βββΊ Execution
Orderbook ββββββββββ
// lib/agents/index.ts β abbreviated
export async function runAgentPipeline(...): Promise<PipelineState> {
const mdResult = await runMarketDiscovery(markets) // rule-based
const pfResult = runPriceFeed(quote, strike) // rule-based
const sentResult = await runSentiment(...) // ROMA
await new Promise(r => setTimeout(r, 8_000)) // rate-limit breathing room
const probResult = await runProbabilityModel(...) // ROMA
const riskResult = runRiskManager(...) // rule-based
const execResult = runExecution(...) // rule-based
}
The design decision I'm most confident about: only the two judgment agents use ROMA. MarketDiscovery, PriceFeed, RiskManager, and Execution are all deterministic. Putting LLM reasoning in the risk manager felt like a bad idea β you want that layer to be predictable, auditable, and fast.
How the Python ROMA service works
# python-service/main.py
from roma_dspy.core.engine.solve import solve, ROMAConfig
@app.post("/analyze")
def analyze(req: AnalyzeRequest):
config = build_roma_config(_llm_config)
result = solve(full_prompt, max_depth=req.max_depth, config=config)
...
That's the whole thing. solve() runs the Atomizer β Planner β parallel Executors β Aggregator flow internally β the core of what Sentient Foundation built with ROMA. At max_depth=1 it tends to solve atomically β one LLM call, no decomposition β which is what I want here. Decomposing "assess BTC sentiment" into parallel calls on a rate-limited key was the source of most of my problems.
The /reset endpoint exists for a specific reason: ROMA has internal circuit breakers. If enough LLM calls fail (429s, timeouts), the breaker opens and every subsequent call fails immediately. That's sensible in a long-running service, but frustrating when you've fixed the underlying issue and the service is stuck refusing all requests until restart. The TypeScript client detects the error message and auto-resets before retrying:
if (text.includes('Circuit breaker is open')) {
await resetCircuitBreakers() // POST /reset, best-effort
}
Both agents have rule-based fallbacks. The UI shows which path ran β SentimentAgent (roma-dspy Β· grok) vs SentimentAgent (rule-based Β· roma-dspy unavailable). The pipeline keeps running either way.
The rate-limit problem
ROMA's Planner decomposes a goal into N parallel executor tasks and fires them concurrently. At max_depth=2, a single /analyze call can generate 4β6 simultaneous LLM requests. Two ROMA agents per cycle means 10β12 LLM calls within a few seconds. With a rate-limited API key that reliably produces a 429 cascade, which trips the circuit breaker, which makes the second agent fail before it even tries.
Two fixes: max_depth=1 (atomic solve, one LLM call instead of six) and an 8-second pause between the two agents in the orchestrator. Neither is elegant. Both work.
If you're running this seriously, OpenRouter is the move here. Set AI_PROVIDER=openrouter and you can route to Claude, Grok, Gemini, or any other model through one key with generous shared rate limits β much better than hammering a single provider's per-minute cap with parallel executor calls.
What the two ROMA agents actually do
SentimentAgent receives live BTC price, 1h/24h changes, strike distance, minutes until close, and top-5 orderbook levels on each side. ROMA returns natural language reasoning. A separate structured extraction call pulls out { score, label, momentum, orderbookSkew, signals }.
ProbabilityModelAgent receives the sentiment score and signals, plus the market-implied probability (yes_ask / 100). It asks ROMA to estimate true P(YES) and whether the model edge justifies a trade.
A real output from a working cycle:
SentimentAgent: neutral (0.01)
β strong 24h momentum offset by bearish orderbook skew
ProbabilityModelAgent: P(model)=32% vs P(market)=31%, edge +1%
β NO_TRADE β edge below 3% threshold
RiskManager: REJECTED β edge 1.0% below minimum (3%)
That's the system working correctly. Thin edge, no trade. Right call.
Limitations, honestly
It's paper trading by default. Live mode exists behind a confirmation modal, but I haven't run enough live cycles to have any opinion on whether the edge estimates are real.
KXBTC15M markets are illiquid most of the time. Outside certain hours there's often no active market. The pipeline handles this gracefully but it limits how much live testing you can actually do.
The 3% minimum edge is conservative by design. Given that the probability estimates come from an LLM reasoning over 1h/24h momentum and orderbook depth β not a trained model β that conservatism seems right.
What I actually think about using ROMA here
ROMA externalizes the "how do I break this problem down" question from application code. I write a goal; the framework decides whether to decompose it. For a focused single-topic analysis, that decomposition turns out to be unnecessary β atomic solve works fine. Where it could be genuinely valuable is for broader goals that benefit from parallel investigation across multiple dimensions simultaneously.
The pattern I'd take from this into other projects: accept that ROMA returns natural language, and put a thin structured extraction layer at the boundary to get typed outputs. Keeps the two concerns separate. Works well.
Full code at github.com/Julian-dev28/sentient-market-reader. If you're curious about the Kalshi auth (RSA-PSS signed headers), the circuit breaker handling, or the TS/Python boundary β drop a reply.
Top comments (1)
Multi-agent architectures for trading are interesting because they force you to solve the coordination problem in a domain with zero tolerance for stale state. Most multi-agent demos I've seen are essentially sequential pipelines disguised as agent systems. A real trading setup has to handle conflicting signals from different agents β one says buy, one says sell β and the arbitration logic is where the actual engineering is. How are you handling disagreement between agents? Is there a consensus mechanism, or does one agent have veto power? That design choice probably matters more than the individual model quality.