DEV Community: Jonathan Murray

Stop letting your hackathon API keys rot

Jonathan Murray — Wed, 03 Jun 2026 22:12:15 +0000

You've got OpenAI, Anthropic, Gemini, and xAI credits sitting in five dashboards. Plug them all into one API and get free state management, courtesy of Dev.to and MLH.

If you've done a hackathon or run a startup, you have API credits scattered everywhere. OpenAI from one event. Anthropic from another. Gemini and xAI from your last sprint. All sitting in separate dashboards, half-used, slowly expiring.

Backboard fixes that. One API, your keys, every model.

Bring your own key

Drop in keys from any of these providers and route across all of them behind a single Backboard API:

OpenAI
Anthropic
OpenRouter
Google Gemini
xAI
Cohere
ElevenLabs

You keep your credits. You keep your rates. You stop stitching seven SDKs together. One key in front of all of them, with memory, routing, and stateful threads built in.

Free state management, on the house

Memory is the part everyone skips at a hackathon because it's a pain to build. Not here. State management on Backboard is free, brought to you by Dev.to and MLH.

Stateful threads at the message level. No vector DB to spin up, no session glue code. Your agent remembers across the whole build.

Add your keys in 30 seconds

Sign in at app.backboard.io
Go to Dashboard → API Keys
Paste your provider keys
Ship

pip install backboard-sdk
# or
npm install backboard-sdk

from backboard import Backboard

bb = Backboard(api_key="your_backboard_key")

# Your OpenAI, Anthropic, Gemini keys are already wired in.
# Memory and state come free.
thread = bb.threads.create(assistant_id="your_assistant")
bb.messages.create(thread_id=thread.id, content="Remember this for later.")

Got tokens from a hackathon? Credits your startup was granted? Put them to work instead of letting them expire.

Add your keys: app.backboard.io/dashboard/api-keys

If there was an alternative to Claude Code, Cursor, Codex, Factory, etc. that performed better on benchmarks, consumed up to 30% less tokens, cost up to 90% less and DID NOT train on your code and data, would you switch?

Jonathan Murray — Wed, 03 Jun 2026 16:10:45 +0000

Stateful AI without a database: threads and assistants

Jonathan Murray — Wed, 03 Jun 2026 10:17:28 +0000

LLMs are stateless. Every API call to a raw model is a blank slate. The model has no idea what was said two messages ago. So the moment you want a chatbot that remembers the conversation, you are on the hook for state.

The usual answer is infrastructure. Spin up Postgres to store message history. Add Redis to cache sessions. Stand up a vector database for long-term memory. Write the code that loads history, trims it to fit the context window, stitches it into every prompt, and saves the new turn. That is a lot of plumbing before the bot says hello.

Backboard handles state for you. Two ideas replace the whole stack: threads and assistants. You never run a database.

The model

Three things, nested:

Message is one turn. A user message in, an assistant reply out.
Thread is one conversation. An ordered list of messages. Pass its thread_id on the next call and the model sees the full history.
Assistant is the profile above the thread. It holds the name, default instructions, tools, and memory. One assistant can own many threads, for example one thread per end-user.

Memory lives on the assistant, so it is shared across every thread under it. History lives on the thread. Both persist on Backboard's side. Nothing to provision.

Threads: state within one conversation

Send a first message and a thread is created automatically. The response hands you a thread_id. Pass it back on the next call and the conversation continues with full context.

Python

pip install backboard-sdk

import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    first = await client.send_message("My favorite color is blue.")

    # Same thread: the model remembers the previous turn
    second = await client.send_message(
        "What did I just tell you?",
        thread_id=first.thread_id,
    )
    print(second.content)  # "You told me your favorite color is blue."

asyncio.run(main())

JavaScript (Node 18+)

const send = (body) =>
  fetch("https://app.backboard.io/api/threads/messages", {
    method: "POST",
    headers: {
      "X-API-Key": "YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  }).then((r) => r.json());

const first = await send({ content: "My favorite color is blue." });

// Same thread: pass the thread_id back
const second = await send({
  content: "What did I just tell you?",
  thread_id: first.thread_id,
});

console.log(second.content);

cURL

# First message, thread auto-created
curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "My favorite color is blue."}'

# Continue: pass the thread_id from the first response
curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "What did I just tell you?", "thread_id": "THREAD_ID_FROM_FIRST_RESPONSE"}'

No history table. No prompt-stitching code. The thread_id is your conversation state, and Backboard stores it. When a thread gets long enough to crowd the context window, Backboard summarizes older messages automatically so you do not have to manage trimming.

Assistants: state across conversations

A thread remembers one chat. An assistant remembers the user across many chats. Memory is stored per assistant, so to carry facts into a brand new conversation you reuse the same assistant_id and start a fresh thread.

Python

# Conversation 1
await client.send_message(
    "I'm allergic to peanuts.",
    assistant_id="your-assistant-id",
    memory="Auto",
)

# Conversation 2: new thread, same assistant, memory carries over
reply = await client.send_message(
    "Any dietary restrictions you remember?",
    assistant_id="your-assistant-id",
    memory="Auto",
)
print(reply.content)  # "You mentioned you're allergic to peanuts."

JavaScript (Node 18+)

await send({
  content: "I'm allergic to peanuts.",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

const reply = await send({
  content: "Any dietary restrictions you remember?",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

console.log(reply.content);

cURL

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "I am allergic to peanuts.", "assistant_id": "your-assistant-id", "memory": "Auto"}'

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "Any dietary restrictions you remember?", "assistant_id": "your-assistant-id", "memory": "Auto"}'

This is the part that normally requires a vector database: embedding facts, storing vectors, running similarity search on every request. Here it is one parameter, memory="Auto", and the assistant owns it.

When to pass what

Goal	Pass
Keep talking in the same chat	The same `thread_id` every call
New chat, but remember the user	Omit `thread_id`, reuse the same `assistant_id` with `memory="Auto"`
One assistant, many users	One `assistant_id`, a separate `thread_id` per user

That last row is the whole pattern for a multi-user app. One assistant defines your AI. Each user gets their own thread. State stays separated without a schema you designed, a migration you ran, or a database you babysit.

The point

Stateless models force you to build a state layer. Backboard makes that layer part of the API. Threads hold the conversation. Assistants hold the profile and the memory. Both persist server-side. You ship a stateful, multi-user AI app and never write a line of database code.

Grab a key and try it: app.backboard.io

Architecture in full: docs.backboard.io/concepts/architecture

Send your first AI message in one API call

Jonathan Murray — Tue, 02 Jun 2026 16:41:37 +0000

Most AI tutorials start with a setup checklist. Pick a model provider. Create an account. Wire up a vector database for memory. Stand up a server to hold conversation state. Glue it all together. Then, finally, you send a message.

Backboard skips all of that. One API call sends your first message. A thread, an assistant, memory, and routing across thousands of models are already running behind that single call. You do not assemble the stack. It is the stack.

Here is the whole thing.

Step 1: Get a key

Sign up at app.backboard.io, go to Settings then API Keys, and copy your key. New accounts get $5 in free credits for 30 days. No credit card.

That is the only setup. Keep your key server-side, never in frontend or mobile code.

Step 2: Send the message

Pick your language. Same call in all three.

Python

pip install backboard-sdk

import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    response = await client.send_message(
        "Hello! Tell me a fun fact about space."
    )

    print("Reply:", response.content)
    print("Thread ID:", response.thread_id)
    print("Assistant ID:", response.assistant_id)

asyncio.run(main())

JavaScript (Node 18+)

No install needed. Just fetch.

const response = await fetch("https://app.backboard.io/api/threads/messages", {
  method: "POST",
  headers: {
    "X-API-Key": "YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    content: "Hello! Tell me a fun fact about space.",
  }),
});

const result = await response.json();

console.log("Reply:", result.content);
console.log("Thread ID:", result.thread_id);
console.log("Assistant ID:", result.assistant_id);

cURL

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "Hello! Tell me a fun fact about space."}'

Run it. You get a reply. That is your first AI message.

What just happened

You sent one string. Backboard did the rest:

Created a thread. The thread_id in the response is a live conversation. Send the next message with it and the model remembers what was said.
Created an assistant. The assistant_id is a reusable profile. Attach memory, documents, and tools to it later without changing your call.
Picked a model. No provider config required. It defaulted to openai / gpt-4o. You can change that with two parameters, shown below.

No vector DB. No state server. No provider SDK. One call.

Continue the conversation

Pass the thread_id back. The model now has context.

follow_up = await client.send_message(
    "Make it shorter.",
    thread_id=response.thread_id,
)
print(follow_up.content)  # knows you mean the space fact

That is stateful conversation with zero extra infrastructure.

Swap the model with two parameters

One key gives you thousands of models. Change the provider and model per message. Same thread, same code.

response = await client.send_message(
    "Explain quantum computing simply.",
    llm_provider="anthropic",
    model_name="claude-sonnet-4-20250514",
)

Want a different model next turn? Change two strings. You are never locked to one provider.

Turn on memory

Add memory="Auto" and the assistant remembers facts across conversations, not just within one thread.

# Thread 1: tell it something
await client.send_message(
    "My name is Sarah and I prefer dark mode.",
    assistant_id="your-assistant-id",
    memory="Auto",
)

# Thread 2, same assistant: it remembers
reply = await client.send_message(
    "What do you remember about me?",
    assistant_id="your-assistant-id",
    memory="Auto",
)
print(reply.content)  # "Your name is Sarah and you prefer dark mode."

Persistent memory, one parameter. No database to provision.

The point

The first call is one line because the platform is full-stack. Memory, model routing, RAG, and stateful threads sit behind a single key. You start with a working AI message, then turn on capabilities as you need them by adding parameters, not services.

Full docs: docs.backboard.io

If Microsoft and Uber can't afford AI coding, what chance do the rest of us have?

Jonathan Murray — Mon, 25 May 2026 15:46:18 +0000

Two stories landed in the same news cycle.

Microsoft cancelled most internal Claude Code licenses. Windows, Surface, Teams, Outlook, all migrating to GitHub Copilot CLI by June 30. Reporting is consistent on the why: usage exploded, the bills got indefensible, and the company that owns Azure and is one of Anthropic's biggest partners decided it was cheaper to migrate thousands of engineers than to keep paying the meter.

Uber's CTO Praveen Neppalli Naga said the company is "back to the drawing board" on AI coding. They burned through their planned 2026 AI budget within months. R&D was $3.4B last year and is still climbing. Engineers were ranked on internal leaderboards for AI tool usage. Claude Code became dominant. Costs went vertical.

Read that twice. Two of the most capitalized, AI-bullish companies on the planet just hit the wall on AI coding cost, and we're still in the first inning.

If they can't make the math work, what happens to the rest of us.

The thing nobody is saying out loud

The current generation of AI coding tools is built on an assumption: more tokens equals better output.

Bigger context windows. Longer reasoning chains. More tool calls per task. The whole industry is in a token-maxing arms race, and the pricing model is perfectly aligned with that race. Every additional token the agent burns is revenue for the model provider. Every re-fetch of the same file, every redundant reasoning loop, every "let me re-read your codebase to remember what we discussed", that's the meter running.

This is the part where I'm supposed to be diplomatic. I'm not going to be.

Claude Code is excellent. Cursor is excellent. Codex is excellent. The engineering is genuinely impressive. But the business model is a parking meter and you are the car. The longer your session, the deeper the agent goes, the more files it touches, the more money the vendor makes. Productivity and cost are positively correlated. That's not a bug. That's the design.

Microsoft figured this out at scale and pulled the plug. Uber figured it out and is rebuilding from scratch. If you're a developer reading this thinking "well, my $200/month plan is fine for now", I have bad news. Your plan is fine because somebody upstream is eating the difference between what you pay and what your usage actually costs. That subsidy ends the moment these companies need real margins. Anthropic is reportedly raising at a $900B valuation. OpenAI just raised again. The investor math doesn't close at "we lose money on every power user forever."

You're not the customer in this model. You're the funnel.

Bigger context is not the answer

The industry's response so far has been to make the context window bigger. 200K. 1M. 2M. Look at all this room.

This is a category error.

A bigger context window doesn't help you, it helps the bill. You're paying to stuff your entire repo into a prompt every turn so the model can "remember" what file structure you have. That's not memory. That's amnesia with a credit card attached.

Real memory, the kind your brain runs on, doesn't reload everything every time you think. It selectively recalls what's relevant. It compresses. It forgets things that don't matter. It builds a model of the world that persists across sessions.

When your coding agent actually remembers your codebase architecture, your conventions, the decision you made last Tuesday, the bug you fixed in auth.ts three weeks ago, the patterns your team prefers, it doesn't need to re-read 400K tokens of context to do the next task. It already knows. The token bill collapses. Quality goes up, not down, because the agent isn't drowning in fresh context every turn.

This is the part of the stack the hyperscalers don't want to build.

Memory is harder than context. Memory is opinionated. Memory requires you to commit to architecture decisions about what to retain, what to compress, what to forget. And critically, memory cuts token revenue. It's a direct conflict of interest for any vendor whose margin depends on you burning tokens.

If you're a vendor making money per token, why would you ever ship the feature that uses fewer tokens.

You wouldn't. And they haven't.

Silicon Valley can afford this. The rest of the world cannot.

A Brazilian developer earning R$15K/month does not have a $200/month Claude Max budget. A two-person Jakarta startup is not dropping $1,500/month per seat on agentic coding. An indie hacker in Lagos is not running a Cursor team plan. The math doesn't work and it isn't going to start working because OpenAI raises another $40B at a higher valuation.

The current AI coding market is a luxury product priced for San Francisco salaries and venture-subsidized burn. That's a real market and the companies serving it should keep serving it. But pretending that's the market is delusional.

There are roughly 30 million developers globally. Maybe 2 million of them work at companies that can sustainably absorb token-metered agentic coding at current prices. The other 28 million need a different solution. Not a worse one. A different one. One whose architecture isn't designed to extract maximum revenue per keystroke.

And let's be honest with each other for a second. The "AI levels the playing field for developers in emerging markets" narrative has been one of the dominant talking points of the last two years. Every keynote. Every blog post. Every "the future of work is global" panel.

How is that going? Right now, with current pricing, the playing field is the most tilted it has ever been. A junior developer in Toronto on a Pro plan has more leverage per dollar than a senior developer in São Paulo on a budget. That's not democratization. That's a new caste system with better marketing.

The "aligned with the driver, not the parking meter" test

I keep coming back to this framing because it keeps being right. The question to ask any AI tool you adopt going forward is whose side the economics are actually on.

If the vendor makes more money when you use it more, you have a parking meter. Your interests and theirs diverge the moment you scale.

If the vendor makes more money when you succeed (you ship faster, retain users, build better), you have a partner. Your interests align.

Most of the AI coding industry right now is parking meters wearing partner costumes. Microsoft just got billed for the parking. Uber too. The smart play for everyone else is to pick tools where the architecture itself, not just the marketing copy, is on your side.

What we're doing about it

We're opening the alpha of our CLI at Backboard. Memory-first. Built for the 28 million developers who are not the target market of the current generation of tools.

I'm not going to pitch you here. I'm telling you we're taking this problem on, and we'd rather have a smaller post and a bigger fight than the other way around.

If the Microsoft and Uber stories landed wrong for you, if you're tired of token bills that look like rent, if you think memory is more interesting than context, come find us.

We're aligned with the driver. Not the parking meter.

backboard.io

OpenAI and Anthropic are Friendster and MySpace, if Subquadratic proves to be true.

Jonathan Murray — Wed, 06 May 2026 15:24:34 +0000

If you've ever shipped an LLM-powered feature that needed to reason over a real codebase, a real contract, or a real research corpus, you already know the shape of the problem. The model technically accepts a million tokens of context. In practice, the answers get worse as the context gets longer, and your infra bill gets worse faster than that.

SubQ is built around SSA — Subquadratic Sparse Attention — a linearly scaling attention mechanism designed for long-context retrieval, reasoning, and software engineering workloads. The technical results are strong on their own merits: 52.2× prefill speedup at 1M tokens, RULER 95.0%, MRCR v2 65.9%, SWE-Bench Verified 81.8%.

But the more interesting question is what happens to the industry if results like these stop being a one-off. The valuations, pricing, and competitive narrative around the major labs have been priced as if compute is the moat — as if maximizing token use and burning more dollars per call is the cost of doing business at the frontier. SSA is one of the first credible signals that this might not be true for much longer. And if it isn't, the OpenAIs and Anthropics of today look less like permanent fixtures and more like the Friendsters and MySpaces of the next platform shift.

The problem isn't "missing context." It's fragmented context.

The hard problems enterprise AI needs to solve are long-context problems. Codebases, contracts, enterprise corpora, databases, spreadsheets, research collections, and long-running agent sessions rarely fail because the answer is absent. They fail because the relevant evidence is distributed across a large body of context, referenced indirectly, and only meaningful when multiple pieces are held in view at once.

If you build with these systems, this list will look familiar:

a codebase where a function is defined in one module, called in dozens of others, and constrained by tests elsewhere
a contract where an obligation depends on a definition, an exception, and a referenced clause several pages apart
a research workflow where a conclusion depends on reconciling evidence across many papers
a long-running coding task where prior planning decisions, intermediate edits, review notes, and regressions all matter

These aren't lookup problems. They're multi-hop reasoning problems over fragmented corpora. And the workarounds we've been using — chunking, RAG, agentic decomposition, recursive summarization — all have the same shape. They preserve some signal and lose some signal. RAG keeps semantic similarity but loses position, hierarchy, neighboring context, and reference structure. Agentic workflows decompose tasks into smaller calls but compound errors across steps and bake hand-authored orchestration policy into the system. The bitter lesson keeps showing up: scaffolding that works today doesn't generalize tomorrow.

SSA is an attempt to remove more of the reason that scaffolding is necessary in the first place.

Why dense attention is the bottleneck

Attention is a retrieval operation built into the model. Each token acts as a query, compares itself against every other token, scores their relevance, and aggregates their information into its next representation. Powerful, because every token gets access to the full context. Expensive, for the exact same reason — every query compares against every key, and the cost grows quadratically with sequence length.

At small contexts this is fine. At hundreds of thousands to millions of tokens, it becomes the dominant constraint. Doubling context doesn't double cost; it quadruples it.

And here's the part that should bother any engineer: most of that work is wasted. In trained models, the vast majority of attention weights are near zero. The model performs the full all-pairs comparison, but only a small fraction of those interactions meaningfully influence the output. Dense attention isn't just quadratic — it's wastefully quadratic.

FlashAttention made this much more practical at today's context lengths by avoiding materialization of the full attention matrix and optimizing memory movement. That's a real win. But it doesn't change the underlying scaling. The number of comparisons is still the same. The model still does quadratic work; it just does that work more efficiently.

System-level workarounds — retrieval pipelines, context compaction, recursive decomposition, agentic orchestration — make dense-attention systems usable. None of them change the scaling law. They route around the limitation. The quadratic cost is the boundary they're routing around.

What prior efficient architectures gave up

The field has spent years trying to make attention cheaper. The hard part isn't reducing cost. It's reducing cost without breaking retrieval. Every prior approach traded something away.

Fixed-pattern sparse attention — sliding windows, strided patterns, dilated masks — gets subquadratic scaling by deciding in advance which positions a token can attend to. The routing decision is positional, not content-aware. The model decides where to look before it knows what it's looking for. When the relevant information falls outside the pattern, it's invisible.

State space models and recurrent alternatives drop the all-pairs comparison entirely, replacing it with a compressed state that evolves across the sequence. Linear scaling by construction — but the state has fixed capacity. Information gets summarized, blurred, or discarded as the sequence grows. Great at gist and structure, weaker at retrieving a specific fact introduced arbitrarily far back.

Hybrid architectures combine both ideas: efficient layers do most of the compute, dense attention layers preserve retrieval. Works in practice, but the dense layers stay load-bearing. As context grows, their quadratic cost dominates again. The benefit is scalar, not asymptotic.

DeepSeek Sparse Attention offsets attention's quadratic cost to a lightning indexer that selects, per query, which keys to attend to. The indexer is itself quadratic — it scores every query against every key with small constants but the same O(n²) scaling. The complexity has been moved, not removed.

The pattern is consistent. Fixed sparsity gives up content-dependent routing. Recurrent models give up exact retrieval. Hybrids reintroduce the original cost. DeepSeek-style indexers stay quadratic and become cost-prohibitive at scale.

The open problem isn't "make attention faster." It's: build a mechanism that's efficient, content-dependent, and capable of retrieving from arbitrary positions across long context.

How SSA works

SSA changes how attention work is allocated. The core idea is content-dependent selection: for each query, the model selects which parts of the sequence are worth attending to, and computes attention exactly over those positions.

Dense attention assumes every pair might matter, so it evaluates all of them. In practice, almost none do. SSA drops that assumption. It doesn't approximate attention — it restricts attention to the positions that actually carry signal, and skips the rest.

That gives SSA three properties that matter together:

Linear scaling in compute and memory. Attention cost grows with the number of selected positions, not the full sequence. Long context becomes economically usable.
Content-dependent routing. The model decides where to look based on meaning, not position. Relevant information can be retrieved regardless of where it appears.
Sparse retrieval from arbitrary positions. Unlike recurrent or compressed approaches, SSA preserves the ability to recover specific information introduced far earlier in the sequence.

The practical distinction matters: SSA is not just a faster implementation of dense attention. It reduces the amount of attention work the model performs. That reduction is what shows up as speed.

Measured in wall-clock input processing time on B200s, SSA achieves the following speedups over standard attention with FlashAttention-2 (FlashAttention-3 did not produce a speedup over FA-2 on B200s):

Context length	SSA speed increase vs. FlashAttention
128K	7.2×
256K	13.2×
512K	23.0×
1M	52.2×

This is the throughput inversion that matters in production. Dense attention becomes slower relative to SSA as context grows. SSA gets more advantageous exactly where long-context workloads become most valuable.

Training SSA for long-context behavior

Architecture is necessary but not sufficient. A model can have a long context window and still fail to use it well. SSA was trained to make long-context use reliable, not just possible.

The training pipeline is three stages:

Pre-training establishes base language modeling capability and the long-context representations the selection mechanism uses.
Supervised fine-tuning shapes behavior toward instruction following, structured reasoning, and the code generation patterns enterprise workloads need.
Reinforcement learning targets the behaviors that are hardest to induce through supervised examples: reliable long-context retrieval, and coding behavior that uses the available context aggressively instead of defaulting to local reasoning.

That last stage is the one developers should care about. Long-context failures often look plausible. A model answers from nearby context because nearby evidence is easier to use, even when the decisive evidence is much earlier. It produces a locally correct patch that violates an interface defined elsewhere. It summarizes a prior decision instead of preserving the exact constraint that should govern a later step. SSA's RL stage is designed around exactly those failure modes.

Training data emphasizes long-form sources with high information density and cross-reference structure — the kind of data that forces the selection mechanism to learn routing over large positional distances. The goal isn't benchmark memorization. It's teaching the model to attend to what matters regardless of where it sits.

Why the training infrastructure matters too

Long-context training isn't only a modeling problem. It's a systems problem that only shows up at scale. At million-token sequence lengths, failure modes that are invisible at shorter contexts become binding — memory pressure, sequence partitioning across devices, gradient instability, numerical precision, kernel efficiency. These determine whether training runs at all.

The SSA training stack runs stably at 1M tokens and beyond, maintains linear memory scaling across the training pipeline, and uses distributed sequence parallelism to shard sequences across devices when they exceed single-device limits.

The consequence isn't just that long-context training becomes possible. It becomes iterable.

Under dense attention, long-context experiments are expensive enough that they get treated as reserved runs. With SSA's linear scaling, they become routine. More ablations, more evaluations, faster feedback, targeted fixes on the behaviors that actually matter at long context.

That's the deeper implication. SSA doesn't only reduce the cost of inference. It reduces the cost of learning long-context behavior in the first place — and that's the thing that compounds for developers downstream.

Evaluating functional context, not nominal context

An advertised context window doesn't tell you how much context a model can use. The real question is whether the model can retrieve, connect, and reason over evidence distributed across that window.

SubQ is evaluated across two axes:

Deployment viability — compute reduction and wall-clock speed
Retrieval capability — RULER, MRCR v2, and SWE-Bench Verified

More general benchmarks will be published in the upcoming model card. Needle-in-a-Haystack tests exact retrieval of a single target. RULER extends that to multi-hop retrieval, aggregation, variable tracking, and selective filtering. MRCR v2 goes further: the model must locate and integrate multiple pieces of evidence distributed across the context, where the relevant set isn't given in advance. That's closer to the shape of real work — finding one fact isn't enough; the model has to determine which pieces matter and combine them into a coherent answer.

Results

Compute and speed

SSA's linear scaling means doubling context length doubles attention compute, rather than quadrupling it. At 1M tokens, that's a 62.5× attention FLOP reduction relative to standard quadratic attention.

Context length	Attention FLOP reduction vs. standard attention
128K	8×
1M	62.5×

Wall-clock speed is the more product-relevant result: a 52.2× prefill speedup over dense attention at 1M tokens. That's the difference between a long-context system that behaves like an interactive tool and one that feels like an offline batch job.

Context length	Input processing speed increase
128K	7.2×
256K	13.2×
512K	23.0×
1M	52.2×

RULER

RULER tests retrieval and reasoning beyond simple needle lookup — multi-hop retrieval, aggregation, variable tracking, selective filtering.

Model	RULER @ 128K
SSA / SubQ	95.0%
Opus 4.6	94.8%

For real workflows this matters because multi-hop tasks compound. A missed reference early in the chain can corrupt every conclusion downstream.

MRCR v2

MRCR v2 is the most demanding retrieval benchmark in this set. It evaluates the ability to locate and integrate multiple non-adjacent pieces of evidence across long context.

Model	MRCR v2 score
SSA / SubQ	65.9%
Opus 4.6	78.3%
GPT 5.5	74.0%
GPT 5.4	36.6%
Opus 4.7	32.2%
Gemini 3.1 Pro	26.3%

SubQ lands at 65.9% — solidly in the range of frontier dense models, well ahead of GPT 5.4, Opus 4.7, and Gemini 3.1 Pro. That's the clearest evidence for the gap between nominal and functional context. A model can accept a long input and still fail to reason reliably over that input. MRCR v2 surfaces the gap because it requires retrieval and combination, not just token processing.

SWE-Bench Verified

SWE-Bench Verified is an end-to-end software engineering benchmark on real GitHub issues. Not a pure retrieval test — it asks whether the model can use codebase understanding to localize bugs, reason about implementation constraints, and produce patches.

Model	SWE-Bench Verified
SSA / SubQ	81.8%
Opus 4.7	87.6%
Opus 4.6	80.8%
Gemini 3.1 Pro	80.6%
GPT 5.4	not reported
GPT 5.5	not reported

Sitting at 81.8% — ahead of Opus 4.6 and Gemini 3.1 Pro on a real-world coding benchmark while running on a subquadratic architecture — is the result that should land hardest for developers. This is the workload most of us actually care about.

The part nobody priced in

Step back from the architecture for a second and look at what the current AI industry is actually selling.

The valuations, the capex, the data center buildouts, the multi-year compute contracts — all of it is underwritten by an assumption that frontier intelligence requires frontier-scale spend. Long context costs a lot. Reasoning costs a lot. Agents cost a lot. The premise running through every pitch deck and earnings call is that the labs with the most GPUs win, and the rest of the market pays for tokens at whatever margin those labs choose.

SSA is one architecture, on one model, with one set of benchmarks. But the result it points at is uncomfortable for that premise: the dominant cost of long-context inference may not be a law of physics — it may be an artifact of dense attention. A 52.2× prefill speedup at 1M tokens isn't a 10% efficiency gain. It is the kind of step-change that, if it generalizes, rewrites the unit economics of the entire industry.

If you don't have to maximize tokens consumed and dollars burned to get frontier-quality long-context behavior, a lot of the moat narrative collapses with it.

Why the incumbents look more fragile than they're priced

The Friendster and MySpace comparison isn't snark — it's a specific lesson. Both had network effects. Both had brand. Both had scale advantages that looked durable right up until a better-architected product showed up and the users moved over a weekend. The moat people talked about (network effects, switching costs) turned out to be much weaker than the moat that actually mattered (a better product on a better stack).

The current frontier labs have a similar mismatch:

API-level switching cost is near zero. Most production code paths abstract the model behind a thin client. Swapping providers is a config change, not a migration.
Compute scarcity is the moat people brag about. It is also the moat that subquadratic architectures attack first. If a challenger can match frontier quality at a fraction of the FLOPs, the capex advantage flips into a capex liability — billions of dollars of GPU contracts depreciating against a more efficient successor.
Pricing power assumes scarcity. Today's per-token prices for long context look reasonable because the underlying compute is genuinely expensive. Drop the cost of a 1M-token prefill by 50× and the same prices start looking like rent extraction, not value capture.
Brand isn't a defense once parity exists. "Nobody got fired for buying OpenAI" works until a model with comparable benchmarks costs an order of magnitude less to serve. Then it works against them, the same way "nobody got fired for choosing IBM" did.

This isn't a prediction that any specific lab disappears. Anthropic, OpenAI, and Google have real assets — distribution, talent, training data, alignment research, regulatory relationships. Those don't evaporate. But the valuations and the pricing power are built on the assumption that frontier compute is a stable moat, and that assumption depends on dense attention staying expensive.

SSA is one of the first credible signals that it might not.

What developers should actually take away

Strip out the industry analysis and the practical takeaways for anyone building on top of these systems are pretty clean:

Long context as a product surface is about to get a lot cheaper and a lot better. If you've been deferring long-context features because the economics didn't pencil, the economics are about to pencil.
A nominal context window has never told you what a model can actually use. RULER 95.0% and MRCR v2 65.9% on a subquadratic architecture is the gap between marketing tokens and functional tokens, and that gap is closing.
Less hand-authored scaffolding. Chunking, recursive summarization, and bespoke orchestration are workarounds for an attention bottleneck. As that bottleneck loosens, the scaffolding becomes a maintenance burden rather than an asset.
Watch where the open and challenger labs go next. Efficient architectures disproportionately benefit teams that don't already own a hyperscaler-sized GPU fleet. The next frontier-quality model that runs cheaply on commodity infra is the one to track.
Don't lock into long-term commitments priced on dense-attention economics. Multi-year contracts written against today's per-token costs are the riskiest thing on the table if a successor architecture cuts those costs by an order of magnitude.

SSA on its own is one paper, one architecture, one set of numbers. The reason it's worth paying attention to is what it implies if the result is real and replicable: the AI bubble's tightest correlation — bigger spend, better model — gets a lot weaker. That's good for developers, good for customers, and meaningfully bad for any incumbent whose story to investors depends on the old curve holding.

The Friendsters and MySpaces of this cycle won't lose because their products got worse. They'll lose because someone shows up with a better-architected stack at a fraction of the cost, and the switching cost turns out to have been a config flag the whole time.

Worth watching.

Very cool use of Backboard!

Jonathan Murray — Sat, 02 May 2026 02:29:45 +0000

DEV Weekend Challenge: Earth Day

Arqam Waheed

Apr 20

Terra Triage: I Built a 3-Agent Wildlife Dispatcher That Learns From Every Referral

#devchallenge #weekendchallenge #ai #backboard

Comments 3

9 min read

"Of Course" Erodes Trust Faster Than Bad Code ... Two Words That Are Killing Your Career

Jonathan Murray — Thu, 30 Apr 2026 18:38:04 +0000

You already have the job or the internship. You're on the team. You're in the meetings. You're in the Slack channels.

And the thing that's going to hold you back has nothing to do with your code.

Someone you work with says "hey, could we do X?" and you say "yeah, of course." Feels confident. Feels like you just proved you've got it.

But you gave an answer that contains zero information. No cost. No timeline. No tradeoff. No indication of whether you even understood the question. And now you're either about to disappear for two weeks and come back with something nobody asked for, or pull an all-nighter for something that was just a question, not a request.

Both started with "of course."

Why I'm Writing This

I'm a non-technical founder. I don't write the code. But I build alongside my team every day. I set direction, I think through problems, I get my hands dirty in the product.

The devs who accelerated fastest on my team were never the ones who said yes the fastest. They were the ones who slowed down long enough to make sure we were talking about the same thing. The ones who said "of course" to everything burned out, shipped the wrong thing, and lost trust. Not because they weren't talented, but because they never gave anyone a chance to actually collaborate with them.

A Yell That Sounds Like a Whisper

Not everything a founder or lead says carries the same weight. But it doesn't always feel that way from your side. When the person steering the ship says "hey what if we tried this," it can land like a mandate even when it's just a thought.

So before you go heads-down for 48 hours on something mentioned in a 5-minute conversation, ask:

"Is this urgent or is this something we should plan for?"

"Are you asking me to build this or are you asking if it's possible?"

"Where does this sit relative to what I'm working on right now?"

And yeah, sometimes the answer is going to be "yes, it's urgent, do it right now, and please don't ask me any more questions." That happens. But even that is better than the silence you were working inside of before. That five-second question just saved you from building the wrong thing at the wrong pace.

"Of Course" Erodes Trust Faster Than Bad Code

You say "of course." You go dark. A week passes. Someone checks in. The thing isn't done, or it's half-done, or it's not what was asked for. The people around you start second-guessing every "of course" that comes after it.

That didn't happen because you're a bad developer. It happened because you skipped making sure everyone was on the same page before you started building.

If you're stuck, say so. If it's more complex than expected, flag it. If the original ask doesn't make sense technically, speak up. "That won't work because of X, but here's what would" is one of the most valuable sentences in engineering.

If you can build but you can't communicate what you're building, why you built it that way, and what could go wrong, you are operating at half your potential.

What the Best Devs Actually Sound Like

"I want to make sure I understand what you're looking for before I start."

"That's a cool idea. Here's what it would take and here's what we'd need to deprioritize."

"I can do a rough version by Friday to see if it's even the right direction. Want that instead of the full build?"

"Honestly, I'm not sure yet. Let me look into it and come back to you tomorrow."

None of those sound weak. They sound like someone you'd hand the keys to. Someone who respects the problem enough to not pretend it's already solved.

The Line

The most dangerous dev says "of course."

The most valuable dev says "let me make sure I understand the problem first."

You already got the job. Now show them why they were right.

The Hidden Challenge of Multi-LLM Context Management

Jonathan Murray — Fri, 24 Apr 2026 20:19:51 +0000

Why token counting isn't a solved problem when building across providers

Building AI products that span multiple LLM providers involves a challenge most developers don't anticipate until they hit it: context windows are not interoperable.

On the surface, managing context in a multi-LLM system seems straightforward. You track how long conversations get, trim when needed, and move on. In practice, it's considerably more complex — and if you're routing requests across providers like OpenAI, Anthropic, Google, Cohere, or xAI, there's a fundamental mismatch that can break your product in subtle ways.

The Tokenization Problem

Every major LLM provider uses its own tokenizer. These tokenizers don't agree. The same block of text produces different token counts depending on which model processes it. The difference is often 10–20%, sometimes more.

What this means in practice: a conversation that fits comfortably in one model's context window may silently overflow another's. A prompt routed to OpenAI might count as 1,200 tokens; the same prompt routed to Claude might count as 1,450. That gap matters.

Where It Breaks

The failure modes tend to show up at the boundaries. When you switch providers mid-conversation, the new model has to ingest the full prior context. If your context management layer was calibrated to the previous model's tokenizer, the new model may see a context that's already at or over the limit — before it's even responded to anything new.

This produces three common failure patterns:

Unexpected context-window overflow: the conversation that worked before now breaches the limit
Inconsistent truncation: different models truncate at different points, changing what prior context the model actually sees
Routing failures that are unpredictable because the numbers your system used don't match the numbers the model actually used

Why Simple Estimates Fail

The instinct is to maintain a single "token estimate" with a generous safety margin. The problem is that the margin you'd need varies by provider, model version, and content type (code tokenizes differently than prose). A margin calibrated for one use case will either be too tight for another, causing failures, or too generous, causing unnecessary truncation that degrades conversation quality.

The Solution: Provider-Aware Token Counting

A robust multi-LLM context management layer makes token counting provider-specific. Rather than maintaining a single estimate, it measures each prompt the way the actual target model will measure it. The routing layer uses these per-provider measurements to make decisions before requests are sent.

This lets the system stay ahead of context limits: it knows when a conversation is approaching an edge, trims or compresses history calibrated to the specific model receiving the request, and avoids the pricing and failure surprises that come from miscounted tokens.

The end result is what users should see: a smooth conversation experience, regardless of which model is serving it. The complexity of "every model speaks a slightly different token language" stays inside the infrastructure layer, invisible to the people using the product.

This is the approach we've taken in our adaptive context window management component, and it's become a foundational part of how we think about multi-LLM routing more broadly.

Rob Imbeault
Apr 17, 2026

Why LLM Reasoning Is Breaking AI Infrastructure (And How to Fix It)

Jonathan Murray — Fri, 24 Apr 2026 20:18:05 +0000

If you've tried building anything serious on top of large language models (LLMs) recently, you've probably run into this:

"Thinking" is supposed to make models better. In practice, it makes your infrastructure worse.

This isn't a model problem—it's an infrastructure and abstraction problem. And it's getting worse as teams scale across multiple AI providers.

Let's break down exactly where things go wrong.

The Illusion of "Just Turn On Reasoning"

At a high level, LLM reasoning sounds straightforward:

Turn reasoning on → better answers
Turn reasoning off → cheaper, faster

But in production systems, reality looks very different.

What actually happens:

Models don't reason when explicitly prompted
Models over-reason on trivial queries, wasting tokens
Behavior is inconsistent across providers and model versions

Instead of predictable performance, you get variability.

You're no longer just building an AI product—you're debugging model behavior at runtime.

The Fragmentation Problem in LLM Reasoning

One of the biggest hidden challenges in AI infrastructure today is fragmentation.

Every major provider has implemented reasoning differently:

OpenAI → reasoning effort levels (low, medium, high)
Anthropic (Claude) → explicit reasoning token budgets
Google AI (Gemini) → hybrid approaches depending on model version

That's just input configuration.

Output fragmentation is even worse:

Some models return separate reasoning blocks
Others provide summarized reasoning
Some mix reasoning directly into standard responses

There is:

No shared schema
No standardized interface
No predictable structure

What this means for developers:

If you're building a multi-model AI system, you now need:

Input normalization layers
Output parsing logic per provider
Custom handling for reasoning formats

At this point, "simple API routing" becomes complex middleware engineering.

AI Cost Optimization Becomes a Moving Target

Reasoning doesn't just impact performance—it breaks cost predictability.

Billing inconsistencies across providers:

Some expose reasoning tokens explicitly
Others bundle them into total usage
Some introduce custom billing fields

Now you're not just optimizing latency or quality.

You're building a cost translation layer across providers.

This adds complexity to:

Forecasting
Budget control
Scaling decisions

Why Multi-Model Switching Breaks Systems

In theory, switching between LLM providers should improve reliability and cost efficiency.

In practice, it introduces system instability.

Even within a single provider:

Different endpoints behave differently
Input formats change
Output schemas change
Reasoning structures vary

Now add state management:

What context should persist?
How do you maintain reasoning continuity?
How do you prevent token explosion?

The result:

Most teams either:

Abandon portability, or
Build fragile adapter layers that constantly break

The Real Problem: Lack of Abstraction

After working through these challenges, one thing becomes clear:

The core issue isn't reasoning—it's the absence of a unified abstraction layer.

Developers today are forced to:

Learn multiple reasoning systems
Normalize different response formats
Track multiple billing models
Rebuild state handling for each provider

This is not scalable.

What "Unified LLM Reasoning" Should Look Like

To make AI infrastructure truly production-ready, reasoning needs to be abstracted.

A unified system should provide:

A single reasoning parameter
Direct control over reasoning budgets
Consistent behavior across models
Standardized input/output formats

The impact:

Developers can:

Tune reasoning without provider lock-in
Switch models without rewriting logic
Maintain consistent state across systems

And most importantly:

Stop thinking about thinking.

The Uncomfortable Truth About Scaling AI Systems

If you're working with LLMs and haven't encountered these issues yet—you will.

Complexity compounds rapidly when you:

Add a second provider
Enable reasoning features
Optimize for cost
Maintain persistent context

At that point:

You're no longer building your product. You're building AI infrastructure.

The Future of AI Platforms

Short-term impact:

Reduced engineering time (weeks to months saved)
Lower debugging overhead
More predictable cost structures

Long-term shift:

The winning AI platforms won't be defined by model quality alone.

They will be defined by:

Interoperability (model interchangeability)
Statefulness (persistent, portable context)

That's the real unlock in the next phase of AI development.

Quick Audit for Your AI Stack

If you're currently integrating multiple LLM providers, ask yourself:

How many reasoning formats are you handling?
How portable is your state management layer?
How predictable are your AI costs?

If those answers aren't clean and consistent:

You're already paying the infrastructure tax.

Rob Imbeault
Apr 20, 2026

I Broke SSO Trying to Center a Div. Let's Talk About Tokenmaxxing

Jonathan Murray — Fri, 24 Apr 2026 15:28:12 +0000

Backboard CODEGEN CLI Waitlist

A couple weeks ago, I tried to recenter some text on one of my side project SSO pages.

That's the whole task. Move the text. Left a bit. Right a bit. Until it's in the middle. Center. Middle.

I opened Claude Code. I said, roughly, "hey, center this."

Fifteen minutes later I was two bugs deep, SSO was broken — not the text, the whole login flow — and I'd hit my usage limit trying to unbreak the thing I broke while trying to do the thing that should've taken eight seconds in the inspector.

Palm. Face.

"Why did I just do that?"

That was tokenmaxxing. You know what tokenmaxxing is. Your timeline knows. There is an entire subgenre of VC on X right now posting, with their whole chest, variations of "if your engineers aren't maxing out their token budgets every day, they aren't working hard enough."

Three thousand likes. Quote tweets from other VCs agreeing. "This," they type. "100%," they type.

I want to say this clearly, one time, so we can move on: that is insane.

Measuring engineering effort by token spend is like measuring a chef by how much gas they burn. Congratulations. Your kitchen is on fire and the soup is fine.

Tokenmaxxing is when the answer to every problem — a typo, a bug, a bad schema, a bad decision, a bad Tuesday — is to shove more context, more tokens, more model at it until the problem stops complaining.

It is console.log("hello world") wearing a $400 watch.

A lot of people are going to read this and get defensive. I get it. I've done it.

I once built a "documentation agent" that loaded the entire repo into context and then asked, very politely, whether we had a login page.

We did. It was in routes/login.tsx.

That query cost $2.17.

I tell myself it was research.

Here's the part nobody says out loud: brute-force compute is the new jQuery.

Not in the "it works, ship it" way. In the "we're going to look at this in three years and wince" way.

We're living in a window where:

A 100k-token prompt to find one number is considered normal.
"Just pass the whole codebase" is a real architectural decision that real adults say out loud in real meetings.
The solution to hallucinations is more tokens. The solution to latency is more tokens. The solution to your cat being sad is, apparently, more tokens.

And the people selling the tokens? Thrilled. Obviously. You would be too.

I want to be clear: I love LLMs. I use them constantly. I have emotions about them I will not discuss here.

But the current game is rigged in a very specific direction. The model companies make more money when you're lazy. Your sloppy prompt is their margin. Your 90,000-token scaffolding is someone's yacht.

Meanwhile, the indie devs — the people who built the internet worth having — are getting priced out of the exact kind of tinkering that used to be free. You can't "just try something" when "just trying something" is $40.

The next big app should not require a $10k/month API budget to prototype. It used to require a laptop and an unreasonable amount of Red Bull. I'd like to go back to that, if possible.

So here is my proposal, which I will now name dramatically so it fits in a tweet.

The Token Minimizing Revolution.

It has two rules, and they are embarrassingly obvious.

Precision over volume. A small, clever retrieval beats a giant dumb context every time. RAG, fine-tunes, routers, caches. Boring stuff. Works.
Token-golfing is the new code-golfing. The flex is not "look what I made the big model do." The flex is "look what I made the small model do."

We're building something right now at backboard.io that is the opposite of tokenmaxxing.

But if you've been feeling that itch — the one where you look at your API bill and think this is not a technology problem, this is a vibes problem — you are not alone.

The revolution will be small. Efficient. Under budget.

Stop building what your customers ask for

Jonathan Murray — Wed, 22 Apr 2026 13:44:39 +0000

I was at a conference this week.

Bunch of stakeholders on stage. Hospital admins, big-name buyers, a couple of policy folks. The message to founders was loud and clear:

"You need to be consulting us. You need to be adapting your products to our suggestions."

And honestly? I hated it.

Not because they were completely wrong. They were half right. They were just shouting the half that was wrong.

Here's the part that's true

Building in a vacuum is how you ship things nobody uses. Founders, especially technical ones, have a real habit of deciding what the world needs from inside a Notion doc.

So yes. Talk to users. Ride along. Watch people struggle with your product. All of that.

The stakeholders aren't crazy for wanting a seat at the table.

Here's the part that breaks things

"Listen to us" slowly turns into "do what we say."

And that's where it gets weird.

Because every dev on earth has learned this lesson already. It's called a bug report.

A user says "the login is slow." You dig in. The login isn't slow. They're on hotel wifi and there's no loading spinner, so it feels frozen. The complaint was real. The proposed fix, "make the login faster," was useless.

Stakeholder feedback works exactly the same way.

The pain is the signal. The proposed fix is a guess. Usually a bad one.

A senior eng who shipped whatever the ticket said would get laughed out of the room. Why do we call a founder who ships whatever the customer asks for "responsive"?

Why the fix is almost always wrong

Three reasons, no mystery to any of this:

1) Stakeholders see their slice. Not the whole system. Of course their fix is local.
2) They imagine solutions inside the workflow they already have. Which is often the exact workflow you're trying to change.
3) The thing that would actually solve the problem doesn't exist in their vocabulary yet. That's kind of your job.

When a cardiologist says "add a button that auto-generates the referral letter," the real signal is referrals are friction. The button might be the worst possible version of the fix. Maybe the letter shouldn't exist. Maybe the referral shouldn't need a letter. That's a conversation. Not a ticket.

The receipt: healthcare AI just ran this experiment for us

For years, stakeholders told the industry they wanted "AI that can pass the medical boards."

The industry listened. Every model got tuned on USMLE-style questions. Board-exam scores became the benchmark everyone pointed at.

This month, JAMA Network Open dropped a study across 21 top LLMs (ChatGPT, Claude, Gemini, DeepSeek, Grok). Final-diagnosis accuracy on complete cases? Over 90%.

Differential diagnosis, the thing an actual doctor does all day? Failed more than 80% of the time.

The stakeholders asked for the wrong benchmark. Founders shipped it. We now have a generation of models that ace trivia and fold on reasoning.

The founders who had pushed back, the ones who said we hear you want trustworthy AI, we're not going to chase board scores to prove it, would look prescient right now. The ones who obeyed built an industry of exam-passers.

How to actually do it

When a stakeholder hands me a feature request, I try to never put it in the backlog as written. Three questions first:

1) What were they trying to do when they felt the pain?
2) What's the actual friction, stripped of their proposed fix?
3) What would "solved" feel like, regardless of how it gets built?

Rule of thumb: if a stakeholder ask fits neatly into a Jira ticket, I haven't translated it yet.

Back to the conference

I get why the stakeholders were on that stage. They've been burned by founders who ignored them. They want to be heard.

But "heard" is not the same as "obeyed." And founders who treat customer feedback as a spec instead of a bug report end up building slightly nicer versions of the thing that already isn't working.

Listen obsessively.
Obey selectively.
And be willing to tell the room that the button they're asking for isn't the thing they actually need.

That's not arrogance. That's the job.

What's a piece of stakeholder feedback you took literally and regretted? Or one you translated into something better and it worked?