mithilesh gaurihar

Posted on Jun 29 • Originally published at Medium

Build the Runtime, Not the Stack

#ai #llm #opensource #softwareengineering

By Krish Garg, Mithilesh Gaurihar

What this post covers

AI infrastructure risk is no longer theoretical. Here is what we are going to walk through:

A US government directive took Anthropic's Fable 5 offline for every developer on the planet, with zero notice.
GitHub Copilot ended flat-rate pricing and moved to usage-based billing, quietly ending the AI coding subsidy era.
Every team hard-wired to a single model, tool, or pricing plan had no fallback. The ones that did had a runtime layer sitting above all of it.
Model-agnostic doesn't mean migration is free. You will retune. But the runtime scopes the damage to one layer instead of everything.
The platforms you're routing across, OpenAI, Anthropic, Microsoft, are all actively trying to absorb the orchestration layer themselves. That's the real lock-in risk, not the model.
The answer is building the runtime on open infrastructure before you need it, not after.

Bottom line: the model, the tool, and the pricing plan can all move overnight. The only thing that doesn't have to is the runtime layer above them.

Two Headlines. One Lesson.

The US Commerce Department sent Anthropic a letter on a Friday evening. By midnight, Fable 5 and Mythos 5 were offline for every user on the planet. Not just foreign nationals, but everyone, because Anthropic had no way to verify citizenship at scale in real time. AWS Bedrock, Google Cloud, Microsoft Foundry, and the direct Claude API all went dark simultaneously. Teams woke up the next morning to broken pipelines and no warning.

Read Anthropic's official statement →

Around the same time, GitHub Copilot ended flat-rate pricing and moved every plan to usage-based billing. The AI coding subsidy era ended with an email and a new line item on the bill.

Read the GitHub announcement →

And if you think that's isolated to models and billing, SpaceX has agreed to acquire Cursor-maker Anysphere for $60 billion pending regulatory approval. Same pattern, different layer.

Read the CBS News report →

Different companies. Different layers. Same pattern: rented ground.

Why this keeps happening

Most AI system engineering goes into the model layer. Prompt design, tool calling, eval pipelines, output parsing. All of it tuned against a specific model's behavior. When that model changes, disappears, or gets repriced, the tuning work is either portable or it isn't. In most production systems, it isn't.

The layer that makes it portable is the runtime: model routing, observability, and cost tracking sitting above the model itself. Teams that built that layer had options on Friday night. Teams that didn't were waiting for support tickets.

This isn't a new idea. Perplexity's Aravind Srinivas has argued that the model is not the product. Anthropic's Boris Cherny has written about loop engineering: stop prompting agents directly and design the loops that prompt them. The durable artifact is the loop, not the model.

This news just made that expensive to ignore.

What model-agnostic actually means in practice

A model-agnostic runtime needs to do three things. Most teams skip all three until they are forced to deal with the consequences.

1. Treat the model as a configuration, not a dependency.
Prompts, tool definitions, and output schemas should not assume a specific provider's API shape. The moment your prompt is written for GPT-4's output format, or your tool calling assumes Anthropic's schema, you have a hard dependency masquerading as flexibility.

2. Span the entire run.
The runtime needs to see every component, every model call, every handoff in a single trace. When you route across multiple models, the failure is almost never inside one model. It is in the handoff between them. A trace that only covers one provider tells you nothing useful about that failure.

3. Surface cost in real time.
Not in a billing dashboard at the end of the month. As the run executes. When you pay per token, the decision of whether to commit a workload to an expensive model needs to happen before you commit it, not after.

Most teams get none of these right because they build the model integration first and the runtime later.

How RocketRide implements this

The model is a node, not a dependency

In RocketRide's pipeline builder, the model is a component node sitting between your prompt and your output. A Chat source feeds into a Prompt component, passes questions to an LLM node, and returns answers downstream. That LLM node is a configuration entry. Point it at OpenAI, Anthropic, Gemini, Mistral, Qwen, or a local Ollama instance and the rest of the pipeline stays exactly where it is. The connections, the prompt component, the output handler: none of it changes. You are swapping one node, not rewriting the system.

Here is what a pipeline looks like running end to end:

Swap GPT for Claude, Gemini, Qwen, Mistral, or a local Ollama instance without touching the pipeline.

Every run is fully observable

When a pipeline runs, RocketRide streams six views simultaneously: Design, Status, Tokens, Flow, Trace, and Errors. Each scoped to that specific run, not aggregated across all runs. The Status tab streams CPU%, CPU Memory, GPU Memory, and Total Completions per second in real time.

CPU, memory, and completion throughput stream live as the pipeline executes. Scoped to the run, not the account.

This matters for a specific reason. When you route across multiple models, the runtime is the only place that sees all of them. Without that span, you are flying blind across every provider boundary. A trace that only covers one model tells you nothing useful when the failure is in the handoff between them.

Cost surfaces before you commit

The Tokens tab breaks down what each component consumed: Total Tokens, CPU Usage, CPU Memory, and GPU Memory, per component, not just as a pipeline total.

The Tokens tab breaks down CPU usage, memory, and total tokens per component in real time. Cost visibility at the right layer.

That per-component breakdown is what makes cost decisions actionable. You can see whether the Chat component is consuming disproportionate tokens before the expensive LLM node even runs. When you are evaluating whether a cheaper model held up on a complex task, you are looking at this view, not waiting for an invoice at the end of the month.

The honest part: swapping is not free

Repointing a config key is the start of the work, not the end.

If the banned model was the one your prompts, tool-calling schemas, and eval thresholds were tuned against, you will retune. Prompt behavior differs across providers. Tool calling schemas are not universal. Evals that passed on one model will fail on another. That is real effort and we won't pretend otherwise.

What the runtime gives you is a scoped problem instead of an architectural crisis. The pipeline structure, the tracing across the Flow and Trace tabs, the cost visibility in the Tokens tab: none of that moves when you swap the model node. You are fixing one layer, not rebuilding everything. That distinction matters when you are scrambling at midnight because a government directive just took your model offline.

The economics make the effort worth it. Harvey's multi-model approach points to a consistent pattern: routing cheaper models for lower-complexity steps and premium models where it counts can outperform a single expensive model while costing less. When you pay per token, that gap compounds across every agent run. But those gains only materialize if the runtime layer exists to do the routing and the measurement. Without it, you cannot even see where the cost is going, let alone optimize it.

The counterargument worth taking seriously

The obvious pushback: capability beats resilience. Bet on the best model and architectural flexibility is a distraction.

The stronger version is structural. OpenAI's AgentKit and Anthropic's Managed Agents are both pulling orchestration, memory, evals, and observability into managed surfaces. VentureBeat reported that Anthropic has explicitly signaled the orchestration layer is where lock-in will live, because tooling infrastructure will outlast interchangeable models.

The platforms you are routing across are actively trying to absorb the runtime themselves.

That is a real risk. It is also the clearest argument for building on open infrastructure. A vendor-owned runtime is the same rented ground problem one layer up. The swap cost has to stay low, which means the runtime cannot be the thing you are locked into. That is why RocketRide Server is open source. Not as a positioning decision. As an architectural one.

What to do

The routing logic, tracing, and cost attribution are not novel engineering. What is hard is building the abstraction before you need it. Most teams build the model integration first and the runtime later, if at all. When the model gets banned or repriced, there is no later.

If your loop is model-agnostic, observable, and on open infrastructure, the next ban, acquisition, or repricing is an operational inconvenience rather than an architectural crisis.

Start here

Try RocketRide Cloud — spin up a model-agnostic runtime and watch every component trace and token stream live as it runs.
RocketRide Server on GitHub — open source and open to contributions. If you're building in this space, we'd love to build alongside you.
Download the free VS Code extension — the pipeline builder and all six run views live directly inside VS Code, free and open source, no account needed to get started.

If you've run a pipeline on RocketRide Cloud, let us know your feedback.

DEV Community