GitHub: github.com/nimiplatform/nimi | Apache-2.0 / MIT
Local inference is becoming the default. But fragmentation is the real problem.
Models are getting stronger and smaller. Local inference is no longer a hobbyist pursuit — it's becoming a standard part of how AI apps are built. IDC predicts that by 2027, 80% of AI inference will run locally or at the edge.
The 2025 Stack Overflow Developer Survey found that 59% of developers use three or more AI tools simultaneously. But open any AI project we're working on today. Take an AI character app as an example: it needs speech recognition (STT), text reasoning (LLM), voice synthesis (TTS), scene image generation, and maybe background music. Five modalities, five different capabilities.
With today's toolchain, we need:
- Local text inference: Ollama or llama.cpp
- Local image generation: ComfyUI or AUTOMATIC1111
- Local voice synthesis: Piper or GPT-SoVITS
- Cloud video generation: Runway API
- Cloud music generation: Suno API
Five tools. Five processes. Five configurations. Five different interface formats.
Every AI capability is an island.
It's like cooking a single meal, but the kitchen is split into five rooms. Chopping in room A, frying in room B, seasoning in room C. Every room has a different lock, a different stove, and different measuring cups.
We spend 40% of our development time writing glue code — provider switching, fallback logic, health checks, streaming adapters, token metering, error retries. None of this has anything to do with our actual product. Yet every AI app is writing the same glue from scratch.
The time we spend on what we actually want to build? Maybe 20%. The other 40% goes to servers, deployment, and infrastructure.
Local tools and cloud SDKs each solve half the problem
The first half of the development pipeline already has some solutions. They fall into two categories. OpenRouter's numbers tell the story — 5 million developers routing requests across 60+ providers and 300+ models. Multi-provider isn't a niche need, it's the norm.
Category one: local model runners. Ollama, LM Studio, LocalAI. They solve "run a model on your machine," but they don't touch the cloud. When local GPU isn't enough, or we need GPT-4-level reasoning, we're on our own switching to a cloud SDK.
Category two: cloud API gateways. OpenRouter, LiteLLM. They unify multiple cloud providers behind one interface, but they don't touch local. Want to use local models to save money, or work offline? They can't help.
There's also a third category: application-layer frameworks. LangChain, Vercel AI SDK. They abstract at the app level, but they don't manage where inference actually runs, what happens when a provider goes down, or how to manage local engine lifecycles.
No single solution handles: local + cloud + multimodal + routing/fallback + lifecycle management.
Each one solves one piece of the puzzle. Nobody has completed the whole picture.
Nimi Runtime is that complete picture.
Runtime: not a model runner — an execution surface
We built Nimi Runtime. In one sentence:
Docker for AI inference.
Docker didn't solve "how to run a program" — that was already solved. Docker solved "run it anywhere, same behavior." Nimi Runtime works the same way: whether calling local Llama or cloud GPT-4, whether it's text or image or voice — same interface, same behavior.
A single Go daemon, running in the background. Start it up, and every AI capability comes through one port.
nimi start # start the daemon
nimi run "Hello" # default inference (local or cloud)
nimi run --provider gemini "Hello" # specify cloud
nimi run --model llama3.2 "Hello" # specify local
Same command. Same interface. The execution plane is abstracted away.
42 cloud providers. Covering OpenAI, Anthropic, Gemini, DeepSeek, Qwen, MiniMax, Kimi, GLM — global coverage. Local engines support LocalAI and Nexa SDK, with Runtime automatically managing their lifecycle — startup, shutdown, health probes, fault recovery. No manual management needed.
12 modalities, one protocol:
| Modality | Capability |
|---|---|
| Text generation | Chat, instructions, tool calls |
| Text + vision | Image understanding, OCR |
| Image generation | Text-to-image, image-to-image |
| Video generation | Text-to-video, image-to-video |
| Speech synthesis | TTS + voice cloning |
| Speech recognition | STT + timestamp alignment |
| Music generation | Text-to-music, style transfer |
| Embeddings | Semantic search, RAG |
| Knowledge retrieval | Document indexing + semantic search |
All through a single gRPC interface.
Key capabilities: routing, fallback, and code we no longer need to write
Smart routing. Set local-first priority: if LocalAI is healthy, it runs locally. LocalAI goes down? Automatic switch to cloud OpenAI. OpenAI rate-limited? Automatic switch to Gemini. Zero if-else statements needed.
Health monitoring. Runtime probes every provider every 8 seconds — HEALTHY, DEGRADED, UNREACHABLE, UNAUTHORIZED. What we see in our app is always an available inference service. Who's running it behind the scenes doesn't matter.
Idempotent deduplication. 10,000-request sliding window prevents duplicate billing. Concurrency control: global limit of 8, per-app limit of 2, all configurable.
Audit trail. Every AI call is logged — which provider, how many tokens, what routing decision was made, whether auto-switching occurred. Ring buffer stores the last 20,000 events. For debugging, cost analysis, and compliance.
This is code we used to write in every project. Not anymore.
Comparison
All solutions in one table:
| Capability | Ollama | LM Studio | LocalAI | ComfyUI | OpenRouter | LangChain | Nimi Runtime |
|---|---|---|---|---|---|---|---|
| Local text inference | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Local image generation | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ |
| Local TTS/STT | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Cloud providers | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ (42) |
| Local + cloud routing | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Auto-fallback | ❌ | ❌ | ❌ | ❌ | Partial | ❌ | ✅ |
| Video / music | ❌ | ❌ | ❌ | Partial | ❌ | ❌ | ✅ |
| Daemon architecture | ✅ | ❌ | ❌ | ❌ | N/A | ❌ | ✅ |
| Workflow DAG | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ |
| Permissions / audit | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
Ollama made local model running beautifully simple. ComfyUI made image workflows unmatched. But each solves one dimension of the problem.
Nimi Runtime unifies these dimensions into a single execution surface. Our apps only need to know one thing: call the Runtime.
Use it in your app
SDK integration:
import { Runtime } from '@nimiplatform/sdk/runtime';
const runtime = new Runtime();
// Local inference
const local = await runtime.generate({
prompt: 'Describe yourself in one sentence.',
});
// Cloud inference — same interface, one parameter added
const cloud = await runtime.generate({
provider: 'gemini',
prompt: 'Describe yourself in one sentence.',
});
Already using Vercel AI SDK? Zero migration cost:
import { generateText } from 'ai';
import { createNimiAiProvider } from '@nimiplatform/sdk/ai-provider';
const nimi = createNimiAiProvider({ runtime });
const { text } = await generateText({
model: nimi.text('gemini/default'),
prompt: 'Hello from Vercel AI SDK + Nimi',
});
Under the hood it's Nimi Runtime's routing and fallback logic. But the code looks exactly like using an OpenAI provider.
Nimi Runtime is open source. Apache-2.0 / MIT dual license.
GitHub: https://github.com/nimiplatform/nimi
curl -fsSL https://install.nimi.xyz | sh
nimi start
nimi run "Hello"
Three commands. One unified AI inference surface. 42 cloud providers, local engines, 12 modalities, one interface.
Stop writing separate integrations for every AI modality. Point Claude at this link — it'll tell you what to do next.
Nimi Team
GitHub: https://github.com/nimiplatform/nimi





Top comments (0)