One SDK, 12 Modalities: AI Inference Shouldn't Be This Fragmented

Nimi — Wed, 18 Mar 2026 09:44:50 +0000

GitHub: github.com/nimiplatform/nimi | Apache-2.0 / MIT

Local inference is becoming the default. But fragmentation is the real problem.

Models are getting stronger and smaller. Local inference is no longer a hobbyist pursuit — it's becoming a standard part of how AI apps are built. IDC predicts that by 2027, 80% of AI inference will run locally or at the edge.

The 2025 Stack Overflow Developer Survey found that 59% of developers use three or more AI tools simultaneously. But open any AI project we're working on today. Take an AI character app as an example: it needs speech recognition (STT), text reasoning (LLM), voice synthesis (TTS), scene image generation, and maybe background music. Five modalities, five different capabilities.

With today's toolchain, we need:

Local text inference: Ollama or llama.cpp
Local image generation: ComfyUI or AUTOMATIC1111
Local voice synthesis: Piper or GPT-SoVITS
Cloud video generation: Runway API
Cloud music generation: Suno API

Five tools. Five processes. Five configurations. Five different interface formats.

Every AI capability is an island.

It's like cooking a single meal, but the kitchen is split into five rooms. Chopping in room A, frying in room B, seasoning in room C. Every room has a different lock, a different stove, and different measuring cups.

We spend 40% of our development time writing glue code — provider switching, fallback logic, health checks, streaming adapters, token metering, error retries. None of this has anything to do with our actual product. Yet every AI app is writing the same glue from scratch.

The time we spend on what we actually want to build? Maybe 20%. The other 40% goes to servers, deployment, and infrastructure.

Local tools and cloud SDKs each solve half the problem

The first half of the development pipeline already has some solutions. They fall into two categories. OpenRouter's numbers tell the story — 5 million developers routing requests across 60+ providers and 300+ models. Multi-provider isn't a niche need, it's the norm.

Category one: local model runners. Ollama, LM Studio, LocalAI. They solve "run a model on your machine," but they don't touch the cloud. When local GPU isn't enough, or we need GPT-4-level reasoning, we're on our own switching to a cloud SDK.

Category two: cloud API gateways. OpenRouter, LiteLLM. They unify multiple cloud providers behind one interface, but they don't touch local. Want to use local models to save money, or work offline? They can't help.

There's also a third category: application-layer frameworks. LangChain, Vercel AI SDK. They abstract at the app level, but they don't manage where inference actually runs, what happens when a provider goes down, or how to manage local engine lifecycles.

No single solution handles: local + cloud + multimodal + routing/fallback + lifecycle management.

Each one solves one piece of the puzzle. Nobody has completed the whole picture.

Nimi Runtime is that complete picture.

Runtime: not a model runner — an execution surface

We built Nimi Runtime. In one sentence:

Docker for AI inference.

Docker didn't solve "how to run a program" — that was already solved. Docker solved "run it anywhere, same behavior." Nimi Runtime works the same way: whether calling local Llama or cloud GPT-4, whether it's text or image or voice — same interface, same behavior.

A single Go daemon, running in the background. Start it up, and every AI capability comes through one port.

nimi start                              # start the daemon
nimi run "Hello"                        # default inference (local or cloud)
nimi run --provider gemini "Hello"      # specify cloud
nimi run --model llama3.2 "Hello"       # specify local

Same command. Same interface. The execution plane is abstracted away.

42 cloud providers. Covering OpenAI, Anthropic, Gemini, DeepSeek, Qwen, MiniMax, Kimi, GLM — global coverage. Local engines support LocalAI and Nexa SDK, with Runtime automatically managing their lifecycle — startup, shutdown, health probes, fault recovery. No manual management needed.

12 modalities, one protocol:

Modality	Capability
Text generation	Chat, instructions, tool calls
Text + vision	Image understanding, OCR
Image generation	Text-to-image, image-to-image
Video generation	Text-to-video, image-to-video
Speech synthesis	TTS + voice cloning
Speech recognition	STT + timestamp alignment
Music generation	Text-to-music, style transfer
Embeddings	Semantic search, RAG
Knowledge retrieval	Document indexing + semantic search

All through a single gRPC interface.

Key capabilities: routing, fallback, and code we no longer need to write

Smart routing. Set local-first priority: if LocalAI is healthy, it runs locally. LocalAI goes down? Automatic switch to cloud OpenAI. OpenAI rate-limited? Automatic switch to Gemini. Zero if-else statements needed.

Health monitoring. Runtime probes every provider every 8 seconds — HEALTHY, DEGRADED, UNREACHABLE, UNAUTHORIZED. What we see in our app is always an available inference service. Who's running it behind the scenes doesn't matter.

Idempotent deduplication. 10,000-request sliding window prevents duplicate billing. Concurrency control: global limit of 8, per-app limit of 2, all configurable.

Audit trail. Every AI call is logged — which provider, how many tokens, what routing decision was made, whether auto-switching occurred. Ring buffer stores the last 20,000 events. For debugging, cost analysis, and compliance.

This is code we used to write in every project. Not anymore.

Comparison

All solutions in one table:

Capability	Ollama	LM Studio	LocalAI	ComfyUI	OpenRouter	LangChain	Nimi Runtime
Local text inference	✅	✅	✅	❌	❌	❌	✅
Local image generation	❌	❌	✅	✅	❌	❌	✅
Local TTS/STT	❌	❌	✅	❌	❌	❌	✅
Cloud providers	❌	❌	❌	❌	✅	✅	✅ (42)
Local + cloud routing	❌	❌	❌	❌	❌	❌	✅
Auto-fallback	❌	❌	❌	❌	Partial	❌	✅
Video / music	❌	❌	❌	Partial	❌	❌	✅
Daemon architecture	✅	❌	❌	❌	N/A	❌	✅
Workflow DAG	❌	❌	❌	✅	❌	✅	✅
Permissions / audit	❌	❌	❌	❌	❌	❌	✅

Ollama made local model running beautifully simple. ComfyUI made image workflows unmatched. But each solves one dimension of the problem.

Nimi Runtime unifies these dimensions into a single execution surface. Our apps only need to know one thing: call the Runtime.

Use it in your app

SDK integration:

import { Runtime } from '@nimiplatform/sdk/runtime';

const runtime = new Runtime();

// Local inference
const local = await runtime.generate({
  prompt: 'Describe yourself in one sentence.',
});

// Cloud inference — same interface, one parameter added
const cloud = await runtime.generate({
  provider: 'gemini',
  prompt: 'Describe yourself in one sentence.',
});

Already using Vercel AI SDK? Zero migration cost:

import { generateText } from 'ai';
import { createNimiAiProvider } from '@nimiplatform/sdk/ai-provider';

const nimi = createNimiAiProvider({ runtime });

const { text } = await generateText({
  model: nimi.text('gemini/default'),
  prompt: 'Hello from Vercel AI SDK + Nimi',
});

Under the hood it's Nimi Runtime's routing and fallback logic. But the code looks exactly like using an OpenAI provider.

Nimi Runtime is open source. Apache-2.0 / MIT dual license.

GitHub: https://github.com/nimiplatform/nimi

curl -fsSL https://install.nimi.xyz | sh
nimi start
nimi run "Hello"

Three commands. One unified AI inference surface. 42 cloud providers, local engines, 12 modalities, one interface.

Stop writing separate integrations for every AI modality. Point Claude at this link — it'll tell you what to do next.

Nimi Team
GitHub: https://github.com/nimiplatform/nimi