DEV Community

Nimi
Nimi

Posted on

One SDK, 12 Modalities: AI Inference Shouldn't Be This Fragmented

Nimi Banner

GitHub: github.com/nimiplatform/nimi | Apache-2.0 / MIT

Local inference is becoming the default. But fragmentation is the real problem.

Models are getting stronger and smaller. Local inference is no longer a hobbyist pursuit — it's becoming a standard part of how AI apps are built. IDC predicts that by 2027, 80% of AI inference will run locally or at the edge.

The 2025 Stack Overflow Developer Survey found that 59% of developers use three or more AI tools simultaneously. But open any AI project we're working on today. Take an AI character app as an example: it needs speech recognition (STT), text reasoning (LLM), voice synthesis (TTS), scene image generation, and maybe background music. Five modalities, five different capabilities.

With today's toolchain, we need:

  • Local text inference: Ollama or llama.cpp
  • Local image generation: ComfyUI or AUTOMATIC1111
  • Local voice synthesis: Piper or GPT-SoVITS
  • Cloud video generation: Runway API
  • Cloud music generation: Suno API

Five tools. Five processes. Five configurations. Five different interface formats.

Every AI capability is an island.

It's like cooking a single meal, but the kitchen is split into five rooms. Chopping in room A, frying in room B, seasoning in room C. Every room has a different lock, a different stove, and different measuring cups.

We spend 40% of our development time writing glue code — provider switching, fallback logic, health checks, streaming adapters, token metering, error retries. None of this has anything to do with our actual product. Yet every AI app is writing the same glue from scratch.

The time we spend on what we actually want to build? Maybe 20%. The other 40% goes to servers, deployment, and infrastructure.


Local tools and cloud SDKs each solve half the problem

The first half of the development pipeline already has some solutions. They fall into two categories. OpenRouter's numbers tell the story — 5 million developers routing requests across 60+ providers and 300+ models. Multi-provider isn't a niche need, it's the norm.

Category one: local model runners. Ollama, LM Studio, LocalAI. They solve "run a model on your machine," but they don't touch the cloud. When local GPU isn't enough, or we need GPT-4-level reasoning, we're on our own switching to a cloud SDK.

Category two: cloud API gateways. OpenRouter, LiteLLM. They unify multiple cloud providers behind one interface, but they don't touch local. Want to use local models to save money, or work offline? They can't help.

There's also a third category: application-layer frameworks. LangChain, Vercel AI SDK. They abstract at the app level, but they don't manage where inference actually runs, what happens when a provider goes down, or how to manage local engine lifecycles.

No single solution handles: local + cloud + multimodal + routing/fallback + lifecycle management.

Each one solves one piece of the puzzle. Nobody has completed the whole picture.

Nimi Runtime is that complete picture.


Runtime: not a model runner — an execution surface

We built Nimi Runtime. In one sentence:

Docker for AI inference.

Docker didn't solve "how to run a program" — that was already solved. Docker solved "run it anywhere, same behavior." Nimi Runtime works the same way: whether calling local Llama or cloud GPT-4, whether it's text or image or voice — same interface, same behavior.

Nimi Architecture

A single Go daemon, running in the background. Start it up, and every AI capability comes through one port.

nimi start                              # start the daemon
nimi run "Hello"                        # default inference (local or cloud)
nimi run --provider gemini "Hello"      # specify cloud
nimi run --model llama3.2 "Hello"       # specify local
Enter fullscreen mode Exit fullscreen mode

Same command. Same interface. The execution plane is abstracted away.

Nimi Quickstart

42 cloud providers. Covering OpenAI, Anthropic, Gemini, DeepSeek, Qwen, MiniMax, Kimi, GLM — global coverage. Local engines support LocalAI and Nexa SDK, with Runtime automatically managing their lifecycle — startup, shutdown, health probes, fault recovery. No manual management needed.

12 modalities, one protocol:

Modality Capability
Text generation Chat, instructions, tool calls
Text + vision Image understanding, OCR
Image generation Text-to-image, image-to-image
Video generation Text-to-video, image-to-video
Speech synthesis TTS + voice cloning
Speech recognition STT + timestamp alignment
Music generation Text-to-music, style transfer
Embeddings Semantic search, RAG
Knowledge retrieval Document indexing + semantic search

All through a single gRPC interface.

Nimi Multimodal


Key capabilities: routing, fallback, and code we no longer need to write

Smart routing. Set local-first priority: if LocalAI is healthy, it runs locally. LocalAI goes down? Automatic switch to cloud OpenAI. OpenAI rate-limited? Automatic switch to Gemini. Zero if-else statements needed.

Health monitoring. Runtime probes every provider every 8 seconds — HEALTHY, DEGRADED, UNREACHABLE, UNAUTHORIZED. What we see in our app is always an available inference service. Who's running it behind the scenes doesn't matter.

Idempotent deduplication. 10,000-request sliding window prevents duplicate billing. Concurrency control: global limit of 8, per-app limit of 2, all configurable.

Audit trail. Every AI call is logged — which provider, how many tokens, what routing decision was made, whether auto-switching occurred. Ring buffer stores the last 20,000 events. For debugging, cost analysis, and compliance.

This is code we used to write in every project. Not anymore.


Comparison

All solutions in one table:

Capability Ollama LM Studio LocalAI ComfyUI OpenRouter LangChain Nimi Runtime
Local text inference
Local image generation
Local TTS/STT
Cloud providers ✅ (42)
Local + cloud routing
Auto-fallback Partial
Video / music Partial
Daemon architecture N/A
Workflow DAG
Permissions / audit

Ollama made local model running beautifully simple. ComfyUI made image workflows unmatched. But each solves one dimension of the problem.

Nimi Runtime unifies these dimensions into a single execution surface. Our apps only need to know one thing: call the Runtime.


Use it in your app

SDK integration:

import { Runtime } from '@nimiplatform/sdk/runtime';

const runtime = new Runtime();

// Local inference
const local = await runtime.generate({
  prompt: 'Describe yourself in one sentence.',
});

// Cloud inference — same interface, one parameter added
const cloud = await runtime.generate({
  provider: 'gemini',
  prompt: 'Describe yourself in one sentence.',
});
Enter fullscreen mode Exit fullscreen mode

Already using Vercel AI SDK? Zero migration cost:

import { generateText } from 'ai';
import { createNimiAiProvider } from '@nimiplatform/sdk/ai-provider';

const nimi = createNimiAiProvider({ runtime });

const { text } = await generateText({
  model: nimi.text('gemini/default'),
  prompt: 'Hello from Vercel AI SDK + Nimi',
});
Enter fullscreen mode Exit fullscreen mode

Under the hood it's Nimi Runtime's routing and fallback logic. But the code looks exactly like using an OpenAI provider.

Nimi SDK


Nimi Runtime is open source. Apache-2.0 / MIT dual license.

GitHub: https://github.com/nimiplatform/nimi

curl -fsSL https://install.nimi.xyz | sh
nimi start
nimi run "Hello"
Enter fullscreen mode Exit fullscreen mode

Three commands. One unified AI inference surface. 42 cloud providers, local engines, 12 modalities, one interface.

Stop writing separate integrations for every AI modality. Point Claude at this link — it'll tell you what to do next.


Nimi Team
GitHub: https://github.com/nimiplatform/nimi

Top comments (0)