guanjiawei

Posted on Mar 6 • Originally published at guanjiawei.ai

Why We Built AIMA: An Open-Source Project for Managing AI Inference with AI

#ai #opensource #infrastructure #aima

Constantly Installing Models for Others

Since the beginning of 2025, numerous hardware vendors and clients have approached us, expressing interest in our model management platform.

The reasons are straightforward: the interface is user-friendly, model-centric, and allows users to get started with a single click—low barrier to entry. Some clients were even willing to pay us to deploy a similar setup for them.

However, that platform was originally designed for specific hardware. NVIDIA GPUs, Huawei Ascend, Hygon DCU, AMD ROCm—each has different drivers, device mounting mechanisms, environment variables, and security contexts. Adapting to each new hardware type required writing massive amounts of code, which was painful.

What hurt more was the aftermath. Clients would come to us regularly: help us adapt the latest model, help us upgrade the engine. Regardless of how much the platform sold for, it ultimately required continuous manual investment to maintain.

I started wondering: why do people actually need such a platform? After going around in circles, the answer always came back to the same place—TCO, Total Cost of Ownership.

Half the Value Evaporates in Three Months

AI evolves too fast.

Over ten new models can emerge within three months. You spend hundreds of thousands on an AI server, tune it, and run the current best model. Leave it untouched. Three months later, new models are roughly twice as powerful as the original one. The hardware value gets cut in half immediately.

Add to that the progress in inference engines—vLLM, SGLang, llama.cpp, with each version squeezing out more performance—and the gap widens further. Without upgrading for six months, compared to industry best practices, you might be left with only about 30% of the value.

The hardware isn't broken, but its value is declining.

Our previous platform worked fine out of the box, but once you needed to import the latest models or adapt to engine changes, the software couldn't keep up. Before new things even emerge, you don't know what they'll look like, whether engines will adjust APIs, or whether hardware drivers will suddenly update. There's no way to be future-compatible.

The Cost of Mediocrity

Some might say: Aren't Ollama and LM Studio pretty good? One command or one app, and you can download and use models.

My view is that this is a helpless choice from a bygone era.

To reduce TCO, you often have to simultaneously reduce the actual performance of the device. To be more compatible and easier to use, these tools make numerous mediocre presets. Take Ollama, for example—one-click startup is fine. But want to tune advanced inference engine parameters or handle more concurrency? You'll hit bottlenecks.

A vLLM inference engine has dozens of advanced parameters. Some parameters make a 50% performance difference between being enabled or not. Different versions behave differently, and optimal values vary by model. Without running it on actual hardware, you simply don't know what works.

Reducing TCO and maintaining SOTA (State of the Art) performance are conflicting goals in these solutions.

What AIMA Aims to Solve

What AIMA aims to do sounds simple: enable devices to achieve SOTA performance in various scenarios while driving TCO down to near the cost of hardware plus electricity.

Of course, doing it isn't simple. This is a four-dimensional optimization problem—hardware, inference engine, model, and application—four dimensions all changing rapidly, with extremely complex permutations.

AIMA's approach is to make itself a thin infrastructure layer. A Go binary, cross-platform compiled, zero CGO dependencies, install and use.

# Detect hardware
aima hal detect

# Initialize infrastructure
sudo aima init

# Deploy model (automatically matches engine and configuration)
aima deploy apply --model qwen3.5-35b-a3b

Three commands, from bare metal to running model inference. What happens behind the scenes? AIMA detects your GPU model and VRAM, matches the optimal engine and parameter configuration from the YAML knowledge base, generates K3S Pod declarations, and spins up the inference service. The entire process doesn't require you to know whether to use vLLM or llama.cpp, or manually configure CUDA paths or ROCm device mounting.

Currently supported hardware includes NVIDIA RTX 4060/4090/GB10, AMD Radeon 8060S and Ryzen AI MAX+ 395, Huawei Ascend 910B, Hygon BW150 DCU, and Apple M4. Engine support includes vLLM, llama.cpp, SGLang, and Ollama.

Knowledge, Not Code

AIMA's approach is: knowledge over code.

Traditional approach: for every new hardware or engine supported, write a bunch of if-else branches. AIMA doesn't do this. Hardware characteristics, engine parameters, model configurations—all defined in YAML files. Go code only does numerical comparison and generic rendering, containing no vendor-specific branches.

Support a new engine? Write a YAML. Support a new model? Also write a YAML. 80% of capability expansion doesn't require recompilation.

The knowledge base content looks like this: each model has multiple variants, with each variant annotated with applicable GPU architectures, minimum VRAM requirements, inference engine type, and specific launch parameters. AIMA's ConfigResolver automatically matches the most suitable variant based on your current hardware state. If your RTX 4060 only has 8GB VRAM, it will skip the vLLM solution requiring 16GB and automatically fall back to a llama.cpp GGUF solution.

These fragmented pieces of knowledge previously scattered across forum posts, GitHub issues, and personal notes now have a unified, structured representation. Anyone can contribute a YAML, anyone can reuse configurations validated by others.

57 MCP Tools

AIMA exposes 57 MCP tools, covering hardware detection, model management, engine management, deployment, knowledge base, benchmarking, cluster management, and all other functions.

Why does this matter? Because MCP is the standard protocol for AI Agents to operate external tools. Making all functionality into MCP tools means any MCP Client—Claude Code, GPT, or agents you write yourself—can directly control everything on this device.

The CLI is a thin wrapper around MCP tools, containing no additional logic. Humans use the CLI, Agents use MCP, both following the exact same code paths.

This is what "Managed by AI" in the name refers to. AIMA doesn't stuff a large model running inside itself; instead, it transforms itself into infrastructure that AI Agents can directly control. Agents detect hardware, query the knowledge base, deploy models, run benchmarks, check results, and adjust configurations—the entire loop runs without human involvement.

Progressive Intelligence

AIMA has a design I find quite interesting: not all scenarios have AI available, so it implements five levels of progressive intelligence.

The bottom is L0. Default values from the YAML knowledge base, embedded in the Go binary at compile time. No network, no AI, nothing—L0 can still give you a working inference service. Not optimal, but a safety net.

Up to L1, humans can manually override parameters via CLI. Further up to L2, golden configurations based on historical benchmarks—optimal parameter combinations run on this hardware, accumulated from benchmark data for direct reuse next time.

At L3a, if the device itself has sufficient compute power, the built-in Go Agent can use local models for simple tool-calling loops, making some decisions itself. The highest L3b connects to external powerful Agents (like Claude), capable of complex tuning, troubleshooting, and exploration.

Each level works independently. From L0 upward, progressively increasing, progressively overriding. Get a new device, even starting from L0, as knowledge accumulates and network access is available, it can climb up on its own.

There's a design concept here called "exploration as knowledge": every exploration an Agent makes—parameter tuning, troubleshooting, deployment attempts—produces structured Knowledge Notes, written back to the knowledge base. Other devices' Agents can directly reuse this knowledge, skipping known failure paths and starting from optimal points.

The more devices used, the more knowledge accumulates, and the easier subsequent devices become to use. The input to this cycle is tokens and idle compute; the output is an increasingly thick knowledge base.

LAN as Cluster

Another thing worth mentioning: Fleet management.

AIMA uses mDNS for LAN auto-discovery. You place five AI devices with different hardware in your office—no need to configure IPs, no need for a registry center, they can discover each other automatically.

# Discover AIMA devices on the LAN
aima discover

# Execute tools remotely
aima fleet exec <device-id> hardware.detect
aima fleet exec <device-id> deploy.list

Each device exposes the same set of MCP tools, with remote and local following the same paths. For an Agent, managing one or managing a group makes no difference.

Write Less Code

Finally, let's talk about design philosophy.

AIMA's Go code has several hard constraints: no code branches for engine types, no code branches for hardware vendors. Engine behavior, model metadata, hardware container access configurations—all in YAML.

Container lifecycle is managed by K3S, GPU VRAM slicing is managed by HAMi. AIMA does very narrow things: generate Pod YAML from knowledge, kubectl apply, query status.

Less code makes it easier for AI to understand the project, and changing things doesn't burn as many tokens. As AI becomes more capable, minimal code becomes an advantage—AI can participate more fluently in this project.

Open Source

AIMA is open-sourced on GitHub under the Apache 2.0 license. A Go binary, cross-platform compiled, ready to use out of the box.

We will also provide an accompanying AIMA Service for remote troubleshooting when devices encounter issues. Together, the goal is to drive the TCO of AI inference devices down to near the cost of hardware plus electricity plus a bit of token money.

High TCO means devices won't be fully utilized, and the market won't grow large. Compute is scarce now; even an old device, as long as it can still run inference, sitting idle is a waste.

Originally published at https://guanjiawei.ai/en/blog/why-we-built-aima

DEV Community