guanjiawei

Posted on May 1 • Originally published at guanjiawei.ai

The Last Piece of the Puzzle: Vibing an Inference Engine

#ai #inferenceengine #edgecomputing #aima

I saw a piece of news yesterday that held my attention for a while.

23-year-old Liam Price has no advanced mathematical background. On an ordinary Monday afternoon, he casually tossed a problem numbered Erdős #1196 into ChatGPT. After several rounds of picking and choosing, he unexpectedly cracked a problem that had stumped the mathematics community for 60 years. After reviewing it, Terence Tao said something quite interesting: those who had looked at this problem before had "gone off track from the very first step."

That news kept me up most of that night. It wasn't envy—it made me start re-examining the projects in my hands.

A Puzzle, Missing the Last Piece

Working on edge AI, I've always had a puzzle in mind.

The core contradiction to solve is a classic one. On one side is SOTA: the hope that AI models running on small form-factor devices can still achieve the industry's strongest performance, across various operating systems and hardware. On the other side is TCO: the total cost of ownership for each device needs to be compressed as close as possible to hardware plus electricity.

Neither side alone is particularly difficult; it's putting them together that becomes awkward. To achieve SOTA, you need experts to tune specifically for that machine's hardware, inference engine, and model. A parameter tuner earning twenty thousand a month isn't expensive, but on a twenty-thousand-yuan edge device, the math doesn't add up. I wrote about the full story behind this in Edge AI Inference: Computing Goldmine or Management Black Hole?.

To close this gap, I built two puzzle pieces.

The first is called AIMA. A knowledge-driven management platform that embeds tuning knowledge, letting Agents run benchmarks and tune parameters on each device themselves, gradually approaching that machine's optimal performance. Put plainly, it's having AI take over the job of "that person with the twenty-thousand monthly salary."

The second is called AIMA Service. For after-sales maintenance when devices fail, engineers used to have to remotely connect and operate manually. AIMA Service lets Agents directly take over: diagnosis, tuning, and fixing failures end-to-end, swallowing anything that exceeds the on-site manager's cognitive boundaries.

With both puzzle pieces in place, I thought I was nearly done. Until I recently started actually using them and discovered something awkward: I had finished an entire management layer, but had no inference engine to manage.

The Engines Available Now: Each Makes Me Frown

Ollama: Best Experience, Most Regrettable Performance

Ollama's user experience is genuinely excellent. Install it, click once, models download automatically, one-click start—cross macOS / Windows / Linux, everything is silky smooth. Almost everyone who first touches private deployment starts here.

But it has two fatal flaws.

The first is unreachable performance. What Ollama actually runs under the hood is llama.cpp, which has an enormous number of tunable parameters. That's its strength (sufficiently flexible) and its weakness (the vast majority of people can't find the optimal set). Ollama simply chooses a conservative set of defaults for everyone, letting ordinary users run out of the box, at the cost of almost never obtaining that machine's maximum performance.

The second is the default concurrency of one. OLLAMA_NUM_PARALLEL ships as 1 by default, with every request queuing up obediently. You can raise it, but doing so means context scales linearly with concurrency and memory has to be recalculated—it's not as simple as flipping a switch. Batch tasks, Agent multi-step calls—all of these hurt when they run into Ollama.

llama.cpp: Parameters Are Tunable, but Precision Gets Eaten by Its Own Format

Using llama.cpp directly rather than Ollama, performance can be 5–10% higher, concurrency is more controllable, and it runs on all kinds of hardware. It's one of the most cross-platform engines available.

But there's one thing I hadn't noticed before, and only realized after building recently: it must convert models from safetensors to its own GGUF format before running.

This step wasn't a problem in past chat scenarios. But once you enter Agent use cases with multi-step calls, trouble arrives.

Even without any quantization, just this format conversion alone causes a fairly obvious precision drop on agent tasks. Go further with quantization and things get worse. I've looked at some public data: Q4_K_M on certain small models drops accuracy from 0.87 all the way to nearly half, Q3_K_M crashes outright; Llama 3.2 3B converted to q4_k_m drops MMLU from 64.2 to 61.8.

These numbers are tolerable in a single step. But an Agent task can run dozens of steps, losing a bit of precision at each step, and compound decay makes it a completely different story. One study found that when conversation length increases by 50%, efficiency drops 3–5%; after more than 12 rounds, Agents start performing large amounts of meaningless repetitive operations, stuffing the context full of garbage.

Everyone is moving toward Agent programming and agent scenarios. The precision problems on this path will only become harder, not easier.

vLLM / SGLang: Performance Maxed Out, but Heavy in Another Way

vLLM and SGLang are the other extreme. Performance maxed out, native multimodal support. vLLM now stably supports vision models, plus ASR / TTS, embedding, and rerank interfaces; SGLang supports 30+ multimodal models, diffusion models, even TTS. The standard answer for cloud inference engines.

But placed at the edge, the problem is straightforward: package sizes are often dozens of GB, deployment is heavy, and cross-platform support is unfriendly. Having an ordinary user run vLLM on their 16GB box is almost unrealistic.

What I Want My Engine to Look Like

Laying the strengths and pain points of these three on one table, the engine I want essentially stitches their advantages together.

Out-of-the-box like Ollama, cross-OS, package size in the hundreds-of-megabytes range, click and run. Under the hood as flexible as llama.cpp, not picky about CPU, GPU, or NPU. Performance maxed out like vLLM, with native support for vision, ASR, TTS, embedding, rerank—not patched in later. And one final requirement: no GGUF conversion, ingest safetensors directly, avoiding the precision cliff brought by format conversion.

Each of these things is being done by someone individually. But stacked together, I haven't seen anyone doing it yet.

If it gets built, the whole puzzle closes. User experience as foolproof as Ollama, a few hundred megabytes installed, click and run. On the performance side, AIMA works behind the scenes letting Agents auto-tune parameters, approaching that vLLM-level SOTA. Run vision, speech, embedding, rerank locally, with no need to link a bunch of cloud accounts. When problems arise, AIMA Service takes over itself.

This is actually the last piece of that "box that can run every auxiliary model" I talked about in The AI Box Should Be as Boring as a Router.

Something I Wouldn't Have Dared to Think About Three Months Ago

At this point I need to mention something else.

Three months ago my attitude toward "building an inference engine" was: I wouldn't dare think about it. Too low-level, too hardcore; open-source projects doing this were either moribund or internal big-tech efforts.

But today I've already started building—not just one engine, but several different versions. How did this mindset shift happen?

Three months ago I first seriously got into AI coding. I assigned the team a small one-week task: write a performance testing script, because I'd been thinking about it for a long time and no one had helped me do it. A week later, everyone had gradually built their tools.

After finishing, new ideas started forming: could we make a demo? The kind you show at exhibitions. While building that, I thought, could we make a small automatic customer acquisition tool? After building that, could I make a personal website for myself?

Everything I built surprised me. Expectations were low, but everything worked out.

At this point I started thinking bigger: the model management platform I'd always wanted to build (AIMA)—could I build it myself? I spent time on it, and surprisingly actually built it. It started handling real business.

Next I asked: then can I build a cloud service? This is a completely different domain from single-machine software, something I'd never touched before. The result: I could still build it, and people could really use it.

Then recently I started optimizing hardware-layer performance and got very good results. At that moment I suddenly realized: AI can truly produce breakthroughs in things that originally belonged to the research domain, very low-level things, especially in end-to-end verifiable scenarios.

Looking back: the software capabilities accumulated before, cloud capabilities, plus this research capability I now have—assemble them together and it's an inference engine. From not daring to think about it to already building it, the gap wasn't skills, but repeated moments of "huh, so this is doable too."

Every step was "try it without expecting it to work." Every step expanded the answer to "what else can be done next" by another circle.

Final Thoughts

Looking back at that news about the 23-year-old.

The most memorable part isn't "AI made a mathematician's discovery." The original report was quite frank: the AI's direct output was "very poor quality"; it was the human who picked out the valuable nugget. What's worth remembering is the triggering structure: a 23-year-old outside mathematics circles, without professional training, because of ChatGPT, dared to touch a problem that had troubled the mathematics community for 60 years.

It's not that AI did it for you; it's that AI made you dare to do it.

My own last three months have been the same. Starting from a performance testing script, step by step turning things I thought were impossible into things I'm now building. Models get stronger every month; what's blocking you today often becomes just an engineering effort two months later when the next generation model comes out.

That doesn't mean you should just lie around waiting for models. It means: don't be too quick to conclude that something is impossible today. The scope of what's doable expands every month.

References

Originally published at https://guanjiawei.ai/en/blog/inference-engine-last-puzzle-piece

DEV Community