When I started working on GoModel, I did not plan to spend much time on benchmarking.
I assumed benchmarking would be annoying, fragile, and probably much harder than it looked. In my head, it felt like one of those tasks that sounds simple at first, but turns into a mini research project once you actually start.
What I learned is the opposite: creating a useful benchmark is much easier than most people think.
And one big reason is that AI makes the whole process much easier than it was a few years ago.
That was the biggest lesson for me.
What is GoModel?
GoModel is an open-source AI gateway / LLM proxy written in Go. It sits between your app and model providers like OpenAI, Anthropic, Gemini, Groq, xAI, and Ollama, and exposes a single OpenAI-compatible API.
I built it because I wanted a lightweight, production-friendly gateway that was easy to deploy, easy to reason about, and fully open-source.
Why I decided to benchmark it
At some point, I kept making the same claim in my head:
“GoModel feels lighter and faster.”
That may be true, but “feels” is not evidence.
I was mostly comparing it against LiteLLM, because LiteLLM is the best-known option in this space and the default reference point for many people looking at LLM gateways.
So I decided to stop guessing and just measure it.
That turned out to be one of the most useful things I have done for the project, not only because of the results, but because of what I learned while building the benchmark itself.
The biggest change: benchmarking is easier now because you can just talk to AI (Lesson 1)
A few years ago, even starting a benchmark felt heavy.
First you had to think through the methodology. Then you had to decide what to measure. Then you had to write the scripts. Then you had to figure out how to run them, collect the numbers, and make sense of the results.
Now a lot of that work is much easier.
You can literally start by describing what you want in plain English:
- I have two services
- they do the same job
- I want to compare throughput, latency, and memory usage
- I want a simple repeatable benchmark
- I do not need a perfect academic setup
- I just want something fair and useful
That is already enough to get moving.
AI is very good at helping with exactly this kind of task. Not because it magically solves benchmarking for you, but because it removes a lot of the friction around getting started.
It can help you:
- define a reasonable benchmark scope
- generate load scripts
- suggest what metrics to collect
- point out obvious mistakes in the setup
- format results
- help you explain the limitations clearly
That part feels very different from how things used to be.
Before, benchmarking often felt blocked by setup cost.
Now it is much more like: just talk to AI, get a first version working, then iterate.
That does not mean you should trust every output blindly. You still need to think. You still need to validate the setup. You still need to understand what is actually being measured.
But the barrier to entry is much lower now.
And I think that is a big deal.
Lesson 2: a benchmark does not need to be perfect to be useful
This was the biggest mindset shift.
I think many developers avoid benchmarking because they imagine they need a huge setup: many machines, a big test matrix, production traffic replay, deep statistical analysis, and charts for every possible scenario.
In reality, you can learn a lot from a small benchmark if you ask a clear question.
My question was simple:
If both tools are used as an LLM gateway in front of the same kind of workload, how do they behave in terms of throughput, latency, and memory usage?
That is already enough.
You do not need to model the entire internet. You just need a test that is fair enough to reveal something meaningful.
AI also helps here because it forces you to phrase the question clearly. If you cannot explain the benchmark clearly to an AI assistant, there is a good chance your scope is still too vague.
Lesson 3: benchmarking forces product clarity
This part surprised me.
I expected benchmarking to tell me about performance.
What it also did was clarify the product itself.
Once you measure something, you are forced to answer questions like:
- What is this product actually optimized for?
- Where should it be better?
- What trade-offs did I make intentionally?
- What should users care about most?
In my case, the benchmark made the positioning much clearer.
GoModel is not just “an AI gateway.”
It is a Go-based, open-source, single-binary gateway designed to be lightweight, simple to deploy, and efficient in the hot path of LLM requests.
Without benchmarking, those are just words.
With benchmarking, they become testable claims.
Lesson 4: benchmarking is also a debugging tool
Before doing this, I mostly thought about benchmarks as something you publish.
That was a mistake.
A benchmark is also one of the fastest ways to find weak spots in your own system.
As soon as you push something under repeatable load, you start noticing where memory grows faster than expected, where latency becomes uneven, and where parts of the system become bottlenecks.
Even if I had never published the results, building the benchmark would still have been worth it.
It gave me a much more honest picture of the system.
And again, AI helps here not by replacing the benchmark, but by helping you move faster once you find a problem. You can ask it to review the script, suggest what might be skewing the result, or help you isolate one part of the test.
My biggest takeaway
The biggest lesson I learned is very simple:
Benchmarking is much more accessible today with AI tools.
You do not need a lab.
You do not need a giant team.
You do not need a perfect methodology.
And now, you also do not need to start from a blank page.
You can just describe what you want to measure, use AI to help generate a first version, and improve it from there.
You still need to think.
You still need to validate the setup.
You still need to be honest about the limits.
But getting started is much easier than it used to be.
Final thought
If you are building infrastructure, developer tools, or performance-sensitive software, I think it is worth benchmarking earlier than you expect.
Not because you need a marketing graph.
Because benchmarking forces clarity.
It helps you understand your product better, find bottlenecks faster, and communicate value more concretely.
And today, with AI, it is easier than ever to start.
That was true for me with GoModel, and it is probably true for a lot of other projects too.
If you want to check out the project, GoModel is open-source and available on GitHub:
I also published the full benchmark results here if you want to see the setup and the raw comparison:
Top comments (0)