<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andrew S. Bandy</title>
    <description>The latest articles on DEV Community by Andrew S. Bandy (@s-bandy).</description>
    <link>https://dev.to/s-bandy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3840236%2F2ce3fcf3-2b6f-416f-82bd-466d38f7e144.jpg</url>
      <title>DEV Community: Andrew S. Bandy</title>
      <link>https://dev.to/s-bandy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/s-bandy"/>
    <language>en</language>
    <item>
      <title>Benchmarking AI Gateways: GoModel vs LiteLLM vs Portkey vs Bifrost</title>
      <dc:creator>Andrew S. Bandy</dc:creator>
      <pubDate>Fri, 26 Jun 2026 17:51:26 +0000</pubDate>
      <link>https://dev.to/s-bandy/benchmarking-ai-gateways-gomodel-vs-litellm-vs-portkey-vs-bifrost-5d98</link>
      <guid>https://dev.to/s-bandy/benchmarking-ai-gateways-gomodel-vs-litellm-vs-portkey-vs-bifrost-5d98</guid>
      <description>&lt;p&gt;In October 2025 I tried to build my startup on top of LiteLLM.&lt;/p&gt;

&lt;p&gt;At first it looked like the obvious choice. It supported many providers, it had&lt;br&gt;
an OpenAI-compatible API, and it was already used by a lot of people. I did not&lt;br&gt;
want to write an AI gateway. I wanted to build the product behind it.&lt;/p&gt;

&lt;p&gt;Then I started running it on the hot path.&lt;/p&gt;

&lt;p&gt;My opinion changed there.&lt;/p&gt;

&lt;p&gt;A gateway is not a dashboard or integration glue you call once in a while. It&lt;br&gt;
sits on every request, every retry, every stream, every tool call, every&lt;br&gt;
fallback, every timeout.&lt;/p&gt;

&lt;p&gt;A heavy gateway charges rent forever.&lt;/p&gt;

&lt;p&gt;Most AI gateway comparisons miss that part. They talk about provider count,&lt;br&gt;
dashboards, tracing, and "support for 1000+ models". Those things matter, but&lt;br&gt;
they are not free. Before the gateway calls OpenAI, Anthropic, Gemini, vLLM, or&lt;br&gt;
anything else, it has already spent your CPU, memory, cold-start time, and&lt;br&gt;
operational budget.&lt;/p&gt;

&lt;p&gt;I am not comparing full product maturity here. I am comparing how these gateways&lt;br&gt;
behave on the hot path.&lt;/p&gt;

&lt;p&gt;So I started writing &lt;a href="https://github.com/ENTERPILOT/GoModel" rel="noopener noreferrer"&gt;GoModel&lt;/a&gt;: a small&lt;br&gt;
open-source AI gateway and AI control plane in Go, with an OpenAI-compatible API&lt;br&gt;
and explicit provider adapters.&lt;/p&gt;

&lt;p&gt;When I &lt;a href="https://news.ycombinator.com/item?id=47861333" rel="nofollow noopener noreferrer"&gt;launched GoModel on Hacker News&lt;/a&gt;,&lt;br&gt;
I promised a real, reproducible benchmark. This article is that follow-up.&lt;/p&gt;

&lt;p&gt;The benchmark question is simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How lean is each AI gateway when it sits on the request path?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That question runs through the whole benchmark: GoModel vs LiteLLM vs Portkey vs&lt;br&gt;
Bifrost, measured by latency, throughput, memory, CPU, cold start, and image&lt;br&gt;
size rather than landing pages or feature matrices.&lt;/p&gt;
&lt;h2&gt;
  
  
  The runtime footprint matters
&lt;/h2&gt;

&lt;p&gt;Latency gets the easiest arguments. It rarely tells the whole story.&lt;/p&gt;

&lt;p&gt;Most real LLM calls are dominated by inference time. If a model takes &lt;code&gt;2000 ms&lt;/code&gt;&lt;br&gt;
to answer, the difference between &lt;code&gt;5 ms&lt;/code&gt; and &lt;code&gt;15 ms&lt;/code&gt; of proxy overhead is not&lt;br&gt;
the main story.&lt;/p&gt;

&lt;p&gt;The main story is the deployment envelope:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How much RAM does the gateway need under load?&lt;/li&gt;
&lt;li&gt;How much CPU does it burn per request?&lt;/li&gt;
&lt;li&gt;How many requests can it serve per core?&lt;/li&gt;
&lt;li&gt;How fast does it cold-start?&lt;/li&gt;
&lt;li&gt;How large is the Docker image?&lt;/li&gt;
&lt;li&gt;Can you run it as a sidecar, on a small VM, in serverless, or near local
models?&lt;/li&gt;
&lt;li&gt;Is the core gateway actually open-source?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those numbers decide whether the gateway can run where you want it to run.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;372 MB&lt;/code&gt; compressed image (&lt;code&gt;1.2 GB&lt;/code&gt; unpacked) that idles around gigabytes of&lt;br&gt;
RAM and takes &lt;code&gt;25 s&lt;/code&gt; to cold-start is a different operational thing than a&lt;br&gt;
&lt;code&gt;16 MB&lt;/code&gt; image that peaks at &lt;code&gt;37 MB&lt;/code&gt; of RAM and is serving traffic &lt;code&gt;0.56 s&lt;/code&gt; after&lt;br&gt;
launch.&lt;/p&gt;

&lt;p&gt;So I care about the runtime footprint.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this benchmark does not prove
&lt;/h2&gt;

&lt;p&gt;This benchmark does &lt;strong&gt;not&lt;/strong&gt; prove that one gateway is best for every company.&lt;/p&gt;

&lt;p&gt;I am not measuring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bug counts or overall correctness&lt;/li&gt;
&lt;li&gt;semantic cache quality&lt;/li&gt;
&lt;li&gt;tracing UI quality&lt;/li&gt;
&lt;li&gt;guardrail quality&lt;/li&gt;
&lt;li&gt;admin dashboards&lt;/li&gt;
&lt;li&gt;long-term provider maintenance&lt;/li&gt;
&lt;li&gt;every possible provider-specific feature&lt;/li&gt;
&lt;li&gt;total provider count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those things matter. Some of them matter a lot.&lt;/p&gt;

&lt;p&gt;LiteLLM in particular has more integrated providers and more gateway features&lt;br&gt;
than GoModel today. If your first requirement is maximum provider coverage right&lt;br&gt;
now, LiteLLM has a real advantage. This benchmark does not erase that. It&lt;br&gt;
measures the runtime footprint of putting each gateway on the request path. In&lt;br&gt;
practice, many smaller or newer providers already expose an OpenAI-compatible&lt;br&gt;
API, so provider count is not always the same as practical routing coverage.&lt;/p&gt;

&lt;p&gt;The benchmark measures one narrower thing: &lt;strong&gt;runtime and deployment overhead on&lt;br&gt;
the request path&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That still matters, because the gateway is on the hot path. If you run high&lt;br&gt;
request volume, local models, serverless workloads, edge workloads, or many small&lt;br&gt;
model calls, the overhead stops being theoretical.&lt;/p&gt;
&lt;h2&gt;
  
  
  AI gateway benchmark setup
&lt;/h2&gt;

&lt;p&gt;I tested four AI gateways people actually compare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GoModel&lt;/li&gt;
&lt;li&gt;LiteLLM&lt;/li&gt;
&lt;li&gt;Portkey&lt;/li&gt;
&lt;li&gt;Bifrost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every gateway talked to the &lt;strong&gt;same instant mock backend&lt;/strong&gt;, on purpose. I did not&lt;br&gt;
want to benchmark OpenAI, Anthropic, AWS networking, or random internet jitter.&lt;br&gt;
I wanted to isolate the gateway itself.&lt;/p&gt;

&lt;p&gt;Each gateway ran one at a time, in Docker, on an &lt;strong&gt;AWS &lt;code&gt;c7i.large&lt;/code&gt;&lt;/strong&gt; with&lt;br&gt;
2 vCPU and 4 GiB RAM, running the latest &lt;strong&gt;Amazon Linux 2023&lt;/strong&gt; AMI. The whole&lt;br&gt;
thing is Terraform'd, runs with one command, and tears itself down afterwards.&lt;/p&gt;

&lt;p&gt;I first ran this on a free-tier &lt;code&gt;t2.micro&lt;/code&gt;. That was cheap and easy to&lt;br&gt;
reproduce, but unfair to the heavier gateways. A 1 GiB machine cannot hold a&lt;br&gt;
gateway that wants gigabytes of memory, so it starts swapping. At that point you&lt;br&gt;
are benchmarking the host being too small.&lt;/p&gt;

&lt;p&gt;So I moved to &lt;code&gt;c7i.large&lt;/code&gt;: still small, but non-burstable and large enough that&lt;br&gt;
nothing swaps. It also makes the LiteLLM setup more honest. LiteLLM recommends&lt;br&gt;
one worker per vCPU, and this machine has 2 vCPUs, so LiteLLM gets 2&lt;br&gt;
workers. That gives it the multi-core access it is supposed to have instead of&lt;br&gt;
pinning it to a single worker on a tiny box.&lt;/p&gt;

&lt;p&gt;The test covered six workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chat completions, non-streaming&lt;/li&gt;
&lt;li&gt;chat completions, streaming&lt;/li&gt;
&lt;li&gt;Responses API, non-streaming&lt;/li&gt;
&lt;li&gt;Responses API, streaming&lt;/li&gt;
&lt;li&gt;Anthropic messages, non-streaming&lt;/li&gt;
&lt;li&gt;Anthropic messages, streaming&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each workload used &lt;code&gt;8,000&lt;/code&gt; requests at concurrency &lt;code&gt;10&lt;/code&gt;, across &lt;strong&gt;two trials&lt;br&gt;
with randomized gateway order&lt;/strong&gt;. Latency is the &lt;strong&gt;median across trials&lt;/strong&gt;, and I&lt;br&gt;
report p99 with its min-max range so one noisy window cannot tell the whole&lt;br&gt;
story.&lt;/p&gt;

&lt;p&gt;I would not call this a statistically exhaustive study. It is a reproducible&lt;br&gt;
engineering benchmark, and the harness is public so people can rerun it, change&lt;br&gt;
the machine, or add their own workloads.&lt;/p&gt;

&lt;p&gt;A few details matter if you want to reproduce or criticize the numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput is measured, not inferred.&lt;/strong&gt; The latency runs report
completed-req/s at fixed concurrency, but real capacity comes from a separate
concurrency sweep that drives each gateway to saturation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every dialect is warmed up before measurement.&lt;/strong&gt; LiteLLM lazily imports some
per-dialect translation code on first use. A chat-only warmup made its
Responses and Messages paths look worse than they should. I warmed up all
dialects to avoid that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retries are disabled for all gateways.&lt;/strong&gt; I also disabled GoModel's circuit
breaker for this benchmark. In production, rejecting traffic after upstream
trouble is the right behavior. In a saturation benchmark, it would make the
throughput number unfairly low.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteLLM runs with its recommended worker count.&lt;/strong&gt; A LiteLLM worker is
effectively single-threaded, and its production guidance is one worker per
vCPU. On this box that means &lt;code&gt;2&lt;/code&gt; workers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming uses terminal-marker or idle-gap detection.&lt;/strong&gt; If a gateway streams
content but never sends a terminal event, the harness measures to last byte
instead of hanging forever.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  GoModel vs LiteLLM vs Portkey vs Bifrost
&lt;/h2&gt;

&lt;p&gt;Representative latency is chat completions, non-streaming. All resource figures&lt;br&gt;
are measured under load on the same box.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;GoModel&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Runtime&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Node.js&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency overhead &lt;code&gt;p50&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;1.8 ms&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2.5 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;9.7 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;30.6 ms&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency &lt;code&gt;p99&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;6.9 ms&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;18.3 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;30.5 ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;39.3 ms&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput (sustained)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;4900 req/s&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3100 req/s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;950 req/s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;324 req/s&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak RAM under load&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;37 MB&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;143 MB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;112 MB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2.3 GB&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Efficiency (req/s per CPU %)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;52&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;25&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2.6&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold start to first request&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;0.56 s&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;7.1 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.1 s&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;25.5 s&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker image (compressed pull)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;16 MB&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;77 MB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;59 MB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;372 MB&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workload coverage&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6/6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6/6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;4/6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;6/6&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor-neutral core&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Partial †&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Core source available&lt;/td&gt;
&lt;td&gt;Yes ‡&lt;/td&gt;
&lt;td&gt;Partial ‡&lt;/td&gt;
&lt;td&gt;Partial ‡&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  What stood out
&lt;/h2&gt;

&lt;p&gt;GoModel had the lowest median latency and the tightest tail: &lt;code&gt;1.8 ms&lt;/code&gt; p50 and&lt;br&gt;
&lt;code&gt;6.9 ms&lt;/code&gt; p99.&lt;/p&gt;

&lt;p&gt;Bifrost was close on median latency at &lt;code&gt;2.5 ms&lt;/code&gt;, which is a good result. The&lt;br&gt;
gap opened at the tail and in memory: &lt;code&gt;18.3 ms&lt;/code&gt; p99 and &lt;code&gt;143 MB&lt;/code&gt; peak RAM under&lt;br&gt;
load.&lt;/p&gt;

&lt;p&gt;Portkey was heavier than I expected for this narrow proxy benchmark. It served&lt;br&gt;
&lt;code&gt;950 req/s&lt;/code&gt; sustained and used &lt;code&gt;112 MB&lt;/code&gt; peak RAM under load. In this setup it did&lt;br&gt;
not serve the Anthropic &lt;code&gt;/v1/messages&lt;/code&gt; dialect, so it gets &lt;code&gt;4/6&lt;/code&gt; workload&lt;br&gt;
coverage. Treat that as a setup limitation, not a claim that Portkey cannot&lt;br&gt;
support Anthropic in a fuller virtual-key configuration.&lt;/p&gt;

&lt;p&gt;LiteLLM was the outlier. At its recommended worker count, it used about&lt;br&gt;
&lt;code&gt;2.3 GB&lt;/code&gt; of RAM, cold-started in &lt;code&gt;25.5 s&lt;/code&gt;, and sustained &lt;code&gt;324 req/s&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Not because Python is morally bad. The language matters only when it changes the&lt;br&gt;
deployment envelope. Here it does: memory floor, image size, cold-start time,&lt;br&gt;
dependency graph, and throughput per core.&lt;/p&gt;

&lt;p&gt;The later &lt;a href="https://futuresearch.ai/blog/litellm-pypi-supply-chain-attack/" rel="nofollow noopener noreferrer"&gt;supply-chain incident around LiteLLM&lt;/a&gt;&lt;br&gt;
also made me more confident in GoModel's design direction. A small Go binary&lt;br&gt;
with a standard-library-heavy dependency tree is structurally less exposed to&lt;br&gt;
that class of problem than a large Python dependency graph.&lt;/p&gt;
&lt;h2&gt;
  
  
  What AI gateway benchmarks do not capture
&lt;/h2&gt;

&lt;p&gt;Forwarding JSON is not the hard part.&lt;/p&gt;

&lt;p&gt;The hard part is provider drift.&lt;/p&gt;

&lt;p&gt;OpenAI, Anthropic, Gemini, AWS Bedrock, Azure OpenAI, Groq, xAI, Cerebras, vLLM,&lt;br&gt;
and local servers all disagree in small ways. Then they change those ways. Tool&lt;br&gt;
calling changes. Streaming changes. Reasoning parameters change. Image inputs&lt;br&gt;
change. Error formats change. Rate-limit semantics change.&lt;/p&gt;

&lt;p&gt;An AI gateway or AI control plane has to absorb that without becoming magic.&lt;/p&gt;

&lt;p&gt;GoModel's bet is not "support every model name on the internet".&lt;/p&gt;

&lt;p&gt;The bet is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;support the providers people actually deploy&lt;/li&gt;
&lt;li&gt;keep provider adapters explicit&lt;/li&gt;
&lt;li&gt;accept OpenAI-compatible requests generously&lt;/li&gt;
&lt;li&gt;translate only what needs translation&lt;/li&gt;
&lt;li&gt;pass through what should stay provider-specific&lt;/li&gt;
&lt;li&gt;return conservative OpenAI-compatible responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the same reason, GoModel starts as a small OpenAI-compatible gateway, not as&lt;br&gt;
a dashboard with a proxy attached.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why this matters for local models and vLLM
&lt;/h2&gt;

&lt;p&gt;If all your traffic goes to a cloud model that takes several seconds to answer,&lt;br&gt;
gateway overhead can look academic.&lt;/p&gt;

&lt;p&gt;Local models change the math.&lt;/p&gt;

&lt;p&gt;If you are routing through an AI gateway to vLLM, Ollama, LM Studio, llama.cpp,&lt;br&gt;
or small specialized models on your own network, the model call can be much&lt;br&gt;
faster. Then gateway overhead, cold starts, memory, and sidecar size matter more.&lt;/p&gt;

&lt;p&gt;One reason I want GoModel to stay small: a gateway should be cheap enough to put&lt;br&gt;
near the workload.&lt;/p&gt;
&lt;h2&gt;
  
  
  Notes on neutrality and open source
&lt;/h2&gt;

&lt;p&gt;Bifrost is built by Maxim AI, an LLM&lt;br&gt;
evaluation and observability platform. It routes to many model providers, but&lt;br&gt;
the gateway also sits close to Maxim's eval and observability ecosystem. If you&lt;br&gt;
want to choose your own eval platform, or stay independent from any eval&lt;br&gt;
platform, ask whether Bifrost is the right match for you. Good software can&lt;br&gt;
still have incentives attached. "Vendor-neutral" needs an asterisk here.&lt;/p&gt;

&lt;p&gt;"Open-source" also needs care.&lt;/p&gt;

&lt;p&gt;Portkey keeps observability storage, dashboard, multi-team RBAC, and at-scale&lt;br&gt;
semantic caching in a closed managed tier. Bifrost's core gateway is Apache-2.0,&lt;br&gt;
but its Enterprise edition adds closed or managed features. LiteLLM's proxy core&lt;br&gt;
is MIT, but enterprise features like SSO, audit logs, and fine-grained access&lt;br&gt;
control sit behind a proprietary commercial license.&lt;/p&gt;

&lt;p&gt;GoModel is open-source today. Some enterprise-grade AI control plane features may&lt;br&gt;
stay private. The core gateway is intended to remain useful without those private&lt;br&gt;
features.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reproduce it yourself
&lt;/h2&gt;

&lt;p&gt;The benchmark is built to be self-verifiable. It provisions the AWS instance,&lt;br&gt;
runs every gateway against the same backend, prints the tables, and destroys the&lt;br&gt;
infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark" rel="noopener noreferrer"&gt;Reproduce it yourself&lt;/a&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./run.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One caveat: it runs on &lt;strong&gt;paid&lt;/strong&gt; AWS infrastructure, not the free tier. A&lt;br&gt;
&lt;code&gt;c7i.large&lt;/code&gt; is about &lt;code&gt;$0.09&lt;/code&gt;/hour and the run self-destructs within an hour or&lt;br&gt;
two, so budget &lt;strong&gt;under &lt;code&gt;$1&lt;/code&gt;&lt;/strong&gt; per run to be safe.&lt;/p&gt;

&lt;p&gt;If you pass &lt;code&gt;KEEP=1&lt;/code&gt; or teardown fails, you keep paying until you destroy the&lt;br&gt;
box, so double-check the teardown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I did not start GoModel because I wanted another AI gateway in the world.&lt;/p&gt;

&lt;p&gt;I started it because the gateway I wanted to use became part of the problem. It&lt;br&gt;
sat on the hot path, but did not feel like hot-path software: too heavy, too&lt;br&gt;
slow to start, too expensive to keep around, too large for the job.&lt;/p&gt;

&lt;p&gt;This benchmark is the result of turning that frustration into numbers.&lt;/p&gt;

&lt;p&gt;The numbers say GoModel is small in the places I care about: &lt;code&gt;16 MB&lt;/code&gt; image,&lt;br&gt;
&lt;code&gt;37 MB&lt;/code&gt; peak RAM, &lt;code&gt;0.56 s&lt;/code&gt; cold start, &lt;code&gt;1.8 ms&lt;/code&gt; p50, &lt;code&gt;6.9 ms&lt;/code&gt; p99, and&lt;br&gt;
&lt;code&gt;4900 req/s&lt;/code&gt; sustained throughput on a small AWS box.&lt;/p&gt;

&lt;p&gt;LiteLLM still has more providers and more features today. Portkey and Bifrost&lt;br&gt;
have their own strengths. But if the gateway is going to sit between your users&lt;br&gt;
and every model call, I think it should first be cheap, predictable, and boring&lt;br&gt;
to run.&lt;/p&gt;

&lt;p&gt;GoModel is my attempt to build that kind of gateway.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>performance</category>
      <category>infrastructure</category>
      <category>webdev</category>
    </item>
    <item>
      <title>LiteLLM was compromised, but GoModel is a good alternative</title>
      <dc:creator>Andrew S. Bandy</dc:creator>
      <pubDate>Tue, 24 Mar 2026 18:39:51 +0000</pubDate>
      <link>https://dev.to/s-bandy/litellm-was-compromised-thats-why-im-building-gomodel-nmm</link>
      <guid>https://dev.to/s-bandy/litellm-was-compromised-thats-why-im-building-gomodel-nmm</guid>
      <description>&lt;p&gt;LiteLLM just had a serious supply chain incident.&lt;/p&gt;

&lt;p&gt;According to the &lt;a href="https://github.com/BerriAI/litellm/issues/24518" rel="noopener noreferrer"&gt;public GitHub reports&lt;/a&gt;, malicious PyPI versions of LiteLLM were published, including 1.82.8, with code that could run automatically on Python startup and steal secrets like environment variables, SSH keys, and cloud credentials. The reported payload sent that data to an attacker-controlled domain. A follow-up issue says the PyPI package was compromised through the maintainer's PyPI account, and that the bad releases were not shipped through the official GitHub CI/CD flow.&lt;/p&gt;

&lt;p&gt;This is bigger than one package. It is a reminder that the AI infra layer is now part of your security boundary.&lt;/p&gt;

&lt;p&gt;Fortunately, there is a good alternative. GoModel: a faster, simpler alternative to LiteLLM, written in Go. Simpler, smaller and better performance for teams that want a reliable LLM gateway.&lt;/p&gt;

&lt;p&gt;Repo link: &lt;a href="https://github.com/ENTERPILOT/GOModel/" rel="noopener noreferrer"&gt;https://github.com/ENTERPILOT/GOModel/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>performance</category>
      <category>opensource</category>
      <category>security</category>
    </item>
    <item>
      <title>Benchmarking GoModel, a LiteLLM alternative: lessons learned from building a simple benchmark</title>
      <dc:creator>Andrew S. Bandy</dc:creator>
      <pubDate>Mon, 23 Mar 2026 15:36:23 +0000</pubDate>
      <link>https://dev.to/s-bandy/benchmarking-gomodel-vs-litellm-alternative-lessons-learned-from-building-a-simple-benchmark-45m</link>
      <guid>https://dev.to/s-bandy/benchmarking-gomodel-vs-litellm-alternative-lessons-learned-from-building-a-simple-benchmark-45m</guid>
      <description>&lt;p&gt;When I started to look for an AI Gateway for my project I've encountered GoModel. I did not plan to spend much time on benchmarking.&lt;/p&gt;

&lt;p&gt;I assumed benchmarking would be annoying, fragile, and probably much harder than it looked. In my head, it felt like one of those tasks that sounds simple at first, but turns into a mini research project once you actually start.&lt;/p&gt;

&lt;p&gt;What I learned is the opposite: creating a &lt;strong&gt;useful&lt;/strong&gt; benchmark is much easier than most people think.&lt;/p&gt;

&lt;p&gt;And one big reason is that AI makes the whole process much easier than it was a few years ago.&lt;/p&gt;

&lt;p&gt;That was the biggest lesson for me.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is GoModel?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/ENTERPILOT/GOModel" rel="noopener noreferrer"&gt;GoModel&lt;/a&gt; is an open-source AI gateway / LLM proxy written in Go. It sits between your app and model providers like OpenAI, Anthropic, Gemini, Groq, xAI, and Ollama, and exposes a single OpenAI-compatible API.&lt;/p&gt;

&lt;p&gt;I built it because I wanted a lightweight, production-friendly gateway that was easy to deploy, easy to reason about, and fully open-source.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I decided to benchmark it
&lt;/h2&gt;

&lt;p&gt;At some point, I kept making the same claim in my head:&lt;/p&gt;

&lt;p&gt;“GoModel feels lighter and faster.”&lt;/p&gt;

&lt;p&gt;That may be true, but “feels” is not evidence.&lt;/p&gt;

&lt;p&gt;I was mostly comparing it against LiteLLM, because LiteLLM is the best-known option in this space and the default reference point for many people looking at LLM gateways.&lt;/p&gt;

&lt;p&gt;So I decided to stop guessing and just measure it.&lt;/p&gt;

&lt;p&gt;That turned out to be one of the most useful things I have done for the project, not only because of the results, but because of what I learned while building the benchmark itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The biggest change: benchmarking is easier now because you can just talk to AI (Lesson 1)
&lt;/h2&gt;

&lt;p&gt;A few years ago, even starting a benchmark felt heavy.&lt;/p&gt;

&lt;p&gt;First you had to think through the methodology. Then you had to decide what to measure. Then you had to write the scripts. Then you had to figure out how to run them, collect the numbers, and make sense of the results.&lt;/p&gt;

&lt;p&gt;Now a lot of that work is much easier.&lt;/p&gt;

&lt;p&gt;You can literally start by describing what you want in plain English:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I have two services&lt;/li&gt;
&lt;li&gt;they do the same job&lt;/li&gt;
&lt;li&gt;I want to compare throughput, latency, and memory usage&lt;/li&gt;
&lt;li&gt;I want a simple repeatable benchmark&lt;/li&gt;
&lt;li&gt;I do not need a perfect academic setup&lt;/li&gt;
&lt;li&gt;I just want something fair and useful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is already enough to get moving.&lt;/p&gt;

&lt;p&gt;AI is very good at helping with exactly this kind of task. Not because it magically solves benchmarking for you, but because it removes a lot of the friction around getting started.&lt;/p&gt;

&lt;p&gt;It can help you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;define a reasonable benchmark scope&lt;/li&gt;
&lt;li&gt;generate load scripts&lt;/li&gt;
&lt;li&gt;suggest what metrics to collect&lt;/li&gt;
&lt;li&gt;point out obvious mistakes in the setup&lt;/li&gt;
&lt;li&gt;format results&lt;/li&gt;
&lt;li&gt;help you explain the limitations clearly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That part feels very different from how things used to be.&lt;/p&gt;

&lt;p&gt;Before, benchmarking often felt blocked by setup cost.&lt;/p&gt;

&lt;p&gt;Now it is much more like: &lt;strong&gt;&lt;a href="https://steipete.me/posts/just-talk-to-it" rel="noopener noreferrer"&gt;just talk to AI&lt;/a&gt;, get a first version working, then iterate&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That does not mean you should trust every output blindly. You still need to think. You still need to validate the setup. You still need to understand what is actually being measured.&lt;/p&gt;

&lt;p&gt;But the barrier to entry is much lower now.&lt;/p&gt;

&lt;p&gt;And I think that is a big deal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 2: a benchmark does not need to be perfect to be useful
&lt;/h2&gt;

&lt;p&gt;This was the biggest mindset shift.&lt;/p&gt;

&lt;p&gt;I think many developers avoid benchmarking because they imagine they need a huge setup: many machines, a big test matrix, production traffic replay, deep statistical analysis, and charts for every possible scenario.&lt;/p&gt;

&lt;p&gt;In reality, you can learn a lot from a small benchmark if you ask a clear question.&lt;/p&gt;

&lt;p&gt;My question was simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If both tools are used as an LLM gateway in front of the same kind of workload, how do they behave in terms of throughput, latency, and memory usage?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is already enough.&lt;/p&gt;

&lt;p&gt;You do not need to model the entire internet. You just need a test that is fair enough to reveal something meaningful.&lt;/p&gt;

&lt;p&gt;AI also helps here because it forces you to phrase the question clearly. If you cannot explain the benchmark clearly to an AI assistant, there is a good chance your scope is still too vague.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 3: benchmarking forces product clarity
&lt;/h2&gt;

&lt;p&gt;This part surprised me.&lt;/p&gt;

&lt;p&gt;I expected benchmarking to tell me about performance.&lt;/p&gt;

&lt;p&gt;What it also did was clarify the product itself.&lt;/p&gt;

&lt;p&gt;Once you measure something, you are forced to answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is this product actually optimized for?&lt;/li&gt;
&lt;li&gt;Where should it be better?&lt;/li&gt;
&lt;li&gt;What trade-offs did I make intentionally?&lt;/li&gt;
&lt;li&gt;What should users care about most?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my case, the benchmark made the positioning much clearer.&lt;/p&gt;

&lt;p&gt;GoModel is not just “an AI gateway.”&lt;/p&gt;

&lt;p&gt;It is a Go-based, open-source, single-binary gateway designed to be lightweight, simple to deploy, and efficient in the hot path of LLM requests.&lt;/p&gt;

&lt;p&gt;Without benchmarking, those are just words.&lt;/p&gt;

&lt;p&gt;With benchmarking, they become testable claims.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lesson 4: benchmarking is also a debugging tool
&lt;/h2&gt;

&lt;p&gt;Before doing this, I mostly thought about benchmarks as something you publish.&lt;/p&gt;

&lt;p&gt;That was a mistake.&lt;/p&gt;

&lt;p&gt;A benchmark is also one of the fastest ways to find weak spots in your own system.&lt;/p&gt;

&lt;p&gt;As soon as you push something under repeatable load, you start noticing where memory grows faster than expected, where latency becomes uneven, and where parts of the system become bottlenecks.&lt;/p&gt;

&lt;p&gt;Even if I had never published the results, building the benchmark would still have been worth it.&lt;/p&gt;

&lt;p&gt;It gave me a much more honest picture of the system.&lt;/p&gt;

&lt;p&gt;And again, AI helps here not by replacing the benchmark, but by helping you move faster once you find a problem. You can ask it to review the script, suggest what might be skewing the result, or help you isolate one part of the test.&lt;/p&gt;

&lt;h2&gt;
  
  
  My biggest takeaway
&lt;/h2&gt;

&lt;p&gt;The biggest lesson I learned is very simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmarking is much more accessible today with AI tools.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You do not need a lab.&lt;/p&gt;

&lt;p&gt;You do not need a giant team.&lt;/p&gt;

&lt;p&gt;You do not need a perfect methodology.&lt;/p&gt;

&lt;p&gt;And now, you also do not need to start from a blank page.&lt;/p&gt;

&lt;p&gt;You can just describe what you want to measure, use AI to help generate a first version, and improve it from there.&lt;/p&gt;

&lt;p&gt;You still need to think.&lt;/p&gt;

&lt;p&gt;You still need to validate the setup.&lt;/p&gt;

&lt;p&gt;You still need to be honest about the limits.&lt;/p&gt;

&lt;p&gt;But getting started is much easier than it used to be.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;If you are building infrastructure, developer tools, or performance-sensitive software, I think it is worth benchmarking earlier than you expect.&lt;/p&gt;

&lt;p&gt;Not because you need a marketing graph.&lt;/p&gt;

&lt;p&gt;Because benchmarking forces clarity.&lt;/p&gt;

&lt;p&gt;It helps you understand your product better, find bottlenecks faster, and communicate value more concretely.&lt;/p&gt;

&lt;p&gt;And today, with AI, it is easier than ever to start.&lt;/p&gt;

&lt;p&gt;That was true for me with benchmarking GoModel, and it is probably true for a lot of other projects too.&lt;/p&gt;

&lt;p&gt;If you want to check out the project, GoModel is open-source and available on GitHub:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ENTERPILOT/GOModel" rel="noopener noreferrer"&gt;GOModel on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is also a benchmark result published there by enterpilot start-up:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://enterpilot.io/blog/gomodel-vs-litellm-benchmark-march-2026/" rel="noopener noreferrer"&gt;GoModel vs LiteLLM benchmark (March 2026)&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>go</category>
      <category>performance</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
