Sipp: a local-first runtime for Hybrid AI Applications

Constant, Yuan Chen — Wed, 24 Jun 2026 13:37:58 +0000

Over the past few months, I had the opportunity to contribute to llama.cpp’s WebGPU backend, helping push it from isolated operator support toward a more complete and reliable path for browser-based and multimodal inference. It was a collective effort with dozens of contributors, and was an essential component of getting Sipp ready for release. The lead maintainer, Reese Levine, wrote a really nice blog post about it, Llamas on the Web, and published a paper around the architecture design. In this tech post, I want to share some thoughts into the WebGPU in-browser inference, and the larger ecosystem we are building to unify both local and cloud compute in a hybrid inference design for making intelligence more available.

Check out Sipp: https://www.sipp.sh/

What it means to make intelligence available

A technology becomes interesting when it works in a lab, and becomes transformative when people can actually reach it, run it, adapt it, and build with it in the environments they already use. That is the sense in which Sipp is about making AI more available.

Language models have mostly entered software through chatboxes, but interactive applications rarely begin with a complete prompt. In games, design tools, and agent workspaces, the useful context is often the live environment itself: the open document, selected objects, recent edits, user actions, scene state, and background work. As open models become smaller and more capable, more of that work can happen close to the user. This matters for more than speed and privacy. Useful contexts should not require every interaction to cross a cloud boundary, depend on frontier-model pricing, or be gated by a single provider. Frontier models are
still better for deep reasoning, and planning, but most interactive software does not need that depth at every step. The stronger design is a split one: local models for immediate, private, continuous interaction, and larger remote models when the task truly requires deeper reasoning.

Sipp is designed for the space between them. You can register a local model, a gateway target, or a provider endpoint, and then call the same operations against the selected endpoint. In these remaining sections, I will introduce the system design behind Sipp, including:

how Sipp represents local, gateway, and provider endpoints
why query, chat, and embed are separate operations
how the local engine schedules requests and reuses key-value (KV) cache state
how the browser host makes WebGPU practical for GGUF models
how the gateway provides a policy and operations boundary for remote compute

Architecture at a glance

Sipp uses endpoint registration. An endpoint can run in the same process, in the browser, behind an HTTP gateway, or at an external provider. The application keeps the endpoint reference and passes it to query, chat, or embed.

flowchart LR
  App[Application] --> Client[SippClient]
  Client --> Local[Local endpoint]
  Client --> Gateway[Gateway endpoint]
  Client --> Provider[Provider endpoint]

  Local --> Engine[Sipp Engine]
  Engine --> Native[llama.cpp and ggml]

  Gateway --> GatewayServer[Gateway server]
  GatewayServer --> ServerClient[SippClient]
  ServerClient --> ServerLocal[Server local model]
  ServerClient --> ServerProvider[Provider target]

The diagram shows two important properties. First, the application chooses the endpoint. Sipp does not silently move a request from a local model to a remote model. The application owns the product
constraints, such as privacy, cost, latency, and quality. Second, the operation shape stays stable. A feature can start with a browser model, add a server-side local model later, and then add a gateway target for harder requests without changing the public operation model.

Endpoint model

The core client stores endpoints behind a common inference endpoint interface. SippClient::add() builds the endpoint from a descriptor:

A local model descriptor loads SippEngine.
A gateway descriptor builds an HTTP gateway endpoint.
A provider descriptor builds a direct provider adapter when provider support is enabled.

After registration, Sipp resolves operations to the selected endpoint, checks whether it supports the requested operation and returns the same run and response types. This design keeps endpoint selection explicit while keeping the application API small.

Request operations

Sipp separates query, chat, and embed because each operation has different
runtime requirements.

query sends a raw prompt string, without applying a chat template. Use this
operation when the application owns the prompt format, such as completion-style
prompts, few-shot prompts, custom templates, encoder-decoder flows, or agent
loops that render prompts themselves.

chat sends ordered role and content messages. A local endpoint applies the
model's chat template. A gateway or provider endpoint maps those messages into
the selected remote protocol.

embed returns vectors instead of generated text. The endpoint must support
embeddings, and the local runtime uses a different path because it reads an
embedding result instead of sampling output tokens.

The separation prevents common integration bugs. A raw prompt does not
accidentally become a chat transcript. A chat request is not sent to a local
model without the model's template. An embedding request is not routed to a
generation-only endpoint.

The gateway and provider paths also enforce this boundary. Gateway requests
reject local-only fields such as contextKey, grammar, and local sampling overrides. Endpoint-specific options must be JSON-compatible and cannot override typed fields such as model, prompt, messages, or stream.

Local engine

The local engine is the part of Sipp that turns endpoint calls into interactive
inference work.

The core loop is tick-based. A request enters the runtime, becomes an internal
generation or embedding request, and advances through scheduler ticks. This
model fits interactive applications, where a visible chat response, a short
background classification, and a longer prefill can overlap.

At each tick, the engine separates prompt prefill from token decode. The batch
planner builds a flat list of token contributions from ready slots. Each
contribution is either Prefill or Decode, and includes the slot index,
request ID, token, position, and whether the request needs logits.

This split gives the scheduler direct control over latency and throughput:

Decode steps are latency-sensitive because they produce visible tokens.
Prefill steps can process many prompt tokens before the first generated token.
Active decode streams should not wait behind a long prompt.
Long prompts should still make progress while decode streams are active.

The planner budgets decode and prefill separately. Reusing the plan
avoids repeated allocation in the scheduler hot path. The planner also uses a
small bitmask fast path for counting occupied slots instead of allocating a
HashSet on each tick.

After the native backend runs, Sipp applies request bookkeeping and emits token
batches. The runtime records the request metrics for observability.

Key-value cache as runtime state

Each local request can carry a contextKey. The key represents the logical
workflow, such as a document, scene, conversation, workspace, or
background task. The engine uses that key to decide whether it can reuse live KV
state or restore a prefix snapshot.

The KvCacheManager maps context keys to physical sequence slots. When a request completes and cache reuse is enabled, Sipp can keep the sequence idle but resident. A later request with the same contextKey can use that warm state. If the runtime has more active contexts than physical sequences, it evicts idle sessions with an LRU policy. Sipp also supports prefix snapshots. During prefill, the runtime can restore the best matching snapshot for the same model fingerprint and context scope. It then recomputes only the missing suffix. The prefill path computes longest common prefix reuse, checks whether partial KV reuse is valid for the model family, makes room in the context window, and records cache hits.

This cache state matters for hybrid routing because it presents the cost of
asking for local evidence. A warm contextKey can let the runtime reuse prefix
tokens, run a short verifier, or draft an answer without paying full prefill
cost again. A router can then decide whether to return the local result, send
the draft to the gateway for audit, or skip local work when the task likely needs a stronger model. For example, an editor can use the same contextKey while a user works inside
one file. The first request pays the cost of reading the file and recent edits into the local model. A follow-up request, such as "is this edit safe?", can reuse that warm prefix and run a cheap local check. If the check is confident, the UI can respond immediately. If the check is uncertain or the check needs more contexts, the router can send the task to the gateway instead.

Instead of a static local-then-cloud cascade, a route can use cache hits, prefill cost, network latency, and provider cost as routing inputs. We are currently researching in this area and hopefully we can bring some insights to this topic in the future.

Browser host

The browser host owns packaging, model staging, capability selection, and the JavaScript-facing runtime API. It does not duplicate the inference engine in the Rust core.

The build compiles the Rust browser ABI as an Emscripten static library. It then
links that library with llama.cpp, ggml, ggml-webgpu, and the multimodal
runtime. The ggml-webgpu target embeds WGSL shader files into a generated
header, and the Emscripten build uses Dawn's emdawnwebgpu port to call browser
WebGPU from C++ and WebAssembly.

The browser client runs through a worker-backed model service or a
main-thread model service and ships single-thread and pthread WebAssembly
artifacts. The pthread artifact requires SharedArrayBuffer, cross-origin
isolation, and the deployment headers that enable shared memory. The
single-thread artifact remains available when those requirements are not met.

Backend selection is capability-aware:

If the app requests CPU, Sipp uses CPU.
If the app requests WebGPU, Sipp returns adapter information.
If the app uses automatic selection, Sipp selects WebGPU only when the adapter exposes shader-f16; otherwise, it falls back to CPU.

Model loading is part of inference performance. The browser cache policy loads
files directly up to 2 GiB. It splits larger GGUF assets into 512 MiB shards and
loads those shards automatically. For large assets, the browser package stores
the files in OPFS, opens sync access handles, and mounts those handles into
Emscripten's filesystem. Read calls copy bytes from OPFS directly into a
Uint8Array view of the WebAssembly heap. That avoids reading into a JavaScript
ArrayBuffer and then copying the same bytes again into HEAPU8. Vision models use the same lifecycle. The main model weights can be staged as GGUF shards, while the projector artifact is staged separately for the multimodal runtime.

WebGPU case study

The main difference of the WebGPU backend from ONNX and TVM/WebLLM is the representation.
ONNX treats WebGPU as an execution provider for ONNX graphs. That is
a good fit for portable graph artifacts and provider abstraction. The tradeoff
is that GGUF-native details, such as tokenizer metadata, chat templates, KV
behavior, and llama.cpp quantized layouts, must cross a different artifact
boundary. TVM/WebLLM uses a compiler pipeline. Model computation is lowered through
MLC-LLM and Apache TVM into WebGPU and WebAssembly
artifacts. That path can apply ahead-of-time optimization, graph fusion, and
operator scheduling for a curated model catalog. The tradeoff is that users do
not point the runtime at an arbitrary GGUF file and run it directly.

GGML WebGPU keeps the model format and runtime behavior closer to execution.
ggml still builds the tensor graph dynamically, and WebGPU executes that graph
in the browser. The backend maps tensor views to WebGPU buffer offsets, supports
quantized layouts without expanding them into large intermediate tensors,
specializes shader pipelines by tensor type and quantization format, and reuses
per-kernel parameter storage. For quantized matrix multiplication,
dequantization stays in the shader path instead of becoming a separate
model-conversion step.

That architecture fits decode-heavy workloads. Decode often depends on memory
bandwidth because each generated token streams weights and attends over KV
state. Keeping quantized weights close to execution reduces memory expansion and
copy costs. Prefill has different tradeoffs because it exposes more parallelism
and can benefit from compiler fusion.

Through our public browser benchmark tooling at
benchmark.sipp.sh/benchmark, we achieved
the following results with an NVIDIA GTX 3080, one warmup run, and three
measured runs:

Runtime or framework	Time to first token, lower is better	Decode, higher is better	End-to-end latency, lower is better
Sipp browser runtime	24.3 ms	77.07 tok/s	6,655 ms
WebLLM	160.0 ms	25.80 tok/s	19,930 ms
Transformers.js	301.0 ms	33.25 tok/s	15,670 ms

This benchmark is one data point. Browser version, model, memory pressure and other factors can change results, but these numbers are still useful because they match the architecture: fewer avoidable copies, GGUF-native execution, and quantized decode in the backend.

Gateway control plane

Sipp separates gateway responsibilities into layers:

sipp::gateway_core defines protocol-neutral operations, request context, cancellation, stream events, target resolution, authorization, admission, and execution traits.
lib/gateway provides route-free HTTP helpers, including codecs, authentication traits, error translation, JSON responses, and server-sent events (SSE) encoding.
apps/gateway-server is the Axum application. It owns TOML configuration, listeners, bearer tokens, CORS, request-size limits, concurrency limits, rate limiting, metrics, and the admin dashboard.

This separation keeps policy out of the local engine. A product can hide
provider secrets, restrict targets by caller, enforce request-size limits, rate
limit public clients, expose health routes, or run a local model on a server GPU
without changing the local runtime.

The gateway server loads configured targets into a SippClient and
stores a map from public target names to endpoint references. Incoming requests
resolve a target, check authorization, acquire admission, and then execute the
same query, chat, or embed operation through SippClient.

Query and chat can return finite responses or streams. Streaming responses use
SSE events: token batches, optional usage, and a final done event with finish
metadata. Embeddings use a finite response path.

The browser gateway client mirrors that contract. It validates the gateway base
URL, allows HTTP only for loopback, supports bearer and header authentication
with value providers, redacts secrets from errors, bounds error bodies and SSE
event sizes, and parses token, usage, done, and error events into the same
browser run abstraction used by local inference.

Hybrid routing

Hybrid routing is under active research, and the decision depends on runtime
signals, including:

local model latency for the current device and backend
privacy requirements for the task
expected reasoning depth
remote target availability, cost, and authorization

With those signals, an application can start with simple policies:

Keep UI classification, lightweight summarization, grammar-constrained extraction, scene synchronization, and private state-adjacent tasks local.
Delegate long-horizon planning, difficult synthesis, stronger world knowledge, or high-accuracy tasks to a gateway target.
Send the remote model only the context it needs.

The application still owns the routing decision. Sipp provides the common
runtime model that makes the decision practical.

Closing

At the start of Sipp, the motivation was simple: AI should be usable in more places than a remote chatbox. By the end, that had become a systems question: how do we make model compute available inside real applications, close to the user when interaction demands it, and connected to deeper reasoning when the task requires it? That is the vision Sipp is built to support.

During this journey, I learned a lot by working close to inference itself and to the llama.cpp community. My work covered backend stability, shader correctness and optimization, quantized-kernel behavior, and the operator coverage needed by vision models. A challenge along the way was tracing precision drift through the CLIP vision path, attention shaders, and feed-forward layers, then fixing the shader behavior until multimodal outputs matched the CPU reference much more closely. This was not just implementation work, but required some low-level debugging across WGSL shaders, memory semantics, quantization formats, and real model validation. These contributions became a central part of making llama.cpp’s WebGPU support more stable, complete, and practically usable. The deeper lessons, however, were not only technical. They came from seeing how much careful work sits behind a model that feels simple to use, and from collaborating with people who were solving the same problems from different parts of the stack.

Looking forward, Local models handle the work that benefits from being close to the user. Cloud models are available when a task needs more depth. Applications should not have to choose one side forever. They should be able to place computation where it makes the experience better. This brings the argument back to where it started: the future is not just better chatboxes, but environment-aware software that can act close to the user and reason deeply when needed. Sipp is built to make that practical.

DEV Community: Constant, Yuan Chen