Tabby review: self-hosted AI code completion you actually control

#ai #productivity #tutorial #webdev

Cloud code completion has a quiet cost: every keystroke of context gets shipped to someone else's servers. For a lot of developers that's fine. For anyone working under a security policy, on a private codebase, or just allergic to sending proprietary code through a third-party API, it's a dealbreaker. Tabby is the open-source answer — a coding assistant you run on your own hardware, where the model, the data, and the uptime are all yours.

We set it up on a single workstation with a consumer GPU and ran it against a real project to find where the "self-hosted" promise holds and where it costs you.

What you actually control with Tabby

Tabby is a self-hosted alternative to GitHub Copilot, written in Rust and shipped as a single Docker image. You run one container, point your editor at it, and you get inline completions plus a chat panel — without any of that traffic leaving your network.

Three things are genuinely yours when you run it:

Your code stays put. Completion context never leaves the machine (or VPC) you deploy to. That's the entire reason most teams look at Tabby.
You pick the model. Tabby pulls from an open registry of code models — StarCoder, CodeLlama, DeepSeek Coder, and the Qwen2.5-Coder family. You can swap the completion model and the chat model independently.
It's free to run. The core is open source under Apache 2.0. There's no per-seat fee; your cost is the hardware and the time to operate it.

It plugs into VS Code, JetBrains IDEs, and Vim/Neovim through official extensions, so the editor side feels close to what you're used to.

Tabby can index your git repositories so completions and chat answers draw on your own codebase, not just the base model's training data. That repo-aware context is what makes a smaller self-hosted model feel usable instead of generic.

Getting it running

The happy path is genuinely one command. With an NVIDIA GPU and Docker installed:

docker run -it --gpus all -p 8080:8080 \
  -v $HOME/.tabby:/data \
  registry.tabbyml.com/tabbyml/tabby \
  serve --model Qwen2.5-Coder-1.5B \
        --chat-model Qwen2.5-Coder-3B-Instruct \
        --device cuda

That brings up the server on localhost:8080 with a web UI, a completion model, and a chat model. You create an account on first launch, generate a token, drop it into the editor extension, and you're completing code.

The realistic version has a few more wrinkles:

GPU matters. A 1.5B completion model fits comfortably on a card with 6–8 GB of VRAM and returns suggestions fast enough to feel live. Larger chat models want more headroom. You can run on CPU with --device cpu, but latency climbs to the point where inline completion stops feeling like completion.
Model choice is a tradeoff, not a default. Bigger models give better suggestions and cost more memory and latency. The Qwen2.5-Coder line is a reasonable starting point; size up only if your hardware allows.
Indexing takes a pass. Pointing Tabby at your repositories runs a one-time index, after which context retrieval is fast.

Don't judge Tabby by a CPU-only trial. A laptop without a GPU produces slow, stuttering completions that make the tool feel worse than it is. If you can't dedicate a GPU, test on a cloud GPU instance before deciding.

Here's the honest comparison against hosted options:

	Tabby (self-hosted)	Hosted (Copilot / Cursor)
Code privacy	Stays on your infra	Sent to vendor API
Cost	Hardware + ops time	Per-seat subscription
Setup	Docker + a GPU	Install and sign in
Completion quality	Good, model-dependent	Frontier-model strong
Maintenance	You own updates and uptime	Vendor handles it

If you'd rather skip the ops entirely and you don't have a hard privacy constraint, a hosted AI editor is the lower-friction path:

Who should self-host — and who shouldn't

Tabby earns its keep in specific situations:

Regulated or air-gapped environments where code can't legally or contractually leave the building.
Teams at scale where per-seat AI subscriptions add up and you already have GPU capacity.
Privacy-first solo developers who want completion without the data tradeoff and don't mind running a container.

It's the wrong call if you want the strongest possible completion quality with zero operational overhead. The open models Tabby runs are capable, but a 1.5B or 3B model self-hosted on a workstation won't match a frontier model behind a hosted product on raw suggestion quality. You're trading some completion strength for full control. Whether that trade pays off depends entirely on why you're looking at self-hosting in the first place.

For most individuals with no privacy constraint, hosted tools win on convenience. For anyone who said "we can't send our code to an API" out loud this quarter, Tabby is one of the few answers that closes that gap.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.