This article was originally published on aifoss.dev
Most self-hosted coding tools put the GPU requirement on the developer's own machine. Tabby flips that model: one GPU server on your network, every developer on the team connects to it. That single architectural decision explains both Tabby's strengths and its friction points.
Tabby (by TabbyML) is a self-hosted AI coding assistant built in Rust, licensed under Apache 2.0, and available at github.com/TabbyML/tabby with around 33,000 GitHub stars as of mid-2026. The latest stable release is v0.32.0 (January 25, 2026), which added Mistral Embedding API support, generic OAuth, and multi-branch repository indexing.
This review covers what works, what doesn't, and who should actually run it.
What Tabby Is (and Isn't)
Tabby is a code completion server. It runs inference on the server side and exposes an API that IDE plugins call in real time as you type. The experience on the client is similar to GitHub Copilot: inline ghost-text completions, with a chat panel available alongside.
What Tabby is not: an agentic coding tool. It won't read your entire codebase, spawn shell commands, or write multi-file features on its own. If that's what you need, look at Aider or Cline instead. Tabby's job is to be a fast, context-aware autocomplete layer — running entirely on your infrastructure.
The server-side architecture has a concrete implication: a team of ten developers can share one RTX 3090 rather than each needing their own GPU. For orgs where code can't touch external APIs — finance, healthcare, defense contractors — this changes the economics significantly.
Installation
Docker is the recommended path, and it's genuinely painless:
# NVIDIA GPU (CUDA)
docker run -it \
--gpus all \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder2-3B --device cuda
# Apple Silicon (Metal)
docker run -it \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder2-3B --device metal
# CPU-only (expect ~8–15 tokens/sec — usable only for very small models)
docker run -it \
-p 8080:8080 \
-v $HOME/.tabby:/data \
tabbyml/tabby \
serve --model StarCoder2-1B --device cpu
After startup, http://localhost:8080 opens the admin dashboard. First-run setup takes under five minutes: create admin account, generate API tokens for each developer, configure the model. The admin dashboard is clean — nothing enterprise software usually is.
For production team deployments, Tabby supports RunPod and other GPU cloud hosts, which is a practical option if your team doesn't have a spare GPU box sitting around.
Model Selection and Hardware Requirements
This is where Tabby requires the most upfront decision-making. Model quality and hardware costs scale together, and there's no free lunch.
| Model | VRAM (int8) | Speed (RTX 3080) | HumanEval | Best for |
|---|---|---|---|---|
| StarCoder2-1B | ~2 GB | 180–250 tok/s | ~28% | CPU or low-VRAM GPU |
| StarCoder2-3B | ~4 GB | 80–120 tok/s | ~36% | 4–6 GB VRAM |
| StarCoder2-7B / DeepSeek-Coder-6.7B | ~8 GB | 40–60 tok/s | ~49–52% | 8 GB VRAM |
| CodeLlama-7B | ~8 GB | 35–55 tok/s | ~37% | 8 GB VRAM, multi-language |
| Codestral-22B | ~16 GB | 15–25 tok/s | ~56% | 16+ GB VRAM |
The 3B tier is the practical sweet spot for teams. StarCoder2-3B fits in 4 GB VRAM, runs at 80+ tokens/second on an RTX 3080, and produces completions that are noticeably better than the 1B model for anything beyond trivial boilerplate. First-token latency on a warm 3B model is under 200ms — fast enough to not interrupt flow state.
DeepSeek-Coder-6.7B is the quality upgrade worth making if you have 8 GB VRAM headroom. It consistently outperforms CodeLlama-7B on benchmark evals across Python, Java, and C++. If the team is primarily Python, DeepSeek-Coder is the right call.
Apple Silicon users running Metal get reasonable performance: StarCoder2-3B on an M2 Pro delivers roughly 60–80 tokens/second. Good enough for daily use.
CPU-only mode works but barely. StarCoder2-3B on CPU runs at 8–15 tokens/second. That's a three-second wait for a twenty-token completion — which effectively kills the autocomplete value proposition. If you have no GPU, use Continue.dev with a remote API endpoint instead; don't try to run Tabby on CPU for a real workflow.
Tabby defaults to int8 precision under CUDA and requires GPU Compute Capability ≥ 7.0 (RTX 20-series and newer, or equivalent AMD via ROCm). The Turing-and-older cards (GTX 10xx) fall into a supported but degraded path — Compute Capability 6.1 works but loses some precision optimizations.
IDE Integration
Tabby supports VS Code, JetBrains IDEs, Vim, Neovim, and Emacs through official plugins. The VS Code extension is the most polished: install it, paste your server URL and API token, and completions start appearing in under a minute.
The JetBrains plugin works well across IntelliJ IDEA, PyCharm, GoLand, and WebStorm. Setup is the same as VS Code. The Vim and Neovim plugins are functional but spartan — expect to spend more time on configuration if you're in that ecosystem.
One practical note: the IDE plugin doesn't batch requests per keystroke. It sends completion requests at a configurable debounce interval (default 150ms after you stop typing). On a LAN connection to the server, this is invisible. Over VPN, add a few hundred milliseconds. It's still faster than GitHub Copilot on a slow internet day.
Team Features and Enterprise Capabilities
This is where Tabby genuinely differentiates from tools like Continue.dev. The free self-hosted version includes features that most SaaS coding assistants charge for:
- SSO via GitHub, GitLab, and LDAP — developer accounts authenticate through your existing identity provider. No separate credential management.
- Usage analytics — per-user completion acceptance rates, active model usage, API call volume. Useful for proving ROI to skeptical engineering managers.
- Team management — admin dashboard for creating users, setting roles, managing API tokens.
- Codebase indexing — Tabby can index your Git repositories (including multi-branch as of v0.32.0) and GitLab Merge Requests to improve completion context. This is the equivalent of "codebase context" in Continue.dev, but served centrally.
- Answer Engine — a shareable Q&A layer on top of the indexed codebase. Think "ask a question about this repo" rather than "write me a feature."
The enterprise license adds custom branding and a few additional admin controls. For most self-hosted teams, the free tier covers everything needed.
Compare this to Continue.dev: Continue is excellent as a per-developer tool but has no central server model. Every developer manages their own model connection. For teams with a shared GPU, Tabby's centralized approach reduces operational overhead considerably.
Completion Quality: Honest Take
Tabby's completions are as good as the model you put behind it. That's a tautology, but it matters here because the "default" recommendation (StarCoder2-3B) produces noticeably weaker completions than GPT-4o or Claude Sonnet for complex logic.
For boilerplate, function signatures, import statements, and pattern completion within a known codebase — StarCoder2-3B is competitive. It handles Python, TypeScript, Go, and Rust well. For algorithmic complexity, multi-step reasoning, or unfamiliar APIs, the 3B model struggles in ways that GitHub Copilot doesn't.
Stepping up to DeepSeek-Coder-6.7B closes most of that gap in languages it was trained on. The HumanEval jump from 36% (StarCoder2-3B) to 52% (DeepSeek-Coder-6.7B) t
Top comments (0)