Kiên Bùi

Posted on Apr 16 • Originally published at github.com

Building magic-code: An Open-Source AI Coding Agent That Runs on Your Hardware

#opensource #rust #ai #selfhosted

Building magic-code: An Open-Source AI Coding Agent That Runs on Your Hardware

How we built a TUI coding agent in Rust, tested it with 274 scenarios across 5 platforms, and made it work with a $600 GPU.

The Problem

AI coding assistants are powerful — but they come with trade-offs. Cloud-based tools send your code to external servers. Proprietary agents lock you into specific providers. And the costs add up fast.

We wanted something different: a coding agent that's fast, private, and runs on your own hardware. That's why we built magic-code.

What is magic-code?

magic-code is an open-source TUI (terminal UI) agentic AI coding agent built in Rust. It works with any LLM provider — from Claude and GPT to self-hosted models like Qwen 3.5 on your own GPU.

$ magic-code "add error handling to the API endpoints"

The agent reads your code, plans changes, edits files, runs tests, and iterates — all from your terminal.

Key numbers

Metric	Value
Language	Rust
Binary size	9.1 MB (static musl)
Startup time	0ms
Built-in tools	30
Test coverage	274 unit tests
Golden test scenarios	274
Supported providers	15+
Lines of code	18,691
License	MIT

Architecture: 6 Crates, Zero Coupling

mc-cli      → Binary, TUI runner, provider selection
mc-tui      → Terminal UI (ratatui), no dependencies on other mc-* crates
mc-core     → Runtime, ReAct loop, agents, memory, compaction
mc-provider → LLM providers (Anthropic, OpenAI, Gemini, generic)
mc-tools    → 30 tool implementations, permissions, sandbox
mc-config   → Configuration types and loader

The strict rule: mc-provider and mc-tools never depend on each other. Only mc-core orchestrates them. This keeps the codebase maintainable as it grows.

The Self-Hosted Challenge

Our primary goal was making magic-code work well with Qwen 3.5 9B — a model that runs on a single RTX 4070 Ti. This is a fundamentally different challenge than building for Claude or GPT-4.

What we learned

1. Small models need explicit instructions

With Claude, you can say "add a greet function" and it figures out the rest. With Qwen 9B, you need "read src/lib.rs then add a greet function using edit_file." We built a 4-tier prompt system that adapts instructions based on model capability:

Tier 1 (Frontier: Claude, GPT-4): Full autonomy, 30 tools
Tier 2 (Strong: Gemini, DeepSeek): Slightly more structured
Tier 3 (Local: Llama, Mistral): Minimal tools, simple English
Tier 4 (Qwen): Optimized for agentic tool calling, 10 tools

2. Thinking mode and tool calling don't mix (yet)

We discovered that Qwen 3.5 with vLLM's --reasoning-parser qwen3 puts tool calls inside thinking blocks — which the tool call parser can't extract. The fix: disable thinking when tools are present, re-enable for pure Q&A. This is actually recommended by the Qwen team.

3. Context window matters more than model size

Qwen 3.5 9B with 256K context on vLLM outperforms larger models with smaller context windows for real coding tasks. We added Qwen to our model registry with proper context window settings and adaptive compaction thresholds.

Testing: 274 Scenarios, 5 Platforms, Honest Results

We built a comprehensive golden test suite to evaluate magic-code across different languages and app types. Every scenario runs in a Docker sandbox with a fresh project, and results are verified by checking actual file contents — not just "did the model respond."

Test structure

tests/golden/
├── fixtures/          # 6 project templates (Rust, Python, React, Go, etc.)
├── scenarios/         # 274 scenarios across 22 categories
├── run.sh             # Parallel test runner (Docker sandbox)
├── run-platform.sh    # Platform-specific runner
├── verify.py          # Content verification (L1/L2 checks)
└── compare.py         # Cross-model comparison

Verification levels

We don't just check if the model responded. We verify:

L0: Did the model produce output? (tool calls + text)
L1: Does the expected file exist?
L2: Does the file contain the expected code patterns?

Results: Qwen 3.5 9B (self-hosted, RTX 4070 Ti)

Platform	L0 (responds)	L2 (verified correct)
Python Web API (FastAPI)	100%	69%
Python Desktop (Tkinter)	100%	82%
Go Web API	100%	68%
React Web App	100%	28%
React Native Mobile	100%	47%

Overall: 60% verified correct across 110 platform scenarios.

We're sharing these numbers honestly. A 9B model on a single GPU won't match Claude Sonnet — but it handles Python and Go tasks well, and it costs nothing to run.

Where Qwen 9B excels

✅ Single file edits (add function, fix bug)
✅ Python code (FastAPI, Tkinter)
✅ Go code (stdlib HTTP, tests)
✅ Bug fixes with clear descriptions
✅ Reading and understanding code

Where it struggles

❌ Creating new files from scratch (often runs bash instead of write_file)
❌ Complex TypeScript/JSX (React components)
❌ Multi-step refactoring
❌ Abstract patterns (ABC, generics, advanced types)

Comparison: Gemini 2.5 Pro via LiteLLM

Platform	Qwen 3.5 9B	Gemini 2.5 Pro
Python Web API	69%	96%
React Web App	28%	100%
Go Web API	68%	96%
Python Desktop	82%	95%
React Native	47%	90%

Gemini 2.5 Pro scores significantly higher — but it's a cloud model. The beauty of magic-code is you can switch between models with a single flag:

# Self-hosted (free)
magic-code --base-url http://localhost:4000 --model vllm/qwen3.5-9b "fix the bug"

# Cloud (when you need it)
magic-code --model gemini-2.5-pro "refactor the entire module"

What Makes magic-code Different

1. Provider agnostic

15+ providers out of the box. Anthropic, OpenAI, Gemini, Groq, DeepSeek, Mistral, Ollama, LiteLLM, vLLM — or any OpenAI-compatible endpoint.

2. Full agentic loop

Not just code completion. magic-code runs a ReAct loop: read code → plan → edit → run tests → iterate. It has 30 built-in tools including file operations, search, bash, browser, memory, and MCP support.

3. Context engineering

Smart compaction keeps conversations going without losing important context. Repo maps (via tree-sitter) give the model project awareness without reading every file. Memory persists facts across sessions.

4. Security by default

Permission system for dangerous operations
Sandbox for bash execution
Prompt injection guards
Audit logging
8 CI security scanners (CodeQL, SonarCloud, cargo-audit, etc.)

5. Headless mode

Integrate magic-code into CI/CD pipelines:

# Auto-fix failing tests
magic-code --yes --json "fix the failing tests" -o result.json

# Batch processing
magic-code --yes --batch tasks.txt

# NDJSON streaming for web apps
magic-code --ndjson "explain auth.rs" | process_events.sh

Installation

# Quick install (binary)
curl -fsSL https://raw.githubusercontent.com/kienbui1995/mc-code/main/install.sh | sh

# Via cargo
cargo install magic-code

# From source
git clone https://github.com/kienbui1995/mc-code.git
cd mc-code/mc && cargo install --path crates/mc-cli

Self-Hosted Setup

Run Qwen 3.5 9B with vLLM:

vllm serve QuantTrio/Qwen3.5-9B-AWQ \
    --port 8300 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 262144 \
    --quantization awq_marlin \
    --enable-prefix-caching \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --served-model-name qwen3.5-9b

Point magic-code at it:

magic-code --base-url http://localhost:8300 --model qwen3.5-9b "your task"

Or use LiteLLM as a proxy to switch between self-hosted and cloud models seamlessly.

What's Next

Improving Qwen 3.5 performance on file creation tasks
Testing with larger self-hosted models (Qwen 32B, Llama 70B)
HTTP API server for web app integration
Watch mode (file watcher, auto-respond)

Try It

magic-code is MIT licensed and available on GitHub and crates.io.

We built this because we believe AI coding tools should be open, fast, and runnable on your own hardware. The results aren't perfect — but they're honest, reproducible, and improving with every release.

cargo install magic-code

magic-code is built by kienbui1995. Star the repo if you find it useful. Contributions welcome.

DEV Community

Building magic-code: An Open-Source AI Coding Agent That Runs on Your Hardware

Building magic-code: An Open-Source AI Coding Agent That Runs on Your Hardware

The Problem

What is magic-code?

Key numbers

Architecture: 6 Crates, Zero Coupling

The Self-Hosted Challenge

What we learned

Testing: 274 Scenarios, 5 Platforms, Honest Results

Test structure

Verification levels

Results: Qwen 3.5 9B (self-hosted, RTX 4070 Ti)

Where Qwen 9B excels

Where it struggles

Comparison: Gemini 2.5 Pro via LiteLLM

What Makes magic-code Different

1. Provider agnostic

2. Full agentic loop

3. Context engineering

4. Security by default

5. Headless mode

Installation

Self-Hosted Setup

What's Next

Try It

Top comments (0)