Amariah Kamau

Posted on Jul 3

I shipped two PRs into Alibaba's qwen-code using only open-weight models. Here's the honest version.

#ai #opensource #openweight #devtools

I build Atlarix — a desktop coding harness built for the open-weight frontier labs (DeepSeek, Qwen, Kimi, MiniMax), not just compatible with them. The whole thesis is that the gap between a weak open-weight model and a frontier one closes when the harness does the heavy lifting.

So I decided to test that thesis in the least forgiving place I could think of: contributing real code to a frontier lab's own production repo, using nothing but open-weight models to write it.

Two pull requests are now merged into QwenLM/qwen-code — Alibaba's open-source coding agent. Here's the honest version, including the part that didn't go well and the constraint that shaped how I worked.

The first attempt got closed — correctly

My first swing was ambitious: a full always-on scheduled-task daemon. Tasks that fire on a cron schedule without an interactive session open, system service installation, webhook triggers, the works. Around 5,000 lines across four rollout phases.

A maintainer closed it. And they were right to.

The problem wasn't that the code didn't work — it was that I'd built a second, parallel daemon with its own storage format and lifecycle, when the repo already had a long-running daemon (qwen serve) and a durable scheduler I should have extended instead. On top of that, four phases in one PR is simply too much to review well.

The maintainer's feedback was blunt and generous at the same time: reuse the existing infrastructure, make the change incremental, split it into reviewable pieces. What could have been ~300 lines of extension, I'd written as thousands of lines of duplication.

That stung. But it was the most useful thing that happened in the whole process, because it taught me the lesson every open-source contributor eventually learns: read the existing architecture before you build, and keep your PRs small enough to review.

So I did the right-sized thing instead

Instead of fighting to rebuild the giant feature immediately, I went looking for a small, well-scoped problem — the kind of change that's easy to review, easy to revert, and unlikely to break anything.

I found one in the web-shell's model picker.

PR #6209 — vision model support in the web-shell UI. The CLI could select a vision model; the web-shell daemon UI couldn't. I added it, following the exact pattern the codebase already used for other model modes. Small, mechanical, convention-matching. It merged.

PR #6236 — a real data-loss fix. This one mattered more than its size suggests. When a user selected a vision model from the web-shell picker, they'd see a success toast — but their choice was silently discarded. The picker stored the model ID in one format (modelId(authType), ACP-style) while the core resolver expected another (authType:modelId). The mismatch meant the stored value never resolved, and the system quietly fell back to auto-select. The settings page still showed the value, which masked the failure completely. The user's explicit choice had no effect, and nothing told them.

The fix re-encodes the format before persisting, plus type-safe dispatch to replace some fragile ternary chains, plus the missing English and Chinese i18n keys. It merged too.

The honest part: the models didn't write perfect code

This is the bit I actually care about, because it's where the hype usually lies.

Both PRs went through several rounds of genuine maintainer review. And the maintainers — Alibaba collaborators, plus an automated reviewer running on Qwen's own models — found real bugs. Not style nitpicks. Actual security and lifecycle issues: an HMAC check computed over the wrong input, a timing-attack-vulnerable token comparison, a timer that could fire on a stopped process, an auth config silently stripped from a webhook path.

I fixed and re-verified each one before merge. On #6236, a maintainer even built the PR locally with real browser tests and screenshots to confirm the fix worked end-to-end before approving.

That review loop is the entire point. The claim isn't "open-weight models wrote flawless code." They didn't. The claim is that open-weight models, driven by a good harness, could take architectural feedback and iterate to something a senior maintainer at a frontier lab was willing to merge. That's a much more interesting and much more honest result than "the AI one-shotted it."

The constraint nobody tells you about: cost

Here's a detail I think is worth being transparent about, because it's the reality of building solo.

PR #6209 was built almost entirely on Qwen (3.6 Plus, 3.7 Plus/Max) via OpenRouter. But partway through PR #6236, I started running low on OpenRouter credits. As a solo founder trying to conserve runway for actual users, I switched to using DeepSeek API credits instead. So #6236 ended up roughly a 50/50 mix of Qwen and DeepSeek.

I could have hidden that and claimed "100% Qwen" across the board. But two things: first, it wouldn't be true, and the whole value of a post like this is that it's honest. Second — it actually makes the result broader, not weaker. The thesis was never "Qwen specifically." It was "open-weight models, in a harness built for them, can do real work." Making that work across two different open-weight labs is stronger evidence than making it work with one.

No frontier model wrote either PR. Just open-weight models — Qwen and DeepSeek — running in Atlarix.

Why this matters (to me, at least)

I'm a self-taught developer building in Nairobi. The models I can afford to run at scale are open-weight ones. Atlarix exists because I needed a way to make those models genuinely productive — not "good enough for a demo," but good enough to ship code into a repo maintained by the people who train the models.

Two merged PRs in a Tier-1 lab's production repo, written by open-weight models in the harness, reviewed and merged by the lab's own maintainers, is the clearest proof of that thesis I've been able to produce.

The gap between open-weight and frontier is real. But a lot of it lives in the harness, not the weights. Close that gap, and a model you can run yourself can punch well above its weight.

Atlarix is at atlarix.dev. If you're building with open-weight models or thinking about model-agnostic tooling, I'd genuinely like to compare notes.

Top comments (3)

Dipankar Sarkar • Jul 5

The honest part I appreciate: the first PR failed by building a parallel daemon instead of extending qwen serve. Worth naming that this is a harness failure, not a model failure, and it cuts straight to your own thesis.

A frontier model would have made the same mistake without the repo's architecture in context, because 'reuse the existing scheduler' is not in the weights, it is in the codebase. The gap your harness closes is not raw reasoning, it is grounding: feeding the model the existing infra so it extends instead of duplicates.

Which suggests the real benchmark for a harness is not 'did the code run' but 'did it find and reuse what already existed.' Much harder to measure, and the exact thing maintainers grade on.

Amariah Kamau • Jul 5

Yes thanks for the in depth analysis.
Thanks to helping all models in atlarix by combining blueprint + grep, it allows models to always get an accurate answer, in addition to some extra features, like background tasks + sub-agents + direct task focus in prompt.
Atlarix continues to surprise us every single day

Edu Peralta • Jul 6

I run multiple agents in parallel and review every diff before it ships, so the honesty here stands out. The part I relate to most is treating the PR review as the real skill, not the prompting. Curious whether you found qwen-code's diffs needed more rework than what you get from a frontier model, or if the harness and the review discipline mattered more than which model wrote the patch. Shipping into someone else's open source repo with less name recognition behind the model is a genuinely harder trust problem than most agent posts admit.