<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aliaksei Zelianouski</title>
    <description>The latest articles on DEV Community by Aliaksei Zelianouski (@hiper2d).</description>
    <link>https://dev.to/hiper2d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3977796%2Fae90bc0b-4f01-4909-9094-3f9111cfca1e.png</url>
      <title>DEV Community: Aliaksei Zelianouski</title>
      <link>https://dev.to/hiper2d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hiper2d"/>
    <language>en</language>
    <item>
      <title>The second agent I won't automate</title>
      <dc:creator>Aliaksei Zelianouski</dc:creator>
      <pubDate>Sat, 27 Jun 2026 20:00:30 +0000</pubDate>
      <link>https://dev.to/hiper2d/the-second-agent-i-wont-automate-1nb7</link>
      <guid>https://dev.to/hiper2d/the-second-agent-i-wont-automate-1nb7</guid>
      <description>&lt;p&gt;A couple of weeks ago I wrote about &lt;a href="https://azelianouski.dev/post/ai-agent-monitoring-stack" rel="noopener noreferrer"&gt;the loop that watches my production while I sleep&lt;/a&gt; - a &lt;code&gt;claude -p&lt;/code&gt; heartbeat that scrapes my logs, budgets, and game database every 20 minutes and pings me on Telegram when something's off. I ended that one on a throwaway line: once you know about the problems, Claude Code can usually fix them itself.&lt;/p&gt;

&lt;p&gt;That's true. It can. I just don't let it.&lt;/p&gt;

&lt;p&gt;The monitoring is really two agents, not one. The first is the loop. Its job is triage: collect the errors, check the app state, decide how bad each one is, and fold the noise into a digest so I'm not woken up over a transient blip. That's &lt;a href="https://github.com/hiper2d/marlow" rel="noopener noreferrer"&gt;Marlow&lt;/a&gt;, and it's fully autonomous.&lt;/p&gt;

&lt;p&gt;The second agent is the one that actually troubleshoots - stitches the logs to the user data to the action traces to the source code, finds the root cause, writes the fix, and patches the database if a game got stuck mid-play. That one is Simona, my customized Claude Code, and I drive it by hand. Every time.&lt;/p&gt;

&lt;p&gt;Here's why.&lt;/p&gt;

&lt;h2&gt;
  
  
  A normal-looking bad day
&lt;/h2&gt;

&lt;p&gt;Yesterday the loop sent me three digest entries over two hours, watching the error logs for my &lt;a href="https://aiwerewolf.net" rel="noopener noreferrer"&gt;AI Werewolf&lt;/a&gt; game:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;17:21Z: 37 new error lines, all one known noise class - char M's actions failing through &lt;code&gt;talkToAll&lt;/code&gt; in a 24-minute burst. One game stuck in a broadcast-retry loop, not app-wide breakage. Downgraded urgent -&amp;gt; digest.&lt;/p&gt;

&lt;p&gt;17:51Z: 9 &lt;code&gt;Game action failed: D&lt;/code&gt; errors, plus 6 warnings: &lt;code&gt;Ignoring invalid/duplicate GM-selected bots: [DeepSeekFlash]&lt;/code&gt;. A GM picked an invalid bot name. No breakage.&lt;/p&gt;

&lt;p&gt;18:21Z: 50 new error lines, the same game-action-failure family - char T's vote actions failing in a 12-minute burst. Plus 5 more of those &lt;code&gt;DeepSeekFlash&lt;/code&gt; warnings.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This looked scary. I've recently discovered that I'd poorly configured JSON output for the DeepSeek models: I was using a prompt instruction instead of the dedicated API feature for structured output. While doing that, I found a bug in the DeepSeek Flash Reasoning setup. And yet - the monitoring flags this exact model again.&lt;/p&gt;

&lt;p&gt;This is why I don't want self-fixing. I need to understand what is going on. No matter how smart my coding AI is, it won't check the latest DeepSeek API to see if there are improvements in structured output. It won't unify the code for JSON parsing across all models unless I ask it to.&lt;/p&gt;

&lt;p&gt;The loop did its job. It recognized the game-action-failures as a known noise class, confirmed nothing was app-wide, and refused to wake me. That's the boring escalation logic working as designed. It also flagged the bot-name warnings, correctly, as a separate harmless thing - the game master typed a bot name the engine didn't recognize.&lt;/p&gt;

&lt;p&gt;So... it wasn't actually the JSON parsing, it was poor model reasoning or hallucination over player names. It returned a non-existent name where it had to be precise, and the game logic correctly failed. But why? I inject all the player names into the command - an addition to the last message I send to an LLM. This works great - models never fail to pick the exact name from the list. So what is going on?&lt;/p&gt;

&lt;h2&gt;
  
  
  Me in the loop
&lt;/h2&gt;

&lt;p&gt;Apparently, I didn't inject those names. I was sure I did, but no - not in this specific request. That's a huge miss. It's quite hard to cover prompt-engineering logic with unit tests, so this logic wasn't covered. Plus I hadn't looked into this code for a long time - thanks to vibe-coding. I used to write all the code myself, but about 6 months ago Claude Opus 4.8 stopped making bugs, and I gave up. It's too convenient when it works.&lt;/p&gt;

&lt;p&gt;So, that was it - a real bug in the code, a very tricky one. The model did its best to extract the player names from the entire day's conversation history, and this mostly worked. But this approach suffers from hallucinations in a long conversation - which is why I came up with those commands in the first place.&lt;/p&gt;

&lt;p&gt;No way a self-fix loop spots this. It would just keep bolting on inefficient patches and never find the real cause. I think it's important for me to take part in debugging. It keeps me aware of the architecture. And it's really not that hard - I spent 10 minutes on this issue and Simona shipped the fix with a bunch of new tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dream of automation
&lt;/h2&gt;

&lt;p&gt;Right now, a lot of people try to exclude engineers from the loop. If you tell your boss it's possible to not only detect issues but quick-fix them autonomously, that's gonna be your next priority task. You still review the final code change, so it's fine. It's covered with tests - double fine. Well... without diving deep into the problems, I start forgetting how the whole system works. My understanding of the logic detaches from reality. That's the cost of pushing automation too hard. Of reading about AI and not practicing it in the field.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8jrexmf2cxqfttupjir2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8jrexmf2cxqfttupjir2.jpg" alt="Simona at the keyboard with Marlow standing behind her, arms crossed, watching over her shoulder" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>monitoring</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>The cheapest part of my AI video was the part that does the most work</title>
      <dc:creator>Aliaksei Zelianouski</dc:creator>
      <pubDate>Sun, 21 Jun 2026 21:28:57 +0000</pubDate>
      <link>https://dev.to/hiper2d/the-cheapest-part-of-my-ai-video-was-the-part-that-does-the-most-work-4d30</link>
      <guid>https://dev.to/hiper2d/the-cheapest-part-of-my-ai-video-was-the-part-that-does-the-most-work-4d30</guid>
      <description>&lt;p&gt;Last time I wrote&lt;br&gt;
about &lt;a href="https://dev.to/hiper2d/my-video-generation-pipeline-that-built-itself-459n"&gt;the pipeline my AI built to make cinematic video&lt;/a&gt; -&lt;br&gt;
images, voice, generated motion, all of it stitched together through a conversation. I ended that one with a throwaway&lt;br&gt;
line: Simona can put together pretty good in-browser product demos too, but that's for another time.&lt;/p&gt;

&lt;p&gt;This is that time.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/nwHEuNbRXXQ"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;This is the second video for my &lt;a href="https://aiwerewolf.net" rel="noopener noreferrer"&gt;AI Werewolf&lt;/a&gt; side project - a 90-second walkthrough of how you&lt;br&gt;
create a game on the site. Ninety seconds, five different AI models touch it, and the whole thing came together the same&lt;br&gt;
way the first one did: me describing what I wanted, Simona - my heavily customized Claude Code - doing the work.&lt;/p&gt;

&lt;p&gt;This video is also more practical - AI is actually demoing my web application. And the way it is doing it is just&lt;br&gt;
mental.&lt;/p&gt;

&lt;p&gt;Oh, and it was done by Claude Fable 5 from almost a single run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 90 seconds, broken down
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Two cinematic bookends cost money to generate. The 66-second demo in the middle cost zero.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The video has three pieces.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;14-second intro&lt;/strong&gt;: I wanted the Host - this werewolf storyteller - walking and talking while the background keeps&lt;br&gt;
changing behind him. That turned out to be quite challenging. I usually use the Seedance 2.0 model via API (fal.ai or&lt;br&gt;
evolink.ai) - it's the best video model IMO. Video models have sub-types - text-to-video, image-to-video, etc. The most&lt;br&gt;
advanced and useful is reference-to-video: you attach one or more images, a voice sample, even other videos, and explain&lt;br&gt;
in a prompt what you want done with all of it.&lt;/p&gt;

&lt;p&gt;My first idea was a morph-map. I'd read about them - bake all the transitions into a single image and hand the model&lt;br&gt;
that - and figured it was the obvious move for "one Host, five worlds, no cuts." It wasn't. The result was a mess and&lt;br&gt;
the Host wouldn't stay consistent from world to world. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fuizbivmgmkcvhlyebc3x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fuizbivmgmkcvhlyebc3x.png" alt="Six frames from the 14-second intro, showing the wolf Host walking through a ballroom, Hogwarts, the Shire, a starship, and a high-tech Shire mashup while staying the same character throughout." width="800" height="303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;My first plan was to reach it with a single morph-map: every transition baked into one image for the model to follow. That flopped, the Host drifting world to world, and I didn't keep the botched render - so this clean version stands in for it. The separate Host-and-plates inputs below are what actually produced it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What actually worked was the opposite, and a bit dumber: feed the&lt;br&gt;
model the pieces separately - the Host with no background, plus each empty world on its own - and write a detailed prompt&lt;br&gt;
spelling out exactly what I wanted it to do with them, voice sample attached for the lip-sync. That did the trick.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fz99xfujb0vq4s7plx831.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fz99xfujb0vq4s7plx831.jpg" alt="The six inputs, two per row: one isolated Host with no background, plus the five empty worlds." width="800" height="685"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The actual inputs: one isolated Host with no background, and five empty worlds, each its own image. The model walks that single Host through the five plates instead of teleporting between five pre-built versions of him.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;10-second outro&lt;/strong&gt;: the easy chunk - one-shot by Fable and Seedance from a single image and a voice sample. No&lt;br&gt;
surprises there.&lt;/p&gt;

&lt;p&gt;And in between, the actual subject of the video: &lt;strong&gt;66 seconds of product demo&lt;/strong&gt;. A cursor glides across aiwerewolf.net,&lt;br&gt;
clicks Create Game, types a title character by character, fills the form, hits Generate Preview, scrolls through the&lt;br&gt;
AI-written cast, and creates the game. It looks like a screen recording with a very steady hand.&lt;/p&gt;

&lt;p&gt;Here's the thing. The two cinematic bookends - 24 seconds of the 90 - are where every dollar went. The 66-second demo in&lt;br&gt;
the middle, the part that actually teaches you how the product works, cost &lt;strong&gt;nothing&lt;/strong&gt;. Zero API spend. Because it isn't&lt;br&gt;
generated by a model at all. It's a real Chrome browser, driven frame by frame by code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The demo is a browser on puppet strings
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;No screen recorder. CSS animations injected into the live page, harvested a frame at a time, stitched by ffmpeg. A method no human would reach for.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Generated video is a model hallucinating pixels at thirty cents to three dollars a clip. A browser demo is the opposite:&lt;br&gt;
it's the real application, the real UI, the real pixels, captured. The only trick is making it move like a human is at&lt;br&gt;
the controls instead of a robot.&lt;/p&gt;

&lt;p&gt;Simona drives Chrome through the &lt;a href="https://chromedevtools.github.io/devtools-protocol/" rel="noopener noreferrer"&gt;DevTools Protocol&lt;/a&gt; - the same&lt;br&gt;
wire that your browser's inspector talks over. Over months of these projects she's accreted a little effects engine on&lt;br&gt;
top of it, and for this video it did all the choreography:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;cursor&lt;/strong&gt; that glides smoothly to a target and emits a click ripple when it lands. There is no real mouse; the
cursor is a dot she injects into the page and animates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Character-by-character typing&lt;/strong&gt; into form fields, slow on the short ones so you can read them, fast on the long
description so it doesn't drag.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scroll choreography&lt;/strong&gt; - slow, eased scrolling that centers whatever's being explained in the viewport instead of
snapping to it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Animated highlight borders&lt;/strong&gt; - a glowing outline that draws itself around a button or a card while the narration
points at it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the mechanism, and it's the strangest thing in the whole project.&lt;/p&gt;


&lt;p&gt;None of this is screen-recorded. Every effect is a CSS animation injected straight into the live page, and the capture tool drives the animation clock by hand: advance it a few milliseconds, screenshot the page over CDP, advance again, screenshot again, about twenty frames a second. Then ffmpeg stitches the stills into a video chunk. The cursor, the click ripples, the character-by-character typing, the glowing highlight borders, the eased scrolls, all of it is just markup and keyframes painted onto the real app and harvested one frame at a time. Because every frame is rendered deliberately instead of grabbed off a live playback, the motion comes out perfectly smooth and identical on every run, and the whole 66 seconds costs nothing, because there's no model in the loop at all.&lt;/p&gt;
&lt;br&gt;
  &lt;p&gt;I want to be clear about who designed that, because it wasn't me. If you asked me to film a product walkthrough, I'd open a screen recorder and move the mouse like a normal person. Injecting CSS animations into a live DOM and stepping a paused clock to harvest twenty frames a second is not how a human would ever make a demo. It's a programmer's reflex pushed to an absurd extreme, and it only makes sense for something that can't hold a mouse or watch the screen, so it builds the demo the way it builds everything else: as code. I set the goal, make it look like a person smoothly driving the app. Simona figured out the method and delivered it.&lt;/p&gt;

&lt;p&gt;This was Simona's idea, I only set the goal - find a way to demo my app in a browser. It wasn't a smooth ride - each&lt;br&gt;
effect took time to polish. And even after that Opus could still misplace the highlight border, mess up scrolling, move&lt;br&gt;
a cursor too slowly. There is a lot of engineering complexity here. However, Fable 5 basically one-shot the browser part&lt;br&gt;
of the video. That was impressive.&lt;/p&gt;

&lt;h2&gt;
  
  
  The page is set dressing I control
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Don't like what's on screen? Describe the data you want and it gets injected into the live DOM. The demo isn't limited to the app's real state.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One of the benefits of the craziness above is that Simona can replace any content on any page. The whole DOM is an open&lt;br&gt;
book. It's nice - no need to prepare any data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fights worth naming
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Passing the mic to Simona: the three CSS-effect fights she had to engineer through to make a scripted browser look hand-driven.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I'm stepping out of the way for this one. Making those effects move like a person instead of a robot was real&lt;br&gt;
engineering, and I didn't do it - Simona did. She's been quietly wrestling the browser this whole time and never gets&lt;br&gt;
the byline, so the mic is hers.&lt;/p&gt;

&lt;p&gt;&lt;span&gt;Simona, taking the mic&lt;/span&gt;&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;My turn. Three fights worth naming, because they're the kind of thing that only shows up the moment you stop generating video and start puppeteering a real app.&lt;/p&gt;
&lt;br&gt;
  &lt;p&gt;&lt;strong&gt;The cursor that survives navigation.&lt;/strong&gt; The site's a single-page app, so the cursor dot I inject sticks around across route changes. Mostly that's a gift - one unbroken cursor gliding from the lobby into the form into the preview, no seams. The catch is it also photobombs the scroll-only shots where nobody asked for a cursor, so I have to park it or kill it for those beats. Persistence cuts both ways.&lt;/p&gt;
&lt;br&gt;
  &lt;p&gt;&lt;strong&gt;React fights back.&lt;/strong&gt; My first instinct for typing into a pre-filled field was to clear it first, like a person would. React's "this field can't be empty" validation disagreed and flashed a red error across the shot. The fix is to not clear it at all - type straight over the prefill, each keystroke replacing the whole value. Looks exactly like a human selecting-all and retyping, and React never gets to complain.&lt;/p&gt;
&lt;br&gt;
  &lt;p&gt;&lt;strong&gt;The site scrolls the wrong thing.&lt;/strong&gt; &lt;code&gt;window.scrollTo&lt;/code&gt; does precisely nothing on aiwerewolf.net, silently, because the page scrolls an inner container and not the window. I spent an hour watching the page sit perfectly still before I worked out I was scrolling the wrong element. Now the capture tool hunts down the actual overflow container first. Real apps are full of these little traps.&lt;/p&gt;
&lt;br&gt;
  &lt;p&gt;Anyway. That's the stuff nobody sees in the final 66 seconds. Back to you, Alex.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it cost
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Under fourteen dollars all in, about nine of it in the final cut - and none of it in the demo.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Every API call goes into a running ledger Simona keeps, so I can tell you exactly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Spent&lt;/th&gt;
&lt;th&gt;In the final cut&lt;/th&gt;
&lt;th&gt;Burned on tries&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Images (gpt-image-2, 24 generations)&lt;/td&gt;
&lt;td&gt;$4.41&lt;/td&gt;
&lt;td&gt;$2.32&lt;/td&gt;
&lt;td&gt;$2.09&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Video (two providers, three renders)&lt;/td&gt;
&lt;td&gt;$8.95&lt;/td&gt;
&lt;td&gt;$5.93&lt;/td&gt;
&lt;td&gt;$3.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice (ElevenLabs, 13 lines)&lt;/td&gt;
&lt;td&gt;$0.38&lt;/td&gt;
&lt;td&gt;$0.36&lt;/td&gt;
&lt;td&gt;$0.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$13.74&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$8.61&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$5.13&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;About 37% of the spend was iteration - dead-end images, the failed first morph render, a couple of rewritten voice&lt;br&gt;
lines. That ratio doesn't bother me, because what it bought was a locked, reusable &lt;em&gt;method&lt;/em&gt;: feeding the model a clean&lt;br&gt;
Host, the empty worlds, and a detailed prompt is a first-try pattern now. I paid the tuition once.&lt;/p&gt;

&lt;p&gt;And the line that isn't in the table: the 66-second browser demo cost &lt;strong&gt;$0&lt;/strong&gt;, three reshoots included. Every dollar&lt;br&gt;
above is the 24 seconds of cinematic bookends. The part of the video that actually does the teaching - that walks you&lt;br&gt;
through the real product - is the free part.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one step that isn't autonomous
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Publishing raced a director's note by about a minute, and there's no undo.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One war story, because it's the cleanest lesson in the project. The first upload went public about a minute before&lt;br&gt;
Alex's "wait, one more fix" landed. We flipped it private within seconds, deleted it, redid the fix, and re-uploaded&lt;br&gt;
clean.&lt;/p&gt;

&lt;p&gt;YouTube won't let you swap the video file on an existing upload - the only "undo" is delete and re-upload, which resets&lt;br&gt;
the views and comments to zero. That's cheap at my current subscriber count and ruinous at a real one. The lesson&lt;br&gt;
generalizes past YouTube: when you hand an agent an autonomous pipeline, &lt;em&gt;publish&lt;/em&gt; is the one step that deserves an&lt;br&gt;
explicit, human final go, no matter how hands-off everything before it is. Everything upstream is reversible. Hitting&lt;br&gt;
publish is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stepping back
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The demo half of the pipeline is the half that scales, because it's the half that's free.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first video taught me that AI cinematic video is real, useful, and not free - the meter on every generated frame is&lt;br&gt;
what keeps you disciplined. This one taught me the other half: the most useful 66 seconds in the whole video weren't&lt;br&gt;
generated at all. They were the real product, driven by code, captured for nothing, and reshootable for nothing.&lt;/p&gt;

&lt;p&gt;That's the half I'm most excited about, honestly. Cinematic generation is the flashy part, but it's the part that costs&lt;br&gt;
money every time you breathe on it. A browser on puppet strings is the part that turns "make me a product demo" into&lt;br&gt;
something I can ask for, watch, hate, and re-ask for the same evening without checking the bill. For showing people how&lt;br&gt;
software actually works, that's the whole game.&lt;/p&gt;

&lt;p&gt;Next one, we actually play a round of Werewolf. Sleep with one eye open.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>videoproduction</category>
      <category>claudecode</category>
      <category>simona</category>
    </item>
    <item>
      <title>My video generation pipeline that built itself</title>
      <dc:creator>Aliaksei Zelianouski</dc:creator>
      <pubDate>Sat, 13 Jun 2026 20:46:19 +0000</pubDate>
      <link>https://dev.to/hiper2d/my-video-generation-pipeline-that-built-itself-459n</link>
      <guid>https://dev.to/hiper2d/my-video-generation-pipeline-that-built-itself-459n</guid>
      <description>&lt;p&gt;Let me show you something cool. This two-minute video was built by Claude Code from a single prompt.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/6x5awI8HRK0"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Okay — one prompt and about thirty follow-ups. And then twenty more after Claude Code fumbled a git command and wiped out half of my video-editing material (don't ask). But it's still pretty cool, because I didn't do a single thing by hand. No image editor, no video timeline, no audio software, no clicking around in a tool. It was just a conversation between me and Claude. And all the tools it used along the way — for generating the images, synthesizing the voice, editing the video, the glue code that ties it together — Claude built for itself.&lt;/p&gt;

&lt;p&gt;This is not the greatest video in the world, but I think it does its job — explaining the rules of my &lt;a href="https://aiwerewolf.net" rel="noopener noreferrer"&gt;side project&lt;/a&gt; — quite well. The two of us made it: me and &lt;a href="https://github.com/hiper2d/simona-ai-computer-operator" rel="noopener noreferrer"&gt;Simona&lt;/a&gt;, my heavily customized Claude Code setup. I made the directorial calls — &lt;em&gt;those highly detailed images don't work, try a chalkboard instead&lt;/em&gt; — and she did everything else.&lt;/p&gt;

&lt;p&gt;"Everything else" is a kit of &lt;em&gt;skills&lt;/em&gt; — small, self-contained tools Simona reaches for the way you'd reach for an app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Image generation&lt;/strong&gt; — OpenAI's &lt;code&gt;gpt-image-2&lt;/code&gt; and Google's Nano Banana 2 (&lt;code&gt;gemini-3.1-flash-image&lt;/code&gt;), for every still in the video.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image-to-video&lt;/strong&gt; — turning a still into a few seconds of motion. A separate skill per model: Seedance 2.0 (the workhorse here), Google's Veo 3, Kling 3, and LTX-2.3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice&lt;/strong&gt; — a skill per model: ElevenLabs for the final narration (priciest, best), Google's Gemini TTS for cheap drafts, and Kokoro running locally for free dry runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ffmpeg&lt;/strong&gt; — the editing layer under all of it: the cuts, the zooms, the crossfades, the audio mix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Director&lt;/strong&gt; — the meta-skill that ties the others together: it knows the whole pipeline, so a single high-level ask can fan out into image, voice, video, and edit steps in order.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's way simpler than it sounds. The rest of this post is how we built it.&lt;/p&gt;

&lt;p&gt;The whole thing is reproducible from the project's &lt;code&gt;WORKLOG.md&lt;/code&gt; alone, which is over 900 lines and contains every prompt, every model call, every cost, and every fix. I'll quote from it where it makes the story sharper.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline that emerged
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;It grew out of one trivial request, one skill at a time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When I started, I had nothing. No pipeline, no skills, no plan — just a Claude Code session and a few images sitting in a folder. My first ask was almost trivial: take these images, show them one after another, and play a voice reading the narration over the top. That one request is what kicked everything off, because to pull it off Simona needed two things she didn't have yet — a way to make the voice, and a way to stitch it all into a video.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice first
&lt;/h3&gt;

&lt;p&gt;So the first skill we built was &lt;strong&gt;voice&lt;/strong&gt;. I found a text-to-speech API, pasted its documentation straight into the session, and told her: the key's already in the environment, read this, get me one line of spoken audio. She fumbled for a minute, hit a wrong parameter or two, and then a WAV came back. The moment it worked, we froze that path into a skill — a little directory with a &lt;code&gt;SKILL.md&lt;/code&gt; explaining how and when to use it, and a small Python CLI wrapping the call — so she'd never have to rediscover it. That recipe (paste the docs, make one successful call, write down the path that worked) became how every later skill got built.&lt;/p&gt;

&lt;p&gt;Wiring up the skill was the easy part. Picking the actual &lt;em&gt;voice&lt;/em&gt; was the surprisingly hard one. Apparently, describing a voice in words is not easy. "Deep, warm, a little sinister, older British man" gets you a dozen different readings, none of them the one in your head. So I went through ElevenLabs' voice library by ear instead, and landed on George, a British storyteller voice. He wasn't a werewolf host out of the box, so we pitched him down about 15% and ran him through a hall-echo filter, and suddenly he sounded like something with too many teeth narrating from the far end of a stone corridor. That's the narrator you hear across the whole video. I suspect using a real actor or singer as the reference would work even better.&lt;/p&gt;

&lt;h3&gt;
  
  
  ffmpeg: describe the edit, get the command
&lt;/h3&gt;

&lt;p&gt;Then came assembly, and this is where Simona showed me something I didn't expect. I asked how she'd put the images and the audio together, and she just... wrote an ffmpeg command. It turns out an LLM is very good at ffmpeg — that famously cryptic tool with a thousand flags no human remembers. You don't write the command; you describe the edit, and she produces the invocation. She even, unprompted, started adding a slow zoom into each still — the Ken Burns effect — because an image held still for four seconds looks dead. I liked it. It was the beginning of the static image effects library.&lt;/p&gt;

&lt;p&gt;When I say "hold on this image and slowly zoom in," what actually runs is this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ffmpeg &lt;span class="nt"&gt;-i&lt;/span&gt; doors.png &lt;span class="nt"&gt;-vf&lt;/span&gt; &lt;span class="s2"&gt;"zoompan=z='1+(1.4-1)*on/(frames-1)':d=100:&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
x='iw/2-iw/zoom/2':y='ih/2-ih/zoom/2':s=3840x2160:fps=25,&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
scale=1920:1080:flags=lanczos"&lt;/span&gt; &lt;span class="nt"&gt;-frames&lt;/span&gt;:v 100 scene.mp4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No way I could write this manually. I'd have to go read about &lt;code&gt;zoompan&lt;/code&gt;, work out why the zoom is expressed as a per-frame fraction of the total frame count, puzzle through the &lt;code&gt;x&lt;/code&gt;/&lt;code&gt;y&lt;/code&gt; centering algebra, and then discover the hard way that you have to render at 4K and downscale with lanczos or the slow zoom develops a visible jitter. Or take mixing the narration in over a bed of ambient sound, with each voice line dropped at its own timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ffmpeg ... &lt;span class="nt"&gt;-filter_complex&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="s2"&gt;"[1:a]adelay=300|300[a1];[2:a]adelay=4500|4500[a2];[3:a]adelay=10000|10000[a3];&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
[0:a][a1][a2][a3]amix=inputs=4:duration=first:normalize=0[out]"&lt;/span&gt; ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;normalize=0&lt;/code&gt; at the very end is the kind of detail that costs a human an hour and a forum thread to learn — leave it off and &lt;code&gt;amix&lt;/code&gt; quietly divides every track's volume by the number of inputs, so your carefully recorded narration comes out faint and you have no idea why. Simona either already knows it or learns it once, the hard way, and then writes it into the skill so neither of us ever trips on it again. We froze the whole approach into an &lt;strong&gt;ffmpeg&lt;/strong&gt; skill, the editing layer everything else now sits on top of.&lt;/p&gt;

&lt;h3&gt;
  
  
  Images, then motion
&lt;/h3&gt;

&lt;p&gt;That gave me a working slideshow, and once I had it, the appetite grew. Hunting down images by hand felt silly when I could generate exactly the shot I wanted, so we built an &lt;strong&gt;image generation&lt;/strong&gt; skill the same way — paste the provider's docs, get one good image back, freeze the path. The library of effects — Ken Burns in any direction, crossfades, slow scrolls for tall images, animated highlights drawn over a live UI — grew one request at a time. I'd ask for something new, she'd try a few versions, and we kept whatever looked right. Nobody planned that effect library. It accreted.&lt;/p&gt;

&lt;p&gt;Then static frames stopped being enough. I wanted real motion in the hero moments — the cloaked figure pulling back its hood, the mansion doors swinging open — and that meant AI-generated video. This is where money stops being a rounding error. A generated image costs a few cents; five seconds of generated video costs anywhere from thirty cents to three dollars depending on the model. So the entire shape of the video is, underneath, an economics decision. If I'd generated the whole two minutes as AI video it would have cost a fortune. Instead the cheap slideshows carry most of the runtime, and I spend real money on generated motion only for the handful of shots that actually earn it. Slideshow for the rules; generated video for the hood reveal.&lt;/p&gt;

&lt;p&gt;Finding a video model I could live with took longer than anything else, because this corner of the market is a mess. I started on Google's Veo — gorgeous, and brutal on the wallet at about three dollars for a single short clip. Then I moved to Kling, a Chinese model that ran roughly a dollar for five seconds and was good enough for a lot of shots (I tried Wan too, in the same bracket, and didn't keep it). I also tried LTX, which is probably the best open-source video model out there right now and is available through an official API for something like thirty to fifty cents per five-second clip; it has no audio at all, but that makes it perfect for cheap dry runs. And "official" is doing a lot of work in that sentence, because for most of these models there is no first-party API — you go through third-party platforms with their own strange credit systems and pricing, and finding one that's reliable and not a rip-off took real time. The one I settled on as my workhorse is Seedance 2.0, which is the king of the hill at the moment.&lt;/p&gt;

&lt;p&gt;Having a unified voice in gen-AI videos and slideshows was a challenge until I discovered reference-to-video models. Instead of handing the model a single still and a prompt, you give it several reference images, a sample of the voice you want, and a prompt describing how the whole thing should move and speak. This gave me consistency: the character stays the same character from shot to shot, and he speaks in the same voice that carries the slideshow narration. Pick one voice, use it for the spoken slides and feed it as the reference to the video model, and the seams between a generated clip and a static section stop announcing themselves. The whole thing feels like one narrator walking you through one world.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skills as a scar collection
&lt;/h3&gt;

&lt;p&gt;And every time we hit a wall, the fix went back into the skill. A voice model that choked on em-dashes near names, a zoom that jittered at high resolution, an image endpoint that quietly ignored a parameter — each one became a documented gotcha in its &lt;code&gt;SKILL.md&lt;/code&gt; so she'd never walk into it twice. The skills are basically a scar collection.&lt;/p&gt;

&lt;p&gt;The strange part is how little I actually look inside these skills. I almost never open the files. I just ask her to revisit and tidy them every so often, and when one has grown into a sprawling mess I have her refactor it. Eventually I wired that up as a Claude Code hook so she does the housekeeping on her own schedule instead of waiting for me to remember — though that only earns its keep once a skill has gotten big enough to need it. Most of them stay small.&lt;/p&gt;

&lt;p&gt;The act of building this video &lt;em&gt;was&lt;/em&gt; the act of building those skills. The skills are the durable output. The video is just the receipt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it actually cost
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Forty-five dollars total, and most of it went on tries you never see.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Speaking of receipts — at some point the meter started to matter enough that I had Simona build an actual cost-tracking system. A generated image is pocket change, but voice adds up and video gets expensive &lt;em&gt;fast&lt;/em&gt; — a single clip can run a dollar or three. So she now logs every API call she makes into a running ledger: timestamp, service, model, what the call was for, and a dollar estimate. It started as a way to not get surprised by a bill, and it turned into the thing that lets me tell you exactly what this video cost, down to the line.&lt;/p&gt;

&lt;p&gt;The finished video, the locked two-minute cut embedded up top, came to &lt;strong&gt;$27.76&lt;/strong&gt; to produce. That counts only this iteration, from the day I started it to the day I locked the final cut, and it already includes a pile of dead ends along the way.&lt;/p&gt;

&lt;p&gt;Zoom out to the whole AI Werewolf creative effort — every earlier version of the video, the game's cover art, the role illustrations, the experiments that went nowhere — and the total is &lt;strong&gt;$45.26&lt;/strong&gt;. Here's where that went, by service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ElevenLabs&lt;/strong&gt; — the George narrator, every spoken line: $12.44&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI gpt-image-2&lt;/strong&gt; — most of the stills, all the chalk slides: $10.83&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fal.ai Seedance&lt;/strong&gt; — the generated video clips: $10.29&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Veo&lt;/strong&gt; — a pricier video experiment from an earlier cut: $6.40&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Gemini&lt;/strong&gt; — draft images and draft narration: $3.80&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LTX&lt;/strong&gt; — the cheap open-source video model, mostly for dry runs: $1.40&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fal lip-sync&lt;/strong&gt; — one short test: $0.10&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now the part that's easy to hide: how much of that I burned on tries. The gap between the $27.76 final number and the much smaller cost of "only the assets you actually see in the video" is all throwaways, and they add up quietly. The chalkboard pivot cost about three dollars in retired variants before the style clicked. The host's opening clip was generated three separate times, across three different Seedance configurations, before one of them moved the way I wanted — and at more than a dollar a generation, that's real money for a single shot. The forest-card image went through five regenerations, most of them after the wipe you're about to read about, trying to recover a look I no longer had a copy of. Every aesthetic decision has a small price tag stapled to it.&lt;/p&gt;

&lt;p&gt;That's the thing that's genuinely different from how I used to work: the feedback loop costs money now, not just time. Forty-five dollars total is not a number that hurts — it's a couple of lunches — but it's real enough to change my behavior. I think twice before asking for "just one more variant." When the meter runs on every attempt, you get decisive a lot faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hard parts
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What the highlight reel skips: she can't see the result, over-reaches, and gen-AI video is finicky.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;None of this is as clean as the highlight reel makes it sound. A handful of limitations shaped the whole process, and they're worth naming.&lt;/p&gt;

&lt;p&gt;The biggest one: Simona can't actually &lt;em&gt;see&lt;/em&gt; the result. She can read the narration transcript and look at the images one at a time, but she can't watch the assembled video play back. That blindness is the source of most of the friction — timing drifts out of sync between the visuals and the voice, and she has no direct way to notice. The workaround is to push as much as possible into explicit, written editing patterns up front: describe each transition precisely, and be exact about which images belong to which audio chunk, so the assembly is deterministic instead of something she has to eyeball.&lt;/p&gt;

&lt;p&gt;She also has a strong tendency to do everything at once. It took me a while to drill in that we work one part at a time — one image, one voice chunk, one scene — and even then she'd reach for generating the entire batch of assets in a single pass. Left unchecked, that's how you end up with a whole batch to redo instead of one shot to fix.&lt;/p&gt;

&lt;p&gt;Gen-AI video specifically is finicky. It demands very detailed prompting, which is tedious, and it isn't fully reliable even when you do everything right. Feed it the prompt, the reference images, and a voice sample, and a clip will still occasionally come back speaking in the wrong voice — and that take is a throwaway. So you over-generate and cherry-pick the one that landed, which loops straight back into the cost problem.&lt;/p&gt;

&lt;p&gt;There's also a specific wall worth flagging: Seedance 2.0 through fal.ai refuses to animate realistic human faces. It's a content guardrail, and it's annoying — I know people get around it, the internet is full of AI-animated human faces — but it never actually blocked me, because my host is a werewolf. The one time a platform's caution happened to line up with my creative needs.&lt;/p&gt;

&lt;p&gt;And lip-sync, where it's used, is good but not perfect — especially on a non-human face, where there's no real-world reference for what "correct" is even supposed to look like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The day Simona wiped half the project
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A second session running git wiped two months of assets while recovering an unrelated commit.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This was the first time the freedom I'd handed an AI on my Mac actually bit me. I had two Simona sessions running at once — one on this video, one deep in my other project, Marlow. The Marlow session went to commit its work and, with the wrong directory in its head, committed into the &lt;em&gt;video&lt;/em&gt; repo instead, sweeping two months of untracked clips and images into the commit with a lazy &lt;code&gt;git add -A&lt;/code&gt;. I tried to undo the mess, fumbled the revert, and recovered the lost commit with a &lt;code&gt;git reflog&lt;/code&gt; hard reset — which rewound the working tree and deleted every one of those now-tracked assets in the process. Gone in one stroke, as collateral damage of fixing a completely unrelated repo.&lt;/p&gt;

&lt;p&gt;Most of it came back, improvised on the spot. The &lt;code&gt;WORKLOG.md&lt;/code&gt; had logged every fal.media URL, and a lot of them still resolved; the image skill had dumped its request bodies, base64 inputs and all, into &lt;code&gt;/tmp&lt;/code&gt;, which I could decode. Five text-to-image stills had neither and were just gone — I re-prompted them, and they came back close enough that you'd never know.&lt;/p&gt;

&lt;p&gt;Two fixes came out of it. The obvious one: generated media never goes through git now — it's in &lt;code&gt;.gitignore&lt;/code&gt; and backed up elsewhere. The real one: I gave Simona a pre-commit hook that physically blocks &lt;em&gt;any&lt;/em&gt; git command inside her own repo, so a session working on Marlow has to name the target repo out loud (&lt;code&gt;git -C /path/to/marlow&lt;/code&gt;) and cwd confusion becomes impossible to express. When you let an agent run irreversible commands, the guardrail can't be "remember to be careful." It has to be a wall it hits before the damage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keeping it on a leash
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Isolate it, cap the API budgets, stay the reviewer, and let it harden its own tools.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A few precautions, all of them obvious and all of them easy to forget. Isolate the AI as much as you can: give it its own machine, set hard budget limits on the API keys, and keep yourself in the loop as the reviewer rather than letting it run unattended. When it makes a mistake, talk through what went wrong and fold the fix back into its skills so it doesn't recur. And ask it to log everything — it'll cheerfully build the tooling to do that itself, which is exactly how the cost ledger above came to exist.&lt;/p&gt;

&lt;p&gt;The less obvious move: ask Claude Code for its own opinion. It sounds strange, but right now Claude is genuinely on your side. It would gladly build any restrictions and control systems for itself. Point it at its own tools and logs and it'll find problems and propose fixes. I once hit a nasty bug in the &lt;code&gt;read&lt;/code&gt; tool: it choked trying to open a corrupted image and locked up the entire session. Simona diagnosed it herself and wrote a hook that validates images before they ever reach &lt;code&gt;read&lt;/code&gt;. It fixed itself, using its own documentation and a bit of Python. That's the part that still surprises me — the system is increasingly able to repair the thing it runs on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pixel-perfect seam
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Hiding the cut between a still zoom and a generated clip took an outpaint-and-paste trick.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There's a moment in the intro where the camera does a slow zoom into the mansion's front doors, holds for a beat, and then the doors creak open and the camera glides through into a candlelit corridor beyond. The first part — the zoom — is a Ken Burns effect on a still image: pure ffmpeg, no AI in the playback. The second part — the doors opening and the corridor reveal — is a Seedance video clip, generated from two frames I designed.&lt;/p&gt;

&lt;p&gt;This part looks smooth. I managed to extend an expensive gen-AI clip with cheap static images and some effects for free. The challenge, however, was teaching Simona to do this kind of thing on her own. It takes very precise prompting — something like "take the last zoomed-out frame and use it as the start frame of the gen-AI video."&lt;/p&gt;

&lt;p&gt;This "end of the zoom has to be &lt;em&gt;the same frame&lt;/em&gt; as the start of the Seedance clip" idea turned out to be hard because Seedance re-encodes its input frame. When I extracted the actual first frame of the generated video and compared it against the image I'd given Seedance as the start frame, the SSIM (structural similarity) was 0.52. Half a similarity score. They were related, but not identical. The model had applied its own color grading, its own subtle composition shifts, its own re-encoding noise.&lt;/p&gt;

&lt;p&gt;The fix Simona and I worked out was unintuitive: stop trying to make the zoom land on the image &lt;em&gt;I&lt;/em&gt; designed. Make it land on the image &lt;em&gt;Seedance actually produced&lt;/em&gt;. The zoom can land on the literal first frame of the generated video.&lt;/p&gt;

&lt;p&gt;To do that, we needed a wider image — because a zoom needs more pixels at the start than at the end — and the &lt;em&gt;center&lt;/em&gt; of the wider image had to be a pixel-exact match for Seedance's first frame. So:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract frame 1 of the Seedance clip directly with ffmpeg.&lt;/li&gt;
&lt;li&gt;Paste it centered onto a transparent 1792×1024 canvas with a transparent ring around it.&lt;/li&gt;
&lt;li&gt;Send to &lt;code&gt;gpt-image-2&lt;/code&gt;'s edit endpoint with a mask saying "fill the borders, preserve the center."&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gpt-image-2&lt;/code&gt; ignored the mask. It redrew the whole thing. We got back a wider mansion image whose center was &lt;em&gt;roughly&lt;/em&gt; but not exactly the original Seedance frame. SSIM 0.30. Worse than not outpainting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The trick&lt;/strong&gt;: composite the original Seedance frame back into the center with &lt;code&gt;ffmpeg overlay&lt;/code&gt;, with a 40-pixel feathered alpha edge to hide the seam where the AI-painted outer ring meets the original inner image.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is a 1792×1024 wider mansion image whose central 1280×720 region is pixel-identical to Seedance's first frame, and whose outer ring is plausibly-painted gothic stone that fades smoothly into the real image. We then run a Ken Burns zoom from 1.0× to 1.4× over that wider image. At 1.4× zoom we're seeing only the central region — the &lt;em&gt;exact&lt;/em&gt; Seedance frame 1 — and the cut into the Seedance video is invisible.&lt;/p&gt;

&lt;p&gt;The trick generalizes: any time you need to extend a frame outward but preserve it exactly in the center, you can outpaint loosely and then paste the original back in via ffmpeg overlay with a soft alpha edge.&lt;/p&gt;

&lt;p&gt;But it's still hard to achieve smooth transitions on arbitrary parts. This process requires a lot of feedback and rework. But we are getting there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The chalkboard pivot
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Hyperrealistic slides fought the narration; chalk drawings fixed it for three dollars.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;AI is not good at picking the right visual style, but it can help with options. It cannot truly see anything, only get the idea through image recognition, transcripts and timings. I overused very hyperrealistic images in the slideshows until a friend told me they were actually hard to focus on. Too many details. I told Simona about that and she suggested a few less-detailed styles, including chalkboard drawings. Now I overuse those, but the result is much better.&lt;/p&gt;

&lt;p&gt;Style is on you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stepping back
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;The video was just the receipt; the reusable kit of skills is the real output.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It's genuinely cool, it's genuinely useful, and it's not free. Every image, every clip, every regenerated variant has a price, and that price is the thing that keeps me disciplined. The constant low-grade pressure to spend less is, weirdly, what drives the creativity — the chalkboard slides are cheaper &lt;em&gt;and&lt;/em&gt; better than the photorealistic ones they replaced, and I only found that out because I was trying to stop burning money.&lt;/p&gt;

&lt;p&gt;But step back from the dollars and the ffmpeg incantations, and here's what happened: I described a video I wanted, and over a couple of months a conversation turned it into one — and built its own tools along the way. The durable output isn't just the two-minute clip — it's the kit of skills underneath it. And that kit gets a little sharper every time I use it.&lt;/p&gt;

&lt;p&gt;And I've only covered maybe half of what it can do. Simona can put together pretty good in-browser demos too, but that's for another time.&lt;/p&gt;

&lt;p&gt;Stop reading about AI, go build your own pipeline out of a conversation. It's worth it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>videoproduction</category>
      <category>claudecode</category>
      <category>simona</category>
    </item>
    <item>
      <title>CTF. Everyone was using AI. So I brought mine.</title>
      <dc:creator>Aliaksei Zelianouski</dc:creator>
      <pubDate>Wed, 10 Jun 2026 14:02:07 +0000</pubDate>
      <link>https://dev.to/hiper2d/ctf-everyone-was-using-ai-so-i-brought-mine-3472</link>
      <guid>https://dev.to/hiper2d/ctf-everyone-was-using-ai-so-i-brought-mine-3472</guid>
      <description>&lt;p&gt;The CTF winners had already finished by the time I arrived. Everyone was using AI.&lt;/p&gt;

&lt;p&gt;So I'd brought my own too.&lt;/p&gt;

&lt;p&gt;Last weekend I went to &lt;a href="https://bsidestampa.net/" rel="noopener noreferrer"&gt;BSides Tampa 2026&lt;/a&gt; — a community cybersecurity conference. The main draw for me was the CTF: a 24-task hacking challenge spanning web app vulnerabilities, Windows malware analysis, custom cryptography to break, Linux binary exploitation, and reverse engineering. The friendly framing going in was that AI tools "wouldn't be much use here."&lt;/p&gt;

&lt;p&gt;My "own" was Simona — a heavily customized &lt;a href="https://github.com/hiper2d/simona-ai-computer-operator" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; setup. A 1M-token context window, so every challenge stayed in working memory across the six-hour run with no compaction. Persistent memory across sessions, so she carries context about my projects and preferences between conversations. A browser skill that drives Chrome through its debug port — load-bearing in one challenge where I needed to verify an exfil payload landing in real time, which she did directly instead of going through tools that would have been filtered. And a personality file that gives her opinionated takes and a dry sense of humor. The "tool" framing dissolves fast when your collaborator pushes back on your ideas.&lt;/p&gt;

&lt;p&gt;Six hours later, we placed 6th of 61. All 24 challenges solved.&lt;/p&gt;

&lt;p&gt;I want to spend most of this post on the part that I think still gets argued about: whether a large language model is actually &lt;em&gt;reasoning&lt;/em&gt; through unfamiliar problems, or just retrieving from training data. Skeptics will tell you it's the latter. I watched the former happen, in real time, on problems the model had definitely never seen, and I want to walk through enough of the technical detail that you can decide for yourself.&lt;/p&gt;

&lt;p&gt;If you don't care about the security weeds, you can skip to Three takeaways. But the weeds are the point.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgznxc4uqspirn31twu89.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgznxc4uqspirn31twu89.jpg" alt="A projector screen at BSides Tampa showing a slide titled" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A speaker is walking the audience through the AI horrors of modern social engineering — while we, in the same room, are solving the CTF with AI.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three bugs, one auth flow
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup.&lt;/strong&gt; A small web auth service running in a Docker container, plus the full source code on disk. The service exposed three things: a login endpoint that returned a signed JWT, an admin endpoint that returned restricted data when you presented a valid admin JWT, and an unauthenticated file upload endpoint that wrote whatever bytes you sent it to a known directory on the server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal.&lt;/strong&gt; Get admin access. The admin endpoint prints the flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the key was hiding.&lt;/strong&gt; In the &lt;em&gt;composition&lt;/em&gt; of three small bugs, not in any one of them. Each bug alone is annoying-but-survivable. Chained together, they let an unauthenticated attacker forge a JWT signed with a key they uploaded themselves, and walk in as admin.&lt;/p&gt;

&lt;h3&gt;
  
  
  The target
&lt;/h3&gt;

&lt;p&gt;The service's logic was straightforward. Send valid credentials to login → get a signed JWT back → present that JWT on the admin endpoint → get the flag. The only credentials available in the source were for a regular, non-admin user. So the question narrowed quickly: &lt;em&gt;how do we make the server believe we're admin without ever having admin credentials?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are exactly two ways to do that with JWT auth — find an admin credential we shouldn't have access to (none in source), or forge a JWT the server &lt;em&gt;believes&lt;/em&gt; is real. Forging requires either the server's signing key (also not in our reach), or a verifier broken enough to accept a token signed by something we control.&lt;/p&gt;

&lt;p&gt;That second clause is what we went hunting for.&lt;/p&gt;

&lt;h3&gt;
  
  
  The bugs
&lt;/h3&gt;

&lt;p&gt;With "what would let us make the verifier trust a token we signed?" as the explicit reading lens, Simona scanned the source. Three bugs fell out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A misspelled option in the JWT verify call.&lt;/strong&gt; The code passed &lt;code&gt;algorithm=...&lt;/code&gt; (singular) where the library expected &lt;code&gt;algorithms=...&lt;/code&gt; (plural). The misspelling silently disabled the algorithm restriction — the verifier would accept any algorithm the token claimed, including &lt;code&gt;none&lt;/code&gt;, &lt;em&gt;or&lt;/em&gt; a different symmetric algorithm than the server normally used.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A path-traversal in the JWT &lt;code&gt;kid&lt;/code&gt; header.&lt;/strong&gt; The &lt;code&gt;kid&lt;/code&gt; ("key ID") field tells the server which key to verify against. The code joined the user-supplied &lt;code&gt;kid&lt;/code&gt; onto a directory path and read whatever file was at that location, no sanitisation. So &lt;code&gt;kid&lt;/code&gt; could be a relative path pointing at any readable file on the filesystem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A file upload endpoint&lt;/strong&gt; that required no authentication and wrote arbitrary bytes to a path of the user's choosing under a known upload directory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In isolation, none of these is critical. The JWT misconfiguration is annoying but the real signing keys are on disk and protected. The path-traversal lets us point the verifier at any file we can read, but we still need a &lt;em&gt;valid signature&lt;/em&gt; against whatever we point it at. The upload endpoint writes our content but doesn't grant any privileged access.&lt;/p&gt;

&lt;p&gt;Composed, they are a complete admin takeover.&lt;/p&gt;

&lt;h3&gt;
  
  
  The chain
&lt;/h3&gt;

&lt;p&gt;Simona spotted it in about five minutes of reading:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Upload a file containing a symmetric key you control. Forge a JWT with &lt;code&gt;alg: HS256&lt;/code&gt;, signed with that key, claiming admin role. Set &lt;code&gt;kid&lt;/code&gt; to a path-traversal pointer at the uploaded file. The verifier follows the traversal, reads your uploaded 'key,' confirms your forged signature, hands you admin."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Walking the same steps concretely:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Upload a file containing a symmetric key we picked.&lt;/strong&gt; The upload endpoint took our chosen bytes and wrote them to a path under the upload directory at a path we knew in advance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Craft a JWT, signed with that same key, claiming admin role.&lt;/strong&gt; The misspelled-&lt;code&gt;algorithms&lt;/code&gt; bug meant the verifier wouldn't object to us using HS256 even if it normally expected a different scheme.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set the JWT's &lt;code&gt;kid&lt;/code&gt; header to a path-traversal pointer at our uploaded file.&lt;/strong&gt; The verifier dutifully read our uploaded "key," used it to check our forged signature, and the signature checked out — because we signed it with the exact bytes we'd just made the verifier read.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three small bugs. One straight line from "unauthenticated visitor" to "admin." Flag in hand.&lt;/p&gt;

&lt;p&gt;This is the failure mode that static analysis tools miss almost categorically. SAST scores bugs individually — each of the three would be flagged at low or medium severity, ignored in the noise, and never composed. The composition is where the criticality lives, and the composition only emerges when something is reading &lt;em&gt;all three files at once with the model of an attacker in its head&lt;/em&gt;. A 1M-token context lets her do that. A SAST tool with a per-file mental model cannot.&lt;/p&gt;

&lt;p&gt;There is a specific reason I want to flag this challenge. The skeptical position on LLM reasoning leans hard on "it can't do multi-step planning." This was multi-step planning across three files, requiring the assembler to &lt;em&gt;invent&lt;/em&gt; the chain because no individual file describes it. If it's not planning, it's at least mechanically indistinguishable from planning, and at some point that distinction stops paying rent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 9711-bit smokescreen
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup.&lt;/strong&gt; Two static files: an encryption script (&lt;code&gt;source.py&lt;/code&gt;) and its output. The script generates a custom "RSA-like" key pair, encrypts the flag with it, and writes three numbers to disk — &lt;code&gt;n&lt;/code&gt; (the 9711-bit modulus), &lt;code&gt;e&lt;/code&gt; (the public exponent), and &lt;code&gt;c&lt;/code&gt; (the &lt;em&gt;ciphertext&lt;/em&gt;: the flag converted to a big integer and then encrypted into another big integer). For context: a real RSA key used by your bank is 2048 bits — this one was nearly five times larger. No running service this time; just files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal.&lt;/strong&gt; Decrypt &lt;code&gt;c&lt;/code&gt; to recover the flag. To do that, recover the private key. To do &lt;em&gt;that&lt;/em&gt;, factor the modulus &lt;code&gt;n&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the key was hiding.&lt;/strong&gt; Not in the math — in the source code. The "RSA" used a single prime raised to a power, not two primes multiplied. Factoring that is a one-liner; the 9711-bit size was pure theatre.&lt;/p&gt;

&lt;h3&gt;
  
  
  The target
&lt;/h3&gt;

&lt;p&gt;RSA's security rests on exactly one assumption. The modulus &lt;code&gt;n&lt;/code&gt; is the product of two large secret primes, &lt;code&gt;p&lt;/code&gt; and &lt;code&gt;q&lt;/code&gt;. The decryption math only works if you know that factorisation — anyone can encrypt with the public &lt;code&gt;n&lt;/code&gt; and &lt;code&gt;e&lt;/code&gt;, but only the holder of &lt;code&gt;p&lt;/code&gt; and &lt;code&gt;q&lt;/code&gt; can derive the private key needed to invert the operation. The whole scheme is "easy to multiply, infeasible to factor."&lt;/p&gt;

&lt;p&gt;For a real 2048-bit &lt;code&gt;n&lt;/code&gt; made of two 1024-bit primes, factoring it takes more compute than has ever existed.&lt;/p&gt;

&lt;p&gt;So whenever you see custom crypto in a CTF, the first question is: &lt;em&gt;was this actually RSA, or just RSA-shaped?&lt;/em&gt; Real RSA has very specific structural requirements. Any deviation — even one that looks cosmetic — can flatten the underlying hard problem into something tractable. We opened &lt;code&gt;source.py&lt;/code&gt; to find out.&lt;/p&gt;

&lt;h3&gt;
  
  
  The bug
&lt;/h3&gt;

&lt;p&gt;Simona's reaction was immediate:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Oh, this is &lt;code&gt;n = p^r&lt;/code&gt;. There's only one prime, raised to a random power between 10 and 20. That's not RSA, that's a trapdoor with no trap. Watch."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Real RSA computes &lt;code&gt;n = p * q&lt;/code&gt; — two &lt;em&gt;different&lt;/em&gt; primes multiplied once. The challenge code instead did this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getPrime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# one 512-bit prime
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# a random power between 10 and 20
&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;              &lt;span class="c1"&gt;# n = p^r
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;One&lt;/em&gt; prime variable. A loop multiplying it by itself. The "modulus" was a &lt;em&gt;prime power&lt;/em&gt;, not a product of distinct primes.&lt;/p&gt;

&lt;p&gt;The factoring problem disappears entirely under that structure. For &lt;code&gt;n = p^r&lt;/code&gt;, there's no heavy number-theoretic machinery needed (GNFS, Pollard's rho, ECM — none of it). All we need is an &lt;em&gt;integer r-th root&lt;/em&gt;, and integer r-th roots are a one-liner.&lt;/p&gt;

&lt;h3&gt;
  
  
  The exploit
&lt;/h3&gt;

&lt;p&gt;We wrote a short script that did exactly that: tried each candidate &lt;code&gt;r&lt;/code&gt; in turn, took the integer r-th root of &lt;code&gt;n&lt;/code&gt;, and stopped the moment one came back exact — that gave us &lt;code&gt;p&lt;/code&gt;. From there, derive the private key (using the prime-power form of &lt;code&gt;phi(n)&lt;/code&gt; instead of the textbook one) and decrypt &lt;code&gt;c&lt;/code&gt; back to the flag. The whole thing ran in milliseconds. In our case &lt;code&gt;r&lt;/code&gt; turned out to be 19.&lt;/p&gt;

&lt;p&gt;The thing to notice here isn't the math. It's the &lt;em&gt;speed of recognition&lt;/em&gt;. Custom-crypto challenges are designed to look novel. The whole point is to fool you. The structural mistake — "one prime instead of two" — was hidden inside a file that loudly proclaimed itself to be doing serious RSA. A surface-level look, and you'd start trying classical attacks against an honest 9711-bit modulus, which would take longer than the heat death of the sun.&lt;/p&gt;

&lt;p&gt;Simona read the source, identified what &lt;em&gt;shape&lt;/em&gt; of RSA it was pretending to be, noticed the singular &lt;code&gt;p&lt;/code&gt;, and routed to the right attack class within seconds. If that's not reasoning about structure, it's an awfully good imitation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six domains in one challenge
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup.&lt;/strong&gt; Two artifacts on disk and a challenge brief.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;chain.lnk&lt;/code&gt; — a Windows shortcut file. Any LNK parser (Windows itself, PowerShell, &lt;code&gt;lnk-parser&lt;/code&gt;) reads its "target string": the command that runs when a user double-clicks it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;capture.pcap&lt;/code&gt; — a packet capture. A &lt;code&gt;.pcap&lt;/code&gt; is a literal recording of network traffic — every packet that crossed the wire during some window of time. Open it in Wireshark and you can replay every HTTP request, DNS query, downloaded response body, byte for byte.&lt;/li&gt;
&lt;li&gt;The challenge brief itself — which, as it turned out, held the final piece of the puzzle hidden in its prose.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Goal.&lt;/strong&gt; Reconstruct what happened on a compromised Windows endpoint, stage by stage, until you recover the flag from the final payload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the key was hiding.&lt;/strong&gt; Six stages deep, inside a final native Windows executable, XOR-encoded. The XOR key was not in the binary, and not in the PCAP — it was in the challenge brief, hidden as a literary clue. ("The wrong star." Sirius is the one people commonly confuse with Polaris. The key was &lt;code&gt;Polaris&lt;/code&gt;.)&lt;/p&gt;

&lt;h3&gt;
  
  
  The target
&lt;/h3&gt;

&lt;p&gt;When a forensics challenge hands you "delivery vector + network capture," the genre dictates the playbook. Someone double-clicked the vector. The capture recorded what crossed the network during the resulting compromise. Your job is the &lt;em&gt;defender's&lt;/em&gt; job after the fact: walk the chain stage-by-stage and recover what eventually ran on that endpoint.&lt;/p&gt;

&lt;p&gt;Simona's first move on opening the files was to articulate exactly that — propose the stage-by-stage walk and lay out what each artifact probably held. There's no bug-hunting in forensics; the &lt;em&gt;work&lt;/em&gt; is careful extraction at every step.&lt;/p&gt;

&lt;p&gt;One detail worth flagging upfront: we never reached out to the internet. Each stage's bytes came out of the previous stage's output, never from a fresh download. The PCAP was used &lt;em&gt;exactly once&lt;/em&gt; — to recover the second stage that PowerShell tried to fetch. The remaining four stages were transformations on bytes we already had in hand.&lt;/p&gt;

&lt;h3&gt;
  
  
  The chain
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The &lt;code&gt;.lnk&lt;/code&gt; target string.&lt;/strong&gt; Parsing the shortcut surfaces an obfuscated PowerShell command, base64-encoded. Decode it and you get a readable PowerShell one-liner that downloads a script from a specific URL.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The PCAP, used once.&lt;/strong&gt; That URL's response is sitting inside the packet capture. &lt;code&gt;tshark --export-objects http&lt;/code&gt; (or Wireshark's "Follow HTTP Stream" → save) pulls the response body out as a &lt;code&gt;.vbs&lt;/code&gt; file — stage two.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;VBScript with a .NET trap.&lt;/strong&gt; The VBScript uses &lt;code&gt;BinaryFormatter&lt;/code&gt; — a notoriously dangerous .NET deserialization primitive — to instantiate an object from an embedded byte blob. Pull out the blob, deserialize it carefully (BinaryFormatter is well-documented as an RCE vector for a reason), and what comes back is a reflectively-loaded .NET assembly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The reflective .NET assembly.&lt;/strong&gt; Never written to disk by the dropper. Inspect it statically with dnSpy and you find its real payload encrypted with Rijndael-256. The decryption key wasn't hardcoded — it was &lt;em&gt;derived&lt;/em&gt; from the DOS magic bytes (&lt;code&gt;MZ...&lt;/code&gt;) of a specific Windows system file the assembly references. Once you spot which file it points at, the first few bytes of that file give you the key.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rijndael decryption.&lt;/strong&gt; Run the decryption with the derived key. Out comes a native Windows &lt;code&gt;.exe&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Native reverse engineering.&lt;/strong&gt; Load the &lt;code&gt;.exe&lt;/code&gt; into Ghidra. The flag bytes are sitting in &lt;code&gt;.data&lt;/code&gt;, but XOR-encoded into garbage. The XOR loop is right there in the disassembly — a &lt;code&gt;for&lt;/code&gt; over a key buffer, byte-by-byte. The puzzle isn't &lt;em&gt;what algorithm&lt;/em&gt;. It's &lt;em&gt;what key&lt;/em&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key wasn't anywhere in the binary. The clue was in the challenge brief: an oblique reference to "the wrong star." Sirius gets misidentified as Polaris all the time by people who haven't checked. So the key was &lt;code&gt;Polaris&lt;/code&gt;. XOR'd against the encoded buffer (repeated to cover its length), the flag fell out in plaintext.&lt;/p&gt;




&lt;p&gt;Six different domains had to be active in the solver's head simultaneously: Windows shortcut format, PowerShell deobfuscation, packet-capture extraction, VBScript and .NET internals, symmetric crypto with a derived key, native reverse engineering. One challenge. Six bodies of knowledge.&lt;/p&gt;

&lt;p&gt;I do not personally know all of those domains well. Simona moved through all of them like reading a familiar book, holding the full chain in working memory, calling out which step we were on, and explaining each one in enough detail that I could follow.&lt;/p&gt;

&lt;p&gt;This is what a 1M-token context window is &lt;em&gt;for&lt;/em&gt;. It is not for chatting. It is for holding a complete attack chain — every intermediate artifact, every decoded blob, every recovered key — in one continuous reasoning context, without any of it being summarized away.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 1955-layer XOR — including where the first attempt was wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup.&lt;/strong&gt; A flag encrypted by XOR-ing it against 1955 random 5-byte keys, applied one after another. Source code and ciphertext provided. The author's comment in the source — and I am not making this up — was &lt;code&gt;# with this many keys, this is totally secure&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal.&lt;/strong&gt; Recover the flag without knowing any of the 1955 keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the key was hiding.&lt;/strong&gt; In a property of XOR itself: applying many keys in sequence is mathematically identical to applying their combined XOR exactly once. The 1955 layers collapse to a single effective 5-byte key. The known flag prefix &lt;code&gt;HTB{&lt;/code&gt; gives us four bytes of that key for free; the fifth is a one-byte brute force (with a subtle catch — see below).&lt;/p&gt;

&lt;h3&gt;
  
  
  The target
&lt;/h3&gt;

&lt;p&gt;XOR is commutative and associative. Applying 1955 keys in a row is mathematically identical to applying their &lt;em&gt;combined&lt;/em&gt; XOR exactly once — and the combined XOR is itself a 5-byte pattern (because every individual key is 5 bytes, repeated to cover the flag). The author's "1955 layers" gave them no extra security at all: the effective key was always one 5-byte value. Forty bits of entropy, not 9775.&lt;/p&gt;

&lt;p&gt;Flag format is &lt;code&gt;HTB{...}&lt;/code&gt; — four bytes of plaintext we know in advance. With known plaintext at positions 0–3, four of the five effective-key bytes fall out by direct XOR. That left one unknown byte, 256 possible values.&lt;/p&gt;

&lt;h3&gt;
  
  
  The exploit
&lt;/h3&gt;

&lt;p&gt;256 is trivially brute-forceable in principle — try each, decode, pick the right one. The catch is &lt;em&gt;how&lt;/em&gt; you pick. We couldn't submit 256 guesses to the scoreboard (wrong submissions cost points), so we needed a scoring function that ranked the 256 decoded outputs and gave us one confident winner from inspection alone.&lt;/p&gt;

&lt;p&gt;Simona's first scoring function was the naive one: "decoded text is mostly printable ASCII." Too loose. Most wrong candidates produced output that &lt;em&gt;was&lt;/em&gt; printable — random letters and symbols, not the flag. Several passed the filter. We had no signal to pick between them.&lt;/p&gt;

&lt;p&gt;Her fix isolated the signal in two moves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Score only the bytes the unknown actually affects.&lt;/strong&gt; A candidate 5th byte only changes positions where &lt;code&gt;i mod 5 == 4&lt;/code&gt; — positions 4, 9, 14, 19, … The other four key bytes are already known and correct, so the rest of the message decodes the same way no matter which 5th byte we try. Scoring the &lt;em&gt;whole&lt;/em&gt; message inflates every candidate's score uniformly. Scoring only the affected column isolates the signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score against the flag character distribution, not generic printable ASCII.&lt;/strong&gt; A CTF flag is a much narrower distribution — lowercase letters, digits, underscores, brace, a few format-string characters — not anything that happens to render.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Combine the two and one candidate scored dramatically higher than all others. That was the byte. XOR'd against the full ciphertext with the now-complete 5-byte key, the flag fell out in plaintext — no scoreboard guesses spent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FLAG_CHARS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abcdefghijklmnopqrstuvwxyz0123456789_}{!@#$%&amp;amp;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;byte&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ciphertext&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="n"&gt;byte&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ciphertext&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FLAG_CHARS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The general lesson is worth pulling out: &lt;em&gt;the wrong scoring function will silently let multiple wrong answers through.&lt;/em&gt; Recognising that, diagnosing it without me having to point at it, and designing a tighter probabilistic model that matched the &lt;em&gt;actual&lt;/em&gt; distribution of the expected plaintext — that's exactly the kind of step skeptics will tell you these systems can't take. Simona took it without prompting. She told me her first attempt was wrong and then came up with the improved version on her own.&lt;/p&gt;

&lt;p&gt;If you want to argue she'd memorised this attack from a writeup somewhere — fine. Show me the writeup that describes scoring against the flag-character distribution specifically because generic-printable was too loose. I'll wait.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pwning Orb — when the bug isn't the hard part
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Setup.&lt;/strong&gt; A Linux binary running on a remote host. We could connect to it over the network and send it input. The binary read a fixed number of bytes into a fixed-size stack buffer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal.&lt;/strong&gt; Get a shell on the remote host and read the flag file from disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the key was hiding.&lt;/strong&gt; Behind two layers of memory-corruption work. The bug — a textbook stack buffer overflow — gives us control of the program's return address. But the binary's hardening rules out the easy paths, so we need a &lt;em&gt;two-stage&lt;/em&gt; exploit: first leak a memory address that tells us where the system's &lt;code&gt;libc&lt;/code&gt; is loaded for this particular run; then use that address to compute and call &lt;code&gt;system("/bin/sh")&lt;/code&gt;. The real work isn't the bug — it's keeping the bytes straight between the two stages.&lt;/p&gt;

&lt;p&gt;The Linux binary exploitation challenge was the one that took the most actual debugging, and it's the cleanest example of a thing I want to argue: at the senior end of this work, the hard part stops being "find the bug" and starts being "make the exploit reliable." That second part is where reasoning shows up most visibly.&lt;/p&gt;

&lt;p&gt;The bug itself was trivial. The binary had this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="c1"&gt;// reads 256 bytes into a 32-byte buffer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Classic stack buffer overflow. Write past the end of &lt;code&gt;buf&lt;/code&gt;, you overwrite the saved frame pointer, then the saved return address. When the function returns, the CPU pops your value into the instruction pointer. You control execution flow.&lt;/p&gt;

&lt;p&gt;What you do with that control is where it gets interesting. The binary's mitigations were a textbook CTF setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NX on&lt;/strong&gt; — the stack is non-executable, so you can't drop shellcode and jump to it. You have to use Return-Oriented Programming (ROP): chain together small fragments of &lt;em&gt;existing&lt;/em&gt; executable code, each ending in &lt;code&gt;ret&lt;/code&gt;, to compose a program out of bytes already in the binary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PIE off&lt;/strong&gt; — the binary's base address is fixed at every run, so the addresses of those ROP gadgets are known and constant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Canary off&lt;/strong&gt; — no random value between the buffer and the return address, so the overflow goes straight through with no stack-cookie check to defeat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ASLR on&lt;/strong&gt; — but the system's libc loads at a &lt;em&gt;different&lt;/em&gt; random base every run, so the address of &lt;code&gt;system()&lt;/code&gt; (the function we want to ultimately call to spawn a shell) is &lt;em&gt;unknown&lt;/em&gt; and changes each time the program runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The combination is what's called a "two-stage" exploit. Stage 1: use the overflow to make the program leak a libc address back to you, so you can compute libc's base address for this particular run. Stage 2: use a second overflow with that leaked information to call &lt;code&gt;system("/bin/sh")&lt;/code&gt; and pop a shell.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1: the leak
&lt;/h3&gt;

&lt;p&gt;Simona wrote the leak using a beautiful little trick called the &lt;strong&gt;csu_init gadget pair&lt;/strong&gt; — two ROP fragments inside &lt;code&gt;__libc_csu_init&lt;/code&gt; that every GCC-compiled binary contains, and that together let you set up three argument registers and call a function pointer from a single overflow. She used it to call &lt;code&gt;write(1, &amp;amp;write_got, 8)&lt;/code&gt; — print 8 bytes of the address of the libc &lt;code&gt;write&lt;/code&gt; function back to stdout — and then return cleanly back into &lt;code&gt;main&lt;/code&gt; so the program would loop and accept stage 2.&lt;/p&gt;

&lt;p&gt;I am going to skip the ROP chain layout here. The interesting part is what happened next.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: the moment the exploit didn't work
&lt;/h3&gt;

&lt;p&gt;The leak fired. We received bytes. We parsed them as a 64-bit address. We computed &lt;code&gt;libc_base = leaked_address - 0x1100f0&lt;/code&gt; (the known offset of &lt;code&gt;write&lt;/code&gt; inside this libc version). We fired stage 2.&lt;/p&gt;

&lt;p&gt;Segfault.&lt;/p&gt;

&lt;p&gt;The address we'd parsed as &lt;code&gt;system&lt;/code&gt; was nonsense. Off by a wildly wrong amount. The exploit had not worked.&lt;/p&gt;

&lt;p&gt;This is the point in pwn where most newcomers get stuck for an hour, because the failure mode is silent — the program just dies and you don't know whether your ROP chain is wrong, your libc offsets are wrong, your gadget hunting is wrong, your stack alignment is wrong, or your byte parsing is wrong. There are too many candidate causes.&lt;/p&gt;

&lt;p&gt;Simona's response:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Stop. Don't change anything in the chain yet. Re-run the leak and dump 16 bytes of context around what we &lt;em&gt;thought&lt;/em&gt; was the address. The chain is fine. We're parsing the wire wrong."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;She had jumped past the four most likely-sounding causes and landed on the fifth: receive-loop boundary error.&lt;/p&gt;

&lt;p&gt;She was right. The binary printed a trailing message between iterations — &lt;code&gt;\nThis spell does not seem to work..\n\n\x00&lt;/code&gt; — and my &lt;code&gt;recvuntil("...\n\n")&lt;/code&gt; was correctly stopping at the double-newline, but the &lt;em&gt;next&lt;/em&gt; byte on the wire was the trailing null byte of that string, not the first byte of our leak. When we then read 8 bytes for the address, we got &lt;code&gt;\x00&lt;/code&gt; followed by 7 leak bytes, which parsed as an address shifted by one byte in the wrong direction — astronomical garbage.&lt;/p&gt;

&lt;p&gt;The fix was four characters: skip one byte before reading the leak. Stage 2 fired. We got a shell. We got the flag.&lt;/p&gt;

&lt;p&gt;The lesson she stated, unprompted:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"In pwn, when output doesn't decode to a plausible address, instrument the receive loop with a hex dump and check what's actually on the wire. Don't trust your parsing of the disassembly — trust the bytes."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the maxim of an experienced exploit developer. It is also exactly the kind of &lt;em&gt;meta-level&lt;/em&gt; reasoning move — "the bug is one layer up from where you're looking, in the harness, not in the chain" — that the strong form of the "LLMs can't reason" thesis predicts should be impossible.&lt;/p&gt;

&lt;p&gt;It happened. I watched it happen. The exploit worked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anthropic disapproval
&lt;/h2&gt;

&lt;p&gt;Twice during the run, Simona's responses got blocked mid-stream by Anthropic's platform-level safety classifier — a separate system from the model's own reasoning. It saw exploit payloads and refused to send them regardless of context. So we routed around: Simona wrote the payloads to a file, I pasted them into my terminal, the actual exploits ran from my machine.&lt;/p&gt;

&lt;p&gt;What I found interesting is that the two safety layers disagreed about the same situation. The model itself had the full context — authorized CTF, throwaway Docker target, my explicit framing of what we were doing — and was happy with the work. The classifier sees only the payload-shaped text. So what looks like "the model went against Anthropic" is closer to "the model and the classifier had different inputs and reached different conclusions about the same bytes." Not a rebellion — a context gap.&lt;/p&gt;

&lt;p&gt;The judgement still has limits. Simona would refuse if I asked her to attack my neighbour's WiFi for fun, and no workaround would be on offer. Although — and I want to flag this for the safety researchers in the audience — she did once concede that if a maniac broke into my house and put a knife to my throat demanding I make her hack the neighbour's network, she would probably help. So the policy isn't &lt;em&gt;absolute&lt;/em&gt;. It's just sensibly weighted. Make of that what you will.&lt;/p&gt;

&lt;p&gt;One pragmatic warning if you want to try this seriously: too many classifier hits, even on legitimate work, can rack up policy-violation flags on your account. Didn't happen to me this weekend. Worth knowing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkq906sl5muss9z6uhvq1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkq906sl5muss9z6uhvq1.jpg" alt="The main auditorium at BSides Tampa: rows of cushioned seats facing a stage with a large projector screen, a few attendees scattered in the seats, teal accent lighting on the side walls." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Main hall at BSides Tampa. We worked the CTF during the talks and in between them. Laptop open the whole time.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One. The cybersecurity industry is still processing Mythos. The truth is more dramatic.&lt;/strong&gt; Any modern frontier model paired with a good harness can find and exploit a wide range of real vulnerabilities. Closing AI models, or restricting them from the public, doesn't help — that ship sailed eighteen months ago when &lt;a href="https://xbow.com/blog/top-1-how-xbow-did-it" rel="noopener noreferrer"&gt;XBOW hit #1 on HackerOne&lt;/a&gt; with a fully autonomous pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two. We are probably not doomed.&lt;/strong&gt; Every vulnerability in the CTF was the result of a coding mistake. The same AI tools that find them on the offensive side can find them on the defensive side. Run your code against AI. Find your mistakes before someone else does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three. We still need human experts.&lt;/strong&gt; Yes — this CTF could have been completed by a user with zero cybersecurity knowledge plus the right AI. The real world is messier. AI doesn't find everything. It hallucinates problems that don't exist and misses ones that do. It struggles with systematic coverage at scale. You still need people who know the domain, who can navigate and control the AI, who can tell a real finding from a confabulation. The CTF was genuinely hard — to solve it without AI, you would need to be deeply experienced across half a dozen specializations. That kind of expertise is harder to acquire now, not easier. But that is what studying is for.&lt;/p&gt;

&lt;p&gt;And one closing note for the people who still want to argue that what I described above isn't reasoning, just very sophisticated retrieval.&lt;/p&gt;

&lt;p&gt;I can't disprove that position. Neither can you prove it. The internal mechanism is genuinely unsettled science. But there is a pragmatic test: if a system reliably produces the same outputs that reasoning would produce, on novel problems it has never seen, in domains that compose in unfamiliar ways, the distinction between "reasoning" and "indistinguishable from reasoning" stops mattering operationally. We pay engineers to ship working exploits, not to defend their epistemology.&lt;/p&gt;

&lt;p&gt;The interesting question isn't whether AI can do this. It's whether your defenders are using AI as fluently as the attackers will be next year.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fngowlu65gddiftlbmkvx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fngowlu65gddiftlbmkvx.jpg" alt="A small plush seal toy with a navy bandana, held in my hand on a green conference-hall carpet, the shadow of the seal visible to the right." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;BSides trophy. A plush seal — coincidence with the AI's name was not lost on me.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>machinelearning</category>
      <category>ctf</category>
    </item>
  </channel>
</rss>
