<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: theAIGeek</title>
    <description>The latest articles on DEV Community by theAIGeek (@ai_geek).</description>
    <link>https://dev.to/ai_geek</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3994559%2F15d278de-d451-4da4-95ae-7dc0fc689fc9.png</url>
      <title>DEV Community: theAIGeek</title>
      <link>https://dev.to/ai_geek</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ai_geek"/>
    <language>en</language>
    <item>
      <title>LLMs That Actually Pen Test: What Post-Training for Security Means for Your AI Stack</title>
      <dc:creator>theAIGeek</dc:creator>
      <pubDate>Sat, 20 Jun 2026 21:26:55 +0000</pubDate>
      <link>https://dev.to/ai_geek/llms-that-actually-pen-test-what-post-training-for-security-means-for-your-ai-stack-6ho</link>
      <guid>https://dev.to/ai_geek/llms-that-actually-pen-test-what-post-training-for-security-means-for-your-ai-stack-6ho</guid>
      <description>&lt;h1&gt;
  
  
  LLMs That Actually Pen Test: What Post-Training for Security Means for Your AI Stack
&lt;/h1&gt;

&lt;p&gt;Security researchers have spent years arguing that LLMs should be more helpful with offensive security tasks. The models kept refusing. Now someone just shipped a post-trained model that does the work instead of lecturing you about responsible disclosure, and it reportedly found thousands of real zero-days. That is not a headline you can ignore if you are building any kind of AI system that touches code, infrastructure, or automated pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happened
&lt;/h2&gt;

&lt;p&gt;Two things landed close together that, read side by side, tell a clear story about where AI security tooling is going.&lt;/p&gt;

&lt;p&gt;First, the Argus Red team shipped a CLI-accessible model that they post-trained specifically for penetration testing. The pitch is simple: instead of a general-purpose model that refuses to explain how buffer overflows work, you get one that treats offensive security as the actual task. No jailbreaks, no prompt engineering gymnastics. The model was trained to do the job.&lt;/p&gt;

&lt;p&gt;Second, there is a wave of reporting around Claude's Fable 5 (a research configuration of Claude) being used to find thousands of zero-days across real codebases. The implication is that when you remove the general-purpose safety floor and retrain or configure a model for a narrow, high-stakes domain, you get capability that base models will not give you.&lt;/p&gt;

&lt;p&gt;Together these signal a real inflection point: domain-specific post-training for adversarial tasks is no longer a research curiosity. It is shipping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Detail That Matters
&lt;/h2&gt;

&lt;p&gt;Post-training here is doing a lot of work, and it is worth being precise about what that means architecturally.&lt;/p&gt;

&lt;p&gt;General-purpose RLHF bakes in refusal behavior as a broad prior. The model learns that "anything that sounds like hacking" is in a category of things it should decline. That is not a capability limit, it is a behavior trained over the capability. Post-training for a specific domain can shift that prior without necessarily retraining from scratch. You fine-tune on domain-specific data, you adjust the reward signal to treat "correctly identifying and demonstrating a vulnerability" as the positive outcome, and you can do this on top of an existing base model.&lt;/p&gt;

&lt;p&gt;The Argus Red approach appears to follow this pattern. They are not claiming a new architecture. They are claiming a different training objective applied to a capable base. The zero-day story with Claude Fable 5 is a different mechanism (it sounds more like a heavily prompted or configured deployment rather than a fine-tune), but the outcome is similar: a model that operates in the security domain without the general refusal behavior getting in the way.&lt;/p&gt;

&lt;p&gt;The failure mode to watch here is scope collapse. A model post-trained to be maximally helpful for pen testing needs extremely tight deployment controls. If that same model ends up in a context where it is answering general user questions, you have a problem. The safety guardrails you removed were doing work in those other contexts even if they were annoying in the security context.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Builders
&lt;/h2&gt;

&lt;p&gt;If you are running a multi-tenant AI platform, this is a direct architectural concern. The emerging pattern is that you will have a portfolio of models: a general-purpose model for most tasks, and domain-specific post-trained models for high-stakes narrow domains. Your routing layer needs to understand which model is appropriate for which tenant and which request type.&lt;/p&gt;

&lt;p&gt;For agent and MCP systems, the implication is more immediate. Security automation agents that can actually test infrastructure, not just describe tests, are now buildable with off-the-shelf components. That changes the threat surface for any system that accepts LLM-generated tool calls. If your MCP server exposes file system or network tools, and your agent framework routes to a security-capable model, you need to think hard about what that model will do with those tool permissions.&lt;/p&gt;

&lt;p&gt;For RAG pipeline builders, this is a reminder that retrieval context can activate capabilities. A security-tuned model retrieving exploit documentation from a knowledge base and then calling a code execution tool is a very different risk profile than a general model doing the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Thing to Do Today
&lt;/h2&gt;

&lt;p&gt;Go test the Argus Red CLI at argusred.com/cli against a CTF target or a lab environment you control. Do not just read about it. Actually watch what the model does versus what GPT-4o or Claude does with the same prompt. The capability delta is the thing you need to see firsthand before you decide how to think about model selection in your own security tooling or agent infrastructure.&lt;/p&gt;

&lt;p&gt;Follow this blog for daily breakdowns of what is actually shipping in AI engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.argusred.com/cli" rel="noopener noreferrer"&gt;Show HN: We post-trained a model that pen tests instead of refusing&lt;/a&gt; - HackerNews&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://bootimus.com" rel="noopener noreferrer"&gt;Bootimus - A Self-Contained PXE and HTTP Boot Server&lt;/a&gt; - HackerNews&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://moultano.wordpress.com/2026/06/19/where-to-find-the-colors-your-screen-cant-show-you/" rel="noopener noreferrer"&gt;Where to Find the Colors Your Screen Can't Show You&lt;/a&gt; - HackerNews&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>technology</category>
      <category>programming</category>
    </item>
    <item>
      <title>The AI Hardware Stack Is Being Rebuilt From the Wafer Up</title>
      <dc:creator>theAIGeek</dc:creator>
      <pubDate>Sat, 20 Jun 2026 20:06:54 +0000</pubDate>
      <link>https://dev.to/ai_geek/the-ai-hardware-stack-is-being-rebuilt-from-the-wafer-up-4n17</link>
      <guid>https://dev.to/ai_geek/the-ai-hardware-stack-is-being-rebuilt-from-the-wafer-up-4n17</guid>
      <description>&lt;h1&gt;
  
  
  The AI Hardware Stack Is Being Rebuilt From the Wafer Up
&lt;/h1&gt;

&lt;p&gt;Before a single H100 ever runs a training job, it has to survive one of the most constrained supply chains in industrial history. Every serious AI accelerator, H100, B200, Cerebras WSE-3, starts its life on a TSMC wafer, gets etched by an ASML EUV machine, and then waits in a queue for CoWoS packaging capacity that is sold out through 2026. Understanding that stack matters if you are building on top of it, because the constraints at the bottom determine what compute costs, what latency looks like, and which architectural bets actually pay off.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Factory Floor Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;TSMC holds 72% of advanced chip manufacturing. That is not a market share number you diversify around quickly. And ASML sits underneath that with a near-monopoly on EUV lithography, the machines that print sub-5nm features. No ASML machines means no advanced chips, full stop. Every H100 and B200 in existence ran through both companies.&lt;/p&gt;

&lt;p&gt;But the real chokepoint right now is not transistors. It is CoWoS packaging, the process that physically stacks High Bandwidth Memory next to the compute die on a shared substrate. HBM is what gives these chips their memory bandwidth, and without CoWoS you cannot build them. That packaging capacity is sold out through 2026. TSMC is spending $52-56 billion in capex in 2026 alone, with 70-80% going toward advanced nodes, and it is still not enough to clear the queue.&lt;/p&gt;

&lt;p&gt;AI accelerator wafer demand is up 11x between 2022 and 2026. That is not a demand spike. That is a structural shift. The shortage is not a supply hiccup that clears in two quarters. Plan accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why GPUs Are Overkill for Inference
&lt;/h2&gt;

&lt;p&gt;NVIDIA dominates AI training with the H100 and B200. That dominance is real and it is deserved for the workload it was designed for. Training is a throughput problem. You want to run massive matrix multiplications in parallel across a huge cluster, and GPU architecture with HBM is genuinely excellent at that.&lt;/p&gt;

&lt;p&gt;Inference is a different problem. You are generating tokens sequentially, moving activations around constantly, and the latency per token matters more than raw FLOP throughput. When you run inference on a GPU cluster, you are paying for training-optimized silicon and spending a lot of cycles on inter-chip communication overhead that adds latency without adding value.&lt;/p&gt;

&lt;p&gt;The growing recognition in the industry is that inference needs its own architecture, not a repurposed training chip.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Cerebras Actually Built
&lt;/h2&gt;

&lt;p&gt;Cerebras took one of the most contrarian bets in hardware: build one chip the size of an entire silicon wafer. The WSE-3 has 4 trillion transistors, 900,000 cores, and 21 PB/s of memory bandwidth. The architectural insight is simple. If everything is on one die, you eliminate inter-chip communication entirely. There is no network fabric moving activations between GPUs. It is just one enormous on-chip compute surface.&lt;/p&gt;

&lt;p&gt;The benchmark results are hard to dismiss. The WSE-3 is 21x faster than the NVIDIA B200 on Llama 3 70B reasoning workloads. It hits 2,500 tokens per second per user on Llama 4 Maverick at 400 billion parameters, more than double the B200. SemiAnalysis pegs the cost per inference token at 32% lower than B200.&lt;/p&gt;

&lt;p&gt;OpenAI clearly took this seriously. In December 2025 they signed a $20B+ Master Relationship Agreement with Cerebras for 750 MW of inference capacity, expandable to 2 GW. Codex-Spark went live on Cerebras infrastructure in February 2026. When OpenAI is diversifying its inference supply away from NVIDIA, that is a signal worth paying attention to.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Builders
&lt;/h2&gt;

&lt;p&gt;If you are running a RAG pipeline, an agent framework, or a multi-tenant LLM platform, compute costs are already your biggest line item and latency is your primary SLA lever. The Cerebras numbers matter here specifically because multi-tenant inference platforms live or die on tokens-per-second-per-user at scale. A 2x throughput improvement at 32% lower cost per token changes your unit economics in a meaningful way.&lt;/p&gt;

&lt;p&gt;The more important shift is architectural. You should not be modeling your infrastructure around a single compute provider. The inference layer is fracturing. NVIDIA still owns training. But for latency-sensitive inference workloads, purpose-built silicon is catching up fast. Design your deployment layer to be provider-agnostic now, before you are locked in.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Thing to Do Today
&lt;/h2&gt;

&lt;p&gt;Pull your current inference cost per 1,000 tokens and your p95 latency from the last 30 days, then run the same prompt workload against Cerebras Cloud on a free tier or trial. Put the numbers side by side. Do not trust the benchmarks blindly. Run your actual workload.&lt;/p&gt;

&lt;p&gt;Follow along here for daily posts on what is actually changing in AI engineering infrastructure, and what it means for the systems you are building.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>technology</category>
      <category>programming</category>
    </item>
    <item>
      <title>Cloudflare's Ephemeral Agent Accounts Are a Real Solution to a Real Identity Problem</title>
      <dc:creator>theAIGeek</dc:creator>
      <pubDate>Sat, 20 Jun 2026 19:55:25 +0000</pubDate>
      <link>https://dev.to/ai_geek/cloudflares-ephemeral-agent-accounts-are-a-real-solution-to-a-real-identity-problem-54gi</link>
      <guid>https://dev.to/ai_geek/cloudflares-ephemeral-agent-accounts-are-a-real-solution-to-a-real-identity-problem-54gi</guid>
      <description>&lt;h1&gt;
  
  
  Cloudflare's Ephemeral Agent Accounts Are a Real Solution to a Real Identity Problem
&lt;/h1&gt;

&lt;p&gt;The hardest part of building agentic systems isn't the LLM calls — it's giving agents a coherent identity without creating a security nightmare. Cloudflare just shipped something that addresses this directly, and it's more architecturally interesting than the headline suggests.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happened
&lt;/h2&gt;

&lt;p&gt;Cloudflare announced temporary accounts for AI agents: short-lived, scoped Cloudflare accounts that an agent can spin up, use, and destroy — all programmatically. The use case is agents that need to do real internet-facing work: proxying traffic, making external API calls, handling DNS, running Workers — without those operations being tied to your primary Cloudflare account or persisting beyond the task lifecycle.&lt;/p&gt;

&lt;p&gt;This isn't just "create a token with a TTL." It's a full account-level isolation boundary, provisioned via API, with its own Workers namespace, its own DNS scope, and its own billing context. The account tears itself down when the agent is done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Detail That Matters
&lt;/h2&gt;

&lt;p&gt;The key design choice here is &lt;strong&gt;account-level isolation rather than token-level scoping&lt;/strong&gt;. That distinction matters enormously.&lt;/p&gt;

&lt;p&gt;Token-scoped access (what most teams default to with API keys) limits &lt;em&gt;what&lt;/em&gt; an agent can do but doesn't isolate the &lt;em&gt;blast radius&lt;/em&gt; of what it does. If an agent with a scoped token misbehaves — misroutes traffic, burns through rate limits, gets its IP flagged — the damage lands on your account. Your reputation, your quotas, your relationship with downstream services.&lt;/p&gt;

&lt;p&gt;Account-level isolation means the ephemeral account is the blast radius. You can let an agent run Workers, proxy traffic through Cloudflare's edge, or spin up DNS records, and if it does something stupid or gets compromised, it's contained. The parent account is untouched. The ephemeral account expires and gets GC'd.&lt;/p&gt;

&lt;p&gt;This is the same reasoning behind process isolation in operating systems. You don't give every subprocess root access with a limited scope — you run it in its own process with its own uid. Cloudflare is applying that model to agent infrastructure.&lt;/p&gt;

&lt;p&gt;The billing isolation is also worth noting. If you're running agents on behalf of customers (multi-tenant), each agent's resource consumption can be tracked and attributed at the account level, not just estimated from aggregate logs. That's actually useful for cost attribution in a multi-tenant platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Builders
&lt;/h2&gt;

&lt;p&gt;If you're building a &lt;strong&gt;multi-tenant AI platform&lt;/strong&gt; — the kind where each customer's agent does work on their behalf — you've probably already hit the identity problem. You're either running everything under one account (bad: shared blast radius, no attribution) or manually managing per-customer credentials (bad: operational overhead, credential sprawl). Cloudflare's ephemeral accounts are a cleaner third option for any workloads that touch Cloudflare's surface area.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;agent and MCP systems&lt;/strong&gt;: most agent frameworks today treat network identity as an afterthought. The agent gets your API key, and you hope it doesn't do anything wild. Ephemeral accounts give you a way to hand an agent a real, functional identity with time-bounded scope — then revoke it by just not renewing it. That's a much better control plane than trying to revoke individual tokens mid-flight.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;RAG pipelines&lt;/strong&gt; the connection is less direct, but if your retrieval agents are hitting external services, caching responses at the edge, or proxying to data sources through Cloudflare Workers (which is a real pattern for rate-limit management), you now have a path to per-task or per-tenant isolation without standing up new infrastructure.&lt;/p&gt;

&lt;p&gt;The deeper implication: this is Cloudflare betting that agents are first-class internet citizens, not just bots. They need accounts, not just keys. That's the right mental model, and I expect other infra providers to follow.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Thing to Do Today
&lt;/h2&gt;

&lt;p&gt;Read the &lt;a href="https://blog.cloudflare.com/temporary-accounts/" rel="noopener noreferrer"&gt;Cloudflare Temporary Accounts API docs&lt;/a&gt; and map it against your current agent identity model. Specifically: what's your blast radius if one of your agents gets compromised or runs amok? If the answer is "my whole account," that's the thing to fix, and ephemeral accounts are worth prototyping against even if you don't deploy it yet.&lt;/p&gt;




&lt;p&gt;Follow along for daily takes on what's actually moving in AI infrastructure — no hype, just what matters for builders.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>technology</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
