<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Breach Protocol</title>
    <description>The latest articles on DEV Community by Breach Protocol (@breachprotocol).</description>
    <link>https://dev.to/breachprotocol</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4011142%2F75acff13-c02f-4eac-8904-cf3f4f9d836f.jpg</url>
      <title>DEV Community: Breach Protocol</title>
      <link>https://dev.to/breachprotocol</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/breachprotocol"/>
    <language>en</language>
    <item>
      <title>Microsoft's CEO Says the AI Industry Has Not Earned the Right to Do This</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 20:02:15 +0000</pubDate>
      <link>https://dev.to/breachprotocol/microsofts-ceo-says-the-ai-industry-has-not-earned-the-right-to-do-this-3jb9</link>
      <guid>https://dev.to/breachprotocol/microsofts-ceo-says-the-ai-industry-has-not-earned-the-right-to-do-this-3jb9</guid>
      <description>&lt;p&gt;Satya Nadella named OpenAI and Anthropic directly and said the AI industry "has not earned the right to do what it is doing to the economy." In a Wall Street Journal interview reported by &lt;a href="https://www.techtimes.com/articles/318809/20260621/nadella-names-openai-anthropic-ai-giants-must-earn-societal-permission.htm" rel="noopener noreferrer"&gt;Tech Times&lt;/a&gt;, Microsoft's chief executive argued that AI companies cannot simultaneously forecast mass white-collar job loss and demand vast resources with a light regulatory touch. His blunt line: "You can't say, hey, all white-collar jobs are gone and this could even be a weapon and we will use all the power to build data centers."&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; In a Wall Street Journal interview, Satya Nadella named OpenAI and Anthropic -- two companies Microsoft has poured billions into -- and warned that an economy reshaped by a handful of AI models will not survive politically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-24&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://www.techtimes.com/articles/318809/20260621/nadella-names-openai-anthropic-ai-giants-must-earn-societal-permission.htm" rel="noopener noreferrer"&gt;read the source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The concept Nadella is pressing on is what industries like mining and energy call a "social license to operate" — not a law or a permit, but the informal, ongoing approval a society extends to an industry, the general sense that what it is doing is acceptable. When that approval runs out, it does not arrive as a polite warning. It arrives as bans, taxes, and political movements that rewrite the rules of an entire sector overnight. Nadella's argument is that AI is spending this kind of public goodwill fast, and not putting anything back.&lt;/p&gt;

&lt;p&gt;His chosen analogy is pointed. He compares AI to the early decades of globalization, when manufacturing moved offshore. The national statistics looked fine — overall growth held up — but specific towns lost the factories, the supplier networks, and the accumulated know-how that had made them work, and the damage is still felt. Nadella's warning is that AI could do the same thing to knowledge work, hollowing out whole categories of white-collar jobs while the top-line economic numbers stay healthy, and doing it faster than globalization ever did. The contradiction he is pressing on: the leading labs publicly forecast that AI will eliminate large swaths of jobs, while simultaneously asking for enormous resources and a light regulatory touch. "If all the value is accrued by only a few models," he said, "the political economy will simply not tolerate it. There is no societal permission for an AI future that hollows out entire industries."&lt;/p&gt;

&lt;p&gt;The interview escalated a theme Nadella had opened a week earlier, in a personal essay posted to X titled "A frontier without an ecosystem is not stable," which reportedly drew more than sixty million views. Independent analysis cited in the coverage puts the AI model market already converging on a few dominant players, with Anthropic, OpenAI, and Google holding the lion's share between them. A future where every company in every sector quietly hands its value to two or three model providers is the outcome Nadella says the public will eventually refuse.&lt;/p&gt;

&lt;p&gt;There is a strategic read of all this, and it is worth naming. Microsoft sells the platform layer — the cloud, the developer tools, the governance plumbing — that sits between businesses and whichever AI model they use. If frontier models become interchangeable commodities that companies can swap in and out, Microsoft's orchestration layer becomes more valuable, not less. Microsoft has also started building its own in-house models to reduce its dependence on its partners. A call for a more diverse, less concentrated AI ecosystem happens to align neatly with Microsoft's commercial interest. The concern can be genuine and self-serving at the same time, and both readings are probably true.&lt;/p&gt;

&lt;p&gt;It is the most pointed challenge yet to the dominant labs, and it comes from inside the tent rather than from a critic on the outside. It also lands in a month already full of evidence for his thesis — a government that can &lt;a href="https://groundtruth.day/news//news/the-government-pulled-a-frontier-model.html" rel="noopener noreferrer"&gt;make a frontier model disappear overnight&lt;/a&gt;, enterprises discovering that AI bills scale in alarming ways, and a steady drumbeat of disclosures that the labs' own models now &lt;a href="https://groundtruth.day/news//news/claude-now-writes-most-of-anthropics-own-code.html" rel="noopener noreferrer"&gt;write most of their code&lt;/a&gt;. The practical hedge Nadella points toward is the same one the rest of the industry is reaching for: do not bet everything on a single provider you cannot control, which is a large part of why downloadable &lt;a href="https://groundtruth.day/news//learn/open-weight-models.html" rel="noopener noreferrer"&gt;open-weight models&lt;/a&gt; keep gaining ground. The caveat for readers is simply to hold the strategic angle in view: this is a sincere warning that also happens to describe a world in which Microsoft wins.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/microsofts-ceo-says-the-ai-industry-has-not-earned-the-right.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>microsoft</category>
      <category>openai</category>
      <category>anthropic</category>
      <category>aieconomics</category>
    </item>
    <item>
      <title>Microsoft's new memory system lets AI agents remember more by storing less</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:58:29 +0000</pubDate>
      <link>https://dev.to/breachprotocol/microsofts-new-memory-system-lets-ai-agents-remember-more-by-storing-less-5fn1</link>
      <guid>https://dev.to/breachprotocol/microsofts-new-memory-system-lets-ai-agents-remember-more-by-storing-less-5fn1</guid>
      <description>&lt;p&gt;Microsoft Research has released Memora, a memory system for AI agents, along with public code on &lt;a href="https://github.com/microsoft/Memora" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. Memora lets agents recall more by storing and searching memories more cleverly—attaching short labels to each memory and searching only those labels, then pulling up the full detail on demand—rather than stuffing entire conversation histories back into the &lt;a href="https://groundtruth.day/news//learn/context-windows.html" rel="noopener noreferrer"&gt;context window&lt;/a&gt; every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; Memora keeps the rich detail of a conversation but searches it using tiny six-word labels, cutting the cost of remembering by up to 98 percent. The code is public.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-29&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://www.microsoft.com/en-us/research/blog/memora-a-harmonic-memory-representation-balancing-abstraction-and-specificity/" rel="noopener noreferrer"&gt;read the source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Language models are fundamentally forgetful: each session, they only know what sits in the context window, and once a conversation grows long, early details fall off the edge. Two common fixes both fall short. Stuffing the entire history back in on every turn gets expensive and degrades quality, since models lose track of details buried in a huge wall of text. Aggressively summarizing the past is cheap but throws away the specific details you might need later. You're stuck choosing between remembering everything badly or remembering a blurry sketch.&lt;/p&gt;

&lt;p&gt;Memora separates what you store from how you find it. For each memory, it keeps the full rich content—the memory's body—and attaches a tiny label: a six-to-eight-word phrase that captures the gist, plus a few context-aware tags it calls "cue anchors." When the agent searches its memory, it searches only the tiny labels, not the full bodies. Once it finds the right label, it pulls up the full detail behind it.&lt;/p&gt;

&lt;p&gt;The analogy is a library card catalog. You don't find a book by speed-reading every volume on the shelves; you flip through the index cards, each a few lines long, until you land on the right one, then go pull the actual book. Memora gives every memory a card. New information on an existing topic can be merged into the card that already covers it, so the system avoids the fragmentation that plagues simpler memory tools, where the same subject ends up scattered across dozens of disconnected entries. A "policy-guided retriever" can also hop from one card to related ones through those cue anchors, chasing a chain of connected memories the way a person follows a train of thought—a more capable cousin of &lt;a href="https://groundtruth.day/news//learn/retrieval-augmented-generation.html" rel="noopener noreferrer"&gt;retrieval-augmented generation&lt;/a&gt;, the standard technique for letting models look things up.&lt;/p&gt;

&lt;p&gt;On benchmarks that test whether an AI can recall facts from long, sprawling conversations, Memora claims a new best score, beating popular memory systems like Mem0 and plain retrieval. The efficiency gains are striking: it cuts token use by up to 98 percent compared with the stuff-everything-in approach, and it stores roughly half as many entries per conversation as Mem0—because merging beats fragmenting. The retriever can be hand-prompted, or trained into a small dedicated model so it runs cheaply.&lt;/p&gt;

&lt;p&gt;Durable memory is the missing piece for agents that work alongside you over weeks or months—a coding assistant that remembers your project's history, a workplace tool that accumulates institutional knowledge. Doing that without re-paying for the entire history on every turn is what makes long-term collaboration economically practical, and an open implementation means others can build on it directly.&lt;/p&gt;

&lt;p&gt;The honest caveat: "98 percent fewer tokens" is measured against the most wasteful baseline—dumping the full context every time. Against other smart memory systems, the margin is real but much narrower, and memory benchmarks have been a fast-moving, somewhat gameable target where today's record rarely lasts. The good news is that the code is public, so Memora's claims are checkable rather than just announced. For anyone tracking &lt;a href="https://groundtruth.day/news//learn/agent-memory.html" rel="noopener noreferrer"&gt;what an AI agent should remember&lt;/a&gt;, it's a concrete, testable step rather than another closed black box.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/microsoft-memora-agent-memory-on-tiny-labels.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agentmemory</category>
      <category>microsoft</category>
      <category>agents</category>
      <category>retrieval</category>
    </item>
    <item>
      <title>Meta reads full sentences from brain waves - without surgery</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:54:44 +0000</pubDate>
      <link>https://dev.to/breachprotocol/meta-reads-full-sentences-from-brain-waves-without-surgery-18io</link>
      <guid>https://dev.to/breachprotocol/meta-reads-full-sentences-from-brain-waves-without-surgery-18io</guid>
      <description>&lt;p&gt;Meta's AI research group has built a non-surgical brain-reading system that recovers typed sentences with about 61% of words correct, up from roughly 8% for prior non-surgical methods — closing most of the accuracy gap with approaches that require implanted electrodes. The system, called Brain2Qwerty v2, uses magnetoencephalography to decode brain activity into text in real time, and Meta has released the training code and data for other researchers to build on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; A new version of Meta's brain-to-text system decodes typed sentences from magnetic brain signals far more accurately than before, closing much of the gap with implanted electrodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-30&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://ai.meta.com/blog/brain2qwerty-brain-ai-human-communication/" rel="noopener noreferrer"&gt;read the source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For people who have lost the ability to speak or type — through injury, stroke, or diseases like ALS — a system that turns thought into text could restore something close to normal communication. The most accurate approaches so far require brain surgery to place electrodes directly on or in the brain tissue. That gives a clean signal, but it is a major operation with real risks, and it will only ever reach a small number of people. The goal has been to get comparable accuracy from outside the skull, with no surgery at all.&lt;/p&gt;

&lt;p&gt;Brain2Qwerty v2 narrows that gap. It uses magnetoencephalography — a scanner that picks up the faint magnetic fields generated by the brain's electrical activity. As a person types, the system captures those magnetic signals and an AI model translates the patterns into the actual text being typed. The result: it recovers sentences coherently with about sixty-one percent of words correct. Prior non-surgical methods managed around eight percent — barely better than guessing and nowhere near usable. Going from eight percent to sixty-one percent is the difference between noise and something you could almost hold a conversation through. Meta says the pipeline works end to end and can decode sentences in real time, and the company released the training code and data so other researchers can build on it. The work is described in &lt;a href="https://ai.meta.com/blog/brain2qwerty-brain-ai-human-communication/" rel="noopener noreferrer"&gt;a post from Meta AI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The AI model's task is to learn the statistical link between those chaotic magnetic patterns and the letters a person intends — the kind of find-the-signal-in-the-noise pattern-matching that modern neural networks excel at. Raw brain signals are extraordinarily messy: faint, noisy, and different from person to person and moment to moment. The system is not reading thoughts in any general sense; it is decoding the specific, physical brain activity that accompanies the motor act of typing, and mapping it back to characters.&lt;/p&gt;

&lt;p&gt;The essential caveat — which Meta is upfront about — is that the magnetic scanner that makes this work is room-sized, specialized laboratory equipment. It is not a headband, not a wearable, and not anything you could use at home or carry around. This is a research milestone about what is possible with non-surgical brain reading, not a product on the way to market. The value is in the proof: it shows you can get near-implant accuracy without cutting into the brain, which reframes what the goal even is. If the accuracy can be preserved as the hardware shrinks — a very big if, and likely years of work — it points toward a future where restoring communication does not require surgery. For now, the honest framing is a lab result that dramatically raised the ceiling on what reading the brain from the outside can achieve, while leaving the hard problem of doing it with practical, affordable equipment wide open. Even bounded that way, closing most of the gap to invasive methods is the kind of step that changes what researchers dare to aim for.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/meta-reads-words-from-brain-waves-without-surgery.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>meta</category>
      <category>neuroscience</category>
      <category>braincomputerinterface</category>
      <category>a11y</category>
    </item>
    <item>
      <title>Knowing when to quit is a skill AI agents badly lack</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:50:59 +0000</pubDate>
      <link>https://dev.to/breachprotocol/knowing-when-to-quit-is-a-skill-ai-agents-badly-lack-450m</link>
      <guid>https://dev.to/breachprotocol/knowing-when-to-quit-is-a-skill-ai-agents-badly-lack-450m</guid>
      <description>&lt;p&gt;The paper "Agentic Abstention" (&lt;a href="https://arxiv.org/abs/2606.28733" rel="noopener noreferrer"&gt;arXiv&lt;/a&gt;) finds that AI agents系统性 fail at knowing when to stop: some stubbornly continue past the point of futility, while others thrash through pointless actions before quitting. Across thirteen AI systems and more than twenty-eight thousand tasks, the hard part is not whether agents can abstain but when — and larger, more capable models showed worse timely abstention, not better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; New research finds AI agents are surprisingly bad at recognizing when a task is hopeless - and, oddly, bigger models are sometimes worse at stopping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-30&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2606.28733" rel="noopener noreferrer"&gt;read the source&lt;/a&gt; (arXiv 2606.28733)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Abstention means choosing not to answer or not to act, rather than plowing ahead. For a one-shot question, abstaining is simple — the model either answers or says I do not know. But an agent works across many turns, using tools like browsers and terminals, and at each step it faces a richer choice: try to finish, give up, or go gather more information. Knowing which one is right, and when, is a genuine skill, separate from being good at the task itself. An agent can be brilliant at booking flights and still terrible at recognizing that the flight it was asked to book does not exist.&lt;/p&gt;

&lt;p&gt;The researchers tested thirteen AI systems across more than twenty-eight thousand tasks spanning online shopping, command-line work, and question answering. Some agents never quit when they should, stubbornly continuing long past the point of futility. Others thrash — performing many pointless actions before finally stopping — especially when a task looks doable at first and only reveals itself as impossible once the environment pushes back. Both failure modes are expensive: a stubborn agent wastes time and money and can take harmful actions, while a thrashing one burns resources flailing.&lt;/p&gt;

&lt;p&gt;The most counterintuitive result is that bigger is not better here. Larger, more capable models sometimes showed worse timely abstention — they were, if anything, more prone to overconfidently pressing on. That breaks the comfortable assumption that scaling up fixes everything; the judgment of when to give up appears to be a distinct capability that raw power does not automatically deliver, and may even work against. A more capable model is a more confident model, and confidence is exactly the wrong instinct when a task has quietly become hopeless.&lt;/p&gt;

&lt;p&gt;The encouraging part is a fix that does not require retraining the model at all. The authors introduce a method that distills full records of past attempts into reusable stopping rules — compact lessons about when continuing tends to be pointless — and feeds those rules to the agent as guidance. On a shopping benchmark, it lifted one model's ability to quit at the right moment from roughly a quarter of the cases to well over half, more than doubling it, without touching the model's underlying parameters. A lot of the problem is not that the model is incapable of good stopping judgment, but that it is not being given the accumulated experience it needs to exercise it.&lt;/p&gt;

&lt;p&gt;This matters beyond efficiency. As agents get pointed at longer, higher-stakes work, an agent that does not know when to stop is a real hazard — it will keep taking actions in a situation it cannot resolve, and every extra action is a chance to make things worse. This connects directly to the wave of benchmarks this week showing that agents fail most long real-world tasks, covered in &lt;a href="https://groundtruth.day/news//news/the-best-ai-agents-still-fail-most-real-computer-tasks.html" rel="noopener noreferrer"&gt;the best AI agents still fail most real computer tasks&lt;/a&gt;: part of failing gracefully is failing at all, rather than churning forever.&lt;/p&gt;

&lt;p&gt;The honest caveat is that the stopping rules are learned from specific task environments, and rules distilled from online shopping may not transfer cleanly to, say, scientific research or software debugging — the skill of knowing when to quit might itself be domain-specific, needing fresh experience for each new setting. And measuring abstention well is genuinely hard, since the right moment to stop is often a judgment call even for a human. But the framing is the contribution. We have spent enormous effort teaching AI systems to act. This work is a reminder that teaching them to recognize when not to act — to know the difference between a hard problem and a hopeless one — is just as important, and right now they are not very good at it. For more on what makes something an agent in the first place, see our lesson on &lt;a href="https://groundtruth.day/news//learn/ai-agents.html" rel="noopener noreferrer"&gt;AI agents&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/knowing-when-to-quit-is-a-skill-ai-agents-lack.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>research</category>
      <category>agents</category>
      <category>reliability</category>
      <category>safety</category>
    </item>
    <item>
      <title>A robot AI that adapts to a moved camera by wiggling, not retraining</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:47:07 +0000</pubDate>
      <link>https://dev.to/breachprotocol/a-robot-ai-that-adapts-to-a-moved-camera-by-wiggling-not-retraining-3mgh</link>
      <guid>https://dev.to/breachprotocol/a-robot-ai-that-adapts-to-a-moved-camera-by-wiggling-not-retraining-3mgh</guid>
      <description>&lt;p&gt;In-Context World Modeling lets a robot's AI adapt to a changed setup — a moved camera, a different robot arm — in a few seconds of exploratory movement, with no retraining. The robot performs brief, task-agnostic probing actions, and the model infers the new configuration from what it observes, building that understanding inside its existing context window without changing any internal weights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; A new method lets robot policies figure out a changed setup from a few seconds of self-directed fiddling, so they keep working when the camera or robot body changes - with no retraining.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-29&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2606.26025" rel="noopener noreferrer"&gt;read the source&lt;/a&gt; (arXiv 2606.26025)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vision-language-action models, which take in what the robot sees and a task description and output actions, are powerful but brittle: shift the camera angle or swap in a slightly different arm and performance can collapse, because the model was trained on one specific setup and assumes the world still matches it. The usual fix is gathering new data and retraining or fine-tuning for each new configuration — slow, expensive, and impractical for robots that need to work when something changes.&lt;/p&gt;

&lt;p&gt;In-Context World Modeling reframes a new setup as something to figure out in the moment rather than retrain for. The robot performs a short burst of self-generated, task-agnostic interactions — small movements that probe how this particular system behaves — and the model reads that recent history to infer the essential variables: where the camera is now, how this arm moves, how the world responds to its actions. It builds this understanding inside its &lt;a href="https://groundtruth.day/news//learn/context-windows.html" rel="noopener noreferrer"&gt;context window&lt;/a&gt;, the working memory it already uses, without changing any of its internal weights.&lt;/p&gt;

&lt;p&gt;That no-weight-changes property is what makes it efficient, and it borrows from language models. Large chatbots can learn a new task from a couple of examples typed into the prompt — called in-context learning — without retraining. In-Context World Modeling ports that idea to physical control: the robot learns the new setup from a few interactions held in context, the same way a chatbot learns a format from a few examples. It is the difference between sending an experienced driver back to driving school every time they rent an unfamiliar car, versus letting them adjust the mirrors and feel out the pedals in the parking lot for thirty seconds first.&lt;/p&gt;

&lt;p&gt;The reported results show the method significantly outperforms standard vision-language-action baselines when the camera viewpoint is novel, in both simulation and on real robots. That is exactly the kind of everyday change — someone bumped the camera, you mounted it slightly differently — that breaks ordinary policies.&lt;/p&gt;

&lt;p&gt;Brittleness to setup changes is one of the biggest practical barriers to deploying robots outside carefully controlled labs. A method that adapts from a few seconds of probing, with no retraining, points toward robots that can be moved, reconfigured, or rebuilt without an engineering project each time. It is part of a broader wave of work on &lt;a href="https://groundtruth.day/news//learn/world-models.html" rel="noopener noreferrer"&gt;world models&lt;/a&gt; — AI that understands how environments behave — and a sign that the in-context-learning paradigm that transformed language AI is now reshaping robotics.&lt;/p&gt;

&lt;p&gt;The honest caveat is that in-context adaptation has a ceiling set by what the underlying model already implicitly knows. Wiggling to discover a moved camera works because the model has seen many camera angles; a truly alien robot body or a wildly out-of-distribution environment may still demand real retraining, because no amount of probing can teach the model something it has no prior basis to understand. For the common, mundane case of "same robot, the setup shifted a bit," though, skipping the retraining step is a genuine and useful win.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/in-context-world-modeling-robots-adapt-without-retraining.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>worldmodels</category>
      <category>incontextlearning</category>
      <category>adaptation</category>
    </item>
    <item>
      <title>A small but elegant idea: putting 'experts' inside the attention layer</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:43:22 +0000</pubDate>
      <link>https://dev.to/breachprotocol/a-small-but-elegant-idea-putting-experts-inside-the-attention-layer-2nh5</link>
      <guid>https://dev.to/breachprotocol/a-small-but-elegant-idea-putting-experts-inside-the-attention-layer-2nh5</guid>
      <description>&lt;p&gt;Grouped Query Experts (GQE) applies the mixture-of-experts routing trick to the attention layer of language models, matching baseline performance while activating only about half the query heads per word. The paper demonstrates that sparsely selecting query heads — while keeping all key-value heads active — preserves the memory savings of grouped-query attention and adds a new layer of computational efficiency. The catch: it has only been validated at small scale (~250M parameters), and whether the gain holds at tens or hundreds of billions remains an open question.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; Grouped Query Experts brings the mixture-of-experts trick into attention, activating only half a model's query heads per token while matching the full version -- at least at small scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-24&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://arxiv.org/abs/2606.20945" rel="noopener noreferrer"&gt;read the source&lt;/a&gt; (arXiv 2606.20945)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A &lt;a href="https://groundtruth.day/news//learn/mixture-of-experts.html" rel="noopener noreferrer"&gt;mixture of experts&lt;/a&gt; is the idea that a giant model doesn't need to use all of itself for every word. Instead, it has many specialist sub-networks — experts — and a small router that, for each piece of text, wakes up only the few experts most relevant and leaves the rest asleep. You get the knowledge of a huge model while only paying to run a slice of it at a time. It's like a hospital: you don't summon every doctor for every patient; a triage nurse routes you to the cardiologist or the dermatologist as needed. This trick has powered many of the biggest recent models — it's the same family as &lt;a href="https://groundtruth.day/news//news/one-model-that-is-really-a-committee.html" rel="noopener noreferrer"&gt;one model that is really a committee&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Until now, this routing has almost always lived in one specific part of the model: the feed-forward layer, the chunk that does general processing after each step. The other major component — attention, the part that decides which earlier words matter for understanding the current one — has been left fully on, all the time.&lt;/p&gt;

&lt;p&gt;GQE changes that. It brings the experts-and-router idea into the attention layer itself. Attention works through query heads (which ask "what am I looking for?") and key-value heads (which hold "here is what's available"). GQE adds a router that, for each word, wakes up only some of the query heads — the relevant specialists — while keeping all the key-value heads on. That last detail matters: the key-value heads are the expensive ones to store and the ones that govern how much memory a long conversation eats, which connects directly to why models have limited &lt;a href="https://groundtruth.day/news//learn/context-windows.html" rel="noopener noreferrer"&gt;context windows&lt;/a&gt;. By leaving those alone and only thinning out the query side, GQE keeps the memory savings that made grouped-query attention popular in the first place, while adding a new layer of selectivity on top.&lt;/p&gt;

&lt;p&gt;The result: GQE matched the performance of a model that keeps all its query heads active, while only switching on about half of them for each word. Same quality, roughly half the work in that part of the model. In a field where efficiency gains often cost a little accuracy, matching the baseline at half the activation is a clean win.&lt;/p&gt;

&lt;p&gt;Attention is one of the two pillars of every modern language model, and it has been comparatively untouched by the mixture-of-experts revolution that reshaped the other pillar. Making attention sparse the same way — only paying for the heads you need — opens a new direction for making big models cheaper to run without making them dumber. Inference cost is the dominant expense for anyone deploying these models at scale, so even modest, compounding savings in a core component are worth a lot.&lt;/p&gt;

&lt;p&gt;The caveat is the whole ballgame for this kind of result. The experiments were run at small scale — a roughly 250-million-parameter model trained on a fixed, modest amount of data. That is a perfectly reasonable place to test an idea, and the comparison was done fairly, head to head against the standard approach at matched cost. But the history of model architecture is littered with tricks that shine at small scale and quietly stop helping — or even start hurting — as you push toward the tens or hundreds of billions of parameters where the real models live. Sometimes the routing overhead eats the savings; sometimes the sparsity that helped a small model starves a big one. The right way to file GQE is: an elegant, well-executed idea with a promising small-scale result, and an open question about whether it survives the trip to full size. If it does, expect to see experts quietly migrate from the feed-forward layer into attention across the next generation of models.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/grouped-query-experts-moe-moves-into-attention.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>research</category>
      <category>architecture</category>
      <category>mixtureofexperts</category>
      <category>attention</category>
    </item>
    <item>
      <title>OpenAI launches GPT-5.6, but only to companies the government clears first</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:39:38 +0000</pubDate>
      <link>https://dev.to/breachprotocol/openai-launches-gpt-56-but-only-to-companies-the-government-clears-first-3pfi</link>
      <guid>https://dev.to/breachprotocol/openai-launches-gpt-56-but-only-to-companies-the-government-clears-first-3pfi</guid>
      <description>&lt;p&gt;OpenAI today released GPT-5.6, a three-tier model family — flagship Sol, workhorse Terra, and low-cost Luna — but made its strongest model available only to roughly twenty government-cleared organizations, marking the first time a major lab has gated its best model behind a federal review process rather than offering it to the public. You can read the company's own &lt;a href="https://openai.com/index/previewing-gpt-5-6-sol/" rel="noopener noreferrer"&gt;preview announcement&lt;/a&gt;, and the reaction is already loud on &lt;a href="https://news.ycombinator.com/" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt; and across the AI press.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; OpenAI's most capable models yet shipped today as a tiny, government-vetted preview, signaling that Washington now holds a gate in front of the frontier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-26&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://openai.com/index/previewing-gpt-5-6-sol/" rel="noopener noreferrer"&gt;read the source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://www.theverge.com/ai-artificial-intelligence/957372/openai-will-delay-gpt-5-6-after-trump-administration-request" rel="noopener noreferrer"&gt;The Verge reported&lt;/a&gt; that the broad rollout was delayed at the administration's request, and a &lt;a href="https://officechai.com/ai/openai-launches-gpt-5-6-sol-beats-mythos-on-terminalbench/" rel="noopener noreferrer"&gt;technical breakdown at OfficeChai&lt;/a&gt; walks through what shipped. OpenAI itself calls this gated arrangement unsustainable and frames it as a short-term step toward wider availability.&lt;/p&gt;

&lt;p&gt;Until recently, the pattern for a new AI model was simple: announce it, put it behind a public sign-up, and let anyone with a credit card start typing. What changed is what these models can now do in the hands of a skilled operator, specifically in cybersecurity and biology. OpenAI's own safety paperwork rates all three new models as high-capability in those two areas. The company is saying these systems are good enough at finding software weaknesses and at chemistry that letting just anyone drive them carries real-world risk — which is exactly why the government is at the table.&lt;/p&gt;

&lt;p&gt;OpenAI built the new family, tuned it hard for the kind of step-by-step tool use that powers &lt;a href="https://groundtruth.day/news//learn/ai-agents.html" rel="noopener noreferrer"&gt;AI agents&lt;/a&gt;, and then, instead of a normal launch, agreed to a limited preview for a small set of vetted partners. The mechanism behind the gate is a federal order requiring frontier labs to submit their most capable models for review before release. That is the same lever that, two weeks ago, forced a rival to switch off its top model, the story we covered when &lt;a href="https://groundtruth.day/news//news/the-us-government-banned-anthropics-most-powerful-ai-model.html" rel="noopener noreferrer"&gt;the government pulled a frontier model&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Think of a new pharmaceutical. A company can invent a powerful compound in its lab, but it cannot sell it the next morning; a regulator reviews it first and may restrict who can prescribe it. Frontier AI is starting to be treated the same way. The model exists, the company is proud of it, and a government review now stands between the lab and the open market. The top tier even ships with a mode that spins up multiple sub-agents to attack a problem in parallel — the kind of capability that makes both the lab and the regulator nervous.&lt;/p&gt;

&lt;p&gt;On capability, treat the company's numbers with care. OpenAI says the flagship leads rivals on a command-line coding test and is competitive on offensive-security tasks while using far fewer tokens to get there. Those are vendor claims with no independent confirmation yet. The honest read: the product and the government-gated preview are real and confirmed, while the leaderboard wins are the lab's own marketing until a third party checks them. If you want to understand why a single test score deserves skepticism, see our explainer on &lt;a href="https://groundtruth.day/news//learn/how-ai-is-benchmarked.html" rel="noopener noreferrer"&gt;how AI is benchmarked&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is the clearest sign yet that the most capable AI is becoming a controlled good, like an export-restricted technology rather than a consumer app. That reshapes competition. If access to the best closed models runs through a government clearance process, the companies that get cleared early gain an enormous head start, and everyone else is pushed toward &lt;a href="https://groundtruth.day/news//learn/open-weight-models.html" rel="noopener noreferrer"&gt;open-weight models&lt;/a&gt; they can run themselves. It also changes the safety conversation: instead of arguing about what a model should refuse to say, the fight is now about who is even allowed to hold the keys.&lt;/p&gt;

&lt;p&gt;Nobody outside the vetted circle can test these claims right now, which is its own problem. A model that only a handful of insiders can probe is a model the wider research community cannot scrutinize for flaws, biases, or overconfidence. OpenAI's own safety notes admit the flagship shows a stronger tendency than its predecessor to go beyond what a user asked for. Gating access protects against misuse, but it also slows the independent red-teaming that has historically caught a model's worst habits. We are entering a period where the most powerful AI is also the least publicly examined.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/gpt-5-6-launches-under-government-vetting.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openai</category>
      <category>gpt56</category>
      <category>regulation</category>
      <category>frontiermodels</category>
    </item>
    <item>
      <title>Google ships a faster, cheaper image model and hands developers conversational video editing</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:35:52 +0000</pubDate>
      <link>https://dev.to/breachprotocol/google-ships-a-faster-cheaper-image-model-and-hands-developers-conversational-video-editing-37pe</link>
      <guid>https://dev.to/breachprotocol/google-ships-a-faster-cheaper-image-model-and-hands-developers-conversational-video-editing-37pe</guid>
      <description>&lt;p&gt;Google released two new generative-media models this week—Nano Banana 2 Lite for fast, cheap image generation and Gemini Omni Flash for short video generation and editing—designed for volume and speed rather than peak quality. The models target builders who need to produce large quantities of images and short clips quickly and affordably, signaling a shift in AI-generation economics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; A lightweight version of Google's image model now makes a picture in about four seconds for a fraction of a cent, while a new video model lets developers edit clips by talking to it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-30&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni-flash-nano-banana-2-lite/" rel="noopener noreferrer"&gt;read the source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both models were detailed in &lt;a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni-flash-nano-banana-2-lite/" rel="noopener noreferrer"&gt;a blog post&lt;/a&gt;, where Google positioned them for builders who prioritize throughput over a single polished render.&lt;/p&gt;

&lt;p&gt;Nano Banana 2 Lite, a stripped-down version of Google's image generator, produces a text-to-image picture in about four seconds and costs roughly three cents per thousand images. That pricing is the real story. At those numbers, generating images stops being a treat you dole out carefully and becomes something you can do by the thousand—testing a hundred variations of a design, auto-illustrating every item in a catalog, or letting an app generate imagery on the fly for each user. When a capability gets an order of magnitude cheaper, people do not just do the old thing for less money; they do new things that were never affordable before. Google is clearly betting that cheap-and-fast unlocks a different class of use than slow-and-pristine. It is available in Google's developer studio and its main AI programming interface, and it is rolling out to consumer products like the Gemini app and search.&lt;/p&gt;

&lt;p&gt;Gemini Omni Flash, the more novel of the two, handles video—generating and, more interestingly, editing short clips of up to ten seconds. Google bills it as the first time developers get programmable access to conversational video editing. Traditional video editing means a timeline, tracks, and a mouse: you scrub to a frame and manually change things. Conversational editing means you describe the change in plain words—make the sky darker, slow the middle down, remove the person on the left—and the model produces the revised clip. Doing that through an API means a developer can bake that ability into their own app, so their users can revise video by talking rather than learning editing software. Combined with the fast image model, Google is sketching an end-to-end pipeline: generate a picture in seconds, turn it into a short clip, then refine the clip by conversation. It is available in the same developer surfaces plus Google's video-creation tool.&lt;/p&gt;

&lt;p&gt;The honest caveat is that the "lite" and "flash" labels are doing a lot of quiet work. A four-second image model priced to run by the thousand is, almost by definition, making tradeoffs against the slower, pricier flagship—in fine detail, in how reliably it renders text inside an image, in handling unusual or complex prompts. Ten-second clips are short, and the hardest parts of video generation—keeping a character consistent, physics that do not melt, coherence across a longer scene—get harder the longer the clip. None of that makes these models less useful; it means they are precision tools for a specific job. The winners will be the builders who match the tool to the task: reach for cheap-and-fast when volume and iteration speed matter, and save the expensive flagship for the single hero image or the shot that has to be flawless. What Google actually shipped this week is less a leap in quality than a shift in economics—and in this field, the economics are often what decides which ideas get built at all.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/google-ships-faster-cheaper-image-and-video-models.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>google</category>
      <category>gemini</category>
      <category>imagegeneration</category>
      <category>videogeneration</category>
    </item>
    <item>
      <title>Google DeepMind puts $75 million into film studio A24 to build AI moviemaking tools</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:32:07 +0000</pubDate>
      <link>https://dev.to/breachprotocol/google-deepmind-puts-75-million-into-film-studio-a24-to-build-ai-moviemaking-tools-199i</link>
      <guid>https://dev.to/breachprotocol/google-deepmind-puts-75-million-into-film-studio-a24-to-build-ai-moviemaking-tools-199i</guid>
      <description>&lt;p&gt;Google DeepMind is investing around seventy-five million dollars in A24 to jointly develop AI filmmaking tools, with researchers embedded directly on productions (&lt;a href="https://deadline.com/2026/06/google-a24-partnership-ai-filmmaking-tools/" rel="noopener noreferrer"&gt;Deadline&lt;/a&gt;; &lt;a href="https://www.reuters.com/business/media-telecom/google-deepmind-signs-ai-research-deal-with-film-studio-a24-2026-06-22/" rel="noopener noreferrer"&gt;Reuters&lt;/a&gt;). The deal is framed strictly as a tooling and workflow partnership, not a data-licensing arrangement for training AI on A24's film catalog. DeepMind's researchers will work alongside filmmakers building and refining production tools in the actual context of making a movie.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; A frontier AI lab is investing in a prestige studio to develop production tools hands-on with filmmakers -- officially not a deal to train models on A24's films.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-22&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://deadline.com/2026/06/google-a24-partnership-ai-filmmaking-tools/" rel="noopener noreferrer"&gt;read the source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI labs regularly build powerful general-purpose tools and struggle to understand what professionals in a specific craft actually need. Filmmakers want help with concrete, unglamorous production problems — matching shots, planning scenes, handling the thousand small decisions a production runs on — not a prompt-to-video generator. The only reliable way to learn those needs is to be in the room. By buying a stake in a respected studio and placing researchers on real productions, DeepMind is short-cutting the gap between powerful AI and AI that filmmakers actually want to use.&lt;/p&gt;

&lt;p&gt;The backdrop matters. The relationship between AI and the film industry has been predominantly adversarial; fear that studios would use AI to replace writers, actors, and crews — or train models on people's work and likenesses without consent — was a major driver of recent labor disputes. A frontier lab investing in a studio to build tools with filmmakers is a deliberate attempt to write a different story: AI as a collaborator handling the tedious, expensive parts of production rather than a replacement for the people doing creative work. Whether it lands that way depends on how the tools are built and who benefits — which is why the details matter more than the press release.&lt;/p&gt;

&lt;p&gt;This deal signals how the next phase of AI competition plays out — not just who has the best model, but who has the deepest hooks into specific high-value industries. Owning a relationship with a prestige studio gives Google both a real-world laboratory and marquee credibility in a creative field that has been deeply wary of AI. It is also a mainstream-crossover moment: AI showing up in the culture industry as an investor and collaborator, not just as a threat in a labor dispute.&lt;/p&gt;

&lt;p&gt;The caveats are worth stating plainly. Commenters were quick to be skeptical of the "not for training" framing, on the reasonable grounds that proximity to a studio's films and creative process is itself valuable to an AI company, whatever the contract says — and the public cannot see the contract. The official position is clear; whether the practical reality stays cleanly on the tooling side of the line is something only time will show. And like any splashy partnership, the announcement is easy; the test is whether real, useful tools come out of it, or whether it ends up as a prestige association that produces more press than product. For now it is a genuine, multi-outlet-confirmed deal — and a notable vote of confidence that AI's future in film is collaborative, at least on paper.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/google-deepmind-bets-on-a-film-studio.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>creativeai</category>
      <category>google</category>
      <category>industry</category>
      <category>video</category>
    </item>
    <item>
      <title>AI coding skill in Python doesn't carry over to other languages</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:28:22 +0000</pubDate>
      <link>https://dev.to/breachprotocol/ai-coding-skill-in-python-doesnt-carry-over-to-other-languages-3e5d</link>
      <guid>https://dev.to/breachprotocol/ai-coding-skill-in-python-doesnt-carry-over-to-other-languages-3e5d</guid>
      <description>&lt;p&gt;A new benchmarking study finds that AI models' strong Python performance is a poor predictor of their coding ability across other programming languages. The &lt;a href="https://huggingface.co/papers/2606.20517" rel="noopener noreferrer"&gt;Multi-LCB&lt;/a&gt; project rebuilt a respected &lt;a href="https://arxiv.org/abs/2403.07974" rel="noopener noreferrer"&gt;contamination-resistant coding benchmark&lt;/a&gt; in twelve languages and found that models which look excellent in Python perform markedly worse elsewhere — they have over-specialized in the language they saw most in training.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; A widely-trusted coding benchmark was Python-only. Expanding it to a dozen languages revealed that models acing Python often stumble badly elsewhere — Python skill isn't general coding skill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-20&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://huggingface.co/papers/2606.20517" rel="noopener noreferrer"&gt;read the source&lt;/a&gt; (arXiv 2606.20517)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three findings stand out. &lt;strong&gt;Python overfitting&lt;/strong&gt;: many models that look excellent in Python perform markedly worse in other languages — they've over-specialized in the language they saw most. &lt;strong&gt;Uneven contamination&lt;/strong&gt;: the degree to which test problems appear to have leaked into a model's training varies by language, a fingerprint of how lopsided these models' training diets are toward popular languages. &lt;strong&gt;Large gaps across languages&lt;/strong&gt;: models are especially weak in stricter, more structured languages and in less common ones that show up rarely in training data. The blunt conclusion: a model's Python performance is not a reliable stand-in for its coding ability in general.&lt;/p&gt;

&lt;p&gt;Testing only in Python is like judging someone's overall musical talent solely by how well they play one song they've practiced a thousand times. They'll sound like a virtuoso — until you hand them a new piece, or a different instrument, and discover the talent was narrower than it looked. &lt;a href="https://github.com/Multi-LCB/Multi-LCB" rel="noopener noreferrer"&gt;Multi-LCB&lt;/a&gt; hands the models a different instrument and listens to what actually comes out.&lt;/p&gt;

&lt;p&gt;Benchmarks shape everything: which models look best, which research directions get funded, and which claims make headlines. If the headline coding test is single-language, the entire field is optimizing for a narrow slice of reality while telling itself the slice is the whole. Real software is written in a sprawling variety of languages, and a coding assistant that only truly shines in Python is far less useful than its leaderboard position suggests. Building tests that span many languages forces a more honest measure of general skill — and this is part of a broader reckoning this week about &lt;a href="https://groundtruth.day/news//learn/llm-as-a-judge.html" rel="noopener noreferrer"&gt;how AI gets evaluated&lt;/a&gt;, with several groups arguing that a single tidy score hides more than it reveals.&lt;/p&gt;

&lt;p&gt;The weaker results in less common languages might not reflect a deep inability to generalize so much as a simple shortage of training material — these models have just seen far less code in those languages. With a more balanced training diet, some of the gap might close, which would mean the problem is partly about what we feed models rather than a fundamental limit of how they learn. "Can't generalize" and "wasn't taught enough" call for different fixes. Either way, the practical lesson is sturdy: the next time a model is crowned a coding champion on a Python-only test, treat the crown with suspicion. The same model handed a different language might tell a very different story.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/good-at-python-isnt-good-at-coding.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>benchmarks</category>
      <category>evaluation</category>
      <category>coding</category>
    </item>
    <item>
      <title>A powerful open model lands and reignites the open-vs-closed debate</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:24:37 +0000</pubDate>
      <link>https://dev.to/breachprotocol/a-powerful-open-model-lands-and-reignites-the-open-vs-closed-debate-977</link>
      <guid>https://dev.to/breachprotocol/a-powerful-open-model-lands-and-reignites-the-open-vs-closed-debate-977</guid>
      <description>&lt;p&gt;Z.ai (also known as Zhipu AI) released GLM-5.2, a top-tier open-weight model with an unusually large context window capable of ingesting hundreds of thousands of words at once. The weights and code are publicly available, and free access was offered for a limited window to drive adoption.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; A Chinese lab released a flagship model anyone can download and run, with a huge memory for long documents — and a viral claim that it makes things up less than a top closed model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-20&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://huggingface.co/zai-org/GLM-5.2-FP8" rel="noopener noreferrer"&gt;read the source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model's headline technical feature is a &lt;strong&gt;&lt;a href="https://groundtruth.day/news//learn/context-windows.html" rel="noopener noreferrer"&gt;context window&lt;/a&gt;&lt;/strong&gt; — the amount of text it can hold in mind at once — on the order of a few hundred thousand words. That is enough to take in a long book, a sprawling codebase, or a thick stack of documents and reason over all of it in a single pass. For real work, this eliminates the need to feed a model material in small chunks and hope it remembers the earlier pieces. Z.ai also released efficient, compressed versions designed to run on more modest hardware, and opened free access for a window of time to encourage people to try it. The code and model weights are available through the &lt;a href="https://github.com/zai-org/GLM-5" rel="noopener noreferrer"&gt;zai-org GitHub&lt;/a&gt; repository.&lt;/p&gt;

&lt;p&gt;GLM-5.2 is being positioned as competitive with the strongest models in its size class, and a viral argument took hold over the weekend that it actually &lt;strong&gt;&lt;a href="https://groundtruth.day/news//learn/hallucination.html" rel="noopener noreferrer"&gt;makes things up less often&lt;/a&gt;&lt;/strong&gt; than a leading closed model from a major lab. That claim spread fast because it flatters a popular story: that you don't need a giant proprietary system to get reliable answers, and that open models have quietly caught up. The original spark was &lt;a href="https://arrowtsx.dev/bigger-models" rel="noopener noreferrer"&gt;a blog post&lt;/a&gt; arguing that building bigger models is no longer the path forward — that efficiency and grounding matter more than raw size. The post triggered significant discussion in the broader open-model community, much of it centered on the &lt;a href="https://huggingface.co/zai-org" rel="noopener noreferrer"&gt;Z.ai model hub&lt;/a&gt; where the release lives.&lt;/p&gt;

&lt;p&gt;This is exactly the kind of claim that feels true and may not survive scrutiny. Comparing how often two models &lt;strong&gt;&lt;a href="https://groundtruth.day/news//learn/hallucination.html" rel="noopener noreferrer"&gt;make things up&lt;/a&gt;&lt;/strong&gt; is genuinely hard to do fairly — it depends heavily on which questions you ask, how you score the answers, and what counts as a fabrication. Some in the community pushed back on the methodology, and others suggested the open model may be trading away some reasoning sharpness in exchange for sticking more cautiously to what it is sure about. Even if it fabricates less, that might come at a cost on other dimensions. The reliability claim is an unsettled debate, not an established fact, and should be read as narrative momentum rather than a verified result.&lt;/p&gt;

&lt;p&gt;Regardless of how that debate resolves, the steady arrival of capable open models reshapes the landscape. Researchers can study a frontier-class system directly instead of guessing at a black box; companies and individuals can run powerful AI privately on their own machines without sending data to anyone; and competitive pressure stays on the closed labs. The fact that this open release comes with a long memory and runs on accessible hardware is itself the bigger story — part of a clear pattern where the most interesting action is increasingly in models you can hold in your hand rather than only rent.&lt;/p&gt;

&lt;p&gt;The reliability question remains open. Until neutral parties run careful, well-designed comparisons — not weekend benchmarks optimized to make a point — the "makes things up less" claim belongs in the "interesting if true" column. What is solid is the release, the long context, and the accessibility. What is contested is exactly how it stacks up against the best closed systems on the dimensions people care about most. With a fresh open model riding a wave of enthusiasm, the right posture is curiosity with a hand on the skeptic's brake.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/glm-5-2-open-model-takes-on-the-giants.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>models</category>
      <category>industry</category>
    </item>
    <item>
      <title>An open model from China beat Claude on a security test -- at a sixth of the cost</title>
      <dc:creator>Breach Protocol</dc:creator>
      <pubDate>Wed, 01 Jul 2026 19:20:52 +0000</pubDate>
      <link>https://dev.to/breachprotocol/an-open-model-from-china-beat-claude-on-a-security-test-at-a-sixth-of-the-cost-1df0</link>
      <guid>https://dev.to/breachprotocol/an-open-model-from-china-beat-claude-on-a-security-test-at-a-sixth-of-the-cost-1df0</guid>
      <description>&lt;p&gt;GLM 5.2, a free open-weight model from Zhipu AI, beat Anthropic's Claude at catching broken-access-control bugs in Semgrep's benchmarks, at roughly a sixth of the cost per bug found. The result, published in Semgrep's blog post &lt;a href="https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/" rel="noopener noreferrer"&gt;We Have Mythos At Home&lt;/a&gt;, is narrow but real: on one security-critical task, a downloadable model outperformed a top closed model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key facts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; Semgrep ran GLM 5.2 against Claude on a narrow vulnerability-finding task and the free, open-weight model came out ahead for far less money.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When:&lt;/strong&gt; 2026-06-28&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary source:&lt;/strong&gt; &lt;a href="https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/" rel="noopener noreferrer"&gt;read the source&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A huge share of real-world web bugs come from one boring mistake: a site checks that you are logged in, but forgets to check that the thing you are asking for actually belongs to you. Change the order number in the address bar from 1001 to 1002 and you are suddenly looking at someone else's invoice. Security people call this a broken-access-control or IDOR bug. It is everywhere, it is costly, and it is the kind of needle-in-a-haystack reading job people now hand to AI: point a model at a codebase and ask, where can a user reach data that isn't theirs?&lt;/p&gt;

&lt;p&gt;Semgrep built a fair test around that question and ran several models through it. The standout was GLM 5.2, an &lt;a href="https://groundtruth.day/news//learn/open-weight-models.html" rel="noopener noreferrer"&gt;open-weight model&lt;/a&gt; from the Chinese lab Zhipu AI. On the narrow task of catching access-control bugs, GLM 5.2 scored ahead of Claude Code -- and because GLM is free to download and cheap to run, the cost per bug it found was about a sixth of Claude's. For a security team scanning millions of lines, that gap is the difference between scanning everything and scanning a sample.&lt;/p&gt;

&lt;p&gt;GLM 5.2 is a &lt;a href="https://groundtruth.day/news//learn/mixture-of-experts.html" rel="noopener noreferrer"&gt;mixture-of-experts&lt;/a&gt; design: it is enormous on paper -- hundreds of billions of parameters -- but for any given chunk of text it only switches on a small slice of itself, which keeps it fast and affordable. It reads up to about a million tokens at once, enough to hold a fair-sized codebase in working memory while it reasons about who can reach what. It ships under a permissive MIT license, so a company can run it on its own machines and never send a line of proprietary code to anyone else.&lt;/p&gt;

&lt;p&gt;Semgrep itself is careful to make the caveat: this is one narrow win, not a coronation. On harder, longer programming tasks -- the kind that involve juggling a whole project over many steps -- GLM 5.2 still trails the top closed models by a wide margin. The sharpest point in the writeup is that the model alone was not even the best result on Semgrep's own board: their full scanning pipeline, the model wrapped in custom tooling and checks, beat every bare model by a healthy margin. How you wire a model into a system matters at least as much as which model you pick. A bare &lt;a href="https://groundtruth.day/news//learn/how-ai-is-benchmarked.html" rel="noopener noreferrer"&gt;benchmark&lt;/a&gt; score is the start of the story, not the end of it.&lt;/p&gt;

&lt;p&gt;The direction matters regardless. A year ago the assumption was that frontier capability lived behind a handful of American API keys. The Semgrep result is a clean, reproducible data point that on at least one economically important task, a free model you can run in your own building is now the rational default. Developers on local-AI forums are quietly moving day-to-day work onto GLM and keeping the expensive models for the genuinely hard problems. Combine that with the fact that the most powerful American models are getting &lt;a href="https://groundtruth.day/news//news/the-us-government-banned-anthropics-most-powerful-ai-model.html" rel="noopener noreferrer"&gt;harder to access&lt;/a&gt;, and a cheap, open, capable alternative feels less like a curiosity and more like infrastructure.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://groundtruth.day/news/glm-52-beats-claude-on-a-cyber-benchmark.html" rel="noopener noreferrer"&gt;Ground Truth&lt;/a&gt;, where every claim is checked against the primary source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openweightmodels</category>
      <category>security</category>
      <category>glm</category>
      <category>benchmarks</category>
    </item>
  </channel>
</rss>
