<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: azena.ai</title>
    <description>The latest articles on DEV Community by azena.ai (@azena-ai).</description>
    <link>https://dev.to/azena-ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3999492%2F49bb826b-714e-4d35-b0a8-9166978bb4c9.png</url>
      <title>DEV Community: azena.ai</title>
      <link>https://dev.to/azena-ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/azena-ai"/>
    <language>en</language>
    <item>
      <title>The reliability gap: what it actually takes to put an AI agent in production</title>
      <dc:creator>azena.ai</dc:creator>
      <pubDate>Fri, 26 Jun 2026 12:09:52 +0000</pubDate>
      <link>https://dev.to/azena-ai/the-reliability-gap-what-it-actually-takes-to-put-an-ai-agent-in-production-36ik</link>
      <guid>https://dev.to/azena-ai/the-reliability-gap-what-it-actually-takes-to-put-an-ai-agent-in-production-36ik</guid>
      <description>&lt;p&gt;A demo agent is easy. It calls a model, the model calls a tool, the tool returns something plausible, and everyone in the room nods. Then you put the same agent in front of real users, real data, and real money — and it quietly does the wrong thing 4% of the time. Nobody notices until a customer does.&lt;/p&gt;

&lt;p&gt;That 4% is the reliability gap. It is the entire distance between a convincing demo and a system you can actually depend on, and almost nothing in the typical LLM tutorial prepares you for it.&lt;/p&gt;

&lt;p&gt;Here is what closing that gap actually involves.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three things that make agents hard
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. They are non-deterministic by construction.&lt;/strong&gt; The same input can produce a different tool call tomorrow. Your regression intuition — "I didn't touch that code, so it still works" — is simply false. A prompt tweak three steps upstream can change a decision three steps downstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. They fail silently.&lt;/strong&gt; A traditional service throws. An agent confidently returns a wrong answer in the same shape as a right one. There is no stack trace for "the model misread the invoice total."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. There is rarely a ground truth at runtime.&lt;/strong&gt; When the agent decides, you usually cannot check the decision against an oracle in the moment. You only find out later, in aggregate, if you measured.&lt;/p&gt;

&lt;p&gt;If you internalise nothing else: an agent is not a function you debug, it is a population you have to measure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals are the test suite you're missing
&lt;/h2&gt;

&lt;p&gt;The single highest-leverage thing a team can build is an eval set — a collection of realistic inputs with known-good outcomes that you run on every change. Not "does it sound good," but "did it pick the right tool / extract the right field / refuse the out-of-scope request."&lt;/p&gt;

&lt;p&gt;A useful eval set has three properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It is drawn from real traffic&lt;/strong&gt;, not from your imagination. Log production interactions, sample the weird ones, and turn the failures into permanent test cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It scores behaviour, not vibes.&lt;/strong&gt; "Selected the &lt;code&gt;refund&lt;/code&gt; tool when the policy said deny" is checkable. "Was helpful" is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It runs in CI.&lt;/strong&gt; A prompt change that lifts one metric and quietly drops another should fail the build before it ships, exactly like a unit test.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the part most teams skip, and it is the part that separates an agent you can iterate on from one you are afraid to touch. I wrote up the failure modes in more detail here: &lt;a href="https://azena.ai/blog/ki-agenten-produktion-evals/" rel="noopener noreferrer"&gt;why AI agents fail in production and what evals have to do with it&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Guardrails constrain the action space, not the prose
&lt;/h2&gt;

&lt;p&gt;A common mistake is to treat reliability as a prompting problem — add another paragraph of "you must never…" and hope. Prompts are persuasion, not enforcement.&lt;/p&gt;

&lt;p&gt;Real guardrails live in code, around the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Allow-list the tools&lt;/strong&gt; available in each state. An agent in a "read-only support" state should not have a &lt;code&gt;delete_account&lt;/code&gt; tool in scope at all. Don't ask it nicely — don't hand it the gun.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate every tool call against a schema&lt;/strong&gt; and against business rules before execution. The model proposes; deterministic code disposes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bound the loop.&lt;/strong&gt; Max steps, max spend, max retries. An agent with an unbounded loop and a credit card is an incident waiting for a date.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make refusal a first-class outcome.&lt;/strong&gt; "I don't have enough information, escalating to a human" is a success, not a failure, and your evals should reward it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mental model: the LLM is the planner, but the runtime is the adult in the room.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human-in-the-loop is an architecture, not an apology
&lt;/h2&gt;

&lt;p&gt;There is a persistent fantasy that "fully autonomous" is the goal and a human checkpoint is a temporary crutch. For anything with legal, financial, or safety weight, that is backwards. The human checkpoint is the design.&lt;/p&gt;

&lt;p&gt;The interesting engineering question is not &lt;em&gt;whether&lt;/em&gt; a human reviews, but &lt;em&gt;where&lt;/em&gt; — you want the agent to do the 90% that is mechanical (gather, draft, structure, pre-fill) and route the 10% that carries liability to a person, with the full context assembled so the review takes seconds, not minutes. That's the difference between automation that scales and automation that creates a new bottleneck.&lt;/p&gt;

&lt;p&gt;We unpack where to draw that line — chatbot vs. agent, and which workflows should never be fully autonomous — here: &lt;a href="https://azena.ai/perspectives/agentic-ai-beratung/" rel="noopener noreferrer"&gt;agentic AI without the autonomy theatre&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where agents should &lt;em&gt;not&lt;/em&gt; go
&lt;/h2&gt;

&lt;p&gt;Honesty is a feature. Some boundaries are not optimisation problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anything where a hallucinated fact becomes a liability (a legal citation, a medical dosage, a contractual figure) needs a deterministic source of record and a human signature — not a more confident model.&lt;/li&gt;
&lt;li&gt;Anything irreversible should be gated behind an explicit confirmation that a person, not the agent, owns.&lt;/li&gt;
&lt;li&gt;Anything touching regulated or personal data should be designed for data control from day one — which European model and infrastructure you run on is a real architectural choice, not an afterthought.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Saying "an agent is the wrong tool here" out loud is one of the most senior things an engineer building these systems can do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The unglamorous summary
&lt;/h2&gt;

&lt;p&gt;Reliable agents are less about a clever prompt and more about boring infrastructure: a real eval set wired into CI, guardrails enforced in code, bounded loops, and a deliberate human checkpoint exactly where the stakes are. None of it is exciting. All of it is the difference between a demo and a system.&lt;/p&gt;

&lt;p&gt;If you're a small or mid-sized team that wants agents in production but doesn't have an in-house ML platform team to build that scaffolding, that gap is exactly the thing a focused engineering partner exists to close — that's the work we do at &lt;a href="https://azena.ai/ki-beratung-mittelstand/" rel="noopener noreferrer"&gt;azena, an EU AI boutique&lt;/a&gt;: bespoke systems, evaluated, with the guardrails and the data-control decisions made on purpose.&lt;/p&gt;

&lt;p&gt;Build the eval set first. Everything else gets easier once you can measure.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>Die Forschungszulage: wie der deutsche Staat eure KI- und Software-Entwicklung mitfinanziert (2026)</title>
      <dc:creator>azena.ai</dc:creator>
      <pubDate>Wed, 24 Jun 2026 10:15:45 +0000</pubDate>
      <link>https://dev.to/azena-ai/die-forschungszulage-wie-der-deutsche-staat-eure-ki-und-software-entwicklung-mitfinanziert-2026-1g5d</link>
      <guid>https://dev.to/azena-ai/die-forschungszulage-wie-der-deutsche-staat-eure-ki-und-software-entwicklung-mitfinanziert-2026-1g5d</guid>
      <description>&lt;p&gt;Es gibt in Deutschland eine Förderung für Softwareentwicklung, die erstaunlich viele Teams übersehen — obwohl sie ein &lt;strong&gt;Rechtsanspruch&lt;/strong&gt; ist, keinen Wettbewerb kennt und auch bei Verlust ausgezahlt wird. Sie heißt &lt;strong&gt;Forschungszulage&lt;/strong&gt; (FZulG), und seit 2024/2026 ist sie deutlich attraktiver geworden. Wenn ihr ernsthaft entwickelt — gerade an KI und nicht-trivialer Software — lohnt sich ein Blick.&lt;/p&gt;

&lt;p&gt;Ich fasse hier den praktischen Kern zusammen. Eine ausführlichere, herstellerneutrale Übersicht mit allen Quellen pflegen wir offen auf GitHub: &lt;a href="https://github.com/azena-ai/ki-foerderung-mittelstand" rel="noopener noreferrer"&gt;github.com/azena-ai/ki-foerderung-mittelstand&lt;/a&gt;. &lt;strong&gt;Kein Steuerrat — eine Arbeitsgrundlage.&lt;/strong&gt; Stand: Juni 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Was die Forschungszulage ist
&lt;/h2&gt;

&lt;p&gt;Die Forschungszulage ist eine &lt;strong&gt;steuerliche&lt;/strong&gt; Förderung für Forschung und Entwicklung (FuE). Statt eines Zuschusses, den ein Sachbearbeiter zuteilt, bekommt ihr einen festen Prozentsatz eurer FuE-Kosten als &lt;strong&gt;Steuergutschrift&lt;/strong&gt; — und wenn ihr keine Steuer zahlt (z. B. junges Unternehmen mit Verlust), wird der Betrag &lt;strong&gt;ausgezahlt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Drei Eigenschaften machen sie besonders:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rechtsanspruch.&lt;/strong&gt; Wer die Voraussetzungen erfüllt, bekommt sie. Kein "Fördertopf leer", kein Windhundrennen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Branchen- und größenunabhängig.&lt;/strong&gt; Vom Einzelunternehmer bis zum Konzern, jede Rechtsform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rückwirkend möglich.&lt;/strong&gt; Förderfähig sind Vorhaben mit Beginn ab 2020 — laufende Projekte zählen also auch.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Die Konditionen (2026)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fördersatz:&lt;/strong&gt; 25 % der förderfähigen Kosten, &lt;strong&gt;35 % für KMU&lt;/strong&gt; (seit 28.03.2024).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bemessungsgrundlage:&lt;/strong&gt; bis &lt;strong&gt;12 Mio € pro Jahr&lt;/strong&gt; (seit 01.01.2026). Macht für ein KMU eine maximale Förderung von &lt;strong&gt;4,2 Mio € im Jahr&lt;/strong&gt; (35 % × 12 Mio).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eigenleistung&lt;/strong&gt; von Gesellschaftern/Einzelunternehmern: &lt;strong&gt;100 €/Stunde, max. 40 Std./Woche&lt;/strong&gt; (seit 2026; davor 70 €). Das ist wichtig für kleine Teams, in denen die Gründer selbst entwickeln.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auftragsforschung:&lt;/strong&gt; Lasst ihr extern entwickeln, sind &lt;strong&gt;70 %&lt;/strong&gt; des Entgelts förderfähig — der Auftragnehmer muss im EWR sitzen.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Zählt Softwareentwicklung? Zählt KI?
&lt;/h2&gt;

&lt;p&gt;Das ist die entscheidende Frage, und die Antwort ist: &lt;strong&gt;ja — unter einer Bedingung.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Förderfähig ist Entwicklung mit echter &lt;strong&gt;technischer oder wissenschaftlicher Unsicherheit&lt;/strong&gt;. Im Gesetz heißt die relevante Kategorie &lt;em&gt;experimentelle Entwicklung&lt;/em&gt;. Für Software/KI bedeutet das konkret:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Begünstigt&lt;/strong&gt;, wenn ihr nicht von vornherein wisst, ob und wie es funktioniert:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;neuartige Algorithmen oder Modellarchitekturen&lt;/li&gt;
&lt;li&gt;nicht-triviale ML-Vorhaben (eigene Modelle, schwierige Daten-/Integrationsprobleme)&lt;/li&gt;
&lt;li&gt;technische Lösungen, für die es keinen erprobten Standardweg gibt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;❌ &lt;strong&gt;Nicht begünstigt:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Routine-Programmierung nach bekanntem Muster&lt;/li&gt;
&lt;li&gt;reine Produktpflege, Bugfixing, Customizing von Standardsoftware&lt;/li&gt;
&lt;li&gt;etwas, das man "einfach so runterschreibt", weil der Weg klar ist&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Die Grenze ist nicht "ist es KI?", sondern "&lt;strong&gt;gab es ein echtes technisches Risiko, das ihr lösen musstet?&lt;/strong&gt;" Genau deshalb fällt maßgefertigte Entwicklung so oft darunter und Standard-Integration nicht.&lt;/p&gt;

&lt;h2&gt;
  
  
  Der Antragsweg ist zweistufig
&lt;/h2&gt;

&lt;p&gt;Viele scheitern nicht an der Sache, sondern daran, dass sie den Ablauf nicht kennen. Es sind zwei Schritte:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Bescheinigung&lt;/strong&gt; bei der &lt;strong&gt;BSFZ&lt;/strong&gt; (Bescheinigungsstelle Forschungszulage) beantragen. Sie prüft &lt;em&gt;inhaltlich&lt;/em&gt;, ob euer Vorhaben begünstigte FuE ist. Das ist die eigentliche Hürde — und ihr könnt sie &lt;strong&gt;vorab&lt;/strong&gt; klären, bevor ihr Geld in die Hand nehmt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Festsetzung&lt;/strong&gt; beim &lt;strong&gt;Finanzamt&lt;/strong&gt; über ELSTER. Hier wird mit der Bescheinigung der konkrete Betrag festgesetzt.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Der praktische Tipp: Holt die BSFZ-Bescheinigung früh. Sie gibt euch Planungssicherheit, dass das Projekt anerkannt wird, bevor ihr die Kosten geltend macht.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lässt sich das kombinieren?
&lt;/h2&gt;

&lt;p&gt;Ja. Die Forschungszulage lässt sich mit &lt;strong&gt;Zuschussprogrammen wie ZIM&lt;/strong&gt; kombinieren, solange ihr nicht dieselben Kosten doppelt fördert. Eine typische Aufteilung: Projektzuschuss (ZIM) für den einen Kostenblock, Forschungszulage für den anderen. Auch das steht in der &lt;a href="https://github.com/azena-ai/ki-foerderung-mittelstand" rel="noopener noreferrer"&gt;Übersicht&lt;/a&gt; inklusive der anderen 2026 noch laufenden Programme (INVEST, EXIST, Mittelstand-Digital Zentren, Landesprogramme).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Am Rande, weil es Zeit spart: &lt;strong&gt;Digital Jetzt&lt;/strong&gt; und &lt;strong&gt;go-digital&lt;/strong&gt; — die Programme, an die viele bei "Digitalisierungsförderung" zuerst denken — sind beide &lt;strong&gt;ausgelaufen&lt;/strong&gt; (Ende 2023 bzw. 2024). Dort lohnt keine Recherche mehr.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Warum das hier steht
&lt;/h2&gt;

&lt;p&gt;Wir bauen bei &lt;a href="https://azena.ai" rel="noopener noreferrer"&gt;azena&lt;/a&gt; maßgefertigte, EU-souveräne KI-Systeme für den &lt;a href="https://azena.ai/ki-beratung-mittelstand/" rel="noopener noreferrer"&gt;deutschen Mittelstand&lt;/a&gt; — und genau bei dieser Art Arbeit ist die Forschungszulage regelmäßig einschlägig, weil maßgefertigte Entwicklung fast per Definition technische Unsicherheit enthält. Der häufigste Irrtum, den wir hören, ist "Förderung gibt's nur für Konzerne mit Forschungsabteilung". Das Gegenteil stimmt: Die Forschungszulage ist &lt;em&gt;für&lt;/em&gt; die Teams gemacht, die etwas technisch Neues bauen, egal wie klein sie sind.&lt;/p&gt;

&lt;p&gt;Die verbindliche Einordnung macht immer die BSFZ, und dieser Beitrag ersetzt keine steuerliche Beratung. Aber wenn ihr gerade an etwas Nicht-Trivialem entwickelt und noch nie über die Forschungszulage nachgedacht habt — tut es.&lt;/p&gt;

&lt;p&gt;Alle Programme, Konditionen und offiziellen Quellen offen und gepflegt hier:&lt;br&gt;
&lt;strong&gt;&lt;a href="https://github.com/azena-ai/ki-foerderung-mittelstand" rel="noopener noreferrer"&gt;github.com/azena-ai/ki-foerderung-mittelstand&lt;/a&gt;&lt;/strong&gt; — Korrekturen per PR willkommen.&lt;/p&gt;

</description>
      <category>ki</category>
      <category>germany</category>
      <category>ai</category>
      <category>startup</category>
    </item>
    <item>
      <title>The genome pattern: how to build an agent loop that actually improves itself</title>
      <dc:creator>azena.ai</dc:creator>
      <pubDate>Wed, 24 Jun 2026 09:46:01 +0000</pubDate>
      <link>https://dev.to/azena-ai/the-genome-pattern-how-to-build-an-agent-loop-that-actually-improves-itself-3bn0</link>
      <guid>https://dev.to/azena-ai/the-genome-pattern-how-to-build-an-agent-loop-that-actually-improves-itself-3bn0</guid>
      <description>&lt;p&gt;Most "autonomous agents" are one prompt in a &lt;code&gt;while&lt;/code&gt; loop. They run, they drift, they repeat yesterday's mistake, and they keep no memory of anything they learned. After a day you don't have an agent that got better — you have the same agent, more tired.&lt;/p&gt;

&lt;p&gt;We've been running a different pattern in production at &lt;a href="https://azena.ai" rel="noopener noreferrer"&gt;azena&lt;/a&gt; for months, and I want to describe it concretely because it's almost embarrassingly simple: &lt;strong&gt;no framework, four markdown files, and one discipline.&lt;/strong&gt; We open-sourced the templates — &lt;a href="https://github.com/azena-ai/self-improving-loop" rel="noopener noreferrer"&gt;&lt;code&gt;azena-ai/self-improving-loop&lt;/code&gt;&lt;/a&gt; — but the idea matters more than the files, so here's the whole thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core idea: the agent runs on a genome
&lt;/h2&gt;

&lt;p&gt;The agent doesn't run on a fixed prompt. It runs on a &lt;strong&gt;genome&lt;/strong&gt; — a versioned strategy file that it both &lt;em&gt;reads&lt;/em&gt; and &lt;em&gt;rewrites&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Every cycle ("tick") the loop does one thing: it picks the single highest-value move, ships it, &lt;strong&gt;verifies it actually worked&lt;/strong&gt;, and then folds what it learned back into its own instructions. The genome goes &lt;code&gt;v001 → v002 → v003…&lt;/code&gt;, and each bump is an auditable record of the agent changing its own mind.&lt;/p&gt;

&lt;p&gt;That last part is the whole game. A static prompt fights reality the moment the mission shifts. A genome &lt;em&gt;absorbs&lt;/em&gt; the shift, because the loop is allowed to edit it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The loop in one picture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
  A([Tick fires]) --&amp;gt; B[Read genome + levers + lessons]
  B --&amp;gt; C[Pick the single highest-value lever]
  C --&amp;gt; D[Build / act — small, shippable]
  D --&amp;gt; E{Gate: verify it really works}
  E -- fail --&amp;gt; C
  E -- pass --&amp;gt; F[Commit]
  F --&amp;gt; G[Self-improve:&amp;lt;br/&amp;gt;rewrite genome, append a lesson, bump version]
  G --&amp;gt; H[Schedule the next tick]
  H --&amp;gt; A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No human in the inner loop. A human sets the mission and reviews the diffs in the morning. That's the deal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four files, three of which the loop edits
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;genome.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The evolving &lt;strong&gt;strategy + state&lt;/strong&gt;: mission, current focus, what's proven, what's next. The loop &lt;em&gt;mutates this&lt;/em&gt; as it learns. Versioned.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;loop-prompt.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;orchestrator&lt;/strong&gt; the agent executes each tick — and improves. Holds the tick cycle, the gate rules, and an append-only lessons log.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;levers.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The &lt;strong&gt;prioritized backlog&lt;/strong&gt;. A ranked list of moves with a status log. The loop always takes the top open one.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;lessons&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hard-won rules, written &lt;em&gt;back into the prompt&lt;/em&gt; the moment they're learned. This is the "self-improving" part.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tick itself is just a state machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;read(genome, levers, lessons)
lever  = highest_value_open(levers)
result = act(lever)            # small, shippable
if not gate(result):           # verify the ARTIFACT
    reschedule(); return       # back to the top — never "commit anyway"
commit(result)
update(genome.status, lever.status)
maybe_mutate(genome)           # bump version if strategy changed
maybe_append(lessons)          # if something was learned
schedule_next_tick()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why file-based memory beats a context window
&lt;/h2&gt;

&lt;p&gt;A context window is volatile and small. It evaporates on compaction, restart, or a long enough wall-clock gap. So an agent whose "state" lives in context literally forgets where it was.&lt;/p&gt;

&lt;p&gt;A genome file is durable and unbounded. The loop can run for &lt;strong&gt;days&lt;/strong&gt; across many ticks, restarts, and summarizations, and still know exactly where it is — because "where it is" is &lt;em&gt;written down&lt;/em&gt;, not remembered. When context gets summarized away, the next tick just re-reads the genome and carries on. That single decision — state on disk, not in the window — is what turns a chatty demo into something that survives a week.&lt;/p&gt;

&lt;h2&gt;
  
  
  The non-negotiable: gates
&lt;/h2&gt;

&lt;p&gt;Here's the part everyone skips, and it's the part that makes autonomy safe instead of reckless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An autonomous loop is only as trustworthy as its verification.&lt;/strong&gt; A gate is a check that must pass &lt;em&gt;before&lt;/em&gt; a commit. A failing gate sends the loop &lt;strong&gt;back to pick another lever&lt;/strong&gt; — never forward to "commit anyway."&lt;/p&gt;

&lt;p&gt;The minimum gate is three steps, and the order matters:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Typecheck green.&lt;/strong&gt; Run the real check. Do &lt;strong&gt;not&lt;/strong&gt; pipe it through &lt;code&gt;head&lt;/code&gt;/&lt;code&gt;tail&lt;/code&gt; — a pipe exits &lt;code&gt;0&lt;/code&gt; and will happily print "OK" over a stack of errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build green.&lt;/strong&gt; Many bundlers strip types and build green &lt;em&gt;despite&lt;/em&gt; type errors — so step 1 is not optional.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify the artifact, not the log.&lt;/strong&gt; This is the one teams skip. "The deploy succeeded" is not evidence that the page renders, the endpoint responds, or the file is non-empty.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That third point is the single most expensive class of bug in an autonomous loop: a step that &lt;strong&gt;reports success while producing garbage&lt;/strong&gt;. A prerender that silently emits an empty SPA shell. A migration that "ran" but touched zero rows. (We dug into this exact failure mode for production agents — why they pass demos but fail live — &lt;a href="https://azena.ai/blog/ki-agenten-produktion-evals/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.) So assert a concrete property of the &lt;em&gt;real&lt;/em&gt; artifact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# don't trust "build OK" — prove it&lt;/span&gt;
&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;div id=\"root\"&amp;gt;&amp;lt;/div&amp;gt;'&lt;/span&gt; dist/index.html&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-eq&lt;/span&gt; 0   &lt;span class="c"&gt;# not an empty shell&lt;/span&gt;
&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &amp;lt; dist/index.html&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; 50000                          &lt;span class="c"&gt;# has real content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One caveat so you don't fight ghosts: a transient failure on something you didn't touch (a network blip, a cold start) isn't a regression. Re-run the gate once. If it fails deterministically, it's real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons the loop has already taught itself
&lt;/h2&gt;

&lt;p&gt;These are real, generalized from production runs. The point of the pattern is that this list &lt;em&gt;grows by itself&lt;/em&gt; — the loop appends to it the moment it gets burned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Verify before you ship.&lt;/strong&gt; A build step can fail silently and leave an empty shell. Assert the output is non-empty and correct &lt;em&gt;before&lt;/em&gt; deploying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't turn one finding into a destructive sweep.&lt;/strong&gt; A single odd-looking match is not a mandate for a sitewide find-and-replace. Check intent first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When reality contradicts the task's premise, report — don't blindly execute.&lt;/strong&gt; If the job says "small fix" and you find a load-bearing rewrite, surface it instead of plowing ahead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools that need a server don't start one.&lt;/strong&gt; Bring the server up, wait for it, &lt;em&gt;then&lt;/em&gt; run the check. A flood of &lt;code&gt;connection refused&lt;/code&gt; means "nothing's listening," not "everything's broken."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice these aren't AI-specific. They're the operating rules of a careful engineer — except the loop wrote them for itself, after paying for them once.&lt;/p&gt;

&lt;h2&gt;
  
  
  One lever per tick
&lt;/h2&gt;

&lt;p&gt;Last principle, easy to underrate: &lt;strong&gt;one lever per tick.&lt;/strong&gt; Small, shippable units keep every change reviewable and every failure cheap to roll back. The temptation with an autonomous agent is to let it do five things at once "to save time." Don't. A tick that ships one verified thing and stops is worth more than a tick that ships five unverifiable things. The cadence is the safety rail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;If you want to try it, it genuinely is four files and an afternoon:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Copy the four &lt;code&gt;*.template.md&lt;/code&gt; files from the &lt;a href="https://github.com/azena-ai/self-improving-loop" rel="noopener noreferrer"&gt;repo&lt;/a&gt; into your project.&lt;/li&gt;
&lt;li&gt;Fill &lt;code&gt;genome.md&lt;/code&gt; with your &lt;strong&gt;mission&lt;/strong&gt; and first focus.&lt;/li&gt;
&lt;li&gt;Seed &lt;code&gt;levers.md&lt;/code&gt; with a ranked backlog.&lt;/li&gt;
&lt;li&gt;Hand &lt;code&gt;loop-prompt.md&lt;/code&gt; to your agent and tell it to run one tick, then schedule the next.&lt;/li&gt;
&lt;li&gt;Review the diffs each morning. Watch the genome evolve.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It works anywhere an agent can &lt;strong&gt;schedule its own next turn&lt;/strong&gt; and &lt;strong&gt;commit to git&lt;/strong&gt; — we run it inside Claude Code, but nothing in the pattern is Claude-specific.&lt;/p&gt;




&lt;p&gt;We build this kind of thing for a living — bespoke, EU-sovereign AI systems for the German &lt;em&gt;Mittelstand&lt;/em&gt; — at &lt;a href="https://azena.ai" rel="noopener noreferrer"&gt;azena&lt;/a&gt;, and we teach the craft at the &lt;a href="https://academy.azena.ai" rel="noopener noreferrer"&gt;azena Dev Academy&lt;/a&gt;. The loop pattern came out of needing our own automation to be trustworthy enough to leave running overnight. If you build something with it, I'd love to hear how the genome evolved.&lt;/p&gt;

&lt;p&gt;The templates, docs, and a few reusable skills are all MIT-licensed here: &lt;strong&gt;&lt;a href="https://github.com/azena-ai/self-improving-loop" rel="noopener noreferrer"&gt;github.com/azena-ai/self-improving-loop&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
      <category>programming</category>
    </item>
    <item>
      <title>EU-sovereign AI: running capable LLMs with full data control (2026 guide)</title>
      <dc:creator>azena.ai</dc:creator>
      <pubDate>Tue, 23 Jun 2026 23:35:01 +0000</pubDate>
      <link>https://dev.to/azena-ai/eu-sovereign-ai-a-practical-2026-guide-to-running-capable-llms-without-sending-your-data-to-the-us-3272</link>
      <guid>https://dev.to/azena-ai/eu-sovereign-ai-a-practical-2026-guide-to-running-capable-llms-without-sending-your-data-to-the-us-3272</guid>
      <description>&lt;p&gt;"Can we use a capable language model and still keep full control over where our data is processed?" — it's one of the first questions we hear from data-sensitive companies in Europe. The good news, as of mid-2026: the answer is a clear &lt;strong&gt;yes&lt;/strong&gt;. EU data residency is no longer a compromise — it's a deliberate architecture choice, and there are genuinely capable options for it.&lt;/p&gt;

&lt;p&gt;This is about &lt;strong&gt;EU data sovereignty and residency&lt;/strong&gt; — deciding consciously &lt;em&gt;where&lt;/em&gt; and &lt;em&gt;under which legal regime&lt;/em&gt; your data is processed. That's a strength, not a stance against anyone: many of the best open models and cloud providers are international, and that's a good thing. The point is control, not opposition.&lt;/p&gt;

&lt;p&gt;We maintain the full, vendor-neutral version of this as an open guide on GitHub: &lt;a href="https://github.com/azena-ai/eu-souveraene-llms" rel="noopener noreferrer"&gt;github.com/azena-ai/eu-souveraene-llms&lt;/a&gt;. Here's the practical core.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two clean paths to EU data residency
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Self-host open weights&lt;/strong&gt; on your own or EU cloud infrastructure. No inference data leaves to the model vendor — so the &lt;em&gt;license&lt;/em&gt;, not the vendor's origin, is what matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use an EU-headquartered managed provider&lt;/strong&gt; that runs the model and keeps processing in the EU.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both are production-ready in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path 1: Self-hosting — the license decides
&lt;/h2&gt;

&lt;p&gt;When you run downloaded weights on your own or EU infrastructure, no inference data flows to the maker — not even for models from the US or China. Origin only concerns the maker's &lt;em&gt;hosted API&lt;/em&gt;, not weights running locally. So the deciding factor is the &lt;strong&gt;license&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Origin&lt;/th&gt;
&lt;th&gt;License (commercial?)&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Large 3 / Ministral 3&lt;/td&gt;
&lt;td&gt;France (EU)&lt;/td&gt;
&lt;td&gt;Apache 2.0 — free&lt;/td&gt;
&lt;td&gt;permissive flagship&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Teuken-7B (OpenGPT-X)&lt;/td&gt;
&lt;td&gt;DE/EU&lt;/td&gt;
&lt;td&gt;Apache 2.0 — free&lt;/td&gt;
&lt;td&gt;all 24 EU languages, EU-trained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EuroLLM-22B&lt;/td&gt;
&lt;td&gt;EU consortium&lt;/td&gt;
&lt;td&gt;Apache 2.0 — free&lt;/td&gt;
&lt;td&gt;35 languages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3&lt;/td&gt;
&lt;td&gt;Alibaba (China)&lt;/td&gt;
&lt;td&gt;Apache 2.0 — free&lt;/td&gt;
&lt;td&gt;privacy-neutral when self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-R1&lt;/td&gt;
&lt;td&gt;DeepSeek (China)&lt;/td&gt;
&lt;td&gt;MIT — free&lt;/td&gt;
&lt;td&gt;strong reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mistral Large 2 / Pixtral Large&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;France&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MRL — NOT commercial&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;common trap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aleph Alpha Pharia-1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Germany&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Open Aleph — research only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;commercial by contract only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Meta Llama 4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;USA&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Community License (not OSI)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;EU restriction on multimodal models&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two expensive traps: &lt;strong&gt;not every "open" Mistral model is Apache 2.0&lt;/strong&gt; — Mistral Large 2 and Pixtral Large are under the non-commercial Mistral Research License. And &lt;strong&gt;Llama 4's license excludes EU-domiciled companies from the multimodal models&lt;/strong&gt; (&lt;a href="https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE" rel="noopener noreferrer"&gt;license text&lt;/a&gt;). Check the license tag, not the reputation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Path 2: Managed inference with EU data residency
&lt;/h2&gt;

&lt;p&gt;If you'd rather not self-host, use a provider that runs the model and processes data in the EU. Here the provider's &lt;strong&gt;legal domicile&lt;/strong&gt; is a key factor for residency.&lt;/p&gt;

&lt;p&gt;EU-headquartered, EU data residency: &lt;strong&gt;Mistral La Plateforme&lt;/strong&gt; (France, EU by default, no training on API data), &lt;strong&gt;IONOS AI Model Hub&lt;/strong&gt; (Germany, data stays in DE), &lt;strong&gt;OVHcloud&lt;/strong&gt; and &lt;strong&gt;Scaleway&lt;/strong&gt; (France). The notable infrastructure development in mid-2026 is the &lt;strong&gt;AWS European Sovereign Cloud&lt;/strong&gt; — a separate, EU-operated partition (GA since January 2026), though its model selection is still thin at launch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data residency: what actually matters legally
&lt;/h2&gt;

&lt;p&gt;This part gets overlooked, and it matters when your compliance requires genuine EU data residency.&lt;/p&gt;

&lt;p&gt;An "EU region" alone says nothing about which &lt;strong&gt;legal regime&lt;/strong&gt; a provider is subject to. The &lt;strong&gt;US CLOUD Act&lt;/strong&gt; (2018), for instance, lets US authorities compel US-domiciled providers to hand over data regardless of server location. A Frankfurt region of a US company sits physically in the EU, but the provider remains under US law. That's not a value judgment — it's simply a factor a clean residency strategy accounts for. (&lt;a href="https://aws.amazon.com/compliance/cloud-act/" rel="noopener noreferrer"&gt;AWS on the CLOUD Act&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;There's also legal movement worth knowing: the &lt;strong&gt;EU-US Data Privacy Framework&lt;/strong&gt; is in force in mid-2026 but &lt;strong&gt;under challenge at the CJEU&lt;/strong&gt; (case C-703/25 P, pending). Teams that want maximum planning certainty keep processing in the EU from the start. And note that encryption isn't a shortcut here: LLM inference needs the &lt;strong&gt;plaintext&lt;/strong&gt; to work, so residency is decided by architecture, not by a bolt-on.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Worth separating: the &lt;strong&gt;EU AI Act&lt;/strong&gt; governs risk and transparency — &lt;strong&gt;not&lt;/strong&gt; data residency. Where your data must be processed comes from the GDPR. (On the AI Act: our &lt;a href="https://azena.ai/eu-ai-act-2026/" rel="noopener noreferrer"&gt;practical EU AI Act compliance guide&lt;/a&gt;, and the &lt;a href="https://github.com/azena-ai/eu-ai-act-mittelstand" rel="noopener noreferrer"&gt;open, vendor-neutral version on GitHub&lt;/a&gt;.)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How to decide
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your need&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Maximum data control, processing in-house&lt;/td&gt;
&lt;td&gt;Self-host open Apache-2.0/MIT weights (Mistral, Teuken, EuroLLM, Qwen, DeepSeek)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed, with EU data residency&lt;/td&gt;
&lt;td&gt;EU-headquartered provider (Mistral La Plateforme, IONOS, OVHcloud, Scaleway)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Already on Azure/AWS&lt;/td&gt;
&lt;td&gt;Workable — document the legal situation (DPF status, provider's regime) in a transfer impact assessment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Just exploring&lt;/td&gt;
&lt;td&gt;Start small: a self-hosted 7–24B model (Ministral 3, Teuken-7B) on an EU GPU instance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;EU-sovereign AI in 2026 isn't a compromise — it's a deliberate architecture decision with genuinely capable options. Sovereignty here means you keep control over where your data is processed: a strength, framed as choice, not opposition.&lt;/p&gt;

&lt;p&gt;We build exactly these systems — bespoke, EU-sovereign AI for the German Mittelstand — at &lt;a href="https://azena.ai" rel="noopener noreferrer"&gt;azena&lt;/a&gt;. The full, sourced, vendor-neutral guide lives here: &lt;strong&gt;&lt;a href="https://github.com/azena-ai/eu-souveraene-llms" rel="noopener noreferrer"&gt;github.com/azena-ai/eu-souveraene-llms&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>privacy</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
