DEV Community: azena.ai

OpenTelemetry's GenAI semantic conventions are NOT stable yet — here's what actually shipped in 2026

azena.ai — Thu, 16 Jul 2026 19:51:50 +0000

If you search for "OpenTelemetry GenAI semantic conventions" right now, you'll find a pile of blog posts confidently declaring that gen_ai.* "went stable" at some point — one popular claim pins it to "OTel 1.30". The primary sources say otherwise. As of mid-July 2026, every single gen_ai.* attribute, span, metric, and event in the official OpenTelemetry registry carries the stability badge "Development" (the status formerly called "experimental"). Not one is marked Stable.

That doesn't mean you should wait. It means you should know exactly what shipped, what got renamed, and where the moving parts are — because if you instrument against a blog post from 2025, you're emitting deprecated attributes today.

Everything below is checked against the OTel attribute registry, the semantic-conventions release changelogs, and the new GenAI repo, as of 2026-07-16.

The June 2026 split: GenAI moved out of the main repo

The biggest structural change happened on June 12, 2026, with semantic-conventions v1.42.0: all GenAI conventions — model/gen-ai/, the OpenAI-specific ones under model/openai/, and the MCP conventions under model/mcp/ — were deprecated in the main repo and moved to a dedicated repository: open-telemetry/semantic-conventions-genai.

Two things about that repo you should internalize:

It has no tagged release yet. As of July 16, 2026, the releases page is empty. The conventions evolve on main.
Its documents still say "Status: Development" — both gen-ai-spans.md and gen-ai-metrics.md.

For context, the main repo shipped at its usual cadence this year — v1.39.0 (January 12), v1.40.0 (February 19), v1.41.0 (April 28), v1.41.1 (May 11), v1.42.0 (June 12, the GenAI extraction), v1.43.0 (July 3). The last big GenAI additions inside the main repo landed in v1.41.0: streaming metrics (gen_ai.client.operation.time_to_first_chunk and .time_per_output_chunk), invoke_workflow as an operation, and the split of invoke_agent into client vs. internal spans.

So the honest status line is: a rapidly consolidating standard with its own repo, its own SIG momentum, real vendor adoption — and no stability guarantee on attribute names.

What you should emit today

Despite the Development status, the core signal set has settled enough to build on. Here's the short list that matters:

Signal	Name	Notes
Span attribute (required)	`gen_ai.operation.name`	`chat`, `embeddings`, `execute_tool`, `invoke_agent`, `invoke_workflow`, `create_agent`, `retrieval`, `plan`, memory ops …
Span attribute (required)	`gen_ai.provider.name`	e.g. `openai`, `anthropic`, `aws.bedrock`, `gcp.vertex_ai`
Span attributes (recommended)	`gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.response.finish_reasons`, `gen_ai.conversation.id`, `error.type`
Token usage (span)	`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`
Metric	`gen_ai.client.token.usage`	Histogram, unit `{token}`, split by `gen_ai.token.type` (input/output)
Metric	`gen_ai.client.operation.duration`	Histogram, unit `s`
Streaming metrics	`gen_ai.client.operation.time_to_first_chunk`, `.time_per_output_chunk`	New as of April 2026
Server-side (self-hosted inference)	`gen_ai.server.request.duration`, `.time_to_first_token`, `.time_per_output_token`
Content capture (opt-in only)	`gen_ai.system_instructions`, `gen_ai.input.messages`, `gen_ai.output.messages`	Not captured by default

You don't need an instrumentation library to emit this. A plain manual span with the standard Python SDK is fully conformant:

from opentelemetry import trace

tracer = trace.get_tracer("my-agent")

def call_llm(client, messages, model="claude-sonnet-4-5"):
    # Span name convention: "{operation} {model}"
    with tracer.start_as_current_span(f"chat {model}") as span:
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.provider.name", "anthropic")
        span.set_attribute("gen_ai.request.model", model)

        response = client.messages.create(model=model, messages=messages, max_tokens=1024)

        span.set_attribute("gen_ai.response.model", response.model)
        span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
        return response

Note what's absent: no prompt text, no completion text. Under the current conventions, content is metadata-opt-in — a sane default if you operate under GDPR or handle customer data.

The rename traps

This is where those outdated blog posts actively hurt you. If your instrumentation (or your vendor's) was written against the 2024/2025 state of the conventions, run this migration checklist:

[ ] gen_ai.system → gen_ai.provider.name. The old attribute is deprecated in the registry. If your dashboards group by gen_ai.system, they'll go dark as libraries update.
[ ] gen_ai.usage.prompt_tokens → gen_ai.usage.input_tokens and gen_ai.usage.completion_tokens → gen_ai.usage.output_tokens. Cost dashboards built on the old names silently under-count once emitters switch.
[ ] gen_ai.prompt and gen_ai.completion are removed entirely — not renamed. Content capture now goes through the opt-in gen_ai.input.messages / gen_ai.output.messages (and gen_ai.system_instructions).
[ ] Grep your alert rules and saved queries, not just your code. The emitting side and the querying side drift independently.
[ ] Expect more of this. Development status explicitly means names can still change. OTel's usual transition mechanism is dual-emission via OTEL_SEMCONV_STABILITY_OPT_IN — plan for a window where you handle both old and new names.

Agents and MCP get first-class spans

The part most relevant to anyone running agents in production: the conventions now model the whole agent execution as a span tree, not just single LLM calls.

invoke_agent  (the agent run)
├── chat      (each model call)
│   └── execute_tool   (each tool invocation)
├── chat
└── execute_tool

gen_ai.operation.name covers the full agent lifecycle — create_agent, invoke_agent, invoke_workflow, execute_tool, retrieval, plan, plus memory operations. And notably, the MCP conventions moved into the same GenAI repo with the v1.42.0 extraction, so MCP tool calls are part of the same trace vocabulary as the agent that issues them.

This is no longer theoretical: per the official OTel blog's 2026 GenAI observability post, coding agents like VS Code Copilot, OpenAI Codex, and Claude Code (the latter in beta) already emit OTel GenAI traces — while the same post stresses that the conventions remain under active development.

One thing traces won't give you, though, is whether the agent's output was any good. Latency and token counts are necessary but not sufficient — for the quality side you need evals wired into the same pipeline; we wrote up how we approach that for production agents here (German).

Who actually speaks gen_ai.* today

Adoption is real but uneven — schema fragmentation hasn't gone away:

Datadog supports the OTel GenAI semantic conventions natively in its LLM/Agent Observability product (they published a dedicated post on it).
Langfuse accepts traces through a full OTLP endpoint and maps gen_ai.* onto its own data model.
Arize Phoenix's native schema is OpenInference (its own llm.* namespace); it interoperates through a translation layer (e.g. an OpenInferenceSpanProcessor for converting OpenLLMetry traces), with an open RFC discussion about closer gen_ai.* alignment.
OpenLLMetry/Traceloop is OTel-based — parts of the original GenAI conventions actually came from an OpenLLMetry donation — but still emits some deprecated attributes (gen_ai.prompt/gen_ai.completion); there's an open issue tracking the migration.

Converters exist between these worlds, but fidelity varies. Which leads to the practical conclusion.

The playbook

Instrument once, against gen_ai.*. Whether hand-rolled spans or an OTel-based library — the convention is the contract, not the vendor.
Treat OTLP as the socket. Export via OTLP and Langfuse, Datadog, or a Grafana stack become swappable backends instead of lock-in decisions.
Cost dashboards come free. gen_ai.client.token.usage + gen_ai.request.model + gen_ai.provider.name give you a standardized cost schema across every provider you use.
Keep prompt capture opt-in. The default is metadata-only; only enable gen_ai.input.messages/gen_ai.output.messages where you've settled the privacy question.
Budget for renames. Development status is a feature warning, not fine print. The gen_ai.system rename won't be the last one.

We run agent observability for German Mittelstand clients at azena.ai. If you read German, our deeper comparison of five observability tools for exactly this stack lives here.

The reliability gap: what it actually takes to put an AI agent in production

azena.ai — Fri, 26 Jun 2026 12:09:52 +0000

A demo agent is easy. It calls a model, the model calls a tool, the tool returns something plausible, and everyone in the room nods. Then you put the same agent in front of real users, real data, and real money — and it quietly does the wrong thing 4% of the time. Nobody notices until a customer does.

That 4% is the reliability gap. It is the entire distance between a convincing demo and a system you can actually depend on, and almost nothing in the typical LLM tutorial prepares you for it.

Here is what closing that gap actually involves.

The three things that make agents hard

1. They are non-deterministic by construction. The same input can produce a different tool call tomorrow. Your regression intuition — "I didn't touch that code, so it still works" — is simply false. A prompt tweak three steps upstream can change a decision three steps downstream.

2. They fail silently. A traditional service throws. An agent confidently returns a wrong answer in the same shape as a right one. There is no stack trace for "the model misread the invoice total."

3. There is rarely a ground truth at runtime. When the agent decides, you usually cannot check the decision against an oracle in the moment. You only find out later, in aggregate, if you measured.

If you internalise nothing else: an agent is not a function you debug, it is a population you have to measure.

Evals are the test suite you're missing

The single highest-leverage thing a team can build is an eval set — a collection of realistic inputs with known-good outcomes that you run on every change. Not "does it sound good," but "did it pick the right tool / extract the right field / refuse the out-of-scope request."

A useful eval set has three properties:

It is drawn from real traffic, not from your imagination. Log production interactions, sample the weird ones, and turn the failures into permanent test cases.
It scores behaviour, not vibes. "Selected the refund tool when the policy said deny" is checkable. "Was helpful" is not.
It runs in CI. A prompt change that lifts one metric and quietly drops another should fail the build before it ships, exactly like a unit test.

This is the part most teams skip, and it is the part that separates an agent you can iterate on from one you are afraid to touch. I wrote up the failure modes in more detail here: why AI agents fail in production and what evals have to do with it.

Guardrails constrain the action space, not the prose

A common mistake is to treat reliability as a prompting problem — add another paragraph of "you must never…" and hope. Prompts are persuasion, not enforcement.

Real guardrails live in code, around the model:

Allow-list the tools available in each state. An agent in a "read-only support" state should not have a delete_account tool in scope at all. Don't ask it nicely — don't hand it the gun.
Validate every tool call against a schema and against business rules before execution. The model proposes; deterministic code disposes.
Bound the loop. Max steps, max spend, max retries. An agent with an unbounded loop and a credit card is an incident waiting for a date.
Make refusal a first-class outcome. "I don't have enough information, escalating to a human" is a success, not a failure, and your evals should reward it.

The mental model: the LLM is the planner, but the runtime is the adult in the room.

Human-in-the-loop is an architecture, not an apology

There is a persistent fantasy that "fully autonomous" is the goal and a human checkpoint is a temporary crutch. For anything with legal, financial, or safety weight, that is backwards. The human checkpoint is the design.

The interesting engineering question is not whether a human reviews, but where — you want the agent to do the 90% that is mechanical (gather, draft, structure, pre-fill) and route the 10% that carries liability to a person, with the full context assembled so the review takes seconds, not minutes. That's the difference between automation that scales and automation that creates a new bottleneck.

We unpack where to draw that line — chatbot vs. agent, and which workflows should never be fully autonomous — here: agentic AI without the autonomy theatre.

Where agents should not go

Honesty is a feature. Some boundaries are not optimisation problems:

Anything where a hallucinated fact becomes a liability (a legal citation, a medical dosage, a contractual figure) needs a deterministic source of record and a human signature — not a more confident model.
Anything irreversible should be gated behind an explicit confirmation that a person, not the agent, owns.
Anything touching regulated or personal data should be designed for data control from day one — which European model and infrastructure you run on is a real architectural choice, not an afterthought.

Saying "an agent is the wrong tool here" out loud is one of the most senior things an engineer building these systems can do.

The unglamorous summary

Reliable agents are less about a clever prompt and more about boring infrastructure: a real eval set wired into CI, guardrails enforced in code, bounded loops, and a deliberate human checkpoint exactly where the stakes are. None of it is exciting. All of it is the difference between a demo and a system.

If you're a small or mid-sized team that wants agents in production but doesn't have an in-house ML platform team to build that scaffolding, that gap is exactly the thing a focused engineering partner exists to close — that's the work we do at azena, an EU AI boutique: bespoke systems, evaluated, with the guardrails and the data-control decisions made on purpose.

Build the eval set first. Everything else gets easier once you can measure.

Die Forschungszulage: wie der deutsche Staat eure KI- und Software-Entwicklung mitfinanziert (2026)

azena.ai — Wed, 24 Jun 2026 10:15:45 +0000

Es gibt in Deutschland eine Förderung für Softwareentwicklung, die erstaunlich viele Teams übersehen — obwohl sie ein Rechtsanspruch ist, keinen Wettbewerb kennt und auch bei Verlust ausgezahlt wird. Sie heißt Forschungszulage (FZulG), und seit 2024/2026 ist sie deutlich attraktiver geworden. Wenn ihr ernsthaft entwickelt — gerade an KI und nicht-trivialer Software — lohnt sich ein Blick.

Ich fasse hier den praktischen Kern zusammen. Eine ausführlichere, herstellerneutrale Übersicht mit allen Quellen pflegen wir offen auf GitHub: github.com/azena-ai/ki-foerderung-mittelstand. Kein Steuerrat — eine Arbeitsgrundlage. Stand: Juni 2026.

Was die Forschungszulage ist

Die Forschungszulage ist eine steuerliche Förderung für Forschung und Entwicklung (FuE). Statt eines Zuschusses, den ein Sachbearbeiter zuteilt, bekommt ihr einen festen Prozentsatz eurer FuE-Kosten als Steuergutschrift — und wenn ihr keine Steuer zahlt (z. B. junges Unternehmen mit Verlust), wird der Betrag ausgezahlt.

Drei Eigenschaften machen sie besonders:

Rechtsanspruch. Wer die Voraussetzungen erfüllt, bekommt sie. Kein "Fördertopf leer", kein Windhundrennen.
Branchen- und größenunabhängig. Vom Einzelunternehmer bis zum Konzern, jede Rechtsform.
Rückwirkend möglich. Förderfähig sind Vorhaben mit Beginn ab 2020 — laufende Projekte zählen also auch.

Die Konditionen (2026)

Fördersatz: 25 % der förderfähigen Kosten, 35 % für KMU (seit 28.03.2024).
Bemessungsgrundlage: bis 12 Mio € pro Jahr (seit 01.01.2026). Macht für ein KMU eine maximale Förderung von 4,2 Mio € im Jahr (35 % × 12 Mio).
Eigenleistung von Gesellschaftern/Einzelunternehmern: 100 €/Stunde, max. 40 Std./Woche (seit 2026; davor 70 €). Das ist wichtig für kleine Teams, in denen die Gründer selbst entwickeln.
Auftragsforschung: Lasst ihr extern entwickeln, sind 70 % des Entgelts förderfähig — der Auftragnehmer muss im EWR sitzen.

Zählt Softwareentwicklung? Zählt KI?

Das ist die entscheidende Frage, und die Antwort ist: ja — unter einer Bedingung.

Förderfähig ist Entwicklung mit echter technischer oder wissenschaftlicher Unsicherheit. Im Gesetz heißt die relevante Kategorie experimentelle Entwicklung. Für Software/KI bedeutet das konkret:

✅ Begünstigt, wenn ihr nicht von vornherein wisst, ob und wie es funktioniert:

neuartige Algorithmen oder Modellarchitekturen
nicht-triviale ML-Vorhaben (eigene Modelle, schwierige Daten-/Integrationsprobleme)
technische Lösungen, für die es keinen erprobten Standardweg gibt

❌ Nicht begünstigt:

Routine-Programmierung nach bekanntem Muster
reine Produktpflege, Bugfixing, Customizing von Standardsoftware
etwas, das man "einfach so runterschreibt", weil der Weg klar ist

Die Grenze ist nicht "ist es KI?", sondern "gab es ein echtes technisches Risiko, das ihr lösen musstet?" Genau deshalb fällt maßgefertigte Entwicklung so oft darunter und Standard-Integration nicht.

Der Antragsweg ist zweistufig

Viele scheitern nicht an der Sache, sondern daran, dass sie den Ablauf nicht kennen. Es sind zwei Schritte:

Bescheinigung bei der BSFZ (Bescheinigungsstelle Forschungszulage) beantragen. Sie prüft inhaltlich, ob euer Vorhaben begünstigte FuE ist. Das ist die eigentliche Hürde — und ihr könnt sie vorab klären, bevor ihr Geld in die Hand nehmt.
Festsetzung beim Finanzamt über ELSTER. Hier wird mit der Bescheinigung der konkrete Betrag festgesetzt.

Der praktische Tipp: Holt die BSFZ-Bescheinigung früh. Sie gibt euch Planungssicherheit, dass das Projekt anerkannt wird, bevor ihr die Kosten geltend macht.

Lässt sich das kombinieren?

Ja. Die Forschungszulage lässt sich mit Zuschussprogrammen wie ZIM kombinieren, solange ihr nicht dieselben Kosten doppelt fördert. Eine typische Aufteilung: Projektzuschuss (ZIM) für den einen Kostenblock, Forschungszulage für den anderen. Auch das steht in der Übersicht inklusive der anderen 2026 noch laufenden Programme (INVEST, EXIST, Mittelstand-Digital Zentren, Landesprogramme).

Am Rande, weil es Zeit spart: Digital Jetzt und go-digital — die Programme, an die viele bei "Digitalisierungsförderung" zuerst denken — sind beide ausgelaufen (Ende 2023 bzw. 2024). Dort lohnt keine Recherche mehr.

Warum das hier steht

Wir bauen bei azena maßgefertigte, EU-souveräne KI-Systeme für den deutschen Mittelstand — und genau bei dieser Art Arbeit ist die Forschungszulage regelmäßig einschlägig, weil maßgefertigte Entwicklung fast per Definition technische Unsicherheit enthält. Der häufigste Irrtum, den wir hören, ist "Förderung gibt's nur für Konzerne mit Forschungsabteilung". Das Gegenteil stimmt: Die Forschungszulage ist für die Teams gemacht, die etwas technisch Neues bauen, egal wie klein sie sind.

Die verbindliche Einordnung macht immer die BSFZ, und dieser Beitrag ersetzt keine steuerliche Beratung. Aber wenn ihr gerade an etwas Nicht-Trivialem entwickelt und noch nie über die Forschungszulage nachgedacht habt — tut es.

Alle Programme, Konditionen und offiziellen Quellen offen und gepflegt hier:
github.com/azena-ai/ki-foerderung-mittelstand — Korrekturen per PR willkommen.

The genome pattern: how to build an agent loop that actually improves itself

azena.ai — Wed, 24 Jun 2026 09:46:01 +0000

Most "autonomous agents" are one prompt in a while loop. They run, they drift, they repeat yesterday's mistake, and they keep no memory of anything they learned. After a day you don't have an agent that got better — you have the same agent, more tired.

We've been running a different pattern in production at azena for months, and I want to describe it concretely because it's almost embarrassingly simple: no framework, four markdown files, and one discipline. We open-sourced the templates — azena-ai/self-improving-loop — but the idea matters more than the files, so here's the whole thing.

The core idea: the agent runs on a genome

The agent doesn't run on a fixed prompt. It runs on a genome — a versioned strategy file that it both reads and rewrites.

Every cycle ("tick") the loop does one thing: it picks the single highest-value move, ships it, verifies it actually worked, and then folds what it learned back into its own instructions. The genome goes v001 → v002 → v003…, and each bump is an auditable record of the agent changing its own mind.

That last part is the whole game. A static prompt fights reality the moment the mission shifts. A genome absorbs the shift, because the loop is allowed to edit it.

The loop in one picture

flowchart TD
  A([Tick fires]) --> B[Read genome + levers + lessons]
  B --> C[Pick the single highest-value lever]
  C --> D[Build / act — small, shippable]
  D --> E{Gate: verify it really works}
  E -- fail --> C
  E -- pass --> F[Commit]
  F --> G[Self-improve:<br/>rewrite genome, append a lesson, bump version]
  G --> H[Schedule the next tick]
  H --> A

No human in the inner loop. A human sets the mission and reviews the diffs in the morning. That's the deal.

Four files, three of which the loop edits

File	Role
`genome.md`	The evolving strategy + state: mission, current focus, what's proven, what's next. The loop mutates this as it learns. Versioned.
`loop-prompt.md`	The orchestrator the agent executes each tick — and improves. Holds the tick cycle, the gate rules, and an append-only lessons log.
`levers.md`	The prioritized backlog. A ranked list of moves with a status log. The loop always takes the top open one.
`lessons`	Hard-won rules, written back into the prompt the moment they're learned. This is the "self-improving" part.

The tick itself is just a state machine:

read(genome, levers, lessons)
lever  = highest_value_open(levers)
result = act(lever)            # small, shippable
if not gate(result):           # verify the ARTIFACT
    reschedule(); return       # back to the top — never "commit anyway"
commit(result)
update(genome.status, lever.status)
maybe_mutate(genome)           # bump version if strategy changed
maybe_append(lessons)          # if something was learned
schedule_next_tick()

Why file-based memory beats a context window

A context window is volatile and small. It evaporates on compaction, restart, or a long enough wall-clock gap. So an agent whose "state" lives in context literally forgets where it was.

A genome file is durable and unbounded. The loop can run for days across many ticks, restarts, and summarizations, and still know exactly where it is — because "where it is" is written down, not remembered. When context gets summarized away, the next tick just re-reads the genome and carries on. That single decision — state on disk, not in the window — is what turns a chatty demo into something that survives a week.

The non-negotiable: gates

Here's the part everyone skips, and it's the part that makes autonomy safe instead of reckless.

An autonomous loop is only as trustworthy as its verification. A gate is a check that must pass before a commit. A failing gate sends the loop back to pick another lever — never forward to "commit anyway."

The minimum gate is three steps, and the order matters:

Typecheck green. Run the real check. Do not pipe it through head/tail — a pipe exits 0 and will happily print "OK" over a stack of errors.
Build green. Many bundlers strip types and build green despite type errors — so step 1 is not optional.
Verify the artifact, not the log. This is the one teams skip. "The deploy succeeded" is not evidence that the page renders, the endpoint responds, or the file is non-empty.

That third point is the single most expensive class of bug in an autonomous loop: a step that reports success while producing garbage. A prerender that silently emits an empty SPA shell. A migration that "ran" but touched zero rows. (We dug into this exact failure mode for production agents — why they pass demos but fail live — here.) So assert a concrete property of the real artifact:

# don't trust "build OK" — prove it
test "$(grep -c '<div id=\"root\"></div>' dist/index.html)" -eq 0   # not an empty shell
test "$(wc -c < dist/index.html)" -gt 50000                          # has real content

One caveat so you don't fight ghosts: a transient failure on something you didn't touch (a network blip, a cold start) isn't a regression. Re-run the gate once. If it fails deterministically, it's real.

Lessons the loop has already taught itself

These are real, generalized from production runs. The point of the pattern is that this list grows by itself — the loop appends to it the moment it gets burned:

Verify before you ship. A build step can fail silently and leave an empty shell. Assert the output is non-empty and correct before deploying.
Don't turn one finding into a destructive sweep. A single odd-looking match is not a mandate for a sitewide find-and-replace. Check intent first.
When reality contradicts the task's premise, report — don't blindly execute. If the job says "small fix" and you find a load-bearing rewrite, surface it instead of plowing ahead.
Tools that need a server don't start one. Bring the server up, wait for it, then run the check. A flood of connection refused means "nothing's listening," not "everything's broken."

Notice these aren't AI-specific. They're the operating rules of a careful engineer — except the loop wrote them for itself, after paying for them once.

One lever per tick

Last principle, easy to underrate: one lever per tick. Small, shippable units keep every change reviewable and every failure cheap to roll back. The temptation with an autonomous agent is to let it do five things at once "to save time." Don't. A tick that ships one verified thing and stops is worth more than a tick that ships five unverifiable things. The cadence is the safety rail.

Getting started

If you want to try it, it genuinely is four files and an afternoon:

Copy the four *.template.md files from the repo into your project.
Fill genome.md with your mission and first focus.
Seed levers.md with a ranked backlog.
Hand loop-prompt.md to your agent and tell it to run one tick, then schedule the next.
Review the diffs each morning. Watch the genome evolve.

It works anywhere an agent can schedule its own next turn and commit to git — we run it inside Claude Code, but nothing in the pattern is Claude-specific.

We build this kind of thing for a living — bespoke, EU-sovereign AI systems for the German Mittelstand — at azena, and we teach the craft at the azena Dev Academy. The loop pattern came out of needing our own automation to be trustworthy enough to leave running overnight. If you build something with it, I'd love to hear how the genome evolved.

The templates, docs, and a few reusable skills are all MIT-licensed here: github.com/azena-ai/self-improving-loop.

EU-sovereign AI: running capable LLMs with full data control (2026 guide)

azena.ai — Tue, 23 Jun 2026 23:35:01 +0000

"Can we use a capable language model and still keep full control over where our data is processed?" — it's one of the first questions we hear from data-sensitive companies in Europe. The good news, as of mid-2026: the answer is a clear yes. EU data residency is no longer a compromise — it's a deliberate architecture choice, and there are genuinely capable options for it.

This is about EU data sovereignty and residency — deciding consciously where and under which legal regime your data is processed. That's a strength, not a stance against anyone: many of the best open models and cloud providers are international, and that's a good thing. The point is control, not opposition.

We maintain the full, vendor-neutral version of this as an open guide on GitHub: github.com/azena-ai/eu-souveraene-llms. Here's the practical core.

Two clean paths to EU data residency

Self-host open weights on your own or EU cloud infrastructure. No inference data leaves to the model vendor — so the license, not the vendor's origin, is what matters.
Use an EU-headquartered managed provider that runs the model and keeps processing in the EU.

Both are production-ready in 2026.

Path 1: Self-hosting — the license decides

When you run downloaded weights on your own or EU infrastructure, no inference data flows to the maker — not even for models from the US or China. Origin only concerns the maker's hosted API, not weights running locally. So the deciding factor is the license.

Model	Origin	License (commercial?)	Note
Mistral Large 3 / Ministral 3	France (EU)	Apache 2.0 — free	permissive flagship
Teuken-7B (OpenGPT-X)	DE/EU	Apache 2.0 — free	all 24 EU languages, EU-trained
EuroLLM-22B	EU consortium	Apache 2.0 — free	35 languages
Qwen3	Alibaba (China)	Apache 2.0 — free	privacy-neutral when self-hosted
DeepSeek-R1	DeepSeek (China)	MIT — free	strong reasoning
Mistral Large 2 / Pixtral Large	France	MRL — NOT commercial	common trap
Aleph Alpha Pharia-1	Germany	Open Aleph — research only	commercial by contract only
Meta Llama 4	USA	Community License (not OSI)	EU restriction on multimodal models

Two expensive traps: not every "open" Mistral model is Apache 2.0 — Mistral Large 2 and Pixtral Large are under the non-commercial Mistral Research License. And Llama 4's license excludes EU-domiciled companies from the multimodal models (license text). Check the license tag, not the reputation.

Path 2: Managed inference with EU data residency

If you'd rather not self-host, use a provider that runs the model and processes data in the EU. Here the provider's legal domicile is a key factor for residency.

EU-headquartered, EU data residency: Mistral La Plateforme (France, EU by default, no training on API data), IONOS AI Model Hub (Germany, data stays in DE), OVHcloud and Scaleway (France). The notable infrastructure development in mid-2026 is the AWS European Sovereign Cloud — a separate, EU-operated partition (GA since January 2026), though its model selection is still thin at launch.

Data residency: what actually matters legally

This part gets overlooked, and it matters when your compliance requires genuine EU data residency.

An "EU region" alone says nothing about which legal regime a provider is subject to. The US CLOUD Act (2018), for instance, lets US authorities compel US-domiciled providers to hand over data regardless of server location. A Frankfurt region of a US company sits physically in the EU, but the provider remains under US law. That's not a value judgment — it's simply a factor a clean residency strategy accounts for. (AWS on the CLOUD Act)

There's also legal movement worth knowing: the EU-US Data Privacy Framework is in force in mid-2026 but under challenge at the CJEU (case C-703/25 P, pending). Teams that want maximum planning certainty keep processing in the EU from the start. And note that encryption isn't a shortcut here: LLM inference needs the plaintext to work, so residency is decided by architecture, not by a bolt-on.

Worth separating: the EU AI Act governs risk and transparency — not data residency. Where your data must be processed comes from the GDPR. (On the AI Act: our practical EU AI Act compliance guide, and the open, vendor-neutral version on GitHub.)

How to decide

Your need	Recommendation
Maximum data control, processing in-house	Self-host open Apache-2.0/MIT weights (Mistral, Teuken, EuroLLM, Qwen, DeepSeek)
Managed, with EU data residency	EU-headquartered provider (Mistral La Plateforme, IONOS, OVHcloud, Scaleway)
Already on Azure/AWS	Workable — document the legal situation (DPF status, provider's regime) in a transfer impact assessment
Just exploring	Start small: a self-hosted 7–24B model (Ministral 3, Teuken-7B) on an EU GPU instance

EU-sovereign AI in 2026 isn't a compromise — it's a deliberate architecture decision with genuinely capable options. Sovereignty here means you keep control over where your data is processed: a strength, framed as choice, not opposition.

We build exactly these systems — bespoke, EU-sovereign AI for the German Mittelstand — at azena. The full, sourced, vendor-neutral guide lives here: github.com/azena-ai/eu-souveraene-llms.