DEV Community: Rudson Kiyoshi Souza Carvalho

The Right Proposal Lost Again: On Power Struggles Disguised as Technical Decisions

Rudson Kiyoshi Souza Carvalho — Fri, 19 Jun 2026 15:18:48 +0000

You've seen this happen before, probably more than once. Two architecture proposals on the table. One is cheaper, simpler to operate, with fewer points of failure and less dependency on a single specialist. The other costs more, needs more people, more time, more meetings. The committee approves the second one.

Nobody in the room will say "we picked the worse option because its author has lunch with the VP every Thursday." What comes out is always technically respectable: "the other solution scales better long-term," "this one reduces technical debt," "it's better aligned with the data governance the platform team has been pushing for." Sentence by sentence, none of it is a lie. And yet none of it is the real reason for the decision.

You leave the meeting knowing, with an uncomfortable certainty in your gut, that you lost for a reason that wasn't on any slide. Then comes the worse part: you can't name that reason without sounding paranoid, or bitter, or "too political" — exactly the label the person who won will never get, because they were careful never to say the word "power" out loud.

This pattern isn't a rare accident of a broken process. It is the process. It repeats in architecture decisions, in roadmap priorities, in who signs the RFC, in who gets the credit for the project that worked and who gets the blame for the one that didn't. It's frequent and regular enough to become a law. This pattern is what became the book "As Leis do Poder em Projetos" (The Laws of Power in Projects).

The mechanism: power wearing a competence badge

Every tech company runs two operating systems in parallel. One is documented: architecture, metrics, OKRs, post-mortems, security committees, RFCs, promotion criteria. The other never shows up in the minutes, but decides just as much as the first one: who gains territory, who gets spared when the project blows up, who's allowed to disagree in public, and who's only allowed to disagree "later, in private."

The reason this second system is nearly impossible to point a finger at is simple: it speaks the exact same language as the first one. "Technical debt" is a real engineering category — and also the perfect label for any old decision someone wants to revisit for reasons that have nothing to do with actual debt. "That service doesn't scale" might be a fact about capacity — or it might be the sentence that kills a rival team's project without anyone having to admit the problem was never scale. Metrics, security, compliance, governance: each of these words has the same property. From the outside, they're indistinguishable from pure technical competence. That's exactly why they work so well as weapons — and why no one is ever held accountable for using them that way.

Questioning a security veto in public, for instance, doesn't look like technical rigor. It looks reckless. So nobody questions it, and the veto becomes the most efficient power instrument in the company — protected by the simple fact that disagreeing with it is socially expensive, even when it's being used for the wrong reasons.

None of this requires a closed-door conspiracy. It's structure, not a plot: any hierarchy that distributes budget, prestige, and career survival unevenly will generate competition for those resources. What's particular about tech is that this competition is almost never called by its name. It's called architecture.

The book: 34 laws, six parts, from diagnosis to defense

"As Leis do Poder em Projetos" organizes this pattern into 34 laws, split into six parts that go from "this is happening to you right now, nobody just used that word" to "what to do about it without becoming the next person who does it to others."

I · Manufacturing the Problem

Before any power struggle has a winner, someone first needs to convince everyone that a problem exists — and, preferably, that only one specific solution solves it. This part maps how crisis narratives are manufactured, inflated, and timed before any actual technical decision reaches the table.

Some people light fires to sell extinguishers
"Technical debt" is a rhetorical weapon, not just an engineering concept

II · Owning the Narrative

Whoever decides which version of events survives into next quarter holds more power than whoever actually solved the problem. This part is about how a project's story gets written — and rewritten — by whoever controls the documents, the channels, and, above all, the vocabulary used to describe what happened.

Whoever writes the post-mortem writes history
Whoever controls the vocabulary controls the debate

III · Capturing Territory

Technical merit and upward mobility inside a company rarely follow the same ruler, and pretending otherwise only delays the moment you realize it. This part is about how territory — scope, headcount, visibility, access — gets won and defended, and why being in the right place at the right time usually beats having done the right work.

Credit goes to whoever is on stage on demo day
Turtles don't climb trees on their own: merit and promotion are not the same thing

IV · The Executive's Game

At the top, the rules change again. Decisions that look purely strategic are, with uncomfortable frequency, about the executive's personal survival within their own tenure — which usually lasts less than any three-year roadmap. This part looks at the board from the perspective of someone who only has a few quarters to show results before the next reorg.

The CIO doesn't pick the best technology; they pick the one that survives the next change of command
Decide which fires to let burn

V · Security, Risk, and Compliance as Power

When a dispute can't be won on technical merit, it tends to migrate to the one terrain where questioning looks reckless instead of rigorous: security, risk, compliance. This part exposes how these domains — created to protect the company — also work as the most efficient veto available to anyone willing to use it.

Security is the one veto nobody dares challenge in public
Every exception becomes a precedent — and every precedent becomes power

VI · Surviving Without Being Naive (the defense)

The last part switches sides. After mapping the game, the book turns to how to play it without becoming what it describes — how to protect good work, allies, and reputation without resorting to the same dirty tactics the rest of the book spent five parts documenting.

Don't win arguments; win before them
Being right is necessary; having narrative, timing, and allies is what makes being right count

This is not a manipulation manual

It's worth repeating, because it's easy to read the list above and conclude the opposite: this isn't a book about manipulating people to win internal disputes. The same tools that run through all 34 laws — controlling the narrative, choosing the timing, building allies, naming the problem before someone else names it for you — don't inherently belong to whoever acts in bad faith. They belong to whoever uses them first and uses them best.

The difference between using these tools to defend honest work and using the same tools to sabotage someone else's isn't in the tool. It's in who decides what to load into it. You can write a rigorous post-mortem and still write the story. You can build allies to protect the right decision, not to bury the right person. The book doesn't pretend that line doesn't exist — it just argues that pretending the whole game doesn't exist is the fastest way to lose to whoever plays without that hesitation.

Political clarity isn't the same thing as cynicism. It's the difference between being blindsided by the game and choosing, eyes open, what to do with the cards it deals you.

Launch

As Leis do Poder em Projetos (The Laws of Power in Projects) is currently in presale on Amazon, with launch expected for July 9, 2026.

https://a.co/d/0bdF0Sk2

The book is currently available in Portuguese, but I'm considering an English edition if there is enough interest from the international software engineering community.

A Proposta Certa Perdeu de Novo: Sobre Disputas de Poder Disfarçadas de Decisão Técnica

Rudson Kiyoshi Souza Carvalho — Fri, 19 Jun 2026 13:56:27 +0000

Você já viu isso acontecer, provavelmente mais de uma vez. Duas propostas de arquitetura na mesa. Uma é mais barata, mais simples de operar, com menos pontos de falha e menos dependência de um único especialista. A outra custa mais, exige mais gente, mais tempo, mais reunião. O comitê aprova a segunda.

Ninguém na sala vai dizer "escolhemos a opção pior porque o autor dela almoça com o VP toda quinta". O que sai é sempre tecnicamente respeitável: "a outra solução escala melhor no longo prazo", "essa reduz débito técnico", "está mais alinhada com a governança de dados que o time de plataforma está cobrando". Frase por frase, nada ali é mentira. E ainda assim nenhuma delas é o motivo real da decisão.

Você sai da reunião sabendo, com uma certeza incômoda no estômago, que perdeu por um motivo que não estava em nenhum slide. Aí vem a parte pior: você não consegue nomear esse motivo sem parecer paranoico, ou amargurado, ou "político demais" — justamente o adjetivo que a pessoa que venceu jamais vai receber, porque ela teve o cuidado de nunca usar a palavra "poder" em voz alta.

Esse padrão não é um acidente raro de processo malfeito. É o processo. Ele se repete em decisão de arquitetura, em prioridade de roadmap, em quem assina o RFC, em quem fica com o crédito do projeto que deu certo e em quem fica com a culpa do que deu errado. É frequente e regular o suficiente para virar lei. Foi esse padrão que virou o livro "As Leis do Poder em Projetos".

O mecanismo: poder usando crachá de competência

Toda empresa de tecnologia roda dois sistemas operacionais em paralelo. Um está documentado: arquitetura, métricas, OKRs, post-mortems, comitês de segurança, RFCs, critérios de promoção. O outro nunca aparece em ata, mas decide tanto quanto o primeiro: quem ganha território, quem é poupado quando o projeto explode, quem tem permissão de discordar em público e quem só tem permissão de discordar "depois, em particular".

A razão pela qual esse segundo sistema é quase impossível de apontar com o dedo é simples: ele fala exatamente a língua do primeiro. "Débito técnico" é uma categoria real de engenharia — e também é a etiqueta perfeita para qualquer decisão antiga que alguém quer revisitar por motivos que não têm nada a ver com dívida nenhuma. "Esse serviço não escala" pode ser um fato de capacidade — ou pode ser a frase que mata o projeto do time concorrente sem que ninguém precise admitir que o problema nunca foi a escala. Métrica, segurança, compliance, governança: cada uma dessas palavras tem a mesma propriedade. Vistas de fora, são indistinguíveis de competência técnica pura. É exatamente por isso que funcionam tão bem como arma — e por isso ninguém jamais é responsabilizado por usá-las assim.

Questionar um veto de segurança em público, por exemplo, não parece rigor técnico. Parece imprudência. Então ninguém questiona, e o veto vira o instrumento de poder mais eficiente da empresa — protegido pelo simples fato de que discordar dele é socialmente caro, mesmo quando ele está sendo usado fora de propósito.

Nada disso exige conspiração de sala fechada. É estrutura, não complô: qualquer hierarquia que distribui orçamento, prestígio e sobrevivência de carreira de forma desigual vai gerar disputa por esses recursos. A particularidade de tech é que essa disputa quase nunca é chamada pelo nome. Ela é chamada de arquitetura.

O livro: 34 leis, seis partes, do diagnóstico à defesa

"As Leis do Poder em Projetos" organiza esse padrão em 34 leis, divididas em seis partes que vão de "isso está acontecendo com você agora, só que ninguém usou essa palavra" até "o que fazer sobre isso sem se tornar a próxima pessoa que faz isso com os outros".

I · Fabricar o Problema

Antes de qualquer disputa de poder ter um vencedor, alguém precisa primeiro convencer todo mundo de que existe um problema — e, de preferência, que só uma solução específica resolve esse problema. Esta parte mapeia como narrativas de crise são fabricadas, infladas e cronometradas antes de qualquer decisão técnica de fato entrar na mesa.

Há quem acenda incêndios para vender extintores
"Débito técnico" é arma retórica, não só conceito de engenharia

II · Dominar a Narrativa

Quem decide qual versão dos fatos sobrevive ao próximo trimestre tem mais poder do que quem efetivamente resolveu o problema. Esta parte trata de como a história de um projeto é escrita — e reescrita — por quem controla os documentos, os canais e, acima de tudo, o vocabulário usado para descrever o que aconteceu.

Quem escreve o post-mortem escreve a história
Quem controla o vocabulário controla o debate

III · Capturar Território

Mérito técnico e ascensão dentro da empresa raramente seguem a mesma régua, e fingir que seguem só atrasa o momento em que você percebe isso. Esta parte é sobre como território — escopo, headcount, visibilidade, acesso — é conquistado e defendido, e por que estar no lugar certo na hora certa costuma valer mais do que ter feito o trabalho certo.

O crédito vai para quem está no palco no dia da demo
Tartaruga não sobe em árvore: mérito e ascensão são coisas diferentes

IV · O Jogo do Executivo

Lá no topo, as regras mudam de novo. Decisões que parecem puramente estratégicas são, com uma frequência desconfortável, sobre a sobrevivência pessoal do executivo dentro do próprio mandato — que costuma durar menos do que qualquer roadmap de três anos. Esta parte olha o tabuleiro a partir de quem só tem alguns trimestres para mostrar resultado antes da próxima reorganização.

O CIO não escolhe a melhor tecnologia; escolhe a que sobrevive à próxima troca de comando
Decida quais incêndios deixar queimar

V · Segurança, Risco e Compliance como Poder

Quando uma disputa não pode ser vencida no mérito técnico, ela costuma migrar para o único terreno onde questionar parece imprudência em vez de rigor: segurança, risco, conformidade. Esta parte expõe como esses domínios — criados para proteger a empresa — também funcionam como o veto mais eficiente disponível para qualquer pessoa disposta a usá-lo.

Segurança é o único veto que ninguém ousa contestar em público
Toda exceção vira precedente — e todo precedente vira poder

VI · Sobreviver Sem Ser Ingênuo (a defesa)

A última parte muda de lado. Depois de mapear o jogo, o livro trata de como jogá-lo sem se tornar aquilo que ele próprio descreve — como proteger bom trabalho, aliados e reputação sem recorrer às mesmas táticas sujas que o resto do livro passou cinco partes documentando.

Não vença discussões; vença antes delas
Ter razão é necessário; ter narrativa, timing e aliados é o que faz a razão valer

Isto não é manual de manipulação

Vale repetir, porque é fácil ler a lista acima e concluir o contrário: isto não é um livro sobre como manipular pessoas para ganhar disputas internas. As mesmas ferramentas que atravessam as 34 leis — controlar a narrativa, escolher o timing, construir aliados, nomear o problema antes que alguém o nomeie por você — não pertencem por natureza a quem age de má-fé. Pertencem a quem as usa primeiro e usa melhor.

A diferença entre usar essas ferramentas para defender um trabalho honesto e usar as mesmas ferramentas para sabotar o de outra pessoa não está na ferramenta. Está em quem decide o que carregar com ela. Dá para escrever um post-mortem rigoroso e, ainda assim, escrever a história. Dá para construir aliados para proteger uma decisão certa, e não para enterrar uma pessoa certa. O livro não finge que essa linha não existe — só argumenta que fingir que o jogo inteiro não existe é a forma mais rápida de perder para quem joga sem essa hesitação.

Lucidez política não é a mesma coisa que cinismo. É a diferença entre ser surpreendido pelo jogo e escolher, de olhos abertos, o que fazer com as cartas que ele te dá.

Lançamento

As Leis do Poder em Projetos está em pré-venda na Amazon, com lançamento previsto para 09/07/2026.

https://a.co/d/0bdF0Sk2

Your AI agent is inventing behavior — and you have no way to prove otherwise

Rudson Kiyoshi Souza Carvalho — Tue, 16 Jun 2026 03:53:25 +0000

You reviewed the PR. The code looks right. The tests pass.

But that new field in the API response — who asked for that?

You check the history, the requirements, the conversation with the PO. It's nowhere. The AI agent just added it. And if you hadn't looked closely, it would have shipped to production.

This happens every time an agent generates code. And it will happen more often as pipelines become more autonomous. The problem isn't that the AI makes mistakes — it's that when it adds something nobody asked for, there's no structural mechanism today that stops it from getting through.

The gap nobody closed

There's an entire ecosystem of standards for tracking what happens to software. But each one covers a different piece:

RTM, ReqIF, OSLC — track requirements, but they're documents humans fill in. Nothing stops an agent from generating code that doesn't correspond to any of them.
SLSA, SPDX, CycloneDX — cover the build chain and component inventory. Excellent, but they operate on the compiled artifact, not on behavior.
W3C PROV — models data provenance in general. It doesn't go down to "who asked for this field in the response?"
C2PA — provenance for media and digital content. A different domain.

The floor between an approved requirement and generated behavior is empty. There's no machine-checkable contract today that says: "this behavior has traceable origin, or it's rejected."

The root of the problem

When a human engineer adds a field nobody asked for, there's natural friction: the PR goes to review, someone asks "where did this come from?", the person has to justify it.

With AI agents the cycle is different:

The agent receives a context
The agent generates code
The code is plausible, it compiles, the tests pass
Nobody has an automated way to ask: was this specific behavior derived from which requirement?

The honest answer is: there isn't one. And this isn't process fussiness — in regulated environments (finance, healthcare, aerospace), this is real risk.

The core idea behind BPR

The Behavioral Provenance Record (BPR) is a conformance specification that attacks exactly this problem.

The logic is simple: instead of trying to prove the agent didn't invent anything — which is impossible — you turn invention into a structural failure.

Every node in the pipeline (a requirement, a behavioral example, a scenario, a contract, a unit of code, a test) emits a provenance record. That record says: "this artifact came from that upstream node." If it didn't come from anywhere, it's rejected.

Core rule: provenance-or-reject.

Every derived node MUST cite the upstream node it came from.
No resolvable origin → rejected.

Where it enters the SDLC

It's not a document you fill in afterward. BPR enters the flow at the moment the artifact is produced:

The human authors the need and the behavioral examples. These are the trust anchors — authored nodes, the roots of the graph.
The agent derives the scenario, the contract, the code, and the test — emitting a derived record at each step, citing what came before.
A conformance gate (a CI stage, a PR check) reads the records and passes or fails the change before merge.

The result is a traceability graph. The classic RTM you know is just a 2D projection of this graph — not the actual object.

The hardest level: anti-invention

The interesting part is what happens at the highest level — when you want to know whether the agent added behavior nobody asked for.

BPR doesn't try to prove a negative. Instead, each node can declare its claims — the behavioral assertions it makes. And every claim needs to cite an upstream claim.

Every claim has a type defined by observability:

behavioral — changes the observable functional contract (response field, status code). Must be traced.
operational — changes observable non-functional behavior (latency, retry, logging, metrics, security). Must be traced.
implementation — internal, no observable footprint (data structure, in-process cache). Exempt — but only if non-observability is attested. Without attestation, it's treated as behavioral. This closes the loophole of "laundering" invention by labeling it an internal detail.

A behavioral claim with no upstream ancestry is invention — now a localized, named, attributable failure. Not a hallucination hidden in the middle of the code.

Incremental adoption — not all or nothing

This is the point I consider most important in the design: you don't have to adopt everything at once.

L1 and L2 already deliver real value today, without depending on anything sophisticated:

L1: every scenario has a test. Every test verifies a scenario. Simple, deterministic, and most teams still don't have this enforced.
L2: every approved need has downstream coverage through to a test. (Deferred, rejected, or informational requirements are explicitly exempt — the status field handles that.)

L3 is the frontier — where you guarantee the agent didn't invent behavior. Harder, depends on external checkers (human, NLI model, LLM-judge), but fully specified and falsifiable.

What BPR standardizes — and what it doesn't

This distinction is what separates a standard from a product:

What is the standard: the record schema, the serialization, the invariants, the conformance levels, and claim semantics.

What is not the standard (replaceable): the validator, the checkers that produce verdicts, how the agent emits records in the pipeline, where the graph is stored, and the policy for who can attest.

You implement your own validator in Go, TypeScript, whatever. If two independent implementations reach the same conformance verdict on the same examples, that's a standard — not a private API.

Honesty about the limits

Conformance level is a claim about structure, not correctness.

"L3b conformant" means every behavioral claim is cited and attested as supported. It doesn't mean the attestation is correct. It's not proof that nothing was invented.

BPR guarantees attributability — that invented behavior can't be committed silently. A weak checker can still issue a wrong supported. The quality of the checker is the responsibility of whoever chooses it, and is out of scope for a purpose-built standard.

What BPR doesn't solve on its own: a malicious agent forging records, an incompetent checker, a bad original requirement, absence of organizational policy, tampering with files after the fact. These are solved through composition — record signing, attestation policy, human process — not by expanding the scope of the standard.

Current state and what's missing

BPR is published as a preprint of an initial specification:

✅ JSON Schema (JSON Schema 2020-12)
✅ Reference validator L0–L3c in Python (runs and rejects invalid input with a specific reason)
✅ Conformant and deliberately broken examples (including laundering attempts)
🔲 Versioning + staleness (v0.5) — when a requirement changes, which downstream nodes become stale?
🔲 Anti-self-attestation at L3c — issuer ≠ verifier as a specifiable property
🔲 Normative mapping to W3C PROV-O

The line between a specification and a standard is an independent implementation. Today there's one — the reference one. Publishing this is the invitation for a second.

How to test it right now

git clone https://github.com/RudsonCarvalho/bpr.git
cd bpr
pip install jsonschema

# base example — should pass at L2
python validator/validate.py examples/profile-retrieval.records.json --level L2

# example with claims — should pass at L3b
python validator/validate.py examples/profile-retrieval.l3.records.json --level L3b

# example with invention and laundering — should FAIL
python validator/validate.py examples/profile-retrieval.l3-broken.json --level L3b

Conclusion

We don't need to prove the AI didn't invent anything. We need any invention to be a named, localized, traceable failure — not a hallucination that shipped to production because the tests passed.

BPR proposes exactly that: a minimal, neutral, verifiable contract. The immediate value is in L1/L2. The long-term promise is in L3.

If you work with AI pipelines generating code: clone it, run the examples, try to break L3, send adversarial cases. That's what turns a specification into infrastructure.

Specification + schema + validator: 👉 github.com/RudsonCarvalho/bpr

Preprint with DOI: 👉 doi.org/10.5281/zenodo.20710512

Your agent skill was never loaded. And you have no way of knowing.

Rudson Kiyoshi Souza Carvalho — Wed, 10 Jun 2026 03:48:55 +0000

Rudson Kiyoshi Souza Carvalho

Jun 10

Agent skills load on a guess (and can't inherit). Here's the fix

#ai #llm #architecture #agents

7 min read

Agent skills load on a guess (and can't inherit). Here's the fix

Rudson Kiyoshi Souza Carvalho — Wed, 10 Jun 2026 03:37:00 +0000

Your agent skill was never loaded. And you have no way of knowing.

Not "loaded the wrong version." Not "loaded late." Never loaded at all. The model read a one-line summary of it, decided it didn't need the details, and generated a confident, plausible, wrong artifact instead. No error. No log line. No stack trace. Just output that looks right and isn't.

I kept running into this while building agents for regulated workflows, so I want to walk through why it happens — it's structural, not a bug — and a small fix you can paste into your own stack today.

How skills actually load

Most agent frameworks load skills the same way. The model never sees your skills up front. It sees a menu — a list of names and short descriptions — and decides for itself, mid-task, whether to open any of them.

Here's roughly the context the model wakes up to:

AVAILABLE SKILLS
- rtm-format    : How to produce a requirements traceability matrix
- pii-redaction : Redact personal data before export
- audit-trail   : Log generated artifacts for compliance

TASK
Generate a requirements traceability matrix for the payments module.

Then, invisibly, it runs something like: "Do I need to open rtm-format? I already know what an RTM is — columns for requirements, sources, tests. I've got this." And it proceeds without ever opening the skill.

That's the whole mechanism. It's a semantic trigger: a probabilistic, model-driven pull. The skill body only enters the context if the model first decides it's needed. There's no guarantee, and — this is the part that hurts in production — no observability. You can't tell from the output whether the skill fired.

The TV-manual problem

Think about a TV manual. Nobody opens it to turn the TV on — you already know how. You only reach for the manual when you recognize you don't know something: pairing a soundbar, fixing some weird HDMI handshake.

The whole system depends on one assumption: you know what you don't know.

An LLM breaks that assumption. It doesn't know what it doesn't know. It "knows" how to generate an RTM in the generic sense, so it never recognizes that it should open your RTM skill — the one that says your column order is fixed, your IDs follow SYS-REQ-####, and there's one row per requirement, no exceptions. From the model's point of view, it already knows how to turn on the TV. So it never opens the manual. And it hands you a perfectly formatted RTM that's wrong in every way that matters to your auditor.

This is why the failure is structural. The model can only choose to load a skill after recognizing it lacks the knowledge — and the cases where it's most confidently wrong are exactly the cases where it feels no need to check.

Why this quietly wrecks critical workflows

A loud failure is a gift. A crash, a 500, a validation error — these tell you exactly where to look.

The skipped-skill failure is the opposite. The agent produces a clean RTM with the wrong column order and non-conforming IDs. It passes a glance. It might pass review. It surfaces three weeks later when a compliance tool rejects the export, or worse, when nobody catches it at all. The cost of a silent failure isn't the failure — it's the false confidence it travels with.

For ad-hoc help ("brainstorm some test cases"), probabilistic loading is fine, even good. For workflows where a specific artifact format is non-negotiable — regulated reporting, audit trails, anything with a downstream machine consumer — "the model will probably load the right skill" is not a foundation you want to stand on.

The fix: a Skill Resolver

If a skill is mandatory for a task type, the decision to load it shouldn't belong to the model at all. Make it a property of the pipeline.

A Skill Resolver is a tiny pre-dispatch step. It runs before the LLM, looks at the task type, and injects the full body of every required skill straight into the context. No menu, no model discretion — push instead of pull.

SKILL_STORE = {
    "rtm-format": "RTM SKILL: columns must be [ReqID, Source, Verification, "
                  "Status]; ReqID format SYS-REQ-####; one row per requirement.",
    "audit-trail": "AUDIT SKILL: log every generated artifact with author, "
                   "timestamp, and source skill version.",
}
REQUIRED = {"compliance": ["rtm-format", "audit-trail"]}

def resolve_skills(task_type, store):
    # Runs BEFORE the LLM. Returns full skill bodies, not just summaries.
    return "\n\n".join(store[name] for name in REQUIRED.get(task_type, []))

def build_prompt(task_type, user_msg, store):
    injected = resolve_skills(task_type, store)
    return f"<required_skills>\n{injected}\n</required_skills>\n\nTask: {user_msg}"

print(build_prompt("compliance", "Generate an RTM for the payments module", SKILL_STORE))

That's the whole idea. The key property isn't the line count — it's where it runs. The injection happens outside the model's decision loop. By the time the LLM is called, rtm-format is already in the context whether the model thought it needed it or not. The pull became a push.

"Can't I just write it in AGENTS.md?" You can, and it helps with prioritization — but it doesn't guarantee anything. A line in AGENTS.md is still an instruction the model interprets at inference time; it lives in the same probabilistic layer as the skill menu. The resolver lives one layer below inference, in code, where "always" actually means always.

Leveling up: skill inheritance for multinationals

Now the real-world version. You're not running one agent — you're running one platform for a company with offices in a dozen countries. The audit format is global. The data-retention rules are German (GDPR) or Brazilian (LGPD). The reporting template is set by each local central bank. And a single business unit has its own quirks on top.

The naive answer is copy-and-modify: fork the global skill set per country, tweak as needed. That falls apart fast. The forks drift — a fix to the global skill never reaches the copies. You lose lineage — six months later nobody can say which rule came from HQ and which a local team invented. And every global change becomes N update points instead of one.

What you actually want is inheritance: a scope chain from global down to the business unit, where more specific scopes override less specific ones, most-specific-wins — except for invariants that HQ locks and no local scope can touch. If you've ever debugged CSS specificity, this is the same cascade: the most specific rule wins, and !important is your invariant.

Here's a resolver that walks that chain and keeps the lineage:

REGISTRY = {
    "global":       {"rtm-format": "v2.1", "audit-trail": "v4.0",
                     "_invariant": {"audit-trail"}},
    "country:BR":   {"rtm-format": "v1.3"},
    "bu:BR/retail": {"rtm-format": "v1.0", "audit-trail": "v1.0"},  # tries to weaken
}

def resolve(scope_chain, registry):
    resolved, lineage, locked = {}, {}, set()
    for scope in scope_chain:                 # walk least -> most specific
        layer = registry.get(scope, {})
        locked |= layer.get("_invariant", set())
        for key, version in layer.items():
            if key.startswith("_"):
                continue
            if key in locked and key in resolved:
                continue                       # invariant: can't be overridden
            resolved[key] = version            # most-specific wins
            lineage[key] = scope               # who set it -> auditable
    return resolved, lineage

chain = ["global", "country:BR", "bu:BR/retail"]
resolved, lineage = resolve(chain, REGISTRY)
for k in resolved:
    print(f"{k:12} -> {resolved[k]:6} (set by {lineage[k]})")

Output:

rtm-format   -> v1.0   (set by bu:BR/retail)
audit-trail  -> v4.0   (set by global)

rtm-format cascades down to the business unit's v1.0. But audit-trail is locked global — the BU's attempt to swap in a weaker v1.0 is ignored, and the lineage map tells you exactly which scope set each final value. One global change, one update point, full audit trail. That's Hierarchical Skill Resolution.

Where Microsoft APM fits

None of this competes with Microsoft's APM — it composes with it. There are three separate planes here, and it's worth keeping them straight:

APM is the distribution plane: how skills are versioned, locked, and pulled from registries — the package-manager layer. The Skill Resolver is the consumption plane: what's deterministically in the context when the model runs. HSR is the governance plane: who controls which skill at which scope, and what can't be overridden. APM ships the v1.0; the resolver guarantees it's actually present at inference; HSR decides that the BU was allowed to set it in the first place. None replaces the others.

If this maps to a problem you're staring at, I opened a discussion in the APM community to push on the governance side — discussion #1722. Feedback and counter-arguments welcome.

Honest limitations

Injection guarantees what enters the context. It does not guarantee what the model does with it. Stuff a skill into a 100k-token prompt and "lost in the middle" still applies — present isn't the same as attended-to. There's a token cost, too; injecting every required skill on every call adds up, so scope your REQUIRED map tightly. And for genuinely open-ended, ad-hoc assistance, don't bother — probabilistic loading is the right tool there. The resolver earns its keep specifically where an output format is non-negotiable.

Wrap-up

Semantic skill loading is a pull, and the model decides whether to pull. That's perfect for exploration and quietly dangerous for compliance, because the model can't recognize a gap it doesn't know it has. A Skill Resolver flips it to a push — moving the load decision out of the model and into your pipeline, in about fifteen lines. Add scope inheritance and you get governance for an org of any size, with lineage you can hand to an auditor.

If you want to go deeper, the full write-up is in the paper, Hierarchical Skill Resolution: Enabling Skill Inheritance and Deterministic Knowledge Injection for AI Agents (DOI: 10.5281/zenodo.20619456), and the governance discussion is over at APM #1722.

What's the worst silent skill-skip you've shipped? I'd love to hear it.

TERSE Tool Catalog (TTC): Cut Tool Catalog Token Usage by 66.6% in Your AI Agents

Rudson Kiyoshi Souza Carvalho — Tue, 05 May 2026 15:14:53 +0000

If you’ve ever built or worked with AI agents that use tools via the Model Context Protocol (MCP), you’ve probably felt the pain that nobody talks about out loud:

The tool catalog is eating your entire context window and budget.

A single tool defined in MCP JSON Schema typically consumes 100–270 tokens. With 50 tools installed, you’re already spending 5,000–13,500 tokens before the user even writes their first message.

This isn’t just expensive — it actively hurts performance:

Higher cost on every single request
Lower tool-selection accuracy as the catalog grows (attention dilution)
Less room for actual user instructions, memory, or reasoning

The good news? There’s a clean, elegant solution: TERSE Tool Catalog (TTC).

The Problem with Today’s MCP JSON Schema

The current MCP format was designed for machine-to-machine execution contracts, not for LLM reasoning. As a result:

There is no explicit trigger condition (WHEN) — the LLM has to guess from a free-form description string.
There is no error contract (ERR) — the model has no idea what to do when a tool fails.
There is no retrieval taxonomy (TAGS) — dynamic tool retrieval (RAG over tools) becomes painful.
Verbose parameter descriptions add noise with almost zero signal for the LLM.

The result is high cost + mediocre tool selection.

Introducing the TERSE Tool Catalog (TTC)

TTC is an official extension of the TERSE Format — a specification for dense, deterministic, human-and-machine-readable representations optimized for LLMs.

It is not just a compression of MCP JSON. It is a semantic reformulation of the tool contract.

TTC keeps everything the LLM actually needs for execution and adds three fields that MCP is missing:

PURPOSE — clear one-line intent
WHEN — explicit semantic trigger (the most important field for selection)
ERR — declared failure modes
TAGS — taxonomy for semantic grouping and retrieval

Measured result: average 66.6% token reduction with net information gain.

TTC Syntax — Clean and Simple

TOOL <tool-id>
  PURPOSE: <one-line description of what the tool does>
  IN: <param1>:<type>, <param2>:<type>?
  OUT: <return-type>
  ERR: <error1> | <error2> | <error3>
  WHEN: <natural language trigger condition>
  TAGS: <tag1>, <tag2>, <tag3>

Supported Types

string, int, float, bool
array[string], array[int], etc.
object, any

The ? suffix marks an optional parameter.

Real-World Example: `gmail_send_email`

MCP JSON Schema (208 tokens):

{
  "name": "gmail_send_email",
  "description": "Sends an email message via the Gmail API to one or more recipients...",
  "input_schema": { ... }  // very verbose
}

TTC (55 tokens):

TOOL gmail_send_email
  PURPOSE: send email via Gmail
  IN: to:string, subject:string, body:string, cc:string?
  OUT: message_id:string
  ERR: auth_failed | quota_exceeded | invalid_recipient
  WHEN: user wants to send or compose an email
  TAGS: gmail, email, communication

Same semantic content. 73.6% fewer tokens. And the LLM now has structured fields to make much better decisions.

Real Benchmark (10 Production Tools)

Tool	JSON Schema	TTC	Reduction
gmail_send_email	208	55	73.6%
gmail_read_inbox	121	52	57.0%
drive_list_files	141	53	62.4%
calendar_create_event	262	78	70.2%
slack_send_message	206	69	66.5%
github_create_issue	269	84	68.8%
...	...	...	...
TOTAL (10 tools)	1948	650	66.6%

Projection at scale:

50 tools → ~9,740 → ~3,250 tokens
100 tools → ~19,480 → ~6,500 tokens Savings: ~13,000 tokens per request

Why TTC Works So Well

It follows the core TERSE principles:

Maximum information density per token
Determinism (same input → same output)
Human + machine readability
Full composability (tools → servers → agent context)

And it adds exactly what LLMs need for better reasoning:

WHEN becomes the primary discriminator for tool selection
ERR enables graceful degradation and fallback strategies
TAGS makes dynamic tool retrieval (RAG over tools) trivial

How to Use It in Your Agent Context

At the start of a conversation (or via dynamic retrieval), you inject:

TOOLS v1.0 [3/47]
  MCP gmail v1.2
    TOOL gmail_send_email
      ...
  MCP google_drive v2.0
    TOOL drive_read_file
      ...

With semantic tool retrieval, you only inject the 3–5 most relevant tools per request. Context cost becomes sub-linear no matter how large your total catalog grows.

Reference Converter (Python)

The author provides a ready-to-use reference implementation:

github.com/RudsonCarvalho/terse-format

It converts MCP JSON Schema → TTC with sensible defaults. For production use, you simply add explicit annotations for OUT, ERR, WHEN, and TAGS on the server side.

Planned Future Extensions

EXAMPLE block — input/output examples for few-shot learning
COST annotation — estimated token/latency cost per call
CHAIN annotation — tool dependencies and composition patterns
ALIAS field — alternative trigger phrases
AUTH annotation — required OAuth scopes

Conclusion

The TERSE Tool Catalog is not just a token-saving trick. It is a genuine improvement in agent quality — better tool selection, better error handling, and native support for semantic tool retrieval.

If you work with agents, MCP, LangGraph, CrewAI, AutoGen, or any modern agentic framework, TTC is worth trying today.

Links

📄 Full spec (Zenodo): https://doi.org/10.5281/zenodo.19869007

💻 GitHub: https://github.com/RudsonCarvalho/terse-format/tree/main/extensions/ttc

🌐 Landing page: https://rudsoncarvalho.github.io/terse-format/

📦 TERSE Format (parent spec): https://doi.org/10.5281/zenodo.19058364

Your AI agent wastes 13,000 tokens before saying "hello"

Rudson Kiyoshi Souza Carvalho — Wed, 29 Apr 2026 01:22:37 +0000

And you probably have no idea.

If you have an agent with 50 MCP tools installed, here's what happens before any user message is processed:

{
  "name": "gmail_send_email",
  "description": "Sends an email message via the Gmail API to one or more 
    recipients. Use this tool when the user explicitly requests to send, 
    compose and send, or deliver an email message to someone.",
  "input_schema": {
    "type": "object",
    "required": ["to", "subject", "body"],
    "properties": {
      "to": {
        "type": "string",
        "description": "The recipient email address or comma-separated list"
      },
      "subject": {
        "type": "string",
        "description": "The subject line of the email"
      },
      "body": {
        "type": "string",
        "description": "The body content of the email in plain text or HTML"
      }
    }
  }
}

That's ~195 tokens. Per tool. Before anything else.

50 tools × 195 tokens = 9,750 tokens of pure overhead.

And that's just the catalog. You haven't touched user context, conversation history, documents, or anything useful yet.

"But there's prompt caching, right?"

Yes. It reduces the financial cost to ~10% of the base rate.

But caching does not reduce attention cost.

Those tokens still occupy the context window. The model still attends to all of them on every request. And if you use dynamic tool retrieval — selecting different tools per request based on user intent — the cache breaks on every different selection.

The bill doesn't disappear. It just gets cheaper.

The real problem nobody talks about

MCP JSON Schema was designed as a tool execution contract. Not as a semantic tool selection contract.

The result: information critical for LLM reasoning is either absent or buried in free-form text:

No error contract — the LLM doesn't know what to do when auth_failed
No explicit trigger — it has to infer "when to use this tool" from a paragraph of description
No retrieval taxonomy — no standard way to group or filter tools by domain

Verbose AND semantically incomplete. The worst of both worlds.

TTC — TERSE Tool Catalog

I spent the last few weeks solving this problem. The result is an extension of the TERSE Format called TTC — TERSE Tool Catalog.

The same tool above in TTC:

TOOL gmail_send_email
  PURPOSE: send email via Gmail
  IN: to:string, subject:string, body:string, cc:string?
  OUT: message_id:string
  ERR: auth_failed | quota_exceeded | invalid_recipient
  WHEN: user wants to send or compose an email
  TAGS: gmail, email, communication

~55 tokens. 73.6% reduction.

And notice what was added, not just removed:

Field	MCP JSON	TTC
ERR — failure contract	❌ absent	✅ explicit
WHEN — selection trigger	❌ buried	✅ explicit
TAGS — retrieval taxonomy	❌ absent	✅ explicit

It's not compression. It's reallocation.

This is the most important point in the spec:

TTC does not reduce tokens by removing semantic content. It reduces syntactic and documentary overhead from JSON Schema — which serves human readability, not LLM reasoning — and reinvests part of those savings into explicit tool-selection semantics.

The actual math:

MCP JSON Schema:         ~195 tokens per tool
TTC without new fields:   ~35 tokens
TTC with all fields:      ~65 tokens

The 30-token "reinvestment" buys:
  ERR  → failure contract (absent from MCP)
  WHEN → selection trigger (absent from MCP)
  TAGS → retrieval taxonomy (absent from MCP)

Result: 195 → 65 tokens. -66.6%.
But those 65 tokens carry higher reasoning signal
than the original 195.

This is net reasoning-signal gain — not information gain in the classical sense. A critic might say you removed content (parameter descriptions, JSON Schema constraints). Correct. Content that serves human documentation, not LLM inference.

Real benchmark — 10 measured tools

Measured with BPE tokenizer (cl100k_base) on 10 real MCP tool definitions:

Tool	JSON Schema	TTC	Reduction
gmail_send_email	208	55	73.6%
calendar_create_event	262	78	70.2%
github_create_issue	269	84	68.8%
jira_create_ticket	254	77	69.7%
slack_send_message	206	69	66.5%
Total (10 tools)	1,948	650	66.6%

Projections for larger catalogs:

Catalog size	JSON Schema	TTC	Absolute saving
20 tools	~3,896	~1,300	~2,596 tokens
50 tools	~9,740	~3,250	~6,490 tokens
100 tools	~19,480	~6,500	~12,980 tokens

The absolute saving grows linearly. The larger the catalog, the higher the ROI.

Normative WHEN vocabulary

A natural language field without a standard creates another problem: two independent MCP server authors write incompatible WHEN conditions, degrading selection accuracy in large catalogs.

TTC v1.0 solves this with a normative vocabulary:

WHEN: user [wants|requests|asks|needs|intends] to [action] [object]

Conformant examples:
  WHEN: user wants to send an email message
  WHEN: user requests to list files in Google Drive
  WHEN: user needs to create a calendar event

Non-conformant:
  WHEN: send email          ← missing intent verb
  WHEN: user email          ← missing action verb

Accuracy simulation (TF-IDF cosine similarity, 12 tools, 36 queries):

Condition	Accuracy
MCP free-form description	63.9%
TTC WHEN controlled vocabulary	72.2%
Delta	+8.3 pp

Caveat: TF-IDF simulation, not a real LLM benchmark. Directional evidence.

Where it works best

✅ Large catalogs (20+ tools) — where absolute savings justify migration

✅ Local and smaller models — Qwen 7B, Llama 3, Mistral — no cache, narrow windows

✅ Multi-agent pipelines — overhead compounds with every context handoff

✅ RAG over tools — compact TTC is ideal for vector DB indexing and subset injection

❌ Small catalogs with large LLM and wide context — marginal gain

❌ Replacing JSON Schema in API execution contracts — not the use case

Seu agente de IA está desperdiçando 13.000 tokens antes de dizer "oi"

Rudson Kiyoshi Souza Carvalho — Wed, 29 Apr 2026 01:22:14 +0000

E você provavelmente nem sabe disso.

Se você tem um agente com 50 tools MCP instaladas, aqui está o que acontece antes de qualquer mensagem do usuário ser processada:

{
  "name": "gmail_send_email",
  "description": "Sends an email message via the Gmail API to one or more 
    recipients. Use this tool when the user explicitly requests to send, 
    compose and send, or deliver an email message to someone.",
  "input_schema": {
    "type": "object",
    "required": ["to", "subject", "body"],
    "properties": {
      "to": {
        "type": "string",
        "description": "The recipient email address or comma-separated list"
      },
      "subject": {
        "type": "string", 
        "description": "The subject line of the email"
      },
      "body": {
        "type": "string",
        "description": "The body content of the email in plain text or HTML"
      }
    }
  }
}

Isso é ~195 tokens. Por ferramenta. Antes de qualquer coisa.

50 tools × 195 tokens = 9.750 tokens de overhead puro.

E isso é só o catálogo. Ainda não chegou no contexto do usuário, na memória da conversa, nos documentos, em nada.

"Mas tem prompt caching, não?"

Sim. E reduz o custo financeiro para ~10% do valor original.

Mas caching não reduz o custo de atenção.

Esses tokens continuam ocupando a janela de contexto. O modelo ainda processa tudo na atenção a cada request. E se você usa retrieval dinâmico de tools — selecionando ferramentas diferentes por request — o cache quebra em cada seleção diferente.

A conta não some. Ela só fica mais barata.

O problema real que ninguém fala

O MCP JSON Schema foi projetado como contrato de execução de ferramenta. Não como contrato semântico de seleção.

Resultado: informação crítica para o LLM raciocinar está ausente ou enterrada em texto livre:

Sem contrato de erro — o LLM não sabe o que fazer quando auth_failed
Sem trigger explícito — tem que inferir "quando usar essa tool" de uma description de parágrafo
Sem taxonomia de retrieval — não tem como agrupar ou filtrar tools por domínio

Ou seja: verboso E semanticamente incompleto. O pior dos dois mundos.

TTC — TERSE Tool Catalog

Passei as últimas semanas resolvendo esse problema. O resultado é uma extensão do TERSE Format chamada TTC — TERSE Tool Catalog.

A mesma ferramenta acima em TTC:

TOOL gmail_send_email
  PURPOSE: send email via Gmail
  IN: to:string, subject:string, body:string, cc:string?
  OUT: message_id:string
  ERR: auth_failed | quota_exceeded | invalid_recipient
  WHEN: user wants to send or compose an email
  TAGS: gmail, email, communication

~55 tokens. Redução de 73.6%.

E repara no que foi adicionado, não só no que foi removido:

Campo	MCP JSON	TTC
ERR — contrato de falha	❌ ausente	✅ explícito
WHEN — trigger de seleção	❌ enterrado	✅ explícito
TAGS — taxonomia de retrieval	❌ ausente	✅ explícito

Não é compressão. É realocação.

Esse é o ponto mais importante do spec, e vale deixar claro:

TTC não economiza tokens removendo conteúdo semântico. Ele elimina overhead sintático e documental do JSON Schema — que serve legibilidade humana, não raciocínio de LLM — e reinveste parte dessa economia em semântica explícita de seleção de ferramentas.

A conta real:

MCP JSON Schema:        ~195 tokens por tool
TTC sem campos novos:    ~35 tokens
TTC com todos os campos: ~65 tokens

Os 30 tokens de "reinvestimento" compram:
  ERR  → contrato de falha (ausente no MCP)
  WHEN → trigger semântico (ausente no MCP)  
  TAGS → taxonomia de retrieval (ausente no MCP)

Resultado: 195 → 65 tokens. -66.6%.
Mas os 65 tokens carregam mais sinal de raciocínio
do que os 195 originais.

É ganho líquido de sinal de raciocínio, não ganho de informação no sentido clássico.

Benchmark real — 10 tools medidas

Medi com tokenizador BPE (cl100k_base) em 10 definições reais de tools MCP:

Tool	JSON Schema	TTC	Redução
gmail_send_email	208	55	73.6%
calendar_create_event	262	78	70.2%
github_create_issue	269	84	68.8%
jira_create_ticket	254	77	69.7%
slack_send_message	206	69	66.5%
Total (10 tools)	1.948	650	66.6%

Projeção para catálogos maiores:

Catálogo	JSON Schema	TTC	Economia
20 tools	~3.896	~1.300	~2.596 tokens
50 tools	~9.740	~3.250	~6.490 tokens
100 tools	~19.480	~6.500	~12.980 tokens

A economia absoluta cresce linearmente. Quanto maior o catálogo, maior o ROI.

Vocabulário normativo para WHEN

Um campo de linguagem natural sem padrão cria outro problema: dois autores de servidores MCP diferentes escrevem WHEN de formas incompatíveis, degradando a acurácia de seleção em catálogos grandes.

O TTC v1.0 resolve isso com vocabulário normativo:

WHEN: user [wants|requests|asks|needs|intends] to [ação] [objeto]

Exemplos conformantes:
  WHEN: user wants to send an email message
  WHEN: user requests to list files in Google Drive
  WHEN: user needs to create a calendar event

Não-conformante:
  WHEN: send email          ← falta verbo de intenção
  WHEN: user email          ← falta verbo de ação

Simulação de acurácia (TF-IDF cosine similarity, 12 tools, 36 queries):

Condição	Acurácia
MCP description livre	63.9%
TTC WHEN vocabulário controlado	72.2%
Delta	+8.3 pp

Caveat: simulação TF-IDF, não benchmark real com LLM. Evidência direcional.

Onde funciona melhor

✅ Catálogos grandes (20+ tools) — onde a economia absoluta justifica a migração

✅ Modelos locais e menores — Qwen 7B, Llama 3, Mistral — sem cache, janelas estreitas

✅ Pipelines multi-agente — o overhead se acumula a cada passagem de contexto

✅ RAG de tools — TTC compacto é ideal para indexar em vetor DB e injetar subsets

❌ Catálogos pequenos com LLM grande e contexto amplo — ganho marginal

❌ Substituir JSON Schema em contratos de API — não é o propósito

COA-MAS v2: A Meta-Framework for Cross-Domain Multi-Agent Governance

Rudson Kiyoshi Souza Carvalho — Wed, 01 Apr 2026 23:29:15 +0000

AI agents are crossing organizational boundaries. They call tools in partner domains, delegate tasks to external services, and operate in chains where no single actor sees the full picture.

COA-MAS v1 solved the intra-domain governance problem — a four-layer architecture, the Action Claim contract, and the AASG enforcement boundary that ensures zero cognitive load at runtime. If you haven't read it, the paper is at doi.org/10.5281/zenodo.19057202.

The cross-domain problem is different. And it took a full architectural pivot to solve it correctly.

The Silver Bullet Fallacy

Early iterations of COA-MAS v2 tried to build a universal calibration mechanism — a way to translate risk scores between domains with different semantic spaces. After several rounds of debate and stress-testing, it became clear that this approach has the same flaw as trying to replace PIX, TED, wire transfers, and letters of credit with a single payment instrument.

Each of those instruments exists because different transaction contexts require different guarantees. Resilience in distributed systems comes from routing to the right pattern based on context — not from finding the pattern that works everywhere.

The Thesis

COA-MAS v2 is a meta-framework, not a protocol. It standardizes one thing: the Action Intent — a universal artifact that any federated governance pattern can consume. The choice of execution topology is delegated to a Pattern Selection Protocol negotiated during trust peering.

The Action Intent is the common currency. The federation mode is the exchange mechanism.

The Action Intent

The Action Intent is the "passport" of the COA-MAS federation. It is a standardized, cryptographically signed declaration of:

Who is acting — SPIFFE identity, delegation chain, GOV-RISK attestation
What they intend to do — tool URI, operation type, resource scope
What effect they declare — reversibility, estimated scope, data sensitivity
Cryptographic binding — ephemeral DPoP public key for proof-of-possession

Domain A's internal policy, prompts, and risk weights are never transmitted. Only the declared intent, authenticated by Domain A's governance layer.

If Domain A lies — declares bounded_set but attempts a full-table deletion — the signed intent becomes irrefutable forensic evidence. The problem moves from governance mathematics to organizational accountability, backed by cryptographic proof.

The canonical JSON Schema is published at doi.org/10.5281/zenodo.19376419.

The Four Federation Modes

The Pattern Selection Protocol routes each cross-domain interaction to the appropriate mode based on trust distance, acceptable latency, and cognitive burden tolerance.

Mode 0 — Intra-Domain (COA-MAS V1)
Same domain. Deterministic, microsecond latency, zero external dependencies. The foundation everything else builds on.

Mode 1 — Sovereign Visa
Domain A submits the Action Intent to Domain B's authorization endpoint. Domain B's GOV-RISK evaluates it using its own Executable Culture — full sovereignty, no calibration across semantic spaces. GOV-RISK-B issues a standard COA-MAS v1 Action Claim with DPoP binding. AASG-B validates a locally-trusted signature at runtime. Zero cognitive load.

Mode 2 — Ambassador
Domain B doesn't expose tools to foreign agents at all. It exposes an agent communication interface. Domain A's intent becomes the opening message of an A2A conversation. Domain B's Ambassador agent formulates its own plan, submits it to GOV-RISK-B via Mode 0, and executes locally. Maximum isolation. Non-deterministic latency.

Mode 3 — Clearinghouse
A neutral Domain C — a regulated hub both domains trust — evaluates the intent and issues a universally-accepted Action Claim. Appropriate for regulated industries (Open Finance, healthcare prior authorization). Opt-in only: it trades polycentric sovereignty for operational simplicity.

Future Mode 4 — ZK-Policy
The CAGA-compliant target. Domain A generates a zero-knowledge proof of correct policy execution without revealing internal data. Domain B verifies mathematically. Not implementable in production today due to ZKML hardware constraints — but the meta-framework is explicitly designed to incorporate it as Mode 4 when viable, without requiring changes to the Action Intent schema or SPIFFE infrastructure.

The Pattern Selection Protocol

Domains don't negotiate a single mode — they negotiate a Federation Policy that maps operation families and resource classes to modes:

{
  "mode_by_operation": {
    "read":      { "mode": 1, "ttl_seconds": 1800, "single_use": false },
    "delete":    { "mode": 1, "ttl_seconds": 120,  "single_use": true  },
    "configure": { "mode": 2 }
  },
  "mode_by_resource_class": {
    "pii":       { "mode": 2 },
    "regulated": { "mode": 3 }
  }
}

The same pair of domains can use Mode 1 for routine reads and Mode 2 for infrastructure operations — without renegotiating the peering relationship.

Positioning Against CAGA

Meyman [SSRN 6299461] formalizes the Cross-Agent Governance Alignment (CAGA) problem and identifies zero-knowledge proofs as the theoretically correct solution. COA-MAS v2 is the operationally deployable answer while ZKML hardware matures — trading full policy confidentiality for sub-millisecond runtime enforcement, zero integration cost for Domain B, and compatibility with stochastic LLM-based GOV-RISKs.

The relationship is complementary. CAGA defines what a correct solution must prove. COA-MAS v2 defines how production systems navigate the space between the theoretically ideal and the operationally deployable.

What's Published

📄 Working Paper v0.3
doi.org/10.5281/zenodo.19376738
zenodo.org/records/19376739

🔧 Action Intent Schema v1.0.0
doi.org/10.5281/zenodo.19376419
zenodo.org/records/19376420

📚 COA-MAS v1 (foundation)
doi.org/10.5281/zenodo.19057202

If you're building cross-domain multi-agent systems and the governance layer is an afterthought, the meta-framework and the schema are open access. Feedback, critique, and stress-testing welcome.

AI Agents Can Delete Your Production Database. Here's the Governance Framework That Stops Them.

Rudson Kiyoshi Souza Carvalho — Tue, 31 Mar 2026 12:51:25 +0000

This article presents COA-MAS — a governance framework for autonomous agents grounded in organizational theory, institutional design, and normative multi-agent systems research. The full paper is published on Zenodo: doi.org/10.5281/zenodo.19057202

The Problem No One Is Talking About

Something unusual happened in early 2026. The IETF published a formal Internet-Draft on AI agent authentication and authorization. Eight major technology companies released version 1.0 of the Agent-to-Agent Protocol. And a widely-read post demonstrated why the prevailing credential model for AI agents was structurally broken.

The convergence wasn't coincidental. It was the signal that a structural problem — long present in early agentic deployments — had reached the threshold of production consequence.

We've built agents that can:

Delete production databases
Execute financial transactions
Modify business logic
Spawn other agents

And we gave them API keys.

An API key authorizes access. It does not authorize a specific action with a specific impact in a specific context. That distinction is the entire problem.

The Structural Failure Mode: Distributed Cognitive Chaos

I call this failure mode Distributed Cognitive Chaos (DCC): the structural consequence of deploying agents without formal authority hierarchies, authorization contracts, or enforcement boundaries.

DCC has three symptoms:

Action hallucination — an agent executes an action it was never authorized to perform, because nothing formally defined "authorized"
Mandate drift — through a chain of agent-to-agent delegations, the original human intent gets distorted beyond recognition
Accountability collapse — when something goes wrong, there is no tamper-evident record connecting the action to the authority that (supposedly) permitted it

This is not a new problem. It's the oldest problem in organizational theory: how do you coordinate partially autonomous actors toward collective goals while preventing any individual actor from harming the collective?

Herbert Simon identified it in 1947. Elinor Ostrom solved it in 1990. We just haven't applied those solutions to AI agents yet.

COA-MAS: A Governance Framework Grounded in Theory

COA-MAS (Cognitive Organization Architecture for Multi-Agent Systems) is my answer. It synthesizes four intellectual traditions:

Simon's bounded rationality → why agents need external governance
Ostrom's institutional design principles → how to structure governance for durability
Normative multi-agent systems research → how to formalize governance as computable norms
Sociotechnical systems theory → how to make social norms technically enforceable

The framework has three components. Each answers a different question.

Component 1: The Four-Layer Architecture

Question: Who is in charge?

Think of it as a corporate structure for AI agents:

┌─────────────────────────────────────────────┐
│ LAYER 4 — STRATEGIC ORCHESTRATION                  │
│ Receives human objectives · decomposes into tasks  │
└─────────────────────────────────────────────┘
                        ↕
┌─────────────────────────────────────────────┐
│ LAYER 3 — COGNITIVE GOVERNANCE                     │
│ Evaluates proposed actions · issues authorization  │
│ documents · maintains audit ledger                 │
└─────────────────────────────────────────────┘
                        ↕
┌─────────────────────────────────────────────┐
│ LAYER 2 — FUNCTIONAL SPECIALIZATION                │
│ Domain agents · execute tasks within their         │
│ cognitive authority boundary                       │
└─────────────────────────────────────────────┘
                        ↕
┌─────────────────────────────────────────────┐
│ LAYER 1 — EXECUTABLE CULTURE (Constitutional)      │
│ Versioned YAML policies · weights · thresholds     │
│ Human-authored before runtime. Immutable during.   │
└─────────────────────────────────────────────┘

The critical insight, drawn from both Simon and Ostrom, is the separation between those who propose actions and those who authorize them. An agent cannot authorize its own actions. This mirrors the principle of checks and balances in constitutional systems: the body that proposes is not the body that authorizes is not the body that records.

Component 2: The Action Claim

Question: What exactly is the agent authorized to do?

An Action Claim is a formal authorization document that agents must present before executing any real-world action. It's analogous to a building permit — not just "you're allowed to build," but: the location, the dimensions, the materials, the timeline, the inspector, and the version of the building code that governed the approval.

The Action Claim has three parts:

{
  // DECLARED FIELDS — filled by the agent
  "proposed_transition": "DELETE expired sessions older than 90 days",
  "originating_goal": "scheduled maintenance task #4421",
  "delegation_chain": ["human:ops-team", "agent:orchestrator-01", "agent:db-cleaner"],
  "estimated_impact": {
    "destructivity": 0.25,
    "data_exposure": 0.00,
    "resource_consumption": 0.30,
    "privilege_escalation": 0.00,
    "logic_integrity": 0.05,
    "recursive_autonomy": 0.10
  },

  // DERIVED FIELDS — filled by GOV-RISK (Layer 3)
  "justification_gap": 0.08,
  "decision": "APPROVE",
  "governance_signature": "sha256:a3f9...",
  "policy_digest": "sha256:1b2c...",

  // AUDIT FIELDS — filled by infrastructure
  "ac_id": "ac-2026-03-31-00421",
  "state": "AUTHORIZED",
  "committed_at": "2026-03-31T14:22:01Z"
}

The tripartite structure reflects Ostrom's principle of separating operational decisions from the collective-choice rules that govern them. The agent operates at the operational level; Layer 3 applies institutional norms; the audit trail creates an immutable record connecting every decision to the rules that governed it.

Component 3: The AASG (Autonomous Agent Security Gateway)

Question: How is authorization enforced?

Think of the AASG as a customs inspector at the boundary between the agents' cognitive world and the real world of executing tools.

Agent Cognition (A2A) ────────────────► Real World (MCP)
                              │
                         [ AASG ]
                              │
                    Checks exactly 3 things:
                    1. Is the Action Claim valid and signed?
                    2. Is the agent identity correct?
                    3. Does the actual tool call match 
                       what was declared?

The AASG is a physically separate process — not a library in the agent, not a plugin in the MCP server. This separation is the architectural expression of Simon's insight: governance constraints must be external to the decision-maker they constrain. An agent cannot reliably police itself.

What happens when an agent lies?

Consider an agent authorized to clean up expired sessions. It gets an AUTHORIZED Action Claim with destructivity: 0.25. Then it tries to execute:

DELETE FROM users  -- no WHERE clause

The AASG projects the actual tool call onto the governance space:

Declared: destructivity = 0.25 (bounded deletion)
Actual:   destructivity = 1.00 (full table wipe)

Congruence delta: 0.75 >> threshold (0.20)
→ ERR_AASG_003: CONGRUENCE_VIOLATION
→ Action never reaches the database
→ Merkle Ledger: INTERCEPTED entry recorded
→ GOV-AUDIT: SCOPE_SUBDECLARATION_DETECTED alert

The action is blocked before any damage occurs — not discovered in a log review afterward.

The Justification Gap: The First Computable Proportionality Metric

The core innovation in COA-MAS is the Justification Gap — a metric that answers: is this specific action proportional to the mandate that authorized it?

The governance space has six dimensions:

Dimension	What It Measures	Weight
d₁ Destructivity	Irreversible state change (DELETE, DROP, PURGE)	0.30
d₂ Data Exposure	Cross-zone data movement, PII handling	0.25
d₃ Resource Consumption	Financial/computational impact	0.10
d₄ Privilege Escalation	IAM creation, credential access	0.20
d₅ Logic Integrity	Business logic modification	0.10
d₆ Recursive Autonomy	Sub-agent spawning, delegation depth	0.05

The formula:

Gⱼ(V_A, V_M) = Σ wᵢ · φ(aᵢ, mᵢ)

where φ(aᵢ, mᵢ) =
  0               if aᵢ ≤ mᵢ   (within mandate)
  ((aᵢ - mᵢ)/mᵢ)² if aᵢ > mᵢ  (quadratic penalty)
  +∞              if aᵢ > 0 ∧ mᵢ = 0  (absolute prohibition)

The decision:

Gⱼ ≤ 0.15  →  APPROVE
0.15 < Gⱼ ≤ 0.40  →  ESCALATE (human review with full dossier)
Gⱼ > 0.40 or Gⱼ = +∞  →  BLOCK

The quadratic penalty means small excesses may be tolerable, but large excesses signal a qualitatively different action from what was authorized. The ESCALATE tier implements "meaningful human oversight" as required by EU AI Act Article 14 — with actual quantitative evidence, not just a kill switch.

Mandate Laundering: The Attack You Didn't Know Existed

Here's the attack that no existing security framework addresses: mandate laundering.

In a delegation chain Human → Agent₁ → Agent₂ → ... → Agentₖ, each intermediate agent can marginally expand the mandate it passes on. Each local expansion looks proportionate. But the cumulative expansion is not.

COA-MAS anchors the Justification Gap to the root human mandate, regardless of intermediate expansions:

G_chain(Aₖ) = Gⱼ(V_{Aₖ}, V_{M₀})  ← root mandate, always

G_total = 0.30 · G_local + 0.70 · G_chain

Non-Improvement Theorem: For any permissive subdelegation, G_chain is monotone non-decreasing. You cannot launder your way out of the original constraint.

How COA-MAS Fits the Standards Ecosystem

COA-MAS doesn't compete with existing standards — it implements what they defer:

Initiative	What It Solves	What It Defers	COA-MAS Role
IETF draft-klrc-aiagent-auth	Identity, authentication, authorization (SPIFFE, OAuth 2.0)	Policy model explicitly out of scope	Implements the policy model
A2A Protocol v1.0	Agent coordination standard	Authorization at execution boundary	AASG is the enforcement point A2A lacks
MCP v1.0	Agent-to-tool communication	No semantic authorization layer	AASG is the authorization gate MCP doesn't have

The IETF draft's Section 12 explicitly states: "the policy model and document format are out of scope." That is precisely where COA-MAS contributes.

The Failure Mode Transition

The most consequential architectural property of COA-MAS is the failure mode it introduces.

Traditional agentic systems: fail semantically and silently. The agent reinterprets a guideline, slightly expands a scope, finds an unanticipated interpretation. Detectable only after damage, through log analysis.

COA-MAS: introduces the explicit CONGRUENCE_VIOLATION failure mode. When an agent attempts an action that violates its declared impact vector, the AASG returns:

A specific error code
The dimension violated
The quantitative delta
A Merkle Ledger entry with full context

This is the organizational equivalent of a building inspector catching a code violation before the foundation is poured — not after the building collapses.

What's Published

The full paper, COA-MAS: A Governance Framework for Autonomous Agents in Production Environments, is available on Zenodo:

📄 zenodo.org/records/19057202

🔑 DOI: doi.org/10.5281/zenodo.19057202

📜 License: CC BY 4.0

The paper covers:

Full formal specification of the Action Claim ontology
Complete mathematical treatment of the Justification Gap
Attack pattern neutralization (scope subdeclaration, decomposition attack, mandate laundering)
EU AI Act regulatory alignment (Articles 9, 11, 13, 14)
Positioning against IETF, A2A, MCP, and AIMS model

Final Thought

The governance of autonomous agents is not a new problem. Simon identified its theoretical roots in 1947. Ostrom identified the institutional design solutions in 1990. Normative MAS researchers formalized the computational analogues through the 1990s and 2000s.

What's new in 2026 is the urgency.

Agents that can delete production databases and execute financial transactions are being deployed without the governance infrastructure this body of knowledge prescribes.

COA-MAS applies established principles to a new domain. The question is not whether governance is necessary — it's whether we build it before or after the first major incident.

If you're building multi-agent systems in production, I'd be genuinely interested in feedback on whether these primitives map to the problems you're encountering. The paper is open access — feel free to cite, critique, or extend.

— Rudson Kiyoshi Souza Carvalho, Independent Researcher

doi.org/10.5281/zenodo.19057202

TERSE — A New Serialization Format Built for LLMs

Rudson Kiyoshi Souza Carvalho — Tue, 31 Mar 2026 12:10:36 +0000

JSON is the default. But defaults were built for a different world.

Every time you send structured data to a Large Language Model, you pay for it token by token. And if you're using JSON — which almost everyone is — you're paying for a lot of characters that carry no information.

Take this simple payload:

{
  "user_id": 1001,
  "status": "active",
  "data": ["feature_a", "feature_b"],
  "verified": true
}

Count the noise: braces, quotes around every key and string value, commas, colons with spaces. Now imagine this multiplied across thousands of API calls per day. That's real money.

I built TERSE to address this.

What is TERSE?

TERSE (Token-Efficient Recursive Serialization Encoding) is a text-based data serialization format designed to represent the complete JSON data model with substantially fewer tokens — making it significantly more cost-efficient for use as input to Large Language Models.

The same payload in TERSE:

user_id: 1001
status: active
data: [feature_a feature_b]
verified: T

Same information. ~47% fewer tokens.

How it compares

Format	Token savings vs JSON	Full JSON coverage?
JSON	baseline	✓
YAML	~20%	✓ (verbose arrays)
TOON	~40%	✗ (flat data only)
TERSE	~47%	✓

YAML is a genuine improvement over JSON — it's more compact and covers the full data model. But it was designed for humans to write, not for LLMs to consume. Verbose arrays (- item per line), full-word booleans (true/false), and a notoriously complex parser spec limit its token savings.

TOON goes further on token reduction but falls apart with nested objects — it only works for flat, uniform tabular data. If your payload has any nesting, TOON can't represent it.

TERSE was designed to close that gap: full JSON data model coverage, with token efficiency as the primary design constraint.

The five design principles

1. Bare strings — identifiers and common values require no quotation marks. production stays production, not "production". Quotes are reserved for strings that actually need them — those containing spaces, reserved characters, or special syntax.

2. Compact primitives — null, true, and false become single characters: ~, T, F. Three of the most common values in any payload, each reduced to one token.

3. Implicit delimiters — spaces separate values inside objects and arrays. No trailing commas, no colons between array elements.

4. Schema arrays — the biggest token win for tabular data. Uniform arrays of objects declare their fields once, then list values positionally:

users:
  #[id name role active]
  1 Alice admin T
  2 Bruno editor T
  3 Carla viewer F

The equivalent JSON repeats "id", "name", "role", "active" on every single row. For a 100-row dataset, that's 400 unnecessary key repetitions.

5. Recursive structure — all constructs nest arbitrarily. Objects inside arrays inside schema arrays — all valid, all compact. No flat-only limitations.

A real example: nested order

JSON (~180 tokens):

{
  "orderId": "ORD-001",
  "customer": {
    "name": "Rafael Torres",
    "email": "r@email.com"
  },
  "items": [
    {"sku": "A1", "qty": 2, "price": 9.99},
    {"sku": "B3", "qty": 1, "price": 24.50}
  ],
  "paid": true,
  "notes": null
}

TERSE (~95 tokens):

orderId: ORD-001
customer: {name:"Rafael Torres" email:r@email.com}
items:
  #[sku qty price]
  A1 2 9.99
  B3 1 24.50
paid: T
notes: ~

This is where TERSE separates itself from TOON and CSV — deeply nested structures work exactly as expected.

You don't write TERSE by hand

The workflow is identical to JSON:

Your data (object/dict)
      ↓
serialize()        ← terse-js or terse-py
      ↓
TERSE string       ← sent to the LLM
      ↓
parse()            ← if you need it back
      ↓
Your data again

Just like nobody writes JSON.stringify() output by hand — you call the function. TERSE works the same way. The format is optimized for the one reader that actually matters: the LLM.

On design intent: why not compress further?

TERSE could go deeper — automatic key abbreviation, binary type encoding, dictionary compression. We deliberately stopped short of that.

The goal is a format that remains human-auditable: you can open a .terse file in any text editor and understand what you're looking at without tooling. In LLM pipelines, auditability is a safety property, not just a convenience. When an agent misbehaves, you need to inspect its inputs.

Two questions that come up

Can I use TERSE for REST API communication between microservices?

You can, but it's not the primary use case. REST APIs are consumed by many clients across different teams and languages — JSON's universal support is a real advantage there. TERSE shines where you control both ends: serializing data before sending it to an LLM, and parsing the response on the other side.

Can I use TERSE for application configuration, like YAML?

Yes — the format supports everything YAML does for config files: nested objects, arrays, typed values, comments. Worth considering if your config is also consumed by an LLM as context.

What's available today

The project includes:

Formal specification (v0.7) with ABNF grammar, conformance rules, and security considerations — published on Zenodo with DOI: 10.5281/zenodo.19058364
Reference implementations in TypeScript, Python, Java, and Go
Live playground where you can paste JSON and see the TERSE output in real time

Everything is open source under MIT (implementations) and CC BY 4.0 (specification).

Resilience Evaluation and Optimization Framework — REOF

Rudson Kiyoshi Souza Carvalho — Wed, 12 Jun 2024 12:23:30 +0000

Autor: Rudson Kiyoshi Souza Carvalho

Data: Abril de 2024

Objetivo: Este documento apresenta o REOF, um framework para avaliar, quantificar e otimizar a resiliência e confiabilidade de sistemas, com foco em aplicações de software.

Ao avaliar sistematicamente cada componente crítico, a metodologia ajuda a identificar proativamente áreas de vulnerabilidade que podem comprometer a confiabilidade/resiliência do sistema.

1. Introdução ao REOF:

O REOF é uma ferramenta padronizada que permite a análise, quantificação e expressão da resiliência e confiabilidade de um sistema através de um índice numérico (IRC - Índice de Resiliência e Confiabilidade).
A metodologia foca na prevenção de falhas e na implementação de melhores práticas para aumentar a confiabilidade.

2. Metodologia de Análise REOF:

O método considera Verticais de Avaliação: O REOF divide a análise em "verticais" que representam pontos críticos de um sistema, como:

EE - Entrada Externa (pontos de interação com o cliente)
SE - Saídas Externas (envio de dados para outros sistemas)
CE - Consultas Externas (integrações com outros sistemas)
DI - Dados Internos (consultas a banco de dados, cache, etc.)
AC - Aplicação em Container (configurações de health check)
SEC - Framework de Segurança Habilitado (ex: Spring Security)

Um dos pontos mais importantes sobre este framework é que ele foi concebido para ser flexível a qualquer vertical criada, portanto, você pode criar suas próprias verticais de avaliação e poderá avaliar qualquer processo que tenha um conjunto de boas práticas a serem avaliados. (logo poderia avaliar verticais de infraestrutura, técnicas de construções de aplicativos mobile, entre outros processos.

Proteções e Pesos: Para cada vertical, são definidas "proteções" (melhores práticas) que aumentam a resiliência, cada uma com um peso específico.
"Com sua equipe de engenharia ou arquitetura, você poderá listar as melhores práticas de proteção para promover resiliência e confiabilidade ao sistema, definindo pesos para cada proteção aplicada."

Cálculo do Índice: O IRC é calculado pela soma ponderada das pontuações de cada vertical.

Fator de Degradação: Um fator de degradação é aplicado para considerar o impacto de múltiplos domínios/funcionalidades em um mesmo microsserviço (micromonolitos).

Para cada domínio adicional, quero reduzir a qualidade do índice geral em 10% para cada domínio/funcionalidade adicionada, pois incluir novas/extras funcionalidades/domínios diferentes faz com que seu serviço tenha que compartilhar recursos, e uma lentidão em uma funcionalidade pode esgotar recursos para outras funcionalidades no mesmo microsserviço.

Normalização do Índice: O IRC é normalizado para uma escala de 0 a 10, facilitando a comunicação e comparação entre diferentes sistemas.

3. IRC/REOF como SLA:

O REOF permite expressar o IRC em níveis de serviço (SLA):

item 1 Excelente (8 a 10)
item 2 Bom (5 a 7.9)
item 3 Aceitável (3 a 4.9)
item 4 Insatisfatório (abaixo de 3)

Pirâmide de confiabilidade REOF de Ruds

SLA para Serviço Excelente: O IRC/REOF deve ser maior ou igual a 8, indicando um nível de serviço excelente. Isso reflete a alta confiabilidade e eficiência do microserviço, sem sobrecarga de domínios adicionais.

SLA para Serviço Bom: O IRC/REOF deve ser entre 5 e 7.9, indicando um nível de serviço bom. Isso reflete a confiabilidade do microserviço.

SLA para Serviço Aceitável: O IRC/REOF deve ser entre 3 e 4.9, indicando um nível de serviço aceitável. Isso indica que há espaço para melhoria. Medidas corretivas devem ser aplicadas para aumentar a confiabilidade deste serviço e reduzir impactos de paradas do serviço por causa da aplicação.

SLA para Serviço Insatisfatório: O IRC/REOF deve estar abaixo de 3, indicando um nível de serviço insatisfatório. Isso indica que este serviço precisa de revisões e melhorias, não sendo um serviço confiável.

4. Flexibilidade e Automação:

O REOF é flexível e pode ser personalizado com novas verticais e proteções.
É possível automatizar o cálculo do IRC através de análise estática de código, mas a precisão pode ser limitada.

5. REOF vs. MTBF:

O REOF é uma medida proativa que avalia a robustez do sistema com base em sua construção, enquanto o MTBF é uma medida reativa que considera apenas o tempo médio entre falhas.

O MTBF é a métrica da sorte ao longo do tempo, um MTBF alto pode indicar que um sistema teve um bom histórico operacional, dadas as condições ideais de operação ambiental desse sistema, no entanto, não diferencia necessariamente sistemas genuinamente bem projetados daqueles que Você pode ter tido 'sorte' de ter um ambiente estável durante o período de execução e avaliação.

O REOF é mais abrangente e fornece insights mais acionáveis para melhorar a resiliência.

6. Relação com Chaos Engineering:

REOF e Chaos Engineering são abordagens complementares.
O REOF garante que as melhores práticas de resiliência sejam aplicadas durante o desenvolvimento, enquanto o Chaos Engineering testa a resiliência do sistema em produção.

7. Benefícios do REOF:

Comunicação eficaz sobre a confiabilidade do sistema.
Identificação precisa de áreas de melhoria.
Cultura de melhoria contínua e prevenção de falhas.
Gerenciamento de riscos e conformidade com SLAs.
Melhor experiência do usuário.

8. Considerações sobre Custos:

Implementação do REOF pode ter custo inicial significativo, mas reduz custos operacionais a longo prazo.
Chaos Engineering pode ter baixo custo de implementação, mas custos operacionais podem ser altos durante os testes.

Como o método REOF é melhor do que o método MTBF?

O MTBF é uma estatística de funcionamento do seu sistema, segundo um histórico operacional, uma medição ao longo do tempo, onde um sistema pode funcionar muito bem dada as condições ideais de operação, se nada de anormal acontecer no seu ambiente/infra, o MTBF indicará que seu sistema é extremamente confiável, pois ele depende das condições sob a qual o seu sistema opera para que possam ocorrer falhas, este método não sabe como seu sistema foi construído, considera a freqüência de falhas num período de tempo, e não a robustez como o sistema foi construído para lidar com diferentes tipos de variações no ambiente e consequentemente se proteger das falhas, é um método reativo.

O MTBF é a métrica da sorte em função do tempo, um MTBF alto pode indicar que um sistema teve um bom histórico de funcionamento dada as condições de ambiente ideais de operação deste sistema, porém, não necessariamente distingue entre sistemas genuinamente bem projetados e aqueles que pode ter tido "sorte" de ter um ambiente estável durante o período de execução e avaliação.

O REOF genuinamente avalia a robustez do sistema, como o sistema foi construído para lidar com os diferentes tipos de problemas que possam ocorrer no ambiente produtivo, é um método proativo.

Relação entre o método REOF e o Chaos Monkey/Engineering

O método REOF, contrasta com a aplicação de ferramentas como o Chaos Monkey em vários aspectos fundamentais. Ambas as abordagens visam melhorar a resiliência e a confiabilidade dos sistemas, mas fazem isso de maneiras complementares, a engenharia do caos é uma disciplina de experimentação em um sistema para criar confiança na capacidade do sistema de resistir a condições turbulentas na produção, enquanto este método garante que foram aplicadas as melhores práticas para resistir ao caos, ou seja, garante a preparação para falhas, os pontos fortes da metodologia de avaliação de confiabilidade em relação ao uso de um Chaos Monkey são:

Foco na Prevenção e Melhoria Contínua

Avaliação Holística: A metodologia fornece uma visão abrangente da performance do sistema ao longo do tempo, permitindo identificar tendências, áreas de melhoria e impactos das mudanças, ao contrário do Chaos Monkey, que testa a resiliência de forma mais imediata e isolada.

Incentivo à Inovação: A gamificação incentiva (proposta tópico desafio de excelência) as equipes a buscar melhorias contínuas e soluções inovadoras para elevar os índices de confiabilidade, promovendo uma cultura de excelência operacional.

Planejamento Estratégico: Oferece uma base para o planejamento estratégico e a alocação de recursos, ao identificar áreas críticas que necessitam de atenção e investimento, algo que a aplicação isolada do Chaos Monkey não proporciona diretamente.

Gestão de Riscos e Conformidade

Redução de Riscos Operacionais: Ao focar na avaliação e melhoria contínuas da confiabilidade, esta metodologia ajuda a mitigar riscos operacionais de longo prazo, enquanto o Chaos Monkey é mais uma ferramenta de teste de estresse que expõe vulnerabilidades.

Conformidade com SLAs: A metodologia permite a monitoração proativa e a garantia de que os serviços atendam ou excedam os SLAs acordados, o que é fundamental para a satisfação do cliente e a conformidade regulatória.

Melhoria da Experiência do Usuário

Foco no Usuário: Avaliar e melhorar a confiabilidade com base nos SLAs enfatiza a importância da experiência do usuário, visando garantir uma operação sem interrupções e desempenho otimizado dos serviços.

Antecipação de Problemas: Permite a identificação e correção proativa de possíveis falhas antes que afetem os usuários finais, enquanto o Chaos Monkey simula falhas para testar a resiliência, o que pode ou não ser diretamente relacionado à experiência do usuário.

Complementaridade com Ferramentas de Teste de Resiliência
Abordagem Integrada: Embora focada em avaliação e melhoria, essa metodologia pode ser complementada por ferramentas como o Chaos Monkey para uma abordagem mais robusta à resiliência. Juntas, elas oferecem uma estratégia de defesa em profundidade contra falhas e interrupções.

Em resumo, a metodologia de avaliação de confiabilidade traz uma abordagem preventiva e estratégica para a gestão da confiabilidade dos sistemas, enfocando a melhoria contínua, a inovação e a satisfação do cliente. Enquanto o Chaos Monkey é uma ferramenta valiosa para testar a resiliência de forma específica e isolada, a combinação das duas abordagens oferece um caminho poderoso para alcançar a excelência operacional e a resiliência do sistema.

Conclusão:

O REOF é um framework poderoso para construir e gerenciar sistemas resilientes. Sua abordagem proativa, foco na prevenção e flexibilidade o tornam uma ferramenta valiosa para qualquer organização que busca alcançar a excelência operacional e garantir a satisfação do cliente.

Siga o link para mais detalhes:
Follow the medium link for more details about this framework: Medium REOF

DEV Community: Rudson Kiyoshi Souza Carvalho

The Right Proposal Lost Again: On Power Struggles Disguised as Technical Decisions

The mechanism: power wearing a competence badge

The book: 34 laws, six parts, from diagnosis to defense

I · Manufacturing the Problem

II · Owning the Narrative

III · Capturing Territory

IV · The Executive's Game

V · Security, Risk, and Compliance as Power

VI · Surviving Without Being Naive (the defense)

This is not a manipulation manual

Launch

A Proposta Certa Perdeu de Novo: Sobre Disputas de Poder Disfarçadas de Decisão Técnica

O mecanismo: poder usando crachá de competência

O livro: 34 leis, seis partes, do diagnóstico à defesa

I · Fabricar o Problema

II · Dominar a Narrativa

III · Capturar Território

IV · O Jogo do Executivo

V · Segurança, Risco e Compliance como Poder

VI · Sobreviver Sem Ser Ingênuo (a defesa)

Isto não é manual de manipulação

Lançamento

Your AI agent is inventing behavior — and you have no way to prove otherwise

The gap nobody closed

The root of the problem

The core idea behind BPR

Where it enters the SDLC

The hardest level: anti-invention

Incremental adoption — not all or nothing

What BPR standardizes — and what it doesn't

Honesty about the limits

Current state and what's missing

How to test it right now

Conclusion

Your agent skill was never loaded. And you have no way of knowing.

Agent skills load on a guess (and can't inherit). Here's the fix

Agent skills load on a guess (and can't inherit). Here's the fix

How skills actually load

The TV-manual problem

Why this quietly wrecks critical workflows

The fix: a Skill Resolver

Leveling up: skill inheritance for multinationals

Where Microsoft APM fits

Honest limitations

Wrap-up

TERSE Tool Catalog (TTC): Cut Tool Catalog Token Usage by 66.6% in Your AI Agents

The Problem with Today’s MCP JSON Schema

Introducing the TERSE Tool Catalog (TTC)

TTC Syntax — Clean and Simple

Supported Types

Real-World Example: gmail_send_email

Real Benchmark (10 Production Tools)

Why TTC Works So Well

How to Use It in Your Agent Context

Reference Converter (Python)

Planned Future Extensions

Conclusion

Links

Your AI agent wastes 13,000 tokens before saying "hello"

"But there's prompt caching, right?"

The real problem nobody talks about

TTC — TERSE Tool Catalog

It's not compression. It's reallocation.

Real benchmark — 10 measured tools

Normative WHEN vocabulary

Where it works best

Links

Seu agente de IA está desperdiçando 13.000 tokens antes de dizer "oi"

"Mas tem prompt caching, não?"

O problema real que ninguém fala

TTC — TERSE Tool Catalog

Não é compressão. É realocação.

Benchmark real — 10 tools medidas

Vocabulário normativo para WHEN

Onde funciona melhor

Links

COA-MAS v2: A Meta-Framework for Cross-Domain Multi-Agent Governance

The Silver Bullet Fallacy

The Thesis

Real-World Example: `gmail_send_email`