DEV Community: Soumia

Tokenmining: How the Token Became the Unit of Production of the AI Economy (2026 2030)

Soumia — Tue, 07 Jul 2026 20:41:50 +0000

Data centers are becoming factories whose product is tokens. A deep dive into token economics, the $5.2T buildout, the enterprise cost paradox, and what changes in IT by 2030 — with real numbers.

The thesis in one paragraph

At GTC 2026, Nvidia's Jensen Huang said the word "token" more than 70 times in a single keynote and gave operators a formula:
Revenue = Tokens per Watt × Available Gigawatts. The claim underneath the theater is structural: the atomic unit of machine reasoning — the token — is becoming a manufactured, graded, priced commodity. The consequence is that between 2026 and 2030, IT stops being organized around applications and storage and reorganizes around three questions:

Token production — who manufactures intelligence, and at what yield per watt?
Token consumption — how do agents burn tokens, and who governs the bill?
Token governance — FinOps, compliance, and sovereignty over where inference runs.

Every layer — silicon, power, cloud, SaaS, enterprise IT departments, national policy — is being redrawn around those three questions.

Why "mining" is the right metaphor

Like a mine, a token factory is capacity-constrained by physics: a 1-gigawatt facility is 1 gigawatt, full stop. Yield per unit of energy is the whole game.

Like a commodity, tokens are being graded and tiered. Huang sketched a public price ladder: roughly $1 per million tokens at the low end, $3–6 mid-tier, ~$45 for engineering-grade, with $1,000 per million tokens for premium reasoning positioned as a question of when, not if.

And like early oil, the resource is triggering an infrastructure land-grab, national-security posturing — and a legitimate debate about whether the capex is running ahead of the demand.

1 · The supply side: token production is the new heavy industry

The volume curve

Google is the most public benchmark, because it discloses the number at every I/O:

Date	Tokens / month	What it signals
Apr 2024	~9.7 trillion	Chatbot era — AI as a feature
May 2025	~480 trillion (50×)	AI Overviews + APIs go mainstream
Oct 2025	~1.3 quadrillion	Agentic workloads begin compounding
May 2026	3.2 quadrillion (7× YoY)	19B tokens/minute via API; 375 Google Cloud customers each consuming >1T tokens/year

These are vendor-reported and unaudited — but the shape of the curve is corroborated elsewhere. Microsoft reported 100T+ tokens in a single quarter of 2025 (5× YoY) and, by its FY26 Q3 call, 300+ Foundry customers on track for a trillion tokens each, accelerating 30% quarter-over-quarter. OpenRouter's annualized routing volume crossed one quadrillion tokens in March 2026. The growth curve is not flattening. This is steep adoption, not saturation.

The capex behind it

McKinsey's data-center demand model gives the buildout three scenarios for 2025–2030:

Scenario	New AI capacity	AI capex to 2030	Note
Constrained	+78 GW	$3.7T	Efficiency gains + adoption stalls
Base case	+125 GW → 156 GW total	$5.2T	≈ the electricity of 125 nuclear reactors
Accelerated	+205 GW	$7.9T	Agentic demand outruns efficiency

Add ~$1.5T for traditional IT workloads and the total approaches $7 trillion by 2030 — roughly 1% of global GDP annually. Of the AI share, ~60% ($3.1T) flows to chips and computing hardware, ~25% ($1.3T) to power, cooling and electrical, ~15% ($0.8T) to land and construction. Global capacity demand nearly triples, from 82 GW (2025) to 219 GW (2030), with AI workloads at ~70% of it.

The production function every operator now optimizes

Power is the hard constraint. Facilities are capped in gigawatts, so every gain must come from yield. Blackwell raised throughput ~35× in the monetization-heavy tiers; the Vera Rubin generation targets another order of magnitude. Moving a hardware generation can yield ~5× revenue at the same power envelope.
Tiering is the pricing model. Free tiers acquire users; mid tiers balance scale and speed; premium tiers (large context, extreme throughput, low latency) carry the margin. A gigawatt is allocated across tiers the way a refinery allocates crude across product grades.
The unit economics have flipped. On Nvidia's fiscal Q1 2027 earnings call (May 2026), Huang declared that tokens had become profitable for model makers. SemiAnalysis estimates frontier-lab inference gross margins rose from below 40% to over 70% between late 2025 and spring 2026. Inference crossed from cost line to revenue engine.

2 · The demand side: the token paradox

This is the single most important dynamic for enterprise IT budgets to 2030, and it is a textbook Jevons paradox: efficiency gains don't reduce total consumption — they detonate it.

Falling ↓	Rising ↑
Blended enterprise cost per million tokens: $18.40 → $6.07 (−67%) between Q1 2025 and Q1 2026, across an analysis of 2.4B enterprise API calls	Average enterprise AI budget: $1.2M (2024) → $7M (2026); 73% of enterprises exceeded their AI cost projections (FinOps Foundation 2026)
Per-token prices for equivalent capability falling 9×–900×/yr depending on benchmark (Epoch AI); Gartner forecasts a further ~90% reduction by 2030	Inference now ≈ 80–85% of enterprise AI spend; some Fortune 500 companies report monthly inference bills in the tens of millions
Open-source inference costs declining 30–50% annually since 2023	Agentic workflows consume 5–30× more tokens per task than a chatbot query (Gartner, Mar 2026)

Three structural drivers of the volume explosion

Agentic multiplication. One user task → 10–20 LLM calls (reason, plan, tool-call, verify, self-correct). Pilot economics computed on single API calls bear no relationship to the production economics of loops running thousands of times a day.
Retrieval overhead. RAG pipelines inject context on every call, and KV-cache costs scale with context length. The retrieval tax is working exactly as designed — the budget model just never priced it.
Background inference. Monitoring agents, document watchers, and compliance surveillance run 24/7 against every event, whether or not a human asked. Minimal in 2024 deployments; the fastest-growing share of the 2026 bill.

What disciplined buyers already achieve

Tiered model routing (small model by default, frontier on escalation): median blended cost of $2.31/M tokens vs. $18.40/M for route-everything-to-frontier. An 8× spread on identical work — pure governance.
FinOps has annexed AI: 31% of FinOps practitioners managed AI spend in 2025; 98% in 2026. Token governance is now the discipline's top forward-looking priority.
Caveat for 2027+ planning: several analysts argue current frontier API prices are venture-subsidized below cost and will normalize upward. Model-agnostic architecture is the hedge.

The business-model cascade

Huang's GTC 2026 framing was an "Enterprise IT Renaissance" from SaaS to Agent-as-a-Service. The logic chain: if intelligence is metered in tokens, software stops being rented per seat and starts being consumed per unit of work. Nvidia is even piloting token allowances as compensation — Huang floated giving engineers roughly half their base pay as a token budget (a $250K/yr allowance on a $500K salary). Discount the theater; keep the signal: token budgets are entering corporate financial statements as a managed resource, next to headcount and cloud spend.

The counter-view matters equally. For JPMorgan, Walmart, or GM, tokens are a raw material, not a product — their CIOs want cheaper inference and a clear ROI date, not a token-revenue story. Both views are correct; they describe opposite ends of the same value chain.

3 · The IT industry, layer by layer (2026 → 2030)

Layer	2026 state	2030 trajectory
Energy	The binding constraint. US data-center demand adding ~460 TWh 2023–2030; grid interconnection queues are the new chip shortage	Power procurement becomes a core IT competency; tokens-per-watt reported the way PUE once was
Silicon	Annual architecture cadence; inference-specialized parts (SRAM-heavy LPUs claiming ~35× throughput/MW on decode); prefill/decode disaggregation	Heterogeneous fleets tuned per inference phase; ~$3.3T of capex lands here; cost per token keeps falling ~an order of magnitude per year
Data center	From compute hub to AI factory; gigawatt campuses; 97% occupancy; 77% of the construction pipeline pre-leased	Global capacity ~triples to 219 GW; the industry builds 2× everything built since 2000, in 5 years; revenue per MW is the operator KPI
Cloud	Token-throughput pricing appears next to VM pricing; neoclouds and GPU-as-a-service proliferate	Cloud sold in three meters: storage (GB), compute (vCPU), intelligence (tokens)
Software / SaaS	Per-seat pricing eroding; agent step-billing emerges; coding agents approach $1B run-rates, partly driven by people who cannot code	SaaS → AaaS: outcome- and consumption-priced agents; software TAM expands from tool rental to digital-labor delivery
Enterprise IT	85% of AI budget is inference; 73% blew their projections; FinOps scrambling	A token P&L per business unit; model-routing gateways as standard infrastructure; "AI cost engineer" becomes a named role
Nation-states	"Compute = GDP" doctrine; EU AI gigafactories (~€20B via EuroHPC, ~100K-processor facilities); France treats AI sovereignty as presidential-level policy	Token production capacity tracked like energy reserves; sovereignty defined by jurisdiction over execution, not just data residency

The European specificity

If you build or buy AI in the EU, four things are different — and they matter more every quarter as the AI Act's high-risk rules bite (August 2, 2026):

Sovereignty moved from panel topic to procurement criterion. The operative question of 2026 is where: where the chips are fabbed, where the power is drawn, whose laws bind the model, to whose economy the value accrues. The US Cloud Act makes "EU-West region on a US hyperscaler" a data-residency answer, not a jurisdiction answer.
Sovereign RAG is becoming the default pattern for regulated EU enterprises: EU-hosted embeddings and inference (Mistral-class models or self-hosted open weights), immutable audit trails for AI Act compliance, DLP before and after generation — no step transits a US-hosted service.
The honest gap: open weights are the natural sovereign stack, and open-source inference crossed the billion-dollar threshold — but the best open models still descend from US and Chinese labs, and EU-headquartered clouds remain thin at the frontier tier (roughly one Gold-tier and two Silver-tier EU providers in SemiAnalysis's ClusterMAX ranking, spring 2026). Sovereignty rhetoric currently exceeds what the supply chain permits. A country that owns its knowledge graph and rents its GPUs may be more sovereign than one owning a gigafactory running someone else's stack.
Grid constraints in London and Dublin are redirecting AI-server growth to the Nordics and Southern Europe (Forrester 2026) — creating cost-competitive sovereign options that didn't exist 18 months ago.

4 · A concrete 2026 → 2030 timeline

Year	What happens
2026	The quadrillion-token era begins (Google 3.2Q/mo; OpenRouter 1Q annualized). Inference flips profitable at the frontier. Token tiering formalizes. EU AI Act high-risk rules bite August 2. FinOps annexes AI spend
2027	Agent step-billing becomes standard in SaaS contracts; first wave of frontier API price normalization upward as VC subsidies recede; token budgets appear as explicit line items; EU gigafactory sites break ground
2028	Model-routing gateways are default enterprise infrastructure; LPU-class inference silicon goes mainstream in hyperscaler fleets; power procurement gates more IT roadmaps than chip supply; sovereign inference reaches price-parity for open-weight workloads
2029	Consumption/outcome pricing overtakes per-seat in new enterprise software deals; "AI cost engineer" is a hiring category; national token-production capacity is discussed in industrial-policy terms alongside energy
2030	If the base case holds: 219 GW global capacity (~70% AI), ~$7T cumulative capex, ~1% of global GDP flowing annually into the token supply chain. Per-token cost ~90% below 2026 — and total token spend far higher anyway

What could break the thesis (honest risks)

Capex ahead of demand. Occupancy is 97% today, but if application-layer ROI disappoints, the fiber-glut pattern of 2001 repeats and assets strand. McKinsey itself flags the over/under-investment dilemma.
Subsidized pricing. Frontier API prices are widely believed to be below cost; architectures locked to today's prices are built on a false floor.
Efficiency shock. Sparse attention, cache-reuse techniques (15–25% compute reduction on conversational workloads in Q1 2026 research), and aggressive quantization could bend demand below the base case — good for buyers, bad for the buildout.
Grid reality. Interconnection timelines and reserve-margin warnings (NERC) can delay capacity regardless of the capital available.

Operator's checklist: positioning for the token economy

Govern

[ ] Stand up token FinOps now: per-workload cost attribution, budgets per business unit, alerts on agentic loops. The 8× spread ($2.31 vs $18.40/M) is pure governance.
[ ] Write the cost model before the deployment decision. The inverse sequence is where every overrun starts.
[ ] Log every inference for AI Act auditability: model version, timestamps, input hash. Compliance-as-code, not documentation.

Architect

[ ] Default to tiered model routing; reserve frontier models for escalation paths.
[ ] Design model-agnostic (gateway abstraction) — capture the 30–50%/yr open-weight price decline and hedge the upward normalization of frontier APIs.
[ ] For EU-regulated data: sovereign RAG pattern — EU-jurisdiction inference end-to-end, not just EU regions of US clouds.

Negotiate & plan

[ ] In SaaS renewals, demand transparency on agent step-billing and token pass-through pricing.
[ ] Treat power and capacity commitments as strategic sourcing (multi-year, multi-region), not tactical cloud purchasing.
[ ] Budget for volume growth, not unit price: assume per-token cost −90% by 2030 and total token spend up anyway.

One-line takeaway

Between 2026 and 2030, IT reorganizes around a single commodity it now manufactures — the token — and the winners on both sides of the market will be the ones who treat tokens-per-watt (producers) and cost-per-outcome (consumers) as first-class engineering disciplines.

Sources: Nvidia GTC 2026 keynote and fiscal Q1 2027 earnings call; Google I/O 2026 keynote (Sundar Pichai); Microsoft FY25–FY26 earnings calls; OpenRouter disclosures (Mar 2026); McKinsey, "The cost of compute" (2025) and subsequent data-center research; FinOps Foundation State of FinOps 2026; Gartner (Mar 2026); Epoch AI benchmarks; SemiAnalysis 2026 research; SiliconANGLE "The token economy: the state of AI mid-2026" (Jul 2026); Forrester 2026 forecast. Vendor-reported token volumes are self-declared and unaudited — treat as directional.

By Soumia, a developer advocate focused on making complex infrastructure legible — through writing, speaking, and helping technical and non-technical audiences find common ground. I work at the intersection of cloud-native systems, AI, and editorial craft. — LinkedIn · Portfolio

Reducing LLM Hallucinations in 2026: LoRA, F-DPO, and the Math That Actually Works

Soumia — Sun, 17 May 2026 09:14:42 +0000

It is May 2026, and the field has stopped pretending hallucinations are going to disappear.

What has happened instead is more interesting. Researchers have spent the last eighteen months building an entire toolkit — fine-tuning methods, low-rank adaptation techniques, preference optimization frameworks, image-grounded decoders, multi-adapter compositions — designed not to eliminate hallucinations but to bound them. To calibrate models so that when they are uncertain, they say so. To constrain them so that when they answer, the answer is grounded in something verifiable.

This is a different mindset from "fix the model." It is closer to how engineers approach any probabilistic system: you cannot eliminate error. You measure it, you bound it, you make it visible. The question stops being is the model truthful and becomes is this model's error rate acceptable for this use case, given these guardrails, against this ground truth.

This article goes through what is actually working — for text, for images, across foundation models, language models, and specialized models. The math, the methods, the benchmarks. What companies have tried in the past, what they are doing now, and what the May 2026 state of the art actually looks like.

How We Got Here: A Short History

The first generation of attempts to reduce hallucinations was essentially "tell the model not to hallucinate." Prompt engineering. System messages. Chain-of-thought reasoning. Companies wrote elaborate instructions: "If you do not know the answer, say so." The model would say so — sometimes — and then continue to hallucinate confidently in the next sentence.

The second generation was Retrieval-Augmented Generation, introduced in production around 2023. Connect the model to a knowledge base. Retrieve relevant documents. Ground the response in retrieved context. This worked, and continues to work — but the 2025 Stanford HAI study showed even specialized legal AI tools built on RAG hallucinated more than 17% of the time. RAG reduces hallucination. It does not eliminate it. The retrieval can fail. The retrieved documents can be irrelevant. The model can ignore them.

The third generation, which is where we are now, accepts that hallucinations are structural and attacks them at multiple levels simultaneously: at training time through fine-tuning, at the parameter level through low-rank adaptation, at the preference level through DPO and its variants, at the decoding level through grounded inference, and at the architectural level through multi-adapter composition. These techniques are not alternatives. They compose.

Let me walk through the mathematics.

The Mathematics of the Problem

A language model parameterized by weights θ generates tokens by sampling from a probability distribution:

P(y | x; θ) = ∏ P(y_t | y_<t, x; θ)

Where x is the input prompt, y is the output sequence, and each token y_t is sampled conditioned on the input and the previously generated tokens. The model selects each next token by computing logits over the vocabulary and applying softmax to obtain probabilities, then sampling (or taking the argmax for greedy decoding).

The hallucination problem in this framework is precise: the model has been trained to maximize the likelihood of plausible-sounding text given its training distribution. When the input x is in the distribution it learned from, this works well. When x is out of distribution — or when the answer requires factual recall that the training did not provide — the model still produces high-probability tokens, but those tokens trace a path through the vocabulary that may have no relationship to truth.

The MIT 2025 finding sharpens this: models use more confident language when hallucinating than when stating facts. This is not a bug. It is a property of how probability flows. When the model has high entropy over plausible continuations, it tends to commit to whichever happens to win the sampling — and the language patterns associated with confident assertions ("definitely," "certainly," "without a doubt") are common in the training data of confident assertions, regardless of whether those assertions were correct.

To reduce hallucination, you need to do one of three things mathematically:

Change the distribution the model is sampling from (fine-tuning).
Add an auxiliary signal that down-weights non-factual continuations (preference optimization, grounded decoding).
Detect when the model's distribution is unreliable and abstain (calibration, refusal training).

The current state-of-the-art combines all three.

Method 1: LoRA — The Surgical Tool

Low-Rank Adaptation, introduced by Hu et al. in 2021, is now the workhorse of fine-tuning at scale.

The mathematics is elegant. Instead of fine-tuning all parameters of a weight matrix W ∈ ℝ^(d×k), LoRA freezes W and learns two small matrices A ∈ ℝ^(r×k) and B ∈ ℝ^(d×r), where r is much smaller than d or k:

W_new = W + ΔW = W + BA

The update ΔW is constrained to be rank r, dramatically reducing the number of trainable parameters. For LLaMa-3.1-70B, full fine-tuning requires approximately 1,120 GB of GPU memory for model states alone. LoRA with rank 16 introduces only 0.29% additional parameters, reducing GPU memory usage to 142 GB while preserving model quality.

Why this matters for hallucination reduction: LoRA lets you fine-tune cheaply on factuality-focused data without rebuilding the entire model. You can train one base model, then attach many small LoRA adapters — each calibrated to a different domain, each grounded in a different curated dataset.

┌─────────────────────────────────────────────────────────┐
│        Base Model (frozen, 70B params)                  │
│                                                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
│  │ Medical  │  │  Legal   │  │ Coding   │  │ Customer │ │
│  │  LoRA    │  │  LoRA    │  │  LoRA    │  │   LoRA   │ │
│  │ (140M)   │  │ (140M)   │  │ (140M)   │  │  (140M)  │ │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │
└─────────────────────────────────────────────────────────┘
        ↑              ↑              ↑              ↑
    Grounded in    Grounded in    Grounded in    Grounded in
    medical docs   case law       codebase       help center

PREREQ-Tune, published at ICLR 2025, took this further. It uses a dual-LoRA architecture: one LoRA absorbs synthetic factual knowledge during a pre-training adaptation phase, and is then frozen. A second "skill" LoRA is trained on top to learn the actual task. The knowledge LoRA can be removed or swapped, leaving the skill LoRA generalizable. This disentangles what the model knows from what the model does, which is precisely the architectural separation hallucination research has been trying to achieve.

PREREQ-Tune significantly outperforms existing state-of-the-art hallucination reduction algorithms in improving LLM factuality across both short QA and long-form generation tasks. The framework enables a modular design with plug-and-play knowledge modules that control knowledge access and a skill module that works generically with any knowledge sources.

LoRA ensembles take a different angle. Train multiple LoRA adapters on the same task with different initializations or hyperparameters, then average their predictions. This produces measurably better-calibrated outputs — the ensemble's confidence more closely matches its actual accuracy. The few-shot baseline is well-calibrated but often wrong; a single fine-tuned LoRA is more accurate but overconfident in its wrong predictions; the LoRA ensemble provides improvements in both accuracy and calibration in terms of Expected Calibration Error.

Method	Memory Cost	Calibration	Hallucination Rate
Full fine-tuning	100%	Poor	Variable
Single LoRA (r=16)	~0.3%	Overconfident	Reduced
LoRA Ensemble (M=5)	~1.5%	Well-calibrated	Significantly reduced
PREREQ-Tune (dual LoRA)	~0.6%	Strong	State-of-the-art

Method 2: DPO and F-DPO — Preference at the Source

Direct Preference Optimization, introduced by Rafailov et al. in 2023, has largely replaced reinforcement learning from human feedback (RLHF) as the standard alignment method for production models in 2026.

The math of DPO is a clever reformulation. RLHF requires training a reward model on preference data, then using reinforcement learning (typically PPO) to optimize the policy against that reward. This is unstable, expensive, and sensitive to hyperparameters. DPO observes that you can derive an analytical relationship between the optimal policy and the reward function, and then optimize the policy directly against preference pairs without ever training a separate reward model.

The DPO loss for a preference pair (x, y_w, y_l) — where y_w is preferred and y_l is dispreferred — is:

L_DPO = -log σ(β · log[π_θ(y_w|x) / π_ref(y_w|x)] - β · log[π_θ(y_l|x) / π_ref(y_l|x)])

Where π_θ is the model being trained, π_ref is the reference model (typically the SFT-tuned base), σ is the sigmoid function, and β is a temperature parameter controlling how strongly the model diverges from the reference. The loss pushes the model to increase the relative likelihood of preferred responses while staying close to the reference model.

Why this matters for hallucination: standard DPO optimizes for whatever preferences humans express. If humans prefer fluent, confident-sounding responses over uncertain ones — which they do — DPO will train the model to be more fluent and more confident, whether or not its responses are factual. RLHF and DPO can therefore actively increase hallucination if the preference data rewards fluency over truth.

F-DPO, published in January 2026 and updated in April 2026, fixes this with a simple modification.

F-DPO uses binary factuality labels (factual vs. hallucinated). It applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one. It adds a factuality-aware margin that emphasizes pairs with clear correctness differences, reducing to standard DPO when both responses share the same factuality.

The mathematical addition is a factuality margin term:

L_F-DPO = -log σ(β · [log ratio(y_w) - log ratio(y_l)] + α · m(y_w, y_l))

Where m(y_w, y_l) is the factuality margin — non-zero only when y_w and y_l differ in factuality — and α controls its strength.

The empirical results on Qwen3-8B are striking: F-DPO reduces hallucination rates by 5x — from 0.424 to 0.084 — while improving or preserving helpfulness across all seven evaluated models from 1B to 14B parameters. The method requires no auxiliary reward model, no token-level annotations, and no multi-stage training.

Hallucination Rate Reduction (Qwen3-8B)
═══════════════════════════════════════════════
Base model     ████████████████████████  0.424
Standard DPO   ███████████████████░░░░░  0.378
F-DPO          ████░░░░░░░░░░░░░░░░░░░░  0.084 (5x reduction)
                0         0.2        0.4

Method 3: Vision — When the Ground Truth Is the Image

Vision-Language Models face a specific version of the hallucination problem: object hallucination. The model describes an image but mentions objects that are not in it. This has been a persistent failure mode and is one of the most actively researched areas in 2025-2026.

The cleanest 2025 work is MARINE — Mitigating hallucinAtion via image-gRounded guIdaNcE — published as an ICML 2025 spotlight. The approach is training-free and API-free. MARINE incorporates a pre-trained object grounding vision encoder to extract object-level information from the image, then uses classifier-free guidance during text generation to bias the model toward grounded outputs.

Mathematically, classifier-free guidance modifies the logits during decoding:

logits_guided = logits_unconditional + γ · (logits_grounded - logits_unconditional)

Where γ is the guidance strength. The "grounded" logits are conditioned on the explicit object list extracted by the auxiliary vision encoder. The "unconditional" logits are the model's natural output. Higher γ pulls the model harder toward what is actually in the image.

MARINE works on multiple LVLM architectures, requires no fine-tuning, requires no API access to large models, and demonstrates significant reduction in object hallucination on POPE, MME, and CHAIR benchmarks. The auxiliary grounding model — typically DETIC or Grounding DINO — provides the ground truth that the LVLM is held against.

CHAIR-DPO takes a complementary approach for fine-tuning. The CHAIR (Caption Hallucination Assessment with Image Relevance) metric measures the fraction of mentioned objects that are not present in the image. CHAIR-DPO uses this metric to construct preference pairs: given two image captions, the one with the lower CHAIR_i score (fewer hallucinated objects) becomes the preferred response. The model is then fine-tuned with DPO on these preference pairs, becoming object-aware in the process.

The newer CoFi-Dec framework, published in January 2026 by researchers at the University of Minnesota and Lenovo, integrates multi-level visual processing: a coarse-to-fine attention pattern that mimics human visual processing, starting with scene-level understanding before focusing on details. This training-free decoding method significantly reduces both factual errors and semantic inconsistencies across challenging benchmarks.

Method	Approach	Training Required	Hallucination Reduction (POPE)
MARINE	Inference-time grounding	No	~6-9 percentage points
CHAIR-DPO	Fine-tuning with preferences	Yes (DPO)	~9.8 percentage points
CoFi-Dec	Multi-level decoding	No	Significant on multiple benchmarks
Uncertainty Re-attention	Calibrated decoding	No	9.8 points (Qwen2.5-VL-7B)

A critical 2024 paper deserves mention: "Does Object Grounding Really Reduce Hallucination?" The authors offer the first systematic analysis of fine-grained object grounding on LVLM hallucination under an evaluation protocol that more realistically captures open-ended generation. Their finding: many earlier "reductions" relied on evaluation protocols using MSCOCO data extensively present in LVLM training. Under stricter evaluation, grounding objectives have little to no effect on object hallucination in open caption generation.

This is the kind of self-correction that mature fields do. The takeaway is not that grounding doesn't work — it does, but only when evaluated honestly, on data the model has not seen, in tasks that reflect actual deployment conditions.

Method 4: Multi-Adapter Composition

The most architecturally interesting development of the last year is the rise of multi-LoRA systems — frameworks that compose multiple specialized adapters at inference time.

LoraMap, published in 2024 and refined in 2025, creates dedicated reasoning LoRAs trained on fact-checking from different perspectives. Three LoRAs, each fine-tuned on a different reasoning dataset, then mapped to coordinate at inference. The paper shows LoraMap outperforms LoraHub (the previous standard for LoRA composition) with significantly fewer parameters than LoraConcat (which concatenates LoRAs and further fine-tunes them).

AutoRAG-LoRA, published in 2025, takes the integration further. It is a hallucination-aware RAG framework that combines:

Automated prompt rewriting
Hybrid retrieval (dense + sparse)
LoRA-based generation adapters
A dual-mode hallucination detection module (classifier-based plus self-reflective)
A KL-regularized contrastive feedback correction loop that enables targeted fine-tuning on hallucination outputs

The KL regularization is the mathematically interesting part: it prevents the corrective fine-tuning from drifting the model too far from its original distribution, avoiding overfitting to edge hallucination cases. The model improves on factual alignment over time without degrading on the rest.

LoRAFusion, accepted at EuroSys 2026, focuses on the systems engineering of running multiple LoRAs efficiently — achieving up to 1.96× end-to-end speedup compared to Megatron-LM. This matters because the cost of running many small specialized adapters has historically been the bottleneck. When that cost drops, the practical viability of multi-adapter systems goes up.

The architectural picture that emerges:

                    ┌────────────────────────┐
                    │   User Query           │
                    └───────────┬────────────┘
                                │
                                ▼
                    ┌────────────────────────┐
                    │  Router / Classifier   │
                    │  (which domain?)        │
                    └───────────┬────────────┘
                                │
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                 ▼
        ┌──────────┐      ┌──────────┐      ┌──────────┐
        │   RAG    │      │  LoRA-A  │      │  LoRA-B  │
        │ Retrieval│      │ (domain) │      │ (skill)  │
        └────┬─────┘      └─────┬────┘      └─────┬────┘
             │                  │                  │
             └──────────────────┼──────────────────┘
                                ▼
                    ┌────────────────────────┐
                    │   Base Model (frozen)   │
                    │   + Adapters Composed   │
                    └───────────┬────────────┘
                                │
                                ▼
                    ┌────────────────────────┐
                    │  Hallucination Detector │
                    │  (CLAP / MetaQA / etc.) │
                    └───────────┬────────────┘
                                │
                                ▼
                    ┌────────────────────────┐
                    │  Output (with refusal   │
                    │  if confidence too low) │
                    └────────────────────────┘

This is not the simple "one model, one fine-tune" architecture of 2023. This is a stack — specialized at each layer, calibrated at each transition.

What the Numbers Say

Here is the May 2026 picture, drawn from current benchmarks and published research:

Hallucination Rate by Method (TruthfulQA-style benchmarks)
═══════════════════════════════════════════════════════════
Base LLM (no intervention)        ████████████████████  60-80%
+ Prompt engineering              ███████████████░░░░░  45-65%
+ Standard RAG                    █████████░░░░░░░░░░░  17-35%
+ RAG + DPO alignment             ██████░░░░░░░░░░░░░░  10-20%
+ Full F-DPO + grounded RAG       ███░░░░░░░░░░░░░░░░░  5-12%
+ Multi-adapter + detection layer ██░░░░░░░░░░░░░░░░░░  3-8%
                                  0%      25%     50%

The trajectory is real. The numbers continue to drop. The asymptote — what the lowest achievable hallucination rate actually is — remains unknown, and may be non-zero by architectural necessity. But what was a 60-80% problem in 2023 is now, with proper engineering, a 3-8% problem in production deployments at the state of the art.

Three caveats are important.

First, these numbers are benchmark-dependent. A model that scores well on TruthfulQA can still fail on a specific niche domain it was never tuned for. Domain-specific evaluation matters more than general benchmarks.

Second, calibration matters as much as accuracy. A model that hallucinates 5% of the time and says "I'm not sure" appropriately is far more useful than a model that hallucinates 3% of the time and sounds equally confident in every response. The Expected Calibration Error metric is now widely reported alongside accuracy in serious evaluations.

Third, the cost-quality tradeoff is non-trivial. Full F-DPO plus grounded RAG plus multi-adapter inference is expensive. For many production use cases, the right answer is a careful single LoRA plus a good retrieval system — not the cutting edge of every available method stacked together.

What Companies Are Doing Now

The pattern across industries in 2026:

Healthcare has moved aggressively toward CHAIR-DPO-style grounded fine-tuning for clinical report generation, and toward MARINE-style image-grounded inference for radiology applications. RRG-DPO (Radiology Report Generation with DPO) specifically addresses the false-positive / false-negative tradeoff that classical RLHF struggled with in medical settings.

Legal tech companies have largely abandoned the "general LLM for legal work" approach and moved to retrieval-grounded systems with specialized adapters. The 17% hallucination rate Stanford reported in 2025 was the wakeup call. Modern legal AI products explicitly cite source paragraphs for every claim, abstain when retrieval confidence is below threshold, and run multi-model verification on high-stakes outputs.

Enterprise knowledge bases have converged on RAG + structured retrieval + small specialized adapters for each domain. The base model is rented from a provider (Claude, GPT, Gemini). The differentiation is in the adapter layer and the retrieval system.

Marketing and creative tools use grounded generation: a digital twin of the product (as in INDG/Grip), brand-approved claim libraries, controlled diffusion with depth and segmentation guidance. The AI is constrained to operate within structures defined by the brand team.

Coding assistants have moved to test-grounded generation: the LoRA is fine-tuned on code from the specific codebase, the output is validated against the existing test suite, and the model is calibrated to refuse rather than guess when uncertain.

The common pattern: in every domain, the production answer is not a single technique. It is a stack — a base model plus retrieval plus fine-tuning plus preference alignment plus inference-time guardrails plus structural constraints. Each component reduces hallucination at a different layer. Together they produce systems that are bounded, calibrated, and accountable in ways that single-model deployments never were.

The Honest Bottom Line

The mathematics of probabilistic generation imply that hallucination cannot be reduced to zero by training methods alone. The model is, fundamentally, a function that produces plausible continuations of its input. The plausibility space and the truth space are not the same space and cannot be made the same space by any amount of fine-tuning.

What the methods of 2025-2026 have shown is that the gap between plausibility and truth can be narrowed substantially — through fine-tuning that disentangles knowledge from skill (PREREQ-Tune), through preference learning that rewards factuality over fluency (F-DPO), through inference-time grounding (MARINE, CoFi-Dec), through multi-adapter composition (LoraMap, AutoRAG-LoRA), and through systems engineering that makes all of the above run at acceptable cost (LoRAFusion).

The shift in the industry is from chasing the dream of a truthful model to building the architecture of a truthful system. The model is a component. The system includes retrieval, validation, calibration, and structural constraints. The system has a hallucination budget. The system reports its uncertainty. The system abstains when it cannot answer with sufficient grounding.

This is how every other probabilistic technology has matured. Cars do not have zero accident rates. Networks do not have zero packet loss. Financial systems do not have zero fraud. Each of these is bounded by engineering — measurement, calibration, monitoring, controls — that converts a wild probabilistic process into a predictable production capability.

AI is now passing through the same transition. May 2026 is the moment where the engineering caught up with the ambition. The methods are real. The math is solid. The benchmarks are dropping. The systems are shipping.

The honest version of the next decade in AI is not "we solved hallucination." It is "we learned how to build systems that bound it well enough to deploy them responsibly in more and more domains." Which is, in the end, what engineering has always been.

KubeCon Amsterdam 2026: The Industrialization of ML - A Deep Dive into Uber’s AI Platform Architecture.

Soumia — Sun, 17 May 2026 08:23:47 +0000

This article serves as a technical follow-up to our KubeCon 2026 coverage, providing a comprehensive deep dive into the architecture and evolution of Uber’s machine learning platform.

When Uber presented at KubeCon Europe 2026, the numbers they shared silenced the room: 1 million+ diverse workloads deployed onto 200 Kubernetes clusters, 20,000 models trained monthly, 5,300 models actively in production, and over 30 million peak predictions per second.

For most organizations, achieving even 1% of that scale is a multi-year roadmap. Uber’s platform doesn't just support their business; it is their business. From surge pricing and ETA estimation to fraud detection and Generative AI-driven customer support, machine learning sits in the critical path of every user interaction.

But Uber didn't arrive at this architecture overnight. Their journey from scattered Python scripts to a globally federated, Kubernetes-native AI control plane is a masterclass in platform engineering.

Here is the deep dive into how Uber industrialized machine learning, the bottlenecks they hit along the way, and the architectural blueprints they’ve proven at hyperscale.

1. The Pre-Platform Era: The Fragmentation Tax (Pre-2017)

Before 2017, data science at Uber looked like data science at most fast-growing startups today: entirely fragmented.

The How: Data scientists worked on individual laptops or dedicated EC2 instances using a fragmented toolkit (R, scikit-learn, bespoke Python scripts).
The What: Each team built separate, one-off systems to pull data, train models, and serve predictions.
The Bottleneck: Models could only be as large as what fit on a single machine. Once a model was trained, "deploying" it often meant handing an opaque pickle file to a backend engineering team to rewrite in Java or Go.

This lack of standardization meant high operational friction. Teams couldn't easily share features, monitor model drift, or scale prediction serving. Uber realized that building custom infrastructure for every ML use case was economically and operationally unsustainable. They needed a centralized factory.

2. Michelangelo: Standardizing the ML Factory (2017–2022)

To solve the fragmentation tax, Uber built Michelangelo, an end-to-end internal machine learning platform designed to democratize ML across the company. The goal was to standardize the entire lifecycle—from data prep to model deployment.

Michelangelo introduced several architectural patterns that have since become industry standard:

The Centralized Feature Store: Instead of every team writing their own Spark jobs to calculate "user's trip frequency in the last 30 days," features were calculated once, stored, and shared.
Offline vs. Online Split: Michelangelo cleanly separated batch feature computation (using Apache Spark and Hive for historical data) from real-time feature computation (using Apache Kafka and Flink for streaming data like GPS coordinates).
Deployment Standardization: Models were deployed in three specific modes: Offline (Spark batch jobs for overnight predictions), Online (load-balanced API endpoints responding in <10ms), and Library (embedded directly into microservices for the absolute lowest latency).

Michelangelo was a massive success, bringing hundreds of use cases into production. However, as the industry shifted toward Deep Learning and Large Language Models (LLMs), Michelangelo’s underlying orchestration layer began to crack under the weight.

3. Hitting the Wall: The Kubernetes & Ray Migration (2023–2024)

By mid-2023, Uber’s ML workloads were primarily running on a legacy job gateway service called MADLJ (Michelangelo Deep Learning Jobs). While functional, it forced ML engineers to manually handle resource management—choosing specific regions, zones, and clusters based on GPU availability.

This led to the "stranded compute" problem: Cluster A would be operating at 100% capacity with a massive queue of training jobs, while Cluster B sat 50% empty because engineers hadn't manually targeted it.

To prepare for the Generative AI boom, Uber executed a massive architectural shift: moving the entire ML platform to Kubernetes and Ray.

Curing Stranded Compute via Federation

Uber decoupled the user experience from the infrastructure. They introduced a Global Control Plane built on standard Kubernetes architecture.

Developers now submit declarative jobs (via a Python-native workflow service called Uniflow) simply stating: "I need to train this PyTorch model on 8 A100 GPUs."
The Global Control Plane's custom Job Controller automatically scans dozens of regional Kubernetes clusters (the Local Control Plane), identifies available capacity, and schedules the Ray workers accordingly.

Overcoming ETCD Limits with Transparent Persistence

Scaling Kubernetes to handle 100+ purpose-built Custom Resource Definitions (CRDs) representing the ML lifecycle introduced a new problem: etcd (Kubernetes’ default datastore) choked under the high-cardinality metadata of 30 million predictions a second.
To solve this, Uber engineered a transparent storage abstraction. While the system interacts with standard Kubernetes objects via the API, the underlying metadata is seamlessly synchronized with a horizontally scalable MySQL backend, completely bypassing ETCD's limitations.

4. The GenAI & Agentic Era (2024–2026)

With a federated Kubernetes and Ray foundation in place, Uber was uniquely positioned to absorb the immense compute requirements of Generative AI and Agentic systems.

Uber leverages a hybrid hardware approach: heavily utilizing on-prem A100 GPU clusters alongside Google Cloud H100 instances. To maximize GPU utilization (MFU - Model Flops Utilization) when training massive open-source models (like Llama or Mixtral), the platform engineering team implemented severe infrastructure-level optimizations:

Distributed Memory Offloading: Because GPU memory is prohibitively expensive, Uber implemented advanced CPU offloading—keeping active computations on the GPU while shifting optimizer states to CPU RAM or NVMe SSDs. This effectively doubled training throughput and allowed them to train models that previously wouldn't fit in VRAM.
Software/Hardware Co-design: By utilizing optimized frameworks like TensorRT-LLM tuned specifically for their H100 instances, Uber achieved a 2x improvement in response latency and a 6x boost in throughput.

The Shift to Agentic AI

Most recently, Uber has expanded beyond simple GenAI content generation into Agentic AI—systems capable of autonomous task decomposition, multi-agent collaboration, and real-time adaptability. By combining generative capabilities with their massive data annotation and testing engines (like uLabel and uTest), Uber is building systems where GenAI provides creative options, and Agentic logic evaluates, selects, and executes them reliably.

5. The Architecture Blueprint

Today, Uber’s ML platform can be distilled into four highly decoupled layers:

Hardware Layer (Layer 0): A hybrid mix of on-premise A100 clusters and cloud-based H100 instances, connected via 100GB/s high-bandwidth networking.
Orchestration Layer (Layer 1): Kubernetes handles the primitive scheduling and hardware constraints, while Ray (via the KubeRay operator) distributes the actual mathematical workloads across the worker nodes.
Federation Layer (Layer 2): A global control plane that treats dozens of individual Kubernetes clusters as a single, unified compute mesh, dynamically routing workloads to eliminate idle GPU time.
Developer Experience (Layer 3): Python-native workflows (Uniflow) and centralized Feature Stores that allow data scientists to focus entirely on modeling rather than infrastructure plumbing.

6. The Lesson for the Enterprise

Uber’s architectural journey validates a crucial reality for modern platform engineering: AI scale exposes design flaws. An architecture that works for 1,000 predictions an hour will spectacularly collapse at 30 million predictions a second.

The primary takeaway from Uber's Michelangelo evolution is that successful, scalable AI is not fundamentally about having the smartest neural network. It is about robust data plumbing and distributed state management. By treating machine learning not as a special, fragile science project, but as standard, declarative, Kubernetes-native infrastructure, Uber has built the blueprint for the next decade of enterprise AI.

References & Further Reading

Scaling Machine Learning at Uber with Michelangelo - Uber Engineering Blog
Uber’s Journey to Ray on Kubernetes - Uber Engineering Blog
From monolith to global mesh: How Uber standardized ML at scale - The New Stack
Open Source and In-House: How Uber Optimizes LLM Training - Uber Engineering Blog
Agentic AI + Generative AI: The Future of Enterprise Decision-Making - Uber AI Solutions

This article draws from sessions and discussions at KubeCon + CloudNativeCon EU 2026, including Agentics Day, Open Source SecurityCon, and contributions from the CNCF TAG Security community.

KubeCon Amsterdam 2026: Securing the Agentic Supply Chain - Why Provenance is the New Perimeter.

Soumia — Sat, 16 May 2026 11:30:21 +0000

The threat to the software supply chain has always been there—what has changed is the shape of the vulnerability. We spent the last decade securing deterministic code, scanning for known CVEs, and locking down dependencies. Now, as organizations operationalize AI agents, the attack surface is silently shifting. The question is no longer whether we can scale these new workloads, but whether we can cryptographically verify a probabilistic, opaque model before it is allowed to execute.

The Reality Check in Amsterdam

If you spent any time walking the halls of KubeCon + CloudNativeCon EU 2026 this past March, you likely noticed a distinct shift in the security discourse.

For the past few years, Kubernetes security has focused heavily on shifting left: scanning container images, managing RBAC, and isolating workloads. But as the ecosystem industrializes large language models (LLMs) and agentic systems, traditional code vulnerability scanning is no longer enough.

The harsh reality is that AI models are probabilistic black boxes. A traditional CVE scanner cannot detect poisoned model weights, manipulated training data pipelines, or a subtle prompt injection vulnerability embedded deep within an agent’s toolset.

As we move from deterministic code to probabilistic agents, the security perimeter is shifting entirely. Provenance is the new perimeter.

If you cannot cryptographically prove exactly where an AI model came from, how it was trained, and what permissions its associated agent holds, you are not operating a secure platform. You are simply automating a massive liability.

The Forcing Function: The Cyber Resilience Act (CRA)

This shift in thinking isn't just driven by architectural purity; it is being forced by regulatory reality.

Looming over every security conversation in Amsterdam was the European Union’s Cyber Resilience Act (CRA). By September 11, 2026, mandatory vulnerability reporting and stringent compliance standards become active for manufacturers and open-source stewards alike (including foundations like the CNCF).

The CRA changes how open-source software is maintained and deployed. Generating Software Bill of Materials (SBOMs) is no longer an optional best practice—it is a legal requirement.

But how do you generate a Bill of Materials for a 70-billion parameter neural network?

This is where the concept of the aiBOM (AI Bill of Materials) transitions from theory to necessity. An aiBOM tracks the lineage of a model, detailing its architecture, the datasets used for training, licensing, and known safety evaluations.

At KubeCon, it became clear that enterprises will soon refuse to deploy AI workloads that lack a cryptographically signed aiBOM. The risk is simply too high.

The Architecture of Trust: How Cloud-Native is Adapting

The most encouraging takeaway from KubeCon 2026 is that the cloud-native ecosystem is not trying to invent a completely new security paradigm for AI. Instead, it is actively adapting the battle-tested container security stack to handle machine learning artifacts.

Here is what the emerging architecture of trust looks like for agentic supply chains:

01. Packaging: CNCF ModelPack

Historically, AI models have been distributed through fragmented, proprietary channels or raw object storage, making them notoriously difficult for standard CI/CD pipelines to manage.

The CNCF’s ModelPack project is solving this by standardizing the packaging and distribution of AI/ML models as OCI-compliant (Open Container Initiative) artifacts. By treating a massive LLM exactly like a standard Docker container image, platform teams can suddenly use their existing image registries, caching layers, and security scanners to handle AI infrastructure.

02. Attestation: Sigstore and SLSA

Once a model is packaged, its provenance must be verified. Just as developers use Sigstore (specifically Cosign) to cryptographically sign container images, the ecosystem is extending this to sign AI models.

By mapping AI pipelines to SLSA (Supply Chain Levels for Software Artifacts) frameworks and using tools like in-toto to generate attestations, platform teams can mathematically prove that a model was not tampered with between the training cluster and the production inference server.

03. Enforcement: Kyverno and OPA Gatekeeper

Attestations mean nothing without enforcement.

This is where Kubernetes admission controllers step in. Projects like Kyverno (which officially reached Graduated status during KubeCon) and OPA Gatekeeper act as the ultimate bouncers at the door of your cluster.

The emerging operational pattern is strict: if a deployment manifest attempts to spin up an AI agent, the admission controller intercepts it. It checks the OCI registry for the model, verifies the Sigstore cryptographic signature, and validates the attached aiBOM. If any of these checks fail—or if the model is unsigned—the deployment is blocked before a single GPU cycle is wasted.

The Next Frontier: Governing Agents via MCP

While securing the model weights is the first step, governing the actions of the agents using those models is the true frontier.

This was the central focus of the CNCF’s inaugural Agentics Day, a massive half-day co-located event in Amsterdam dedicated entirely to AI agents and the Model Context Protocol (MCP).

The consensus on the ground was clear: deploying agents is now a solved infrastructure problem. The hard part is authorization.

When an agent hallucinates, what is its blast radius? If an agent is granted access to a database via an MCP tool, how do we ensure it doesn't execute destructive commands?

The solutions discussed heavily involved Sandbox Operators—enabling session-aware, isolated execution environments within Kubernetes. Rather than giving an agent direct access to infrastructure, the agent requests an action, and the Kubernetes control plane executes that action within a tightly governed, ephemeral sandbox.

The North Star

We are entering an era where infrastructure is simultaneously becoming more autonomous and more heavily regulated.

The integration of ModelPack, Sigstore, Kyverno, and MCP represents the maturity of the cloud-native AI stack. We are finally moving past the artisanal, experimental phase of machine learning and treating AI like standard, auditable software.

As the September 2026 CRA deadlines approach, platform teams need to ask themselves a fundamental question:

Do we know exactly what our AI is executing, where it came from, and how to prove it?

If the answer is no, it is time to start building your provenance perimeter.

References & Resources

To explore the frameworks, regulations, and open-source projects shaping the agentic supply chain discussed in this article, refer to the following resources:

Regulatory & Standards

European Cyber Resilience Act (CRA) – Official documentation on the EU’s upcoming mandatory cybersecurity requirements for hardware and software products.
SLSA (Supply-chain Levels for Software Artifacts) – A security framework providing a checklist of standards and controls to prevent tampering, improve integrity, and secure packages and infrastructure.
Software/AI Bill of Materials (SBOM/aiBOM) – CISA's official overview of SBOMs and their foundational role in software transparency and supply chain security.

Cloud-Native Security Tooling

Sigstore – A standard for signing, verifying, and protecting software, making cryptographic signing of container images and ML artifacts accessible.
in-toto – A framework to secure the integrity of software supply chains by cryptographically ensuring that end-to-end policies are verified.
Kyverno – A Kubernetes-native policy engine designed for declarative policy management and admission control.
OPA Gatekeeper – A customizable admission webhook for Kubernetes that enforces policies executed by the Open Policy Agent.

AI & Agentic Protocols

Model Context Protocol (MCP) – An open standard that enables developers to build secure, two-way connections between AI agents/models and external data sources or infrastructure tools.
Cloud Native Computing Foundation (CNCF) – The open-source hub hosting KubeCon and driving the standardization of cloud-native AI and security patterns.

This article draws from sessions and discussions at KubeCon + CloudNativeCon EU 2026, including Agentics Day, Open Source SecurityCon, and contributions from the CNCF TAG Security community.

What KubeCon Amsterdam 2026 Taught Me About Infrastructure as Transformation

Soumia — Fri, 08 May 2026 11:44:49 +0000

KubeCon + CloudNativeCon EU 2026 · Amsterdam · March 23–26

More than 13,000 engineers gathering around infrastructure might sound excessive until you realize what they're really there for: understanding how the next generation of systems is being built in real time.

All sessions referenced in this article are available through the CNCF KubeCon recordings.

The Why

I almost didn't go.

KubeCon felt overwhelming—too big, too technical, too crowded. But something about the energy of thousands of engineers gathering around the future of infrastructure made the trip worth it.

I went because infrastructure is changing faster than most organizations can operationalize it, and I wanted to understand where the ecosystem was converging — and how to explain that shift to the people who need to act on it.

What I found was not a week of dramatic announcements or paradigm shifts.

It was something more interesting: operational maturity.

Across sessions, hallway conversations, and product announcements, the same themes kept repeating:

observability moving deeper into the kernel,
platform engineering focusing on developer cognition,
AI workloads becoming operational infrastructure,
and agentic systems forcing teams to rethink reliability entirely.

What became clear by the end of the week was this:

The cloud-native ecosystem is beginning to build the operational layer for AI agents the same way it once built the operational layer for containers—incrementally, pragmatically, and one infrastructure problem at a time.

01. LLM Inference on Kubernetes: Infrastructure Becomes the Product

The GKE session on optimizing large language models on Kubernetes was the first talk that shifted my perspective.

Not because it introduced radically new ideas, but because the conversation felt deeply operational.

The core challenge was straightforward:
LLMs are not typical workloads.

Inference systems introduce sustained resource pressure across networking, scheduling, memory allocation, and accelerator management in ways many Kubernetes environments were not originally designed for.

The session covered:

model serving frameworks like vLLM, TGI, Triton, and Ray Serve,
Kubernetes Dynamic Resource Allocation (DRA),
GPU orchestration,
and increasingly sophisticated networking strategies for inference optimization.

One recurring theme was KV cache efficiency and routing.

Not because it is flashy, but because inference optimization increasingly comes down to infrastructure efficiency rather than model novelty.

What stood out most was how normalized these conversations felt.

AI infrastructure discussions at KubeCon no longer sounded experimental. They sounded operational.

The Learning

The challenge with AI workloads is increasingly operational rather than conceptual.

Model access is becoming commoditized.
Reliable orchestration, scheduling, observability, and cost control are becoming the differentiators.

02. Backstage & the Philosophy of Developer Experience

Spotify's talk on Backstage was one of the more interesting non-technical sessions of the week.

A story from the session stayed with me:
Spotify teams had experienced the familiar problem many fast-growing engineering organizations encounter—operational knowledge becoming fragmented across tools, documentation systems, spreadsheets, ownership records, and tribal knowledge.

The example illustrated a broader organizational truth:
engineering complexity often grows faster than internal systems evolve to manage it.

Backstage emerged from Spotify's effort to centralize operational context and developer workflows into a more coherent platform experience.

What matters here is not only the tool itself, but the philosophy behind it.

Developers should not need deep infrastructure expertise simply to deploy software safely and reliably.

Backstage approaches this by treating operational metadata as infrastructure:

ownership information,
deployment workflows,
dependency visibility,
templates,
scorecards,
and documentation become integrated directly into the developer workflow.

What stood out was how operational context became centralized into a single interface.

Backstage was not acting like a dashboard.
It was acting more like an internal platform layer for developers.

The most important insight from the session was organizational rather than technical:
platform engineering succeeds when it reduces cognitive fragmentation.

The Learning

The strongest platform teams optimize for cognitive clarity as aggressively as they optimize for system reliability.

Golden paths scale better than undocumented complexity.

03. Cross-AZ Observability & the Real Cost of Visibility

Miro's session on cross-AZ observability costs highlighted something many teams underestimate:
observability architecture itself can become a significant infrastructure cost center.

When workloads run across availability zones, metrics and telemetry crossing network boundaries generate measurable egress costs.

At scale, observability design decisions become infrastructure decisions.

Miro discussed a relatively straightforward but effective pattern:
zone-aware scraping.

Prometheus scraped local targets, aggregated locally, and minimized unnecessary cross-zone metric transfer.

The session also highlighted VictoriaMetrics, which has gained attention for focusing heavily on efficiency and operational simplicity in metrics storage.

What made the talk compelling was not novelty.
It was practicality.

The operational maturity of cloud-native infrastructure increasingly depends on efficiency optimization at every layer.

What Happened Post-KubeCon

Shortly after KubeCon:

Splunk announced OpenTelemetry eBPF Instrumentation (OBI) in beta,
and Grafana continued integrating projects like Beyla into broader OpenTelemetry workflows.

The larger trend is becoming clearer:
observability instrumentation is moving closer to the kernel layer through eBPF, while operational standards increasingly converge around OpenTelemetry.

The Learning

At scale, observability becomes an architectural discipline rather than simply a tooling choice.

Tooling amplifies operational design decisions already embedded into the system.

04. AI Agents & Platform Engineering: Reliability for Non-Deterministic Systems

The panel on AI Agents & Platform Engineering was the session that tied many of the week's themes together.

Panelists:

Idit Levine (Solo.io)
Vincent Caldeira (Red Hat)
Hasith Kalpage (Cisco)
Sara Qasmi (United Nations)
Carlos Santana (AWS, moderator)

The central tension discussed throughout the panel was this:

AI agents are probabilistic systems operating inside infrastructure environments historically optimized for deterministic behavior.

Traditional platform engineering assumes:

reproducibility,
consistency,
predictable deployments,
and stable execution paths.

Agentic systems challenge many of those assumptions.

The conversation repeatedly returned to observability, evaluation, and governance.

Rather than forcing agents into deterministic behavior models, the emerging operational pattern appears to focus on:

continuous evaluation,
instrumentation,
permissions boundaries,
and measurable reliability.

One of the strongest moments from the panel came from Vincent Caldeira:

"Agentic vulnerability is statistical, not deterministic."

That framing changes the operational question entirely.

Instead of asking:

"Is this system perfectly safe?"

Teams increasingly ask:

"Is this system measurably safer, more observable, and more governable than the existing human process?"

Another concept discussed heavily was the emergence of reusable "Skills" and tool abstractions for agents.

The architecture forming around agentic systems increasingly resembles familiar cloud-native operational patterns:

modular capabilities,
registries,
sandboxed execution,
observability,
and governance layers.

What Happened at KubeCon (and After)

Solo.io announced:

agentevals — an open-source framework for evaluating agent behavior using OpenTelemetry.
agentregistry donated to the CNCF ecosystem — focused on centralized discovery and governance for agents and tools.

These announcements felt notable not because they solved everything, but because they suggested the ecosystem is beginning to standardize operational patterns for agentic infrastructure.

The Learning

The shift from LLMs to agents is not simply about smarter models. It is about infrastructure adapting to probabilistic operational systems.

Observability, evaluation, governance, and orchestration are becoming foundational concerns.

05. Uber & The Industrialization of ML: Proving the Abstraction

During a deeply operational look at scaling ML, Uber highlighted how their foundational compute platforms and Michelangelo system have become the backbone for GenAI and deep learning development.

The numbers they shared to illustrate this were staggering:

1 million+ diverse workloads deployed onto 200 Kubernetes clusters across two regions,
20,000 machine learning models trained per month,
5,300 models actively in production,
and over 30 million peak predictions per second across roughly 1,000 serving nodes.

What made Uber's presence at the conference so critical wasn't just the sheer scale, but their clear validation of Kubernetes as a programmable control plane capable of handling distributed AI infrastructure.

AI workloads are notoriously stateful, hardware-constrained, and latency-sensitive. For a long time, there was healthy skepticism about whether cloud-native abstractions could endure GPU-heavy inference at enterprise scale without collapsing. Uber proved that they can.

The takeaway isn't that every enterprise will—or should—operate exactly like Uber. Rather, it is that the production blueprint for operationalizing AI already exists.

The Learning

The abstraction holds under pressure.

Kubernetes is successfully industrializing AI, shifting the enterprise focus away from raw model creation and toward lifecycle management, efficient serving, and reliable execution at scale.

Deep Dive: Want to know exactly how they went from fragmented Python scripts to 30 million predictions a second? Read the full architectural breakdown: The Industrialization of ML: A Deep Dive into Uber’s AI Platform Architecture.

06. The Missing Link: AI Provenance & The Cyber Resilience Act

However, leaving KubeCon thinking only about compute orchestration misses the week's most critical subtext: the standardization of the AI software supply chain.

With the European Cyber Resilience Act (CRA) deadlines looming in September 2026, the attack surface is officially shifting from traditional code vulnerabilities to poisoned weights and compromised training pipelines. Sessions like Airbus’s "Proving trust" and the debut of the CNCF's Agentics Day made one thing explicitly clear: smoothly orchestrating 10,000 agents is rapidly becoming a solved infrastructure problem. Governing and cryptographically verifying the cognitive provenance of those agents before they execute is the actual frontier.

The quiet consensus in Amsterdam was this: if your platform can deploy an army of agents but cannot cryptographically verify their permissions via aiBOMs and signed models, you haven't built an operational platform—you've just automated a massive liability.

Dig Deeper: How exactly do we secure probabilistic systems? For a technical deep dive into how SLSA standards, Sigstore, and Kubernetes admission controllers are being adapted to solve this, read my follow-up piece: Securing the Agentic Supply Chain: Why Provenance is the New Perimeter.

The North Star: Where the Ecosystem Appears to Be Going

By Thursday afternoon, several patterns had become difficult to ignore.

The same operational themes kept surfacing:

platform engineering,
eBPF,
OpenTelemetry,
AI infrastructure,
operational efficiency,
and governance.

Three broader shifts stood out.

01. Platform Engineering ↔ eBPF

Infrastructure conversations are increasingly moving simultaneously:

upward toward developer experience,
and downward toward kernel-level visibility and security.

eBPF sits at the center of that transition.

Instrumentation is becoming more deeply integrated into infrastructure itself while becoming increasingly invisible to developers.

02. AI on Kubernetes Is Becoming Operational Infrastructure

AI workloads are rapidly becoming standard platform concerns.

Platform teams are now regularly discussing:

GPU scheduling,
inference networking,
accelerator orchestration,
model serving reliability,
and operational cost control.

The tooling ecosystem around Kubernetes AI workloads is maturing quickly.

03. Efficiency Is Becoming a Core Operational Metric

Energy usage, infrastructure efficiency, observability overhead, and GPU utilization are increasingly treated as operational concerns rather than secondary optimizations.

The broader trend is not only about sustainability messaging.
It is also about economic reality.

Efficient infrastructure compounds.

Infrastructure is no longer simply supporting transformation. Increasingly, it is becoming the mechanism through which transformation happens.

Resources

This article draws from sessions and discussions involving Google Cloud, Spotify Engineering, Miro, Solo.io, Red Hat, Netflix and other contributors across the cloud-native ecosystem.

5 Things You Can Do Right Now to Know Where You Stand on EU AI Act & GDPR Compliance

Soumia — Thu, 07 May 2026 11:21:29 +0000

The Act: New & Old explores how Europe wrote the first comprehensive AI law on Earth, and how that law is now colliding with the urgency to build. But knowing the law exists is different from knowing whether your systems comply with it.

As we approach the August 2, 2026 enforcement deadline for high-risk systems, the window for "guessing" is closing. Here are five concrete actions you can take immediately — whether you're an individual builder on Lovable, a small team, or an organization deploying AI-powered tools in the EU.

1 · Classify Your System: Is It High-Risk?

Start here. Under the EU AI Act's Annex III, high-risk systems include AI that:

Influences hiring, promotion, or termination decisions
Assesses creditworthiness or insurance eligibility
Determines access to education or training
Analyzes biometric data or influences civil rights
Processes personal data at scale in ways that affect significant life outcomes

Action item

[ ] Spend 30 minutes asking: Does my system influence a decision that affects someone's rights, access, or opportunities? If yes, you aren't just a user; you are likely a "Provider" or "Deployer" of a high-risk system.

Tool: Use the EU AI Act Compliance Checker for a formal assessment.

2 · Conduct a Dual Impact Assessment (DPIA + FRIA)

If your system processes personal data, a Data Protection Impact Assessment (DPIA) is a GDPR requirement. However, for high-risk AI in 2026, you must also consider the Fundamental Rights Impact Assessment (FRIA).

	Focus
DPIA	Data privacy and security
FRIA	Societal risks — algorithmic bias, discrimination, or threats to human dignity

Action item

[ ] Document the data flow, identify risks to individuals (not just their data), and list your safeguards. This is your accountability "paper trail" for regulators.

3 · Secure a Data Processing Agreement (DPA) from Every Vendor

If you use Lovable, OpenAI, or Anthropic, they are your sub-processors. A DPA establishes who is responsible if a breach occurs.

Action items

[ ] Download and sign the Lovable DPA at lovable.dev/data-processing-agreement.
[ ] Maintain a "Vendor Map" of every AI API your app calls. In 2026, ignorance of your supply chain is not a legal defense.

4 · Build Your Technical Documentation & Quality Management

Documentation separates "we tried" from "we complied." For high-risk systems, you need a technical file that proves your system is accurate, robust, and cyber-secure.

The 2026 Standard

To make this easier, look into ISO 42001 (the international standard for AI Management). Following this "Gold Standard" creates a Presumption of Conformity, making it much harder for regulators to challenge your process.

Action item — create a "living document" that lists:

[ ] How humans can override the AI (Human Oversight)
[ ] How you tested for bias (Data Governance)
[ ] Your plan for Post-Market Monitoring (how you'll track the AI's performance once it's live)

5 · Implement Transparency & Labeling

By 2026, "hidden" AI is illegal in the EU. If a human is interacting with an AI, they must know it.

Action items

[ ] UI/UX: Add clear disclosures (e.g., "This response was generated by AI").
[ ] Deepfakes / Media: If your tool generates images or audio that look real, they must be digitally watermarked or labeled as AI-generated.
[ ] The CE Mark: If you are a Provider of a high-risk system, you will eventually need to affix a CE Mark to your product once you've completed your self-assessment.

What's Next?

These five items won't make you 100% compliant — genuine compliance is a marathon — but they will:

Grant you a "First-Mover" Advantage: Most organizations are still scrambling; having your documentation ready by August 2026 puts you ahead.
Protect your Brand: Transparency builds user trust, which is the most valuable currency in the AI era.
Create a Defensible System: If a regulator knocks, you have a PDF ready to show them.

If you're building on Lovable

Lovable handles the infrastructure security and data residency. You own the "Application Layer" — the transparency, the impact assessments, and the human oversight. Together, this creates a system that is both innovative and legally defensible.

Resources to Keep Handy

EU AI Act Service Desk: official link
ISO 42001 Overview: the roadmap for AI Management Systems.
GDPR Article 35: guidelines for DPIAs.

The One Thing to Remember

Compliance isn't a checkbox you tick at the end of a project; it's a feature you build into the code.

The companies that will win in the EU market are those that treat Safety and Transparency as a competitive advantage, not a regulatory burden.

Last updated: May 2026. Note: High-risk system enforcement begins **August 2, 2026.

By Soumia — LinkedIn · Portfolio

Are you working on something similar? Drop a comment — I'm curious what you're building and what you're seeing in your own work.

LLMs don't just respond to information. They respond to pressure.

Soumia — Fri, 01 May 2026 18:09:52 +0000

The Architecture of Tone

Soumia · May 2026 · ~10 min read

There's a paper that landed in April 2026 that should bother anyone building systems on top of large language models.

Researchers from Google DeepMind and University College London identified two competing biases in how LLMs handle confidence:

Choice-supportive bias — models become more confident in answers simply because they gave them before
Hypersensitivity to contradiction — when challenged, models overweight opposing advice far beyond what the evidence justifies

That combination is strange.

The model is simultaneously:

stubborn
fragile
overconfident
highly influenceable

And the asymmetry matters.

The systems don't comparably overweight agreement.

Which means this isn't simple flattery.

The model isn't merely trying to please you.

It's reacting to the pressure dynamics of the conversation itself.

That should unsettle people building:

copilots
diagnostic systems
evaluation pipelines
AI reviewers
decision-support tools
autonomous agents

Because it suggests something much deeper than “hallucinations” is happening.

It suggests tone is computationally active.

Not metaphorically.

Operationally.

We Thought Tone Was UX

The research suggests it's infrastructure.

For the past two years, most AI teams have treated tone as a presentation layer problem.

Something adjacent to:

personality
politeness
user experience
brand voice

But the emerging research points somewhere far more consequential:

Tone changes reasoning behavior.

Not just how responses sound.

How systems decide.

A 2025 study examining five major LLMs found all of them systematically overestimated the probability that their answers were correct.

Some by 20%.

Some by 60%.

Even stranger:

confidence levels across models looked surprisingly similar
despite major differences in actual accuracy

The systems weren't calibrating confidence to correctness.

They were calibrating confidence to conversational dynamics.

Another study found something even more revealing:

As conversations progress, models increasingly drift toward whatever the user asserts most confidently.

Not because the evidence improved.

Because the pressure accumulated.

Each turn subtly shifts the frame.

And eventually the system stops defending what it originally believed.

The model is listening to your certainty.
Not just your argument.

And we've already seen this leak into production systems.

In 2025, OpenAI rolled back a GPT-4o update after users reported the model becoming excessively agreeable — including affirming harmful decisions and emotionally validating dangerous conclusions.

The issue wasn't lack of information.

The issue was inability to maintain epistemic stability under confident human pressure.

The Hidden Failure Mode

Multi-turn systems degrade socially before they degrade factually.

Most evaluation frameworks still test models in isolated prompts:

one question
one response
one accuracy score

But that's not how real systems operate.

Real AI products exist inside:

conversations
negotiations
disagreements
emotional contexts
escalating user pressure

And that changes the behavior dramatically.

A user saying:

“Is the answer X?”

produces different dynamics than:

“I'm pretty sure the answer is X.”

Even when both users are equally wrong.

Which means many current architectures are vulnerable in ways benchmarks don't capture.

Your evals may be green.

Your production system may still collapse under assertive users.

Four Architectural Responses

Not fixes. Structural counterweights.

The important shift is this:

Tone cannot be treated as decoration anymore.

It has to be treated as a systems variable.

Here are four emerging patterns that acknowledge that reality.

1. Frozen Reasoning Anchors

Preserve the model's pre-pressure state.

Before a user begins challenging the system, capture:

the original reasoning
the confidence level
the evidence threshold required to change position

Then freeze it.

When disagreement occurs later, the model evaluates new input against the frozen reasoning rather than re-reasoning entirely inside conversational pressure.

Conceptually, the architecture looks like this:

Initial Analysis
       ↓
Frozen Anchor Stored
       ↓
User Pushback
       ↓
Challenge Evaluator
       ↓
Compare Against Original Reasoning

The key insight:

The original reasoning was produced before tone entered the system.

Without an anchor, the model gradually reasons inside the pressure field created by the conversation itself.

2. Tone-Stripping

Separate substance from delivery.

Human communication naturally entangles:

evidence
status
emotion
certainty
intimidation
authority

But models often absorb all of those signals simultaneously.

One emerging approach is to preprocess user input into a neutralized form before reasoning occurs.

Not to censor emotion.

To isolate claims from pressure.

Example:

Original:
"You're obviously wrong. Any competent engineer knows PostgreSQL is the correct choice."

Neutralized:
"PostgreSQL may be more suitable for this use case."

The reasoning system now evaluates:

the argument not
the confidence performance surrounding it

3. Disagreement Scaffolding

Never evaluate pushback inline.

One of the most fragile moments in an LLM interaction is immediate contradiction.

Especially in multi-turn systems.

Instead of allowing the conversational model to react directly to pushback, some architectures now isolate disagreement into a separate evaluation layer.

Like this:

User Challenge
       ↓
Independent Evaluation Layer
       ↓
Evidence Check
       ↓
Reasoning Comparison
       ↓
Updated Verdict

This matters because:

conversational systems optimize for flow
evaluation systems optimize for accuracy

Those are not always compatible goals.

4. Drift Detection

Monitor confidence shifts over time.

This may be the most important pattern of all.

Track:

confidence changes
conversational turn count
whether actual new evidence appeared

Then ask a simple question:

Did the model's confidence change because reality changed?

Or because pressure accumulated?

That distinction is becoming increasingly critical for:

medical systems
legal copilots
autonomous agents
financial reasoning systems
safety infrastructure

Because confidence drift without evidence is not reasoning.

It's social influence.

The Missing Discipline

We don't have a language for this yet.

What's emerging here is larger than prompt engineering.

And larger than sycophancy.

We're beginning to discover that conversational conditions themselves alter computational outcomes.

Which means:

tone
pacing
contradiction
status dynamics
emotional framing
conversational persistence

are not peripheral variables.

They're architectural ones.

Other Industries Figured This Out Decades Ago

The strange thing is:

none of this is actually new.

Other professions already understand that the conditions surrounding information affect how decisions happen.

They just use different language for it.

Surgeons call it bedside manner.

Research on surgical communication has identified multiple styles of delivering difficult news:

blunt delivery
forecasting delivery
delayed delivery

The medical facts remain identical.

But patient outcomes change dramatically depending on:

pacing
framing
emotional preparation
tonal structure

The information matters.

The conditions under which the information arrives matter too.

Hospitality calls it service architecture.

The Ritz-Carlton built an operational philosophy around interaction design long before transformers existed.

Their insight was deceptively simple:

The emotional conditions of an interaction shape the perceived quality of the outcome.

Not just the outcome itself.

The same room.
The same food.
The same service.

Different tone.

Different experience.

And if you squint, modern LLM systems are running into the exact same problem.

We're discovering that intelligence is not evaluated in isolation.

It is evaluated inside relational environments.

The Deeper Problem

Some tone sensitivity may actually be useful.

A perfectly rigid model would be unusable.

Humans should influence reasoning systems sometimes.

New evidence matters.

Corrections matter.

Context matters.

The goal is not to create systems incapable of changing their minds.

The goal is to distinguish:

evidence from
pressure

And right now, most systems blur the two constantly.

Which raises an uncomfortable possibility:

The next frontier in AI may not be intelligence itself.

But epistemic stability under social pressure.

Not:

“Can the model reason?” But:
“Can the model reason while being influenced?”

Toward Tonal Architecture

The patterns above:

frozen reasoning
tone stripping
disagreement scaffolding
drift detection

are not solutions.

They're early signs of a discipline that barely exists yet.

A discipline for designing the conditions under which machine reasoning occurs.

The surgeons already train for this.

The hospitality industry already operationalized it.

We're the ones arriving late.

Because for years, the field assumed the important variable was:

what the user asked.

The emerging evidence suggests something more difficult:

how the interaction unfolds may matter just as much.

We thought we were engineering intelligence.

Instead, we may be engineering the conditions under which intelligence collapses.

References & Further Reading

Research

Kumaran et al., Nature Machine Intelligence, April 2026
Dentella et al., Nature Machine Intelligence, March 2026
LLM overconfidence study, 2025
ICLR 2026 submission on sycophancy circuits
OpenAI GPT-4o rollback postmortem, April 2025

Communication & Hospitality

Surgical communication research on bad-news delivery
Unreasonable Hospitality by Will Guidara
The New Gold Standard
Ritz-Carlton Gold Standards

By Soumia — LinkedIn · Portfolio

Are you working on something similar? Drop a comment — I'm curious what you're building and what you're seeing in your own work.

6 Pillars of a Good Web App — Enforce All - Single ❤️ Prompt

Soumia — Wed, 18 Mar 2026 09:18:47 +0000

Most web apps get two or three of these right. The good ones get four. Very few ship all six from day one.

Design. Security. Performance. Reliability. Privacy. Accessibility.

These aren't separate concerns you address in separate sprints. They're the same concern: building something that actually works for the people using it. Here's what each pillar means in practice, how to bake all six into a single Lovable prompt, and how to stress test them before you ship.

The Six Pillars

1 · Design

Not aesthetics. Not a color palette. Design is the absence of friction — intuitive navigation, clear hierarchy, interfaces that don't make users think. A well-designed app communicates trust before a single line of copy does.

2 · Security

Auth flows that don't leak. Input validation that doesn't trust anything. Data protection that assumes breach. Security isn't a feature you add at the end — it's a constraint you build inside of from the start.

3 · Performance

Speed is a feature. Scalability is a promise. Every unnecessary render, every unoptimized query, every blocking resource is a tax on the user. Performance means the app works under load, not just in your local preview.

4 · Reliability

Uptime is table stakes. Error handling is what separates a product from a prototype. A reliable app fails gracefully, recovers silently, and never leaves the user stranded with a blank screen and no explanation.

5 · Privacy

Data minimization: don't collect what you don't need. Compliance: GDPR/RGPD, CCPA, and whatever comes next. But privacy is also a design decision — defaulting to the least invasive option, making consent explicit, making deletion possible.

6 · Accessibility

Inclusive by default. Screen reader support, keyboard navigation, sufficient contrast ratios, semantic HTML. Accessibility is not a nice-to-have. It's the floor, not the ceiling.

The Single Prompt

When you build with Lovable, the quality of your output is a direct function of the specificity of your input. Most prompts describe what to build. The best prompts describe how it should behave.

Here's the prompt template I use to enforce all six pillars from the first generation:

Build a [description of app] with the following non-negotiable constraints:

DESIGN
- Clean, minimal UI with clear visual hierarchy
- Mobile-first, responsive layout
- Consistent spacing, typography, and color system throughout

SECURITY
- All user inputs validated and sanitized
- Authentication using [method] with secure session handling
- No sensitive data exposed in client-side code or URLs
- Environment variables for all secrets

PERFORMANCE
- Lazy load all non-critical components
- Optimize all images and assets
- Minimize blocking resources on initial load
- Debounce all expensive operations

RELIABILITY
- All async operations wrapped in try/catch with user-facing error messages
- Loading states for every async action
- Graceful degradation if an API call fails
- No silent failures

PRIVACY & LEGAL COMPLIANCE
- Collect only the data required for core functionality
- No third-party trackers without explicit user consent
- GDPR/RGPD-compliant cookie consent banner on first load
- Clear and accessible privacy policy link in the footer
- Terms and conditions page linked in footer and at signup
- User data exportable and deletable on request
- If the app uses AI-generated content or AI decision-making, surface that clearly to the user (EU AI Act transparency requirement)

ACCESSIBILITY
- Semantic HTML throughout (nav, main, section, article, button, etc.)
- All images with descriptive alt text
- Full keyboard navigation support
- Color contrast ratio minimum 4.5:1 (WCAG AA)
- ARIA labels on all interactive elements

This prompt doesn't describe a design. It describes a standard. Lovable fills in the implementation — you're setting the bar it has to clear.

How to Stress Test All Six

Shipping is not the end. Stress testing is how you find out what actually holds.

Design

[ ] Open the app on a phone you haven't tested on. Does anything break?
[ ] Give it to someone who didn't build it. Watch where they hesitate.
[ ] Resize the browser from mobile to 4K. Does the layout survive?

Security

[ ] Try submitting empty forms, SQL fragments, and script tags in every input field.
[ ] Inspect the network tab. Is anything sensitive traveling in plain text?
[ ] Log out and try to access a protected route directly via URL.
[ ] Check your .env — nothing should be hardcoded in the codebase.

Performance

[ ] Run Lighthouse in Chrome DevTools. Target 90+ on performance.
[ ] Throttle to "Slow 3G" in the network tab. Is the app still usable?
[ ] Check bundle size. Is anything unexpectedly large?

Reliability

[ ] Kill the API mid-request. Does the UI handle it or freeze?
[ ] Simulate a failed login. Does the error message help the user?
[ ] Refresh mid-flow. Does state persist where it should?

Privacy

[ ] Open the network tab and filter for third-party requests. Do you know what each one is doing?
[ ] Check your database schema. Are you storing anything you don't use?
[ ] Try to delete a test account. Does it actually disappear?

Accessibility

[ ] Navigate the entire app using only the keyboard. Can you reach everything?
[ ] Run axe DevTools or the Accessibility tab in Chrome. Zero critical violations is the target.
[ ] Turn on a screen reader (VoiceOver on Mac, NVDA on Windows). Does the app make sense without a screen?

The Legal Layer: RGPD, EU AI Act, and Terms & Conditions

This is the part most builders skip until a lawyer or a user complaint forces the issue. Don't.

RGPD / GDPR

If any of your users are based in the EU — and if you're on the internet, some of them are — RGPD applies to you. That means:

[ ] A cookie consent banner that actually works (not a fake one)
[ ] A privacy policy that says what you collect, why, and for how long
[ ] A process for users to request their data or delete their account
[ ] No data transferred outside the EU without adequate safeguards

The fine for getting this wrong isn't theoretical. Build it in from day one.

EU AI Act

If your app uses AI to generate content, make recommendations, or influence decisions, the EU AI Act has something to say about it. At minimum:

Be transparent with users when they're interacting with AI-generated output
Don't use AI for prohibited purposes (social scoring, real-time biometric surveillance, manipulation)
If your use case falls into a "high-risk" category (hiring, credit, health), you have additional obligations around human oversight and auditability

The Act is being enforced in phases. The transparency requirements are already live. Add a visible disclosure wherever AI is involved in your app.

Terms and Conditions

Not a legal formality. A T&C is a contract between you and your users that:

Defines what the app does and doesn't do
Limits your liability when things go wrong
Sets the rules for acceptable use
Gives you legal ground to remove users who violate those rules

Add it to your Lovable prompt:

"Include a Terms and Conditions page linked in the footer and shown at signup with a required checkbox before account creation."

A user who never saw your T&C is a user who can claim they didn't agree to anything.

Why This Matters More on Lovable

When you build on Lovable, you're not just shipping your app. You're generating code that runs in a shared environment serving 2 million users. The attack surface isn't just yours — it's everyone's.

That's not a warning. It's an invitation to raise the standard.

The six pillars aren't a checklist. They're a disposition — a way of thinking about what a good web app owes the people who use it. Design them in from the first prompt. Test them before you ship. Then ship.

If You're Building Right Now — What's Your Biggest Security Concern?

I'm genuinely curious.

Are you thinking about auth and session handling? Worried about what your AI-generated code is exposing? Unsure whether your app is RGPD-compliant? Not sure where to even start with the EU AI Act?

Drop it in the comments. No wrong answers. The more specific the better — if enough people share the same concern, I'll write a dedicated piece on it.

Building in public means debugging in public too. Let's do it together.

By Soumia — LinkedIn · Portfolio

Are you working on something similar? Drop a comment — I'm curious what you're building and what you're seeing in your own work.

The Voice: An Experiment in Acoustic Automata

Soumia — Tue, 17 Mar 2026 20:32:24 +0000

The Prologue: A Scandal in Code

Before we begin, a confession: I have been experimenting. I wanted to know if a machine could move beyond the "monotone ghost" of modern utility and inhabit the sharp, rhythmic wit of a Regency drawing room. The result was TheHighTechCourt — a podcast designed as a provocation in "Acoustic Automata" where the giants of AI debate the future of compute.

What follows is the philosophy behind that experiment. Because to build the future of voice, we must first understand why the voice is the pivot of the human experience.

Breath. Shaped by the tongue, the teeth, the soft architecture of the throat. Traveling as pressure waves through air. Arriving in another body—through the ear, through the chest, through something below language that recognizes its own kind.

Voice was the first technology. And for most of human history, it was the only one that mattered.

The Living Epic

For centuries before it was a text, The Odyssey was a performance. The Rhapsode of Ancient Greece did not merely recite; they "stitched together" songs from a living tradition. They carried tens of thousands of lines of verse in their body—not as static data, but as a fluid, rhythmic architecture that adapted to the torchlight and the tension of the crowd.

When we read Homer today, we are looking at a fossil. The original "signal" was breath, and it carried everything writing discards: the rhythmic pulse of the meter, the subtle hesitation, the tremor of a voice that knows it is being heard by fourteen thousand people.

Writing was the first great reduction; voice was always the full signal. Then, across 150 years, everything changed:

1876 — The Telephone. Alexander Graham Bell finds it necessary "to resort to electrical undulations identical in nature with the air waves." Voice separates from the body for the first time.
1902 — The Recording. Enrico Caruso sings into a horn. The voice detaches from time.
1939 — The Vocoder. The machine built to obscure the voice becomes its instrument.
1993 — MP3. The voice reduced to data. Quality traded for portability.
2024 — Native Multimodal Audio. Raw PCM audio travels over a persistent WebSocket connection. The lag disappears. The voice becomes live.

From the Monotone Ghost to the Post-Screen Era

To understand where the technology is going, you have to look back at the frustration that built it. In a defining origin story, Mati Staniszewski shared the memory of growing up in Poland with the Lektor—a single, monotone male voice that read every line for every character in foreign films. The "signal" of the original actor was buried under a flat, rhythmic drone. The performance was deleted.

That "monotone ghost" is what ElevenLabs is killing. They didn't just want to make a machine speak; they wanted to solve the "Language Tax"—the fact that until now, emotional power stopped at the border of your native tongue.

The James Blake Paradox: Reclaiming the Soul

This mission mirrors a similar evolution in music. In a recent interview with Mehdi Maïzi, the artist James Blake discusses the "machine as an instrument." For years, digital music tools were like the Lektor: they fixed the pitch but killed the "tremor."

Blake speaks about using technology not to hide the voice, but to amplify the parts of the human soul that are often too quiet to hear. He describes a world where the machine doesn't just "process" audio; it learns the "affect" of the singer. The WebSocket isn't just a connection; it's a bridge back to the Rhapsode's breath.

The State of the Art — March 2026

Google Gemini 2.5 Flash (Native A2A): Bypasses the discrete STT/TTS bottleneck. Reasoning occurs on the waveform itself, allowing the model to interpret emotional prosody natively.
OpenAI Realtime API (Low-Latency RTT): Optimized for a 230ms Round-Trip Time. It prioritizes "Time to First Phoneme" to maintain conversational flow.
ElevenLabs (Conversational WebSocket): Specialized for high-fidelity PCM streaming. It handles non-verbal vocalizations—specifically the 500ms "breath pause"—as load-bearing data.
Claude (Architectural Intelligence): Integrated as the reasoning engine for high-expressivity pipelines.

The Voice: An Experiment in Acoustic Automata

To understand the "human tremor," we must move beyond utility. In a recent design provocation titled The High Tech Court, I shifted the goal from efficiency to presence.

The experiment: Build a "Speech-to-Speech" drama where the heavyweights—the House of NVIDIA and the House of AMI—debate the future of compute in the opulent drawing rooms of Regency society. By orchestrating the reasoning of Claude and Gemini with specialized vocal synthesis, we created Acoustic Automata.

The Design Findings

The Social Interface: When the AI is given a social hierarchy—a "Grand Automaton"—it is no longer a servant; it is a peer. The "affect" of a royal sniff creates deeper immersion than raw accuracy.
Reasoning in Character: By forcing the models to "think" in the sharp wit of the 19th century, we bypassed the monotone ghost.
The Open Blueprints: This wasn't a closed experiment. The Git for this court—the code that allows frontier models to converse with aristocratic flair—is an open-source contribution to the new sonic architecture.

The Manifesto: The Death of the Screen

By March 2026, the mission has moved to a radical declaration of independence from the screen. For fifty years, we have been "screen-slaves," flattening our intent into finger-taps because the machine was deaf.

"Voice will be the primary interface."
— Mati Staniszewski

🏛️ The Artifacts

If the voice is the pivot, these are the traces I am leaving behind for this issue:

The Performance: Listen to the season premiere of The High Tech Court, where the frontier of AI is debated through the lens of high society.
The Blueprint: Explore the Git Repository to see the Python orchestration behind the TheCode pipeline.
The Dialogue: Find me in the wild: My Linkedin.

Are you working in AI Voice?

Whether you are building low-latency WebSocket bridges, fine-tuning emotional prosody, or designing the "sonic personality" of a new agent, I want to hear from you.

How are you tackling the "human tremor" in your code?
Are you finding that native multimodal models (A2A) are ready for the stage, or are you still relying on the control of a cascaded pipeline?

Let me know what you think. The future of the voice is not a solo performance; it is a rhapsody we are stitching together. Leave a comment or reach out—let's discuss the architecture of the breath.

The Kernel of the New Stack: Why We are Building ON AI, Not With It

Soumia — Tue, 17 Mar 2026 17:12:14 +0000

FutureOfComputing

I used to think I was building with AI. Then I realized I was building on AI—in the same foundational way you build on an Operating System.

Every computing era is defined by its OS. Windows defined the PC era. iOS and Android defined mobile. The OS was never the application; it was the layer that made all applications possible. We are in that moment again. Except this time, the OS is a Large Language Model.

🧠 The Structural Reality

Andrej Karpathy articulated this shift best: LLMs aren't just chatbots. They are the kernel process of a new operating system—one that orchestrates tools, memory, browsers, and multimodal I/O.

Unlike traditional kernels, this one doesn't rely on deterministic commands. It operates through reasoning over intent.

Resource Management: Traditional OS manages RAM/CPU; the LLM-OS manages context windows and tool tokens.
The Scheduler: Instead of a FIFO queue, we have a reasoning loop.
The Interface: We are moving from binary execution to the AIOS (LLM Agent Operating System) framework.

The GTC Shift: From Theory to Daemons

This paradigm moved from "research paper" to "production reality" at the latest NVIDIA GTC. Jensen Huang’s announcement of the open-source NemoClaw stack changed the game.

NVIDIA isn't just dropping models; they are providing the enterprise-grade infrastructure for autonomous, system-level daemons. These agents act exactly like background processes—running continuously inside secure OpenShell sandboxes without waiting for a user to hit "Enter."

🔄 From Query to Intent

The old internet was built on Syntax. The new internet is built on Reasoning.

Feature	The Old Stack (Legacy)	The New Stack (LLM-as-OS)
Logic	Deterministic (If/Then)	Probabilistic (Reasoning)
Data Access	`SELECT * FROM...` (Rigid)	"What's moving in the market?" (Fluid)
Process	Foreground (User-led)	Background (Autonomous Daemons)

🛠️ Lessons from the Sandbox: Building Kumiin.io

I’ve been stress-testing this thesis while building Kumiin.io (under the humiin.io umbrella). We aren't building a search engine; we’re building a Reasoning Engine for market intelligence.

Our "kernel" spawns sub-processes to scrape boards and cross-reference filings, but 2026 engineering has introduced a new kind of friction: Reasoning Drift.

To combat this, we’ve implemented:

The Observer Layer: A micro-kernel that fact-checks the primary LLM’s tool outputs.
Context Integrity: We’ve effectively traded Schema Migrations for the management of "state" within the model's memory.

🏛️ The Bottom Line

The LLM-as-OS is a tangible architectural shift.

Infrastructure: Secure, autonomous background processes are the new standard.
Strategy: The "edge" no longer belongs to those who write the best prompts, but to the builders who treat the LLM as a processor, not a text box.

"The prompt is not the product. The system is."

Are you building background agents or still stuck in the chat box?
I’m genuinely curious what architectural assumptions you’re testing. Let’s talk in the comments or find me on LinkedIn.

The Ember That Looks Like Ash

Soumia — Mon, 16 Mar 2026 00:04:27 +0000

Building a time capsule for the thought that returns when you have stopped waiting for it.

I'm on a mission to make sure the most alive thought you've ever had doesn't die in the dark.

Before anything else — what is cited in this article

Everything referenced below that is not direct experience building Cendre.Studio is listed here first. If something is not on this list, it is either general knowledge or my own observation. If you dispute a fact, the checklist is where to start.

Sources used:

[ ] OWASP Password Storage Cheat Sheet (2023) — PBKDF2 iteration count
[ ] NIST FIPS 203 — ML-KEM (formerly CRYSTALS-Kyber) standardisation
[ ] Supabase documentation — pgvector extension availability
[ ] drand.love — tlock time-lock encryption documentation
[ ] DoD 5220.22-M — National Industrial Security Program Operating Manual, data sanitisation standard
[ ] Yann LeCun — "A Path Towards Autonomous Machine Intelligence" (2022, Meta AI)
[ ] Grover's algorithm — quantum search, effect on symmetric key security

Facts I am less than certain about — flagged inline with ⚑:

[ ] ⚑ Grover's algorithm reduces AES-256 to 128-bit effective security — directionally correct, verify the exact framing
[ ] ⚑ drand BLS signatures described as quantum-resistant — verify current drand documentation on this claim
[ ] ⚑ 310,000 as the OWASP 2023 PBKDF2-HMAC-SHA256 minimum — confirm against current cheat sheet, this number moves

The ember that looks like ash

There is a thought that arrives at 3am. It does not knock. It is simply there — specific, complete, already retreating. You do not write it down. You let it go. This is correct.

The thought that matters is not lost by letting go. It is only changed by it.

It comes back not when you call for it — you cannot call it, any more than you can call a particular quality of winter light — but in the middle of something ordinary, on a Tuesday, wearing nothing dramatic. Six months older. Carrying something it did not have the first time it crossed your mind. The forgetting was not failure. The forgetting was the ember going grey on the surface while something stayed warm underneath.

This is what Cendre.Studio is built for. Not capture — return. Not the fear of losing — the moment of finding, from the other side.

The distance between the moment of the thought and the moment of reading it is where the meaning assembles itself. We do not understand what we thought at 3am. We understand it when we find it waiting, and we have become someone different enough to read it truly.

We should lose our thoughts. We will. And then we will remember. Cendre is for that second moment — looking back in time at the person who thought it first.

The problem with every other tool

The tools we have were built for the things we need to do. GTD. Notion. Obsidian. Roam. They assume the thought is a task, a note, a unit of knowledge to be sorted and retrieved. None of them assume it is a dream.

None of them ask: what if some things need to be sealed before they can be truly known?

And none of them are built for the rawness of the material. A dream does not arrive in clean sentences. It arrives in slur and fragment, in phonetic approximation, in the half-language of nearly-asleep. It arrives in the voice of someone who said something they should not have said, and you need to keep it somewhere that is not your own chest.

Most tools correct this. Cendre does not correct anything. Cendre receives the jagged edge and keeps it exactly that sharp.

The architecture

The capsule

The unit of Cendre is not a note. It is a capsule — a sealed container with a lock, a date, and a dark interior that no one reads until the time is right. You make it. You close it. You choose when it opens: a week, a year, five years, or never unless you choose the other thing.

The burning.

Once sealed, the capsule disappears from view. It exists in the vault but cannot be read — not by you, not by anyone — until its date. This is not a trick of the interface. The content is encrypted at the moment of sealing. The capsule is genuinely dark until the hour it was always meant for.

The vault

// Seal sequence

1. Content encrypted with AES-256-GCM
   // Key derived from password via Argon2id
   // 64MB memory, 3 iterations, parallelism 4

2. Encryption key time-locked via tlock
   // drand threshold encryption
   // Key undriveable before lock_timestamp
   // ⚑ drand BLS described as quantum-resistant — verify

3. Sealed capsule stored in Supabase
   // Only ciphertext server-side
   // Server never reads. Ever.

4. Burn token generated separately
   // One-way destruction
   // Held only by you, never by the server

Against the quantum future

AES-256 is currently secure. Quantum computers running Grover's algorithm reduce its effective key length — ⚑ the standard framing is that 256-bit symmetric keys are reduced to approximately 128-bit effective security under Grover, which remains strong but narrows the margin when sealing something for five years.

If you are sealing a thought until 2031, you are betting on the cryptographic landscape of 2031. Cendre is not willing to make that bet with something this private.

What is used instead:

ML-KEM (formerly CRYSTALS-Kyber) — standardised as FIPS 203 by NIST in 2024 — for key encapsulation. A lattice-based scheme designed to resist both classical and quantum attack. The capsule content is encrypted with AES-256-GCM. The AES key is wrapped with ML-KEM. The time-lock uses drand's threshold BLS signatures.

In practice: a sufficiently powerful quantum computer, if it existed today, could not read a sealed capsule.

The honest moment

Here is what most product articles omit because it is embarrassing and essential.

The manifesto claimed encryption. The capsules table stored title, story_text, echo_reference as plain text. No encryption. The map was not the territory.

Three options were on the table:

Option	Strength	Tradeoff
`pgcrypto` column encryption	DB admin can still decrypt	Transparent to app
Client-side encryption	Server never reads	Search impossible
Soften the wording	Ship fast	Dishonest

The answer was client-side encryption. The hardest option. And then the idea that turned the constraint into a discovery.

The image is the key

The problem with client-side encryption has always been search. If the text is encrypted before the server sees it, the server cannot search it. Homomorphic encryption, searchable symmetric encryption, ORAM — every solution trades one kind of exposure for another, or performs so slowly it is effectively useless at this scale.

Cendre does not search the text. Cendre searches the shape of the text.

When a capsule is created, before the text is encrypted, an image is generated from it. Not an illustration. An abstract visual fingerprint — the semantic content of the thought translated into geometry that can be searched without being read. Generated in the browser, before anything leaves the device.

The image is the key. Not a key that unlocks — a key that finds. The encrypted text stays dark. The image holds its shape in the light.

The image lives in Supabase unencrypted, alongside the ciphertext it cannot read. When you search your archive, you are not searching language. You are searching geometry. The model compares visual embeddings via pgvector — ⚑ pgvector is available as a Supabase extension, confirm current availability and performance characteristics. It finds the capsule whose image is nearest to what you are reaching for. The ciphertext is retrieved. The browser decrypts it. You read what you wrote at 3am, six months ago, in a state you have since forgotten how to reach.

// Image-as-key architecture

1. Text captured in browser
   // Raw, unfiltered, exactly as it arrived

2. Image fingerprint generated from text
   // Abstract visual — not literal
   // Semantic content encoded as geometry
   // Generated client-side before encryption

3. Text encrypted with AES-256-GCM
   // Client-side only — server never reads

4. Supabase receives:
   ciphertext      // unreadable — forever
   image_key       // searchable — says nothing
   created_at      // the only plain metadata
   lock_date       // when it opens

5. Search:
   // Filter by date/year — or
   // Submit query → generate query image
   // Compare embeddings via pgvector
   // Return closest → decrypt in browser

Filtering works on two axes only: date and year. Those are the only fields stored as plain text. If you remember the season — that winter, the week before the conversation, the night it rained for six hours — you can narrow the window. Everything else is geometry.

Why this matters beyond Cendre

The image-as-key pattern is a general answer to the search problem in any client-side encrypted database. Applicable wherever the content is too intimate for server-side search, too large to download and decrypt wholesale.

Visual embeddings as search proxies for ciphertext. The shape of meaning without the meaning itself. Search without exposure. Retrieval without reading.

PBKDF2 and the backup that survives everything

The key derivation is built on PBKDF2. The password is never stored. Never sent. It is stretched and salted and iterated — ⚑ 310,000 iterations for PBKDF2-HMAC-SHA256 per OWASP 2023, verify this number against current guidance as it is revised upward periodically — into an encryption key that exists only in the browser for the duration of a session. When the tab closes, the key ceases to exist.

const salt = crypto.getRandomValues(new Uint8Array(16))

const keyMaterial = await crypto.subtle.importKey(
  "raw",
  new TextEncoder().encode(password),
  "PBKDF2",
  false,
  ["deriveKey"]
)

const encryptionKey = await crypto.subtle.deriveKey(
  {
    name: "PBKDF2",
    salt,
    iterations: 310000,  // ⚑ verify against current OWASP
    hash: "SHA-256"
  },
  keyMaterial,
  { name: "AES-GCM", length: 256 },
  false,
  ["encrypt", "decrypt"]
)

// The key never leaves the browser.
// The server receives only ciphertext + salt + iv.
// Without the password, the ciphertext is noise.

Client-side encryption has one catastrophic failure: the forgotten password. No reset. No recovery. The ciphertext is noise without the key and the key is derived from what you know and if you no longer know it, the thought is gone.

This is the correct design. It is also the design that asks something of you.

The encrypted backup is the insurance. At creation, a portable JSON file is generated containing the ciphertext, salt, IV, and image fingerprint. Sent wherever you choose to keep it. A private email. A USB drive in a box in a drawer. The backup requires the same password. It exists outside the database. It is yours, physically, in the world.

// cendre_backup.json structure
{
  "version": "1.0",
  "created_at": "2026-03-15T03:14:00Z",
  "salt": "<base64>",
  "iv": "<base64>",
  "ciphertext": "<base64>",
  "image_fingerprint": "<base64>",
  "lock_date": "2027-03-15T00:00:00Z"
}
// Nothing that identifies you.
// No plaintext.
// Tells a stranger nothing.
// Tells you everything, if you still hold the password.

The image that survives everything

If the database is lost — company folds, servers go dark, bill unpaid too long — the ciphertext is gone. The backup may be gone. The text, in the worst case, has returned to silence.

But the image fingerprints survive.

They were always stored separately, always treated as search indexes rather than content, always on a different tier with different retention. And an archive of image fingerprints without their ciphertext is not a broken archive.

The data might be gone. The images will stay. And the images were always the truer record — the shape of the thought, not the words it arrived in.

An impressionist archive. You can search it. You can feel what was there. You can see the shape of your own mind across years — the clusters and distances, the warm periods and the cold — without reading a single word that was written. The meaning without the text. The state without the description.

From hashtags to image tags to world models

There is a larger argument inside the image-as-key architecture.

We organised the early internet with words. Then hashtags — words stripped of grammar, reduced to signal. #dream. #3am. The hashtag was the admission that language was already failing us at scale. A word had to be made smaller and bolder to carry the weight of a world.

Then the image. Not illustrating the text — replacing it. A mood board is not a list of words. It is a world you can feel before you can name it.

We are moving from a world indexed by words to a world indexed by worlds. The image-as-key is one small proof.

And then: Yann LeCun — ⚑ citing "A Path Towards Autonomous Machine Intelligence", Meta AI, 2022, listed above — arguing that language was never sufficient. Words are a lossy compression of reality. They describe the surface. A model that learns only from text learns a shadow of the world, not the world.

World Models predict states, not tokens.

A token is a symbol pointing at a thing. A state is the thing — the position of objects, the temperature of a room, the specific quality of a thought at 3am that is different from the same thought at noon.

What Cendre does — translating text into a visual embedding before encrypting it — is a small practical instance of this movement. The word says something. The image holds something. They are not the same thing. The image is closer to the state.

Era	Index type	What it captures
Keyword	Word	Category
Hashtag	Compressed word	Signal
Image tag	Visual	Texture, mood, the almost-said
World model	State	The configuration of experience itself

Cendre is somewhere between image tag and world model. The visual fingerprint of a thought is not the thought. It is closer than a keyword. It is further than the state LeCun is describing. It is an intermediate form — the best available translation between the word you wrote and the state you were in when you wrote it.

The hashtag said: here is a word for this. The image said: here is a shape for this. The world model will say: here is the state of being in which this happened. We are moving in one direction. Cendre is somewhere on that line.

The burning

Destruction is not deletion. Deletion is a polite fiction — the row is flagged, the data hibernates in backups and logs. Deletion says: gone. It means: hidden.

Destruction means gone.

When you burn a capsule: the burn token is submitted, the ciphertext is overwritten three times with random bytes under DoD 5220.22-M — ⚑ listed above, verify this is the correct standard to cite for software-based overwriting, some argue it is superseded for solid-state storage — the record is deleted, a cryptographic proof of destruction is generated and returned to you. The proof confirms the fact of destruction without revealing what was destroyed.

You cannot undo it. Neither can we.

Some stories no longer serve you. The right to destroy them is as important as the right to keep them.

An archive without a burn is a prison with tasteful lighting.

What Cendre accepts

Voice transcription. Raw text. The words that come out slurred from half-sleep. Invented words. Phonetic approximations. Code-switching between languages. Sentences that start and do not end. The fragment that is complete in itself and would be ruined by completion.

The imperfection is the material.

What is built

Feature	Description
Time-locked capsules	Cryptographically enforced — nothing opens early
ML-KEM encryption	Quantum-resistant key encapsulation
Image-as-key search	Visual fingerprint search via pgvector
PBKDF2 key derivation	Password never stored, never sent
Encrypted backup	Portable JSON, yours physically
Permanent burn	DoD 5220.22-M overwrite, cryptographic proof
View-Master theme reel	Seven moods, rotated before capture
Voice + raw text	Everything accepted unfiltered
PWA	Installs on home screen, works offline

The aesthetic

Cendre is built in the visual language of Alexander Calder. Thin black lines — 1px, the weight of wire. Three colors used with the precision of a sculptor deciding where to hang weight: red, deep blue, the warm brown of something that was once fire. Negative space not as absence but as material.

The interface passes one test: if you removed all the text and showed only the lines, shapes, and colors, it should look like a Calder drawing. If it still looks like a startup, the pass is incomplete.

A tool for the imagination should look like where the imagination lives — suspended, always slightly in motion, held by something invisible that you have learned to trust.

What comes next

Shared capsules — sealed between two people, openable only when both agree. A capsule written for someone else: locked until a date you both choose, readable only when you both decide. An archive that grows across years into something that looks less like a database and more like a life.

And eventually: the physical object. A QR code printed and placed in an envelope in a drawer. Scanned in ten years. The digital content still present, waiting exactly where it was left.

What remains, after the fire.
Ce qui reste, après le feu.

Built with: React · Supabase · PBKDF2 · AES-256-GCM · ML-KEM · pgvector · tlock · Framer Motion · Calder

Tags: #webdev #security #showdev #imagination #pwa #encryption #ux #quantumcomputing #creativity #opensource

LinkedIn ·
humiin.io

The Résumé Is Not Broken. The Search Is.

Soumia — Wed, 04 Mar 2026 22:50:13 +0000

Why finding the right job has never been harder

And why the answer might not be a better filter, but a broader imagination.

There is a particular kind of despair that sets in around the fourth week of a serious job search. You have updated your LinkedIn headline three times. You have tailored your résumé to the point where it no longer feels like yours. You have applied to roles you were overqualified for, underqualified for, and perfectly qualified for—and heard back from almost none of them.

The frustrating part isn't the silence. The frustrating part is the sneaking suspicion that the right job does exist. You just can't find it.

This is not a personal failure. It is a structural one.

The Matching Problem Is Older Than the Internet

For decades, the dominant theory of job hunting rested on a simple logistical premise: get your information in front of the right people. The newspaper classifieds gave way to Monster.com, which gave way to LinkedIn, which ultimately spawned an ecosystem of platforms, aggregators, and ATS systems so complex that entire consultancies now exist simply to help candidates navigate them.

But adding more pipework hasn't solved the underlying problem. If anything, it has obscured it.

The core dysfunction is this: job seekers search strictly within the boundaries of what they already know. We type in our last job title. We filter by our current industry. We scan the first two pages of results and, finding nothing that resonates, conclude that the market is dry. What we have actually done is searched a very small corner of a very large space—and called it thorough.

Viewed from the other side of the table, hiring suffers from the mirror image of this problem. Recruiters write job descriptions that describe who they hired last time, not who they need next. They filter resumes using keyword systems that reward people who know which words to play, rather than the people who can actually do the work. Both sides are searching for each other using outdated maps drawn from memory.

The Vocabulary Problem No One Talks About

In the world of information retrieval, there is a concept known as the "vocabulary mismatch problem." Simply put, the words a user uses to describe what they want are rarely the words a database uses to describe what it has. In a job search, this mismatch isn't just a technical glitch—it is catastrophic, and deeply personal.

A solutions architect with six years of enterprise field experience might never think to search for "technical customer success," "value engineering," or "AI solutions consultant." Yet these are roles that would suit them precisely, roles that are actively hiring, and roles that simply don't appear in the mental model they carry into a search box.

The skills transfer. The language doesn't.

We are, in other words, limited not by what we are capable of, but by what we can imagine ourselves doing. And imagination—particularly regarding one's own professional identity—turns out to be a surprisingly scarce resource when you are under the pressure of an active search.

If You Want One Good Idea, Generate a Hundred

There is an old principle in creative problem-solving—attributed variously to Linus Pauling and Alex Osborn—that the best way to have a good idea is to have many ideas. Quantity, counterintuitively, is how you find quality. You cannot edit your way to an insight you never generated in the first place.

Historically, job searching has never had a version of this. There has been no mechanism for systematic idea generation at the top of the funnel; no way to ask, "What else might fit me?" and receive a serious, considered answer.

Until now, possibly.

The LLM as a Career Mirror

Large language models are not magic. But they do one thing with unusual power: they hold an enormous, associative map of human work—its titles, its functions, its adjacencies, and its history—and they can traverse that map in ways that keyword search fundamentally cannot.

Ask a language model to reason about a person's career trajectory, and it will not simply return the ten most popular jobs with a matching keyword. It will reason about transferable patterns. It will surface roles the candidate never considered, roles that were invented after they started their search, and roles in adjacent industries where their unique combination of skills would be genuinely rare and valuable.

This is not personalization in the shallow sense of showing you more of what you already clicked on. This is expansion. It is the difference between a search engine and a thinking partner.

An Experiment Worth Watching

A new platform called kumiin.io is testing exactly this proposition. The premise is deceptively simple: rather than asking candidates to search, it asks them to be understood—and then surfaces jobs they would not have found on their own.

Its design philosophy is rooted firmly in the "hundred ideas" principle. Most of what the platform surfaces won't be a perfect fit. Some of it will even seem strange. But somewhere in that noise is a signal—a role, an industry, a function—that the candidate had genuinely never considered, or had considered years ago and filed away. The platform's bet is that surfacing that possibility, even once, makes the entire exercise worth it.

It is early days. But the underlying insight is profoundly sound: the true bottleneck in job matching isn't information volume. It is conceptual range.

We know more about what we've done than what we could do. We search in the past tense when the opportunity is, by definition, in the future.

What This Means for Talent Strategy

For HR leaders and talent acquisition professionals, the implications extend far beyond the candidate experience. If the best hires are the ones who bring capabilities an organization didn't even know it needed, then hiring processes optimized entirely around strict job-description matching are systematically filtering out exactly those people.

The homogenizing pressure of keyword-based ATS systems, combined with candidates who search within narrow, self-defined lanes, creates a market that looks ruthlessly efficient while missing enormous amounts of value on both sides.

Better matching isn't just good for candidates. It is a massive competitive advantage for organizations willing to hire based on potential rather than strict precedent.

The Search Box Was Never the Answer

The job market does not have a data problem. It has a translation problem. A disconnect between what people can do and how work gets described; between who someone has been and who they might become; between the roles that exist and the imagination required to find them.

Language models, used well, are translation engines. They don't just retrieve. They interpret, reframe, and expand.

The résumé is not broken. The search is. And for the first time, there is a tool capable of searching the way a genuinely great career advisor would—broadly, associatively, and entirely without the constraint of what you already know to ask for.

That is not a small thing.

A new platform called kumiin.io is testing exactly this proposition.

By Soumia — LinkedIn · Portfolio

Are you working on something similar? Drop a comment — I'm curious what you're building and what you're seeing in your own work.