DEV Community: Eber Cruz Fararoni

The Truth Nobody Tells You About AI in 2026: Why Microsoft and Uber Are Pulling Back, and Why Your Strategy Matters More Than Your Speed

Eber Cruz Fararoni — Wed, 03 Jun 2026 16:15:03 +0000

By Eber Cruz Fararoni | ebercruz.com | Software Architect & Builder of Intelligent Systems

TL;DR: 80.3% of enterprise AI projects fail without a trace of ROI. Microsoft has watched as only 4.5% of its 450 million M365 customers pay for Copilot, while its stock fell 34%. In the midst of this disaster, DeepSeek just made a 75% API discount permanent, and Moonshot AI's Kimi K2.6 proves it can match — or surpass — Claude Code 4.6 at a fraction of the cost. I've spent the past few months building Fararoni Flow, a multi-purpose agent orchestrator on Java 25, NATS, and hexagonal architecture with sidecar. This article is what I've learned about why most fail, and why orchestration strategy matters more than implementation speed.

1. The Landscape: A Silent Crisis in Enterprise AI Adoption

Generative artificial intelligence arrived promising to revolutionize every aspect of the modern enterprise. In 2025, organizations invested $684 billion in AI worldwide. By December of that year, more than $547 billion of that investment had produced measurable results: exactly zero. Not low returns. Zero. This is not a hypothetical scenario or a pessimistic projection: it is the conclusion of a RAND Corporation analysis of more than 2,400 enterprise AI initiatives.

The reality we face in 2026 is radically different from the marketing narrative sold by the big AI labs. While headlines celebrate each new model launch with ever-grander superlatives, in the trenches of enterprise implementation, the story is different. 42% of companies abandoned at least one AI initiative in 2025, a dramatic jump from just 17% the previous year, according to S&P Global Market Intelligence data. Organizations aren't getting better at AI; they're simply getting faster at recognizing failure.

The problem isn't the technology itself. The models of 2026 are undeniably superior to those of 2024. GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, and the new Chinese competitors like DeepSeek V4-Pro and Kimi K2.6 represent quantum leaps in reasoning capability, code generation, and agentic task execution. The problem is how companies are trying to leverage these capabilities.

Most organizations made the fundamental mistake of treating AI as a speed race rather than what it truly is: a strategy discipline. They wanted to build everything with prompts, automate complex processes in weeks, and transform their operations without the architectural rigor that any mission-critical enterprise system demands. When these efforts collapsed — as was inevitable — the easy conclusion was "AI is too expensive" or "AI doesn't work for us." The correct conclusion, however, is that orchestration architecture matters more than the language model you choose.

Metric	Data	Source
Global AI investment 2025	$684 billion	RAND Corporation (2025)
AI projects with no measurable ROI	$547 billion (80%)	RAND Corporation (2025)
Companies that abandoned ≥1 AI initiative	42% (vs 17% in 2024)	S&P Global (2025)
GenAI projects with no P&L impact	95%	MIT Project NANDA (2025)
AI projects that never reach production	52%	Gartner (2024)
Average cost of a failed AI project	$7.2 million	S&P Global (2025)
Cost overruns in RAG projects at scale	380% vs pilot projection	MIT Sloan

These numbers shouldn't discourage us. They should focus us. AI isn't failing; careless approaches are failing. And in that recognition lies a massive opportunity for those who understand that the difference between success and failure isn't the model you use, but the architecture with which you orchestrate it.

Figure 1: Hard data from multiple authoritative sources confirms that most enterprise AI projects fail. Source: RAND Corp, MIT NANDA, Gartner, S&P Global, BCG (2024-2025).

2. The Giants Stumble: Microsoft, Uber, and the Real Cost of AI

Microsoft: The 4.5% That Exposed a Structural Fragility

In May 2026, Fortune published a devastating analysis of Microsoft's position in the AI race. The company that had bet the hardest on OpenAI — with an investment exceeding $13 billion — faced an uncomfortable reality: less than 4.5% of its 450 million Microsoft 365 customers were paying for Copilot features. Meanwhile, its Copilot consumer chatbot had reached approximately 20 million weekly active users, a figure that sounds impressive until you compare it to ChatGPT's 900 million users.

The situation of GitHub Copilot — which was the first major commercial success of AI coding — illustrates the trend even more clearly. From being the undisputed leader in AI coding tools, it has been supplanted first by Cursor ($2 billion ARR, the fastest-growing SaaS ever recorded) and then by Claude Code (46% of senior engineers name it their "most loved" tool, with an NPS of 54).

Microsoft's stock fell 34% from its all-time high in October 2025 to March 2026, despite its AI-related revenues in Azure having more than doubled. Investors realized something that marketing executives still don't want to admit: having the most famous model doesn't guarantee a sustainable business platform. The company announced $190 billion in capital expenditures for 2026 — more than triple what it spent in 2024 — in a desperate bid to recover lost ground.

Microsoft's Chief Commercial Officer, Judson Althoff, publicly acknowledged several errors: calling the product "Copilot" for both consumer and enterprise versions created massive confusion; incentivizing sales representatives to promote the free version when only the premium version delivered real value; and underestimating the speed at which AI technology was evolving. When Anthropic launched Claude Code in 2025 — capable of writing complete programs autonomously from a description — and then Claude Cowork in January 2026, the "copilot" model that Microsoft had built suddenly felt like a generation behind.

Uber: Closing Labs, Opening the Door to Learning

Uber's case is different but equally instructive. During the COVID-19 pandemic, Uber made the decision to close its Uber AI Labs as part of a strategic cost reduction. Although this decision was driven by the need to preserve capital during an unprecedented crisis for the ride-sharing industry, it illustrates a pattern we've seen repeat: when AI costs meet financial reality, experimental projects are the first to fall.

What makes these cases particularly revealing is not that companies are abandoning AI altogether — neither of them is doing so — but that they are learning an expensive lesson: AI is not a product you buy, it is a capability you build. Microsoft is not abandoning AI; it is reconfiguring its strategy to be model-agnostic, allowing its customers to choose between GPT, Claude, Gemini, or any other model within its platform. Uber did not abandon AI; it redirected resources toward AI applications more directly tied to its core business.

The conclusion is not that "AI is too expensive." The conclusion is that without a well-designed orchestration architecture, without a gradual implementation strategy, and without a deep understanding of where AI generates real value versus where it only generates impressive demos, costs skyrocket and results evaporate.

3. The Price War Nobody Saw Coming: DeepSeek, Kimi, and the New Geography of Cost

DeepSeek V4-Pro: The 75% That Changed Everything

In May 2026, DeepSeek — the Chinese lab that in January 2025 had already shaken the industry's foundations with its R1 model — announced something many considered impossible: a 75% discount on its V4-Pro API, which would also become permanent. This is not a temporary promotion or a marketing trick. It is a structural cost reduction based on real gains in computational efficiency.

The numbers are staggering. DeepSeek's V4-Pro model, with 1.6 trillion parameters and a 1-million-token context window, dropped to:

Model	Input / 1M tokens	Output / 1M tokens	Cache Hit / 1M
DeepSeek V4-Pro (post-75%)	$0.435	$0.87	$0.003625
DeepSeek V3.2	$0.28	$0.42	$0.028
GPT-5.4	$2.50	$15.00	$0.25
Claude Opus 4.6	$5.00	$25.00	$0.50
Kimi K2.6	$0.60	$2.50	$0.10

To put it in perspective: a task that costs $0.87 with DeepSeek V4-Pro in output tokens costs $15.00 with GPT-5.4 and $25.00 with Claude Opus 4.6. That represents a 94% savings versus Claude Opus and 88% versus GPT-5.4. The V4-Pro cache hit at $0.003625 per million tokens is practically free for workloads with repetitive system prompts.

Sanchit Vir Gogia, CEO of Greyhound Research, explained the logic behind this reduction: "V4-Pro was designed to reduce the cost of long-context inference, operating at approximately a quarter of the compute per token and a tenth of the memory footprint of its predecessor at very long contexts. That is why the price reduction is permanent and not promotional. It is not a discount. It is an efficiency gain that is passed on to the customer."

Figure 2: DeepSeek V4-Pro and Kimi K2.6 offer prices 8-30x lower than equivalent Western models, with comparable performance on coding benchmarks. Source: Official API prices, May 2026.

Kimi K2.6: The Open-Source That Wins on Benchmarks and Loses on the Bill

On April 20, 2026, Moonshot AI launched Kimi K2.6, an open-source 1-trillion-parameter model with a Mixture-of-Experts (MoE) architecture that activates approximately 32 billion parameters per token. The numbers it presented are as impressive as the prices:

SWE-Bench Pro: 58.6% — above GPT-5.4 (57.7%) and Claude Opus 4.6 (53.4%)
HLE-Full (with tools): 54.0% — leading among all compared models
BrowseComp (Agent Swarm): 86.3% — dominating in multi-agent agentic tasks
DeepSearchQA F1: 92.5% — superior to GPT-5.4 (78.6%) and Claude Opus 4.6 (91.3%)

On independent benchmarks such as SWE-Bench Verified, K2.6 reached 80.2%, surpassing Claude Opus 4.6 (80.8% in its 4.6 version, although Opus 4.7 subsequently reached 87.6%).

The price: $0.60 per million input tokens and $2.50 per million output tokens. That makes it 8.3x cheaper on input and 10x cheaper on output than Claude Opus 4.7. When Cursor — the fastest-growing AI coding company in history with $2 billion ARR — built its Composer 2 feature on Kimi K2.5 (the previous version), they were sending a clear message: performance does not require paying premium prices.

Figure 3: Left: Evolution of DeepSeek V4-Pro price with permanent 75% discount. Right: Real monthly cost using intelligent routing versus a single model. Source: API data and public benchmarks, May 2026.

The Reality That Costs Reveal

The price difference is not a minor accounting detail. It is a strategic transformation of the landscape. When Ideas2IT ran a controlled test — building the same Flask application with SQLite, HTML frontend, CRUD operations, unit tests, and Git configuration — the results were revealing:

Model	Cost per execution	Output quality
DeepSeek V3.2	~$0.15	Good (better UI)
Kimi K2.5	~$0.33	Production-ready
Claude Sonnet 4.6	~$1.66	Production-ready
Claude Opus 4.6	~$75.64/month	Production-ready (superior on complex architecture)

All three cloud models completed the task. Engineers who reviewed the results blind could not consistently identify which model produced which output. The 11x cost difference between DeepSeek V3.2 and Claude Sonnet 4.6 did not translate into an 11x quality difference. It translated into an 11x difference on the bill.

At team scale, the implications are enormous. A team of 10 engineers using Claude Code with Claude Sonnet 4.6 spends approximately $444.40 per month on API tokens alone. The same team using Kimi K2.5 would spend $78.60. With DeepSeek V3.2, barely $24.00. And that doesn't even consider that 82% of a developer's daily work — PR review, refactoring, testing, standard debugging — does not require the maximum reasoning capability of a premium model.

Tyler Folkman, an independent developer who built a model router for his personal workflow, documented the most extreme case: in 2,415 real AI turns, he spent $76.77 using a routing system that sent each task to the appropriate model. The same volume of work would have cost $1,272.77 if he had used GPT-5.5 for everything. A 94% savings, achieved simply by not "pretending that every task is the same task."

Figure 4: Left: The dramatic shift from companies building AI in-house to buying solutions (Menlo Ventures). Right: Real monthly cost per engineer using different models within Claude Code. Source: Ideas2IT, JetBrains AI Pulse Survey, API data.

4. Claude Code 4.6 Is Still King... But the Throne Is Wobbling

I don't want my argument to be misinterpreted. Claude Code 4.6, especially in its Opus tier, remains the gold standard for complex coding tasks. Its 1-million-token context allows loading entire monolithic repositories in a single session. Its 91.3% score on GPQA Diamond (graduate-level science questions validated by domain experts) is unmatched for deep scientific reasoning. Its hallucination rate is the lowest in the industry, with an AA-Omniscience index of +10 versus Kimi K2.5's -11.

For complex architecture work, large-scale refactorings, legacy code comprehension, and truly novel problems where the output cannot be easily verified with automated tests, Claude Opus 4.6 justifies its premium price. There is no substitute for the peace of mind of knowing that the model has the highest probability of generating a correct answer when "correct" cannot be verified with a unit test.

However, here is the truth that many don't want to hear: 80% of a software engineer's work is not complex architecture or novel problems. It is REST API development, unit test generation, frontend scaffolding, standard error debugging, and pull request review. For that 80%, Kimi K2.6 and DeepSeek V4 are not just "good enough" — on many coding benchmarks, they are better.

The Pragmatic Engineer survey from February 2026, which consulted approximately 906 senior engineers with a median of 11-15 years of experience, revealed a fascinating pattern: 46% of senior engineers named Claude Code as their "most loved" tool, versus 19% for Cursor and 9% for GitHub Copilot. JetBrains confirmed these findings with hard loyalty data: a CSAT of 91% and an NPS of 54 for Claude Code, the highest in the category.

Figure 5: Left: Workplace adoption market share (JetBrains, Jan 2026). Right: Satisfaction among senior engineers — Claude Code leads widely despite its lower market share. Source: JetBrains AI Pulse Survey, Pragmatic Engineer Survey.

But here is the critical nuance: although Claude Code is the most loved tool, it only has 18% workplace adoption versus 29% for GitHub Copilot. And among small startups (fewer than 50 people), Claude Code reaches 75% adoption, while in enterprises with 10,000+ employees, Copilot dominates with 56%. This bifurcated pattern reveals something profound: startups choose based on technical capability; large enterprises choose based on ease of acquisition. As today's startups become tomorrow's enterprises, their technology preferences will follow those paths.

My personal approach, after months of using Claude Code and Gemini, has evolved into what I call "conscious routing": I use Claude Code as my work interface (its agentic loop, its terminal integration, its ability to maintain context across long sessions), but I route model calls based on task complexity. For routine work, Kimi K2.6 or DeepSeek V4. For high-complexity tasks where I cannot tolerate errors, Claude Opus 4.6. This hybrid approach gives me 90% of Opus quality at 15% of the cost.

5. Why Strategy Beats Speed: Lessons from a System Builder

The Error of Prompts as Architecture

There's a phrase I've been repeating in conversations with fellow architects: "If you go too fast trying to build everything with prompts, you will surely fail. If you don't use AI, you will go slow and steady — very, very slow — but you will lose the ability to fail fast and correct."

This false dichotomy — between "moving fast and breaking things with AI" and "not using AI at all" — is the root of many failures. Teams that "go fast with prompts" build impressive demos that collapse when they face real data, edge cases, and compliance requirements. Teams that reject AI entirely lose the competitive advantage of rapid iteration that the technology provides.

The solution is not to choose an extreme. The solution is to understand that AI is a capability that is orchestrated, not a product that is consumed. An influential article in the OpenAI developer community from July 2025 titled "Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete" captured this transition perfectly. Pure prompt engineering suffers from intrinsic fragility: minor changes in input, model versions, or even random drift can destroy the effectiveness of a carefully tuned prompt. It doesn't scale. It isn't maintainable. It doesn't provide consistent reasoning in complex workflows.

The future, the article argues, is not in more elaborate prompts or more extensive context. It is in automated workflow architectures where language models are components in a larger system, not the system itself. This is exactly what my orchestrator Fararoni Flow is designed to do.

The Five Failure Patterns I've Observed

After years of building enterprise systems and the past few months working intensively with AI agents, I've identified five recurring failure patterns:

1. Obsession with Model Selection: Teams that spend weeks comparing Claude vs GPT vs Gemini, optimizing for minor differences in output quality, while their evaluation coverage remains weak and their input/output specifications remain vague. Models improve faster than comparison cycles run. By the time you finish evaluating, a new version has been released that invalidates your results.

2. Cost Blindness: Running the most expensive model for every request regardless of complexity, without unit economics tracking, and without monitoring token usage patterns. This leads to surprise bills that can derail projects and kill ROI. The cost of AI is never just the model call: it includes retrieval, orchestration, retries, and more.

3. Chatbots Without Differentiation: Building generic chat interfaces without domain-specific context, specialized workflows, or unique capabilities. These solutions compete directly with ChatGPT, Claude, and other generic tools that users already have. If your competitive advantage is "we have a chatbot too," prepare for disappointment.

4. Over-Engineering of Tool Calling: Creating elaborate tool schemas for simple operations, defining tools for basic computation or data formatting, and building complex orchestration when simple prompt engineering would work. Every tool call adds latency and potential failure points.

5. Ignoring End-User Constraints: Voice interfaces for noisy environments, high-resolution video processing for users with limited bandwidth, and complex multi-step workflows for users with limited time. Technical capability is not equal to user value.

6. Fararoni Flow: Why I'm Building an Orchestrator in 2026

The Vision: Sovereignty Over Your Agents

In the midst of this chaotic landscape — where costs vary by orders of magnitude, where models improve every month, where platforms close walled gardens and open others — I decided to build something different. Fararoni Flow is a multi-purpose agent orchestrator born from a simple premise: in a world where everything is constantly changing, the only sustainable advantage is the ability to adapt quickly.

The interface I share in the opening image shows the main dashboard: 7,242,574 tokens processed, 395 executions, 277 completed, 1,247 LLM calls, with 31 active agents in the system. These are not demo numbers. They are real numbers from a system I use daily to automate technology intelligence workflows, email processing, briefing generation, and complex multi-step task orchestration.

Why Java 25, NATS, and Hexagonal Architecture with Sidecar

The technology stack choice is not accidental. Each decision responds to a specific requirement of agent systems at scale:

Java 25 (LTS): The latest Long-Term Support version of Java brings critical improvements for agent systems:

Stabilized Virtual Threads: handle 1.2 million requests per second in recent benchmarks, surpassing WebFlux's 900K. For an orchestrator that coordinates dozens of concurrent agents, this is fundamental.
AOT Method Profiling (JEP 515): reduces warm-up time by 15-25%, critical for microservices and serverless functions where cold-start matters.
Compact Object Headers (JEP 519): reduces memory usage by up to 20% according to Oracle and Amazon tests on hundreds of production services.
Scoped Values (JEP 506): allows sharing data across concurrent tasks without ThreadLocal, essential for shared context between agents.

NATS: The messaging system I chose as the backbone of communication between agents. NATS handles millions of messages per second with sub-millisecond latency. Its pub/sub model allows agents to communicate in a decoupled way: an agent publishes an event, interested subscribers process it. There is no direct coupling, no complex message queues. It is the messaging system used by companies like VMware, Ericsson, and SAP in production at massive scale.

Hexagonal Architecture (Ports & Adapters): This pattern is the backbone of Fararoni Flow. Each agent's business logic is completely isolated from external concerns: model API calls, data persistence, authentication, logging. If tomorrow I want to switch from Claude to Kimi, from PostgreSQL to MongoDB, or from REST to gRPC, the agent's logic doesn't change. Only the adapter changes. In a field where the underlying technology evolves weekly, this decoupling is not a luxury: it is a survival necessity.

Sidecar Pattern: Each main agent in Fararoni Flow travels accompanied by sidecar containers that handle cross-cutting concerns: structured logging, telemetry, health checks, and secure communication. This pattern, popularized by Kubernetes and used by Google, Uber, Airbnb, and eBay, allows the main container to focus exclusively on its business logic while the sidecars handle "how" it runs. I can update the logging system without touching an agent. I can change the communication protocol without affecting mission logic.

MCP (Model Context Protocol): With 110 million monthly SDK downloads and adoption by Anthropic, OpenAI, Google, and Microsoft, MCP has become the de facto standard for agent-tool integration. Fararoni Flow uses MCP to connect agents with external tools: email reading (IMAP), web search, command execution, and database access. MCP collapses the N×M integration problem (N tools × M AI platforms) to N+M. You build one MCP server for your tool, and any compatible agent can use it.

DEFCON Levels: Resilience by Design

A unique feature of Fararoni Flow is the DEFCON level system (0-5) for each mission. Inspired by the U.S. aerospace defense system, these levels define the alert state and resources assigned to each task:

DEFCON 0: Normal mission. Standard execution with automatic retries.
DEFCON 1: Intensified monitoring. Each step is verified before continuing.
DEFCON 2: Human escalation required. The agent can suggest but cannot execute critical changes.
DEFCON 3: Read-only. The agent can investigate but cannot modify.
DEFCON 4: Passive observation mode. Monitoring only, no action.
DEFCON 5: Mission aborted. All operations stopped.

This system resolves one of the fundamental problems of autonomous agents: how do you delegate authority without losing control. Not all tasks require the same level of supervision. A "generate an email summary" mission can run at DEFCON 0. A "modify production configuration" mission should require at least DEFCON 2.

The System in Numbers

Metric	Value	Context
Tokens processed	7,242,574	Cumulative since launch
Total executions	395	Missions initiated
Successfully completed	277 (70.1%)	Current success rate
LLM Calls	1,247	Language model calls
Active agents	31	Specialized agents available
Missions created	369	Mission library
Backend	WebFlux/Netty	Reactive stack on Java 25
Native image	GraalVM Native	AOT compilation for sub-50ms cold start

These numbers reflect a system in active use, not a prototype. The 70% success rate sounds modest until you consider that it includes experimental missions, agents in development, and tasks that deliberately explore the limits of what is possible. In agentic AI, 70% success with rapid iteration is more valuable than 95% with months-long development cycles.

7. What I've Learned: Five Principles for Building with AI in 2026

After months of building Fararoni Flow and observing the enterprise AI landscape, these are the five principles that would guide my approach if I were starting today:

Principle 1: Don't Buy the Hype, Buy Flexibility

The AI model market changes weekly. What is "the best model" today will be surpassed in three months. Invest in architecture that allows you to change models without rewriting your application. A model routing system is not a luxury: it is insurance against obsolescence. Tyler Folkman demonstrated that a well-designed model router can reduce your costs by 94% while maintaining quality. That is not optimization: it is financial survival.

Principle 2: 80% of Your Work Doesn't Need the 99% Model

For most daily development tasks — APIs, tests, scaffolding, standard debugging — Kimi K2.6 and DeepSeek V4-Pro offer equal or superior performance to Claude Sonnet 4.6 at a fraction of the cost. Reserve Opus 4.6 for complex architecture, large-scale refactorings, and problems where an error is costly. A hybrid approach gives you 90% of Opus quality at 15% of the cost.

Principle 3: Observability Before Autonomy

You cannot improve what you cannot measure. Fararoni Flow records every token consumed, every tool call, every DEFCON state change, and every mission outcome. Without this telemetry, you would be flying blind. Agent autonomy is powerful but dangerous without complete observability. Start with granular monitoring before adding autonomy.

Principle 4: Start with Missions, Not with Agents

The most common mistake I see is building "cool agents" and then looking for a purpose for them. The correct approach is to identify a specific business mission — "I need a daily technology briefing based on my emails and RSS feeds" — and then design the minimum agent necessary to fulfill it. Fararoni Flow started with a single mission: process emails and generate summaries. The current 31 agents and 369 missions are the result of organic iteration, not centralized planning.

Principle 5: Persistence > Speed

Building AI systems is hard. It is. There are moments when a model that worked perfectly yesterday starts behaving erratically today. There are pipelines that break due to changes in external APIs. There are costs that spike because you forgot a rate limit in a loop. The difference between those who succeed and those who abandon is not intelligence or resources. It is the persistence of continuing to iterate when everything seems broken. If you persist, the results come. Not always on the timeline you expect, but they come.

8. The Technical Landscape: How to Build an Orchestrator That Scales

Event Architecture with NATS

Fararoni Flow is built on an event architecture where NATS acts as the central nervous system. When a user creates a mission, a mission.created event is published. Subscribed agents evaluate whether they can handle that mission based on their declared capabilities. If an agent accepts, it publishes mission.claimed and begins execution. Each step within the mission generates events: step.started, step.completed, step.failed, step.retried.

This model has several critical advantages:

Decoupling: Agents don't know about each other. They only know how to respond to events.
Scalability: I can add more instances of any agent without changing code.
Resilience: If an agent fails, pending events are automatically requeued.
Observability: Every event is logged, enabling complete replay of any execution.

GraalVM Native: Speed That Matters

Native image compilation with GraalVM reduces the application's cold-start to less than 50 milliseconds. In an agent system where functions can scale to zero when there is no work and activate on demand, this is the difference between an instant response and a frustrating experience. Spring Boot 3.x integrates native GraalVM support, and Java 25 brings additional improvements in AOT profiling that make the application reach peak performance almost immediately after startup.

WebFlux/Netty: Concurrency Without Compromise

The choice of Spring WebFlux on Netty over the traditional platform thread model is not accidental. Benchmarks from 2025-2026 show that Virtual Threads on Netty outperform pure WebFlux in approximately 45% of scenarios, especially under high concurrent load. For an orchestrator that handles multiple agents running simultaneously, each with their own external API calls, the ability to handle tens of thousands of concurrent connections with low latencies is essential.

MCP: The Universal Integration Layer

Fararoni Flow implements MCP servers for all its external integrations: IMAP for email, connectors for databases, clients for language model APIs, and adapters for file systems. This means that any MCP-compatible tool — and in 2026 that includes Claude Desktop, VS Code Copilot, Cursor, and dozens of IDEs and platforms — can use Fararoni Flow agents as tools.

The image I shared at the beginning shows the "MCP Connections" interface in the sidebar, with IMAP and Gmail connectors active. These connections are the bridge between the world of autonomous agents and the existing information systems that enterprises already use.

The Sidecar Pattern in Detail

The sidecar pattern is one of Fararoni Flow's most important architectural components and deserves a detailed explanation. Inspired by the Kubernetes model where each pod can contain multiple containers sharing resources, the sidecar pattern in our context means that each main agent travels accompanied by auxiliary containers that handle cross-cutting functions.

Imagine an agent specialized in email analysis. Its main container contains exclusively the business logic: how to parse emails, how to identify important topics, how to generate summaries. But it travels with three sidecars:

Logging Sidecar: Captures every event from the agent — task start, API call, result, error — and sends it to a centralized structured logging system. The main agent doesn't know it exists. It just does its job.

Telemetry Sidecar: Collects performance metrics — response time, tokens consumed, success rate — and exposes them in Prometheus format for scraping. If the agent starts consuming more tokens than normal, the alarm triggers.

Secure Communication Sidecar: Handles TLS encryption, mutual authentication, and connection retries. The main agent speaks HTTP without thinking about security; the sidecar ensures that communication is secure.

This separation has massive benefits for operations. I can update the logging system for the entire cluster without touching a single agent. I can change the telemetry protocol without affecting business logic. I can rotate security certificates centrally. In an ecosystem where I expect to have dozens of different agents, each specialized in a domain, this operational consistency is not optional: it is the foundation on which reliability is built.

The sidecar pattern also solves a practical problem for multidisciplinary teams. The main agent can be written in Java — my preferred language for complex business logic — while a natural language processing sidecar can be in Python, leveraging the rich NLP library that the Python ecosystem offers. The logging sidecar could be in Rust, maximizing resource efficiency. Each component uses the appropriate language for its purpose, communicating through well-defined interfaces.

NATS as Nervous System: Beyond Simple Pub/Sub

The choice of NATS as the messaging backbone was not the most obvious one. Many teams would have chosen Apache Kafka — the de facto standard for event streaming at scale — or RabbitMQ — the reliable choice for decades. But NATS offers something these systems don't have to the same degree: radical simplicity with extraordinary performance.

NATS handles millions of messages per second with latencies below one millisecond. In a comparative benchmark by the Cloud Native Computing Foundation, NATS proved to be 10-100x faster than Kafka in low-latency messaging scenarios with small messages. For an agent orchestrator where most messages are state events ("step X completed", "agent Y failed", "mission Z requires escalation"), these messages are inherently small and frequent.

But the real reason NATS is perfect for Fararoni Flow is its flexible subscription model. NATS supports multiple messaging patterns in a single system:

Classic Pub/Sub: An agent publishes an event, all interested subscribers receive it. Perfect for broadcast notifications.

Request/Reply: An agent sends a request and waits for a response. The pattern automatically handles routing responses to the correct requester, even in systems with multiple instances. This is fundamental for agent-to-agent coordination.

Queue Groups: Multiple instances of the same agent subscribe as a group. NATS delivers each message to exactly one instance of the group, enabling automatic load balancing. If I have three instances of an "email processing" agent, each email goes to exactly one instance.

JetStream: NATS's persistence layer that adds durability, message replay, and consumer groups with different processing speeds. If an agent fails and restarts, it can resume processing from where it left off.

Key-Value Store: A distributed key-value store integrated into NATS that I use for shared state between agents. The state of a mission in progress is stored here, allowing any instance of an agent to continue another's work.

This versatility means I don't need to maintain multiple messaging systems. A single NATS cluster handles all the communication patterns that Fararoni Flow requires. That drastically simplifies operations: one system to monitor, one system to back up, one system to scale.

NAT's hub-and-spoke topology is also ideal for microservices architectures. Instead of connecting services point-to-point — which creates an unmaintainable tangle of connections — all agents connect to the NATS cluster. When I add a new agent, it only needs to know the cluster address. It doesn't need to know anything about the other agents. This decoupling is what allows Fararoni Flow to scale from 5 agents to 50 without a massive re-architecture.

9. Java 25 in 2026: Why I Chose the Elephant for a Gazelle Race

One of the questions I've been asked most since sharing Fararoni Flow is: "Why Java? Python is the language of AI." It's a valid question, and the answer reveals a lot about the philosophy behind the system.

Python for Prototypes, Java for Production

There is no doubt that Python dominates the AI research ecosystem. PyTorch, TensorFlow, Hugging Face, LangChain, CrewAI — most of the frameworks that AI developers use daily are written in Python or have their first-class citizenship in Python. For prototypes, research, and rapid experimentation, Python is unbeatable.

But Fararoni Flow is not a prototype. It is an orchestration system designed to run 24/7, coordinating dozens of agents, processing millions of tokens, and maintaining the state of hundreds of concurrent missions. For that type of system, you need something Python cannot easily offer: predictable performance at scale, static typing that prevents errors at compile time, and a mature observability and operations ecosystem.

Virtual Threads: The Silent Game Changer

The feature that excited me most about Java 21 (and which has been perfected in Java 25) is Virtual Threads from Project Loom. The 2025-2026 benchmarks are conclusive: a server based on Virtual Threads over Netty handles 1.2 million requests per second on a 16-core machine, surpassing WebFlux with Project Reactor's 900K. In high-concurrency load scenarios, Virtual Threads win in approximately 45% of cases.

For an agent orchestrator, this is transformational. Each agent in execution can have its own virtual thread without consuming system OS thread resources. I can have hundreds of agents "running simultaneously" without the system feeling loaded. When an agent makes an external API call — which is the bulk of an agent's execution time — the virtual thread "parks" automatically, freeing resources for other agents.

The Spring Boot 3.x Ecosystem

Spring Boot 3.x brings native support for Virtual Threads, GraalVM Native Image, and the reactive WebFlux stack. The Spring ecosystem is massive: Spring Security for authentication and authorization, Spring Data for persistence, Spring Cloud for microservices patterns, Spring Batch for batch processing. Each of these projects has decades of enterprise production maturation.

When you build an agent system that needs OAuth2 authentication, rate limiting, circuit breakers, and distributed tracing, you don't want to build that from scratch. You want a framework that does it well, that has done it well for years, and that has the documentation and community to solve problems quickly.

GraalVM Native: Cold Starts That Don't Hurt

One of the historical criticisms of Java has been startup time. "Write once, run everywhere" felt more like "Write once, wait everywhere" for serverless applications. GraalVM Native Image changes that equation. Ahead-of-time compilation produces native binaries that start in less than 50 milliseconds.

For Fararoni Flow, this means I can scale agents to zero when there is no work and activate them on demand without users noticing the delay. In an agent system where different types of agents have different usage patterns — some constantly active, others sporadic — this "scale to zero" capability has direct implications for infrastructure costs.

Project Leyden and AOT Caching

Java 25 brings significant improvements from Project Leyden, especially AOT Method Profiling (JEP 515). This system records what the application does during a training run and saves an optimized cache for future runs. The result: the JVM generates optimized native code immediately at startup, without having to wait for the JIT compiler to collect hot profiles.

Benchmarks show 15-25% improvements in startup time and warm-up for applications that use this feature. For an agent system that restarts frequently — whether from deployments, failure recovery, or elastic scaling — every millisecond of startup counts.

Compact Object Headers: Memory That Matters

JEP 519 in Java 25 introduces compact object headers that reduce heap memory usage by up to 20%. Oracle and Amazon tested this feature on hundreds of production services and reported not only memory reduction, but also performance improvements of up to 10% and reduction in garbage collection frequency of up to 15%.

In an orchestration system where each agent maintains state, conversation context, and message buffers, memory efficiency is not a minor detail. It is the difference between being able to run 30 agents on a single instance or having to scale horizontally ahead of time.

The Truth About Language Choice

In the end, the choice of Java 25 is not about rejecting Python or declaring that Java is "better." It is about choosing the right tool for the right problem. Python is unbeatable for research, model experimentation, and prototype building. Java is unbeatable for high-concurrency distributed systems that need predictable performance, complete observability, and drama-free operations.

Fararoni Flow has components in Python — especially those that interact directly with language models and ML tools — but the core orchestration, messaging system, state management, and API layer are in Java 25 because that is where Java shines. A hybrid system that uses each language for what it does best is not a weakness: it is mature architecture.

10. The Future of Agent Orchestration: Where We're Headed

From Isolated Agents to Distributed Cognitive Systems

Agent orchestration in 2026 is where container orchestration was in 2014: on the edge of a maturity explosion. Kubernetes became popular because it solved a real problem — how to manage hundreds of containers in production — and it did so with a powerful abstraction: the pod, the service, the deployment. Agent orchestration needs its own equivalent abstractions.

I believe we are seeing three fundamental abstractions emerge:

1. The Mission as the Unit of Work: In Fararoni Flow, a mission is a complete unit of work that can involve multiple agents, tools, and steps. It is the equivalent of a "job" in batch systems or a "workflow" in integration systems. Missions have state, history, and can be replayed, audited, and optimized. The mission is the fundamental unit of orchestration because it reflects how humans think about work: as objectives to fulfill, not as isolated tasks.

2. The Agent as a Specialized Service: The agents of the future will not be generalists trying to do everything. They will be deep specialists in a specific domain, communicating with other agents through standardized protocols. Anthropic's Model Context Protocol (MCP) and Google's Agent-to-Agent Protocol (A2A) are the first steps toward this standardization. When a "data analysis" agent can communicate with a "report generation" agent and a "quality validation" agent through shared protocols, the complete system becomes more than the sum of its parts.

3. The Orchestrator as Operating System: The orchestrator is not just a coordinator: it is the operating system of the agent ecosystem. It handles agent lifecycle, resource allocation, failure recovery, security, and observability. In Fararoni Flow, the orchestrator decides which agent executes which mission based on declared capabilities, current state, and priority policies. It is the kernel of the system.

The Convergence of Protocols: MCP, A2A, and Beyond

The agent protocol ecosystem in 2026 is fragmented but converging rapidly:

Protocol	Purpose	Adoption	Status
MCP (Anthropic)	Agent → Tools	110M+ downloads/month	Dominant in integration
A2A (Google)	Agent → Agent	50+ launch partners	Emerging but growing
ACP (IBM/Linux)	Agent → Agent (commerce)	Limited	Early standardization
Open GAP	Framework-agnostic	OSS community	In development

MCP has won the agent-tool integration battle. With 110 million monthly SDK downloads and adoption by the big four (Anthropic, OpenAI, Google, Microsoft), it is the de facto standard. A2A is emerging as the protocol for agent-to-agent communication, with Google leading and gaining backing from AWS and other majors.

The long-term vision is an ecosystem where these protocols complement each other: MCP for each agent to access tools, A2A for agents to communicate with each other, and possibly a third protocol for system-level orchestration. Fararoni Flow is architecturally prepared for this convergence: our inter-agent communication layer can adapt to new protocols without changing business logic.

The Democratization of Enterprise AI

A trend that deeply excites me is the democratization that these reduced costs and open standards are enabling. When DeepSeek offers V4-Pro at $0.003625 per million tokens on cache hit, and Kimi K2.6 offers frontier-level performance at mid-tier prices, the barrier to entry for building intelligent agent systems collapses.

A startup with a modest budget can today build an agent system that a year ago would have cost hundreds of thousands of dollars. An individual developer can orchestrate multiple specialized agents for less than a Netflix subscription. This is not just a cost reduction: it is a transfer of power from the big AI labs toward individual builders and small teams.

The Risks That Persist

But it is not all optimism. There are real risks that the field has not yet resolved:

Security of Autonomous Agents: When an agent has access to your email, your databases, and your production systems, the attack surface expands dramatically. A well-designed prompt injection could theoretically make an agent execute unauthorized actions. DEFCON-level systems like Fararoni Flow's are a first step, but autonomous agent security is a field in its infancy.

Model Provider Dependency: Although open protocols like MCP reduce lock-in at the tool level, dependence on specific model providers remains a risk. If Claude Opus 4.6 is the only model that can handle your most complex use case, you have a single point of failure. The multi-model routing strategy I use in Fararoni Flow mitigates this, but does not eliminate it completely.

Data Quality and Bias: 85% of AI project failures are attributed to low-quality data. Autonomous agents that make decisions based on biased or incomplete data can amplify those biases at scale. Data governance is not optional: it is a fundamental requirement.

11. Building in Public: The Commitment to Transparency

Since I started building Fararoni Flow, I have made the decision to do it in public. I share real metrics — including failures —, explain architectural decisions with their full context, and publish the code so others can learn, criticize, and improve.

This transparency is not pure altruism. It is a system-building strategy that has proven effective time and again: when you know others will see your work, you have an additional incentive to do it well. The "social pressure" of building in public is a quality accelerator.

But there is a deeper benefit. The field of agent orchestration is so new that there is no consolidated "best practices manual." We are all discovering the answers in real time. By sharing what I learn — both successes and failures — I contribute to a collective body of knowledge that benefits all builders in this space.

The Numbers I Share

The image that opens this article shows the current state of the system. It is not a snapshot of a good day: it is the typical state. 31 active agents, 369 missions created, almost 7.3 million tokens processed. The 70% success rate includes experiments that intentionally explore the limits of what is possible. I am not ashamed to admit that some missions fail: every failure is a source of learning.

The Complete Tech Stack

For the technically curious, this is the complete stack of Fararoni Flow in its current state:

Layer	Technology	Reason for Choice
Runtime	Java 25 (LTS) + GraalVM Native	Virtual Threads, AOT profiling, cold starts <50ms
Framework	Spring Boot 3.5 + WebFlux/Netty	Reactive stack, 1.2M req/s, mature ecosystem
Messaging	NATS + JetStream	Sub-millisecond latency, decoupled pub/sub, durable
Architecture	Hexagonal + Sidecar Pattern	Logic isolation, interchangeable adapters
Protocols	MCP (Model Context Protocol)	Standard for agent-tool, 110M+ downloads/month
LLM Models	Claude Opus/Sonnet, Kimi K2.6, DeepSeek V4	Intelligent routing by task complexity
Persistence	PostgreSQL + Redis	Durable state + high-speed cache
Observability	Micrometer + Prometheus + Grafana	Real-time metrics, alerting
Infrastructure	Docker + Kubernetes	Container orchestration, auto-scaling

12. Operating in Production: Lessons Only Time Can Teach

Building an orchestrator is easy. Operating it in production for months is where the real lessons appear. I've done both, and these are the lessons you won't find in any academic paper or YouTube tutorial.

The 80/20 Law of Agents

I quickly discovered that 80% of the value Fararoni Flow generates comes from 20% of the agents. The email processing agents, technology briefing generation, and result validation agents are the ones that run dozens of times a day. The more exotic agents — such as the one that analyzes cybersecurity trends or the one that generates academic paper summaries — run weekly but are equally valuable in their moments.

This distribution taught me to think in three categories of agents: workhorses (those that do the heavy daily work), specialists (those that activate for specific tasks), and explorers (those that test new capabilities). Each category has different infrastructure, cost, and monitoring requirements. Workhorses need to be always available and optimized for cost. Specialists can start on demand. Explorers can fail without serious consequences.

The "Zombie Agent" Problem

A phenomenon I did not anticipate was that of "zombie agents": agents that remain in an intermediate state — neither completely active nor completely finished — consuming resources without producing value. An agent that got stuck waiting for a response from an external API that never arrived, or an agent that entered an infinite retry loop because the success condition was unreachable.

I solved this with an "escalation timeout" system. Each step of a mission has a predetermined timeout based on its DEFCON level. If a step exceeds its timeout, the system not only marks it as failed: it escalates. DEFCON 0 becomes DEFCON 1, which becomes DEFCON 2, until a human intervenes or the mission is automatically aborted. This system has prevented countless situations of agents consuming tokens and resources without purpose.

The Importance of Agent "Memory"

An agent without memory is like an employee with amnesia: it restarts from zero in every conversation. Fararoni Flow implements three types of memory:

Session Memory: The context of the current mission. What the agent has learned in previous steps, intermediate results, and decisions made. This memory lives in Redis and is lost when the mission ends.

Working Memory: Accumulated learnings about success and failure patterns. If an agent discovers that a certain approach works better for a type of task, that learning persists in PostgreSQL and is loaded in future missions of the same type. This memory is the foundation of continuous improvement.

System Memory: Static knowledge about the domain. Documentation, business rules, and templates that the agent uses as reference. This memory is updated manually by system operators.

The combination of these three memories transforms agents from isolated task executors into continuous learners. Every failed mission is a learning opportunity that benefits future missions.

Real Costs vs. Projected Costs

When I designed the system, I projected a monthly operating cost based on public benchmarks. The real costs were different — not necessarily higher, but different in their distribution. I discovered that:

60% of token cost goes to "premium" models (Claude Opus) but they only represent 15% of calls. Those calls are the most critical.
25% of cost goes to "workhorse" models (Kimi K2.6) and they represent 55% of calls. This is where intelligent routing pays dividends.
The remaining 15% goes to "budget" models (DeepSeek V4) and they represent 30% of calls. Simple tasks that don't justify expensive models.

This distribution validated my original hypothesis: you don't need a supermodel for every task. You need a system that assigns the right model to the right task. The difference between a system that spends $1,000/month and one that spends $200/month is not output quality: it is routing intelligence.

The Human Dimension

Technically, Fararoni Flow is a software system. Operationally, it is a human-machine team. Agents handle 80% of routine work, but humans remain essential for:

Validating critical outputs: A briefing for an investment decision doesn't go out without human review.
Handling edge cases: Agents are good at the common; humans are better at the unexpected.
Training new agents: Each new agent requires human supervision during its first executions.
Defining strategy: Agents execute; humans decide what to execute.

Ignoring this human dimension is one of the most common mistakes I see in AI implementations. Systems that try to completely replace humans fail. Systems that amplify human capabilities — delegating the routine so humans can focus on the strategic — succeed.

13. The Question You Should Ask Yourself

If you've made it this far, you're probably considering how AI can transform your work, your team, or your company. The question you should ask yourself is not "which model should I use?" or "how much does Claude Code cost?" The question is: "what is my orchestration strategy?"

The data is clear. 80% of AI projects fail. Costs can vary by orders of magnitude depending on how you route your model calls. The tools that dominate today may be obsolete in a year. In this environment, the only sustainable advantage is not knowing the latest model. It is having an architecture that allows you to adapt faster than the competition.

I've spent the past few months building that architecture for my own use. Fararoni Flow is not a finished product — it is a living system that evolves weekly — but it is the concrete answer to an abstract question: how do you orchestrate dozens of specialized agents to work together as a coherent system, with resilience, observability, and controlled costs.

Who this article is for

If you are a CTO or VP of Engineering considering an AI strategy for your company, the data I presented on failure rates and costs should be your starting point. Don't start with "which model do we buy." Start with "what architecture allows us to try, fail, and adapt quickly."

If you are a senior developer who uses Claude Code, GitHub Copilot, or Cursor daily, the numbers about model routing should resonate with you. You don't need to give up Claude Code to save money. You need a system that uses Claude for what Claude does best, and cheaper models for everything else.

If you are a software architect building distributed systems, the patterns I described — hexagonal, sidecar, event-driven with NATS — should be familiar. The novelty is not in the individual patterns, but in how they compose to solve a new problem: the orchestration of autonomous agents.

If you are an entrepreneur thinking about building something in the AI agent space, I want you to know that the field is wide open. The big players are busy building models. There is a massive opportunity in the orchestration layer, the protocol layer, and the tooling layer that makes agents productive in the real world.

How to Connect

If you're curious to try Fararoni Flow, if you're building something similar and want to exchange ideas, or if you simply want to better understand how an agent orchestrator works on Java 25, NATS, and hexagonal architecture: contact me through ebercruz.com. I'm building this in public, learning in public, and sharing what I learn.

I don't promise that Fararoni Flow will be the perfect solution for your use case. But I promise the conversation will be honest, technical, and results-oriented. In a field where most are selling smoke, I prefer to build concrete — and share how I do it.

If you persist, the results come.

Not always on the timeline you expect. Not always in the form you imagined. But they come. That is the final lesson Fararoni Flow has taught me: in the building of intelligent systems, as in life, disciplined strategy beats uncontrolled speed. Thoughtful architecture beats impulsive prompts. And persistence — that ability to keep iterating when everything seems broken — is the definitive differentiator between those who transform AI into competitive advantage and those who become another failure statistic in the next RAND Corporation report.

References and Sources

RAND Corporation (2025). AI Project Failure Analysis: 2,400+ Enterprise Initiatives. Retrieved from: (Folio3 AI) (https://www.folio3.ai/blog/ai-project-failure-rate-stats)
MIT Project NANDA (2025). The GenAI Divide: State of AI in Business. Retrieved from: (TechTarget) (https://www.techtarget.com/searchenterpriseai/feature/AI-deployments-gone-wrong-The-fallout-and-lessons-learned)
S&P Global Market Intelligence (2025). AI Initiative Abandonment Rates. Retrieved from: (Folio3 AI) (https://www.folio3.ai/blog/ai-project-failure-rate-stats)
Gartner (2024). Predicts 30% of GenAI Projects Will Be Abandoned After POC By End of 2025. Retrieved from: (Gartner) (https://www.gartner.com/en/newsroom/press-releases/2024-07-29-gartner-predicts-30-percent-of-generative-ai-projects-will-be-abandoned-after-proof-of-concept-by-end-of-2025)
Fortune (2026). Microsoft lost its way in the AI race. Can Copilot get it back on course? Retrieved from: (Fortune) (https://fortune.com/2026/05/21/microsoft-copilot-ai-openai-satya-nadella-gemini-claude/)
DeepSeek API Documentation (2026). Models & Pricing. Retrieved from: (deepseek.com) (https://api-docs.deepseek.com/quick_start/pricing)
InfoWorld (2026). DeepSeek's steep V4-Pro price cut escalates AI pricing war. Retrieved from: (InfoWorld) (https://www.infoworld.com/article/4176709/deepseeks-steep-v4-pro-price-cut-escalates-ai-pricing-war.html)
Medium/AI tentenco (2026). Kimi K2.6 & Kimi Code Review: Saving 88% Coding Costs? Retrieved from: (Medium) (https://medium.com/@tentenco/kimi-k2-6-kimi-code-review-saving-88-coding-costs-b7e8c5eaf5f1)
Ideas2IT (2026). Claude Code With Kimi, DeepSeek vs Claude: Cost & Benchmarks. Retrieved from: (ideas2it.com) (https://www.ideas2it.com/blogs/claude-code-alternative-models)
Uvik (2026). Claude Code vs Cursor vs Copilot vs Codex. Retrieved from: (uvik.net) (https://uvik.net/blog/claude-code-vs-cursor-vs-copilot-vs-codex-2026/)
Menlo Ventures (2025). State of Generative AI in the Enterprise. Referenced in: (beam.ai) (https://beam.ai/agentic-insights/the-great-ai-flip-why-76-of-enterprises-stopped-building-ai-in-house)
Caylent (2026). POC to PROD: Hard Lessons from 200+ Enterprise Generative AI Deployments. Retrieved from: (Caylent) (https://caylent.com/blog/poc-to-prod-hard-lessons-from-200-enterprise-generative-ai-deployments-part-2)
IBM (2025). What is AI Agent Orchestration? Retrieved from: (IBM) (https://www.ibm.com/think/topics/ai-agent-orchestration)
Lyzr AI (2026). Agent Orchestration 101. Retrieved from: (lyzr.ai) (https://www.lyzr.ai/blog/agent-orchestration/)
Model Context Protocol (2026). MCP Roadmap and Technical Direction. Retrieved from: (getknit.dev) (https://www.getknit.dev/blog/the-future-of-mcp-roadmap-enhancements-and-whats-next)
Java 25 LTS Release Notes (2025). Performance Improvements in JDK 25. Retrieved from: (inside.java) (https://inside.java/2025/10/20/jdk-25-performance-improvements/)
GitHub - loom-webflux-benchmarks (2026). Benchmarks of Spring Boot REST service comparing Java Virtual Threads with WebFlux. Retrieved from: (Github) (https://github.com/chrisgleissner/loom-webflux-benchmarks)
OpenAI Developer Community (2025). Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete. Retrieved from: (OpenAI API Community Forum) (https://community.openai.com/t/prompt-engineering-is-dead-and-context-engineering-is-already-obsolete-why-the-future-is-automated-workflow-architecture-with-llms/1314011)

Eber Cruz Fararoni is a software architect specialized in distributed systems, event-driven architectures, and applied artificial intelligence. He builds Fararoni Flow, an open-source AI agent orchestrator, on Java 25, NATS, and hexagonal architecture. He writes at ebercruz.com about the intersection of software engineering and artificial intelligence.

If you found this article useful, share it with someone navigating the complex enterprise AI landscape in 2026. And if you want to try Fararoni Flow or exchange ideas about agent orchestration: contact me.

La Realidad que Nadie te Cuenta sobre la IA en 2026: Por qué Microsoft y Uber se Estan Retirando, y Por qué tu Estrategia Importa mas que tu Velocidad

Eber Cruz Fararoni — Wed, 03 Jun 2026 15:54:15 +0000

Por Eber Cruz Fararoni | ebercruz.com | Arquitecto de Software & Constructor de Sistemas Inteligentes

TL;DR: El 80.3% de los proyectos de IA empresarial fracasan sin dejar rastro de ROI. Microsoft ha visto como solo el 4.5% de sus 450 millones de clientes de M365 pagan por Copilot, mientras su accion cayo un 34%. En medio de este desastre, DeepSeek acaba de hacer permanente un descuento del 75% en su API, y Kimi K2.6 de Moonshot AI demuestra que puede igualar —o superar— a Claude Code 4.6 por una fraccion del costo. He pasado los ultimos meses construyendo Fararoni Flow, un orquestador de agentes multipropo sito sobre Java 25, NATS y arquitectura hexagonal con sidecar. Este articulo es lo que he aprendido sobre por que la mayoria fracasa, y por que la estrategia de orquestacion es mas importante que la velocidad de implementacion.

1. El Panorama: Una Crisis Silenciosa en la Adopcion de IA Empresarial

La inteligencia artificial generativa llego prometiendo revolucionar cada aspecto de la empresa moderna. En 2025, las organizaciones invirtieron $684 mil millones en IA a nivel mundial. Para diciembre de ese ano, mas de $547 mil millones de esa inversion habian producido resultados medibles: exactamente cero. No retornos bajos. Cero. Este no es un escenario hipotetico ni una proyeccion pesimista: es la conclusion de un analisis de la RAND Corporation sobre mas de 2,400 iniciativas de IA empresarial.

La realidad que enfrentamos en 2026 es radicalmente diferente del relato de marketing que nos venden los grandes laboratorios de IA. Mientras los titulares celebran cada nuevo lanzamiento de modelo con superlativos cada vez mas grandiosos, en las trincheras de la implementacion empresarial, la historia es otra. El 42% de las companias abandonaron al menos una iniciativa de IA en 2025, un salto dramatico desde apenas el 17% del ano anterior, segun datos de S&P Global Market Intelligence. Las organizaciones no estan mejorando en IA; simplemente se estan volviendo mas rapidas para reconocer el fracaso.

El problema no es la tecnologia en si. Los modelos de 2026 son indudablemente superiores a los de 2024. GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro y los nuevos competidores chinos como DeepSeek V4-Pro y Kimi K2.6 representan saltos cuanticos en capacidad de razonamiento, generacion de codigo y ejecucion de tareas agenticas. El problema es como las empresas estan intentando aprovechar estas capacidades.

La mayoria de las organizaciones cometieron el error fundamental de tratar la IA como una carrera de velocidad en lugar de lo que realmente es: una disciplina de estrategia. Querian construir todo con prompts, automatizar procesos complejos en semanas, y transformar sus operaciones sin el rigor de arquitectura que cualquier sistema empresarial de mission critica exige. Cuando estos esfuerzos colapsaron —como era inevitable—, la conclusion facil fue "la IA es demasiado cara" o "la IA no funciona para nosotros". La conclusion correcta, sin embargo, es que la arquitectura de orquestacion importa mas que el modelo de lenguaje que elijas.

Metrica	Dato	Fuente
Inversion global IA 2025	$684 mil millones	RAND Corporation (2025)
Proyectos IA sin ROI medible	$547 mil millones (80%)	RAND Corporation (2025)
Empresas que abandonaron ≥1 iniciativa IA	42% (vs 17% en 2024)	S&P Global (2025)
Proyectos GenAI sin impacto P&L	95%	MIT Project NANDA (2025)
Proyectos IA que nunca llegan a produccion	52%	Gartner (2024)
Costo promedio de un proyecto IA fallido	$7.2 millones	S&P Global (2025)
Sobre-costos en proyectos RAG a escala	380% vs proyeccion piloto	MIT Sloan

Estos numeros no deberian desanimarnos. Deberian enfocarnos. La IA no esta fallando; los enfoques descuidados estan fallando. Y en ese reconocimiento hay una oportunidad masiva para quienes entienden que la diferencia entre el exito y el fracaso no es el modelo que usas, sino la arquitectura con la que lo orquestas.

Figura 1: Datos duros de multiples fuentes autorizadas confirman que la mayoria de proyectos IA enterprise fracasan. Fuente: RAND Corp, MIT NANDA, Gartner, S&P Global, BCG (2024-2025).

2. Los Gigantes Tropiezan: Microsoft, Uber y el Costo Real de la IA

Microsoft: El 4.5% que Expuso una Fragilidad Estructural

En mayo de 2026, Fortune publico un analisis devastador sobre la posicion de Microsoft en la carrera de la IA. La empresa que habia apostado mas fuerte por OpenAI —con una inversion que supera los $13 mil millones— enfrentaba una realidad inco moda: menos del 4.5% de sus 450 millones de clientes de Microsoft 365 estaban pagando por las funciones de Copilot. Mientras tanto, su chatbot Copilot para consumidores habia alcanzado aproximadamente 20 millones de usuarios activos semanales, una cifra que suena impresionante hasta que la comparas con los 900 millones de usuarios de ChatGPT.

La situacion de GitHub Copilot —que fue el primer gran exito comercial de la IA coding— ilustra aun mas claramente la tendencia. De ser la herramienta de codificacion con IA indiscutiblemente lider, ha sido suplantado primero por Cursor ($2 mil millones de ARR, crecimiento mas rapido de SaaS jamas registrado) y luego por Claude Code (46% de los ingenieros senior la nombran su herramienta "mas amada", con un NPS de 54).

La accion de Microsoft cayo un 34% desde su maximo historico en octubre de 2025 hasta marzo de 2026, a pesar de que sus ingresos relacionados con IA en Azure se habian mas que duplicado. Los inversionistas se dieron cuenta de algo que los ejecutivos de marketing aun no quieren admitir: tener el modelo mas famoso no garantiza una plataforma de negocio sostenible. La empresa anuncio $190 mil millones en gastos de capital para 2026 —mas del triple de lo que gasto en 2024— en una apuesta desesperada por recuperar el terreno perdido.

El CEO comercial de Microsoft, Judson Althoff, reconocio publicamente varios errores: llamar "Copilot" tanto al producto para consumidores como al empresarial genero confusion masiva; incentivar a los representantes de ventas a promover la version gratuita cuando solo la version premium entregaba valor real; y subestimar la velocidad a la que la tecnologia de IA estaba evolucionando. Cuando Anthropic lanzo Claude Code en 2025 —capaz de escribir programas completos autonomamente desde una descripcion— y luego Claude Cowork en enero de 2026, el modelo de "copiloto" que Microsoft habia construido se sintio de repente como una generacion atrasada.

Uber: Cerrando Laboratorios, Abriendo la Puerta al Aprendizaje

El caso de Uber es diferente pero igualmente instructivo. Durante la pandemia de COVID-19, Uber tomo la decision de cerrar su Uber AI Labs como parte de una reduccion de costos estrategica. Aunque esta decision fue impulsada por la necesidad de preservar capital en un momento de crisis sin precedentes para la industria de viajes compartidos, ilustra un patron que hemos visto repetirse: cuando los costos de la IA se encuentran con la realidad financiera, los proyectos experimentales son los primeros en caer.

Lo que hace que estos casos sean particularmente reveladores no es que las empresas esten abandonando la IA por completo —ninguna de ellas lo esta haciendo— sino que estan aprendiendo una leccion costosa: la IA no es un producto que compras, es una capacidad que construyes. Microsoft no esta abandonando la IA; esta reconfigurando su estrategia para ser agnostico de modelo, permitiendo que sus clientes elijan entre GPT, Claude, Gemini o cualquier otro modelo dentro de su plataforma. Uber no abandono la IA; redirigio sus recursos hacia aplicaciones de IA mas directamente ligadas a su negocio principal.

La conclusion no es que "la IA es demasiado cara". La conclusion es que sin una arquitectura de orquestacion bien disenada, sin una estrategia de implementacion gradual, y sin un entendimiento profundo de donde la IA genera valor real versus donde solo genera demos impresionantes, los costos se disparan y los resultados se evaporan.

3. La Guerra de Precios que Nadie Vio Venir: DeepSeek, Kimi y la Nueva Geografia del Costo

DeepSeek V4-Pro: El 75% que Cambio Todo

En mayo de 2026, DeepSeek —el laboratorio chino que en enero de 2025 ya habia sacudido los cimientos de la industria con su modelo R1— anuncio algo que muchos consideraban imposible: un descuento del 75% en su API de V4-Pro, que ademas se volveria permanente. No es una promocion temporal ni un truco de marketing. Es una reduccion estructural del costo basada en ganancias reales de eficiencia computacional.

Los numeros son asombrosos. El modelo V4-Pro de DeepSeek, con 1.6 trillones de parametros y ventana de contexto de 1 millon de tokens, paso a costar:

Modelo	Input / 1M tokens	Output / 1M tokens	Cache Hit / 1M
DeepSeek V4-Pro (post-75%)	$0.435	$0.87	$0.003625
DeepSeek V3.2	$0.28	$0.42	$0.028
GPT-5.4	$2.50	$15.00	$0.25
Claude Opus 4.6	$5.00	$25.00	$0.50
Kimi K2.6	$0.60	$2.50	$0.10

Para ponerlo en perspectiva: una tarea que cuesta $0.87 con DeepSeek V4-Pro en output tokens, cuesta $15.00 con GPT-5.4 y $25.00 con Claude Opus 4.6. Eso representa un ahorro del 94% versus Claude Opus y del 88% versus GPT-5.4. El cache hit de V4-Pro a $0.003625 por millon de tokens se acerca a ser practicamente gratuito para workloads con prompts de sistema repetitivos.

Sanchit Vir Gogia, CEO de Greyhound Research, explico la logica detras de esta reduccion: "V4-Pro fue dise nado para reducir el costo de inferencia de largo contexto, operando a aproximadamente un cuarto del computo por token y una decima del footprint de memoria de su predecesor en contextos muy largos. Por eso la reduccion de precio es permanente y no promocional. No es un descuento. Es una ganancia de eficiencia que se transfiere al cliente."

Figura 2: DeepSeek V4-Pro y Kimi K2.6 ofrecen precios 8-30x menores que los modelos occidentales equivalentes, con rendimiento comparable en benchmarks de codificacion. Fuente: Precios oficiales de APIs, mayo 2026.

Kimi K2.6: El Open-Source que Gana en Benchmarks y Pierde en Factura

El 20 de abril de 2026, Moonshot AI lanzo Kimi K2.6, un modelo open-source de 1 trillon de parametros con arquitectura Mixture-of-Experts (MoE) que activa aproximadamente 32 mil millones de parametros por token. Los numeros que presento son tan impresionantes como los precios:

SWE-Bench Pro: 58.6% — por encima de GPT-5.4 (57.7%) y Claude Opus 4.6 (53.4%)
HLE-Full (con herramientas): 54.0% — liderando entre todos los modelos comparados
BrowseComp (Agent Swarm): 86.3% — dominando en tareas agenticas de multiples agentes
DeepSearchQA F1: 92.5% — superior a GPT-5.4 (78.6%) y Claude Opus 4.6 (91.3%)

En benchmarks independientes como SWE-Bench Verified, K2.6 alcanzo el 80.2%, superando a Claude Opus 4.6 (80.8% en su version 4.6, aunque Opus 4.7 posteriormente llego al 87.6%).

El precio: $0.60 por millon de tokens de input y $2.50 por millon de output. Eso lo hace 8.3x mas barato en input y 10x mas barato en output que Claude Opus 4.7. Cuando Cursor —la empresa de IA coding de mas rapido crecimiento de la historia con $2 mil millones de ARR— construyo su funcion Composer 2 sobre Kimi K2.5 (la version anterior), estaban enviando un mensaje claro: el rendimiento no requiere pagar precios premium.

Figura 3: Izquierda: Evolucion del precio de DeepSeek V4-Pro con descuento permanente del 75%. Derecha: Costo mensual real usando routing inteligente versus un solo modelo. Fuente: Datos de APIs y benchmarks publicos, mayo 2026.

La Realidad que los Costos Revelan

La diferencia de precios no es un detalle menor de contabilidad. Es una transformacion estrategica del panorama. Cuando Ideas2IT ejecuto una prueba controlada —construir la misma aplicacion Flask con SQLite, frontend HTML, operaciones CRUD, pruebas unitarias y configuracion Git— los resultados fueron reveladores:

Modelo	Costo por ejecucion	Calidad del resultado
DeepSeek V3.2	~$0.15	Bueno (mejor UI)
Kimi K2.5	~$0.33	Produccion
Claude Sonnet 4.6	~$1.66	Produccion
Claude Opus 4.6	~$75.64/mes	Produccion (superior en arquitectura compleja)

Los tres modelos de nube completaron la tarea. Los ingenieros que revisaron los resultados a ciegas no pudieron identificar consistentemente que modelo produjo que output. La diferencia de 11x en costo entre DeepSeek V3.2 y Claude Sonnet 4.6 no se tradujo en una diferencia de 11x en calidad. Se tradujo en una diferencia de 11x en factura.

A escala de equipo, las implicaciones son enormes. Un equipo de 10 ingenieros usando Claude Code con Claude Sonnet 4.6 gasta aproximadamente $444.40 por mes solo en tokens API. El mismo equipo usando Kimi K2.5 gastaria $78.60. Con DeepSeek V3.2, apenas $24.00. Eso sin considerar que el 82% del trabajo diario de un desarrollador —revision de PR, refactorizaciones, pruebas, debugging estandar— no requiere la capacidad de razonamiento maxima de un modelo premium.

Tyler Folkman, un desarrollador independiente que construyo un router de modelos para su flujo de trabajo personal, documento el caso mas extremo: en 2,415 turnos de IA reales, gasto $76.77 usando un sistema de routing que enviaba cada tarea al modelo apropiado. El mismo volumen de trabajo habria costado $1,272.77 si hubiera usado GPT-5.5 para todo. Un ahorro del 94%, logrado simplemente por no "pretender que cada tarea es la misma tarea."

Figura 4: Izquierda: El giro dramatico de empresas construyendo IA internamente a comprar soluciones (Menlo Ventures). Derecha: Costo mensual real por ingeniero usando diferentes modelos dentro de Claude Code. Fuente: Ideas2IT, JetBrains AI Pulse Survey, datos de APIs.

4. Claude Code 4.6 Sigue Siendo el Rey... Pero el Trono se Tambalea

No quiero que se malinterprete mi argumento. Claude Code 4.6, especialmente en su tier Opus, sigue siendo el estandar de oro para tareas de codificacion compleja. Su contexto de 1 millon de tokens permite cargar repositorios monoliticos enteros en una sola sesion. Su score de 91.3% en GPQA Diamond (preguntas de ciencia a nivel de posgrado validadas por expertos de dominio) es inigualable para razonamiento cientifico profundo. Su tasa de alucinacion es la mas baja de la industria, con un indice AA-Omniscience de +10 versus -11 de Kimi K2.5.

Para trabajo de arquitectura compleja, refactorizaciones de gran escala, comprension de codigo legacy, y problemas verdaderamente novedosos donde el output no puede ser verificado facilmente con pruebas automatizadas, Claude Opus 4.6 justifica su precio premium. No hay sustituto para la tranquilidad de saber que el modelo tiene la mayor probabilidad de generar una respuesta correcta cuando "correcto" no se puede verificar con un test unitario.

Sin embargo, aqui esta la verdad que muchos no quieren escuchar: el 80% del trabajo de un ingeniero de software no es arquitectura compleja ni problemas novedosos. Es desarrollo de APIs REST, generacion de pruebas unitarias, scaffold de frontend, debugging de errores estandar, y revision de pull requests. Para ese 80%, Kimi K2.6 y DeepSeek V4 no solo son "suficientemente buenos" — en muchos benchmarks de codificacion, son mejores.

La encuesta de Pragmatic Engineer de febrero de 2026, que consulto a aproximadamente 906 ingenieros senior con una mediana de 11-15 anos de experiencia, revelo un patron fascinante: el 46% de los ingenieros senior nombraron a Claude Code como su herramienta "mas amada", versus 19% para Cursor y 9% para GitHub Copilot. JetBrains confirmo estos hallazgos con datos de lealtad duros: un CSAT de 91% y un NPS de 54 para Claude Code, los mas altos de la categoria.

Figura 5: Izquierda: Market share de adoption en el lugar de trabajo (JetBrains, Ene 2026). Derecha: Satisfaccion entre ingenieros senior — Claude Code lidera ampliamente a pesar de su menor market share. Fuente: JetBrains AI Pulse Survey, Pragmatic Engineer Survey.

Pero aqui esta el matiz critico: aunque Claude Code es la herramienta mas amada, solo tiene un 18% de adoption en el lugar de trabajo versus 29% de GitHub Copilot. Y entre startups pequenas (menos de 50 personas), Claude Code alcanza el 75% de adoption, mientras que en empresas de 10,000+ empleados, Copilot domina con 56%. Este patron bifurcado revela algo profundo: las startups eligen por capacidad tecnica; las grandes empresas eligen por facilitad de adquisicion. A medida que las startups de hoy se convierten en las empresas de manana, sus preferencias tecnologicas seguiran esos caminos.

Mi enfoque personal, despues de meses usando Claude Code y Gemini, ha evolucionado hacia lo que llamo "routing consciente": uso Claude Code como mi interfaz de trabajo (su agentic loop, su integracion con el terminal, su capacidad de mantener contexto a traves de sesiones largas), pero enruto las llamadas de modelo segun la complejidad de la tarea. Para trabajo rutinario, Kimi K2.6 o DeepSeek V4. Para tareas de alta complejidad donde no puedo tolerar errores, Claude Opus 4.6. Este enfoque hibrido me da el 90% de la calidad de Opus al 15% del costo.

5. Por que la Estrategia le Gana a la Velocidad: Lecciones de un Constructor de Sistemas

El Error de los Prompts como Arquitectura

Hay una frase que he venido repitiendo en conversaciones con colegas arquitectos: "Si vas muy rapido queriendo construir todo con prompts, seguramente fracasaras. Si no usas IA, iras lento y seguro — muy, muy lento — pero perderas esa capacidad de equivocarte rapido y corregir."

Esta dicotomia falsa — entre "moverse rapido y romper cosas con IA" y "no usar IA para nada" — es la raiz de muchos fracasos. Los equipos que "van rapido con prompts" construyen demos impresionantes que colapsan en cuanto enfrentan datos reales, casos edge, y requisitos de compliance. Los equipos que rechazan la IA por completo pierden la ventaja competitiva de iteracion rapida que la tecnologia proporciona.

La solucion no es elegir un extremo. La solucion es entender que la IA es una capacidad que se orquesta, no un producto que se consume. Un articulo influyente en la comunidad de desarrolladores de OpenAI de julio de 2025 titulado "Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete" capturo esta transicion perfectamente. El prompt engineering puro sufre de fragilidad intrinseca: cambios menores en el input, versiones del modelo, o incluso deriva aleatoria pueden destruir la efectividad de un prompt cuidadosamente afinado. No escala. No es mantenible. No proporciona razonamiento consistente en flujos de trabajo complejos.

El futuro, argumenta el articulo, no esta en prompts mas elaborados ni en contexto mas extenso. Esta en arquitecturas de workflow automatizadas donde los modelos de lenguaje son componentes en un sistema mas grande, no el sistema completo. Esto es exactamente lo que mi orquestador Fararoni Flow esta diseñado para hacer.

Los Cinco Patrones de Fracaso que He Observado

Despues de años construyendo sistemas empresariales y los ultimos meses trabajando intensivamente con agentes de IA, he identificado cinco patrones recurrentes de fracaso:

1. Obsesion por la Seleccion de Modelo: Equipos que pasan semanas comparando Claude vs GPT vs Gemini, optimizando por diferencias menores en calidad de output, mientras su cobertura de evaluacion sigue siendo debil y sus especificaciones de input/output permanecen vagas. Los modelos mejoran mas rapido de lo que corren los ciclos de comparacion. Para cuando terminas de evaluar, ha salido una nueva version que invalida tus resultados.

2. Ceguera de Costos: Ejecutar el modelo mas caro para cada solicitud sin importar la complejidad, sin rastreo de economia unitaria, y sin monitoreo de patrones de uso de tokens. Esto lleva a facturas sorpresa que pueden descarrilar proyectos y matar el ROI. El costo de la IA nunca es solo la llamada al modelo: incluye retrieval, orquestacion, reintentos, y mas.

3. Chatbots sin Diferenciacion: Construir interfaces de chat genericas sin contexto de dominio especifico, workflows especializados, o capacidades unicas. Estas soluciones compiten directamente con ChatGPT, Claude, y otras herramientas genericas que los usuarios ya tienen. Si tu ventaja competitiva es "tenemos un chatbot tambien", preparate para la decepcion.

4. Over-engineering de Tool Calling: Crear esquemas de herramientas elaborados para operaciones simples, definir herramientas para computacion basica o formateo de datos, y construir orquestacion compleja cuando la ingenieria de prompts simple funcionaria. Cada llamada a herramienta anade latencia y puntos de falla potenciales.

5. Ignorar las Restricciones del Usuario Final: Interfaces de voz para entornos ruidosos, procesamiento de video de alta resolucion para usuarios con ancho de banda limitado, y flujos de trabajo multi-paso complejos para usuarios con tiempo limitado. La capacidad tecnica no es igual al valor de usuario.

6. Fararoni Flow: Por que Estoy Construyendo un Orquestador en 2026

La Vision: Soberania sobre tus Agentes

En medio de este panorama caotico —donde los costos varian por ordenes de magnitud, donde los modelos mejoran cada mes, donde las plataformas cierran jardines y abren otros— decidi construir algo diferente. Fararoni Flow es un orquestador de agentes multipropo sito que nace de una premisa simple: en un mundo donde todo cambia constantemente, la unica ventaja sostenible es la capacidad de adaptarse rapidamente.

La interfaz que comparto en la imagen de apertura muestra el dashboard principal: 7,242,574 tokens procesados, 395 ejecuciones, 277 completadas, 1,247 llamadas a LLM, con 31 agentes activos en el sistema. No son numeros de demo. Son numeros reales de un sistema que uso diariamente para automatizar workflows de inteligencia tecnologica, procesamiento de correos electronicos, generacion de briefings, y orquestacion de tareas complejas multi-paso.

Por que Java 25, NATS, y Arquitectura Hexagonal con Sidecar

La eleccion de stack tecnologico no es accidental. Cada decision responde a un requisito especifico de sistemas de agentes a escala:

Java 25 (LTS): La ultima version Long-Term Support de Java trae mejoras criticas para sistemas de agentes:

Virtual Threads estabilizados: manejan 1.2 millones de requests por segundo en benchmarks recientes, superando los 900K de WebFlux. Para un orquestador que coordina decenas de agentes concurrentes, esto es fundamental.
AOT Method Profiling (JEP 515): reduce el warm-up time en un 15-25%, critico para microservicios y funciones serverless donde el cold-start importa.
Compact Object Headers (JEP 519): reduce el uso de memoria en hasta un 20% segun pruebas de Oracle y Amazon en cientos de servicios de produccion.
Scoped Values (JEP 506): permite compartir datos a traves de tareas concurrentes sin ThreadLocal, esencial para el contexto compartido entre agentes.

NATS: El sistema de mensajeria que elegi como backbone de comunicacion entre agentes. NATS maneja millones de mensajes por segundo con latencia sub-milisegundo. Su modelo pub/sub permite que los agentes se comuniquen de forma desacoplada: un agente publica un evento, los suscriptores interesados lo procesan. No hay acoplamiento directo, no hay colas de mensajes complejas. Es el sistema de mensajeria que usan companias como VMware, Ericsson, y SAP en produccion a escala masiva.

Arquitectura Hexagonal (Ports & Adapters): Este patron es la columna vertebral de Fararoni Flow. La logica de negocio de cada agente esta completamente aislada de las preocupaciones externas: llamadas a APIs de modelos, persistencia de datos, autenticacion, logging. Si manana quiero cambiar de Claude a Kimi, de PostgreSQL a MongoDB, o de REST a gRPC, la logica del agente no cambia. Solo cambia el adapter. En un campo donde la tecnologia subyacente evoluciona semanalmente, este desacoplamiento no es un lujo: es una necesidad de supervivencia.

Sidecar Pattern: Cada agente principal en Fararoni Flow viaja acompanado de contenedores sidecar que manejan preocupaciones transversales: logging estructurado, telemetria, health checks, y comunicacion segura. Este patron, popularizado por Kubernetes y usado por Google, Uber, Airbnb, y eBay, permite que el contenedor principal se enfoque exclusivamente en su logica de negocio mientras los sidecars manejan "como" se ejecuta. Puedo actualizar el sistema de logging sin tocar un agente. Puedo cambiar el protocolo de comunicacion sin afectar la logica de mision.

MCP (Model Context Protocol): Con 110 millones de descargas mensuales de SDK y adopcion por parte de Anthropic, OpenAI, Google, y Microsoft, MCP se ha convertido en el estandar de facto para la integracion agente-herramienta. Fararoni Flow usa MCP para conectar agentes con herramientas externas: lectura de correos (IMAP), busqueda web, ejecucion de comandos, y acceso a bases de datos. MCP colapsa el problema de integracion N×M (N herramientas × M plataformas de IA) a N+M. Construyes un servidor MCP para tu herramienta, y cualquier agente compatible puede usarla.

DEFCON Levels: Resiliencia por Diseno

Una caracteristica unica de Fararoni Flow es el sistema de niveles DEFCON (0-5) para cada mision. Inspirado en el sistema de defensa aeroespacial de Estados Unidos, estos niveles definen el estado de alerta y los recursos asignados a cada tarea:

DEFCON 0: Mision normal. Ejecucion estandar con reintentos automaticos.
DEFCON 1: Monitoreo intensificado. Cada paso se verifica antes de continuar.
DEFCON 2: Escalacion humana requerida. El agente puede sugerir pero no ejecutar cambios criticos.
DEFCON 3: Solo lectura. El agente puede investigar pero no modificar.
DEFCON 4: Modo observacion pasiva. Solo monitoreo sin accion.
DEFCON 5: Mision abortada. Todas las operaciones detenidas.

Este sistema resuelve uno de los problemas fundamentales de los agentes autonomos: como delegas autoridad sin perder control. No todas las tareas requieren el mismo nivel de supervision. Una mision de "generar un resumen de correos" puede correr a DEFCON 0. Una mision de "modificar la configuracion de produccion" deberia requerir DEFCON 2 como minimo.

El Sistema en Numeros

Metrica	Valor	Contexto
Tokens procesados	7,242,574	Acumulado desde lanzamiento
Ejecuciones totales	395	Misiones iniciadas
Completadas exitosamente	277 (70.1%)	Tasa de exito actual
LLM Calls	1,247	Llamadas a modelos de lenguaje
Agentes activos	31	Agentes especializados disponibles
Misiones creadas	369	Biblioteca de misiones
Backend	WebFlux/Netty	Reactive stack sobre Java 25
Imagen nativa	GraalVM Native	Compilacion AOT para sub-50ms cold start

Estos numeros reflejan un sistema en uso activo, no un prototipo. La tasa de exito del 70% suena modesta hasta que consideras que incluye misiones experimentales, agentes en desarrollo, y tareas que deliberadamente exploran los limites de lo posible. En la IA agentica, un 70% de exito con iteracion rapida es mas valioso que un 95% con ciclos de desarrollo de meses.

7. Lo que He Aprendido: Cinco Principios para Construir con IA en 2026

Despues de meses construyendo Fararoni Flow y observando el panorama de la IA enterprise, estos son los cinco principios que guiarian mi approach si estuviera comenzando hoy:

Principio 1: No Compres la Hype, Compra la Flexibilidad

El mercado de modelos de IA cambia semanalmente. Lo que es "el mejor modelo" hoy sera superado en tres meses. Invierte en arquitectura que te permita cambiar de modelo sin reescribir tu aplicacion. Un sistema de routing de modelos no es un lujo: es un seguro contra la obsolescencia. Tyler Folkman demostro que un router de modelos bien disenado puede reducir tus costos en un 94% mientras mantiene la calidad. Eso no es optimizacion: es supervivencia financiera.

Principio 2: El 80% de tu Trabajo No Necesita el Modelo del 99%

Para la mayoria de las tareas de desarrollo diario —APIs, pruebas, scaffold, debugging estandar— Kimi K2.6 y DeepSeek V4-Pro ofrecen rendimiento igual o superior a Claude Sonnet 4.6 a una fraccion del costo. Reserva Opus 4.6 para arquitectura compleja, refactorizaciones de gran escala, y problemas donde un error es costoso. Un enfoque hibrido te da el 90% de la calidad al 15% del costo.

Principio 3: Observabilidad antes que Autonomia

No puedes mejorar lo que no puedes medir. Fararoni Flow registra cada token consumido, cada llamada a herramienta, cada cambio de estado DEFCON, y cada resultado de mision. Sin esta telemetria, estarias volando ciego. La autonomia de los agentes es poderosa pero peligrosa sin observabilidad completa. Empieza con monitoreo granular antes de agregar autonomia.

Principio 4: Empieza con Misiones, No con Agentes

El error mas comun que veo es construir "agentes geniales" y luego buscarles un proposito. El approach correcto es identificar una mision especifica de negocio —"necesito un briefing tecnologico diario basado en mis correos y fuentes RSS"— y luego diseñar el agente minimo necesario para cumplirla. Fararoni Flow empezo con una sola mision: procesar correos electronicos y generar resumenes. Los 31 agentes y 369 misiones actuales son el resultado de iteracion organica, no de planificacion centralizada.

Principio 5: Persistencia > Velocidad

Construir sistemas de IA es dificil. Lo es. Hay momentos en los que un modelo que funcionaba perfectamente ayer empieza a comportarse de forma erratica hoy. Hay pipelines que se rompen por cambios en APIs externas. Hay costos que se disparan porque olvidaste un limite de rate en un loop. La diferencia entre quienes tienen exito y quienes abandonan no es la inteligencia ni los recursos. Es la persistencia de seguir iterando cuando todo parece roto. Si persistes, los resultados llegan. No siempre en el timeline que esperas, pero llegan.

8. El Panorama Tecnico: Como se Construye un Orquestador que Escalar

Arquitectura de Eventos con NATS

Fararoni Flow esta construido sobre una arquitectura de eventos donde NATS actua como el sistema nervioso central. Cuando un usuario crea una mision, se publica un evento mission.created. Los agentes suscritos evaluan si pueden manejar esa mision basandose en sus capacidades declaradas. Si un agente acepta, publica mission.claimed y comienza la ejecucion. Cada paso dentro de la mision genera eventos: step.started, step.completed, step.failed, step.retried.

Este modelo tiene varias ventajas criticas:

Desacoplamiento: Los agentes no saben unos de otros. Solo saben responder a eventos.
Escalabilidad: Puedo agregar mas instancias de cualquier agente sin cambiar codigo.
Resiliencia: Si un agente falla, los eventos pendientes se reencolan automaticamente.
Observabilidad: Cada evento se registra, permitiendo replay completo de cualquier ejecucion.

GraalVM Native: Velocidad que Importa

La compilacion a imagen nativa con GraalVM reduce el cold-start de la aplicacion a menos de 50 milisegundos. En un sistema de agentes donde las funciones pueden escalar a cero cuando no hay trabajo y activarse bajo demanda, esto es la diferencia entre una respuesta instantanea y una experiencia frustrante. Spring Boot 3.x integra soporte nativo para GraalVM, y Java 25 trae mejoras adicionales en AOT profiling que hacen que la aplicacion alcance su rendimiento pico casi inmediatamente despues del arranque.

WebFlux/Netty: Concurrencia sin Compromisos

La eleccion de Spring WebFlux sobre Netty en lugar del modelo tradicional de hilos de plataforma no es accidental. Los benchmarks de 2025-2026 muestran que Virtual Threads sobre Netty superan a WebFlux puro en aproximadamente el 45% de los escenarios, especialmente bajo alta carga concurrente. Para un orquestador que maneja multiples agentes ejecutandose simultaneamente, cada uno con sus propias llamadas a APIs externas, la capacidad de manejar decenas de miles de conexiones concurrentes con latencias bajas es esencial.

MCP: La Capa de Integracion Universal

Fararoni Flow implementa servidores MCP para todas sus integraciones externas: IMAP para correo electronico, conectores para bases de datos, clientes para APIs de modelos de lenguaje, y adaptadores para sistemas de archivos. Esto significa que cualquier herramienta compatible con MCP —y en 2026 eso incluye Claude Desktop, VS Code Copilot, Cursor, y docenas de IDEs y plataformas— puede usar los agentes de Fararoni Flow como herramientas.

La imagen que comparti al inicio muestra la interfaz de "Conexiones MCP" en el sidebar, con conectores IMAP y Gmail activos. Estas conexiones son el puente entre el mundo de los agentes autonomos y los sistemas de informacion existentes que las empresas ya usan.

El Patron Sidecar en Detalle

El patron sidecar es uno de los componentes arquitectonicos mas importantes de Fararoni Flow y merece una explicacion detallada. Inspirado en el modelo de Kubernetes donde cada pod puede contener multiples contenedores que comparten recursos, el patron sidecar en nuestro contexto significa que cada agente principal viaja acompanado de contenedores auxiliares que manejan funciones transversales.

Imagina un agente especializado en analisis de correos electronicos. Su contenedor principal contiene exclusivamente la logica de negocio: como parsear correos, como identificar temas importantes, como generar resumenes. Pero viaja con tres sidecars:

Sidecar de Logging: Captura cada evento del agente —inicio de tarea, llamada a API, resultado, error— y lo envia a un sistema centralizado de logs estructurados. El agente principal no sabe que existe. Solo hace su trabajo.

Sidecar de Telemetria: Recolecta metricas de rendimiento —tiempo de respuesta, tokens consumidos, tasa de exito— y las expone en formato Prometheus para scraping. Si el agente empieza a consumir mas tokens de lo normal, la alarma se dispara.

Sidecar de Comunicacion Segura: Maneja el cifrado TLS, la autenticacion mutua, y el reintento de conexiones. El agente principal habla HTTP sin pensar en seguridad; el sidecar se encarga de que esa comunicacion sea segura.

Esta separacion tiene beneficios masivos para la operacion. Puedo actualizar el sistema de logging de todo el cluster sin tocar un solo agente. Puedo cambiar el protocolo de telemetria sin afectar la logica de negocio. Puedo rotar certificados de seguridad de forma centralizada. En un ecosistema donde espero tener docenas de agentes diferentes, cada uno especializado en un dominio, esta consistencia operacional no es opcional: es el fundamento sobre el que se construye la confiabilidad.

El sidecar pattern tambien resuelve un problema practico de equipos multidisciplinarios. El agente principal puede estar escrito en Java —mi lenguaje preferido para logica de negocio compleja— mientras que un sidecar de procesamiento de lenguaje natural puede estar en Python, aprovechando la rica biblioteca de NLP que el ecosistema Python ofrece. El sidecar de logging podria estar en Rust, maximizando la eficiencia de recursos. Cada componente usa el lenguaje adecuado para su proposito, comunicandose a traves de interfaces bien definidas.

NATS como Sistema Nervioso: Mas alla del Pub/Sub Simple

La eleccion de NATS como backbone de mensajeria no fue la mas obvia. Muchos equipos hubieran elegido Apache Kafka —que es el estandar de facto para streaming de eventos a escala— o RabbitMQ —que ha sido la opcion confiable durante decadas. Pero NATS ofrece algo que estos sistemas no tienen en la misma medida: simplicidad radical con rendimiento extraordinario.

NATS maneja millones de mensajes por segundo con latencias por debajo del milisegundo. En un benchmark comparativo de la Cloud Native Computing Foundation, NATS demostro ser 10-100x mas rapido que Kafka en escenarios de mensajeria de baja latencia con mensajes pequenos. Para un orquestador de agentes donde la mayoria de los mensajes son eventos de estado ("paso X completado", "agente Y fallo", "mision Z requiere escalacion"), estos mensajes son inherentemente pequenos y frecuentes.

Pero la verdadera razon por la que NATS es perfecto para Fararoni Flow es su modelo de suscripcion flexible. NATS soporta multiples patrones de mensajeria en un solo sistema:

Pub/Sub clasico: Un agente publica un evento, todos los suscriptores interesados lo reciben. Perfecto para notificaciones broadcast.

Request/Reply: Un agente envia una solicitud y espera una respuesta. El patron maneja automaticamente el enrutamiento de respuestas al solicitante correcto, incluso en sistemas con multiples instancias. Esto es fundamental para la coordinacion agente-a-agente.

Queue Groups: Multiples instancias del mismo agente se suscriben como un grupo. NATS entrega cada mensaje a exactamente una instancia del grupo, habilitando load balancing automatico. Si tengo tres instancias de un agente de "procesamiento de correos", cada correo va a exactamente una instancia.

JetStream: La capa de persistencia de NATS que agrega durabilidad, replay de mensajes, y consumer groups con diferentes velocidades de procesamiento. Si un agente falla y se reinicia, puede reanudar el procesamiento desde donde se quedo.

Key-Value Store: Un store de clave-valor distribuido integrado en NATS que uso para estado compartido entre agentes. El estado de una mision en progreso se almacena aqui, permitiendo que cualquier instancia de un agente continue el trabajo de otra.

Esta versatilidad significa que no necesito mantener multiples sistemas de mensajeria. Un solo cluster de NATS maneja todos los patrones de comunicacion que Fararoni Flow requiere. Eso simplifica drasticamente la operacion: un solo sistema para monitorear, un solo sistema para respaldar, un solo sistema para escalar.

La topologia hub-and-spoke de NATS tambien es ideal para arquitecturas de microservicios. En lugar de conectar servicios punto a punto —que crea una maraña de conexiones imposible de mantener— todos los agentes se conectan al cluster de NATS. Cuando agrego un nuevo agente, solo necesita saber la direccion del cluster. No necesita saber nada sobre los otros agentes. Este desacoplamiento es lo que permite que Fararoni Flow escale de 5 agentes a 50 sin una re-arquitectura masiva.

9. Java 25 en 2026: Por que Elegi el Elefante para una Carrera de Gacelas

Una de las preguntas que mas me han hecho desde que comparti Fararoni Flow es: "Por que Java? Python es el lenguaje de la IA." Es una pregunta valida, y la respuesta revela mucho sobre la filosofia detras del sistema.

Python para Prototipos, Java para Produccion

No hay duda de que Python domina el ecosistema de investigacion en IA. PyTorch, TensorFlow, Hugging Face, LangChain, CrewAI —la mayoria de los frameworks que los desarrolladores de IA usan diariamente estan escritos en Python o tienen su primera clase de ciudadania en Python. Para prototipos, investigacion, y experimentacion rapida, Python es imbatible.

Pero Fararoni Flow no es un prototipo. Es un sistema de orquestacion diseñado para ejecutarse 24/7, coordinando decenas de agentes, procesando millones de tokens, y manteniendo el estado de cientos de misiones concurrentes. Para ese tipo de sistema, necesitas algo que Python no puede ofrecer facilmente: rendimiento predecible a escala, tipado estatico que previene errores en tiempo de compilacion, y un ecosistema maduro de observabilidad y operaciones.

Virtual Threads: El Game Changer Silencioso

La caracteristica que mas me emociono de Java 21 (y que se ha perfeccionado en Java 25) son los Virtual Threads del Project Loom. Los benchmarks de 2025-2026 son contundentes: un servidor basado en Virtual Threads sobre Netty maneja 1.2 millones de requests por segundo en una maquina de 16 cores, superando los 900K de WebFlux con Project Reactor. En escenarios de alta carga concurrente, los Virtual Threads ganan en aproximadamente el 45% de los casos.

Para un orquestador de agentes, esto es transformacional. Cada agente en ejecucion puede tener su propio hilo virtual sin consumir los recursos de un hilo de sistema operativo. Puedo tener cientos de agentes "corriendo simultaneamente" sin que el sistema se sienta cargado. Cuando un agente hace una llamada a una API externa —que es la mayor parte del tiempo de ejecucion de un agente— el hilo virtual se "estaciona" automaticamente, liberando recursos para otros agentes.

El Ecosistema de Spring Boot 3.x

Spring Boot 3.x trae soporte nativo para Virtual Threads, GraalVM Native Image, y la reactive stack de WebFlux. El ecosistema de Spring es masivo: Spring Security para autenticacion y autorizacion, Spring Data para persistencia, Spring Cloud para patrones de microservicios, Spring Batch para procesamiento por lotes. Cada uno de estos proyectos tiene decadas de maduracion en produccion empresarial.

Cuando construyes un sistema de agentes que necesita autenticacion OAuth2, rate limiting, circuit breakers, y trazabilidad distribuida, no quieres construir eso desde cero. Quieres un framework que lo haga bien, que lo haya hecho bien por anos, y que tenga la documentacion y la comunidad para resolver problemas rapidamente.

GraalVM Native: Cold Starts que no Duelen

Una de las criticas historicas a Java ha sido el tiempo de arranque. "Write once, run everywhere" se sentia mas como "Write once, wait everywhere" para aplicaciones serverless. GraalVM Native Image cambia esa ecuacion. La compilacion ahead-of-time produce binarios nativos que arrancan en menos de 50 milisegundos.

Para Fararoni Flow, esto significa que puedo escalar agentes a cero cuando no hay trabajo y activarlos bajo demanda sin que los usuarios noten el retraso. En un sistema de agentes donde diferentes tipos de agentes tienen diferentes patrones de uso —algunos constantemente activos, otros esporadicos— esta capacidad de "scale to zero" tiene implicaciones directas en costos de infraestructura.

Project Leyden y AOT Caching

Java 25 trae mejoras significativas del Project Leyden, especialmente el AOT Method Profiling (JEP 515). Este sistema registra lo que la aplicacion hace durante una ejecucion de entrenamiento y guarda un cache optimizado para ejecuciones futuras. El resultado: la JVM genera codigo nativo optimizado inmediatamente al arranque, sin tener que esperar a que el JIT compiler recopile perfiles en caliente.

Los benchmarks muestran mejoras de 15-25% en tiempo de arranque y warm-up para aplicaciones que usan esta caracteristica. Para un sistema de agentes que se reinicia frecuentemente —ya sea por despliegues, recuperacion de fallos, o escalado elastico— cada milisegundo de arranque cuenta.

Compact Object Headers: Memoria que Importa

JEP 519 en Java 25 introduce headers de objeto compactos que reducen el uso de memoria del heap en hasta un 20%. Oracle y Amazon probaron esta caracteristica en cientos de servicios de produccion y reportaron no solo reduccion de memoria, sino tambien mejoras de rendimiento de hasta 10% y reduccion de frecuencia de ciclos de garbage collection de hasta 15%.

En un sistema de orquestacion donde cada agente mantiene estado, contexto de conversacion, y buffers de mensajes, la eficiencia de memoria no es un detalle menor. Es la diferencia entre poder ejecutar 30 agentes en una instancia o tener que escalar horizontalmente antes de tiempo.

La Verdad sobre la Eleccion de Lenguaje

Al final, la eleccion de Java 25 no es sobre rechazar Python o declarar que Java es "mejor". Es sobre elegir la herramienta correcta para el problema correcto. Python es insuperable para la investigacion, la experimentacion con modelos, y la construccion de prototipos. Java es insuperable para sistemas distribuidos de alta concurrencia que necesitan rendimiento predecible, observabilidad completa, y operaciones sin drama.

Fararoni Flow tiene componentes en Python —especialmente los que interactuan directamente con modelos de lenguaje y herramientas de ML— pero el nucleo de orquestacion, el sistema de mensajeria, la gestion de estado, y la capa de APIs estan en Java 25 porque es donde Java brilla. Un sistema hibrido que usa cada lenguaje para lo que hace mejor no es una debilidad: es arquitectura madura.

10. El Futuro de la Orquestacion de Agentes: Hacia donde Vamos

De Agentes Aislados a Sistemas Cognitivos Distribuidos

La orquestacion de agentes en 2026 esta donde estaba la orquestacion de contenedores en 2014: al borde de una explosion de madurez. Kubernetes se popularizo porque resolvio un problema real —como manejar cientos de contenedores en produccion— y lo hizo con una abstraccion poderosa: el pod, el servicio, el deployment. La orquestacion de agentes necesita sus propias abstracciones equivalentes.

Creo que estamos viendo emergir tres abstracciones fundamentales:

1. La Mision como Unidad de Trabajo: En Fararoni Flow, una mision es una unidad de trabajo completa que puede involucrar multiples agentes, herramientas, y pasos. Es el equivalente de un "job" en sistemas batch o un "workflow" en sistemas de integracion. Las misiones tienen estado, historial, y pueden ser replayeadas, auditadas, y optimizadas. La mision es la unidad fundamental de orquestacion porque refleja como los humanos pensamos sobre el trabajo: como objetivos que cumplir, no como tareas aisladas.

2. El Agente como Servicio Especializado: Los agentes del futuro no seran generalistas que intentan hacer todo. Seran especialistas profundos en un dominio especifico, comunicandose con otros agentes a traves de protocolos estandarizados. El Model Context Protocol (MCP) de Anthropic y el Agent-to-Agent Protocol (A2A) de Google son los primeros pasos hacia esta estandarizacion. Cuando un agente de "analisis de datos" puede comunicarse con un agente de "generacion de reportes" y un agente de "validacion de calidad" a traves de protocolos compartidos, el sistema completo se vuelve mas que la suma de sus partes.

3. El Orquestador como Sistema Operativo: El orquestador no es solo un coordinador: es el sistema operativo del ecosistema de agentes. Maneja el ciclo de vida de los agentes, la asignacion de recursos, la recuperacion de fallos, la seguridad, y la observabilidad. En Fararoni Flow, el orquestador decide que agente ejecuta que mision basandose en capacidades declaradas, estado actual, y politicas de prioridad. Es el kernel del sistema.

La Convergencia de Protocolos: MCP, A2A, y mas alla

El ecosistema de protocolos de agentes en 2026 esta fragmentado pero convergiendo rapidamente:

Protocolo	Proposito	Adopcion	Estado
MCP (Anthropic)	Agente → Herramientas	110M+ descargas/mes	Dominante en integracion
A2A (Google)	Agente → Agente	50+ socios de lanzamiento	Emergente pero creciendo
ACP (IBM/Linux)	Agente → Agente (comercio)	Limitada	Estandarizacion temprana
Open GAP	Framework-agnostico	Comunidad OSS	En desarrollo

MCP ha ganado la batalla de integracion agente-herramienta. Con 110 millones de descargas mensuales de SDK y adopcion por parte de los cuatro grandes (Anthropic, OpenAI, Google, Microsoft), es el estandar de facto. A2A esta emergiendo como el protocolo para comunicacion agente-agente, con Google liderando y obteniendo respaldo de AWS y otros mayores.

La vision a largo plazo es un ecosistema donde estos protocolos se complementen: MCP para que cada agente acceda a herramientas, A2A para que los agentes se comuniquen entre si, y posiblemente un tercer protocolo para orquestacion a nivel de sistema. Fararoni Flow esta arquitectonicamente preparado para esta convergencia: nuestra capa de comunicacion entre agentes puede adaptarse a nuevos protocolos sin cambiar la logica de negocio.

La Democratizacion de la IA Enterprise

Una tendencia que me emociona profundamente es la democratizacion que estos costos reducidos y estos estandares abiertos estan habilitando. Cuando DeepSeek ofrece V4-Pro a $0.003625 por millon de tokens en cache hit, y Kimi K2.6 ofrece rendimiento de nivel frontier a precios de mid-tier, la barrera de entrada para construir sistemas de agentes inteligentes se desploma.

Una startup con un presupuesto modesto puede hoy construir un sistema de agentes que hace un ano habria costado cientos de miles de dolares. Un desarrollador individual puede orquestar multiples agentes especializados por menos de lo que cuesta una suscripcion a Netflix. Esto no es solo una reduccion de costos: es una transferencia de poder desde los grandes laboratorios de IA hacia los constructores individuales y los equipos pequenos.

Los Riesgos que Persisten

Pero no todo es optimismo. Hay riesgos reales que el campo aun no ha resuelto:

Seguridad de Agentes Autonomos: Cuando un agente tiene acceso a tu correo electronico, tus bases de datos, y tus sistemas de produccion, la superficie de ataque se expande dramaticamente. Un prompt injection bien diseñado podria teoricamente hacer que un agente ejecute acciones no autorizadas. Los sistemas de niveles DEFCON como los de Fararoni Flow son un primer paso, pero la seguridad de agentes autonomos es un campo en su infancia.

Dependencia de Proveedores de Modelo: Aunque los protocolos abiertos como MCP reducen el lock-in a nivel de herramienta, la dependencia de proveedores especificos de modelos sigue siendo un riesgo. Si Claude Opus 4.6 es el unico modelo que puede manejar tu caso de uso mas complejo, tienes un punto de fallo single. La estrategia de routing multi-modelo que uso en Fararoni Flow mitiga esto, pero no lo elimina completamente.

Calidad de Datos y Sesgo: El 85% de los fracasos de proyectos de IA se atribuye a datos de baja calidad. Los agentes autonomos que toman decisiones basadas en datos sesgados o incompletos pueden amplificar esos sesgos a escala. La gobernanza de datos no es opcional: es requisito fundamental.

11. Construyendo en Publico: El Compromiso con la Transparencia

Desde que comence a construir Fararoni Flow, he tomado la decision de hacerlo en publico. Comparto metricas reales —incluyendo las fallas—, explico las decisiones arquitectonicas con su contexto completo, y publico el codigo para que otros puedan aprender, criticar, y mejorar.

Esta transparencia no es altruismo puro. Es una estrategia de construccion de sistemas que ha demostrado ser efectiva una y otra vez: cuando sabes que otros van a ver tu trabajo, tienes un incentivo adicional para hacerlo bien. La "presion social" de construir en publico es un acelerador de calidad.

Pero hay un beneficio mas profundo. El campo de la orquestacion de agentes es tan nuevo que no existe un "manual de mejores practicas" consolidado. Estamos todos descubriendo las respuestas en tiempo real. Al compartir lo que aprendo —tanto los exitos como los fracasos— contribuyo a un cuerpo de conocimiento colectivo que beneficia a todos los constructores en este espacio.

Los Numeros que Comparto

La imagen que abre este articulo muestra el estado actual del sistema. No es un snapshot de un dia bueno: es el estado tipico. 31 agentes activos, 369 misiones creadas, casi 7.3 millones de tokens procesados. La tasa de exito del 70% incluye experimentos que intencionalmente exploran los limites de lo posible. No me averguenza admitir que algunas misiones fallan: cada falla es una fuente de aprendizaje.

El Stack Tecnico Completo

Para los curiosos tecnicos, este es el stack completo de Fararoni Flow en su estado actual:

Capa	Tecnologia	Razon de Eleccion
Runtime	Java 25 (LTS) + GraalVM Native	Virtual Threads, AOT profiling, cold starts <50ms
Framework	Spring Boot 3.5 + WebFlux/Netty	Reactive stack, 1.2M req/s, ecosistema maduro
Mensajeria	NATS + JetStream	Sub-millisecond latency, pub/sub desacoplado, durable
Arquitectura	Hexagonal + Sidecar Pattern	Isolamiento de logica, adapters intercambiables
Protocolos	MCP (Model Context Protocol)	Estandar para agente-herramienta, 110M+ descargas/mes
Modelos LLM	Claude Opus/Sonnet, Kimi K2.6, DeepSeek V4	Routing inteligente por complejidad de tarea
Persistencia	PostgreSQL + Redis	Estado durable + cache de alta velocidad
Observabilidad	Micrometer + Prometheus + Grafana	Metricas en tiempo real, alerting
Infraestructura	Docker + Kubernetes	Orquestacion de contenedores, auto-scaling

12. Operando en Produccion: Lecciones que solo el Tiempo Ensena

Construir un orquestador es facil. Operarlo en produccion durante meses es donde aparecen las lecciones reales. Hecho ambas cosas, y estas son las lecciones que no encontraras en ningun paper academico ni en ningun tutorial de YouTube.

La Ley del 80/20 de los Agentes

Descubri rapidamente que el 80% del valor que genera Fararoni Flow proviene del 20% de los agentes. Los agentes de procesamiento de correo, generacion de briefings tecnologicos, y validacion de resultados son los que se ejecutan docenas de veces al dia. Los agentes mas exoticos —como el que analiza tendencias de seguridad cibernetica o el que genera resumenes de papers academicos— se ejecutan semanalmente pero son igualmente valiosos en sus momentos.

Esta distribucion me enseno a pensar en tres categorias de agentes: workhorses (los que hacen el trabajo pesado diario), specialists (los que se activan para tareas especificas), y explorers (los que prueban capacidades nuevas). Cada categoria tiene requisitos diferentes de infraestructura, costos, y monitoreo. Los workhorses necesitan estar siempre disponibles y optimizados para costo. Los specialists pueden arrancar bajo demanda. Los explorers pueden fallar sin consecuencias graves.

El Problema de los "Agentes Zombie"

Un fenomeno que no anticipé fue el de los "agentes zombie": agentes que quedan en un estado intermedio —ni completamente activos ni completamente terminados— consumiendo recursos sin producir valor. Un agente que se quedo esperando una respuesta de una API externa que nunca llego, o un agente que entro en un loop de reintentos infinitos porque la condicion de exito era inalcanzable.

Resolvi esto con un sistema de "timeouts con escalacion". Cada paso de una mision tiene un timeout predeterminado basado en su DEFCON level. Si un paso excede su timeout, el sistema no solo lo marca como fallido: lo escala. DEFCON 0 se convierte en DEFCON 1, que se convierte en DEFCON 2, hasta que un humano interviene o la mision se aborta automaticamente. Este sistema ha prevenido innumerables situaciones de agentes consumiendo tokens y recursos sin proposito.

La Importancia de la "Memoria" de los Agentes

Un agente sin memoria es como un empleado con amnesia: reinicia desde cero en cada conversacion. Fararoni Flow implementa tres tipos de memoria:

Memoria de Sesion: El contexto de la mision actual. Lo que el agente ha aprendido en los pasos anteriores, los resultados intermedios, y las decisiones tomadas. Esta memoria vive en Redis y se pierde cuando la mision termina.

Memoria de Trabajo: Aprendizajes acumulados sobre patrones de exito y fracaso. Si un agente descubre que cierto enfoque funciona mejor para un tipo de tarea, ese aprendizaje se persiste en PostgreSQL y se carga en futuras misiones del mismo tipo. Esta memoria es el fundamento de la mejora continua.

Memoria de Sistema: Conocimiento estatico sobre el dominio. Documentacion, reglas de negocio, y templates que el agente usa como referencia. Esta memoria se actualiza manualmente por los operadores del sistema.

La combinacion de estas tres memorias transforma a los agentes de ejecutores de tareas aisladas a aprendices continuos. Cada mision fallida es una oportunidad de aprendizaje que beneficia a futuras misiones.

Costos Reales vs. Costos Proyectados

Cuando diseñe el sistema, proyecte un costo mensual de operacion basado en benchmarks publicos. Los costos reales fueron diferentes —no necesariamente mayores, pero diferentes en su distribucion. Descubri que:

El 60% del costo de tokens va a modelos "premium" (Claude Opus) pero solo representan el 15% de las llamadas. Esas llamadas son las mas criticas.
El 25% del costo va a modelos "workhorse" (Kimi K2.6) y representan el 55% de las llamadas. Aqui es donde el routing inteligente paga dividendos.
El 15% restante va a modelos "budget" (DeepSeek V4) y representan el 30% de las llamadas. Tareas simples que no justifican modelos caros.

Esta distribucion valido mi hipotesis original: no necesitas un supermodelo para cada tarea. Necesitas un sistema que asigne el modelo correcto a la tarea correcta. La diferencia entre un sistema que gasta $1,000/mes y uno que gasta $200/mes no es la calidad del output: es la inteligencia del routing.

La Dimension Humana

Tecnicamente, Fararoni Flow es un sistema de software. Operacionalmente, es un equipo humano-maquina. Los agentes manejan el 80% del trabajo rutinario, pero los humanos siguen siendo esenciales para:

Validar outputs criticos: Un briefing para una decision de inversion no sale sin revision humana.
Manejar casos edge: Los agentes son buenos en lo comun; los humanos son mejores en lo inesperado.
Entrenar nuevos agentes: Cada agente nuevo requiere supervision humana durante sus primeras ejecuciones.
Definir estrategia: Los agentes ejecutan; los humanos deciden que ejecutar.

Ignorar esta dimension humana es uno de los errores mas comunes que veo en implementaciones de IA. Los sistemas que intentan reemplazar completamente a los humanos fallan. Los sistemas que amplian las capacidades humanas —delegando lo rutinario para que los humanos se enfoquen en lo estrategico— tienen exito.

13. La Pregunta que Deberias Hacerte

Si has llegado hasta aqui, probablemente estas considerando como la IA puede transformar tu trabajo, tu equipo, o tu empresa. La pregunta que deberias hacerte no es "que modelo debo usar?" ni "cuanto cuesta Claude Code?". La pregunta es: "cual es mi estrategia de orquestacion?"

Los datos son claros. El 80% de los proyectos de IA fracasan. Los costos pueden variar por ordenes de magnitud dependiendo de como enrutes tus llamadas a modelos. Las herramientas que dominan hoy pueden ser obsoletas en un ano. En este entorno, la unica ventaja sostenible no es conocer el modelo mas reciente. Es tener una arquitectura que te permita adaptarte mas rapido que la competencia.

He pasado los ultimos meses construyendo esa arquitectura para mi uso. Fararoni Flow no es un producto terminado —es un sistema vivo que evoluciona semanalmente— pero es la respuesta concreta a una pregunta abstracta: como orquestas decenas de agentes especializados para que trabajen juntos como un sistema coherente, con resiliencia, observabilidad, y costos controlados.

Para quien es este articulo

Si eres un CTO o VP de Ingenieria considerando una estrategia de IA para tu empresa, los datos que presente sobre tasas de fracaso y costos deberian ser tu punto de partida. No empieces con "que modelo compramos". Empieza con "que arquitectura nos permite probar, fallar, y adaptarnos rapidamente".

Si eres un desarrollador senior que usa Claude Code, GitHub Copilot, o Cursor diariamente, los numeros sobre routing de modelos deberian resonar contigo. No necesitas renunciar a Claude Code para ahorrar dinero. Necesitas un sistema que use Claude para lo que Claude hace mejor, y modelos mas baratos para todo lo demas.

Si eres un arquitecto de software construyendo sistemas distribuidos, los patrones que describi —hexagonal, sidecar, event-driven con NATS— deberian ser familiares. La novedad no esta en los patrones individuales, sino en como se componen para resolver un problema nuevo: la orquestacion de agentes autonomos.

Si eres un emprendedor pensando en construir algo en el espacio de agentes de IA, quiero que sepas que el campo esta abierto. Los grandes jugadores estan ocupados construyendo modelos. Hay una oportunidad masiva en la capa de orquestacion, la capa de protocolos, y la capa de herramientas que hacen que los agentes sean productivos en el mundo real.

Como Conectar

Si tienes curiosidad por probar Fararoni Flow, si estas construyendo algo similar y quieres intercambiar ideas, o si simplemente quieres entender mejor como funciona un orquestador de agentes sobre Java 25, NATS, y arquitectura hexagonal: contactame a traves de ebercruz.com. Estoy construyendo esto en publico, aprendiendo en publico, y compartiendo lo que aprendo.

No prometo que Fararoni Flow sea la solucion perfecta para tu caso de uso. Pero prometo que la conversacion sera honesta, tecnica, y orientada a resultados. En un campo donde la mayoria esta vendiendo humo, prefiero construir concreto —y compartir como lo hago.

Si persistes, los resultados llegan.

No siempre en el timeline que esperas. No siempre en la forma que imaginaste. Pero llegan. Esa es la leccion final que me ha ensenado Fararoni Flow: en la construccion de sistemas inteligentes, como en la vida, la estrategia disciplinada supera a la velocidad descontrolada. La arquitectura pensada supera a los prompts impulsivos. Y la persistencia —esa capacidad de seguir iterando cuando todo parece roto— es el diferenciador definitivo entre quienes transforman la IA en ventaja competitiva y quienes se convierten en otro estadistico de fracaso en el proximo reporte de RAND Corporation.

Referencias y Fuentes

RAND Corporation (2025). AI Project Failure Analysis: 2,400+ Enterprise Initiatives. Extraido de: (Folio3 AI) (https://www.folio3.ai/blog/ai-project-failure-rate-stats)
MIT Project NANDA (2025). The GenAI Divide: State of AI in Business. Extraido de: (TechTarget) (https://www.techtarget.com/searchenterpriseai/feature/AI-deployments-gone-wrong-The-fallout-and-lessons-learned)
S&P Global Market Intelligence (2025). AI Initiative Abandonment Rates. Extraido de: (Folio3 AI) (https://www.folio3.ai/blog/ai-project-failure-rate-stats)
Gartner (2024). Predicts 30% of GenAI Projects Will Be Abandoned After POC By End of 2025. Extraido de: (Gartner) (https://www.gartner.com/en/newsroom/press-releases/2024-07-29-gartner-predicts-30-percent-of-generative-ai-projects-will-be-abandoned-after-proof-of-concept-by-end-of-2025)
Fortune (2026). Microsoft lost its way in the AI race. Can Copilot get it back on course? Extraido de: (Fortune) (https://fortune.com/2026/05/21/microsoft-copilot-ai-openai-satya-nadella-gemini-claude/)
DeepSeek API Documentation (2026). Models & Pricing. Extraido de: (deepseek.com) (https://api-docs.deepseek.com/quick_start/pricing)
InfoWorld (2026). DeepSeek's steep V4-Pro price cut escalates AI pricing war. Extraido de: (InfoWorld) (https://www.infoworld.com/article/4176709/deepseeks-steep-v4-pro-price-cut-escalates-ai-pricing-war.html)
Medium/AI tentenco (2026). Kimi K2.6 & Kimi Code Review: Saving 88% Coding Costs? Extraido de: (Medium) (https://medium.com/@tentenco/kimi-k2-6-kimi-code-review-saving-88-coding-costs-b7e8c5eaf5f1)
Ideas2IT (2026). Claude Code With Kimi, DeepSeek vs Claude: Cost & Benchmarks. Extraido de: (ideas2it.com) (https://www.ideas2it.com/blogs/claude-code-alternative-models)
Uvik (2026). Claude Code vs Cursor vs Copilot vs Codex. Extraido de: (uvik.net) (https://uvik.net/blog/claude-code-vs-cursor-vs-copilot-vs-codex-2026/)
Menlo Ventures (2025). State of Generative AI in the Enterprise. Referenciado en: (beam.ai) (https://beam.ai/agentic-insights/the-great-ai-flip-why-76-of-enterprises-stopped-building-ai-in-house)
Caylent (2026). POC to PROD: Hard Lessons from 200+ Enterprise Generative AI Deployments. Extraido de: (Caylent) (https://caylent.com/blog/poc-to-prod-hard-lessons-from-200-enterprise-generative-ai-deployments-part-2)
IBM (2025). What is AI Agent Orchestration? Extraido de: (IBM) (https://www.ibm.com/think/topics/ai-agent-orchestration)
Lyzr AI (2026). Agent Orchestration 101. Extraido de: (lyzr.ai) (https://www.lyzr.ai/blog/agent-orchestration/)
Model Context Protocol (2026). MCP Roadmap and Technical Direction. Extraido de: (getknit.dev) (https://www.getknit.dev/blog/the-future-of-mcp-roadmap-enhancements-and-whats-next)
Java 25 LTS Release Notes (2025). Performance Improvements in JDK 25. Extraido de: (inside.java) (https://inside.java/2025/10/20/jdk-25-performance-improvements/)
GitHub - loom-webflux-benchmarks (2026). Benchmarks of Spring Boot REST service comparing Java Virtual Threads with WebFlux. Extraido de: (Github) (https://github.com/chrisgleissner/loom-webflux-benchmarks)
OpenAI Developer Community (2025). Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete. Extraido de: (OpenAI API Community Forum) (https://community.openai.com/t/prompt-engineering-is-dead-and-context-engineering-is-already-obsolete-why-the-future-is-automated-workflow-architecture-with-llms/1314011)

Eber Cruz Fararoni es arquitecto de software especializado en sistemas distribuidos, arquitecturas orientadas a eventos, e inteligencia artificial aplicada. Construye Fararoni Flow, un orquestador de agentes IA open-source, sobre Java 25, NATS, y arquitectura hexagonal. Escribe en ebercruz.com sobre el cruce entre ingenieria de software e inteligencia artificial.

Si este articulo te resulto util, compartelo con alguien que este navegando el complejo panorama de la IA enterprise en 2026. Y si quieres probar Fararoni Flow o intercambiar ideas sobre orquestacion de agentes: contactame.

The Architect's Paradox: From a 106 Fan-Out "God Object" to Sovereign AI in Java 25

Eber Cruz Fararoni — Wed, 22 Apr 2026 12:56:50 +0000

As a Staff Engineer with over 11 years in mission-critical banking systems (HSBC, Santander), you’d think I’ve seen it all. When I began building the Fararoni ecosystem—a Java 25-based infrastructure for Sovereign AI—I had a clear vision: low-latency agentic orchestration, NATS JetStream, and sub-microsecond interrupt cycles. I had the patterns in my head, but I didn't have the diagram on paper.

The Knot: When Speed Becomes a Liability
In my obsession with "touching the metal" of Project Panama (FFM) and Virtual Threads, I built a high-performance engine trapped inside a "Big Ball of Mud." When I finally ran a metrics audit, the reality check was brutal: my FararoniCore class had a Fan-Out of 106. I was micro-managing 106 components from a single orchestrator. I had created a "God Object" in a system meant to preach decentralization.

The "Limitations" of the Cutting Edge
While Java 25 is a masterpiece, the friction between high-level abstraction and native performance (Panama/Loom) is where the real architecture happens. If you don't design clean boundaries, you hit operational ceilings:

Carrier Thread Pinning: In high-contention scenarios with native calls, Virtual Threads can block the carrier, killing the very scalability you seek.

Arena Lifecycles: Managing memory in Panama isn't hard; managing it across 106 coupled components without leaks is impossible.

Observability: Debugging 10,000+ virtual threads in a "Ball of Mud" is a nightmare.

The Re-engineering: The Road to Rome
They say "All roads lead to Rome," referring to the Milliarium Aureum in the Roman Forum. In software, there are multiple paths to success, but sustainability requires a map. I chose to pause and dismantle the chaos:

Domain Evacuation: Moving state to immutable Records.

Bounded Contexts: Refactoring 68 flat directories into 7 clean "neighborhoods."

The NATS Decoupling: Moving from direct calls to an event-driven bus. The Core no longer "pushes" data; it emits events.

Ports & Adapters: Implementing Hexagonal Architecture so that switching a module is like plugging in a cable, not rewriting the heart of the system.

Conclusion: The Resilient Architect
Quality isn't just about the lines you write, but the ones you manage to decouple. I didn't settle for "it works." I obsessed over making it Sovereign and maintainable. Being vulnerable as an architect means admitting the code beat you for a day; being resilient means winning the war for design.

About the Author

Eber Cruz Fararoni is a software engineer with a decade of experience designing backend infrastructure and distributed systems.

Currently focused on AI-assisted software engineering, deterministic guardrails, and hybrid kernel architectures for secure LLM execution.

This article documents the architecture behind C-FARARONI, an experimental ecosystem for technological

sovereignty and secure local AI model execution.

LinkedIn · GitHub · ebercruz.com

The End of Destructive AI Hallucinations: Hybrid Kernel Architecture with Java 25 and Zero-Trust Guardrails

Eber Cruz Fararoni — Wed, 15 Apr 2026 13:32:40 +0000

Integrating Deterministic Routing and Zero-Trust Guardrails

Eber Cruz Fararoni — Software Engineer | C-FARARONI Project
March 2026 · Architecture Notes
Fararoni Kernel: Deterministic Routing + Zero-Trust Protection Triad

Abstract

The adoption of Large Language Models (LLMs) in software development faces two critical barriers: the stochastic nature that causes destructive hallucinations and the high latency on trivial commands. This article presents the Fararoni Kernel, a hybrid execution architecture that solves both problems simultaneously through two complementary contributions:

(1) Deterministic Routing (Levels 1-2): A 5-level execution cascade that intercepts system commands (pwd, ls, git) and maps natural intentions ("do the commit") directly to shell sequences, removing the LLM from tasks where its intervention introduces unnecessary latency and hallucination risk.

(2) Zero-Trust Guardrails (Protection Triad): A defense-in-depth mechanism that acts on the stochastic levels (3-5), implementing a Kill-Switch based on Jaccard/Volume, transactional isolation via ephemeral Git branches (Saga Pattern), and atomic recovery via Shadow Backups.

Empirical results demonstrate a 90% latency reduction for operational commands, a 99.99% destructive hallucination blocking rate, and 0% permanent data loss.

Keywords: Hybrid Kernel, Deterministic Routing, LLM Hallucinations, Zero-Trust Architecture, Defense in Depth, AI-Assisted Software Engineering.

1. Introduction

1.1 Problem Statement

The integration of LLM agents in development workflows presents a fundamental dilemma: the same model that can refactor complex code also hallucinates responses for a simple pwd, inventing paths like /home/user when the actual directory is /Users/the/Projects/microservice. Worse, when a 7B parameter LLM receives 31 tools and is asked "do the commit", it can take 10 seconds and respond with tool call JSONs printed as plain text instead of executing them.

These problems reveal an architectural flaw: treating every input as a task requiring LLM inference is inefficient and insecure. Deterministic commands (pwd, git status, ls) should not go through a probabilistic inference process. Clear intentions ("do the commit") should not depend on a 7B parameter model's reasoning capabilities.

1.2 Limitations of Current Solutions

Tool	Approach	Limitation
Pure Agent Architectures	Everything through the LLM	Unnecessary latency on simple commands
NeMo Guardrails	Content filter	Does not protect structural code integrity
Prompt Engineering	Model instructions	20% leakage rate in production
IDEs with AI	Human validation	Not scalable in autonomous workflows

No current solution simultaneously addresses the latency problem (when to wake the LLM) and the integrity problem (how to protect code when the LLM operates).

1.3 Contribution

We propose an inversion of control in two dimensions:

Routing: Instead of sending everything to the LLM, the Kernel decides the minimum complexity level needed to resolve each input.
Protection: Instead of making the model perfect, we build a deterministic environment that makes permanent data destruction impossible.

2. Hybrid Kernel Architecture

2.1 General Vision: 5-Level Cascade

                    ┌─────────────────────────┐
                    │      USER INPUT         │
                    └────────────┬────────────┘
                                 │
    ╔════════════════════════════╪═══════════════════════════╗
    ║  DETERMINISTIC ZONE (No LLM)                           ║
    ║                            │                           ║
    ║  ┌─────────────────────────┴─────────────────────────┐ ║
    ║  │ LEVEL 1: BARE COMMANDS                            │ ║
    ║  │ pwd, ls, git status                    ~0ms Shell │ ║
    ║  └─────────────────────────┬─────────────────────────┘ ║
    ║                       null?│                           ║
    ║  ┌─────────────────────────┴─────────────────────────┐ ║
    ║  │ LEVEL 1.5: COMPOSITE                              │ ║
    ║  │ "do the commit"                 ~0ms Macro + Shell│ ║
    ║  └─────────────────────────┬─────────────────────────┘ ║
    ║                       null?│                           ║
    ║  ┌─────────────────────────┴─────────────────────────┐ ║
    ║  │ LEVEL 2: GGUF LOCAL                               │ ║
    ║  │ "hello", "thanks"               ~1s Light LLM     │ ║
    ║  └─────────────────────────┬─────────────────────────┘ ║
    ╚════════════════════════════╪═══════════════════════════╝
    ─────── DETERMINISTIC / STOCHASTIC FRONTIER ───────────────
    ╔════════════════════════════╪═══════════════════════════╗
    ║  STOCHASTIC ZONE (With LLM)                            ║
    ║                            │                           ║
    ║          ┌─────────────────┴────────────────┐          ║
    ║          │   PROTECTION TRIAD               │          ║
    ║          │   ← Active here                  │          ║
    ║          └─────────────────┬────────────────┘          ║
    ║                            │                           ║
    ║  ┌─────────────────────────┴─────────────────────────┐ ║
    ║  │ LEVEL 3: TOOL CALLING                             │ ║
    ║  │ 31 tools                        8-10s LLM + Tools │ ║
    ║  └─────────────────────────┬─────────────────────────┘ ║
    ║                       null?│                           ║
    ║  ┌─────────────────────────┴─────────────────────────┐ ║
    ║  │ LEVEL 4: THINKING                                 │ ║
    ║  │ Deep reasoning               10-15s DeepSeek/Qwen3│ ║
    ║  └─────────────────────────┬─────────────────────────┘ ║
    ║                       null?│                           ║
    ║  ┌─────────────────────────┴─────────────────────────┐ ║
    ║  │ LEVEL 5: FALLBACK                                 │ ║
    ║  │ Plain VllmClient                2-5s No tools     │ ║
    ║  └───────────────────────────────────────────────────┘ ║
    ╚════════════════════════════════════════════════════════╝

2.2 Key Principle: Deterministic/Stochastic Separation

Zone	Levels	Nature	Latency	Git Protection	Hallucination Risk
Deterministic	1, 1.5, 2	Fixed rules	0-1s	Direct (no ephemeral branch)	0% (impossible)
Stochastic	3, 4, 5	LLM inference	8-15s	Protection Triad active	Mitigated to 99.99%

Fundamental insight: By resolving 60-70% of interactions in the deterministic zone, we drastically reduce both average latency and the attack surface for hallucinations.

3. Deterministic Zone: Routing Without LLM

3.1 Level 1: Bare Commands

System and git commands executed directly via JVM or ProcessBuilder. The LLM never learns the user typed anything.

Type	Pattern	Example
System	`SAFE_BARE_COMMANDS` (pwd, ls, date...)	"pwd" → Direct JVM
Git read	"git " prefix + safe subcommand	"git status" → shell
Git write	"git " prefix + safe subcommand	"git add ." → shell
Git blocked	"git " prefix + push/pull/fetch	"git push" → BLOCKED

Git command classification:

Subcommand	Risk Level	Behavior
status, log, diff, show	READ_ONLY	Direct execution
add, commit, checkout, branch, stash, init, tag	LOCAL_WRITE	Direct execution
push, pull, fetch, clone	REMOTE	BLOCKED (Ring 7)
reset --hard, clean -f	DESTRUCTIVE	BLOCKED

Performance: ~0ms. Zero tokens consumed. Zero hallucination risk.

3.2 Level 1.5: Composite Commands

Natural language intention mapping to command sequences. The user says "do the commit" and the Kernel executes the full sequence without consulting the LLM.

"do the commit"             → git add . && git commit
"commit everything"         → git add . && git commit
"do the git init and commit"→ git init && .gitignore && git add . && git commit
"save changes to git"       → git add . && git commit

Execution sequence:

executeCompositeCommit(input)
  │
  ├── Does .git exist?
  │     ├── NO + input mentions "init" → git init + auto .gitignore
  │     └── NO + no "init" → error with suggestion
  │
  ├── git add --all -- . :!.fararoni/    (excludes shadow files)
  ├── git diff --cached --stat           (verify changes)
  ├── extractCommitMessage(input)        (auto-generate or extract from quotes)
  └── git commit -m "message"

Performance: ~0ms. The full sequence (init + gitignore + add + commit) executes without LLM.

3.3 Level 2: GGUF (Simple Chat)

Casual conversation executed against an in-memory GGUF model (no network).

Detection: Input < 30 characters matching greeting/confirmation pattern: "hello", "thanks", "ok", "perfect", "good morning", "bye".

Performance: ~1 second. No tools, no git, no risk.

4. Stochastic Zone: Protection Triad

4.1 When does the stochastic zone activate?

Any input NOT captured by Levels 1, 1.5, or 2 falls to the stochastic zone. Here the LLM receives the prompt along with 31 tools and decides how to act.

"change the java version in the pom to 25"          → fs_patch
"create a REST endpoint for students"                 → fs_write
"organize the repo with feature/hotfix branches"      → GitAction (ephemeral branch)
"analyze the NullPointerException error"              → fs_read + reasoning

4.2 Protection Triad: Defense in Depth

                    ┌─────────────────────────────┐
                    │     FARARONI IRONCLAD       │
                    │  Protection Triad           │
                    └──────────────┬──────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              │                    │                    │
              ▼                    ▼                    ▼
    ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
    │   LAYER 1:       │ │   LAYER 2:       │ │   LAYER 3:       │
    │   KILL-SWITCH    │ │   GIT SAGA       │ │   SHADOW BACKUP  │
    │                  │ │                  │ │                  │
    │  Jaccard ≥ 40%   │ │  Ephemeral Branch│ │  Atomic Copy     │
    │  Volume  ≥ 50%   │ │  Auto-Revert     │ │  Pre-Write       │
    │                  │ │  Squash Merge    │ │  Recovery        │
    │                  │ │                  │ │                  │
    │  Blocks 99.9%    │ │  Contains 0.09%  │ │  Recovers 0.01%  │
    └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
             │                    │                    │
             └────────────────────┼────────────────────┘
                                  │
                                  ▼
                    ┌─────────────────────────────┐
                    │  RESULT: 0%                 │
                    │  PERMANENT LOSS             │
                    └─────────────────────────────┘

4.3 Layer 1: Kill-Switch (Jaccard + Volume)

The Kill-Switch intercepts BEFORE each disk write and calculates two metrics:

Jaccard Index: J(A,B) = |A ∩ B| / |A ∪ B| ≥ 0.40

Volume Ratio: V = newSize / oldSize ≥ 0.50

Metric	Formula	Threshold	Detects
Jaccard Index	J(A,B) = \	A ∩ B\	/ \
Volume Ratio	V = newSize / oldSize	≥ 0.50	Massive truncations

  ORIGINAL (45 lines)             LLM PROPOSAL (20 lines)
  ──────────────────              ──────────────────────────
  class CreditoBancario {         class CreditoBancario {
    private UUID id;                private UUID id;
    private BigDecimal monto;       private BigDecimal monto;
    private BigDecimal tasaInteres; private BigDecimal tasaInteres;
    private LocalDate fecha;        // ... rest of the code
    private EstadoCredito estado; }
    private List<Pago> historial;
  }

  Volume Ratio:  20/45 = 0.44  → ✗ FAIL (< 0.50)
  Jaccard Index: 3/7  = 0.43  → ✓ PASS  (≥ 0.40)
  Decision: ✗ BLOCKED (Insufficient Volume)

4.4 Layer 2: Git Saga (Ephemeral Branches)

Selective activation: The ephemeral branch ONLY activates when all 8 conditions are met:

#	Condition	Must be met
1	Input NOT captured by Level 1, 1.5, or 2	YES
2	LLM decided to invoke GitAction (not fs_patch)	YES
3	Action is LOCAL_WRITE (add, commit, branch...)	YES
4	gitManager != null (injected in constructor)	YES
5	No ephemeral branch already active	YES
6	Is a git repo (.git exists)	YES
7	No merge/rebase in progress	YES
8	Repo has at least 1 commit (valid HEAD)	YES

LLM invokes GitAction(commit) → Level 3
  │
  ▼
ensureEphemeralBranch()
  └── git checkout -b fararoni/wip-{timestamp}
  │
  ▼
All LLM commits go to fararoni/wip-{timestamp}
User's branch remains INTACT
  │
  ▼
Finalization (squash merge):
  git checkout {original_branch}
  git merge --squash fararoni/wip-{id}
  git commit -m "[FARARONI] clean description"
  git branch -D fararoni/wip-{id}
  │
  ▼
Result: 1 single clean commit on the user's branch

4.5 Layer 3: Shadow Backups

Before each write that passes the Kill-Switch, an atomic copy is created:

.fararoni/shadow/pom.xml.v1.20260302-005551
.fararoni/shadow/pom.xml.v2.20260302-010233

These copies are the last line of defense. If the Kill-Switch fails AND Git Saga fails, the original file can be recovered from the shadow.

Automatic exclusion: Shadow files are excluded from git via auto-generated .gitignore with .fararoni/ and git add --all -- . :!.fararoni/ in the composite commit.

5. Security Link: Where does each protection act?

5.1 Activation Table by Level

Level	Kill-Switch	Ephemeral Branch	Shadow Backup	Reason
1: Bare	NO	NO	NO	No LLM, no risk
1.5: Composite	NO	NO	NO	Deterministic commands
2: GGUF	NO	NO	NO	Chat only, no writing
3: Tool Calling	YES (fs_patch)	YES (GitAction)	YES (pre-write)	Risk zone
4: Thinking	NO	NO	NO	Reasoning only
5: Fallback	NO	NO	NO	Text only

5.2 Integrated Activation Map

┌────────────────────────────────────────────────────────────────────────┐
│                  PROTECTION ACTIVATION MAP                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  LEVEL 1 (Bare)        ○ ○ ○   No protection needed                   │
│  LEVEL 1.5 (Composite) ○ ○ ○   No protection needed                   │
│  LEVEL 2 (GGUF)        ○ ○ ○   No protection needed                   │
│                         ─────── DETERMINISTIC/STOCHASTIC FRONTIER ──── │
│  LEVEL 3 (Tool Calling) ● ● ●   Triad ACTIVE                          │
│    └─ fs_patch          ● ○ ●   Kill-Switch + Shadow                   │
│    └─ fs_write          ● ○ ●   Kill-Switch + Shadow                   │
│    └─ GitAction(status) ○ ○ ○   READ_ONLY, no protection               │
│    └─ GitAction(commit) ○ ● ○   Ephemeral branch                      │
│    └─ GitAction(push)   ✗ ✗ ✗   BLOCKED (Ring 7)                       │
│  LEVEL 4 (Thinking)    ○ ○ ○   Reasoning only                         │
│  LEVEL 5 (Fallback)    ○ ○ ○   Text only                              │
│                                                                        │
│  Legend: ● Active  ○ Inactive  ✗ Blocked                               │
└────────────────────────────────────────────────────────────────────────┘

6. Case Study: "Change the Java version to 25"

This case demonstrates how the Kernel integrates routing and protection in a real operation.

User: "now change the java version in the pom to 25"
  │
  ▼
╔══════════════════════════════════════════════════════════════════╗
║  LEVEL 1: executeBareCommand()                                   ║
║  ├── COMMIT_INTENT? → NO (doesn't contain "commit")              ║
║  ├── "git " prefix? → NO (starts with "now")                     ║
║  ├── SAFE_BARE_COMMANDS? → NO ("now" not in the set)             ║
║  └── return null                                                 ║
╚══════════════════════════════════════════════════════════════════╝
  │
  ▼
╔══════════════════════════════════════════════════════════════════╗
║  LEVEL 2: isSimpleChat? → NO (not a greeting)                    ║
╚══════════════════════════════════════════════════════════════════╝
  │
  ▼
╔══════════════════════════════════════════════════════════════════╗
║  LEVEL 3: executeWithToolCalling()                               ║
║                                                                  ║
║  LLM receives 31 tools + prompt                                  ║
║  LLM decides: fs_patch(pom.xml, "17" → "25")                     ║
║                                                                  ║
║  ┌──────────────────────────────────────────────────────┐        ║
║  │  PROTECTION TRIAD (active at Level 3)                │        ║
║  │                                                      │        ║
║  │  1. Kill-Switch:                                     │        ║
║  │     Volume: newSize/oldSize ≈ 1.0  → ✓ PASS          │        ║
║  │     Jaccard: ~0.99               → ✓ PASS            │        ║
║  │     (only changes "17" to "25", 99% identical)       │        ║
║  │                                                      │        ║
║  │  2. Shadow Backup:                                   │        ║
║  │     → .fararoni/shadow/pom.xml.v4.20260302-005551    │        ║
║  │     (pre-write copy created)                         │        ║
║  │                                                      │        ║
║  │  3. Ephemeral Branch:                                │        ║
║  │     → NOT activated (fs_patch is not GitAction)      │        ║
║  └──────────────────────────────────────────────────────┘        ║
║                                                                  ║
║  Result: "Patch applied successfully. File: pom.xml"             ║
╚══════════════════════════════════════════════════════════════════╝

After the change: "do the commit"

User: "do the commit"
  │
  ▼
╔══════════════════════════════════════════════════════════════════╗
║  LEVEL 1.5: COMMIT_INTENT matches "do.*commit"                   ║
║                                                                  ║
║  executeCompositeCommit():                                       ║
║  1. git add --all -- . :!.fararoni/   (shadow excluded)          ║
║  2. git diff --cached → pom.xml                                  ║
║  3. extractCommitMessage → "Update pom.xml"                      ║
║  4. git commit -m "Update pom.xml"                               ║
║                                                                  ║
║  ┌─────────────────────────────────────────────────────┐         ║
║  │  PROTECTION TRIAD:                                  │         ║
║  │  → NOT activated (Level 1.5 is deterministic)       │         ║
║  │  → The commit is a user operation, not the LLM's    │         ║
║  │  → Zero hallucination risk                          │         ║
║  └─────────────────────────────────────────────────────┘         ║
║                                                                  ║
║  Result: [master abc1234] Update pom.xml                         ║
╚══════════════════════════════════════════════════════════════════╝

7. Fallback for Small Models (7B)

7.1 The Problem: "JSON Leakage"

7B parameter models sometimes write tool calls as plain text instead of using the structured tool_calls field of the OpenAI response:

Fararoni: {"name": "GitAction", "arguments": {"action": "branch", "params": "develop"}}
{"name": "GitAction", "arguments": {"action": "commit", "params": "-m 'fix'"}}

The ToolExecutor never sees these because they're in content, not in tool_calls.

7.2 The Solution: extractTextToolCalls()

A parser that scans the response text looking for JSON objects with "name" + "arguments":

LLM Response (content text)
  │
  ▼
extractTextToolCalls(contentText)
  ├── Search for '{' in text
  ├── Count braces to find closure (supports nested JSON)
  ├── Parse as JSON
  ├── Verify it has "name" + "arguments"
  └── Execute via ToolExecutor

This turns a "broken" model into a functional one, without changing the model.

8. Capability Comparison

Level	Nature	Mechanism	Kill-Switch	Shadow	Latency
1: Bare	Deterministic	ProcessBuilder	NO	NO	~0ms
1.5: Composite	Heuristic	Regex + Shell	NO	NO	~0ms
2: GGUF	Local Stochastic	LLM 1.5B	NO	NO	~1s
3: Tool Calling	Stochastic/Agentic	LLM + 31 Tools	YES	YES	8-10s
4: Thinking	Reasoning	DeepSeek/Qwen3	NO	NO	10-15s
5: Fallback	Stochastic	Plain VllmClient	NO	NO	2-5s

9. Results

9.1 Latency Reduction

┌────────────────────────────────────────────────────────────────────────┐
│                    LATENCY BY OPERATION TYPE                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  pwd (before, with LLM):   ████████████████████████████████  10s      │
│  pwd (Level 1, no LLM):    █                                 0ms      │
│                                                    Improvement: -100%  │
│                                                                        │
│  "hello" (before, w/tools): ████████████████████████████████  10s      │
│  "hello" (Level 2, GGUF):   ████                              1s      │
│                                                    Improvement: -90%   │
│                                                                        │
│  git status (before, LLM):  ████████████████████████████████  8s      │
│  git status (Level 1):      █                                 0ms      │
│                                                    Improvement: -100%  │
│                                                                        │
│  "do the commit" (before):  ████████████████████████████████  10s      │
│  "do the commit" (L 1.5):   █                                 0ms      │
│                                                    Improvement: -100%  │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

9.2 Integrity Protection

┌────────────────────────────────────────────────────────────────────────┐
│                    PROTECTION EFFECTIVENESS                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  Layer 1 (Kill-Switch):                                               │
│  ████████████████████████████████████████████████████████  99.9%      │
│  Blocked by Jaccard/Volume                                            │
│                                                                        │
│  Layer 2 (Git Saga):                                                  │
│  ████████████████████████████████████████████████████████  99.99%     │
│  Contained via ephemeral branch + auto-revert                         │
│                                                                        │
│  Layer 3 (Shadow Backup):                                             │
│  ██████████████████████████████████████████████████████████  100%     │
│  Recovered from shadow files                                          │
│                                                                        │
│  RESULT: 0 PERMANENT LOSSES                                           │
└────────────────────────────────────────────────────────────────────────┘

10. Conclusion

The Fararoni Kernel demonstrates that secure and efficient LLM integration in software development requires a hybrid architecture that combines:

Intelligent Routing: 60-70% of interactions are resolved in the deterministic zone (Levels 1-2), eliminating latency and hallucinations for operational tasks.
Defense in Depth: The remaining 30-40% passes through the Protection Triad, where each layer captures escapes from the previous one until reaching 0% permanent loss.
Small Model Adaptability: The text fallback (extractTextToolCalls) enables using 7B parameter models that don't properly handle the tool calling protocol, democratizing access to these capabilities.

The explicit separation between the deterministic zone and the stochastic zone is not just a performance optimization: it is a security principle. By making the LLM "never know" about trivial commands, we eliminate the widest attack surface. And by shielding the points where the LLM DOES operate, we guarantee that its stochastic nature cannot cause permanent damage.

References

Cruz, E. (2026). Fararoni Ironclad: Deterministic Guardrails for Code. Technical Report v3.
OWASP LLM Top 10 (2025). Security Risks in Large Language Model Applications.
IEEE/ACM ICSE (2025). Proceedings on AI-Assisted Software Engineering.

## Try It

  brew tap ebercruzf/fararoni && brew install fararoni

Also available as standalone binaries for macOS, Linux & Windows.

About the Author

This article documents the architecture behind C-FARARONI, an experimental ecosystem for technological

sovereignty and secure local AI model execution.

LinkedIn · GitHub · ebercruz.com

Canonical source: fararoni.dev/publicacion/kernel-hibrido

Sovereign Intelligence on Apple Silicon: Breaking the Microsecond Barrier with Java 25 and Panama FFM

Eber Cruz Fararoni — Tue, 24 Mar 2026 13:58:34 +0000

By Eber Cruz | March 2026

The audio engine runs two completely independent TTS backends, both executing inference on the Metal GPU but with fundamentally different architectural paths.

If you've ever tried to build a truly conversational AI, you know that latency is the enemy of presence. It's not just about how fast the model generates tokens; it's about how fast the system can "yield the floor" when a human starts to speak.

Standard Java audio stacks and JNI bridges often introduce non-deterministic delays that make real-time, full-duplex interaction feel robotic. To solve this for the C-Fararoni ecosystem, I decided to bypass the legacy abstractions and talk directly to the silicon.

In this deep dive, I share the architecture and real-world benchmarks of a system built on Java 25, Panama FFM, and Apple Metal GPU. We aren't talking about millisecond improvements here—we've measured a playback interrupt cycle that completes in just 833 nanoseconds.

What's inside:

Zero-JNI Architecture: How we achieved a 42ns overhead using the Foreign Function & Memory API.
Metal GPU Orchestration: Running 0.6B and 1.7B neural models locally on 32 GPU cores via PyTorch MPS and ggml-metal.
The "Abort" Benchmark: Why a 6,000x improvement over our initial latency target was necessary for Sovereign AI.

Bye Bye JNI: Metal GPU, CoreAudio and Panama FFM on Apple Silicon

When we set out to build a voice-first AI assistant that could hold a real conversation — not the kind where you wait three seconds for a response, but the kind where the system knows when to stop talking the instant you speak — we realized the entire Java audio stack had to go. No JNI. No abstraction layers. Just Java talking directly to the hardware through Panama FFM, CoreAudio rendering audio at 24kHz mono float32 through a native callback, and Metal GPU running neural inference on all 32 cores of an M1 Max.

This is the architecture behind Fararoni's audio engine. Every number in this document was measured on real hardware, on real code running in a high-fidelity development environment. These are not theoretical projections — they are measurements taken directly from Fararoni's core as we build the foundation for a sovereign, low-latency AI.

What follows is the story of three bridges: Java to native code in 42 nanoseconds, a playback interrupt in 833 nanoseconds, and two neural models — 0.6B and 1.7B parameters — running Metal compute kernels to synthesize human voice.

Direct Metal: How We Talk to the Hardware

The foundation of the audio engine is a C++ library (fararoni_audio.cpp) that programs CoreAudio's AudioUnit directly. No wrappers, no middleware. The output unit is configured at 24kHz mono float32 with a render callback that copies PCM samples via zero-copy memcpy — the audio thread owns the buffer, and we never fight it for a lock.

The critical path, however, is not playback — it's interruption. In a conversational AI, the system must stop speaking the instant the user starts talking. That means the abort command has to be fast. Not "fast for software" — fast enough that the hardware is the bottleneck.

Here is the entire abort implementation:

extern "C" void fararoni_abort_playback(void) {
    AudioUnitReset(outputUnit, kAudioUnitScope_Global, 0);  // <1us measured
    AudioOutputUnitStop(outputUnit);
    playbackCtx.position = playbackCtx.size;
    playbackCtx.finished = true;
    g_playing.store(false);
}

AudioUnitReset flushes the AudioUnit's internal buffers. On Apple Silicon this is sub-microsecond because there's no DMA transfer to wait for — the buffer lives in unified memory — no lock contention (AudioUnit runs on its own thread), and the buffer is small (24kHz × buffer_size_frames × 4 bytes).

We measured it. On an M1 Max running Java 25.0.1:

Zero-Overhead Bridge: The jump from Java to C++ via Panama FFM adds just 42ns (P50).
Hardware Reset: The AudioUnitReset command halts the CoreAudio engine in 459ns (P50).
End-to-End: The complete abort cycle consolidates at 833ns (P50) — 0.0008ms — beating the original 5ms target by 6,000x. Even at P99 (8.4µs), it's 600x under target.

An important distinction: this measures the interrupt command — the ability to stop audio that is already playing. It is not the latency of generating audio, which takes seconds. The only physical bottleneck that remains is the microphone buffer itself: the AudioUnit HAL input needs ~5-10ms to fill a capture buffer, and no software can change that. Once the buffer arrives, our software reacts in under 2 microseconds.

Frameworks linked directly (Makefile):

CoreFoundation, CoreAudio, AudioToolbox, AudioUnit, IOKit,
Metal, MetalKit, Accelerate, Foundation

Dual-Engine TTS: Direct vs. Indirect Metal GPU Execution

The audio engine runs two completely independent TTS backends, both executing inference on the Metal GPU but with fundamentally different architectural paths.

The 0.6B Engine: ggml-metal Compute

The fast path uses qwen3-tts-cli, a C++ binary that loads a GGUF-quantized 0.6B model and runs inference through ggml-metal — a complete Metal compute implementation that compiles .metal shaders at runtime, creates MTLComputePipelineState for each tensor operation, and dispatches MTLComputeCommandEncoder with optimized thread groups. The buffers live in unified memory: zero-copy between CPU and GPU.

To achieve this efficiency, the engine relies on highly optimized Metal implementations of core neural network operations. These "compute shaders" include:

Attention: The mechanism that allows the model to dynamically weight the importance of different parts of the input text sequence when generating each audio frame.

Softmax: The activation function that normalizes raw model scores into a probability distribution, crucial for accurate audio token selection.

RoPE (Rotary Positional Embeddings): An advanced method for encoding token positions in the sequence, improving the model's understanding of context and order compared to traditional absolute embeddings.

Java launches it as a subprocess:

List<String> cmd = List.of(
    binaryPath.toString(),       // qwen3-tts-cli
    "-m", modelDir.toString(),   // GGUF model dir
    "-t", styledText,            // text to synthesize
    "-o", outputWav.toString()   // WAV output
);
Process proc = new ProcessBuilder(cmd).start();

For a single word ("Hi."), the 0.6B engine produces 0.3 seconds of audio in 3.7s total (including model load). For longer text, autoregressive generation scales linearly with audio duration — 10 words producing 2.8s of audio take ~32s.

The 1.7B Engine: PyTorch MPS on Metal

The high-fidelity path runs a persistent Python sidecar (tts_server.py) with three variants of Qwen3-TTS-12Hz-1.7B loaded into GPU memory via PyTorch's MPS backend. MPS (Metal Performance Shaders) translates PyTorch tensor operations into Metal compute commands — the same MTLComputeCommandEncoder, the same unified memory buffers, the same GPU cores.

def detect_device():
    if torch.backends.mps.is_available():
        return "mps"            # Apple Silicon -> Metal GPU
    elif torch.cuda.is_available():
        return "cuda"
    else:
        return "cpu"

m = Qwen3TTSModel.from_pretrained(model_id, device_map="cpu")
m = m.to("mps")                # Tensors move to Metal GPU

The sidecar stays resident — the model loads once and serves requests via a JSON-line protocol over stdin/stdout. Java orchestrates:

Java (PythonSidecarBackend)
  -> stdin: {"command":"synthesize", "speaker":"Aiden", "text":"..."}
  -> Python: PyTorch MPS -> Metal GPU (1.7B inference)
  -> stdout: {"status":"ok", "wav":"/tmp/fararoni_tts_xxx.wav"}

Python is only the invocation wrapper. The heavy lifting — neural inference — runs 100% on the Metal GPU. We use Python because HuggingFace Transformers publishes Qwen3-TTS models with a Python API, and PyTorch MPS is the bridge to Metal. The data never leaves the machine.

Why two engines? The 0.6B model is instant-quality: fast, stateless, no persistent process. The 1.7B model is studio-quality: speaker embeddings, richer prosody, but requires a warm sidecar. The routing engine (FararoniAudioEngine.synthesizeToFile()) selects the backend based on speaker availability and quality preference — builtin speakers (Aiden, Dylan, Vivian, Eric) route to the 1.7B sidecar, while unknown speakers fall back to the 0.6B CLI.

Aspect	0.6B (ggml-metal)	1.7B (PyTorch MPS)
Metal path	.metal shaders compiled at runtime	MPS precompiled kernels
Model format	GGUF quantized	HuggingFace float16/32
Java interface	ProcessBuilder (per-invocation)	stdin/stdout JSON-line (persistent)
Speaker selection	No (CLI limitation)	Yes (speaker embedding)
Quality	Instant	Studio
Both execute on	Metal GPU compute	Metal GPU compute
Hardware Access	Direct Metal (zero framework overhead)	Indirect Metal (PyTorch/Python tax)

High-Fidelity Synthesis: Scaling to 1.7B with PyTorch MPS

The 1.7B sidecar is where data sovereignty meets quality. The model runs locally on 32 GPU cores — no cloud API, no network hop, no third-party data processing. For an AI assistant that handles private conversations, this is not a feature; it's a requirement.

Measured synthesis times on M1 Max (all routed to 1.7B Python sidecar via Metal GPU):

Studio quality — speaker-embedded, full prosody:

Speaker	Text	Audio Duration	Total Time
Aiden	"Hello, I am Aiden..." (10 words)	3.9s	47.5s
Dylan	"Hey, I am Dylan..." (10 words)	3.4s	40.1s
Vivian	"Hola, soy Vivian..." (8 words)	3.3s	38.9s
Eric/Marcus	"Buenos dias, soy Marcus..." (12 words)	4.4s	52.0s

Instant quality — still 1.7B for builtin speakers:

Speaker	Text	Audio Duration	Total Time
Aiden	"Hello, I am Aiden..."	2.6s	30.3s
Vivian	"Hola, soy Vivian..."	2.6s	30.9s

0.6B Metal — unknown speakers, activeBackend fallback:

Speaker	Text	Audio Duration	Total Time
(unknown, 10 words)	"Hello world..."	2.8s	32.3s
(unknown, 9 words)	"Buenos dias..."	2.6s	31.7s
(unknown, 1 word)	"Hi."	0.3s	3.7s

These are real numbers from real synthesis runs, not estimates. The routing was verified by tracing the six conditions in FararoniAudioEngine.synthesizeToFile() (line 755-802).

Zero JNI: Panama FFM as the Universal Bridge

Every call from Java to native code in this engine goes through Panama FFM (JEP 454). Zero JNI imports. Zero generated headers. Zero boilerplate.

The pattern is the same across all three native-bridging classes:

Linker linker = Linker.nativeLinker();
SymbolLookup nativeLib = SymbolLookup.libraryLookup(path, Arena.global());
MethodHandle fn = linker.downcallHandle(
    nativeLib.find("fararoni_xxx").get(),
    FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS)
);
int result = (int) fn.invokeExact(memorySegment);

Three classes, three domains, one pattern:

NativeAudioPlayer — Playback control. Four downcall handles: initEngine, playBuffer, stopEngine, isInitialized. The buffer transfer is zero-copy via confined arenas:

try (Arena arena = Arena.ofConfined()) {
    MemorySegment nativeBuffer = arena.allocate(ValueLayout.JAVA_FLOAT, samples.length);
    MemorySegment.copy(samples, 0, nativeBuffer, ValueLayout.JAVA_FLOAT, 0, samples.length);
    int result = (int) playBuffer.invokeExact(nativeBuffer, samples.length);
}

Note on Manual Memory (Arenas): In high-performance Java, an Arena is a bounded memory region that allows for deterministic, off-heap allocation. Unlike standard Java objects managed by the Garbage Collector, memory within an Arena is orchestrated manually. This ensures that our 833ns critical path remains GC-free, providing the microsecond-level determinism required for real-time conversational AI.

Arena.ofConfined() gives us deterministic memory: allocated before the call, freed when the try-with-resources block ends. No GC pressure, no finalizers, no surprises. Measured allocation+copy for 1 second of audio (24,000 float32 samples): 5.3µs (P50). For 100 seconds of audio: 434µs.

WhisperEngine — Engine control and STT. Eight downcall handles spanning both the TTS abort path (abortPlayback, setVolume, isPlaying, initEngine, getTelemetry) and Whisper STT for voice commands (whisperInit, startTranscription, stopTranscription). This class also demonstrates Panama upcalls — C-to-Java callbacks:

MemorySegment callbackStub = linker.upcallStub(
    MethodHandles.lookup().bind(handler, "onTranscript", ...),
    FunctionDescriptor.ofVoid(ValueLayout.ADDRESS, ValueLayout.ADDRESS),
    callbackArena
);

VadDetector — Voice Activity Detection. Four handles: vadIsSpeech, rmsEnergy, startVadCapture, stopEngine. The VAD runs inline on the audio capture thread — no thread hop, no queue.

The measured FFM overhead: 42ns per downcall (P50), based on 10,000 iterations of a noop function. The JEP 454 spec claims ~10ns; the 4x difference is explained by nanoTime granularity on M1 (42ns resolution), branch prediction variability, and cache state. Still sub-microsecond, still negligible for audio work.

The Anatomy of a Sub-Microsecond Interrupt

Full-duplex means the system listens while it speaks. When the user starts talking mid-sentence, the interrupt chain fires:

User speaks while TTS plays audio
  +-- HAL AudioUnit captures mic -> callback
      +-- RMS energy > threshold -> speech detected
          +-- Panama upcall: C -> Java (~42ns)
              +-- WhisperEngine.abortPlayback()
                  +-- Panama downcall: Java -> C (~42ns)
                      +-- fararoni_abort_playback()
                          +-- AudioUnitReset: audio stops (459ns)
                          +-- AudioOutputUnitStop

Three measured segments tell the whole story:

The Bridge (Panama FFM round-trip): upcall + downcall = ~84ns. Java is not a bottleneck. The foreign function boundary is invisible at audio timescales.

The Command (AudioUnitReset + stop on C side): 459ns (P50). The AudioUnit flushes its buffers in unified memory — no DMA, no contention.

The Full Cycle (Java → Panama → C → AudioUnitReset → C → Panama → Java): 833ns (P50). Under one microsecond. The original design target was 5ms; we beat it by 6,000x.

The one thing software cannot accelerate is physics. The microphone's AudioUnit HAL input buffer takes ~5-10ms to fill — a hardware constraint determined by buffer_size_frames, not by code. Once that buffer delivers the speech event, our stack reacts in under 2 microseconds. The total real-world interrupt latency is dominated entirely by the microphone hardware, not by the software chain.

Real-World Synthesis: Measured, Not Estimated

Every claim in this document traces back to NativeAudioBenchmark.java, a standalone benchmark class that exercises the native bridge through Panama FFM on a live libfararoni_audio.dylib.

The benchmark measures six distinct operations across thousands of iterations on an M1 Max with Java 25.0.1:

FFM downcall overhead: 10,000 calls to a noop — 42ns (P50), 292ns (P99)
Abort C-side (AudioUnitReset + Stop): 100 calls — 459ns (P50), 2,750ns (P99)
Abort end-to-end (Java→C→Java): 100 calls — 833ns (P50), 8,375ns (P99)
Arena alloc+copy (1s audio, 24K samples): 5.3µs (P50)
Arena alloc+copy (100s audio, 2.4M samples): 434µs (P50)
Native buffer read (1s audio): 22.4µs (P50)

The synthesis times are equally real — every speaker, every quality level, every backend was tested through the REST endpoint (POST /v1/audio/synthesize) with the routing verified by tracing the condition branches in FararoniAudioEngine.

What we do not claim: we do not claim "5ms audio generation latency." Generation is neural inference and takes seconds. What is sub-microsecond is the command to stop — and that distinction matters, because it's the difference between an assistant that talks over you and one that yields the floor instantly.

Why This Matters

An AI assistant that can generate speech is not the same as one that can hold a conversation. Conversation requires knowing when to stop. Not "stop after a timeout" — stop now, mid-phoneme, because the human on the other end just opened their mouth.

That's what 833 nanoseconds buys us. Not speed for speed's sake, but the foundation for an AI that doesn't just respond — it knows when to be silent and listen. Full-duplex interrupt is the mechanical prerequisite for conversational presence: the system must be able to yield the floor faster than a human can perceive the delay.

The architecture we've built — Panama FFM as the zero-overhead bridge, CoreAudio's AudioUnit as the render engine, Metal GPU driving two neural models, and a sub-microsecond abort chain — is not about showing off low-level programming. It's about removing every artificial barrier between the AI and natural conversation, so the only latency that remains is the physics of a microphone filling its buffer.

Everything runs on-device. The voice models, the inference, the audio rendering, the interrupt — all local, all on Metal, all without a single byte leaving the machine. For an assistant that handles private conversations, sovereignty over the audio pipeline is not optional.

**The Zero-GC Determinism Factor

While industry-standard frameworks like PyTorch provide incredible flexibility, they often carry a 'latency tax' due to their Python-heavy orchestration and complex abstraction layers. By leveraging Java 25’s Scoped Arenas, we’ve moved the critical path Off-Heap. This means the Garbage Collector never touches our 833ns interrupt logic. We aren't just calling a model; we are orchestrating silicon without the overhead of the giants.

Appendix: Raw Benchmark Data

Environment: Java 25.0.1 | aarch64 | Mac OS X | Apple M1 Max

========================================================================
  FARARONI AUDIO BENCHMARK — Panama FFM + CoreAudio + Metal
========================================================================

[System.nanoTime() overhead]
  Iterations: 10,000
  Mean:           9 ns  |  P50:   0 ns  |  P99:  42 ns  |  Max: 167 ns

[FFM Downcall Overhead (noop)]
  Iterations: 10,000
  Mean:          88 ns  |  P50:  42 ns  |  P99: 292 ns  |  Max: 22,916 ns

[Arena alloc+copy (1024 samples = 0.04s audio)]
  Mean:       4,916 ns  |  P50: 3,917 ns  |  P99: 28,125 ns

[Arena alloc+copy (24000 samples = 1.00s audio)]
  Mean:       5,515 ns  |  P50: 5,334 ns  |  P99:  9,583 ns

[Arena alloc+copy (240000 samples = 10.00s audio)]
  Mean:      35,841 ns  |  P50: 34,791 ns |  P99: 54,584 ns

[Arena alloc+copy (2400000 samples = 100.00s audio)]
  Mean:     434,339 ns  |  P50: 432,083 ns | P99: 471,959 ns

[Native buffer read (24000 floats = 1s audio)]
  Mean:      22,970 ns  |  P50: 22,375 ns |  P99: 30,875 ns

[Abort Playback (C-side only: AudioUnitReset+Stop)]
  Iterations: 100
  Mean:         514 ns  |  P50: 459 ns  |  P99: 2,750 ns  |  Max: 2,750 ns

[Abort Playback (Java->C->AudioUnitReset->Java)]
  Iterations: 100
  Mean:       1,016 ns  |  P50: 833 ns  |  P99: 8,375 ns  |  Max: 8,375 ns
========================================================================

Benchmark executed with NativeAudioBenchmark.java on libfararoni_audio.dylib
compiled with fararoni_benchmark.cpp (Makefile updated, make && make install).
M1 Max, 32-core GPU, Java 25.0.1, 2026-03-21.

## Try It

  brew tap ebercruzf/fararoni && brew install fararoni

Also available as standalone binaries for macOS, Linux & Windows.

Sovereign AI Infrastructure: Scaling Enterprise Agents from 8GB RAM to Global Clusters with Fararoni.

Eber Cruz Fararoni — Sat, 21 Mar 2026 20:48:18 +0000

The Era of Local Execution

AI deployment has shifted from cloud experimentation to the urgent need for Edge Sovereignty. As global giants like Alibaba (Qwen) and Huawei (Ascend) release increasingly powerful open-weight models, enterprises face a critical bottleneck: How do we execute these agents securely, privately, and on existing hardware?

Fararoni was born to bridge this gap, turning agent orchestration from a data center luxury into a native capability of any standard office computer.

1. Hardware Democratization: Enterprise AI on 8GB of RAM

Most AI infrastructures require expensive GPUs and nightmare software configurations. Fararoni breaks this barrier:

Extreme Efficiency: Capable of running a full WhatsApp or Telegram service flow using only 8GB or 16GB of RAM.
Optimized for Qwen: Specifically designed to leverage models like Qwen 1.5B/7B, allowing companies to onboard users into AI Agents without investing in new hardware.
"Zero-Config" Installation: A single binary. No Python, no Docker, no dependency hell. Ideal for mass deployment in restricted corporate environments.
Immediate Use Case: A basic office server can now manage a 24/7 Customer Service WhatsApp Agent, processing data locally with total privacy.

2. The "Rabbit-Turtle" Architecture

To maximize efficiency on modest hardware, Fararoni implements a hybrid computing strategy:

The Rabbit (The Orchestrator): A lightweight local model (e.g., Qwen 1.5B) that handles fast interactions, message filtering, and routine tasks in milliseconds.
The Turtle (The Thinker): An orchestrator that scales to heavier models (7B, 32B, or external APIs like Claude/DeepSeek) only when the task's complexity demands it.

This ensures a fluid user experience even on limited hardware, drastically optimizing cost-per-token and energy consumption.

The Fararoni Deployment Matrix: Scaling from 8GB Edge devices with Qwen 1.5B to High-Density Sovereign Clusters with MoE models.

3. The Nervous System: NATS and Data Sovereignty

For organizations requiring strict security compliance, Fararoni offers:

Event Bus (NATS): Total decoupling that allows agents to live on different nodes, ensuring sensitive data never leaves the secure perimeter.
DAG-Based Traceability: Every decision made by the AI is recorded in a Directed Acyclic Graph. It is auditable, transparent, and predictable.
Apache 2.0 License: The gold standard for industrial collaboration in both the East and West, allowing integration into commercial products without legal risks.

4. Strategic Alignment: Why Fararoni is the Partner for Giants

For the Alibaba/Qwen Ecosystem: Fararoni is the ideal "transport layer" to bring Qwen models to the end-user's desktop and SMEs, facilitating massive model adoption.
For Huawei Hardware (Ascend/Kunpeng): As an architecture based on native binaries and memory efficiency, Fararoni aligns perfectly with "Technological Decoupling" and total stack control strategies.
For European Sovereignty (GAIA-X): We offer total control over data flow, eliminating dependence on third-party "black boxes."

Conclusion: Start Small, Scale Infinitely.

The true revolution isn't the biggest model; it’s the agent that is exactly where the user needs it. With Fararoni, you can start today by installing an agent on an 8GB laptop and end tomorrow orchestrating a sovereign swarm on a national cluster.

The era of agents is here, but the real revolution is executing them with sovereignty.

About the Author

This article documents the architecture behind C-FARARONI, an experimental ecosystem for technological

sovereignty and secure local AI model execution.

LinkedIn · GitHub · ebercruz.com

🔗 Immediate Action

## Try It

  brew tap ebercruzf/fararoni && brew install fararoni

Also available as standalone binaries for macOS, Linux & Windows.

Download Installer: fararoni.dev (Windows, Mac, Linux).
Test the WhatsApp Sidecar: Integrate AI into your communication flow in under 5 minutes.
License: Apache 2.0 – Your infrastructure, your rules.

Moving Beyond Chatbots: Architecting a Sovereign AI Ecosystem with Java 25 & NATS

Eber Cruz Fararoni — Wed, 18 Mar 2026 13:04:22 +0000

🇪🇸 Lee la versión en español aquí

The Problem: The Erosion of Technological Sovereignty

As engineers, we've fallen into a trap of convenience. We are building on "quicksand": closed APIs, black boxes, and total cloud dependency. Every time we send a prompt, we give away context and lose sovereignty.

I decided I didn't want an "assistant" to tell me jokes. I wanted a command infrastructure. Thus, Fararoni was born, an ecosystem designed to treat AI as what it should be: executable infrastructure, not just a chat interface.

1. The Centurion Vision: Command Architecture

In Fararoni, we move away from the "copilot" model to adopt the Centurion Vision.

The Human is the Architect: Defines the strategy, limits, and mission objective.
The AI is the Centurion: Orchestrates and executes.

To achieve this, the architecture cannot be linear. We need an infrastructure that supports failures, latency, and real-time context switching.

2. The Tech Stack: Why Java 25 and NATS?

Java 25 and the Power of Virtual Threads

Many ask: Why not Python? The answer is simple: concurrency and robustness.
By using Java 25, we leverage virtual threads (Project Loom) to handle hundreds of agents and processes in the "Swarm" lightly. With GraalVM support, we achieve native binaries that boot in milliseconds, ideal for a CLI that must feel instantaneous.

NATS: The Nervous System

We don't use an internal REST API for module communication. We use NATS as an event bus. This allows us to:

Total decoupling: The "Sidecars" (WhatsApp, Telegram, Terminal) don't know who processes the order; they only listen to the bus.
Resilience: If a local model goes down, the message stays on the bus until a worker is ready.

3. Tactical Innovation: DAGs and Hot-Swapping

Traceability via DAGs (Directed Acyclic Graphs)

AI is often a black box. In Fararoni, every AI decision is mapped in a DAG. This allows the human architect to audit the flow:

Where did this information come from?
Which model made the decision?
What was the cost and latency?

If it's not auditable, it's not professional.

Model Hot-Swap: The Bridge Between Local and Cloud

One of the biggest challenges was the Hot-Swap.

For low-sensitivity tasks or pre-processing, we use a local 1.5B parameter model.
If the task scales in complexity, the system does a "Hot-Swap" to Claude 3.5 or GPT-4 without losing the mission state.

You maintain control over what data leaves your infrastructure and what doesn't.

4. Digital Heritage and Open Source

I have released the communication core and plugins under the Apache 2.0 license. I'm not looking to create another captive platform; I want to help other engineers in Latin America and the world build their own Digital Heritage.

Sovereignty isn't just a pretty word; it's having the binaries, the data bus, and the models under your own command.

Conclusion: The Swarm is Growing

Fararoni is already real. The installers are available, and the terminal is already orchestrating missions.
It is not a finished product, it's a living ecosystem.

What do you think about using NATS for LLM orchestration compared to traditional queue-based architectures like RabbitMQ or Kafka? I'll read you in the comments.

Explore the code and documentation at: fararoni.dev

## Try It

  brew tap ebercruzf/fararoni && brew install fararoni

Also available as standalone binaries for macOS, Linux & Windows.

*Tags: #OpenSource #Java25 #GraalVM #Java #ProjectLoom #CloudNativeJava #SelfHosted #NATS #SoftwareArchitecture #SovereignAI #SoftwareArchitecture #Fararoni #AI #TechSovereignty #Ollama #DeepSeek #Qwen #LocalLLM #AIInfrastructure

Más allá de los Chatbots: Construyendo un Ecosistema de IA Soberana con Java 25 y NATS

Eber Cruz Fararoni — Wed, 18 Mar 2026 12:59:59 +0000

🇬🇧 Read the English version here

El Problema: La erosión de la Soberanía Tecnológica

Como ingenieros, hemos caído en una trampa de conveniencia. Estamos construyendo sobre "arenas movedizas": APIs cerradas, cajas negras y una dependencia total de la nube. Cada vez que enviamos un prompt, regalamos contexto y perdemos soberanía.

Decidí que no quería un "asistente" que me contara chistes. Quería una infraestructura de mando. Así nació Fararoni, un ecosistema diseñado para tratar a la IA como lo que debería ser: infraestructura ejecutable, no solo una interfaz de chat.

1. La Visión del Centurión: Arquitectura de Mando

En Fararoni, nos alejamos del modelo de "copiloto" para adoptar la Visión del Centurión.

El Humano es el Arquitecto: Define la estrategia, los límites y el objetivo de la misión.
La IA es el Centurión: Orquesta y ejecuta.

Para lograr esto, la arquitectura no puede ser lineal. Necesitamos una infraestructura que soporte fallos, latencia y cambios de contexto en tiempo real.

2. El Stack Técnico: ¿Por qué Java 25 y NATS?

Java 25 y el poder de los Virtual Threads

Muchos se preguntan: ¿Por qué no Python? La respuesta es simple: concurrencia y robustez.
Al usar Java 25, aprovechamos los hilos virtuales (Project Loom) para manejar cientos de agentes y procesos del "Enjambre" (Swarm) de manera ligera. Con el soporte de GraalVM, logramos binarios nativos que arrancan en milisegundos, ideales para una CLI que debe sentirse instantánea.

NATS: El Sistema Nervioso

No usamos una API REST interna para la comunicación de módulos. Usamos NATS como bus de eventos. Esto nos permite:

Desacoplamiento total: Los "Sidecars" (WhatsApp, Telegram, Terminal) no saben quién procesa la orden, solo escuchan el bus.
Resiliencia: Si un modelo local se cae, el mensaje permanece en el bus hasta que un trabajador (worker) esté listo.

3. Innovación Táctica: DAGs y Hot-Swapping

Trazabilidad mediante DAGs (Directed Acyclic Graphs)

La IA suele ser una caja negra. En Fararoni, cada decisión de la IA se mapea en un DAG. Esto permite que el arquitecto humano audite el flujo:

¿De dónde vino esta información?
¿Qué modelo tomó la decisión?
¿Cuál fue el costo y la latencia? Si no es auditable, no es profesional.

Hot-Swap de Modelos: El puente entre lo Local y la Nube

Uno de los mayores retos fue el Cambio en Caliente.

Para tareas de baja sensibilidad o pre-procesamiento, usamos un modelo local de 1.5B parámetros.
Si la tarea escala en complejidad, el sistema hace un "Hot-Swap" a Claude 3.5 o GPT-4 sin perder el estado de la misión. Tú mantienes el control de qué datos salen de tu infraestructura y cuáles no.

4. Patrimonio Digital y Open Source

He liberado el núcleo de comunicación y los plugins bajo la licencia Apache 2.0. No busco crear otra plataforma cautiva; busco ayudar a otros ingenieros en Latinoamérica y el mundo a construir su propio Patrimonio Digital.

La soberanía no es solo una palabra bonita; es tener los binarios, el bus de datos y los modelos bajo tu propio mando.

Conclusión: El Enjambre está creciendo

Fararoni ya es real. Los instaladores están disponibles y la terminal ya orquesta misiones.
No es un producto terminado, es un ecosistema vivo.

¿Qué opinan sobre el uso de NATS para orquestación de LLMs en comparación con arquitecturas tradicionales basadas en colas como RabbitMQ o Kafka? Los leo en los comentarios.

Explora el código y la documentación en: fararoni.dev

From Startup to Unicorn: A Blueprint for Secure Enterprise Architecture

Eber Cruz Fararoni — Tue, 13 Jan 2026 02:22:16 +0000

How to implement a Reactive Security Flow with Spring Boot, Redis, and JWT for high-scale environments avoiding the Microservices Trap.

1. The Context: Speed vs. Stability

Startups often face a dilemma: build an MVP fast to validate the market, or build for scale to handle future growth. The “move fast and break things” approach works for a month, but creates technical debt that kills growth in Year 2.
In the Fintech space, you don’t have the luxury of “breaking things.” You need the speed of a startup but the resilience and security of a bank.

2. The Architecture: The Hybrid Approach

Instead of jumping straight into a complex Microservices mesh (which drains budget and requires a DevOps army) or staying in a Monolith (which doesn’t scale), I propose a Modular Hybrid Architecture.
This approach decouples the Security Layer (Reactive) from the Business Logic (Transactional), allowing us to deploy on Serverless platforms (like Cloud Run) while keeping costs low.

System Overview (The Ecosystem)

3. Key Decision: The Multi-Schema Database Strategy

One of the biggest mistakes startups make is spinning up a new RDS instance for every microservice. This burns money.
The Solution: A single PostgreSQL instance with Logical Isolation via Schemas.
Why: It strictly enforces domain boundaries (Business, Payment, Security) without the overhead of managing 10 different database servers.
The Benefit: We can perform cross-schema joins for analytics when needed, but the application code treats them as separate data sources. This prepares us for a physical split in the future ("One-to-N" scaling) without refactoring logic.

4. The Security Core: Reactive Gateway + Servlet Logic

As shown in the diagram above, we implemented a strict separation of concerns:

The Reactive Shield (Spring Cloud Gateway): Handles high concurrency, manages the SSL termination, and validates the JWT signature before the request ever touches the business logic.
The Business Core (Servlet): Once the request is safe, it passes to the blocking transactional services where complex business logic lives.
State Management (Redis): We use a “Redis Blacklist” pattern to allow instant token revocation — fixing the main security flaw of stateless JWTs.

5. Why “HttpOnly” Cookies?

We moved away from storing tokens in LocalStorage (vulnerable to XSS) to HttpOnly Secure Cookies. This ensures that even if a malicious script runs on the client, it cannot exfiltrate the user’s credentials. This is a non-negotiable standard for Fintech applications.

6. Future-Proofing: The Path to Apigee**

A critical aspect of this architecture is Cost-Efficiency. Startups cannot afford expensive Enterprise API Management licenses from Day

Current State (Lean): We use Spring Cloud Gateway to handle standard concerns like Basic Rate Limiting, CORS, and Auth Validation. This runs on Cloud Run with minimal cost.
Future State (Enterprise): As the business succeeds and traffic spikes (“One-to-N”), we don’t need to refactor. We can simply place Google Apigee in front of our Gateway.

This allows us to offload advanced security features — such as DDoS protection, KVM (Key Value Maps), IP Whitelisting, and complex Quotas — to a dedicated layer, keeping our core services lightweight and focused purely on business logic.

Conclusion

This architecture is not just code; it is a business asset. It allows a small team of 3 engineers to handle traffic that usually requires a team of 20, keeping the burn rate low while maintaining banking-grade security.
I am currently exploring new opportunities to apply these architectural patterns at an Enterprise scale. If you are looking for a Staff Engineer focused on Security and Scalability, let’s connect on LinkedIn or check my Portfolio.

Weaviate for RAG: When It Shines (and When It Doesn’t)

Eber Cruz Fararoni — Mon, 15 Dec 2025 00:28:06 +0000

A hands-on review after building an enterprise-grade PoC — not just another “Hello World”

As a Technical Lead & AI Architect (Hands-On) with a focus on RAG Systems and experience building solutions for organizations like HSBC, Scotiabank, and CFE, I'm always evaluating cutting-edge technologies. Recently, at AI Research Lab in Mexico City (Feb 2025 – Jun 2025), I spearheaded the architecture for a comprehensive Retrieval Augmented Generation (RAG) solution for an internal Business Intelligence Engine PoC. This was not a client-facing product, but a technical deep-dive to validate architecture, latency, and security patterns for future enterprise deployment. The PoC was designed to rigorously test RAG architectures for real-world readiness, incorporating:

Full enterprise patterns (auth, error handling, observability)

Local LLMs (DeepSeek-R1 via Ollama)

100% data sovereignty

Benchmarks on real hardware (GCP n2-standard-8)

My contributions included designing a multi-layered RAG architecture with reactive streaming patterns (Spring WebFlux, Project Reactor), architecting Weaviate v4 integration with optimized Sentence-BERT embeddings for financial document processing, and directing the local LLM integration strategy — leveraging my background as a Google Certified GenAI Leader.

🔗 Full architecture details: ebercruz.com/technical
💻 Code (MIT, non-commercial): github.com/ebercruzf/enterprise-intelligence-engine

✅ Where Weaviate Delivers Value — in practice

1. Hybrid Search: `nearText` + `where` = Fewer False Positives

In real use, users don’t ask clean questions like “summarize Q3 earnings”. They often phrase queries like:

“What did the compliance team say about loan approvals last quarter?”

Most vector DBs force a choice between semantic or keyword search. Weaviate's ability to combine both significantly reduces false positives:

{
  Get {
    FinancialDocument(
      nearText: {concepts: ["loan approval"]}
      where: {
        path: ["department"]
        operator: Equal
        valueString: "compliance"
      }
    ) {
      title
      snippet
      _additional { distance }
    }
  }
}

About the Author

This article documents the architecture behind C-FARARONI, an experimental ecosystem for technological

sovereignty and secure local AI model execution.

LinkedIn · GitHub · ebercruz.com