DEV Community: Ricardo Lara

FinancialClaw: making OpenClaw useful for personal finance

Ricardo Lara — Fri, 03 Apr 2026 21:50:37 +0000

We often talk about AI agents as if their greatest value lies in understanding natural language. But understanding isn't enough. An agent starts becoming truly useful when it can help with concrete tasks, reduce friction, and do so consistently.

FinancialClaw was born from exactly that idea. I wanted OpenClaw to do more than just chat about personal finance — I wanted it to help me manage it: log expenses, record income, handle recurring payments, and query summaries without relying on memory, scattered notes, or repetitive manual steps. From the start, the project took a clear direction: a personal tool with local persistence, designed for daily use, and with multi-currency support.

What's interesting is that this usefulness didn't come simply from adding more features. It emerged from combining natural language with clear rules, predictable operations, and local storage. In other words: let the agent interpret the intent, but don't improvise the logic that actually matters.

The real problem

Managing personal finances doesn't usually fail because it's hard to understand. It fails because of friction.

Logging expenses feels tedious. Recording income gets postponed. Recurring payments are forgotten. And when you want to know how much you've spent this month or what income you've received, you end up piecing it together from different places.

That was exactly what I wanted to avoid with FinancialClaw. I wasn't interested in building another tool that just talked about finances or answered generic questions. I wanted something capable of turning a conversation into a useful action: log an expense, record income, mark a payment, or query a summary — without breaking the flow.

What makes FinancialClaw useful

FinancialClaw's usefulness isn't about sounding smart. It's about making everyday tasks easier to execute.

Logging an expense should be quick. That's why FinancialClaw lets you do it manually or by scanning receipts. The idea wasn't just to capture data, but to bring the recording closer to the moment things actually happen.

The same applies to income. I didn't want income entries to end up as loose notes, but as part of a history that could later be queried in useful ways. Separating the definition of an income source from its actual receipt made it possible to model that flow better: expecting an income is one thing; recording when it arrived, how much, and on what date is another.

Then there was the problem of repetition. Subscriptions, services, installments, and periodic payments are part of real life. If a financial tool doesn't help with that, it falls short very quickly. That's why support for recurring expenses was an important part of the project from early on.

And of course, storing data isn't enough. Real usefulness appears when you can later ask how much you spent this month, what pending transactions you have, or what income you've received — and get answers based on persisted data, calculated consistently.

Where an agent alone falls short

This is where an idea emerged that I find increasingly important in agentic systems: an agent can interpret intentions, but it shouldn't improvise critical logic.

In FinancialClaw, that means the agent can recognize that the user wants to log an expense or request a summary. But it shouldn't ambiguously decide how to validate a date, how to calculate a period, or how to format a result. That part needs to be predictable.

This was one of the clearest lessons from the project. If models are variable by nature, then the way to make them useful for sensitive tasks isn't to ask them to improvise better, but to support them with explicit rules, validations, and well-defined operations. In this case, that translated into data validation, parameterized queries, clear calculations, and consistent results.

And this matters even more in personal finance. Here, usefulness depends on trust. If the same question produces inconsistent results, or if an invalid date gets saved without an error, the tool loses value very quickly.

What it took to make it truly usable

One of the biggest takeaways from this project is that building something useful on top of an agent isn't just about programming the core logic.

You also have to solve everything else: how it gets installed, how it persists data, how it's configured, how it integrates well with the agent's actual flow, and how to prevent the experience from becoming fragile. There were important decisions early on, like multi-currency support and using XXX as a placeholder for a currency not yet configured. That helped avoid unnecessary assumptions and made the initial setup process clearer.

During development, quieter but very important problems also surfaced: validations that existed in types but not at runtime, dates that looked correct but weren't, installation steps that could break the experience, and configuration details that directly affected the tool's real usefulness. Fixing those was key because a financial tool stops being useful the moment it starts accepting ambiguous or incorrect data, or when using it requires more effort than it saves.

What I learned

FinancialClaw left me with a fairly simple idea: an agent's usefulness isn't just about what it understands, but about what it lets you do with less friction and more confidence.

It also left me with something else. In domains with state, clear rules, and real consequences, the agent shouldn't improvise everything. It works better when it interprets the intent but relies on a more predictable layer to validate, persist, calculate, and return consistent results.

That's why, rather than seeing FinancialClaw just as an OpenClaw extension, I prefer to see it as proof of something more interesting: that an agentic system starts becoming truly useful when conversation stops being the destination and becomes a practical way to operate software.

External resources

FinancialClaw: haciendo útil a OpenClaw para finanzas personales

Ricardo Lara — Fri, 03 Apr 2026 21:46:38 +0000

Muchas veces hablamos de agentes de IA como si su mayor valor estuviera en entender lenguaje natural. Pero entender no basta. Un agente empieza a ser realmente útil cuando puede ayudar con tareas concretas, reducir fricción y hacerlo de forma consistente.

FinancialClaw nació justo de esa idea. Quería que OpenClaw no solo pudiera conversar sobre finanzas personales, sino ayudarme a gestionarlas: registrar gastos, guardar ingresos, manejar pagos recurrentes y consultar resúmenes sin depender de memoria, notas sueltas o pasos manuales repetitivos. Desde el principio, el proyecto tomó una dirección clara: una herramienta personal, con persistencia local, pensada para el uso diario y con soporte multi-moneda.

Lo interesante es que esa utilidad no apareció simplemente por añadir nuevas funciones. Apareció al combinar lenguaje natural con reglas claras, operaciones predecibles y almacenamiento local. En otras palabras: dejar que el agente interprete la intención, pero no improvisar la lógica que realmente importa.

El verdadero problema

Llevar finanzas personales no suele fallar porque sea difícil de entender. Falla por fricción.

Registrar gastos da pereza. Anotar ingresos se posterga. Los pagos recurrentes se olvidan. Y cuando uno quiere saber cuánto ha gastado en el mes o qué ingresos ha recibido, termina reconstruyendo todo desde distintos lugares.

Eso era justamente lo que quería evitar con FinancialClaw. No me interesaba crear otra herramienta que solo hablara de finanzas o respondiera preguntas genéricas. Quería algo capaz de convertir una conversación en una acción útil: registrar un gasto, guardar un ingreso, marcar un pago o consultar un resumen sin romper el flujo.

Qué hace útil a FinancialClaw

La utilidad de FinancialClaw no está en sonar inteligente, sino en hacer que tareas cotidianas se vuelvan más fáciles de ejecutar.

Registrar un gasto debería ser rápido. Por eso FinancialClaw permite hacerlo manualmente y también apoyarse en recibos. La idea no era solo capturar datos, sino acercar el registro al momento real en que las cosas ocurren.

Lo mismo pasa con los ingresos. No quería que quedaran como anotaciones sueltas, sino como parte de un historial que luego pudiera consultarse de forma útil. Separar la definición de un ingreso de su recepción real permitió modelar mejor ese flujo: una cosa es esperar un ingreso y otra distinta es registrar cuándo llegó, cuánto llegó y en qué fecha.

También estaba el problema de lo repetitivo. Suscripciones, servicios, cuotas y pagos periódicos forman parte de la vida real. Si una herramienta financiera no ayuda con eso, termina quedándose corta muy rápido. Por eso el soporte para gastos recurrentes fue parte importante del proyecto desde temprano.

Y, por supuesto, guardar no basta. La utilidad aparece de verdad cuando luego puedes preguntar cuánto gastaste este mes, qué movimientos tienes pendientes o qué ingresos has recibido, y obtener respuestas sobre datos persistidos y calculados de forma consistente.

Donde un agente por sí solo no alcanza

Aquí apareció una idea que me parece cada vez más importante en sistemas agentic: un agente puede interpretar intenciones, pero no debería improvisar lógica crítica.

En FinancialClaw, eso significa que el agente puede reconocer que el usuario quiere registrar un gasto o pedir un resumen. Pero no debería decidir de forma ambigua cómo validar una fecha, cómo calcular un período o cómo formatear un resultado. Esa parte necesita ser predecible.

Esa fue una de las lecciones más claras del proyecto. Si los modelos son variables por naturaleza, entonces la forma de volverlos útiles en tareas sensibles no es pedirles que improvisen mejor, sino apoyarlos en reglas explícitas, validaciones y operaciones bien definidas. En este caso, eso se tradujo en validación de datos, consultas parametrizadas, cálculos claros y resultados consistentes.

Y eso importa todavía más en finanzas personales. Aquí la utilidad depende de la confianza. Si la misma pregunta produce resultados inconsistentes, o si una fecha inválida se guarda sin error, la herramienta pierde valor muy rápido.

Lo que costó volverlo realmente usable

Una de las cosas que más me dejó este proyecto es que construir algo útil sobre un agente no consiste solo en programar la lógica principal.

También hay que resolver todo lo demás: cómo se instala, cómo persiste los datos, cómo se configura, cómo se integra bien con el flujo real del agente y cómo evitar que la experiencia se vuelva frágil. Hubo decisiones importantes desde temprano, como el soporte multi-moneda y el uso de XXX como placeholder para una moneda aún no configurada. Eso ayudó a evitar supuestos innecesarios y a hacer más claro el proceso inicial de uso.

Durante el desarrollo también aparecieron problemas más silenciosos, pero muy importantes: validaciones que existían en tipos pero no en ejecución, fechas que parecían correctas pero no lo eran, pasos de instalación que podían romper la experiencia y detalles de configuración que afectaban directamente la utilidad real de la herramienta. Corregir eso fue clave porque una herramienta financiera deja de ser útil en el momento en que empieza a aceptar datos ambiguos o incorrectos, o cuando usarla requiere más esfuerzo del que ahorra.

Lo que aprendí

FinancialClaw me dejó una idea bastante simple: la utilidad de un agente no está solo en lo que entiende, sino en lo que permite hacer con menos fricción y más confianza.

También me dejó algo más. En dominios con estado, reglas claras y consecuencias reales, el agente no debería improvisar todo. Funciona mejor cuando interpreta la intención, pero se apoya en una capa más predecible para validar, persistir, calcular y devolver resultados consistentes.

Por eso, más que ver FinancialClaw solo como una extensión de OpenClaw, prefiero verlo como una prueba de algo más interesante: que un sistema agentic empieza a volverse realmente útil cuando la conversación deja de ser el destino y se convierte en una forma práctica de operar software.

Recursos externos

Parte 2: lo que cambió cuando dejé de pensar mi sistema multiagente como idea y empecé a ejecutarlo de verdad

Ricardo Lara — Sat, 28 Mar 2026 03:04:16 +0000

En la primera parte conté por qué terminé construyendo un flujo multiagente en vez de seguir empujando todo dentro de una sola conversación. La idea seguía teniendo sentido: separar responsabilidades, usar modelos distintos según la fase y mantener aprobación humana antes de implementar me daba más orden, mejor coste y menos ruido.

Pero ahí todavía estaba resolviendo el problema conceptual.

Esta segunda etapa fue distinta. Ya no se trataba de defender la idea, sino de ejecutarla de verdad. Y fue ahí donde aparecieron los problemas que no se ven en un diagrama ni en una buena narrativa: permisos, sistemas de IA que no se comportan igual, procesos que necesitan una terminal interactiva real, configuración que envejece mal y decisiones de orquestación que en papel suenan bien, pero en la práctica no alcanzan.

El cambio más importante: dejé de pensar en un pipeline y empecé a pensar en un runtime

Creo que la mejor forma de explicar esta evolución es esta: agentflow dejó de ser solo una forma de organizar prompts, archivos y pasos, y empezó a convertirse en un runtime explícito.

Eso cambió bastante mi forma de verlo.

Antes la configuración describía más qué había que generar. Ahora describe cómo corre cada rol: qué proveedor usa, qué modelo, qué nivel de esfuerzo, qué sandbox y qué prompt lo gobierna. Ya no es solo una herramienta para “montar un flujo”, sino una base para ejecutar roles de verdad con más control.

Ese cambio es importante porque me hizo ver algo que antes no estaba tan claro: separar bien los roles no basta. También necesitas el runtime que haga viable esa separación.

La ejecución real fue la que mostró dónde estaba el hueco

El hallazgo más claro de esta etapa fue bastante simple y bastante brutal: los agentes no podían escribir archivos de forma autónoma.

No porque faltara diseño. El campo sandbox ya existía y la intención estaba bien planteada. El problema era más incómodo: el adaptador de Claude no estaba traduciendo esa intención a los flags reales del CLI. Entonces el sistema corría, pero se bloqueaba pidiendo permisos para cada escritura.

Ese fue uno de esos momentos que te obligan a aterrizar. Porque ahí entiendes que “el sistema corre” y “el sistema funciona” no son la misma cosa.

El fix fue directo, pero la lección fue más importante que el fix. Para Claude Code hubo que traducir workspace-write a --dangerously-skip-permissions y read-only a --permission-mode plan. Codex ya resolvía mejor ese lado con --sandbox workspace-write. OpenCode, en cambio, sigue teniendo una limitación más estructural porque su CLI no expone un flag equivalente.

Ese problema no lo descubres afinando prompts. Lo descubres ejecutando.

También me quedó claro que orquestar no es lo mismo que delegar

Otra cosa que esta etapa dejó muy clara fue que el orquestador que yo tenía en mente todavía no estaba cerrando bien la última milla.

En teoría, agentflow run ya existía y ya tenía una lógica de secuenciación. Pero en la práctica, cuando Claude Code, Codex u OpenCode participaban en una sesión real, ese comando no bastaba. Los bootstrap skills eran demasiado superficiales. Básicamente delegaban y ya. No daban contexto suficiente para decidir cuándo parar, qué pasos correr, cómo manejar el review loop o cuándo pedir aprobación humana.

Ahí fue cuando se me hizo evidente algo que ahora ya me parece obvio: no hay un solo modo correcto de orquestación.

El modo CLI tiene sentido para automatización, CI y ejecución determinista. Pero una sesión interactiva necesita otra cosa. Necesita que el agente tenga criterio para clasificar la tarea, presentar un plan, esperar aprobación antes de implementar y decidir cómo avanzar según el contexto. Intentar que un mismo mecanismo sirviera tanto para automatización y CI como para una sesión interactiva con criterio y aprobación humana generaba más fricción de la que resolvía.

No toda tarea merece el pipeline completo

Otra mejora que me parece realmente importante en esta etapa fue aterrizar por fin el classifier.

En la parte 1 ya estaba la intuición de que no todas las tareas deberían costar lo mismo. Pero todavía era más una tesis que una capacidad real del sistema.

Ahora sí hay un rol que clasifica la complejidad como small, medium o large, y eso cambia el flujo. Un cambio pequeño no tiene por qué pasar por toda la ceremonia. Una tarea mayor sí justifica pipeline completo, review loop y posibles ajustes de modelo. Además, si el proyecto viene de una configuración vieja y no tiene todavía ese rol, el sistema no se rompe: cae en un fallback heurístico y sigue funcionando.

Esto me gusta porque mueve la optimización al lugar correcto. No después de gastar tiempo y tokens, sino antes.

Y dicho más simple: no tiene sentido tratar un bug pequeño como si fuera una migración compleja.

Los proveedores no son intercambiables

También me ayudó mucho esta etapa para bajarme de una simplificación que en abstracto es tentadora: pensar que todos los proveedores de IA son más o menos lo mismo.

No lo son.

Cuando hablo de un proveedor de IA, me refiero al sistema externo que ejecuta una tarea concreta dentro del flujo. Puede ser Claude, Codex u otro. Es, básicamente, el servicio al que le delego trabajo en una fase del proceso.

Y cuando llevé esto a ejecución real, quedó claro que esos proveedores no se comportan todos igual. Cambian los permisos, la forma en que se integran, cómo manejan los procesos e incluso cómo esperan ejecutarse.

En algunos casos, además, no basta con lanzar un comando y esperar una respuesta. Hay herramientas que necesitan correr dentro de una terminal interactiva real, como si estuvieran abiertas directamente en consola. A eso normalmente se le llama TTY, pero dicho en lenguaje simple significa esto: la herramienta necesita una consola “de verdad” para funcionar bien.

Eso fue lo que me llevó a usar estrategias distintas según el proveedor. Para algunos casos funcionaba bien una ejecución basada en pipes. Para Codex, en cambio, terminé necesitando PTY real con node-pty, porque su interfaz puede fallar o quedarse colgada si no corre en una terminal interactiva de verdad.

Parece un detalle menor, pero no lo es. Porque trabajar con agentes no es solo trabajar con texto: también es trabajar con procesos, permisos, terminales y errores reales. Y si eso no se diseña bien, el sistema se siente frágil aunque la idea sea buena.

Varias mejoras útiles no fueron vistosas, pero sí necesarias

También hubo mejoras menos llamativas, pero bastante más importantes de lo que parecen.

Una fue dejar de depender de un testRunner rígido en el config. Ese tipo de campo envejece mal. Cambias el proyecto, cambias el runner o cambias el stack, y terminas cargando una instrucción vieja. Me pareció mucho mejor permitir que el tester lo detecte desde el propio proyecto cuando no está definido.

No son cambios vistosos, pero sí son de esos que hacen que una herramienta deje de sentirse rígida.

Todavía no todo está cerrado, pero ya estamos en el tipo correcto de problemas

No quiero contar esta etapa como si todo hubiera quedado perfecto, porque no sería verdad.

Todavía hay pendientes. La suite de tests ya pasa, pero todavía no cubre de forma profunda todo el contrato runtime-first, sobre todo la ejecución real de adapters, agent run y el classifier. La documentación ya está bastante más alineada con el runtime actual, pero OpenCode sigue teniendo una limitación real con el sandbox que no depende solo de agentflow.

Pero, sinceramente, esos ya me parecen problemas sanos.

Porque ya no estoy discutiendo si la idea tiene sentido. Ya no estoy en la etapa de justificar la tesis. Ahora estoy en la etapa de cerrar gaps concretos: compatibilidad, documentación, robustez y consistencia de ejecución.

Y prefiero mucho más estar ahí.

Lo que realmente me dejó esta parte 2

Si la parte 1 era sobre por qué un sistema multiagente tenía más sentido que una sola conversación gigante, esta parte 2 es sobre otra cosa: qué pasa cuando esa idea sale del papel y se encuentra con la realidad.

Ahí fue donde aparecieron los huecos de verdad: permisos, orquestación efectiva, complejidad, proveedores que no se comportan igual, procesos, defaults frágiles y trazas.

La idea original no se cayó. De hecho, para mí salió fortalecida.

Pero ahora la veo de forma más completa: una arquitectura multiagente no está lista solo porque se vea bien en el diseño. Está lista cuando puede ejecutar de verdad sin romperse en cosas básicas.

Cierre

La primera versión me enseñó a separar responsabilidades.

Esta segunda etapa me obligó a construir el runtime que hace viable esa separación.

Y la ejecución real me terminó enseñando lo más importante: entre “esto corre” y “esto funciona como debería” hay una distancia grande. Esa distancia no se cierra con más teoría. Se cierra ejecutando, observando dónde falla y corrigiendo con cambios concretos.

Eso, para mí, es lo que realmente cuenta de esta segunda parte.

Recursos externos

Part 2: what changed when I stopped treating my multi-agent system as an idea and started running it for real

Ricardo Lara — Sat, 28 Mar 2026 03:02:00 +0000

In the first part, I explained why I ended up building a multi-agent flow instead of continuing to push everything into a single conversation. The idea still made sense: separating responsibilities, using different models depending on the phase, and keeping human approval before implementation gave me more order, better cost control, and less noise.

But at that stage I was still solving the conceptual problem.

This second stage was different. It was no longer about defending the idea, but about actually running it. And that was where the problems appeared that do not show up in a diagram or in a strong narrative: permissions, AI systems that do not behave the same way, processes that need a real interactive terminal, configuration that ages badly, and orchestration decisions that sound good on paper but do not hold up in practice.

The biggest change: I stopped thinking in terms of a pipeline and started thinking in terms of a runtime

I think the best way to explain this evolution is this: agentflow stopped being just a way to organize prompts, files, and steps, and started becoming an explicit runtime.

That changed the way I saw it.

Before, the configuration described more of what had to be generated. Now it describes how each role runs: which provider it uses, which model, which effort level, which sandbox, and which prompt governs it. It is no longer just a tool to assemble a flow, but a base for running real roles with more control.

That shift matters because it made me see something that was not fully clear before: separating roles well is not enough. You also need the runtime that makes that separation viable.

Real execution was what showed where the gap actually was

The clearest finding of this stage was simple and brutal: the agents could not write files autonomously.

Not because the design was missing. The sandbox field already existed and the intent was correct. The problem was more uncomfortable: the Claude adapter was not translating that intent into the real CLI flags. So the system ran, but it kept getting blocked asking for permission on every write.

That was one of those moments that forces you to land the idea in reality. Because that is when you understand that a system running and a system working are not the same thing.

The fix was direct, but the lesson mattered more than the fix. For Claude Code I had to translate workspace-write into --dangerously-skip-permissions and read-only into --permission-mode plan. Codex already handled that side more cleanly with --sandbox workspace-write. OpenCode, on the other hand, still has a more structural limitation because its CLI does not expose an equivalent flag.

You do not discover that problem by refining prompts. You discover it by running the system.

It also became clear that orchestration is not the same as delegation

Another thing this stage made very clear was that the orchestrator I had in mind was still not closing the last mile well enough.

In theory, agentflow run already existed and already had sequencing logic. But in practice, when Claude Code, Codex, or OpenCode were participating in a real session, that command was not enough. The bootstrap skills were too shallow. They mostly delegated and stopped there. They did not provide enough context to decide when to stop, which steps to run, how to handle the review loop, or when to ask for human approval.

That was when something became obvious that now feels self-evident: there is no single correct mode of orchestration.

CLI mode makes sense for automation, CI, and deterministic execution. But an interactive session needs something else. It needs the agent to have judgment to classify the task, present a plan, wait for approval before implementing, and decide how to move forward based on context. Trying to force the same mechanism to work both for automation and CI and for an interactive session with judgment and human approval created more friction than it removed.

Not every task deserves the full pipeline

Another improvement that feels genuinely important in this stage was finally grounding the classifier.

In part 1, the intuition was already there: not every task should cost the same. But it was still more of a thesis than a real system capability.

Now there is a role that classifies complexity as small, medium, or large, and that changes the flow. A small change does not need to go through the full ceremony. A larger task does justify the complete pipeline, a review loop, and possible model adjustments. And if a project comes from an older configuration and does not yet have that role, the system does not break: it falls back to a heuristic path and keeps running.

I like this because it moves optimization to the right place. Not after spending time and tokens, but before.

Put more simply: it makes no sense to treat a small bug like a complex migration.

Providers are not interchangeable

This stage also helped me let go of a simplification that is tempting in the abstract: thinking that all AI providers are more or less the same.

They are not.

When I talk about an AI provider, I mean the external system that executes a specific task inside the flow. It can be Claude, Codex, or something else. It is basically the service I delegate work to in one phase of the process.

And once I pushed this into real execution, it became clear that these providers do not behave in the same way. Permissions change, integration styles change, process handling changes, and even the way they expect to be run changes.

In some cases, it is also not enough to launch a command and wait for a response. Some tools need to run inside a real interactive terminal, as if they were opened directly in the console. That is usually called a TTY, but in plain language it means this: the tool needs a real terminal to work properly.

That is what pushed me toward different execution strategies depending on the provider. For some cases, a pipe-based execution worked fine. For Codex, I ended up needing a real PTY with node-pty, because its interface can fail or hang if it does not run in a genuinely interactive terminal.

It sounds like a minor detail, but it is not. Because working with agents is not only about working with text. It is also about processes, permissions, terminals, and real errors. And if that is not designed well, the whole system feels fragile even if the core idea is strong.

Several useful improvements were not flashy, but they were necessary

There were also less visible improvements that mattered more than they seem.

One was stopping the dependency on a rigid testRunner field in the config. That kind of field ages badly. You change the project, change the runner, or change the stack, and you end up carrying stale instructions. It felt much better to let the tester detect that from the project itself when the field is not defined.

These are not flashy changes, but they are the kind that make a tool stop feeling rigid.

Not everything is closed yet, but these are the right problems to have

I do not want to describe this stage as if everything were already perfect, because that would not be true.

There are still open gaps. The test suite passes, but it still does not cover the runtime-first contract deeply enough, especially around real adapter execution, agent run, and the classifier. The documentation is already much more aligned with the current runtime, but OpenCode still has a real sandbox limitation that does not depend only on agentflow.

But honestly, those already feel like healthy problems.

Because I am no longer debating whether the idea makes sense. I am no longer in the phase of defending the thesis. I am now in the phase of closing concrete gaps: compatibility, documentation, robustness, and execution consistency.

And I strongly prefer being there.

What part 2 really left me with

If part 1 was about why a multi-agent system made more sense than one giant conversation, part 2 is about something else: what happens when that idea leaves the page and meets reality.

That was where the real gaps showed up: permissions, effective orchestration, complexity management, providers that do not behave the same, processes, fragile defaults, and traces.

The original idea did not collapse. If anything, it came out stronger.

But now I see it more completely: a multi-agent architecture is not ready just because it looks good in the design. It is ready when it can actually run without breaking on basic things.

Closing

The first version taught me how to separate responsibilities.

This second stage forced me to build the runtime that makes that separation viable.

And real execution ended up teaching me the most important thing: between this runs and this works the way it should, there is a large distance. That distance is not closed with more theory. It is closed by running the system, observing where it fails, and correcting it with concrete changes.

That, for me, is what this second part is really about.

External resources

Why I Started Splitting Planning, Implementation, Testing, and Documentation in AI Workflows

Ricardo Lara — Sat, 21 Mar 2026 16:49:32 +0000

While testing different AI coding tools, I kept running into two recurring problems.

The first one was cost. Using the same model to plan, implement, review, test, and document does not make much sense. Not every stage requires the same level of reasoning.

The second problem was more important: when a single agent does everything inside one long conversation, context starts to get polluted. It begins to mix decisions, forget constraints, and lose precision as it moves from one phase to the next.

To address that, I built agentflow, a CLI for setting up a multi-agent workflow where each stage has a clear responsibility:

planning
approval
implementation
review
testing
documentation

The goal is not just to split tasks. It is also to give each phase a cleaner context window and use the right model for the right kind of work.

What stood out to me during testing was that the real value was not only the potential cost savings, but the consistency of the process. As the workflow matured, test coverage improved, manual intervention dropped, and some issues started getting caught earlier in the pipeline.

What interests me most about this approach is not “making AI code on its own.” It is designing a better process around it.

If you want the full write-up, you can read the original post here:

Full post:
https://ricardolara.dev/es/blog/inteligencia-artificial-multiagente/

npm package:
https://www.npmjs.com/package/@riclara/agentflow

If you work with Claude Code, Codex, or similar tooling, I would love to hear your feedback.

Cómo estoy usando un pipeline multiagente para hacer más consistente el desarrollo con IA

Ricardo Lara — Sat, 21 Mar 2026 16:43:33 +0000

Mientras probaba distintas herramientas de coding con IA, me encontré con dos problemas bastante repetidos.

El primero era de costo: usar el mismo modelo para planear, implementar, revisar, probar y documentar no tiene mucho sentido. No todas las etapas requieren el mismo nivel de razonamiento.

El segundo era más importante: cuando un solo agente hace todo dentro de una conversación larga, el contexto se contamina. Empieza a mezclar decisiones, olvidar restricciones y perder precisión entre fases.

Para resolver eso, construí agentflow, un CLI para configurar un flujo multiagente donde cada etapa tiene una responsabilidad clara:

plan
aprobación
implementación
review
testing
documentación

La idea no es solo repartir tareas. También es trabajar con contexto más limpio en cada fase y usar el modelo adecuado según el tipo de trabajo.

Algo que me llamó la atención durante las pruebas fue que el valor no estaba solo en el posible ahorro de costo, sino en la consistencia del proceso. A medida que el flujo fue madurando, mejoró la cobertura de pruebas, bajó la intervención manual y aparecieron errores que valía más la pena detectar antes.

Lo que más me interesa de este enfoque no es “hacer que la IA programe sola”, sino diseñar mejor el proceso alrededor de ella.

Si quieres leer la explicación completa, aquí está el post original:

Post completo:
https://ricardolara.dev/es/blog/inteligencia-artificial-multiagente/

Paquete npm:
https://www.npmjs.com/package/@riclara/agentflow

Si trabajas con Claude Code, Codex o tooling similar, me interesaría mucho tu feedback.