DEV Community: Vicente Junior

I Built a WhatsApp Finance Agent in OpenClaw. Migrating to Hermes Taught Me What "Self-Improving" Actually Means.

Vicente Junior — Sat, 23 May 2026 22:15:53 +0000

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

TL;DR

I built Finn — a personal finance agent for WhatsApp — on OpenClaw, in TypeScript, with a single agent and six tools.

Migrating to Hermes Agent isn't a framework swap. It's a paradigm shift from "an agent that executes" to "an agent that learns."

The hermes claw migrate command makes the OpenClaw → Hermes path official. The gains are real (persistent memory, skill-based abstractions, multi-platform gateway), and so are the tradeoffs.

Meet Finn

A few months ago I built Finn, a personal finance assistant that lives entirely inside WhatsApp. No app to install, no dashboard to remember — just a chat:

You:  spent 45 on lunch
Finn: $45 · Food · Mastercard · Me · 2026-04-22 — confirm? ✅
You:  sim
Finn: ✅ Saved!

Under the hood, Finn is an OpenClaw plugin written in TypeScript. It runs a single agent with a gpt-4.1 tool-use loop (capped at 5 iterations), persists transactions to Supabase, parses credit card PDFs (text-based and OCR via gpt-4o), and transcribes voice notes through Whisper. It speaks both Portuguese and English depending on the last message it received.

I shipped it for the OpenClaw Challenge 2026 — you can see the full demo, code samples, and architecture walkthrough here. It works. I use it every day.

And then I read the Hermes Agent docs.

This post is about what happens when you take a working production agent built on OpenClaw and ask: what would this look like in Hermes, and is the upgrade worth it?

What Finn Already Does Well in OpenClaw

Before I criticize anything, let me be fair to the framework I picked.

OpenClaw gave me a clean abstraction that mapped naturally to "personal agent on WhatsApp":

A before_dispatch hook that intercepts incoming messages and routes them through my plugin.
A simple tool-use loop — define a JSON schema, expose handler functions, let the model call them. Max 5 iterations per turn keeps cost and latency predictable.
Native WhatsApp gateway through OpenClaw's connector model.
Phone whitelist at the gateway level, before my code even sees the message.

The six tools I exposed to the agent map directly to the things a finance assistant needs to do:

Tool	Purpose
`save_transaction`	Persist a confirmed expense, income, or card payment
`query_spending`	Query totals and breakdowns from the database
`save_bulk_transactions`	Bulk-save invoice items from a PDF import
`save_bank_statement`	Bulk-save bank statement rows
`update_transaction`	Edit a saved record (with confirmation)
`delete_transaction`	Delete a saved record (with confirmation)

Every mutation requires explicit user confirmation before the tool is called. The whole architecture fits on one page:

WhatsApp
    │
    ▼
OpenClaw Gateway  ──(before_dispatch hook)──▶  Finn Plugin (TypeScript)
                                                  │
                            ┌─────────────────────┤
                            ▼                     ▼
                       Text / Audio              PDF
                            │                     │
                       runAgent()           Custom parsers
                       (gpt-4.1 loop)      (text + OCR fallback)
                            │                     │
                            └──────────┬──────────┘
                                       ▼
                                  Supabase

It's a clean design. I'm proud of it. But the longer I used Finn, the longer my wishlist got.

The Walls I Hit

After a few weeks of daily use, three things started bothering me:

1. Finn forgets everything between conversations.
When I say "roxinho" I mean my Nu Bank card. When I say "feira" I mean grocery store. Today, those mappings live in the system prompt. They don't grow. If next week I start using "Itubinho" for Itaú, I have to edit the prompt and redeploy. The agent is not learning — I am.

2. Finn only lives on WhatsApp.
If I want the same assistant on Telegram (where my parents are), on Discord (where my work crew lives), or on Signal (where I keep some chats), I have to build a new connector each time, or fork the plugin and reinstall it elsewhere.

3. Every new capability is a code change.
The "saldo-diff algorithm" I built for Bradesco statements is custom code, deployed via my own deploy.sh script. If I want a Nubank statement parser tomorrow, that's a new TypeScript file, a new test suite, a new deploy. There is no abstraction layer between "I figured out a new way to do something" and "I shipped TypeScript to production."

These aren't OpenClaw's fault, really. They are the limits of building on a framework that thinks of agents as executors of tools — the agent's job is to pick the right tool, with the right arguments, in the right order, and stop.

What I wanted was an agent that accumulated knowledge the way I do. That was when Hermes started looking interesting.

The Shift Hermes Makes: From Tools to Skills

Here is the core conceptual change.

In OpenClaw, the agent's capability surface is a list of tools. Each tool is a function with a JSON schema. The agent picks one, calls it, observes the result, and decides what to do next.

In Hermes, the agent's capability surface is a list of skills. A skill is a markdown file — SKILL.md — with YAML frontmatter and instructions. The agent loads skills on demand, using progressive disclosure: at level zero it only sees skill names and short descriptions (around 3k tokens for the whole catalog), at level one it loads the full skill content, at level two it can pull additional reference files inside the skill's directory. Tools still exist (Hermes ships 70+ of them across 28 toolsets), but skills are the primary abstraction for how the agent works on a problem.

This is what Finn's save_transaction tool would look like as a Hermes skill:

---
name: log-expense
description: Log a personal expense to the finance database with smart defaults
version: 1.0.0
metadata:
  hermes:
    tags: [finance, personal, daily]
    category: finance
    requires_toolsets: [terminal]
---

# Log Expense Skill

## When to Use
The user mentions a purchase, payment, or expense, with or without a category.
Examples: "spent 45 on lunch", "paid Netflix 55.90", "almoço 35 ontem".

## Defaults
- Card: Mastercard (unless user names another)
- Cost center: Me (unless user names another person)
- Date: today (unless user names another date)
- Never ask about defaults — apply them silently.

## Procedure
1. Parse amount, description, and any explicit metadata from the message.
2. Apply defaults to anything not specified.
3. Echo the parsed transaction back to the user for confirmation.
4. On confirmation ("sim", "yes", "✅"), write to Supabase via the terminal tool.
5. Reply with a success message including the new balance for the category.

## Pitfalls
- Do not trust relative dates ("last Tuesday") without confirming the actual date.
- Round currency to 2 decimal places when displaying.
- Watch for duplicate detection warnings before saving.

## Verification
The success message includes the saved record ID and the running category total.

That is the entire "save_transaction" capability, expressed as procedural knowledge instead of a function signature. There's no TypeScript to compile, no schema to maintain in two places, no deploy. If next month I figure out that "Wednesday is my coffee day, default the category to Coffee on Wednesdays unless overridden," I add three lines to the markdown file. Done.

But the bigger shift isn't the syntax. It is that the agent itself can write and update these skills through a tool called skill_manage. This is the agent's procedural memory: when it figures out a non-trivial workflow that worked, it can save the approach as a new skill for next time. After completing a complex task with 5+ tool calls, or after the user corrects its approach, or after it hit errors and found the working path — those are the moments Hermes will offer to crystallize what it learned into a reusable skill.

OpenClaw has nothing equivalent. If I want Finn to learn, I have to learn first, then update the prompt.

Memory: From "Vocabulary Hard-coded in Prompt" to "Memory the Agent Curates"

Finn's vocabulary mappings live in prompts.ts. They are static. To update them I edit a file, build, and redeploy.

Hermes has three memory primitives that change this completely:

SOUL.md — the personality file. Loaded first into every system prompt. This is "who the agent is."
USER.md — what the agent knows about the user. Updated by the agent over time. This is where "roxinho means Nu Bank, feira means grocery store" would naturally land.
MEMORY.md — operational notes the agent curates for itself, with periodic nudges to consolidate.

On top of those, Hermes has session storage in SQLite with FTS5 full-text search, so the agent can search prior conversations semantically. It also integrates Honcho for dialectic user modeling — a deeper layer that builds an evolving model of who you are across sessions.

The contrast with Finn-as-it-stands-today is sharp. In OpenClaw I am the memory. In Hermes the agent maintains its own.

Multi-Platform for Free

Finn lives on WhatsApp because that's the connector I built around. If I want it on Telegram, I write a new connector.

Hermes ships with native gateways for over 20 platforms out of the box — WhatsApp, Telegram, Discord, Slack, Signal, Matrix, Mattermost, Email, SMS, Microsoft Teams, Google Chat, and more. Same agent, same skills, same memory. One Hermes process can serve all of them simultaneously, with per-platform session isolation and unified user authorization.

For Finn, that means my custom plugin code for WhatsApp routing, my deploy script, my session management — those layers either disappear or get absorbed into the gateway's configuration. The work I did to integrate one platform stops being a feature and becomes table stakes.

Side-by-Side: Finn on OpenClaw vs Finn on Hermes

Concern	Finn on OpenClaw (today)	Finn on Hermes (proposed)
Agent abstraction	One agent, 6 tool functions	One agent, N skills (markdown)
Adding a capability	New TypeScript file + deploy	New `SKILL.md` file
Vocabulary ("roxinho = Nu")	Static in `prompts.ts`	Lives in `USER.md`, updated by agent
Cross-session context	Stateless per conversation	SQLite + FTS5, persistent across sessions
Self-improvement	None (I edit prompts)	`skill_manage` lets the agent create/update skills
Platforms	WhatsApp only (custom connector)	20+ native gateways
PDF parsing	Custom TS parsers (`parse-invoice.ts`, OCR fallback)	Skill with helper scripts under `~/.hermes/skills/finn-finance/scripts/`
Scheduled tasks	None (would require new infra)	First-class cron, deliverable to any platform
Language	TypeScript	Python (Hermes core) + markdown skills
Migration	—	`hermes claw migrate` (official command)

The last row is the one I want to underline. The Hermes CLI ships with a dedicated hermes claw migrate command that moves settings, memories, skills, and API keys from an OpenClaw setup directly to Hermes. That is not a hint that the projects are related. That is an official upgrade path.

Where OpenClaw Is Still the Right Choice

I want to be honest here, because most framework comparison posts pretend the winner is universal.

OpenClaw is still the right tool when:

Your team is a TypeScript team. Hermes core is Python. If your stack and your hires are TypeScript-first, the cognitive switch and the deploy story matter. A working OpenClaw plugin in your team's primary language can beat a "better" framework in a language nobody loves.
You need a small, focused, single-purpose agent and you want maximum determinism. A 5-iteration tool-use loop is easy to reason about, easy to debug, easy to put limits on. Hermes can do this too, but it has more layers between you and the model.
You don't need cross-platform reach. If WhatsApp is the only surface you'll ever need, the multi-platform gateway in Hermes is overhead you'll never use.
You don't want the agent rewriting things on its own. skill_manage and agent-curated memory are powerful, but they mean the agent's behavior surface evolves over time. If you need a behavior you wrote on Monday to be exactly the same on Friday, the more static OpenClaw plugin model is easier to audit.

OpenClaw didn't get worse. My ambitions for Finn outgrew its scope.

What You Give Up Choosing Hermes

The honest tradeoffs:

Less deterministic execution. Skills load dynamically; subagents can spawn at runtime via delegate_task. Most of the time, that's fine. For some compliance-critical paths, the static OpenClaw flow is easier to defend in a review.
Maturity gap. Hermes Agent is at v0.10.0. OpenClaw has been running personal agents for longer. Production readiness is a real consideration.
Language switch. Hermes is Python-first. My Finn codebase is TypeScript. The skills layer is markdown so it's portable, but custom helper scripts and tooling integrations would need to be rewritten or wrapped.
Lock-in to the Hermes mental model. Once your agent has accumulated dozens of learned skills over months, porting that institutional knowledge to a third framework is not trivial. The agent's procedural memory is an asset and a form of coupling.

The Migration Path Is Official

Here is what I appreciate about this specific migration: it is not a vague "you could probably do it" path. It is a documented, supported, one-command operation. The Hermes CLI ships with:

hermes claw migrate

This moves settings, memories, skills, and API keys from an OpenClaw installation into Hermes. The two projects share enough conceptual DNA that the migration is real, not aspirational.

My plan for Finn isn't a rewrite. It is a phased migration:

Phase 1 — Run side by side. Install Hermes locally. Create a log-expense skill that mirrors Finn's save_transaction tool. Point it at a sandbox Supabase. Use it on a second WhatsApp number for a week.

Phase 2 — Migrate the simple paths first. save_transaction, query_spending, update_transaction, delete_transaction are good candidates. They are mostly business logic plus a database call. Each becomes one SKILL.md.

Phase 3 — Tackle the hard cases. The PDF parsers — especially the saldo-diff algorithm for Bradesco statements — are real engineering. They become helper scripts under ~/.hermes/skills/finn-finance/scripts/, invoked from the skill via the terminal tool. The skill itself documents when and how to use the script; the Python or TypeScript helper does the actual parsing.

Phase 4 — Decommission OpenClaw. Only after Phase 3 is stable, point the production WhatsApp number at Hermes. Keep the OpenClaw plugin around for a rollback window. Eventually retire it.

The point isn't speed. The point is to not break the thing I use daily.

Closing

When I built Finn I wasn't thinking about whether OpenClaw was the right framework. I was thinking about whether I could ship a working WhatsApp finance agent in a few weekends. The framework choice was downstream of that goal, and OpenClaw made the answer "yes."

Hermes is asking a different question — what does an agent look like when the agent itself participates in its own evolution — through skills it writes, memory it curates, knowledge it accumulates across sessions and platforms.

For a personal finance assistant that I expect to live with me for years, that question is more interesting than the question that brought me to OpenClaw. The migration is on my list.

If you have a working OpenClaw agent and you haven't read the Hermes Agent docs yet, do it before your next "I should add X to it" moment. You might end up writing one markdown file instead of one new TypeScript file.

I gave Gemini 3.5 Flash a CVE-fix PR to review. It found another bug in the same file.

Vicente Junior — Fri, 22 May 2026 22:40:14 +0000

This is a submission for the Google I/O Writing Challenge

Across 3 real production PRs, I asked Gemini 3.5 Flash to do a code review. The model — announced this week at Google I/O 2026 — caught 3 legitimate bugs, hallucinated 0, in roughly 4 seconds per PR. The middle PR was the patch for a known security vulnerability in Fastify (CVE-2026-25223, a validation-bypass). The model flagged a second, unrelated regex bug in the exact file being patched.

Here's what I learned building a code-review agent in about 2 hours with Google's new model.

Why I tested this

At the I/O keynote, Sundar Pichai pitched Gemini 3.5 Flash as "frontier intelligence combined with action" — optimized for agentic coding and long-horizon tasks. Code review is the perfect stress test: it requires reasoning about code semantics, cross-file context, and judgment about what matters.

Reading another 50 hype threads on X felt pointless. So I built the smallest possible agent that could actually use the model on real code, ran it on three concrete PRs, and counted what it got right, what it made up, and what it missed.

The architecture

Three stages, ~80 lines of TypeScript, runs on Node 20+:

INPUT                  PROCESSING                       OUTPUT
─────                  ──────────                       ──────
owner/repo#N    →      1. fetch the .diff URL      →    stdout (colored summary)
                       2. truncate if > 150k chars      out/{slug}.json
                       3. build prompt + schema         out/{slug}.md
                       4. Gemini 3.5 Flash call
                       5. Zod-parse the response

No GitHub token (public PRs use the unauthenticated .diff URL). No octokit. No frameworks. Just the new @google/genai SDK with structured output.

The core

The heart of the pipeline is a single review() function — pass it a diff, get back a typed array of issues:

import { GoogleGenAI } from "@google/genai";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

const IssueSchema = z.object({
  file: z.string(),
  line: z.number().nullable(),
  severity: z.enum(["low", "medium", "high", "critical"]),
  category: z.enum(["bug", "security", "performance", "style", "logic", "maintainability"]),
  message: z.string(),
  suggestion: z.string().nullable(),
});

const ReviewSchema = z.object({
  summary: z.string(),
  issues: z.array(IssueSchema),
});

const SYSTEM_PROMPT = `You are a senior code reviewer. Analyze the unified git
diff below and produce a JSON review.

Rules:
- Flag REAL issues only — no nitpicks, no style preferences.
- Prefer fewer, higher-quality issues over volume.
- Each "message" must explain WHY it matters (impact, not just observation).
- If you cannot see enough context to be sure, lower the severity.

Return the full review as JSON matching the provided schema.`;

async function review(diff: string) {
  const res = await ai.models.generateContent({
    model: "gemini-3.5-flash",
    contents: `${SYSTEM_PROMPT}\n\n--- DIFF ---\n${diff}`,
    config: {
      responseMimeType: "application/json",
      responseJsonSchema: zodToJsonSchema(ReviewSchema),
    },
  });
  return ReviewSchema.parse(JSON.parse(res.text ?? "{}"));
}

A few details worth flagging:

Model string: "gemini-3.5-flash". GA since May 19, 2026.
Structured output: use responseJsonSchema (not the older responseSchema). It validates against the Zod-derived schema and returns conformant JSON. No regex-parsing the response, no try/catch for malformed output.
No temperature tuning: Google explicitly recommends not setting temperature, top_p, or top_k on the 3.5 family — the model handles sampling internally.

Full repo at the end. Now the interesting part.

The three PRs

I picked PRs with very different shapes to see how the model behaved across contexts.

PR	Type	Lines	Why
express#6190	Small refactor	~10	Baseline: clean code, no real issues
fastify#6414	Security-sensitive	+398 / −147	The patch for CVE-2026-25223
express#6100	Small refactor	~15	Different file, different style

Final scorecard

PR #1 (express#6190):    +0  −0   Model agreed: no issues
PR #2 (fastify#6414):    +3  −0   3 hits, 0 hallucinations
PR #3 (express#6100):    +0  −0   Model agreed: no issues
──────────────────────────────────────────────────────────────
Total:                   +3  −0   Zero false positives.

What it caught — the headline

PR #2 is the one that mattered. Fastify pull #6414 rewrote the entire content-type parser to fix a security flaw (CVE-2026-25223) where attackers could bypass body validation by appending a tab character to Content-Type (e.g. application/json\tx). The fix introduced a new ContentType class and replaced the old loose string-matching logic.

This is exactly the kind of high-stakes, security-sensitive refactor where an automated reviewer either earns its place or doesn't.

The model flagged three issues. Here's each one, verified against the actual code.

Hit 1: inconsistent variable use in `existingParser`

MEDIUM · logic — The existingParser method checks contentType === "application/json" and this.customParsers.has(contentType) using the original contentType string instead of the newly calculated, normalized ct variable.

Looking at the new code in lib/content-type-parser.js:

ContentTypeParser.prototype.existingParser = function (contentType) {
  if (typeof contentType === 'string') {
    const ct = new ContentType(contentType).toString()
    if (contentType === 'application/json' && this.customParsers.has(contentType)) {
      return this.customParsers.get(ct).fn !== this[kDefaultJsonParse]
    }
    if (contentType === 'text/plain' && this.customParsers.has(contentType)) {
      return this.customParsers.get(ct).fn !== defaultPlainTextParser
    }
  }
  return this.hasParser(contentType)
}

The model is right. ct is the normalized version, but the conditional guards still test the raw contentType. Since customParsers only holds normalized keys (see line 85: this.customParsers.set(normalizedContentType, parser)), any header with a different case or trailing parameters silently skips the fast path. Subtle, easy to miss in review.

Hit 2: a regex missing its end anchor

HIGH · security — The subtypeNameReg regular expression is missing a trailing $ anchor. Consequently, any string starting with a valid subtype will match successfully.

This one is the headline. In the brand new file lib/content-type.js, the patch defines two parallel regexes:

const typeNameReg     = /^[\w!#$%&'*+.^`|~-]+$/      // has $
const subtypeNameReg  = /^[\w!#$%&'*+.^`|~-]+\s*/    // no $

The subtype regex anchors at the start but not at the end. Inputs like application/json/extra pass the validation gate where they shouldn't. In a PR whose entire purpose is fixing a validation-bypass CVE, a senior reviewer would put this in red on the first pass. The model put it in HIGH on the first pass.

I am not claiming this is itself exploitable at the same severity as the original CVE — the downstream parsers may not be reachable in a way that materializes the bug. But the pattern is exactly the class of issue that did materialize as CVE-2026-25223. Pattern-recognition of dangerous shapes is half of what code review is.

Hit 3: stateful global regex

MEDIUM · bug — The keyValuePairsReg regex is defined globally with the /g flag. Because of this, it is stateful and relies on lastIndex. If parsing throws an exception or future modifications exit the loop early, lastIndex will not reset to 0.

Confirmed at the top of lib/content-type.js:

const keyValuePairsReg = /([\w!#$%&'*+.^`|~-]+)=([^;]*)/gm

Used inside a class constructor with .exec() in a loop. In healthy execution, lastIndex resets to 0 when exec returns null. But the failure mode — exception inside the loop body, or any future break — silently corrupts every subsequent parse for the lifetime of the process. The model's suggested fix (use matchAll instead) is exactly the JavaScript-idiomatic answer.

This is a latent footgun, not a live bug. Severity MEDIUM is arguably high. But it's a real thing the model saw.

What it didn't catch — the honest part

Two failure modes worth being honest about.

Cross-file context. The model only sees the diff. It can't tell whether a function called by the changed code is safe, whether a removed branch was load-bearing somewhere else, or whether tests actually cover the new behavior. For PR #6414 in particular, the upstream callers of the new ContentType class are not in the diff, and the model never reasoned about them.

Severity calibration is rough. The regex-without-anchor is HIGH. The stateful /g is MEDIUM. In practice, those probably want to swap — the regex one is a clear pattern with security relevance, the global-regex one is a latent footgun unlikely to fire. Junior-reviewer instincts.

I also can't conclusively measure what the model missed without reviewing every comment thread on the PR by hand. The merged commit went through multiple rounds of feedback (commits like "address feedback", "refactor algorithm", "appease coverage"), so reviewers did catch things, but how many of those are in-diff issues a tool could have seen versus broader design decisions — I'd need another afternoon to know.

What I'd actually use this for

Three takeaways after running this on real code:

It earns a place as a first-layer pre-review. Specifically: PRs that touch parsers, validators, or anything that consumes external input. The cost is around $0.003 per PR. The cost of not running it is shipping a regex without an anchor on a security-sensitive code path.
It does not replace human reviewers. It cannot reason about distributed state, concurrency, transactions, or anything that requires understanding multiple files in concert.
Hallucination rate was zero in this sample — but the sample is tiny. The literature on similar models suggests false positives in the 15-25% range on real-world PRs. Three out of three being valid is great but is not a benchmark.

The 80 lines of TypeScript that produced this run are on GitHub. Two things that are non-obvious about the setup:

@google/genai v2 uses responseJsonSchema, not responseSchema. Easy to get wrong if you're translating tutorial code from an older Gemini.
Public GitHub PRs expose a .diff endpoint that requires no auth. You don't need octokit for an MVP.

If you try it on PRs with shapes I didn't test — concurrency-heavy, multi-file, generated code — tell me what you find. The interesting question is where the model breaks, not where it works.

Built and tested in May 2026 with Gemini 3.5 Flash, GA two days before publication.

Finn 💰 — A Personal Finance Assistant That Lives in WhatsApp

Vicente Junior — Sat, 25 Apr 2026 15:53:31 +0000

This is a submission for the OpenClaw Challenge.

What I Built

Finn is a personal finance assistant that lives entirely in WhatsApp. No app to install, no dashboard to remember to open — you just message Finn the way you'd message a friend.

The problem Finn solves is a real one: most personal finance tools require you to change your habits (log into an app, categorize manually, remember later). Finn fits into a habit you already have — checking WhatsApp.

What it can do:

📝 Log expenses and income via text or voice note (Whisper transcription)
📄 Parse credit card PDF invoices — send the PDF, review the itemized list, confirm, saved
🏦 Parse bank account statement PDFs — transactions extracted via a saldo-diff algorithm that works around garbled PDF text
🔍 Query spending — by period, category, card, or cost center (cross-source: credit cards + bank account + manual entries combined)
✏️ Edit or delete any saved record by just describing it
🔒 Phone whitelist — only your number gets a response

Demo

All four scenarios below are real conversations with a running Finn instance.

🟢 Demo 1 — Log a Single Expense

The most common interaction: user mentions an amount and a description. Finn classifies the category and card automatically, shows a confirmation line, and only saves after approval.

You:   189 on lunch
Finn:  $189 · Food · Mastercard · Me · 2026-04-22 — confirm? ✅

You:   yeah
Finn:  ✅ Saved! Anything else?

If the description is ambiguous, Finn asks exactly one clarifying question:

You:   spent 80 today
Finn:  What was it for?

You:   pharmacy
Finn:  $80 · Pharmacy · Mastercard · Me · 2026-04-22 — confirm? ✅

You:   yes
Finn:  ✅ Saved!

Voice notes work the same way — Finn echoes the transcription before acting:

You:   🎙️ [audio: "spent ninety at the grocery store today"]
Finn:  🎙️ _"spent ninety at the grocery store today"_

       $90 · Groceries · Mastercard · Me · 2026-04-22 — confirm? ✅

You:   yes
Finn:  ✅ Saved!

✏️ Demo 2 — Change Something

You can correct any saved record in plain language. Finn finds the most recent matching transaction, shows what will change, and waits for confirmation.

Change the amount:

You:   the pharmacy one was actually 95, not 80
Finn:  Update $80 → $95 · Pharmacy · Mastercard · Me · 2026-04-22 — confirm? ✅

You:   yes
Finn:  ✅ Updated!

Change the category:

You:   that lunch from yesterday, mark it as Entertainment
Finn:  Update Food → Entertainment · $189 · Mastercard · Me · 2026-04-22 — confirm? ✅

You:   yes
Finn:  ✅ Updated!

Delete a record:

You:   delete the $90 grocery entry
Finn:  Delete $90 · Groceries · Mastercard · Me · 2026-04-22 — are you sure? 🗑️

You:   yes
Finn:  ✅ Deleted.

📄 Demo 3 — Import a Credit Card PDF Invoice

You forward the invoice PDF to the WhatsApp chat. Finn parses it, shows you every line, resolves unknown categories, and bulk-saves on one "yes".

You:   [sends Mastercard invoice PDF]

Finn:  📄 Mastercard · Due 2026-04-10 · $2,340.00 · 18 items (16 new · 2 skipped)

       📊 Invoice Summary
       Purchases: $2,680.00 · Financed balance (credit): -$520.00 · Interest: +$180.00 · Total due: $2,340.00

       1. ✅ UBER*TRIP
          Mar 12 · $18.50 · Transport · Me

       2. ✅ NETFLIX.COM
          Mar 13 · $15.99 · Entertainment · Me

       3. ✅ AMAZON PRIME
          Mar 14 · $42.30 · Shopping · Me (intl · USD 7.99)

       4. ❓ SQ*DOWNTOWN MKT
          Mar 15 · $34.00 · ❓ · Me

       5. ✅ SHELL GAS STATION
          Mar 17 · $95.00 · Transport · Me

       6. ✅ DELTA AIRLINES 01/03
          Mar 19 · $320.00 · Travel · Me
       ...

       ❓ What category for: #4 SQ*DOWNTOWN MKT ($34.00)?

You:   Groceries

Finn:  Got it. Save all 16 items? (2 duplicates will be skipped)

You:   yes

Finn:  ✅ Saved 16 transactions · Total $2,340.00
       Transport: $210.50 · Shopping: $380.00 · Food: $195.00 · Entertainment: $95.99 · Travel: $320.00 · Other: $1,138.51

The "Invoice Summary" line reconciles the invoice math: Purchases + Interest + Financed balance = Total due. Finn stores the interest and financed balance as signed rows so SELECT SUM(amount) always equals the invoice total.

For image-based (scanned) PDFs, Finn automatically falls back to GPT-4.1 Vision OCR — same output, no extra steps for the user.

🔍 Demo 4 — Query Spending

Finn always queries the database — it never answers from memory or inference.

Total this month:

You:   how much did I spend this month?
Finn:  📊 All spending — last 60 days
       Total: $6,234.18

       Shopping:     $1,402.10
       Food:           $534.90
       Transport:      $487.20
       Entertainment:  $211.70
       Travel:         $420.00
       Other:        $3,178.28

By category:

You:   breakdown of Transport
Finn:  🚗 Transport — last 60 days
       $487.20 across 8 transactions

       Mar 15 · Uber              $34.90  (Mastercard)
       Mar 20 · Posto Ipiranga   $180.00  (Mastercard)
       Apr 01 · Uber              $28.50  (Mastercard)
       Apr 05 · Shell Gas        $155.00  (Nu)
       Apr 10 · Parking           $22.00  (Bradesco)
       ...

Cross-source (credit card + bank account):

You:   did my card payment hit the bank account?
Finn:  Yes — Mar 31 bank debit of $4,782.24 matches your Mastercard invoice exactly. ✅

A specific period:

You:   how much did I spend on Food in March?
Finn:  🍽️ Food — March 2026
       $312.40 across 9 transactions

       Highest: $89.90 at a restaurant on Mar 22
       Daily average: $10.08

How I Used OpenClaw

The entire agent is an OpenClaw plugin registered as a before_dispatch hook. Every WhatsApp message — text, voice, or PDF — passes through Finn before OpenClaw does anything else.

WhatsApp → OpenClaw gateway → before_dispatch hook → Finn plugin → OpenAI gpt-4.1 → Supabase

Plugin Registration

// openclaw.plugin.json
{
  "name": "finance-agent",
  "version": "1.0.0",
  "hooks": ["before_dispatch"]
}

// plugin/src/index.ts
api.on('before_dispatch', async (event, ctx) => {
  const phone = ctx.senderId ?? event.senderId

  // Phone whitelist — only the owner gets responses
  const allowedPhones = process.env.ALLOWED_PHONES?.split(',').map(p => p.trim()) ?? []
  if (allowedPhones.length > 0 && !allowedPhones.includes(phone)) {
    return { handled: true, text: '' }  // silent ignore for unknown numbers
  }

  const result = await runAgent({ phone, message, mediaType })
  return { handled: true, text: result.reply }
})

The Agent Loop

The core is a tool-use loop over gpt-4.1 with six tools and a maximum of 5 iterations:

Tool	Purpose
`save_transaction`	Persist a confirmed expense or income entry
`query_spending`	Query totals, breakdowns, history from Supabase
`save_bulk_transactions`	Bulk-save confirmed invoice items from a PDF
`save_bank_statement`	Bulk-save confirmed bank statement rows
`update_transaction`	Edit a saved record after confirmation
`delete_transaction`	Delete a record after confirmation

The system prompt enforces a strict state machine: the LLM never calls save_transaction without an explicit user confirmation. The confirmation always uses a canonical format:

$45 · Food · Mastercard · Me · 2026-04-22 — confirm? ✅

Preventing LLM Hallucination on Queries

Even with a well-crafted prompt saying "always call query_spending", the model would sometimes answer "you spent $X on Transport" by inferring from a recently-parsed PDF in context — instead of querying the database. The fix: force tool_choice on the first iteration for any spending question:

const SPENDING_Q_RE = /how much|breakdown|what did I spend/i
let toolChoice = SPENDING_Q_RE.test(userText)
  ? { type: 'function', function: { name: 'query_spending' } }
  : 'auto'

PDF Invoice Pipeline

When a PDF arrives, the plugin routes it before the LLM ever sees it:

if (pdfText.trim().length < 100) {
  // Scanned/image-based PDF → GPT-4.1 Vision OCR
  invoice = await parseInvoiceOcr(pdfToImages(pdfBuffer))
} else if (/Extrato de:.*Agência/i.test(pdfText)) {
  // Bank statement
  stmt = parseStatementBradesco(pdfText)
} else {
  // Text-based credit card invoice
  invoice = parseInvoice(pdfText)
}

What I Learned

1. PDF parsing is harder than it looks. The text extraction from pdf-parse is reliable for prose but unreliable for table columns — numbers get concatenated with adjacent reference codes. The saldo-diff approach was a counterintuitive fix: instead of parsing the value I want, compute it from context.

2. "This month" is not a calendar concept for credit cards. A purchase on March 8 appears on an April invoice — so a filter of date >= April 1 would miss it. Finn uses a 60-day rolling window for "this month" queries to cover the billing cycle lag.

3. Forcing tool_choice prevents silent hallucination. The model reliably answers from database queries when forced, and sometimes "just knows" from context when not forced. Both answers look correct — the second one just isn't queryable later.

4. WhatsApp as an interface has a real adoption advantage. The friction of opening a dedicated finance app is the #1 reason people stop using them. A chat interface that's already open all day has zero switching cost.

5. Security in layers. OpenClaw's allowFrom whitelist blocks at the gateway level. ALLOWED_PHONES adds an application-level check. Supabase rows are scoped by phone with RLS. Each layer is independent — if one fails, the others still hold.

Tech Stack

Layer	Technology
Channel	WhatsApp via OpenClaw
Runtime	TypeScript, Node.js 20
Framework	OpenClaw (`before_dispatch` hook)
LLM	OpenAI gpt-4.1 (tool-use loop)
Database	Supabase (PostgreSQL + Row Level Security)
PDF parsing	`pdf-parse` + custom text parsers
Vision OCR	GPT-4.1 Vision (scanned PDFs)
Audio	OpenAI Whisper
Deployment	pm2 on a VPS

GitHub: github.com/vicente-r-junior/finn

Notion Life Review OS — Log your day to Notion from WhatsApp using AI

Vicente Junior — Sun, 29 Mar 2026 12:26:25 +0000

This is a submission for the Notion MCP Challenge

What I Built

Notion Life Review OS is a WhatsApp assistant that captures your day and organizes everything in your own Notion workspace — from a single message.

You send something like:

"Worked on the API integration today. Need to present to the client next Thursday. Also figured out why our Redis connection was dropping."

It extracts a task, a project, a learning, and your mood. Asks you to confirm. Saves everything to the right Notion database. No forms. No clicking. No friction.

The core idea is simple: your day lives in WhatsApp already. You're already typing there. So why open another tool?

It also works the other way. Ask it anything:

"What tasks are due this week?"
"What did I learn this week?"

And you can manage your Notion schema directly from WhatsApp — even via voice:

"Add a column called Who, select type, to the Tasks table"

The new field is available on the very next message.

One thing I really liked about how this came together: the project and task structure is completely generic. You can use it for work — a project called "API Backend" with tasks like "Deploy to production". But it works just as well for a grocery list — project "Supermarket", tasks "milk, eggs, bread". Or a personal to-do list. The system doesn't care. It just captures what you tell it and puts it in the right place.

Video Demo

Show me the code

github.com/vicente-r-junior/notion-life-review-os

Full setup instructions in the README.

How I Used Notion MCP

Notion MCP is the backbone of the entire system. Every single interaction with Notion goes through it — no direct API calls anywhere.

Reading schema at startup

When the app boots, it calls API-retrieve-a-database and API-retrieve-a-data-source for each of the 5 databases. The schemas get cached in Redis and injected directly into the GPT-4o system prompt — so the agent knows what fields exist, what types they are, and which ones are required, without any extra calls per message.

Writing data

When the user confirms, the app calls API-post-page for each item — daily log, tasks, projects, learnings. This part is pure deterministic Python, not an LLM. The write step is too important to leave non-deterministic.

Querying data

For questions like "what tasks are due this week?", the agent uses API-query-data-source with structured filters built from natural language. It resolves dates, applies status filters, and formats the answer for WhatsApp.

Updating schema dynamically

When the user asks to add a column — even via voice — the app calls API-update-a-data-source. The Redis cache refreshes immediately and the system prompt is rebuilt. The new field is available on the next message.

Bulk updates

For things like "set all tasks Who to Vicente", the app queries first, shows a confirmation with the affected records, then calls API-patch-page for each one.

Architecture

WhatsApp → Evolution API → FastAPI webhook
                               ↓
                     Intent classifier (GPT-4o-mini)
                               ↓
           ┌───────────────────┼──────────────────┐
           ↓                   ↓                  ↓
 Conversational agent     Query agent      Add column flow
      (GPT-4o)             (GPT-4o)        (GPT-4o-mini)
           ↓                   ↓                  ↓
      SAVE_PAYLOAD        Notion MCP         Notion MCP
           ↓               (query)          (update schema)
    User confirms
           ↓
     Notion Writer
     (pure Python)
           ↓
       Notion MCP
      (write pages)

One conversational agent instead of a pipeline.
I started with separate extractor, matcher, and confirmation agents. It was complex and fragile. A single GPT-4o call with Redis conversation history turned out to be simpler, faster, and much easier to debug. The agent holds the full context of the conversation and knows when it has enough information to produce a SAVE_PAYLOAD.

The write step is never an LLM.
The notion_writer is pure Python calling Notion MCP directly. Every property format handled explicitly. Giving an LLM direct write access to your Notion is asking for trouble.

Schema-aware prompts.
The agent knows your exact Notion schema at all times. Custom fields like Who, Priority, or Energy are injected into the system prompt dynamically. If a field is marked required, the agent asks for it before saving — no partial records.

Stack

Layer	Technology
Backend	Python 3.12 + FastAPI
AI	OpenAI GPT-4o + Whisper
Notion interface	Notion MCP (`mcp/notion` Docker image)
WhatsApp bridge	Evolution API
Session + cache	Redis 7
Infrastructure	Docker Compose on Hostinger VPS

Setup

1. Clone and configure

git clone https://github.com/vicente-r-junior/notion-life-review-os.git
cd notion-life-review-os
cp .env.example .env

2. Create 5 Notion databases

Inside a parent page called Life Review OS, create:

Daily Logs
Tasks
Projects
Learnings
Weekly Reports

Copy each database ID into .env. Connect your Notion integration to the parent page — it propagates to all children automatically.

3. Configure .env

OPENAI_API_KEY=sk-...
NOTION_API_KEY=secret_...
NOTION_DB_DAILY_LOGS=...
NOTION_DB_TASKS=...
NOTION_DB_PROJECTS=...
NOTION_DB_LEARNINGS=...
NOTION_DB_WEEKLY_REPORTS=...
MCP_AUTH_TOKEN=any-random-string
EVOLUTION_API_URL=http://your-evolution-api:8080
EVOLUTION_API_KEY=...
EVOLUTION_INSTANCE=your-instance-name
WHATSAPP_NUMBER=5511999999999
REDIS_URL=redis://app-redis:6379
TIMEZONE=America/Sao_Paulo

4. Start

docker compose up -d

Point your Evolution API webhook to http://your-server:8000/webhook and you're live.

What I Learned

MCP response parsing trips you up the first time. Every Notion MCP response is SSE-wrapped JSON inside a content array. Once you have the unwrapping pattern it's trivial — but it's not obvious when you first hit it.

One agent beats a pipeline. I built the multi-agent version first. Extractor, matcher, confirmation, writer — each doing one thing. It looked clean on paper and was a nightmare in practice. Replacing it with a single conversational GPT-4o call and Redis history was the best decision I made on this project.

The write step should never be an LLM. Flexible conversation on the way in, deterministic code on the way out. That's the pattern that worked.

Prompt design is the real work. Getting the agent to always include SAVE_PAYLOAD when there's actionable content, never say "done" without confirming, correctly handle corrections mid-conversation — that's where most of the iteration went. The code was the easy part.

Redis for everything. Session state, schema cache, idempotency keys, conversation history — all in Redis with TTLs. No separate database needed. Cleanup is automatic.

DEV Community: Vicente Junior

I Built a WhatsApp Finance Agent in OpenClaw. Migrating to Hermes Taught Me What "Self-Improving" Actually Means.

Meet Finn

What Finn Already Does Well in OpenClaw

The Walls I Hit

The Shift Hermes Makes: From Tools to Skills

Memory: From "Vocabulary Hard-coded in Prompt" to "Memory the Agent Curates"

Multi-Platform for Free

Side-by-Side: Finn on OpenClaw vs Finn on Hermes

Where OpenClaw Is Still the Right Choice

What You Give Up Choosing Hermes

The Migration Path Is Official

Closing

Further Reading

I gave Gemini 3.5 Flash a CVE-fix PR to review. It found another bug in the same file.

Why I tested this

The architecture

The core

The three PRs

Final scorecard

What it caught — the headline

Hit 1: inconsistent variable use in existingParser

Hit 2: a regex missing its end anchor

Hit 3: stateful global regex

What it didn't catch — the honest part

What I'd actually use this for

Finn 💰 — A Personal Finance Assistant That Lives in WhatsApp

What I Built

Demo

🟢 Demo 1 — Log a Single Expense

✏️ Demo 2 — Change Something

📄 Demo 3 — Import a Credit Card PDF Invoice

🔍 Demo 4 — Query Spending

How I Used OpenClaw

Plugin Registration

The Agent Loop

Preventing LLM Hallucination on Queries

PDF Invoice Pipeline

What I Learned

Tech Stack

Notion Life Review OS — Log your day to Notion from WhatsApp using AI

What I Built

Video Demo

Show me the code

How I Used Notion MCP

Architecture

Stack

Setup

What I Learned

Hit 1: inconsistent variable use in `existingParser`