DEV Community

Aniket Hingane
Aniket Hingane

Posted on

Building an Intelligent Expense Categorization Rule Engine with Python

Title Animation

How I Automated Financial Rule Generation Using Reflexive AI Agents

TL;DR

In my experience building various data pipelines, I observed that writing static rule engines for text processing—like categorizing bank transactions—is tedious and brittle. In this experimental article, I share a PoC I built: an autonomous, self-improving "Reflexive Agent" in Python. The agent acts as both developer and tester, writing an initial categorization script, evaluating its accuracy against a mock bank dataset in a secure sandbox, and recursively rewriting its own code based on the failure metrics until it hits 95% accuracy. I walk through my entire process, architecture, and code, showing how you can leverage Large Language Models (LLMs) not just to generate code once, but to iterate and self-heal automatically. The complete code for this experiment is available on my GitHub repository.

Introduction

Whenever I review my personal bank statements at the end of the month, I am continually frustrated by how messy the transaction descriptions are. From cryptic merchant IDs like TARGET T-1234 to vague processor wrappers like DOORDASH *BURGER, standardizing these into clean categories (Groceries, Dining, Transportation) is arguably one of the most annoying chores in personal finance.

For years, my intuition was just to write a massive regex mapping layer—a massive switch/case statement of hardcoded rules. But as per my experience, this approach scales horribly. As soon as a new merchant appears, or a processor changes their naming convention, the rules break.

Recently, I thought about the concept of Self-Improving Agents—specifically, the "Reflexion" pattern in AI. Instead of using an LLM to categorize the transactions directly (which is slow, expensive, and risks exposing private data to an API), what if I built an agent that writes the categorization logic for me? What if it tests its own logic against a validation set, looks at where it failed, and refines the Python script until it works perfectly?

I decided to run an experiment and build exactly that: ExpenseAnalyzer-AI. In this article, I will take you exactly through my thought process and coding journey as I built a self-healing, code-writing agent that solves a very real, very annoying personal finance problem.

Title Diagram

What's This Article About?

This experimental article is fundamentally about bridging the gap between theoretical multi-agent reasoning and practical automation. I am focusing entirely on an architecture called a "Self-Improving Agent."

In doing so, I will demonstrate how you can encapsulate an LLM inside a continuous feedback loop. You aren't just giving the LLM a prompt and hoping for the best. Instead, you're giving it a "compiler error" or a "unit test failure," forcing it to reason about why the code it just wrote failed to categorize UBER EATS properly, and asking it to submit a patch.

Throughout this extensive write-up, I'll detail exactly how my environment execution works, how I securely evaluate the generated AST (Abstract Syntax Tree) in a Python sandbox, and how the LLM eventually converges on a highly accurate rule engine.

Tech Stack

To keep things lightweight, purely logical, and highly replicable, I built this PoC using:

  1. Python 3.12+: The core execution environment for both the agent runner and the generated logic.
  2. OpenAI API (GPT-4o / GPT-4o-mini): The "brain" of the operation that generates and critiques the code.
  3. Python exec() Sandbox: A localized namespace execution environment used to blindly evaluate the LLM-generated code without polluting the main application scope.
  4. Mermaid.js: Used visually throughout this article to map out the agent behaviors and architecture flows.

Why Read It?

If you are a developer, data engineer, or AI enthusiast, you've likely played with code-generation tools like Copilot or ChatGPT. But I think there's a massive difference between assisted code generation and autonomous code generation.

You should read this piece if you want to understand how to move past single-turn prompt engineering into continuous, test-driven generative loops. My approach shifts the paradigm: instead of deploying the LLM to production to do the standard task, you deploy the LLM in your build pipeline to generate the deterministic business logic that will eventually run in production. This drastically reduces runtime latency, cuts API costs to zero in production, and guarantees deterministic execution for sensitive financial data.

The Theory Behind Self-Improving Agents

Before diving into the code, I think it's vital to discuss the theory. As per my research, the "Reflexion" paper (Shinn et al.) introduced a pattern where language models are given an environment to interact with. If they perform an action and fail, they are asked to generate a verbal reflection on why they failed. They hold this reflection in their short-term memory during the next iteration.

I applied this exactly to my PoC.

System Architecture

When my LLM generates a function, it doesn't know if CHEVRON 0009 should be 'Transportation' or 'Groceries'. It guesses randomly based on its pre-trained weights. When I execute the generated code, my validation framework inherently knows the ground truth. By returning the error ("Failed on CHEVRON 0009. Expected: Transportation, Got: Groceries"), I am providing the model with exact constraints. The LLM then reasons: "Ah, Chevron is a gas station. I should add a rule checking for 'CHEVRON' and map it to Transportation."

This mirrors the test-driven development (TDD) cycle that human developers use every single day.

Deep Dive: Architecture

The anatomy of this project relies on two distinct modules: the Environment and the Agent.

  1. The Environment: Think of this as the uncompromising judge. It holds the validation dataset (the bank transactions and their correct labels). It knows how to safely parse stringified Python code, compile it into an executable function, map over the dataset, and calculate the accuracy.
  2. The Agent: This is the LLM wrapper. It possesses the exact system prompts necessary to solicit code from the OpenAI API. It orchestrates the HTTP requests and maintains the loop state.

When these two interact, it creates a powerful dynamic. I observed that by decoupling the evaluation from the generation, I could arbitrarily swap out the dataset or the LLM provider without breaking the system's core loop.

Let's Design

To visualize how these pieces orchestrate the self-improving loop, I put together a sequence diagram. Notice how the developer (me) effectively steps back after initializing the loop.

Agent Communication Flow

The loop is self-contained. The boundary condition to break the loop is a predefined accuracy threshold. In my case, I set it to 95%. The flowchart below outlines the exact decision tree evaluated at every tick of the loop.

Workflow/Process Flow

Let’s Get Cooking

Now, let's look at the actual code I wrote to make this happen. I will split the code into logical blocks and explain my reasoning beneath each.

1. Mocking the Dataset

First, I needed ground truth data. I created a hardcoded dictionary simulating exported CSV data from a typical banking portal.

import json
import traceback
import sys
from typing import List, Dict
import os
from openai import OpenAI

# Mock dataset of bank transactions and their correct categories
DATASET = [
    {"desc": "UBER   *TRIP", "true_category": "Transportation"},
    {"desc": "TARGET T-1234", "true_category": "Shopping"},
    {"desc": "NETFLIX.COM", "true_category": "Entertainment"},
    {"desc": "STARBUCKS STORE", "true_category": "Dining"},
    {"desc": "AMZN Mktp US", "true_category": "Shopping"},
    {"desc": "PAYROLL INC * SALARY", "true_category": "Income"},
    {"desc": "SHELL OIL 123", "true_category": "Transportation"},
    {"desc": "SPOTIFY PREMIUM", "true_category": "Entertainment"},
    {"desc": "THE HOME DEPOT", "true_category": "Shopping"},
    {"desc": "UBER EATS", "true_category": "Dining"},
    {"desc": "COMCAST CABLE", "true_category": "Utilities"},
    {"desc": "PG&E ENERGY", "true_category": "Utilities"},
    {"desc": "DOORDASH *BURGER", "true_category": "Dining"},
    {"desc": "WHOLEFDS SFO", "true_category": "Groceries"},
    {"desc": "TRADER JOE'S", "true_category": "Groceries"},
    {"desc": "LYFT *RIDE", "true_category": "Transportation"},
    {"desc": "CHEVRON 0009", "true_category": "Transportation"},
    {"desc": "HULU DIGITAL", "true_category": "Entertainment"},
    {"desc": "MCDONALD'S", "true_category": "Dining"},
    {"desc": "APPLE.COM/BILL", "true_category": "Entertainment"}
]
Enter fullscreen mode Exit fullscreen mode

In my opinion, building this dataset is the most critical manual step. The quality of your validation set dictates the quality of the final generated code. If the validation set has broad, overarching examples of edge cases (like how UBER *TRIP is Transportation but UBER EATS is Dining), it forces the LLM to write highly specific, resilient regex rules to differentiate them.

2. Building the Execution Environment

Next, I wrote the Environment class. This is where the magic (and primary danger) happens.

class Environment:
    """Executes the generated code and evaluates it against the dataset."""
    def evaluate(self, code: str) -> dict:
        # Define a safe namespace for execution
        namespace = {}
        try:
            exec(code, namespace)
        except Exception as e:
            return {"accuracy": 0.0, "error": f"Compilation Error: {traceback.format_exc()}", "failures": []}

        if 'categorize_transaction' not in namespace:
             return {"accuracy": 0.0, "error": "Function 'categorize_transaction' was not defined.", "failures": []}

        categorize_fn = namespace['categorize_transaction']

        correct = 0
        failures = []
        for i, item in enumerate(DATASET):
            desc = item["desc"]
            expected = item["true_category"]
            try:
                predicted = categorize_fn(desc)
                if predicted == expected:
                    correct += 1
                else:
                    failures.append(f"Transaction: '{desc}', Expected: '{expected}', Got: '{predicted}'")
            except Exception as e:
                failures.append(f"Runtime error on '{desc}': {e}")

        accuracy = correct / len(DATASET)
        return {
            "accuracy": accuracy,
            "error": None,
            "failures": failures
        }
Enter fullscreen mode Exit fullscreen mode

I used Python's built-in exec() function, passing in a scoped namespace dictionary. I did this because executing untrusted code directly in the main thread space is a massive security vulnerability. Even in an experimental PoC like this, isolating variables ensures the LLM's dynamically generated function doesn't overwrite critical variables in my orchestration loop.

After retrieving the executed categorize_transaction function reference, I simply map it over the dataset. If the code crashes or returns the wrong string, I append the context to a failures array. This array acts as the direct feedback mechanism for the AI.

3. The Self-Improving Agent

With the sandbox built, I created the LLM interface.

class SelfImprovingAgent:
    def __init__(self):
        self.client = OpenAI() if os.getenv("OPENAI_API_KEY") else None

    def generate_initial_code(self) -> str:
        prompt = '''Write a Python function `categorize_transaction(description: str) -> str` that takes a bank transaction description and returns its category.
Categories should ideally be: Transportation, Shopping, Entertainment, Dining, Income, Utilities, Groceries.
Use basic string matching or regex. Return ONLY the python code block, no markdown formatting.
'''
        if not self.client:
            # Dummy initial code fallback if API key is missing
            return "def categorize_transaction(description):\n    return 'Shopping'"

        print("[Agent] Writing initial ruleset...")
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "system", "content": "You are a Python programming expert."},
                      {"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content.replace("```

python", "").replace("

```", "").strip()
Enter fullscreen mode Exit fullscreen mode

The generate_initial_code function is basic zero-shot generation. From my experience with LLMs, it will likely output a naive solution, maybe getting 30% of the dataset right by guessing common strings like "NETFLIX" or "UBER".

The real heavy lifting happens in the reflection step:

    def reflect_and_improve(self, previous_code: str, evaluation: dict) -> str:
        failures_text = "\\n".join(evaluation['failures'][:10])
        prompt = f"""You previously wrote this categorization function:

{previous_code}

It achieved an accuracy of {evaluation['accuracy']*100:.1f}%.
Here are some failed examples:
{failures_text}

Update the python function `categorize_transaction(description: str) -> str` to fix these mistakes. 
Return ONLY the python code, no markdown block."""

        if not self.client:
            # Dummy V2 resolution
            return "def categorize_transaction(desc):\n    desc = desc.upper()\n    if 'UBER' in desc and 'EATS' not in desc: return 'Transportation'\n    if 'TARGET' in desc or 'HOME' in desc or 'AMZN' in desc: return 'Shopping'\n    # (...) truncated for brevity \n    return 'Other'"

        print(f"[Agent] Reflecting on failures and writing V-Next...")
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "system", "content": "You are an expert software developer fixing a bug."},
                      {"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content.replace("```

python", "").replace("

```", "").strip()
Enter fullscreen mode Exit fullscreen mode

When I wrote this, I realized I had to limit the failures_text array (e.g., [:10]). Pumping too many failures into the context window can completely overwhelm the LLM, leading to hallucinations or over-indexing on extremely rare edge cases. By giving it 10 examples at a time, the model incrementally patches the logic, much like how a developer squashes bugs one ticket at a time.

4. Running The Loop

Finally, the orchestration layer brings it all together.

def run_loop():
    print("--------------------------------------------------")
    print("  INITIALIZING SELF-IMPROVING EXPENSE AGENT       ")
    print("--------------------------------------------------")
    agent = SelfImprovingAgent()
    env = Environment()

    code = agent.generate_initial_code()
    iteration = 1
    max_iterations = 5

    while iteration <= max_iterations:
        print(f"\\n>>> Iteration {iteration}")
        print("Evaluating generated code...")
        eval_result = env.evaluate(code)
        acc = eval_result["accuracy"]
        print(f"Accuracy: {acc*100:.1f}%")

        if acc >= 0.95:
            print("\\nSUCCESS! Agent reached target accuracy.")
            print("\\n===========================================")
            print("FINAL GENERATED CODE:")
            print("===========================================")
            print(code)

            # Print ASCII Table for Statistics
            print("\\n+---------------------------------+")
            print("| FINAL STATISTICS                |")
            print("+---------------------------------+")
            print(f"| Iterations    : {iteration:<15} |")
            print(f"| Target        : 95.0%           |")
            print(f"| Final Accuracy: {acc*100:.1f}%          |")
            print("+---------------------------------+")
            break

        print(f"Failed on {len(eval_result['failures'])} examples. Going back to reflection...")
        code = agent.reflect_and_improve(code, eval_result)
        iteration += 1

    if iteration > max_iterations:
        print("\\nMax iterations reached without hitting target.")

if __name__ == "__main__":
    run_loop()
Enter fullscreen mode Exit fullscreen mode

The loop relies on a while bound by max_iterations = 5. I set a hard cap because self-improving agents can quickly devolve into an infinite loop of oscillating accuracy (e.g., fixing one bug breaks another rule, iterating back and forth forever). A hard limit prevents runaway API costs.

Let's Setup

If you want to run my experimental repository for yourself, here is exactly how I set it up on my local Mac terminal. Step-by-step details can be found at: https://github.com/aniket-work/ExpenseAnalyzer-AI

  1. Clone my GitHub repository.
  2. Ensure you have Python 3.12+ (though any modern Python 3.x version will suffice).
  3. Set up a virtual scope: python3 -m venv venv and source venv/bin/activate.
  4. Install the dependencies via pip (pip install openai).
  5. Export your OpenAI token: export OPENAI_API_KEY="sk-...".

Let's Run

The moment of truth. When I ran python main.py in my terminal, the magic truly clicked. The terminal outout was fantastic. To visualize what happens in the span of roughly 10 seconds, I generated this animation:

(As seen in the primary simulation GIF at the top of the article)

In the very first iteration, the LLM usually scores around a 15% accuracy. It generally just hardcodes a single return statement broadly guessing the category, completely missing nuanced merchants like DOORDASH *BURGER or WHOLEFDS SFO.

By Iteration 2, seeing the explicit failure list, it starts importing re (regex) or writing heavy if/elif blocks. By Iteration 3, it catches overlapping constraints (like differentiating UBER EATS from UBER *TRIP) and crosses the 95% boundary flawlessly.

Advanced Concepts & Edge Cases

While testing this, I observed a few critical edge cases that researchers and developers must account for:

  • Overfitting: The model isn't learning how to write better software generally. It is learning how to write software that perfectly satisfies the benchmark dataset you provide. If your validation set only contains 20 transactions, the model might just write if desc == "UBER *TRIP": return "Transportation", effectively creating a fragile lookup table instead of robust heuristics. I mitigated this in larger offline tests by explicitly prompting the model: "Use generic regex that will not overfit to string literals."
  • Sandbox Security: Never, ever use raw exec() on production infrastructure without extreme containerized isolation (like an AWS Lambda function with severed network access). A hallucinating LLM could accidentally generate import os; os.system("rm -rf /"). While unlikely, it underscores the need for tight memory and permissions boundaries.
  • Cost Scaling: Iterative reflection can quickly burn through tokens if your validation dataset is huge (i.e., tens of thousands of failure strings appended to the prompt). Stratified sampling of errors helps keep the context window small.

Ethics and Future Roadmap

From an ethical viewpoint, replacing human judgment entirely in sensitive areas like personal finance classification could lead to dangerous misinterpretations (e.g., categorizing a medical expense as entertainment). The beauty of this pattern is that it outputs deterministic Python code. I can review the resulting Python script myself, ensure it's not doing anything biased or stupid, and commit it to my source code repository.

In my opinion, this represents the safest way to leverage AI: using it to write the code, allowing a human to review the code, and then executing the code manually.

As for the future roadmap of this specific PoC, I am considering expanding the agent loop to handle automated unit testing. Rather than mapping through a dictionary, the LLM could actually rewrite pytest files concurrently with the categorization logic, ensuring both the tests and the code evolve synchronously.

Closing Thoughts

Building ExpenseAnalyzer-AI taught me that Large Language Models are profoundly more capable when you give them time to think, reflect, and fix their own mistakes. I observed that shifting the computation cost from production inference to build-time generation yields faster, more reliable, and fully transparent business applications.

I highly encourage you to try running this loop on your own datasets. There's an undeniable "wow" factor when you see the terminal flicker as the script evaluates its own failure and flawlessly rewrites its codebase seconds later.


Disclaimer

The views and opinions expressed here are solely my own and do not represent the views, positions, or opinions of my employer or any organization I am affiliated with. The content is based on my personal experience and experimentation and may be incomplete or incorrect. Any errors or misinterpretations are unintentional, and I apologize in advance if any statements are misunderstood or misrepresented.

Top comments (0)