My AI Stopped "Guessing" and Started "Thinking": Implementing a Planning & Reasoning Architecture

#buildinpublic #testing #automation #architecture

In previous articles, I talked about how I generate tests using LLMs, parse Swagger schemas, and fight against hardcoded data. But "naked" LLM generation has a fundamental problem: it is linear. The model often tries to guess the next step without understanding the big picture.

Yesterday, I deployed the biggest architectural update since I started development — the System of Planning and Reasoning.

Now, Debuggo doesn't just "write code." It acts like a Senior QA: first, it analyzes requirements, assesses risks, decomposes the task into subtasks, and only then begins to act.

I want to show you "under the hood" how this works and, most importantly, honestly compare: did it actually get faster?

The Problem: Why Does AI Get Lost?

Previously, if I asked: "Create a group, add a user to it, verify the table, and delete the group" the AI often lost context halfway through the test. It might forget the ID of the created group by the time it needed to delete it, or start clicking on elements that hadn't loaded yet.

I needed the AI to "stop and think" before pushing buttons.

The Solution: Agentic Architecture

I implemented a multi-layer system based on the ReAct (Reasoning + Acting) pattern and state machines.

Here is what the test generation architecture looks like now:

graph TD
    A[Test Case] --> B[Planning Agent]
    B --> C{Analysis}
    C --> D[Complexity & Risks]
    C --> E[Dependencies]
    C --> F[Subtasks]
    D & E & F --> G[Execution Plan]
    G --> H[Reasoning Loop]
    H --> I[Step Generation]

1. Planning Agent: The Brain of the Operation

Before generating a single step, the Planning Agent launches. It performs a static analysis of the future test.

Complexity Score The agent calculates the mathematical complexity of the test from 0 to 100:

Number of steps × 10 (max 50 points)
Diversity of actions (click, type, hover) × 5 (max 30 points)
Test type (API adds +20 points)

If the complexity is above 80, the system automatically switches to "High Alert" mode (stricter validation and more frequent DOM checks).

Decomposition into Subtasks Instead of a "wall of text" with 15 steps, the system breaks the test into logical blocks. Example from the system logs:

┌─────────────────────────────────────────────────────────┐
│ Subtask 1: Fill out Group Creation Form                 │
├─────────────────────────────────────────────────────────┤
│ Steps: [1-5] | Actions: navigate → type → click         │
└─────────────────────────────────────────────────────────┘
          ↓
┌─────────────────────────────────────────────────────────┐
│ Subtask 2: Verify Result                                │
├─────────────────────────────────────────────────────────┤
│ Steps: [9-12] | Actions: wait → assert                  │
└─────────────────────────────────────────────────────────┘

This allows the AI to focus on a specific micro-goal without losing the overall context.

2. Reasoning System: The ReAct Pattern

The most interesting part happens during the generation process. I abandoned direct prompting in favor of a Reasoning Loop (Thought → Action → Observation).

Now, every step goes through a cycle like this:

Turn 2: Planning
├─ Thought: "I need to split the test into 4 subtasks"
├─ Action: "Create execution plan"
├─ Observation: "4 subtasks, 78% confidence"
└─ Confidence: 0.88

The system literally "talks to itself" (saving this conversation to the DB), making decisions based on a Confidence Score. If confidence drops below 0.75, the system pauses to look for an alternative path.

3. Self-Healing (Error Recovery)
Even with a cool plan, things can go wrong. I implemented a State Machine that handles failures on the fly.

For example, if the AI gets a selector_not_found error, it triggers the MODIFY strategy:

The Agent re-analyzes the HTML.
Finds an alternative anchor (e.g., text instead of ID).
Generates a new selector.
Updates the step and retries.

Real Benchmarks: The Cost of "Thinking"

Implementing agents isn't free. "Thinking" takes time. I decided to check if it was worth it by comparing the generation of the exact same tests before and after implementing Reasoning.

The results were unexpected.

Test 1: Simple (EULA Popup)
Goal: Login and accept the agreement.

Before Reasoning (Linear): 00:58 (4 steps)
After Reasoning (Agent): 03:22 (5 steps)
Verdict: 📉 Slower.

The system spent time planning a simple task. However, it automatically added a 5th step: verifying that we are actually on the homepage after accepting the EULA. Previously, this step was skipped.
Takeaway: Slower, but the test became more reliable.

Test 2: Medium (E2E User Creation)
Goal: Create an admin, logout, login as the new admin.

Before Reasoning (Linear): 06:38 (20 steps)
After Reasoning (Agent): 09:56 (20 steps)
Verdict: 😐 Overhead.

The number of steps didn't change. The linear model handled it fine, while the Agent spent an extra 3 minutes "thinking" and checking dependencies. This is the honest price of architecture.

Test 3: Complex (Download Template)
Goal: Find a specific template deep in the menu, download it, verify the list.

This is where the magic happened.

Before Reasoning (Linear): 23:38 (39 steps!)
After Reasoning (Agent): 08:11 (12 steps!)
Verdict: 🚀 3x Faster and Cleaner.

Why the difference? Without planning, the old model got "lost." It clicked the wrong places, went back, tried again—generating 39 steps of garbage and errors. The new model built a plan first, understood the direct path, and did everything in 12 steps.

Main Takeaway
Yes, on simple tests we see a dip in generation speed (overhead for LLM work). But on complex scenarios, where a standard AI starts "hallucinating" and walking in circles, the Planning Agent saves tens of minutes and produces a clean, optimal test.

The AI doesn't get lost anymore.

Current version metrics: