Pascal CESCATO

Posted on Oct 27 • Edited on Nov 2

AgentKit: How Efficient Laziness Fixes Fragile LLM Workflows

#ai #automation #python #architecture

How I stopped debugging JSON parsing errors and started shipping features

TL;DR

I hate debugging the same JSON parsing error twice. AgentKit lets you define LLM agents declaratively (YAML + Pydantic), so validation happens automatically. One day of setup saved me 3+ hours/week of debugging. That's efficient laziness in action.

📚 This is Part 2 of my "Efficient Laziness" series.

Part 1: Database Design: Start from Business Logic or Jump into Code?
Part 2: You're reading it
Part 3: Coming soon

The Story: When Your Newsletter Pipeline Becomes a Debugging Hell

I run a weekly tech newsletter. The pipeline should be simple:

Fetch articles from Wallabag
Clean HTML → Markdown
AI analysis (summary, scoring, keywords, categorization)
Insert into PostgreSQL
Generate HTML newsletter
Send automatically

The reality? Step 3 — the "AI-powered" part — was a textbook example of automated inefficiency.
Every Monday, the pipeline failed for one of these entirely predictable reasons:
❌ JSON malformed → "Invalid syntax at line 42" (manual retry required)
❌ Missing "keywords" field → "Prompt tweaked, workflow rerun" (rinse, repeat)
❌ Score = 11 instead of max 10 → "Validation node added" (because of course)
❌ Category misspelled → build category validator
Time spent debugging and patching n8n workflows: 2–3 hours.
Time spent on actual work: Statistically insignificant.

I wasn't building features. I was building safety nets for unpredictable LLM outputs.

My tolerance for waste: Zero.

The Question: What If Validation Was the System's Job, Not Mine?

Before touching any code, I asked myself THE question (sound familiar from my database design article?):

"What am I actually trying to do here?"

Not "call an API and hope for the best."

Answer: Get a validated ContentAnalysisOutput from markdown_content input. Period.

Everything else—retries, JSON parsing, validation—is plumbing. Plumbing I shouldn't have to build myself.

Let’s see what that looks like in code →

Enter AgentKit: Declarative Agents (or, How I Got Lazy the Right Way)

The obvious solution?
AgentKit — currently in open beta, because of course someone finally formalized this — operates on a principle so simple it’s almost insulting:
You declare what you need. The system handles how.
(Revolutionary, I know.)

Why this works:

Core concept: No more babysitting JSON. No more "hope the LLM complies this time."
Implementation: YAML + Pydantic. Because if your contract isn’t machine-enforced, it’s just a wishlist.

Before (n8n imperative workflow chaos):

[AI Node] → [Parse JSON Node] 
  → IF valid JSON
    → [Validate Schema Node]
      → IF valid schema
        → ✅ Continue
      → ELSE ❌ Log error, retry with different prompt
  → ELSE ❌ Log error, retry with explicit "return valid JSON" instruction

Result: 12 nodes, 3 branches, unmaintainable — or, at best, a maintenance nightmare.
(Because nothing says "scalable" like hardcoding validation logic in a GUI.)

After (AgentKit declarative agent):

agent:
  name: content_analyzer
  output_schema: ContentAnalysisOutput  # Pydantic model
  max_retries: 2
  # System handles validation & retry automatically

(No GUI. No drag-and-drop. Just a contract. Enforced.)

Result: 1 YAML file. Validation happens automatically. Retries handled by the runner.

That's efficient laziness: I defined the contract once, the system enforces it forever.

From Paper to YAML: The Agent Contract

Just like with database modeling, I started with paper. Not code. Questions first:

What are my inputs? → markdown_content: string
What do I need back? → Structured scoring + summary + keywords + category
What constraints? → Scores must be 0-3, keywords minimum 3, total max 10

Then I translated that into a Pydantic model (the "schema"):

from pydantic import BaseModel, conint, Field
from typing import List

class ContentAnalysisOutput(BaseModel):
    smb_applicability: conint(ge=0, le=3)  # ge = greater/equal, le = less/equal
    automation_potential: conint(ge=0, le=2)
    economic_value: conint(ge=0, le=2)
    open_source: conint(ge=0, le=2)
    innovation: conint(ge=0, le=1)
    total_score: conint(ge=0, le=10)
    summary: str
    keywords: List[str] = Field(..., min_length=3)
    category: str

Why Pydantic?

Automatic validation (no custom code)
Clear error messages ("expected int 0-3, got 11")
Type hints your IDE understands
Serialization/deserialization built-in

This model is my contract. Any LLM output that doesn't match this contract gets automatically rejected and retried.

The Agent Definition: YAML as the Single Source of Truth

Here's the complete agent definition:

agent:
  name: content_analyzer
  description: >
    Analyzes web content to generate business-oriented summaries
    with normalized scoring and classification.

  inputs:
    - markdown_content: string

  model:
    provider: variable  # Swap OpenAI/Mistral/Llama easily
    model_name: variable
    temperature: 0.1
    response_format: json

  prompt: |
    You are a technology analyst specializing in open-source solutions.
    Analyze the following content and provide a structured evaluation.

    Scoring criteria (STRICT):
      - SMB applicability: 0-3
      - Automation potential: 0-2
      - Economic value: 0-2
      - Open-source: 0-2
      - Innovation: 0-1
      - TOTAL must be ≤ 10

    Expected JSON format:
    {
      "smb_applicability": 0,
      "automation_potential": 0,
      "economic_value": 0,
      "open_source": 0,
      "innovation": 0,
      "total_score": 0,
      "summary": "...",
      "keywords": ["...", "...", "..."],
      "category": "..."
    }

    Content to analyze:
    {{markdown_content}}

Why YAML?

Same reason I use paper before coding: think once, write once
Git-friendly (version control, diffs, rollbacks)
Human-readable (non-devs can review prompts)
Language-agnostic (runs anywhere: Python, Node, Go...)
Already standard for infrastructure (Docker, K8s, Terraform)

Building the Runner: 150 Lines to Never Debug JSON Again

Since AgentKit isn't fully released, I built a minimal Python runner. Core principle:

The runner handles ALL the annoying stuff I hate doing manually.

Core execution loop with automatic retry:

from jinja2 import Template
import json

def run_agent(agent_config: dict, variables: dict, max_retries: int = 2):
    """
    Execute agent with automatic validation & retry

    This is the 'efficient laziness' in action:
    - Template rendering: automatic
    - JSON parsing errors: automatic retry
    - Schema validation: automatic via Pydantic
    - Error logging: structured

    I never touch this code. It just works.
    """
    template = Template(agent_config["agent"]["prompt"])
    rendered = template.render(**variables)

    for attempt in range(1, max_retries + 1):
        try:
            # Call LLM (OpenRouter, Mistral, OpenAI...)
            raw_response = call_llm(rendered)

            # Parse JSON (can fail)
            parsed = json.loads(raw_response)

            # Validate with Pydantic (can fail)
            validated = ContentAnalysisOutput(**parsed)

            # Success! Return validated data
            return {
                "success": True, 
                "output": validated.dict(), 
                "attempt": attempt
            }

        except json.JSONDecodeError:
            # LLM returned invalid JSON → retry with correction prompt
            print(f"⚠️ Invalid JSON (attempt {attempt}) → auto-retrying")
            rendered = (
                f"The previous response was invalid JSON. "
                f"Please fix it and return ONLY valid JSON:\n{raw_response}"
            )

        except Exception as e:
            # Validation failed (wrong type, out of range, etc.)
            if attempt == max_retries:
                return {"success": False, "error": str(e)}
            print(f"❌ Validation error (attempt {attempt}): {e} → retrying")

What this does for me:
✅ Template rendering (Jinja2 variables)
✅ JSON parsing with automatic retry
✅ Schema validation with clear error messages
✅ Structured logging
✅ Retry logic with correction prompts

What I never have to do again:
❌ Build custom validation nodes
❌ Debug "why is this field missing?"
❌ Add retry logic for the 50th time
❌ Parse error messages manually

Exposing via FastAPI: One Endpoint, Infinite Agents

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import yaml

app = FastAPI()

class AgentRequest(BaseModel):
    agent_name: str
    markdown_content: str

@app.post("/analyze")
def analyze(req: AgentRequest):
    """
    Universal agent executor

    Add a new agent? Drop a YAML file in agents/.
    No code changes. No deployments. Just works.

    That's what I call scaling through laziness.
    """
    # Load agent config
    try:
        with open(f"agents/{req.agent_name}.yaml") as f:
            agent_config = yaml.safe_load(f)
    except FileNotFoundError:
        raise HTTPException(status_code=404, detail=f"Agent '{req.agent_name}' not found")

    # Run agent (validation happens automatically)
    result = run_agent(
        agent_config, 
        {"markdown_content": req.markdown_content}
    )

    if not result["success"]:
        raise HTTPException(status_code=500, detail=result["error"])

    return result

Example response:

{
  "success": true,
  "output": {
    "smb_applicability": 3,
    "automation_potential": 2,
    "economic_value": 2,
    "open_source": 2,
    "innovation": 1,
    "total_score": 10,
    "summary": "Milvus Lite simplifies local vector database deployment...",
    "keywords": ["Milvus", "vector database", "LLM", "self-hosted"],
    "category": "AI & Infrastructure"
  },
  "attempt": 1
}

What This Saved Me (The ROI of Efficient Laziness)

Time Investment:

Initial setup: 1 day (YAML agent + runner + FastAPI)
Per new agent: 30 minutes (just write YAML)

Time Saved (Weekly):

Before: 3-4 hours debugging JSON errors, validation issues, retry logic
After: 0 hours (system handles it)
ROI: 12-16 hours/month saved

But More Importantly:

Mental peace.

I don't wake up Monday morning to find my newsletter broken because an LLM decided to return "keywords": "AI, automation" (string) instead of "keywords": ["AI", "automation"] (array).

The system catches it. Retries. Logs the attempt. Works.

That's the real win: I freed my brain from babysitting unpredictable LLM outputs.

A Real Example: Adding Payment Tracking (Wait, Wrong Article)

Actually, let me give you the RIGHT example:

Three weeks after deploying this, my client asked:

"Can we add sentiment analysis to the content scoring?"

My answer: "Give me 1 hour."

Why 1 hour and not 3?

Because the structure was already there:

Copy content_analyzer.yaml → content_analyzer_v2.yaml
Add sentiment: str to Pydantic model
Update prompt to include sentiment scoring
Deploy (just drop the YAML file, no code changes)

Zero refactoring. Zero debugging. Zero broken workflows.

If I'd built this the n8n way with 12 validation nodes, it would've been 2 days of:

Adding sentiment node
Adding validation for sentiment
Testing all branches
Fixing edge cases
Praying nothing else broke

Loved "efficient laziness" in action here? Share your debugging horror stories or time-saving hacks below – let’s shape the next steps together! Check out my intro to CTEs on my newly Astro-migrated blog CTE : la clause WITH que les ORM ignorent (mais que vous devriez connaître), and stay tuned for my 3-part English series on dev.to starting next week, where I’ll apply this mindset to SQL optimization. Follow me to catch it all!**

Why This Approach Should Be the Norm (But Isn't)

Let's be honest: most developers code first, validate later (if at all).

I've always done the opposite. Not out of virtue, but out of pure efficient laziness: I hate fixing the same bug twice.

My Philosophy (Again): "Never Do Twice What Can Be Done Once"

Spend 1 day on declarative agents vs 3 hours/week debugging? Obvious choice.
Define schema once vs validate manually 100 times? No debate.
Write YAML once vs maintain 12 n8n nodes? Crystal clear.

This "laziness" has always saved me time, energy, and sanity.

Why Is This Rarely Done?

Same reasons as database modeling:

Pressure to ship fast: "We'll add validation later" (we never do)
Belief that prompts will magically work: They won't. LLMs are creative, not consistent.
Lack of tooling: AgentKit is new. Most people don't know this approach exists.

But What About Exploratory Projects?

Fair question. "What if I'm just testing an idea and don't know the schema yet?"

Here's my take: even a 15-minute Pydantic model forces you to ask:

What fields do I actually need?
What are valid ranges?
What's required vs optional?

Without this, you code "by feeling," and that's where waste begins—even in MVP phase.

A minimal schema isn't the enemy of exploration. It's what makes exploration efficient instead of chaotic.

The 4-Step Method (Applicable to Any LLM Project)

Same method as database design, different domain:

Step 1: Identify Your Contract

Ask: "What output do I need, every single time, no exceptions?"

Not "what would be nice to have." What's non-negotiable.

Step 2: Define the Schema (Pydantic)

Write your output model. Include:

Required fields
Type constraints (conint, constr, etc.)
Validation rules (min_length, ge, le)

Step 3: Write the Agent in YAML

Clear prompt with examples
Input variables
Reference to your schema
Retry config

Step 4: Test, Don't Trust

Run 10-20 real inputs. Watch what fails. Adjust schema OR prompt, never both at once.

Scaling: When You Have 40 Agents

Problem I'm facing now: 40 agents × 2000-token prompts = unmaintainable YAML files

Solution (work in progress): Modular composition

agent:
  name: content_analyzer
  prompt_parts:
    - role: "{{include('prompts/base/analyst_role.md')}}"
    - criteria: "{{include('prompts/scoring/criteria_v2.md')}}"
    - output_format: "{{include('prompts/formats/json_scoring.md')}}"
    - input: "Content to analyze:\n{{markdown_content}}"
  output_schema: "{{load('schemas/content_analysis_v3.json')}}"

Why this matters:

Update scoring criteria in ONE place → applies to all agents
Version control prompts independently
A/B test prompt variations
Reuse role definitions across agents

I'm building this now. Next article may cover the implementation.

What This Enables: The Bigger Picture

AgentKit represents the same shift we've seen elsewhere in tech:

From imperative ("how") to declarative ("what")

Domain	Before	After
Containers	"Install X, configure Y, run Z"	`Dockerfile`
Infrastructure	"Click these buttons in AWS console"	`terraform apply`
Orchestration	"Deploy pod 1, then pod 2, then..."	`kubectl apply -f config.yaml`
LLM Agents	"Call API, parse, validate, retry..."	`agents/analyzer.yaml`

The pattern:

Define desired state
System makes it happen
You never touch the plumbing again

That's not just elegant. It's efficient laziness at scale.

Limitations (or: "Why This Isn’t Perfect, Just Less Terrible")

⚠️ Cost: Retries Aren't Free

If 30% of your requests need 1 retry, you're paying +30% API costs.

Mitigation:

Write better prompts. (Or accept that LLMs are like interns: they need supervision.)
Implement exponential backoff. (Because spamming the API is _so 2023.)_
Monitor retry rates. (And weep quietly over your bill.)

My take: I’d rather pay 30% more and sleep than save money and debug at 3 AM. (Life’s too short for JSON parsing errors. And bad coffee.)

⚠️ Security: YAML Execution Needs Hardening

Current implementation is a proof-of-concept. Production needs:
✅ Agent name validation (prevent path traversal: ../../etc/passwd)
✅ Prompt injection detection
✅ Rate limiting per agent (Because someone will try to DoS your FastAPI endpoint. Probably you, at 2 AM.)
✅ API key rotation
✅ Signed agent files (Trust no one. Not even your future self.)

Don't run this in production without hardening.

⚠️ State Management: This Is Stateless

Complex workflows (multi-step, conditional logic) need orchestration:

Temporal
Argo Workflows
Airflow

AgentKit handles single-agent execution beautifully. Multi-agent workflows? Different problem.

Implementation Checklist

If you want to try this:

Week 1: Core Setup
☐ Define 1-2 critical agents in YAML
☐ Build minimal runner (~150 lines)
☐ Add Pydantic validation
☐ Test with real inputs (10-20 examples)

Week 2: Production-Ready
☐ Add retry logic with exponential backoff
☐ Expose via FastAPI
☐ Add authentication (API keys)
☐ Implement structured logging
☐ Set up monitoring (retry rates, latency, costs)

Week 3: Hardening
☐ Input validation (sanitize agent names)
☐ Rate limiting per agent/user
☐ Error alerting (Slack/email when agents fail repeatedly)
☐ Documentation for your team

Month 2: Scaling
☐ Modular prompt composition (includes)
☐ A/B testing framework for prompts
☐ Cost tracking per agent
☐ Multi-agent orchestration (if needed)

The Bigger Question: What's Your Time Worth?

Here's the efficient laziness calculation:

Option A: Keep debugging manually

Time: 3 hours/week × 52 weeks = 156 hours/year
Mental cost: High (unpredictable failures)
Scalability: Terrible (more agents = more debugging)

Option B: Build declarative system

Time: 1 day setup + 30 min per new agent
Mental cost: Low (system handles validation)
Scalability: Excellent (add agents by dropping YAML files)

Break-even point: ~3 weeks

After that, it's pure gain.

Conclusion: Think Once, Validate Forever

The actual secret? Stop treating LLM outputs like artisanal handcrafted prose and start treating them like database transactions:

Define the schema. (Yes, before writing prompts.)
Enforce it. (No, "mostly valid" isn’t valid.)
Profit. (Or at least stop debugging at 2 AM.)

Why this works:

Databases figured this out in the 1970s. LLMs are just late to the party.
Side effects of this approach:
- Fewer surprises. (Shocking.)
- More time for actual work. (Imagine that.)
- A system that fails predictably instead of creatively.

Final thought: If you’re still manually validating LLM outputs, ask yourself: "Do I enjoy suffering, or did I just not automate this yet?" (Hint: It’s the latter. Fix it.)

In my case: ContentAnalysisOutput from markdown_content. Everything else flows from that contract.
And this small discipline—writing a Pydantic model before writing prompts—saved me 12-16 hours/month.
The efficient laziness Manifesto:
Never debug the same JSON parsing error twice.
Define the contract once.
Let the system enforce it forever.

For Developers:

Adopt "efficient laziness" with LLM agents:

Contract first (Pydantic schema)
Agent definition (YAML)
Universal runner (handles plumbing)
Never touch validation logic again

For Teams Using LLMs:

If you're integrating AI into workflows, ask:

Do we have schemas for AI outputs?
Do we validate automatically or manually?
How much time do we spend debugging AI responses?

A well-structured agent system means:
✅ Fewer bugs
✅ Less maintenance
✅ Easier iteration
✅ Predictable costs
✅ Peace of mind

Resources

🔗 AgentKit Docs (when available)
🐍 My GitHub repo (replace with your link)
📝 Pydantic Documentation
🚀 FastAPI Documentation

Discussion

Questions for the community:

How much time do you spend debugging LLM outputs?
Do you validate AI responses automatically or manually?
What's your approach to retry logic?
Would declarative agents change how you build AI features?

Drop your thoughts below 👇 Especially if you've faced the "invalid JSON at 2am" pain.

This is the second article in my "Efficient Laziness" series. The efficiently lazy developer never codes the same thing twice. His motto: think once, well, and move on.

If you found this useful, follow me for more deep dives on AI architecture and automation. I also run a weekly tech newsletter on open-source solutions.

📬 Want essays like this in your inbox?

I just launched a newsletter about thinking clearly in tech — no spam, no fluff.

Subscribe here: https://buttondown.com/efficientlaziness

Efficient Laziness — Think once, well, and move forward.

📬 Want essays like this in your inbox?

I just launched a newsletter about thinking clearly in tech — no spam, no fluff.

Subscribe here: https://buttondown.com/efficientlaziness

Efficient Laziness — Think once, well, and move forward.

Top comments (8)

GnomeMan4201 • Oct 27

Excellent analysis on structural validation. I'd like to build on your cost-efficiency considerations.
Your declarative agent design provides elegant structural guarantees, and I think there's an opportunity to extend this pattern to resource constraints as well.
Observation: The current retry-based validation approach effectively converts prompt engineering debugging cycles into runtime API costs. While this works well for many use cases, it introduces what might be called "cost fragility" at scale—where structural correctness is maintained, but operational costs become unpredictable.
Proposed Extension: Consider introducing a CostConstraint specification within the YAML configuration, alongside the Pydantic schema:
cost_constraints:
max_tokens_per_run: 10000
max_retry_rate_per_day: 50
When these thresholds are exceeded before valid output generation, the runner would emit a structured CostExceeded error rather than continuing retries. This would trigger alerting for prompt engineers while preserving the declarative design philosophy.
Benefits:
Transforms prompt engineering into a financially auditable discipline
Provides fail-fast behavior for cost anomalies
Maintains separation of concerns (structure validation vs. resource governance)
Creates clear feedback loops between prompt quality and operational cost
This approach would democratize the concept of "efficient failure"—making cost constraints as first-class as type constraints in the agent specification.
Would be interested in your thoughts on integrating resource governance into the declarative model.

Pascal CESCATO • Oct 27 • Edited

Thanks a lot — this is a brilliant extension.
You’re absolutely right: retries solve structural fragility but introduce cost fragility at scale. I really like your term “cost-efficient failure”.

Your CostConstraint idea fits perfectly into the declarative pattern. It would just need a lightweight controller layer inside the runner, something like:

if tokens_used > cfg.cost_constraints.max_tokens_per_run:
    raise CostExceeded("max_tokens_per_run reached")
if retry_count > cfg.cost_constraints.max_retry_rate_per_day:
    raise CostExceeded("max_retry_rate_per_day reached")

That logic could easily live next to the Pydantic validation, under the same philosophy:
validate early, fail cleanly, alert predictably.

I might actually integrate a first version into the runner and expose it as an optional YAML block.
This would make cost governance as explicit as schema validation — and that’s a powerful step toward truly “efficient” automation.

GnomeMan4201 • Oct 27

Appreciate the thoughtful response. Your work on this is solid. Looking forward to v1.

Pascal CESCATO • Oct 27

I just published the French version 🇫🇷
French readers, you can now read it here: AgentKit : du flux n8n à l'orchestration d'agents déclaratifs

Zachary Loeber • Oct 29

Am I missing something here? This looks almost exactly like CrewAI

Pascal CESCATO • Oct 29

Good point - it does look a bit like CrewAI at first glance. The key difference is that AgentKit focuses on workflow reliability and structural validation, while CrewAI leans more toward multi-agent collaboration. Same family of ideas, different playground.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.