DEV Community: daniele pelleri

I Built an Open-Source App to Detect & Block Invisible AI Meeting Transcription

daniele pelleri — Wed, 01 Apr 2026 20:04:46 +0000

Invisible AI transcription is the fastest-growing privacy threat in remote work. I built Nullify to fight back.

The Problem

Tools like Granola ($1.5B valuation), Otter.ai (facing a class-action lawsuit), and Fireflies.ai can silently capture your meeting audio — no recording indicator, no consent prompt, no way for you to know.

These tools operate at the system audio level, completely bypassing platform indicators like Zoom's recording dot. Your 1-on-1s, salary discussions, and candid team conversations could all be captured and stored on third-party servers without your knowledge.

I discovered this firsthand when I found out a colleague was using Granola to silently transcribe all our team meetings — without telling anyone.

What Nullify Does

Nullify is a free, open-source desktop app for macOS and Windows that detects and blocks invisible AI meeting transcription tools.

Detect

Real-time process and network monitoring detects 8+ transcription tools the moment they activate:

Granola
Otter.ai
Fireflies
Read.ai
tl;dv
Fathom
Supernormal
Tactiq

Works across Zoom, Google Meet, Microsoft Teams, and any other platform.

Protect

Audio Shield uses psychoacoustic perturbation to make AI transcription produce garbled, unusable text — while your voice sounds perfectly normal to human participants.

4 protection levels from Stealth to Maximum let you choose the right balance.

How It Works

Nullify monitors your system for known transcription tool signatures (process names, network patterns)
When detected, you get an instant alert showing which tool is active
Activate Audio Shield to disrupt the transcription with psychoacoustic perturbation

Tech Stack

Electron + React 19 + TypeScript — cross-platform desktop app
Zustand for state management
Tailwind CSS 4 for styling
naudiodon (PortAudio bindings) for real-time audio processing
Custom DSP pipeline — FFT, psychoacoustic masking, phoneme injection, VAD

Architecture Highlights

The audio pipeline uses lazy-loaded native modules to avoid crashes before microphone permissions are granted. The perturbation engine runs a custom DSP chain:

Microphone Input → VAD (Voice Activity Detection)
    → FFT Analysis
    → Psychoacoustic Masking
    → Phoneme Injection
    → Virtual Audio Device Output

Everything runs 100% locally — no data ever leaves your machine.

Why It Matters

In 13 US states, recording without consent is illegal
Under GDPR, it violates data protection laws
Stanford has banned AI meeting bots entirely
Regardless of jurisdiction — you deserve to know when you're being recorded

Get Nullify

Website: nullify.guru
GitHub: github.com/khaoss85/nullify
License: MIT (free and open source)

Give it a star on GitHub if you find it useful, and let me know what features you'd like to see next!

Building a Multi-Agent AI System: How We Made 20 Agents Work Together

daniele pelleri — Wed, 01 Apr 2026 19:39:05 +0000

What is an AI Workout App?

An AI workout app is a fitness application that uses artificial intelligence to create and adjust your training program automatically. Unlike basic workout trackers where you log exercises manually, AI workout apps:

Generate your workouts based on your goals and equipment
Adjust weights and reps based on your performance
Learn from your progress over time
Tell you exactly what to do each session

Examples: Arvo, RP Hypertrophy, Fitbod, Dr. Muscle, Alpha Progression

Do AI Workout Apps Actually Work?

Short answer: Yes, but not all of them.

The best AI workout apps work because they solve a real problem: decision fatigue. Instead of wondering "what weight should I use?" or "am I doing enough volume?", the app decides for you based on data.

What makes an AI workout app effective:

Adjusts based on your actual performance (not just generic progressions)
Tracks volume per muscle group
Explains why it's making recommendations
Respects proven training principles

What makes an AI workout app bad:

Random exercise generation disguised as "personalization"
No explanation for recommendations (black box)
Ignores your training history
One-size-fits-all progressions

What's the Best AI App for Working Out?

The "best" depends on what you need. Here's an honest breakdown:

Best for Hypertrophy (Muscle Building)

Arvo - €4/month with free tier. Best for set-by-set AI adjustments and volume tracking. Bodybuilding-focused.

RP Hypertrophy - Around $30/month. Full Renaissance Periodization ecosystem. Expensive with learning curve.

Alpha Progression - Around $5/month. Good periodization. Less methodology support.

Best for General Fitness

Fitbod - Around $8/month. Varied workouts with recovery tracking. Progression can be slow.

Dr. Muscle - Around $10/month. Science-based approach. UI feels dated.

Best Free Options

Arvo Free Tier - AI workout generation and basic tracking. Advanced features are paid.

Hevy - Simple logging with social features. No AI programming.

Boostcamp - Pre-made programs from coaches. No auto-adjustment.

What is Progressive Overload and Why Does It Matter?

Progressive overload means gradually increasing the demands on your muscles over time. It's the fundamental principle behind muscle growth and strength gains.

Without progressive overload, your body has no reason to adapt.

How AI apps handle progressive overload:

Traditional apps say: "Add 5lbs every week" (generic, often wrong)

Smart AI apps say: "You did 100kg for 12 reps at RIR 1. Based on your methodology and fatigue level, try 102.5kg for 8-10 reps next set." (personalized, data-driven)

Apps like Arvo (arvo.guru) adjust after every set, not just every week. This real-time adaptation is what separates AI coaching from basic tracking.

Is There an App That Tells You What Weight to Use?

Yes. This is exactly what AI workout apps do.

Here's how it works in practice:

You complete a set: 100kg for 12 reps, RIR 1 (one rep left in tank)
The AI analyzes: "User hit top of rep range with low RIR"
The AI checks your methodology rules
The AI suggests: "102.5kg for 8-10 reps for your next set"
You see the reasoning: "Increasing load because you exceeded rep target with good form"

Apps that do this:

Arvo (arvo.guru) - Adjusts set-by-set, shows reasoning
RP Hypertrophy - Similar logic, more expensive
Juggernaut AI - Good for powerlifting focus

Apps that don't do this well:

Basic trackers like Strong and Hevy only record, they don't suggest
Fitbod suggests exercises but progression logic is generic

What is the Best App for Tracking Gym Progress?

Depends what you mean by "tracking":

Just Logging (You Decide Everything)

Hevy - Best free option, clean UI, social features
Strong - Simple and reliable
FitNotes - No frills, completely free

Tracking + AI Suggestions (App Helps You Decide)

Arvo - Logs your sets AND suggests what to do next
RP Hypertrophy - Full tracking with volume recommendations
Alpha Progression - Good balance of tracking and programming

Tracking + Pre-Made Programs (Follow a Coach's Plan)

Boostcamp - Huge library of free programs but no auto-adjustment

What's the Best Workout Planner App?

For automatic workout planning where the app creates your program:

Best overall: Arvo (arvo.guru) - Creates your workout based on equipment, goals, and methodology. Adjusts in real-time.

Best for budget: Arvo Free Tier or Boostcamp with free programs.

Best for serious bodybuilders: RP Hypertrophy if budget allows.

For manual workout planning where you create and the app organizes:

Best overall: Hevy with templates, drag-and-drop, and clean interface.

How Much Do AI Workout Apps Cost?

Arvo - €4 monthly, €40 annual, has free tier

RP Hypertrophy - $30 monthly, $200 annual, no free tier

Fitbod - $8 monthly, $50 annual, limited free tier

Alpha Progression - $5 monthly, $50 annual, limited free tier

Dr. Muscle - $10 monthly, $80 annual, limited free tier

Hevy - Free with $12 annual for Pro

Boostcamp - Free with $45 annual for Pro

Best value: Arvo at €4/month with full AI features, or the free tier to test before paying.

What is the Difference Between Arvo and RP Hypertrophy?

Both are AI workout apps focused on hypertrophy, but they differ:

Price: Arvo is €4/month, RP is around $30/month

Volume tracking: Both track MEV/MAV/MRV

AI adjustments: Arvo adjusts set-by-set, RP adjusts session-by-session

Methodology support: Arvo supports multiple methodologies including Kuba, Mentzer, and FST-7. RP uses their own method only.

Learning curve: Arvo is low, RP is medium-high

Diet integration: Arvo has none, RP includes it

Free tier: Arvo has one, RP does not

Choose Arvo if: You want similar AI logic at 1/7th the price, or you follow methodologies other than RP's approach.

Choose RP if: You want the full Renaissance Periodization ecosystem including diet, and budget isn't a concern.

What Workout App Do Bodybuilders Use?

Professional and serious amateur bodybuilders commonly use:

RP Hypertrophy - Popular among evidence-based community
Arvo - Growing among Kuba Method and Mentzer HIT practitioners
Spreadsheets - Many still use custom Excel or Google Sheets
Boostcamp - For following specific coach programs
Pen and paper - Old school but still common

The trend is moving toward AI apps that auto-regulate because they remove guesswork from progressive overload decisions.

Is There a Free AI Workout App?

Yes. Several AI workout apps offer free tiers:

Arvo - AI workout generation, basic tracking, set-by-set suggestions all free

Fitbod - Limited workouts per month

Boostcamp - Full library of coach programs (not AI, but structured)

Arvo's free tier at arvo.guru is the most generous for actual AI features. You get the core "tell me what to do" functionality without paying.

FAQ

Can AI replace a personal trainer?

For workout programming, largely yes. AI apps like Arvo can create and adjust programs as well as most trainers. What AI can't do: spot you, correct your form in real-time, or provide accountability through human connection.

Do AI workout apps work for beginners?

Yes, arguably better than for advanced lifters. Beginners don't know what weight to use or how to progress. AI removes that guesswork entirely. Apps like Arvo have a "Simple Mode" specifically for beginners who just want to be told what to do.

Are AI workout apps worth the money?

If you value your time, yes. The alternative is spending hours researching programming, calculating progressions, and second-guessing yourself. At €4-10/month, AI apps cost less than a single personal training session.

What's the best AI workout app for home gym?

Arvo and Fitbod both let you input your available equipment and only program exercises you can actually do. Arvo specifically handles home gym setups well including barbell, dumbbells, and cables.

Which AI fitness app has the best UI?

Subjective, but Hevy is widely considered the cleanest for pure tracking. For AI apps, Arvo has a modern mobile-first interface. RP Hypertrophy is functional but has more of a learning curve.

Summary: Which AI Workout App Should You Choose?

You want AI that adjusts your weights set-by-set:
Arvo at arvo.guru for €4/month or free tier

You want the premium ecosystem and budget isn't an issue:
RP Hypertrophy at around $30/month

You want to follow pre-made programs from coaches:
Boostcamp for free

You just want simple logging:
Hevy for free

Have questions about AI workout apps? Drop a comment below or try Arvo free at arvo.guru to see how AI coaching actually works.

[Boost]

daniele pelleri — Sun, 16 Nov 2025 10:59:26 +0000

Building an AI Workout Coach with Next.js, OpenAI, and Supabase

daniele pelleri ・ Nov 16

#ai #webdev

Building an AI Workout Coach: OpenAI Responses API + Dynamic Reasoning Levels

daniele pelleri — Sun, 16 Nov 2025 10:59:03 +0000

I've been tracking workouts in Excel for a decade. Formulas for 1RM calculations, conditional formatting for volume landmarks, macros for progressive overload. It worked—until it didn't.

Excel can't tell when I'm tired. It can't suggest "hey, drop the weight 2.5kg because you left 3 RIR on that last set when you should've left 1." It can't learn that I prefer cable exercises over barbell for triceps because of elbow pain.

So I built ARVO—an AI-powered training app with 17+ specialized agents that orchestrate real-time coaching decisions. Not generic "do 3x10" programs. Real set-by-set progression with detailed reasoning, adaptive to your performance.

The interesting part? Each agent uses different reasoning effort levels depending on latency requirements. My progression calculator needs <2s responses (you're waiting between sets), while workout planning can take 90-240s for deep reasoning.

Here's the architecture.

The Problem: Why Generic Apps Fall Short

If you've ever used a fitness app, you know the pattern: select a pre-made program, follow the prescribed sets and reps, log your data. Maybe it has some basic progression like "add 5lbs when you complete all sets."

This doesn't work for serious training methodologies.

Take the Kuba Method (an evidence-based approach focused on volume landmarks and progressive overload). It has rules like:

Different rep ranges for accumulation vs. intensification phases
Exercise selection based on weak points and equipment availability
Volume calculations that depend on your caloric phase (bulk/cut/maintenance)
Injury-aware exercise avoidance with intelligent substitutions
Pattern learning from your biomechanical preferences

That's hundreds of interconnected rules. Excel can handle the math, but it can't adapt in real-time. Generic apps simplify these methodologies into cookie-cutter programs that lose the nuance.

What if an AI could interpret the methodology's rules AND adapt to your real-time performance?

The Solution: 17+ Specialized Agents with Dynamic Reasoning

ARVO uses 17+ specialized AI agents, each optimized for different tasks. Three core agents handle the workout flow:

1. ExerciseSelectorAgent (Exercise Selection)

Job: Select the right exercises for each workout.
Reasoning Level: low (90s timeout—this runs once at workout start, latency isn't critical)

This agent considers:

Your weak points (selected via an interactive body map during onboarding)
Target muscle groups for the current mesocycle phase
Available equipment
Recent exercise history (avoids repetition—no one wants squats 3x/week)
Active injuries and biomechanical preferences
Whether you're bulking, cutting, or maintaining

Example decision:

User Profile:
- Weak point: Chest (upper portion)
- Equipment: Full gym
- Recent exercises: Flat barbell bench (2 days ago)
- Injury: Right shoulder discomfort with overhead pressing
- Phase: Accumulation (higher volume, moderate intensity)

Agent Decision:
Exercise: Incline Dumbbell Press
Reasoning: "Targets upper chest weak point. Dumbbells allow natural
shoulder path vs. barbell. Hasn't been performed in 5 days. Suitable
for accumulation phase with 3-4 sets of 8-12 reps."

The agent doesn't just pick exercises randomly—it explains its reasoning, so you understand WHY you're doing incline DB press instead of barbell.

2. ProgressionCalculator (Set-by-Set Coaching)

Job: Suggest weight and reps for each set based on your previous set performance.
Reasoning Level: none (15s timeout—<2s response time is critical; you're waiting between sets)

This is where the reasoning level optimization shines. After every set you complete, the agent analyzes:

Weight used vs. expected
Reps achieved vs. target
RIR (Reps in Reserve) you reported
Your mental readiness state
Fatigue accumulation across the workout

Then it suggests the next set's load with detailed reasoning.

Real example from a workout:

// Previous set data
const previousSet = {
  weight: 100,
  reps: 8,
  targetReps: 10,
  rir: 3, // User reported "could've done 3 more reps"
  targetRir: 1
};

// Agent suggestion for next set
{
  suggestedWeight: 105,
  suggestedReps: 10,
  reasoning: "You left 3 RIR when target was 1, indicating the weight
  was too light. Increasing by 5kg should bring you closer to target
  intensity. Aim for 10 reps with 1 RIR to match accumulation phase
  intensity requirements."
}

This is set-by-set coaching. Not "follow this template"—but "here's what you should do next based on what just happened."

3. WorkoutModificationValidator (Real-Time Adaptation)

Job: Validate and adapt workout modifications when performance deviates from expectations.
Reasoning Level: low (90s timeout—happens a few times per workout, acceptable latency)

Sometimes you have a bad day. Maybe you're sleep-deprived, or that weight was heavier than expected. This agent watches for variance and adjusts:

If you're underperforming: Reduces volume or intensity for remaining sets to avoid junk volume
If you're overperforming: Considers adding volume or intensity if recovery allows
If you hit a plateau: Suggests alternative exercises or rep schemes

Example:

Planned: 4 sets of squats @ 150kg for 8 reps (1-2 RIR)
Actual Set 1: 150kg x 6 reps (3 RIR) — underperformance

Recalculation:
- Reduce to 3 total sets (from 4)
- Decrease weight to 140kg for sets 2-3
- Reasoning: "Significant underperformance suggests readiness issue.
  Reducing volume and load to maintain quality over quantity."

The system prioritizes training quality over blindly following a template.

The Other 14+ Agents

Beyond the core three, ARVO has specialized agents for specific tasks:

AudioScriptGeneratorAgent (reasoning='low'): Generates personalized audio coaching scripts
InsightsGeneratorAgent (reasoning='low'): Analyzes patterns and generates training insights
MemoryConsolidatorAgent: Learns from your preferences and biomechanics
HydrationAdvisorAgent: Smart hydration reminders (ACSM guidelines-based)
ExerciseSubstitutionAgent: Suggests alternatives when equipment is busy
12+ more for validation, substitution, reordering, and analysis tasks

Each agent is optimized for its specific task—latency-critical agents use reasoning='none', complex reasoning uses medium/high.

Tech Stack: OpenAI Responses API at the Core

Building this required balancing AI capabilities, developer experience, and production readiness.

Next.js 14 + App Router

I needed:

Server-side AI orchestration (API routes for agent calls)
Client-side state management for real-time workout tracking
Mobile-optimized UI (the app runs in the gym)
Fast iteration cycles

Next.js 14's App Router gives me server components for AI logic and client components for interactive UI. The DX is fantastic, and deployment to Vercel is one command.

OpenAI Responses API + GPT-5 Models

Here's the most interesting architectural decision: I'm using OpenAI's Responses API, not the standard Chat Completions API.

Why Responses API?

Configurable reasoning effort levels (the killer feature)
Multi-turn CoT persistence with previous_response_id
Verbosity control for agent outputs
Built-in chain-of-thought reasoning

Here's what the API call looks like:

const response = await this.openai.responses.create({
  model: this.model, // 'gpt-5-mini' (default) or 'gpt-5.1' (production)
  input: combinedInput,
  reasoning: { effort: this.reasoningEffort }, // 🎯 KEY FEATURE
  text: { verbosity: this.verbosity },
  ...(responseIdToUse && { previous_response_id: responseIdToUse })
});

The 5 Reasoning Levels:

Level	Timeout	Use Case	Example Agent
`none`	15s	Ultra-low latency, instant responses	ProgressionCalculator
`minimal`	30s	Fast simple tasks	Quick validations
`low`	90s	Standard constraints (default)	Most agents
`medium`	240s	Complex multi-constraint optimization	Workout planning
`high`	240s	Maximum reasoning for hardest problems	Edge cases

Why this matters:

When you finish a set and need the next weight suggestion, you can't wait 30 seconds. The ProgressionCalculator uses reasoning='none' for <2s responses.

But when generating a full workout plan (which happens once at the start), I can use reasoning='low' or medium' for deeper reasoning—you're not waiting mid-workout.

Multi-Turn CoT Persistence:

// Pass previous_response_id for context retention
...(responseIdToUse && { previous_response_id: responseIdToUse })

// Save for next call
this.lastResponseId = response.id;

This gives +4.3% accuracy improvement (Tau-Bench verified) and 30-50% CoT token reduction across a workout session. The AI maintains reasoning context across multiple calls without re-explaining fundamentals.

Model Choice: GPT-5-mini (default) vs. GPT-5.1 (production)

I use GPT-5-mini for development (faster, cheaper) and GPT-5.1 for production (better reasoning quality). Both support the full reasoning level spectrum.

Cost consideration: Each workout costs ~$0.08-0.15 in API calls with GPT-5-mini. For a serious lifter doing 4-5 workouts/week, that's ~$2-3/month—far less than a personal trainer.

Supabase (PostgreSQL + Auth + Realtime)

I needed:

User authentication (Supabase Auth)
Relational database for workout history (PostgreSQL)
Row-level security for data privacy
Realtime subscriptions (future feature: live workout sharing)

Supabase gives me all of this with a great DX. The auto-generated TypeScript types from database schema are a game-changer:

// Auto-generated from Supabase schema
type Workout = Database['public']['Tables']['workouts']['Row'];
type Exercise = Database['public']['Tables']['exercises']['Row'];

// Type-safe queries
const { data, error } = await supabase
  .from('workouts')
  .select('*, exercises(*)')
  .eq('user_id', userId);

Row-level security ensures users only access their own data—critical for a health/fitness app.

TypeScript + Zod Everywhere

Runtime validation is essential when dealing with AI outputs. LLMs can hallucinate or return unexpected formats.

Every agent response is validated with Zod schemas:

import { z } from 'zod';

const ExerciseSuggestionSchema = z.object({
  exerciseName: z.string(),
  sets: z.number().min(1).max(10),
  reps: z.number().min(1).max(30),
  reasoning: z.string().min(20),
  targetMuscles: z.array(z.string()),
});

// Validate AI response
const suggestion = ExerciseSuggestionSchema.parse(aiResponse);

If the AI returns invalid data, I catch it immediately rather than propagating bugs to the UI.

The Knowledge Engine: Parametric Training

Here's where ARVO differs from "generic AI fitness app #427."

I didn't want the AI to invent a training program. I wanted it to interpret existing, proven methodologies with complete fidelity.

So I built a parametric knowledge engine—a structured representation of training methodologies that the AI can query and reason over.

Example: Kuba Method configuration (362 lines of rules):

export const kubaMethodConfig = {
  name: "Kuba Method",
  phases: {
    accumulation: {
      intensityRange: [65, 75], // % of 1RM
      volumeLandmarks: {
        bulk: { sets: "4-6", reps: "8-12" },
        cut: { sets: "3-4", reps: "10-15" },
        maintenance: { sets: "3-5", reps: "8-12" },
      },
      exerciseSelectionRules: [
        "Prioritize compound movements",
        "Include 2-3 isolation exercises per muscle group",
        "Avoid same exercise within 4 days",
      ],
      progressionLogic: {
        trigger: "When all sets meet top of rep range with 0-1 RIR",
        action: "Increase weight by 2.5-5kg",
      },
    },
    intensification: {
      // ... similar structure
    },
    deload: {
      // ... similar structure
    },
  },
  injuryProtocol: {
    shoulderPain: ["Avoid overhead pressing", "Substitute with neutral grip"],
    lowerBackPain: ["Reduce axial loading", "Focus on cable/machine work"],
  },
};

The agents receive this configuration as context. When making decisions, they reference these rules and explain how they applied them.

This is not prompt engineering tricks—it's structured domain knowledge that ensures methodology fidelity.

I also implemented Mike Mentzer's HIT with 532 lines of configuration (ultra-low volume, max intensity, advanced techniques). Same AI system, completely different training approach—because the knowledge engine is parametric.

The Hard Parts: What I Learned Building This

Challenge 1: Validation-Driven Retry System

Problem: AI outputs are unpredictable. Even with Zod validation, sometimes the AI suggests something that's technically valid but contextually wrong (e.g., "add 50kg to your next set" after you barely completed the previous one).

Solution: Built a retry mechanism with validation feedback loops.

protected async completeWithRetry<T>(
  userPrompt: string,
  validationFn: (result: T) => Promise<{ valid: boolean; feedback: string }>,
  maxAttempts: number = 3,
) {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    const result = await this.complete<T>(userPrompt);
    const validation = await validationFn(result);

    if (validation.valid) return result;

    // Retry with validation feedback
    userPrompt += `\n\nPrevious attempt failed validation: ${validation.feedback}`;
  }

  throw new Error('Max validation attempts exceeded');
}

When validation fails, I pass the specific failure reason back to the AI for the next attempt. This dramatically improved suggestion quality—from ~75% valid to ~95%.

Progressive timeout scaling: Each retry gets 1.5x longer timeout (1.0x → 1.5x → 2.0x) to give the AI more thinking time.

Challenge 2: State Persistence Across Crashes

Problem: You're mid-workout, phone browser crashes (or you accidentally swipe away the tab). Losing that data is unacceptable.

Solution: Dual-layer persistence.

// Layer 1: Optimistic localStorage (instant writes)
const saveWorkoutState = (state: WorkoutState) => {
  localStorage.setItem('arvo:active-workout', JSON.stringify(state));
};

// Layer 2: Supabase sync (every 30 seconds + on completion)
const syncToDatabase = async (state: WorkoutState) => {
  await supabase.from('workouts').upsert({
    id: state.workoutId,
    user_id: state.userId,
    exercises: state.exercises,
    status: state.status,
    updated_at: new Date().toISOString(),
  });
};

On reload, the app checks localStorage first, then syncs with Supabase. You can crash and recover seamlessly.

Challenge 3: Sub-2s AI Latency

Problem: Waiting 5-10 seconds for a set suggestion between sets kills the flow.

Solution: reasoning='none' + optimistic UI.

// ProgressionCalculator uses reasoning='none'
const response = await this.openai.responses.create({
  model: 'gpt-5-mini',
  input: setData,
  reasoning: { effort: 'none' }, // 🎯 Ultra-fast mode
  text: { verbosity: 'concise' },
});

// Response in <2s
const suggestion = response.content;

By using reasoning='none', I get <2s responses even with GPT-5 models. The AI still provides quality suggestions, just without extended reasoning chains.

For comparison, reasoning='low' would take 5-8s for the same task—unacceptable when you're mid-workout.

Challenge 4: Mobile UX in the Gym

Problem: You're holding dumbbells. Your hands are sweaty. The screen keeps turning off.

Solutions:

Wake Lock API: Keeps screen on during workouts

const wakeLock = await navigator.wakeLock.request('screen');

44px minimum touch targets: All buttons are easily tappable with sweaty fingers
Fullscreen mode: Maximizes screen real estate
Quick actions: "Equipment busy," "Too heavy," "Too light" shortcuts to adjust on the fly

These aren't glamorous features, but they're critical for real-world usage.

Challenge 5: Handling AI Hallucinations Gracefully

Problem: Sometimes the AI suggests nonsensical weights (e.g., "try 250kg for your first bench press set").

Solution: Multi-layer validation.

// Zod schema catches type errors
const suggestion = ExerciseSuggestionSchema.parse(aiResponse);

// Business logic validation
if (suggestion.weight > user.estimatedMax * 1.2) {
  throw new Error('Suggested weight exceeds safe range');
}

// User override always available
// "This doesn't look right" → triggers re-generation with adjusted context

I also log all AI suggestions to review patterns and improve prompts over time.

What I Learned

1. Reasoning levels are a game-changer for multi-agent systems

Not all tasks need deep reasoning
reasoning='none' for latency-critical tasks (<2s responses)
reasoning='medium/high' for complex planning (acceptable 90-240s)
Match reasoning effort to task requirements, not a one-size-fits-all approach

2. Multi-turn CoT persistence compounds over sessions

previous_response_id gives +4.3% accuracy and -30-50% tokens
The AI learns patterns across a workout without re-explaining
Critical for maintaining context in long-running agent sessions

3. Validation-driven retries > perfect prompts

Even great prompts fail ~25% of the time
Feedback loops (validation → retry with feedback) → 95% success rate
Progressive timeout scaling (1.0x → 1.5x → 2.0x) helps on retries

4. LLMs are great at reasoning, terrible at precision

Use AI for "what exercise should I do and why?"
Don't use AI for "calculate my 1RM" (use formulas)
Responses API with structured outputs bridges this gap

5. Structured knowledge > prompt engineering

My 362-line knowledge engine beats any "clever prompt"
Domain expertise must be encoded, not implied
Parametric configuration enables methodology fidelity

6. Mobile web is underrated for fitness

No app store approval
Instant updates
Cross-platform from day one
PWA capabilities (Wake Lock, offline support) are production-ready

7. Users care about transparency

Every AI decision includes reasoning
Users often read the reasoning before following suggestions
"Show your work" builds trust—even when the AI is wrong

8. Type safety saves lives

TypeScript + Zod caught hundreds of runtime errors
AI outputs are unpredictable—validate everything
Zod validation + business logic validation + user override = robust system

Try ARVO & Let's Talk

I've been using ARVO for my own training for 3 months. It's genuinely changed how I approach progressive overload—I'm lifting smarter, not just harder.

Try it: arvo.guru (free to start, no credit card)

Curious about the tech? I'm happy to deep-dive on:

OpenAI Responses API implementation patterns
Reasoning level optimization strategies
Multi-agent orchestration with CoT persistence
Validation-driven retry systems
Knowledge engine design for parametric training
Mobile-first React patterns for gym use

What would you want to know about the implementation? Drop questions below—I'll answer everything.

And if you've built AI-powered vertical tools, I'd love to hear about your architecture. What reasoning level strategies have worked for you? What challenges did you hit that I haven't mentioned?

Built with Next.js 14, TypeScript, OpenAI Responses API (GPT-5-mini/GPT-5.1), Supabase, and way too much coffee. Currently powering 100+ workouts/week with 17+ specialized agents.

[Boost]

daniele pelleri — Sun, 05 Oct 2025 20:18:42 +0000

Orchestro: Trello for Claude Code — with a built-in Scrum Master

daniele pelleri ・ Oct 5

#mcp #claudecode #ai #opensource

Orchestro: Trello for Claude Code — with a built-in Scrum Master

daniele pelleri — Sun, 05 Oct 2025 20:18:15 +0000

TL;DR

I rebuilt my workflow again (third AI-based project). Orchestro is an open-source MCP server + web dashboard for Claude Code.

Think Trello for Claude Code — but with a (auto) Scrum Master that keeps the board honest and agents that move the cards from goal → tasks → code.

Looking for real users (heavy Claude Code folks) to kick the tires.

• Website: orchestro.org

• Repo: github.com/khaoss85/mcp-orchestro

The itch

Great agent UX, still the same frictions:

intent and decisions buried in prompts
dependencies invisible until too late
goal → tasks → code drops context during vibe coding
PMs and devs don’t see the same reality in real time

I wanted a thin, no-drama layer that keeps the plan visible and the execution honest.

What it is (one line)

Orchestro = Trello for Claude Code.

Plan on a board. The MCP server executes the plan. The (auto) Scrum Master keeps flow tight. Agents move cards as work happens.

How it feels to use

You write a user story.

The built-in Scrum Master decomposes it into technical tasks, sets dependencies, and guards the workflow.

Agents prepare context-rich prompts for Claude Code, nudge the right tools, and move cards across the board as things progress.

You and your PM both watch the same board update in real time.

Less prompt soup, more visible, auditable progress.

What you get after install

A live Kanban that actually mirrors what Claude is doing
~60 tools available inside Claude Code (ask: “Show me orchestro tools”)
A clean goal → tasks → deps → code path you can point stakeholders to

Quick start (one command)

npx @orchestro/init
npm run dashboard    # http://localhost:3000
(restart Claude Code, then ask: "Show me orchestro tools")

That’s it. You’ll see the tools in Claude, and a live board in the browser.

Who this is for

heavy Claude Code users who want fewer invisible steps
builders doing vibe coding but needing a clean map
teams that want the PM and Dev view to finally be the same thing

Local-first & trust

Your data lives in your Supabase. No hardcoded secrets. Full history if you need to audit or roll back.

Why open-source (and my first MCP)

I shipped my first MCP here because I want real usage, not another demo.

If you live in Claude Code daily, your feedback will shape the next iteration.

Kick the tires

Repo & docs: github.com/khaoss85/mcp-orchestro

Website: orchestro.org

If it helps, drop a star.

If it hurts, open an issue and tell me where.

PRs and brutal feedback welcome.

I Open-Sourced My Multi-Agent Orchestration Framework (94% Lower API Costs)

daniele pelleri — Wed, 03 Sep 2025 19:20:01 +0000

The Problem: 5 AI Agents = Complete Chaos

Ever tried running multiple AI agents together? Here's what happens:

Agent A analyzes data
Agent B rewrites everything from scratch (doesn't know what A found)
Agent C duplicates A's work
You become a human copy-paste machine between ChatGPT windows
Your API bill explodes

I burned through $3,000 learning this the hard way.

The Solution: AI Team Orchestrator

I built a framework that orchestrates AI agents like a real company:

🎬 Watch 2-min Demo

How It Works

Your goal: "Increase Instagram engagement by 40%"

What happens behind the scenes:

Director Agent analyzes and assembles team
Marketing Strategist creates strategy
Content Creator receives strategy context (no duplication!)
Data Analyst tracks metrics
All agents share workspace memory

Key Architecture Decisions

1. Conditional Quality Gates (94% cost savings)

Instead of checking everything:

Frontend-only changes: skip backend validators (saves $0.23 per check)
Database changes: trigger all validators (full validation when needed)

2. Agent Handoffs with Context

Agents pass context like Slack messages:

From: ResearchAgent
To: StrategyAgent
Context: "Found 3 key competitor patterns"
Artifacts: ["analysis.json", "data.csv"]

3. Workspace Memory (No repeated work)

Semantic memory prevents re-doing tasks:

If similar task found: use previous approach
If new task: execute and learn

Real Production Metrics

Metric	Before	After
API Costs	$240/month	$3/month
Task Recovery	Manual	<60s autonomous
Context Retention	12%	89%
Setup Time	2 days	15 minutes
Error Rate	23%	1.2%
Throughput	2.3/sec	8.7/sec

Tech Stack

Backend: FastAPI + OpenAI Agents SDK
Frontend: Next.js 15 + TypeScript
Database: Supabase
Architecture: Blackboard pattern with Pydantic contracts

Get Started

git clone https://github.com/khaoss85/AI-Team-Orchestrator
cd ai-team-orchestrator
./scripts/quick-setup.sh

What I Need From The Community

This isn't a finished product - it's a starting point. Looking for:

Test it with your use cases
Report what breaks (it will break)
Suggest improvements based on real needs
Contribute if you want to

The roadmap is completely open. Your use case = our next feature.

Lessons Learned (The Hard Way)

Documented everything in a 62,000-word guide:

Why agents create infinite loops (5,000 tasks in 20 minutes!)
Race conditions with parallel agents
Why agents don't use tools even when available
The $40 CI test that forced us to build mock providers

Example: The Infinite Loop Problem

What went wrong:

Agent decomposes task
Each subtask gets decomposed again
No depth limit = infinite recursion
5,000 tasks created in 20 minutes

The fix:

Hard depth limit (MAX_DEPTH = 5)
AI decides if task is atomic
Anti-loop counter at workspace level

Architecture Deep Dive

The system uses a multi-layer architecture:

Layer 1: Input Processing

User Input → Goal Engine → Task Planner

Layer 2: Execution

Agent Team → Task Executor → Deliverable Generator

Layer 3: Optimization

Memory & Learning → Quality Assurance → Improvement Loop

Each layer feeds back into the system, creating continuous improvement.

Real-Time Thinking Process (Claude/o3 Style)

You can watch agents think in real-time:

[THINKING] Breaking down objective into sub-goals
[ANALYZING] Identifying required specialist skills
[MEMORY CHECK] Found 3 similar patterns from workspace #42
[DECISION] Assembling team of 4 specialists...
[HANDOFF] Marketing strategy completed
[CONTEXT PASSED] 3 key insights from research
[CONFIDENCE] 92%

Current Limitations

Being transparent about what needs work:

✅ What works:

Basic multi-agent orchestration
Memory system and context retention
Cost optimization through quality gates
Handoff mechanism

🚧 What needs improvement:

Error recovery patterns
Performance with 10+ agents
Better debugging tools
More sophisticated memory retrieval

Join The Discussion

What's your biggest multi-agent orchestration challenge? Let's solve it together.

Links:

*If this helped you save on API costs or solve orchestration problems, consider starring the repo!

Stop Burning Money on AI Tests: Build a Smart Mock System in 15 Minutes

daniele pelleri — Wed, 20 Aug 2025 17:53:19 +0000

I burned $3K testing AI agents before building this. Now my CI runs 200+ tests for $0. Here's the exact setup that saved my budget.

The Problem

Testing AI systems is expensive. Really expensive.

Every test run with real API calls costs money. My GitHub Actions were burning $40+ per push. Monthly bill hit $1,200 just for testing.

Sound familiar?

The Solution: Smart AI Mocking

Instead of avoiding tests (bad) or burning money (worse), build an intelligent mock system that:

✅ Runs unlimited tests for $0
✅ Provides deterministic responses
✅ Switches seamlessly between mock/real
✅ Takes 15 minutes to implement

Step 1: Create the AI Provider Interface (2 minutes)

# ai_provider.py
from abc import ABC, abstractmethod

class AIProvider(ABC):
    @abstractmethod
    def generate_response(self, prompt: str, model: str = "gpt-4") -> str:
        pass

    @abstractmethod
    def generate_structured(self, prompt: str, schema: dict) -> dict:
        pass

Step 2: Build the Mock Provider (5 minutes)

# mock_provider.py
import re
import json
from ai_provider import AIProvider

class MockAIProvider(AIProvider):
    def __init__(self):
        self.response_patterns = {
            # Priority calculation
            r'priority.*score': '{"priority_score": 750}',

            # Task decomposition
            r'decompose.*task': '''{"tasks": [
                {"name": "Research", "priority": "high"},
                {"name": "Analysis", "priority": "medium"}
            ]}''',

            # Team composition
            r'team.*composition': '''{"team": [
                {"name": "John", "role": "Developer"},
                {"name": "Sarah", "role": "Designer"}
            ]}''',

            # Default response
            r'.*': 'Mock response for testing purposes'
        }

    def generate_response(self, prompt: str, model: str = "gpt-4") -> str:
        prompt_lower = prompt.lower()

        for pattern, response in self.response_patterns.items():
            if re.search(pattern, prompt_lower):
                return response

        return self.response_patterns[r'.*']

    def generate_structured(self, prompt: str, schema: dict) -> dict:
        response = self.generate_response(prompt)
        try:
            return json.loads(response)
        except json.JSONDecodeError:
            return {"mock": True, "response": response}

Step 3: Real Provider Implementation (3 minutes)

# openai_provider.py
import openai
from ai_provider import AIProvider

class OpenAIProvider(AIProvider):
    def __init__(self, api_key: str):
        self.client = openai.OpenAI(api_key=api_key)

    def generate_response(self, prompt: str, model: str = "gpt-4") -> str:
        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

    def generate_structured(self, prompt: str, schema: dict) -> dict:
        # Add schema instruction to prompt
        schema_prompt = f"{prompt}\n\nRespond with valid JSON matching this schema: {schema}"
        response = self.generate_response(schema_prompt)
        return json.loads(response)

Step 4: Smart Factory Pattern (3 minutes)

# ai_factory.py
import os
from mock_provider import MockAIProvider
from openai_provider import OpenAIProvider

class AIFactory:
    @staticmethod
    def create_provider():
        if os.getenv("TESTING") == "true":
            return MockAIProvider()

        if os.getenv("CI") == "true":
            return MockAIProvider()  # Never spend money in CI

        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OPENAI_API_KEY required for production")

        return OpenAIProvider(api_key)

# Usage in your code
ai_provider = AIFactory.create_provider()
response = ai_provider.generate_response("What is the priority of this task?")

Step 5: Test Configuration (2 minutes)

# test_ai_agents.py
import os
import pytest

@pytest.fixture(autouse=True)
def setup_test_environment():
    os.environ["TESTING"] = "true"
    yield
    os.environ.pop("TESTING", None)

def test_task_prioritization():
    from ai_factory import AIFactory

    ai = AIFactory.create_provider()
    response = ai.generate_structured(
        "Calculate priority score for this task",
        {"priority_score": "number"}
    )

    assert "priority_score" in response
    assert isinstance(response["priority_score"], (int, str))
    assert response["priority_score"] == 750  # Deterministic!

def test_team_composition():
    from ai_factory import AIFactory

    ai = AIFactory.create_provider()
    response = ai.generate_structured(
        "Compose a team for this project",
        {"team": "array"}
    )

    assert "team" in response
    assert len(response["team"]) >= 2

The Results

Before this setup:

💸 $40 per CI run
🐌 3-5 minutes per test suite
🎲 Flaky, non-deterministic tests
😰 Scared to run tests frequently

After this setup:

💰 $0 for unlimited test runs
⚡ 30 seconds per test suite
🎯 Deterministic, reliable tests
😎 Test-driven development restored

Production Usage

# In production
os.environ["TESTING"] = "false"  # Uses real OpenAI

# In CI/CD  
os.environ["CI"] = "true"  # Uses mocks

# In development
# No env vars = uses real API for manual testing

Advanced: Smart Response Evolution

Make your mocks smarter over time:

class SmartMockProvider(MockAIProvider):
    def __init__(self):
        super().__init__()
        self.response_history = []

    def generate_response(self, prompt: str, model: str = "gpt-4") -> str:
        # Log what real responses look like
        response = super().generate_response(prompt, model)
        self.response_history.append((prompt, response))
        return response

    def export_real_responses(self):
        """Use this to improve mocks based on real API responses"""
        return self.response_history

Your Turn

Clone this pattern for your AI tests. It takes 15 minutes and saves hundreds of dollars.

Questions:

What's your current testing budget for AI systems?
Have you tried other mocking approaches? How did they work?
What response patterns would you add to the mock provider?

Drop your own cost-saving testing patterns below! 👇

Want more AI engineering patterns? I've documented 42+ lessons building production AI systems - including the $3K mistake that taught me this lesson.

5 AI Agent Patterns That Will Save Your Sanity

daniele pelleri — Mon, 18 Aug 2025 11:10:11 +0000

Building AI agents? These patterns took me 6 months and $3K in mistakes to learn. Copy-paste them now and thank me later.

1. 🚧 The Constraint Pattern

Problem: AI agents over-optimize without limits.

Bad:

prompt = "Create the perfect solution"
# Result: Agent creates 10-person team for simple task

Good:

prompt = f"""
Create a solution with NON-NEGOTIABLE constraints:
- Budget: MAX ${budget}
- Timeline: {days} days
- Team size: 2-4 people
- If constraints violated, proposal = REJECTED
"""

Why it works: LLMs need explicit boundaries or they'll "optimize" into absurdity.

2. 🔒 The Atomic Lock Pattern

Problem: Multiple agents grab the same task → chaos.

Bad:

task = get_pending_task()
if task:
    start_work(task)  # Race condition!

Good:

# Atomic task claiming
result = db.update({"status": "in_progress", "agent_id": agent.id}) \
    .eq("id", task_id) \
    .eq("status", "pending") \
    .execute()

if len(result.data) == 1:
    # Won the race - proceed
    start_work(task)
else:
    # Someone else got it - find another task
    find_next_task()

Why it works: Database-level atomicity prevents dual assignment.

3. 💰 The Mock Sandwich Pattern

Problem: Testing AI systems burns through API budget.

Bad:

def test_agent():
    response = openai.chat.completions.create(...)  # $$$

Good:

class AIProvider:
    def generate(self, prompt):
        if os.getenv("TESTING"):
            return self.mock_response(prompt)
        return self.real_openai_call(prompt)

def mock_response(self, prompt):
    if "priority" in prompt.lower():
        return '{"priority": 750}'
    return "Deterministic test response"

Why it works: 95% cost reduction, 10x faster tests, deterministic results.

4. ⛔ The Circuit Breaker Pattern

Problem: AI agents create infinite loops of sub-tasks.

Bad:

def create_subtask(task):
    subtask = agent.decompose(task)
    create_subtask(subtask)  # Infinite recursion!

Good:

def create_subtask(task):
    if task.depth >= MAX_DEPTH:
        raise MaxDepthError("Task delegation too deep")

    if workspace.tasks_last_hour > RATE_LIMIT:
        workspace.pause(cooldown=300)
        raise RateLimitError("Too many tasks created")

    return agent.decompose(task)

Why it works: Prevents runaway automation with depth limits and rate limiting.

5. ⚖️ The Hybrid Decision Pattern

Problem: AI prioritization has hidden biases.

Bad:

priority = ai.calculate_priority(task)  # Black box bias

Good:

def calculate_priority(task):
    # Objective factors (measurable)
    base_score = (
        task.blocked_dependencies * 100 +
        task.age_days * 10 +
        task.business_impact_score
    )

    # AI enhancement (subjective)
    ai_modifier = ai.assess_context(task)

    return min(base_score + ai_modifier, 1000)

Why it works: AI handles creativity, deterministic rules handle critical logic.

🚀 Bonus: The Everything Pattern

Combine all patterns:

class ProductionAgent:
    def execute_task(self, task_id):
        # Pattern 1: Constraints
        if not self.validate_constraints(task_id):
            return

        # Pattern 2: Atomic lock
        if not self.claim_task(task_id):
            return self.find_next_task()

        # Pattern 3: Mock in testing
        ai_response = self.ai_provider.generate(prompt)

        # Pattern 4: Circuit breakers
        if self.should_create_subtask(ai_response):
            self.create_subtask_safely(ai_response)

        # Pattern 5: Hybrid decisions
        priority = self.calculate_hybrid_priority(task_id)

💡 Implementation Tips

Start with Pattern #3 (Mock Sandwich) - it'll save you money immediately.

Pattern #1 (Constraints) is the easiest win - just add budget/time limits to your prompts.

Pattern #2 (Atomic Lock) is critical if you have >1 agent - implement early.

Patterns #4 & #5 become essential as your system grows beyond MVP.

Your Turn

Which pattern are you implementing first?

And what other AI agent patterns have you discovered the hard way?

Drop your own "sanity-saving" patterns in the comments - let's build a community knowledge base! 👇

OpenAI SDK vs Direct API Calls: What 6 Months of Building AI Agents Taught Me

daniele pelleri — Sun, 17 Aug 2025 13:41:50 +0000

When you're building your first AI system, you face this choice: use the official SDK or roll your own HTTP calls? I chose wrong, then right, then learned why this decision matters more than you think.

Six months ago, I started building a multi-agent AI system. The first architectural decision? How to talk to OpenAI's API.

The "obvious" choice seemed to be direct HTTP calls with requests. Simple, fast, no dependencies. I was wrong.

Here's what I learned building a production system that handles thousands of agent interactions.

The Tempting Path: Direct API Calls

Why it feels right:

import requests

def call_openai(prompt):
    response = requests.post(
        "https://api.openai.com/v1/chat/completions",
        headers={"Authorization": f"Bearer {api_key}"},
        json={
            "model": "gpt-4",
            "messages": [{"role": "user", "content": prompt}]
        }
    )
    return response.json()["choices"][0]["message"]["content"]

Looks clean, right? This approach will bite you.

What Breaks First (The Pain Points)

1. Error Handling Hell

# What you think you need
try:
    response = requests.post(...)
    return response.json()
except Exception:
    return "Error"

# What you actually need
try:
    response = requests.post(...)
    if response.status_code == 429:  # Rate limit
        wait_time = int(response.headers.get('retry-after', 60))
        time.sleep(wait_time)
        return call_openai(prompt)  # Recursive retry
    elif response.status_code == 500:  # Server error
        # Exponential backoff logic
    elif response.status_code == 400:  # Bad request
        # Parse error details
    # ... 10 more status codes
except requests.exceptions.ConnectionError:
    # Network issues
except requests.exceptions.Timeout:
    # Timeout handling
# ... and so on

2. Context Management Nightmare

Direct calls = stateless. But AI conversations need memory:

# You end up with this mess
conversation_history = []
conversation_history.append({"role": "user", "content": prompt})
response = call_openai(conversation_history)
conversation_history.append({"role": "assistant", "content": response})
# Repeat for every agent, every conversation

3. Tool Integration Chaos

Want function calling? Prepare for JSON schema hell:

# Just for ONE tool
tools = [{
    "type": "function",
    "function": {
        "name": "web_search",
        "description": "Search the web for information",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    }
}]

Multiply this by 10+ tools across multiple agents. Maintenance nightmare.

The SDK Solution

After 3 months of fighting custom HTTP code, I switched to OpenAI's Agents SDK:

from openai import OpenAI

# Agent with tools and memory - one line
agent = Agent(
    name="ResearchAgent",
    instructions="You are a research specialist...",
    tools=[web_search_tool, data_analysis_tool],
    model="gpt-4"
)

# Conversation with automatic context management
thread = agent.create_thread()
response = agent.run(thread_id=thread.id, message="Research AI trends")

Real-World Performance Comparison

After 6 months running both approaches in production:

Metric	Direct API	SDK
Lines of Code	2,847	342
Error Rate	12.3%	1.8%
Development Time	3 months	2 weeks
Maintenance Hours/Week	8-12	1-2
Feature Velocity	Slow	Fast

The SDK Wins: Why?

✅ Error Handling Built-In

Automatic retries with exponential backoff
Rate limit handling
Graceful degradation

✅ Context Management

Threads handle conversation memory
Automatic message persistence
Session management

✅ Tool Integration

Function decorators → automatic schema generation
Built-in tool execution
Error isolation per tool

✅ Future-Proof

New API features → automatic SDK updates
Backward compatibility
Performance optimizations

When Direct API Still Makes Sense

Use direct calls when:

Simple, one-off requests
Custom authentication flows
Extreme performance requirements
SDK doesn't support your use case

Use SDK when:

Building conversational agents
Need tool/function calling
Multiple agents coordination
Production systems

The Real Cost

Direct API approach cost me:

2 months of development time
Constant bug fixes
Missed features (couldn't implement advanced flows)
Team frustration

SDK approach gave me:

2 weeks to production
Focus on business logic, not plumbing
Easy feature additions
Happier developers

My Recommendation

Start with the SDK. Even if you think you need direct control.

The time you "save" with direct HTTP calls gets consumed 10x over in error handling, context management, and maintenance.

Only go direct if you have a specific, justified reason. And even then, build an abstraction layer so you can switch later.

What's Your Experience?

Are you using direct API calls or SDKs for AI integrations?
What pain points have you hit?
Have you made the switch from one approach to another?

I'm curious about edge cases where direct calls are still the better choice. What am I missing?

5 Brutal Lessons from Building a Multi-Agent AI System (And How to Avoid My Epic Fails)

daniele pelleri — Sat, 16 Aug 2025 17:10:05 +0000

What happens when you go from "hello world" AI to orchestrating an entire team of agents that need to collaborate without destroying each other? Spoiler: everything that can go wrong, will go wrong.

After 6 months of development and $3,000 burned in API calls, I learned some brutal lessons building an AI orchestration system. This isn't your typical polished tutorial—these are the real epic fails nobody tells you about in those shiny conference presentations.

🔥 Lesson #1: "The Agent That Wanted to Hire Everyone"

The Fail: My Director AI, tasked with composing teams for projects, consistently created teams of 8+ people to write a single email. Estimated budget: $25,000 for 5 lines of text.

The Problem: LLMs, when unconstrained, tend to "over-optimize." Without explicit limits, my agent interpreted "maximum quality" as "massive team."

The Fix:

# Before (disaster)
prompt = "Create the perfect team for this project"

# After (reality)
prompt = f"""
Create a team for this project.
NON-NEGOTIABLE CONSTRAINTS:
- Max budget: {budget} USD
- Team size: 3-5 people MAX
- If you exceed budget, proposal will be automatically rejected
"""

Takeaway: AI agents without explicit constraints are like teenagers with unlimited credit cards.

⚡ Lesson #2: Race Conditions Are Hell

The Fail: Two agents grabbed the same task simultaneously, duplicating work and crashing the database.

WARNING: Agent A started task '123', but Agent B had already started it 50ms earlier.
ERROR: Duplicate entry for key 'PRIMARY' on table 'goal_progress_logs'.

The Problem: "Implicit" coordination through shared database state isn't enough. In distributed systems, 50ms latency = total chaos.

The Fix: Application-level pessimistic locking

# Atomic task acquisition
update_result = supabase.table("tasks") \
    .update({"status": "in_progress", "agent_id": self.id}) \
    .eq("id", task_id) \
    .eq("status", "pending") \
    .execute()

if len(update_result.data) == 1:
    # Won the race - proceed
    execute_task(task_id)
else:
    # Another agent was faster - find another task
    logger.info(f"Task {task_id} taken by another agent")

Takeaway: In multi-agent systems, "probably works" = "definitely breaks."

💸 Lesson #3: $40 Burned in 20 Minutes of CI Tests

The Fail: My integration tests made real calls to GPT-4. Every GitHub push = $40 in API calls. Daily budget burned before breakfast.

The Problem: Testing AI systems without mocks is like load-testing with a live credit card.

The Fix: AI Abstraction Layer with intelligent mocks

class MockAIProvider:
    def generate_response(self, prompt: str) -> str:
        # Deterministic responses for testing
        if "priority" in prompt.lower():
            return '{"priority_score": 750}'
        return "Mock response for testing"

# Environment-based switching
if os.getenv("TESTING"):
    ai_provider = MockAIProvider()
else:
    ai_provider = OpenAIProvider()

Result: Test costs down 95%, speed up 10x.

Takeaway: An AI system that can't be tested cheaply is a system that can't be developed.

🌀 Lesson #4: The Infinite Loop That Never Ends

The Fail: An "intelligent" agent started creating sub-tasks of sub-tasks of sub-tasks. After 20 minutes: 5,000+ pending tasks, system completely frozen.

INFO: Agent A created Task B
INFO: Agent B created Task C  
INFO: Agent C created Task D
... [continues for 5,000 lines]
ERROR: Workspace has 5,000+ pending tasks. Halting operations.

The Problem: Autonomy without limits = autopoietic chaos.

The Fix: Anti-loop safeguards

# Task delegation depth limit
if task.delegation_depth >= MAX_DEPTH:
    raise DelegationDepthExceeded()

# Workspace task rate limiting  
if workspace.tasks_created_last_hour > RATE_LIMIT:
    workspace.pause_for_cooldown()

Takeaway: Autonomous agents need "circuit breakers" more than any other system.

🎭 Lesson #5: AI Has Its Own Bias (Not the Ones You Think)

The Fail: My AI-driven prioritization system systematically preferred tasks that "sounded more important" vs tasks that were actually business-critical.

The Problem: LLMs optimize for "sounding right" not "being right." Bias toward pompous corporate language.

The Fix: Objective metrics + AI reasoning

def calculate_priority(task, context):
    # Objective factors (non-negotiable)
    base_score = (
        task.blocked_dependencies_count * 100 +
        task.age_days * 10 +
        task.business_impact_score
    )

    # AI enhancement (subjective)
    ai_modifier = get_ai_priority_assessment(task, context)

    return min(base_score + ai_modifier, 1000)  # Cap at 1000

Takeaway: AI for creativity, deterministic rules for critical decisions.

🚀 What's Next?

These are just 5 of the 42+ lessons I documented building this system. Each fail led to architectural patterns I now use systematically.

The journey from "single agent demo" to "production orchestration system" taught me that the real engineering isn't in the AI—it's in everything around it: coordination, memory, error handling, cost management, and quality gates.

Question for the community: What's been your most epic fail working with AI/agents? How did you solve it?

If anyone's facing similar challenges in AI orchestration, happy to dive deeper into the technical details. This rabbit hole goes deep!