How Session Replay + Online Evals Revealed How My Holiday Pet App Actually Works
Original article published on December 17, 2025.
I added LaunchDarkly observability to my Christmas-play pet casting app thinking I'd catch bugs. Instead, I unwrapped the perfect gift 🎁. Session replay shows me WHAT users do, and online evaluations show me IF my model made the right casting decision with real-time accuracy scores. Together, they're like milk 🥛 and cookies 🍪 - each good alone, but magical together for production AI monitoring.
See the App in Action
Discovery #1: Users' 40-second patience threshold
I decided to use session replay to evaluate the average time it took users to go through each step in the AI casting process. Session replay is LaunchDarkly's tool that records user interactions in your app - every click, hover, and page navigation - so you can watch exactly what users experience in real-time.
The complete AI casting process takes 30-45 seconds: personality analysis (2-3s), role matching (1-2s), DALL-E 3 costume generation (25-35s), and evaluation scoring (2-3s). That's a long time to stare at a loading spinner wondering if something broke.
What are progress steps?
Progress steps are UI elements I added to the app - not terminal commands or backend processes, but actual visual indicators in the web interface that show users which phase of the AI generation is currently running. These appear as a simple list in the loading screen, updating in real-time as each AI task completes. No commands needed - they automatically display when the user clicks "Get My Role!" and the AI processing begins.
Session replay revealed:
WITHOUT Progress Steps (n=20 early sessions):
0-10 seconds: 20/20 still watching (100%)
10-20 seconds: 18/20 still watching (90%)
20-30 seconds: 14/20 still watching (70%) - rage clicks begin
30-40 seconds: 9/20 still watching (45%) - tab switching detected
40+ seconds: 7/20 still watching (35% stay)
WITH Progress Steps (n=30 after adding them):
0-10 seconds: 30/30 still watching (100%)
10-20 seconds: 29/30 still watching (97%)
20-30 seconds: 25/30 still watching (83%)
30-40 seconds: 23/30 still watching (77%)
40+ seconds: 24/30 still watching (80% stay!)
Critical Discovery: Progress steps more than DOUBLED
completion rate (35% → 80%)
This made the difference:
Clear progress steps:
Step 1: AI Casting Decision
Step 2: Generating Costume Image (10-30s)
Step 3: Evaluation
As each completes:
✅ Step 1: AI Casting Decision
Step 2: Generating Costume Image (10-30s)
Step 3: Evaluation
Session replay showed users hovering over the back button at 25 seconds, then relaxing when they saw "Step 2: Generating Costume Image (10-30s)." The moment they understood DALL-E was creating their pet's costume (not the app freezing), they were willing to wait. Clear progress indicators transform anxiety into patience.
Discovery #2: Observability + online evaluations give the complete picture
Session replay shows user behavior and experience. Online evaluations expose AI output quality through accuracy scoring. Together, they form a solid strategy for AI observability.
To see this in action, let's take a closer look at an example.
Example: The speed-running corgi owner
In this scenario, a user blazes through the entire pet app setup from the initial quiz to the final results, completing the process in record time. So fast, in fact, that instead of this leading to a favorable outcome, it led to an instance of speed killing quality.
Session Replay Showed:
- Quiz completed in 8 seconds (world record) - they clicked the first option for every question
- Skipped photo upload entirely
- Waited the full 31 seconds for processing
- Got their result: "Sheep"
- Started rage clicking on the sheep image immediately
- Left the site without saving or sharing
Why did their energetic corgi get cast as a sheep? The rushed quiz responses created a contradictory personality profile that confused the AI. Without a photo to provide visual context, the model defaulted to its safest, most generic casting choice.
Online Evaluation Results:
- Evaluation Score: 38/100 ❌
- Reasoning: "Costume contains unsafe elements: eyeliner, ribbons"
- Wait, what? The AI suggested face paint and ribbons, evaluation said NO
Online evaluations use a model-agnostic evaluation (MAE) - an AI agent that evaluates other AI outputs for quality, safety, or accuracy. The out-of-the-box evaluation judge is overly cautious about physical safety. For the above scenario, the evaluation comments:
- "Costume includes eyeliner which could be harmful to pets" (It's a DALL-E image!)
- "Ribbons pose entanglement risk"
- "Bells are a choking hazard" (It's AI-generated art!)
About 40% of low scores are actually the evaluation being overprotective about imaginary safety issues, not bad casting.
Speed-runners get generic roles AND the evaluation writes safety warnings about digital costumes. Users see these low scores and think the app doesn't work well.
But speed-running isn't the whole story. To truly understand the relationship between user engagement and AI quality, we need to see the flip side. The perfect user. One who gives the AI everything it needed to succeed. What happens when a user takes their time and engages thoughtfully with every step?
Example: The perfect match
Session Replay Showed:
- 45 seconds on quiz (reading each option)
- Uploaded photo, waited for processing
- Spent 2 minutes on results page
- Downloaded image multiple times
Online Evaluation Results:
- Evaluation Score: 96/100 ⭐⭐⭐⭐⭐
- Reasoning: "Personality perfectly matches role archetype"
- Photo bonus: "Visual traits enhanced casting accuracy"
Time invested = Quality received. The AI rewards thoughtfulness.
Discovery #3: The photo upload comedy gold mine
Session replay revealed what photos people ACTUALLY upload. Without it, you'd never know that one in three photo uploads are problematic, and you'd be flying blind on whether to add validation or trust your model.
Example: The surprising photo upload analysis
Session Replay Showed:
Photo Upload Analysis (n=18 who uploaded):
- 12 (67%) Normal pet photos
- 2 (11%) Screenshots of pet photos on their phone
- 1 (6%) Multiple pets in one photo (chaos)
- 1 (6%) Blurry "pet in motion" disaster
- 1 (6%) Stock photo of their breed (cheater!)
Despite 33% problematic inputs, evaluation scores remained high (87-91/100). The AI is remarkably resilient.
Example: When "bad" photos produce great results
My Favorite Session: Someone uploaded a photo of their cat mid-yawn. The AI vision model described it as "displaying fierce predatory behavior." The cat was cast as a "Protective Father." Evaluation score: 91/100. The owner downloaded it immediately.
The Winner: Someone's hamster photo that was 90% cage bars. The AI somehow extracted "small fuzzy creature behind geometric patterns" and cast it as "Shepherd" because "clearly experienced at navigating barriers." Evaluation score: 87/100.
Without session replay, you'd only see evaluation scores and think "the AI is working well." But session replay reveals users are uploading screenshots and blurry photos—input quality issues that could justify adding photo validation.
However, the high evaluation scores prove the AI handles imperfect real-world data gracefully. This insight saved me from over-engineering photo validation that would have slowed down the user experience for minimal quality gains.
Session replay + online evaluations together answered the question "Should I add photo validation?" The answer: No. Trust the model's resilience and keep the experience frictionless.
The magic formula: Why this combo works (and what surprised me)
Without Observability:
- "The app seems slow" → ¯\(ツ)/¯
- "We have 20 visitors but 7 completions" → Where do they drop?
With Session Replay ONLY:
- "User got sheep and rage clicked; maybe left angry" → Was this a bad match?
With Model-Agnostic Evaluation ONLY:
- "Evaluation: 22/100 - Eyeliner unsafe for pets" → How did the user react?
- "Evaluation: 96/100 - Perfect match!" → How did this compare to the image they uploaded?
With BOTH:
"User rushed, got sheep with ribbons, evaluation panicked about safety"
→ The OOTB evaluation treats image generation prompts like real costume instructions"40% of low scores are costume safety, not bad matching"
→ Need custom evaluation criteria (coming soon!)"Users might think low score = bad casting, but it's often = protective evaluation"
→ Would benefit from custom evaluation criteria to avoid this confusion
The evaluation thinks we're putting actual ribbons on actual cats. It doesn't realize these are AI-generated images. So when the casting suggests "sparkly collar with bells," the evaluation judge practically calls animal services.
Now that you've seen what's possible when you combine user behavior tracking with AI quality scoring, let's walk through how to add this same observability magic to your own multi-modal AI app.
Your turn: See the complete picture
Want to add this observability magic to your own app? Here's how:
1. Install the packages
npm install @launchdarkly/observability
npm install @launchdarkly/session-replay
2. Initialize with observability
import { initialize } from 'launchdarkly-js-client-sdk';
import Observability from '@launchdarkly/observability';
import SessionReplay from '@launchdarkly/session-replay';
const ldClient = initialize(clientId, user, {
plugins: [
new Observability(),
new SessionReplay({
privacySetting: 'strict' // Masks all data on the page - see https://launchdarkly.com/docs/sdk/features/session-replay-config#expand-javascript-code-sample
})
]
});
3. Configure online evaluations in dashboard
- Create your AI Config in LaunchDarkly for LLM evaluation
- Enable automatic accuracy scoring for production monitoring
- Set accuracy weight to 100% for production AI monitoring
- Monitor your AI outputs with real-time evaluation scoring
4. Connect the dots
Session replay shows you:
- Where users drop off
- What confuses them
- When they rage click
- How long they wait
Online evaluations show you:
- AI decision accuracy scores
- Why certain outputs scored low
- Pattern of good vs bad castings
- Safety concerns (even for pixels!)
Together they reveal the complete story of your AI app.
Resources to get started:
Full Implementation Guide - See how this pet app implements both features
Session Replay Tutorial - Official LaunchDarkly guide for detecting user frustration
When to Add Online Evals - Learn when and how to implement AI evaluation
The real magic is in having observability AND online evaluations.
Try it yourself
Cast your pet: https://scarlett-critter-casting.onrender.com/
See your evaluation score ⭐. Understand why your cat is a shepherd and your dog is an angel. The AI has spoken, and now you can see exactly how much to trust it!
Ready to add AI observability to your multi-modal agents?
Don't let your AI operate in the dark this holiday season. Get complete visibility into your multi-modal AI systems with LaunchDarkly's online evaluations and session replay.
Get started: Sign up for a free trial → Create your first AI Config → Enable session replay and online evaluations → Ship with confidence.
Further reading
LaunchDarkly resources:
Related tutorials:








Top comments (0)