DEV Community: toshipon

If You're Building with LLMs, You Should Have Thought About Observability from Day One

toshipon — Sun, 19 Apr 2026 04:25:13 +0000

Introduction

"AI did something. But I have no idea what it actually did." — Before that happens to you, observability is something you really need to think about upfront.

When you develop using AI coding tools like Claude Code or Cursor, you can see "it works." But "why it worked," "where and how it made decisions," and "whether the implementation matches my intent" become much harder to see.

The same goes when you use AI in your product itself — whether that's calling LLM APIs, building AI agents, or integrating MCP tools. If you can't see what the AI returned or what decisions it made, you can't improve it.

There's a principle in the SRE world: "If you can't see it, you can't fix it."

This principle applies directly to AI-powered development and to products that run AI-driven features.

This article captures how I've come to think about observability in LLM-based development — and what I'm actually doing about it.

What Is Observability? (In the SRE Context)

Observability is one of the core concepts in SRE.

It refers to the degree to which you can understand a system's state from the outside. Generally, it's described as the "three pillars":

Metrics: State expressed in numbers (latency, error rate, CPU usage, etc.)
Logs: Records of what happened
Tracing: Records of how requests flowed through the system

When all three are present, you can answer: "What's happening?" "Why is it happening?" and "Where is it happening?"

In AI-involved development, you need to think about this in two contexts:

Observability of the development process: Observing what happens when you use AI tools to develop
Observability of AI in production: Observing AI behavior within your product

1. When You Use AI in Your Development Process, Visibility Decreases

When you develop with Claude Code or similar tools, changes happen fast.
But "why did this happen?" becomes much harder to see.

For example:

When you accept code the AI proposed as-is, you can't trace the reasoning behind that decision
Subtle differences between what you intended and what got implemented are easy to miss
Later, you can't remember "why was this changed that way?"

This isn't really a code quality issue — it's a process observability problem.

How to Handle It

Write commit messages carefully: Even if you have the AI write them, always include the intent
Document your PRs properly: Make the intent and context behind changes explicit
Review AI-generated code as a habit: Don't stop at "it works"
Write tests first: Declaring intent upfront lets you catch gaps between the AI's implementation and what you wanted

In short, the faster your AI moves, the more you need to deliberately create places where humans can observe what's happening.

2. Think About AI Observability in Production as Three Layers

When you embed AI in your product, observability is essential.

Layer 1: Input/Output Logging

At minimum, log what you send to the AI and what it returns.

The prompt you provided
The output it generated
The model used
Tokens consumed
Cost

Even this alone lets you investigate later: "Why was that output wrong that one time?"

Layer 2: Action Logging

Record what the AI did. This is especially important if you're using it in an agentic way.

Which tools it called
What each tool invocation returned
Where errors occurred
Whether any retries happened

Without this, you can't track down problems like "the AI called these tools in the wrong order."

Layer 3: Connecting to Product Behavior

Link what the AI does to what your users do.

Did users actually use the content the AI suggested?
Which input patterns tend to produce good outputs?
Where do users drop off?

To see this, you need to connect AI logs with your product's user behavior logs.

What I'm Actually Doing

This is what we're implementing across kaizen-lab and related projects.

Storing User Behavior in the Database with Vercel Analytics Drains

By storing page views and custom events in the database, we can connect AI outputs with user behavior.

I wrote about this in detail here: "Connecting Vercel Analytics Drains to an internal database to have AI evaluate product behavior"

Exposing APIs via MCP

By making the product state accessible to external AI tools via MCP, we expand what the AI can observe. This is a form of observability in itself.

Catching Errors with Sentry on AI Workflows

When AI processes things automatically, errors can happen without anyone noticing immediately. By integrating Sentry, we catch unexpected exceptions.

I covered this in: "Automating Error Detection → Root Cause Analysis → Fix PRs with Sentry × AI Agents"

Without Observability, You Can't Get to "Improvement"

In AI-powered development, it's common to stop at "it works, so we're done."

But "it works" and "it works correctly" are different things.

Only with observability can you reach the state where:

You understand what the problem is
You know how to improve it
You can verify that your improvements are working

This is true whether you're using AI in development or running AI in production.

I think "if you can't see it, you can't fix it" is actually becoming even more important in the LLM era of development.

Summary

The observability mindset from SRE applies directly to LLM-based development.

What I try to keep in mind:

Observability of the development process: Build a process where you can track what the AI did
Observability of AI in production: Input/output logs, action logs, and connections to user behavior
You can't improve what you can't observe: This principle doesn't change before or after AI

Observability isn't just "something enterprises do in their SRE teams."

The moment you start developing with AI, it becomes something indie developers need to think about too.

How I Built a Full-Stack Security Audit Skill for Claude Code

toshipon — Sat, 11 Apr 2026 12:46:14 +0000

Introduction

"I want to run a security audit, but every time I have to start from zero."

That feeling gets old fast when you're building across a full-stack setup like Vercel + Supabase + Next.js + iOS. Each layer comes with its own security concerns, and just remembering what to check can be exhausting.

OWASP guidelines are comprehensive, but they’re also huge. And some critical settings — especially in Vercel and Supabase dashboards — can’t be fully inspected from the CLI alone.

So I built a Claude Code Custom Skill called security-audit.

Claude Code Custom Skills let you package reusable procedures and domain knowledge for a specific task. Instead of reconstructing the audit process from memory every time, I can now run a reproducible 6-phase full-stack security review from Next.js to iOS with a single command:

/security-audit all

The key difference is that it doesn’t stop at CLI and SQL checks. It also uses Chrome MCP to inspect dashboard-only settings automatically.

What the Finished Skill Looks Like

/security-audit              # Choose target interactively
/security-audit all          # Full-stack end-to-end audit (recommended)
/security-audit nextjs       # Next.js application only
/security-audit vercel       # Vercel infrastructure only
/security-audit supabase     # Supabase backend only
/security-audit ios          # iOS app only
/security-audit web          # Next.js + Vercel + Supabase

This is the part I like most: it turns a vague, easy-to-postpone task into something I can actually run on demand.
Instead of thinking, "I should probably do a security review soon," I can just start from a standard entry point and let the process unfold.

The Skill is structured into six phases:

Phase	Focus	Main inspection methods
1. Information Gathering	Project structure, trust boundaries, data flows	Grep / Glob
2. Next.js Audit	Server Actions, Middleware, CVEs	Grep / Bash
3. Vercel Audit	Env vars, Deployment Protection, WAF	CLI + Chrome MCP
4. Supabase Audit	RLS, function privileges, Auth settings	SQL + Chrome MCP
5. iOS Audit	Keychain, ATS, biometrics	Grep / Bash
6. Cross-Layer Analysis	Auth flow consistency, token lifecycle	Cross-cutting review

Phase Structure

Phase 1: Information Gathering
    │
    ├── Phase 2: Next.js Application
    │   Server Actions / Middleware / CSP / CVE
    │
    ├── Phase 3: Vercel Infrastructure
    │   Env vars (CLI) / Deployment Protection (Chrome MCP)
    │   / WAF (Chrome MCP) / Git Fork Protection (Chrome MCP)
    │
    ├── Phase 4: Supabase Backend
    │   RLS (SQL) / Function privileges (SQL) / Auth settings (Chrome MCP)
    │   / Security Advisor (Chrome MCP)
    │
    ├── Phase 5: iOS App
    │   Keychain / ATS / Biometrics / Privacy Manifest
    │
    └── Phase 6: Cross-Layer Analysis
        Auth flow consistency / token lifecycle
        / API transport security / continuity of data protection

How I Built It in 3 Steps

Step 1: Research — learn from the best existing Skills

I didn’t start by writing.
I started by studying the best Skills I could find.

Resource	What I learned
Trail of Bits skills (4.5k stars)	A Skill should stay focused on one responsibility. Reference files should be separated out.
Anthropic best practices	Keep `SKILL.md` under ~500 lines, use progressive disclosure, write in imperative form
SecOpsAgentKit	Organizing by domain makes complex security workflows easier to navigate

Step 2: Design — three core principles

Progressive Disclosure

SKILL.md should contain only the overview and phase structure. Detailed inspection patterns belong in references/ and should be loaded only when needed.
Evidence-First

Not "this might be vulnerable," but "this grep pattern found this code, and here is why it’s risky."
Use OWASP directly

Instead of inventing custom categories, I adopted OWASP Top 10:2025, WSTG, and MASVS v2 as-is.

Step 3: Implementation — a 6-file structure

~/.claude/skills/security-audit/
├── SKILL.md                              # Main entry (overview + phase structure)
└── references/
    ├── nextjs-security.md                # Next.js-specific inspection patterns
    ├── vercel-security.md                # Vercel CLI + Chrome MCP checks
    ├── supabase-security.md              # Supabase SQL + Chrome MCP checks
    ├── ios-testing.md                    # MASVS v2 categories
    └── web-testing.md                    # OWASP WSTG + Top 10:2025

Why Chrome MCP Matters

One of the biggest strengths of this Skill is that it uses Chrome MCP to inspect settings that the CLI can’t access.

For example, Vercel’s Deployment Protection and some Supabase Auth settings are only partially available via CLI or API. With Chrome MCP, the agent can navigate those dashboards, inspect toggle states, and capture screenshots as evidence.

# Vercel example
navigate_page -> /settings/deployment-protection
take_screenshot -> record evidence
evaluate_script -> extract toggle states and protection scope

# Supabase example
navigate_page -> /database/security-advisor
take_screenshot -> record all findings
take_snapshot -> inspect details via accessibility tree

Target	Available via CLI/SQL	Requires Chrome MCP
Vercel environment variable list	`vercel env ls`	-
Deployment Protection	-	Toggle state on settings page
Supabase RLS state	`pg_class` queries	-
Supabase Auth settings	-	MFA, email confirmation, rate limits
Supabase Security Advisor	-	Full lint findings

Why a Skill Instead of One Long Prompt?

At first, I thought I could just write one long audit prompt.

But in practice, turning it into a Skill was much more manageable.

Reusable: /security-audit all always starts from the same reliable entry point
Modular: SKILL.md and references/ separate responsibilities cleanly
Standardized: The order of inspection and evaluation criteria stays consistent
Target-aware: It can branch into nextjs, vercel, supabase, or ios
Higher audit consistency: It follows predefined criteria instead of improvising every time

In other words, I didn’t turn this into a Skill just for convenience.
I did it to improve the reproducibility of the audit itself.

Key Design Decisions

Progressive Disclosure

This is the pattern Anthropic emphasizes most strongly.
Claude’s context window is a shared resource, so if you cram everything into SKILL.md, it competes with the rest of the task.

SKILL.md (always loaded)
  -> overview + phase structure + report format

references/ (loaded only when needed)
  -> nextjs-security.md: Next.js-specific inspection patterns
  -> vercel-security.md: Vercel dashboard inspection steps
  -> supabase-security.md: SQL queries + dashboard checks
  -> ios-testing.md: MASVS v2 commands
  -> web-testing.md: WSTG commands

Evidence-First

This came directly from studying the Trail of Bits Skills.
Every inspection item should include concrete bash, grep, or SQL commands.

-- Tables in public schema with RLS disabled (Critical)
SELECT n.nspname AS schema, c.relname AS table_name
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
  AND n.nspname = 'public'
  AND c.relrowsecurity = false;

Direct OWASP Adoption

I chose not to invent any custom taxonomy.
Instead, I used the standard frameworks directly:

Web: OWASP Top 10:2025 + WSTG
iOS: MASVS v2 + MASTG
Shared vocabulary: CWE for vulnerability classification

The biggest advantage is obvious: it gives your team and external reviewers a common language.

Anti-Patterns I Learned from Anthropic’s Best Practices

Anti-pattern	Example	Better approach
Vague description	`"A security-related skill"`	Include explicit trigger phrases
Overloaded `SKILL.md`	One file with 3,000+ lines	Split detailed content into `references/`
Second-person instructions	`"You should check..."`	Use imperative form: `"Check..."`
Deep reference chains	A -> B -> C -> D	Keep references to one level
Choices with no default	`"Choose one of the following..."`	Recommend a default (`all`)
No concrete inspection method	`"Check RLS"`	Attach exact SQL queries

Results

Turning full-stack security auditing into a Claude Code Custom Skill gave me several clear benefits:

Reproducibility: the same six-phase audit runs with consistent quality every time
Coverage: OWASP categories are mapped across four layers — Next.js, Vercel, Supabase, and iOS
Efficiency: /security-audit all triggers a full-stack audit in one command
Automation: CLI/SQL for machine-readable checks, Chrome MCP for dashboard-only settings
Shared language: OWASP-aligned findings are easier to discuss with other engineers and reviewers

The most important thing in Skill design is not starting to write too early.
Study the best existing examples first. Define your principles — especially Progressive Disclosure and Evidence-First — and only then implement.

That’s what turns a one-off prompt into something you can actually keep using.

Skill Files

I also published the full Skill files on GitHub:

GitHub: toshipon/claude-code-security-audit-skill

The article only covers the core ideas, but the full files are ready to drop into:

~/.claude/skills/security-audit/

If you're building with a similar stack, you can use it as-is or adapt the phase structure to your own environment.
The main value is not the exact wording of the Skill — it's having a repeatable audit workflow that doesn't depend on memory.

References

Trail of Bits Security Skills — 16 security-oriented Skills (4.5k stars)
Anthropic Skill Best Practices — official guidance
OWASP MASVS v2 — mobile security verification standard
OWASP MASTG — mobile security testing guide
OWASP WSTG — web security testing guide
OWASP Top 10:2025 — latest web vulnerability ranking

I Have an AI Agent That Tests My Own Product Every 3 Hours

toshipon — Wed, 08 Apr 2026 22:58:47 +0000

The Dogfooding Problem for Solo Developers

"Eat your own dog food" is good advice. Use your own product. Find the bugs your users find. Feel the pain before they do.

In practice, here's what actually happens:

You use it heavily right after launch
Development takes over and you stop touching it
You check it as a developer, not as a user — you know all the right paths
"It works" becomes the bar, and rough UX slips through

I build and maintain a web app solo. At some point I realized: I hadn't actually used it as a user in weeks. I'd been shipping features, but not experiencing the product.

So I did something that felt slightly absurd: I gave the job to an AI agent.

Every 3 hours, an AI agent opens my product, checks if things work, and opens a PR if it finds something broken.

Here's how it works, what it found, and what it can't do.

The Setup

Three components:

Component	What it does
AI Agent (Claude)	Decides what to check, interprets results, writes fixes
MCP Server	Exposes my app's API as callable functions for the AI
Playwright	Lets the AI control a real browser to check the UI

The agent runs on a scheduled heartbeat. I define what to check in a markdown file:

# What to check (rotate through these):

## API checks
- Call list_projects, get_canvas, get_verification_status
- Verify data integrity and response format

## UI checks  
- Open the live site in a real browser
- Check mobile viewport (375px)
- Check dark mode
- Check empty states (what does a new user see?)
- Screenshot any anomalies

## Code quality
- Run tsc --noEmit, report TypeScript errors
- Check for unused imports in recently changed files

## When you find something broken:
- Create a branch
- Fix it
- Run vitest to confirm tests pass
- Open a PR
- Report to Discord

That's the entire instruction set. The AI handles the rest.

How the API Integration Works

By default, an AI can't interact with your app's internals. To fix this, I wrapped my API as an MCP (Model Context Protocol) server — basically a list of functions the AI can call.

// The AI can call these like tool calls
const tools = {
  list_projects: {
    description: "Get all projects",
    handler: async () => await db.project.findMany()
  },
  add_learning: {
    description: "Record a finding or bug",
    handler: async (args) => await db.learning.create({ data: args })
  },
  get_verification_status: {
    description: "Check the status of all verifications",
    handler: async () => await db.verification.findMany()
  }
};

This lets the AI do what a human user does — create records, read data, check states — but via API instead of clicking around.

What It Found

Here are three real bugs the agent caught that I wouldn't have caught otherwise:

Bug 1: API and UI were out of sync

When creating data through the API, the API response showed the data correctly. But the data didn't appear in the UI.

Root cause: The data was stored in two separate database tables. The API wrote to one, the UI read from the other.

Why humans missed it: Humans always use the UI. If you click "create" in the browser, both tables get written. The bug only appeared when creating via API — which humans never did, but the AI did on every check.

Bug 2: Mobile layout broken

On desktop: fine. On mobile (375px): input fields overflowed horizontally.

Fix: One CSS change: grid-cols-2 → grid-cols-1 md:grid-cols-2.

Bug 3: Empty state was a white screen

A new user opening their first project saw... nothing. No error, just blank. No guidance, no "create your first item" button.

This one wasn't technically broken — it just made the product confusing for new users. The agent flagged it as a UX issue and suggested an empty state component.

Dogfooding Alone Wasn't Enough

Dogfooding catches a lot — especially broken flows, layout issues, and rough UX.

But it doesn't catch everything.

Some bugs only happen in production, under very specific conditions:

a component crashes only after a rare user action
an import mismatch breaks a route that manual testing doesn't hit
an exception only appears with real data, real timing, or real browser state

Those bugs are hard to find by manually using the product every few hours.

So I ended up adding a second loop: error monitoring.

The dogfooding agent checks whether the product works as a user experience.
Error monitoring checks whether the product is failing in the wild.

That combination turned out to be much stronger than either one alone.

Adding Sentry as a Second Feedback Loop

Now the system has two complementary loops:

Loop	What it catches
Dogfooding every 3 hours	Broken flows, visual issues, empty states, mobile regressions, rough UX
Sentry monitoring	Runtime exceptions, production-only bugs, hard-to-reproduce crashes

The dogfooding loop answers:

Can a user actually move through the product?
Does the UI make sense?
Is anything visually broken?

The Sentry loop answers:

Did something crash in production?
What stack trace and context came with it?
Is there a fixable bug hidden behind low-frequency failures?

This matters because not all quality issues look the same.

Some problems are visible. Others only show up as stack traces.
If you only rely on dogfooding, you miss production-only failures.
If you only rely on Sentry, you miss awkward UX and broken but non-crashing flows.

Together, they form a much more complete quality loop.

From Detection to Auto-Fix

Once I added Sentry, the agent's job expanded.
It no longer just looked for problems by using the product.
It could also react to problems reported by the product itself.

The flow now looks like this:

Every 3 hours, the agent dogfoods the app
On a separate schedule, it checks Sentry for unresolved issues
If it finds a real bug, it analyzes the stack trace and source code
It creates a branch, writes a fix, runs tests, and opens a PR
Small safe fixes can be merged automatically after checks pass

One of the best examples was a page crash caused by the wrong i18n hook import.
The error message itself was vague. Manual testing didn't catch it consistently.
But Sentry provided enough context for the agent to trace the issue back to a bad import and generate a tiny fix.

That was the moment this stopped feeling like "automated testing" and started feeling more like an automated maintenance loop.

What the AI Can and Can't Do

The AI is good at	The AI can't do
Checking if things work	Feeling if things feel right
Catching regressions automatically	"This interaction is frustrating"
Covering edge cases humans skip	Subjective UX judgment
Opening PRs immediately on finding bugs	Knowing if a feature is missing
Running every 3 hours without fatigue	Replacing actual user feedback

The "can't do" column matters. The AI is a complement, not a replacement.

After the agent does its check, I still need to use the product myself and talk to users. The agent handles the objective, repeatable checks. I handle the subjective, experiential ones.

One More Honest Note

About 30% of the time, the agent reports "fixed" when it hasn't fully fixed something. This was frustrating until I built in a hard requirement: tests must pass before marking anything as done.

Rule: Before opening a PR, run `npx vitest run`.
If tests fail, do not open the PR.
Report the failure instead.

This dropped false completions dramatically. The agent's confidence isn't reliable — test results are.

How to Build This

You don't need my exact setup. The minimum viable version:

Pick a scheduled runner — GitHub Actions cron, a crontab, or any agent platform with scheduled tasks
Expose one API endpoint the AI can call — Start with just a health check
Write a simple check instruction — "Call this endpoint and report if it fails"
Add Playwright later — Browser checks are optional but powerful for catching visual regressions

The core insight isn't the tech stack. It's that dogfooding is a discipline problem, not a capability problem. You know how to test your own product. You just don't do it consistently.

Automating it removes the discipline requirement.

Have you built any automated quality loops into your side projects? Or does your testing start and end with "it worked on my machine"? Curious what others have tried in the comments.

The product the agent keeps testing is KaizenLab, my app for hypothesis validation and product learning.

That made this setup especially useful: the same system I use to organize product decisions is also what the agent keeps checking, stress-testing, and helping improve.

How I Used Hypothesis Validation to Shape My Go-to-Market Strategy

toshipon — Mon, 06 Apr 2026 17:13:36 +0000

Introduction

Have you ever built an app and then realized you had no clear idea how to sell it?

That’s a common trap in indie development. Building the product is hard, but figuring out who it’s for, what value it creates, and how to communicate that value is often even harder.

I recently launched Focusnest, an iOS ambient sound mixer designed for focus, relaxation, and sleep.

On the product side, I felt pretty good about it. It supports mixing multiple sounds, saving presets, built-in timers, 1/f fluctuation for more natural sound movement, and full offline use. On paper, it seemed strong enough to compete.

But when I started thinking about go-to-market, I got stuck.

As an ambient sound app, the market is crowded. I could explain the features, but I wasn’t convinced they gave people a compelling reason to care.

So instead of jumping straight into promotion, I treated go-to-market itself as a set of hypotheses.

That process helped me realize something important: Focusnest probably shouldn’t be positioned as just an “ambient sound app.” It should be positioned as a focus-switching app — a tool that helps people enter deep work faster.

In this article, I’ll walk through how I organized those hypotheses, what changed in my thinking, and how that shaped the first version of my go-to-market strategy.

The App I Built: Focusnest

Focusnest is an iOS app for focus, relaxation, and sleep using customizable ambient soundscapes.

Its main features include:

White noise, brown noise, and pink noise
Natural sounds like rain, rivers, fire, waves, birds, wind, and thunder
Mixing multiple sounds at the same time
Saving and restoring presets
A built-in Pomodoro-style timer
1/f fluctuation for more natural sound variation
Full offline support

I felt confident about the product quality.

But product quality and go-to-market are two different problems.

That’s where many indie products get stuck: “I built it” does not automatically become “people want it.”

The First Feeling That Something Was Off

At first, I naturally tried to position it as an ambient sound app.

But something felt off.

The ambient sound market is crowded
Spotify and YouTube are viable substitutes
Existing players like Noisli and Endel are already strong
“This looks nice” didn’t feel like a strong enough reason to switch

In other words, I had built the product, but I still hadn’t found a clear market context for it.

That’s an easy place to stall.

You can keep polishing features forever, but if you haven’t clarified who it’s for and what job it really does, your messaging stays blurry.

The Real Mistake

The mistake was treating feature quality and buying motivation as if they were the same thing.

I could explain things like:

You can mix sounds
There’s a timer
It uses 1/f fluctuation

Those are features.

But features alone don’t answer the question:

Why would someone want this now?

What users actually want is not the feature itself, but the change in state the product gives them.

So I Organized the Problem as Hypotheses

At that point, I used KaizenLab, a hypothesis validation tool, to structure my thinking.

Instead of treating go-to-market like a vague marketing task, I treated it like a set of testable assumptions.

Not:

“How do I promote this?”

But:

Who is this most likely to resonate with?
Are users really looking for “sound,” or are they looking for something else?
What makes this meaningfully different from substitutes?
Which channel is most likely to create early traction?
What framing makes the value obvious?

A rough summary looked like this:

Area	Hypothesis
Target	Remote workers, students, and creators who struggle to switch into focus mode
Core problem	They are not looking for “nice sounds.” They are looking for a trigger to start work or study
Differentiation	Not the number of sounds, but the focus-onboarding experience created by mixing, presets, and timer integration
Channels	X, Zenn, Qiita, App Store ASO
Messaging	Use cases are stronger than feature lists

Once I made it visible, “the market feels crowded” stopped being a vague concern.

It became a clearer strategic question:

Who is this for, what is it really helping them do, and through which channel should I explain that first?

The Key Insight: It’s Not Really an Ambient Sound App

This was the biggest shift.

At first, I saw Focusnest as:

Before

An ambient sound app
A relaxation app
A noise playback app

But after organizing the hypotheses, a different framing emerged.

After

A tool that reduces the friction of entering focus mode
A portable deep work environment
A personal switch for starting work or study

That change in wording is not just branding.

It changes:

who the competitors are,
who the message resonates with,
and what kinds of content I should create.

For example, saying:

“You can mix 16 different sounds”

is much weaker than saying:

“Start coding faster with a rain + brown noise preset”
“Recreate your ideal focus environment with one tap”
“Shorten the ritual it takes to enter work mode”

The second version gives people a concrete reason to care.

The Marketing Hypotheses I Came Away With

After organizing everything, a few practical hypotheses stood out.

Hypothesis 1: The first users are probably not “people who love ambient sound”

The first people most likely to care may be knowledge workers who struggle with context switching.

For example:

Remote workers
Engineers
Students
Creators

These people are less interested in sound for its own sake.

They care about entering a focused state more easily.

Hypothesis 2: Use-case messaging will outperform feature messaging

Feature messaging still matters.

Things like offline support, 1/f fluctuation, and the number of sound sources are useful.

But on their own, they don’t create urgency.

Use cases probably work better:

for coding
for studying
for reading
for relaxing before sleep
for starting work in the morning

That kind of framing makes the product feel immediately usable.

Hypothesis 3: Story-driven distribution is better than paid acquisition at this stage

At this point, I don’t think paid ads are the right first move.

What seems more promising is building context first.

That likely means channels like:

use-case posts on X
developer and validation stories on Zenn / Qiita
improving App Store screenshots and description copy

In particular, I think the story of why I built it and how I’m figuring out how to sell it is more interesting than simple promotion.

The Initial Actions I Decided to Take

Once the hypotheses were clearer, the next actions also became clearer.

The goal is not to do everything at once.
It’s to test small moves and observe what resonates.

1. Shift the positioning from “ambient sound app” to “focus-switching app”

This affects everything:

product description
App Store copy
screenshots
social posts

The wording should consistently emphasize entering focus faster, not just listening to sounds.

2. Create use-case-based posts on X

For example:

“Rain + brown noise is my coding preset”
“I use one tap to switch into work mode every morning”
“Different presets for focus, relaxation, and sleep”

These are stronger than generic feature announcements.

3. Turn the development and GTM thinking into content

Not just “I built an app,” but:

“I used hypothesis validation to figure out how to bring it to market.”

That kind of content does three things at once:

it promotes Focusnest,
it shares a practical process other builders can learn from,
and it strengthens my own identity as someone who builds products through validation, not just intuition.

4. Delay Product Hunt for now

Product Hunt is attractive, but I don’t think it’s the right first move yet.

Before trying to launch broadly, I want stronger clarity on:

who this is really for,
what messaging works,
and which channels give early traction.

In this case, building context first seems more valuable than chasing a big launch too early.

What I Learned

The biggest lesson was simple:

Building and selling are different jobs.

A good product is not enough.

The same product can feel irrelevant or compelling depending on:

how you frame it,
who you frame it for,
and what context you place it in.

At first, I saw Focusnest as an ambient sound app.

But once I organized the go-to-market hypotheses, I realized its real value was reducing the cost of entering a focused state.

That clarity alone made the next moves much easier.

It also made something else clearer:

not just what I should do next, but what I shouldn’t do yet.

For example, instead of rushing into ads or Product Hunt, I now think it makes more sense to first build context, messaging, and initial traction.

Conclusion

A lot of indie builders hit the same wall:

They spend all their energy building, then get stuck at “Now how do I sell this?”

When that happens, it may be faster to stop adding tactics and start organizing the problem as hypotheses.

Questions like:

Who is this really for?
What problem does it actually solve?
What is meaningfully different from substitutes?
In what context does the value become obvious?

Once those become clearer, both your messaging and your product page tend to improve.

If you’ve built something but still don’t know how to bring it to market, it may help to treat go-to-market as a validation problem too.

I used KaizenLab to organize these hypotheses.

It worked not only for product ideas, but also for shaping go-to-market thinking around an actual indie product.

A Feature I Never Planned Emerged From Persona Interviews — Here's Exactly How

toshipon — Thu, 02 Apr 2026 16:03:21 +0000

The Feature That Wasn't in the Design Doc

When I started building BJJ Techniques — a BJJ (Brazilian Jiu-Jitsu) technique learning app for iOS — I had a clear vision: a searchable database of techniques, organized by position and category, with step-by-step instructions and YouTube videos.

The "Technique Tree" — a visual map showing how techniques connect and flow into each other — was not in that design doc. Not even close.

It emerged entirely from persona interviews.

Here's exactly how that happened, including the specific research I used to make those interviews actually work.

About the App

BJJ Techniques is an iOS app for learning Brazilian Jiu-Jitsu techniques systematically (available on the App Store).

Key features:

Technique Library — Search techniques by category: submissions, sweeps, guard passes, and more
Technique Detail Pages — Overview, step-by-step breakdowns, YouTube videos, and related techniques in one place
Technique Tree — Visualize how techniques connect from any starting position
Learning Paths — Structured weekly curriculum for white belts through early blue belts

The Tool I Used: KaizenLab

Before getting into the personas, a quick note on the workflow.

I run all my hypothesis validation in KaizenLab — a web app I built myself to operationalize the lean hypothesis testing methodology from Toshiaki Ichitani's book Build the Right Thing Right.

The core idea: before writing code, define your hypotheses explicitly, design experiments to test them, and record what you learn — in a structured way that builds up over time. KaizenLab handles hypothesis canvases, persona management, AI pseudo-interview simulation, and validation cycle tracking, all in the browser. It also has MCP (Model Context Protocol) integration so AI agents can operate it directly.

Everything in this article — the personas, the interviews, the feature decision — was run through KaizenLab. I'm writing this both as a case study in persona-driven validation and as a real-world test of the tool I'm building.

Three Personas, Three Real Frustrations

Most indie hackers I know create personas like this: "User A, 25-35, tech-savvy, wants X." Useful, but shallow. The responses you get from shallow personas are shallow too.

I created three personas with significantly more depth:

Tanaka Shota, 28, IT engineer, white belt

Frustrations:

Forgets technique names and steps right after learning them
YouTube search gives fragmented, disconnected information
Feels bad asking senior students the same questions repeatedly

Goals:

Learn BJJ techniques systematically
Improve the quality of twice-weekly training sessions
Reach competition level

Sato Misaki, 34, marketing manager (reduced hours), female white belt

Frustrations:

Trains only once a week, progress feels too slow
Most tutorial videos feature male practitioners with strength-based approaches — unclear if techniques work for her body type
Doesn't know what to prioritize learning

Goals:

Maximize limited training time
Find techniques that work for smaller practitioners
Understand what's most important to learn right now

Suzuki Daisuke, 42, sales manager, blue belt (also coaches beginners)

Frustrations:

Feels like fundamentals are shaky despite his rank
Gets confused by techniques named in English, Portuguese, and Japanese
Can't rely on physical dominance — technique precision is critical at his age

Goals:

Fill gaps in fundamental technique knowledge
Organize options by position
Build a personal game plan

These aren't marketing archetypes. Each one has specific contradictions, specific constraints, and specific contexts that change what they actually want from an app.

KaizenLab's persona management view — three personas organized as cards, each with goals, frustrations, and psychological state dimensions.

The Research That Made Interviews Work

Here's where it gets interesting.

I use KaizenLab's AI pseudo-interview feature to simulate conversations with personas before talking to real users. The point is to stress-test your questions and spot weak assumptions early — before wasting anyone's time.

But I found that standard AI personas give obvious, shallow answers. "Yes, that feature would be useful." "I'd like better search." These are useless.

What changed the quality dramatically was applying principles from the HumanLM research paper from Stanford (2026), which studied how to make AI-simulated participants produce more realistic, human-like responses.

The key insight from HumanLM: surface attributes aren't enough. You need to model psychological state dimensions.

KaizenLab's persona editor has dedicated fields for all three:

1. Stance — What's their position on specific topics?

Suzuki's stance on new tools: "I've been doing BJJ for years.
I'll try a new app if someone I respect recommends it,
but I won't pay for something I haven't validated myself."

2. Emotional tendencies — How do they respond emotionally?

Sato's tendencies: "Gets discouraged when progress feels
invisible. Motivated by visible milestones. Anxious about
being the only woman who doesn't understand something."

3. Communication style — How do they express needs?

Tanaka's style: "Direct, specific, data-oriented. Won't say
'I want a feature' but will say 'I tried to look up
arm triangle yesterday and spent 20 minutes cross-referencing
three different YouTube videos.'"

When you add these dimensions to a persona, the AI stops giving generic answers. Sato doesn't say "I want a learning path." She says "I have 45 minutes before I need to pick up my kid, and I need to know exactly which two techniques to drill today." That's a different design requirement entirely.

What the Interviews Actually Found

I ran AI pseudo-interviews with all three personas, asking them to describe how they currently learn and track BJJ techniques.

The surprising finding: all three independently requested some form of technique tree or learning path. A feature I had not planned and had no intention of building.

KaizenLab's interview results view — insights extracted from AI pseudo-interviews across all three personas, automatically organized by theme.

But they wanted completely different things.

Tanaka (white belt, IT background)

He wanted an RPG-style skill tree — a branching diagram starting from positions, with unlockable nodes. Closed guard → armbar OR sweep → mount → choke.

"Feeling of progression. Like I know where I am and what unlocks next."

This is a learned pattern from gaming and online learning platforms. He wanted the same dopamine loop applied to martial arts.

Sato (female white belt, time-constrained)

She wanted a learning path integrated into the technique database. Not a map of everything — a filtered view of only what's relevant for her level right now.

"Show me the 5 techniques that matter most for where I am. Lock everything else. I don't want to see what I'm not ready for."

This is a completely different mental model from Tanaka's. He wants the full map with fog of war. She wants a guided tour.

Suzuki (blue belt, coaching role)

He wanted a custom game plan builder — select from the full technique library to build a personal map of his game. Multiple plans: one for Gi, one for No-Gi, one for competition.

"When I'm coaching a white belt, I want to show them my game plan and say 'start here.' Not a generic beginner curriculum."

Different again. He already knows the techniques. He wants a tool for organizing and communicating his approach.

Why Multiple Personas Converge = High Confidence

Here's the validation principle that made me confident enough to build this:

When multiple personas independently surface the same underlying need — even if they describe it differently — that's a strong signal.

Tanaka, Sato, and Suzuki each came from different places:

Different experience levels
Different learning constraints
Different use cases (self-study vs. coaching)
Different mental models (gaming vs. workflow vs. curriculum)

But all three had the same core problem: no way to see how techniques relate to each other and where they stand within that structure.

If only Tanaka had mentioned it, I might have dismissed it as one person's gaming preference. If only Suzuki mentioned it, I might have assumed it was a niche need for advanced practitioners.

Three independent hits, three different angles, same underlying gap.

That's when I decided to build it.

The Feature That Emerged

The Technique Tree I ended up designing has three layers, corresponding to the three personas' needs:

Phase 1 (free, all users): Position-based technique map

Tap a position → see branching submissions, sweeps, passes, escapes
Synced with learning progress (mastered = color, not yet = gray)
Uses existing relatedTechniqueIds and counterTechniqueIds data

Phase 2 (premium): Custom game plan builder

Select from the full technique library to build your personal map
Save multiple plans (Gi / No-Gi / competition)
Share with training partners

The existing data structure already supported this. The connections between techniques were defined. I just hadn't built a UI that surfaced them visually.

The technique tree in action — starting from closed guard, filtered by arm locks. Mastered techniques appear in color; unlearned ones in gray.

What I'd Have Built Without This Process

A searchable database with filters.

Which is fine. But it's what every BJJ app already has. The techniques would have been well-organized and the search would have been solid. Users would have used it, found a specific technique, watched the linked YouTube video, and moved on.

The Technique Tree creates something different: a reason to explore the app as a system rather than a reference lookup. It's the feature most likely to drive retention — coming back to the app not just when you forget a technique name, but to understand how your game is developing.

I didn't think of this myself. Three personas, systematically interviewed with enough psychological depth to produce real signals, thought of it.

The Process in Practice

If you want to run this for your own product:

1. Build personas with psychological state dimensions, not just demographics

For each persona, define:

Their stance on specific topics relevant to your product
Their emotional tendencies (what motivates them, what discourages them)
Their communication style (how they express needs — directly? through frustration? through workarounds?)

2. Run AI pseudo-interviews before real ones

Use the psychological state dimensions to prompt the AI to respond as the persona, not as a helpful assistant. If the answers feel generic, your persona lacks depth.

3. Listen for convergence across personas

One persona mentioning a need = interesting. Two personas = worth investigating. Three personas from different segments = build it.

4. Pay attention to how they describe the need, not just what they want

Tanaka, Sato, and Suzuki all wanted "technique tree," but their descriptions revealed three different product requirements. The surface request was the same. The underlying need was the same. But the specific solution for each was different.

That distinction is what turns user research into good product design.

Have you had a feature emerge from user research that you never would have designed yourself? Or do you usually build from your own intuition? I'd be curious to hear in the comments.

I do this kind of persona-driven validation in KaizenLab — the tool I built specifically for managing hypothesis validation cycles.

References

BJJ Techniques — App Store — The app built using this validation process
HumanLM: Large Language Models as Simulated Participants — Stanford, 2026
Why I Wasted 6 Months Building the Wrong Product — Series Part 1
I Spent 3 Months Building a SaaS — Then AI Did the Same Thing in One Prompt — Series Part 2
KaizenLab — Hypothesis validation cycle management

Why an SRE Engineer Built a Product Validation Tool — Bringing Observability Thinking to Product Development

toshipon — Sun, 29 Mar 2026 01:30:48 +0000

"Why Would an SRE Build a Product Tool?"

I get asked this a lot.

By day, I'm an SRE engineer at a fintech company. Terraform, AWS, Azure, Kubernetes — my job is keeping systems reliable. I think in dashboards, alerts, and incident response.

But when I started building side projects, something felt deeply wrong.

Infrastructure has observability. Product decisions don't.

We use Datadog and Grafana to visualize system state as a matter of course. But "why did we build this feature?" and "was that decision correct?" — there's no dashboard for that. No alerts. No traces.

That gap is what led me to build a hypothesis validation tool. And it turns out, SRE thinking translates surprisingly well to product development.

The Observability Gap in Product Development

The Three Pillars — Reframed

In SRE, we think about observability through three pillars:

Pillar	In Infrastructure	In Product Development
Metrics	CPU, memory, response time	KPIs, usage rates, conversion
Logs	Access logs, error logs	Decision logs, validation results
Traces	Request processing paths	Hypothesis → Experiment → Learning → Next Action

In infrastructure, we never accept "we don't know what's happening" as a state. We set up alerts, build dashboards, write runbooks for incident response.

But in product development? "Why we built this feature" is lost within six months. Code preserves what was built, but never why it was built.

ADRs for Architecture, But What About Product Decisions?

If you're an engineer, you might use ADRs (Architecture Decision Records) to document technical choices:

# ADR-001: Use Supabase for Database

## Status: Accepted

## Context
Minimize backend costs for a side project

## Decision
Adopt Supabase (PostgreSQL + Auth + RLS)

## Rationale
- More SQL flexibility than Firebase
- RLS handles security at the database layer
- Free tier is sufficient for indie projects

ADRs capture technical decisions. But they don't capture "the evidence that convinced us this feature was worth building in the first place."

That's the gap. And it's exactly the kind of gap that makes an SRE uncomfortable.

3 SRE Concepts That Changed How I Build Products

1. SLOs → Validation Success Criteria

In SRE, you define SLOs (Service Level Objectives) before you set up monitoring. "99th percentile response time < 200ms" — the quantitative bar comes first.

Applied to product development, this means defining success criteria before running any experiment.

Hypothesis: "Users struggle with tracking hypothesis validation"
Success Criteria: 3 out of 5 interviewees recognize this as a problem
Method: Semi-structured interviews

This sounds obvious, but most indie hackers (myself included, before) skip it. We run experiments and then decide after the fact whether the results were "good enough." That's like deploying a service without defining SLOs and then arguing about whether the error rate is acceptable.

Define the bar first. Then measure against it.

2. Incident Response → Pivot Decisions

SRE incident response has clear escalation rules:

Sev 1: Assemble the response team immediately
Sev 2: Handle during business hours
Sev 3: Address in the next sprint

I applied the same structure to product validation results:

Validation Result	Response
Validated (high confidence)	Continue — move to implementation
Validated (low confidence)	Investigate — plan additional experiments
Invalidated	Pivot or kill — change direction or stop

The key insight: don't make pivot decisions emotionally. "I spent weeks on this hypothesis, so it must be right" is the product equivalent of ignoring alerts because you don't want to get paged. SREs respond to alerts based on rules, not feelings. Product decisions should work the same way.

I wrote in my last post about spending 3 months building a SaaS that AI made obsolete. If I'd had these rules, I would have killed it in week 3 when the early signals were already there.

3. Runbooks → Validation Playbooks

SREs document incident response procedures as runbooks. When something breaks at 3 AM, you don't want to figure out the steps from scratch.

Same principle for hypothesis validation:

## Problem Validation Playbook

### Prep
1. Review hypothesis canvas — identify core assumptions
2. Define target persona
3. Set success criteria (e.g., 3/5 recognize the problem)

### Execute
1. Pre-test interview questions with AI simulation
2. Run 5 semi-structured interviews
3. Record key findings and direct quotes

### Decide
1. Compare results against success criteria
2. Record learnings
3. Make decision: Continue / Pivot / Kill

With a runbook, you don't panic during an incident. With a validation playbook, you don't freeze when it's time to decide whether your product idea is worth pursuing.

The Career Angle: Why This Combination Is Rare

SRE engineers who think about product validation are uncommon. Product managers who think in terms of observability are also uncommon. The intersection is almost empty.

If you're an engineer considering side projects or a career shift toward product:

Your reliability thinking is an asset — you already know how to define measurable targets and respond to data
Your operational discipline transfers — runbooks, escalation rules, and blameless post-mortems all have product equivalents
Your bias toward measurement is exactly what product development needs — too many product decisions are made on vibes

The gap isn't your skills. The gap is recognizing that the mental models you already use at work apply directly to building products.

What I Do Now

I built these SRE-inspired workflows into my own validation process, and eventually into a tool called KaizenLab to keep myself honest. But the tool matters less than the mindset.

If infrastructure deserves observability, so do your product decisions.

Next time you're about to start a side project, try this: before writing any code, write a validation runbook. Define your SLOs — I mean, success criteria. Set up your "alerts" — the signals that tell you to pivot or kill.

You already know how to do this. You just haven't applied it to products yet.

Are you an engineer who's applied technical thinking to product development? Or a PM who's borrowed concepts from SRE? I'd love to hear how these worlds collide in the comments.

I Spent 3 Months Building a SaaS — Then AI Did the Same Thing in One Prompt

toshipon — Thu, 26 Mar 2026 12:49:48 +0000

The Moment It Hit Me

I'd been heads-down for three months building a real estate investment simulator. It was a proper SaaS — loan calculators, renovation cost modeling, rental income projections, cash flow scenarios for old Japanese houses (kominka). I had Stripe integration, a Pro plan at ¥2,980/month, the works.

Then one evening, I watched someone type "simulate the rental yield for a 6-room property in Kamakura, purchase price ¥25M, renovation ¥8M, rent ¥65,000/room" into Claude — and get back a detailed cash flow breakdown in about 10 seconds.

Three months of my work. One prompt.

What I Built and Why

I'm an SRE engineer by day, and I'd gotten into real estate investing on the side — specifically old Japanese houses (kominka) that you can convert into rental apartments. The math is complex: you're juggling purchase price, renovation costs per room, loan terms, vacancy rates, property tax, management fees, and a dozen other variables.

I kept building the same spreadsheet over and over for each property I evaluated. So I thought: why not turn this into a product? Other investors must have the same pain.

I spent three months building it. Feature after feature — multiple property comparison, scenario modeling, loan amortization charts, break-even analysis. I even built dark mode. (Every indie hacker's favorite procrastination feature.)

Here's where I made my first mistake: I kept adding features without talking to users. The UI got complex. Really complex. And I had no idea which features actually mattered because I'd never validated with anyone except myself. When you're the developer AND the only user, everything feels essential.

Then Stripe rejected my payment integration. That stung, but looking back, it was the universe trying to tell me something.

What Happened Next

Around the same time, AI models got seriously good at financial analysis. Claude, ChatGPT — they could all handle multi-variable real estate calculations conversationally. You describe a property, ask your questions, and get answers. No UI to learn, no subscription to pay for.

The "SaaS is Dead" narrative started picking up steam in indie hacker circles. And while I think that take is overblown for most categories, for calculation-heavy tools with no network effects or proprietary data? It hit close to home.

My simulator was essentially a structured UI for math that a language model could do on the fly. The only "advantage" was a nice interface — but even that was debatable, since my UI had gotten too complex for its own good.

The Question I Should Have Asked

Before writing a single line of code, I should have asked:

"Can AI do this well enough that a dedicated service adds no unique value?"

I never even considered it. In 2025, when I started building, AI-as-calculator wasn't as obvious. But the trajectory was clear if I'd been paying attention. And more importantly, there's a broader version of this question that every indie hacker should ask:

"What makes this worth being a product instead of a prompt, a script, or a spreadsheet?"

If the answer is "a nicer UI" — that's not enough anymore.

The AI Replacement Test

Here's what I do now before building anything. It takes about an hour.

Step 1: Try to Replace It with AI (15 minutes)

Open Claude, ChatGPT, or whatever model you prefer. Describe your product's core use case as a prompt. Be specific.

If the AI produces 80%+ of the value your product would deliver — stop. Your product needs a fundamentally different value proposition, or it shouldn't exist as a product.

For my simulator, the AI nailed the math. It couldn't save scenarios across sessions or generate comparison charts, but honestly? Most users would be fine with copy-pasting into a spreadsheet.

Step 2: Identify Your "Moat Against AI" (15 minutes)

Ask yourself what your product does that AI can't replicate:

Proprietary data — Do you have data the model doesn't? (e.g., real-time pricing, user-generated datasets)
Network effects — Does it get better with more users? (e.g., marketplace, community)
Workflow integration — Does it plug into a system where copy-pasting AI output would be painful? (e.g., CI/CD, CRM)
Compliance/trust — Does the domain require auditability, consistency, or certification that AI can't guarantee? (e.g., medical, legal, financial reporting)
Collaboration — Do multiple people need to work on it together in real-time?

If you can't check at least one of these — you're building a nice wrapper around something AI gives away for free.

Step 3: Ask 5 People (30 minutes)

Not "would you use this?" — that question is useless. Everyone says yes.

Instead, ask:

"How do you handle [this problem] today?"
"Have you tried asking ChatGPT/Claude to do this?"
"What was missing from the AI's answer?"

If they haven't tried AI for this yet, suggest they try it right then. Watch their reaction. If they say "oh wow, this is good enough" — you have your answer.

Step 4: Write Down Your Hypothesis Before Building

Write one sentence:

"People will pay for [my product] instead of using AI because [specific reason]."

If you can't fill in that blank convincingly — don't build it yet. Validate the "[specific reason]" part first.

What I Do Now

That experience — three months of building something that AI made redundant — changed how I approach every new project. I now validate hypotheses before writing code. I decompose ideas into testable assumptions and kill the ones that don't hold up.

I actually built a tool to manage this process for myself: KaizenLab. But honestly, even a notebook works. The tool doesn't matter. The discipline does.

The Uncomfortable Truth

The hardest part of this story isn't that AI replaced my product. It's that I could have figured this out in an afternoon if I'd been willing to question my own idea.

I didn't want to test the hypothesis because I was afraid the answer would be "don't build it." And I was right to be afraid — that was the answer. But finding that out in an afternoon is infinitely better than finding it out after three months.

If you're an indie hacker reading this: before your next npx create-next-app, spend one hour on the AI Replacement Test. It might save you three months.

Or it might confirm that your idea is genuinely defensible — and then you'll build with way more confidence.

Either way, you win.

Have you had a project disrupted by AI? Or found a way to build something AI can't easily replace? I'd love to hear your story in the comments.