Pratyay Banerjee

Posted on Mar 4 • Edited on Mar 8

Kaizen — 𝘓𝘦𝘵 𝘺𝘰𝘶𝘳 𝘧𝘰𝘤𝘶𝘴 𝘧𝘰𝘭𝘭𝘰𝘸 𝘺𝘰𝘶! 🎯

#geminireflections #gemini #ai #devchallenge

Built with Google Gemini: Writing Challenge

Disclaimer: This is a submission for the Built with Google Gemini: Writing Challenge

Presenting Kaizen 🦄

Kaizen is a multi-agent Chromium extension that quietly tracks where your attention actually settles while you browse. It doesn’t block anything or try to push productivity. Instead, it helps you notice when your mind has drifted and guides you back to the thread you were following. Our brains are wired to conserve energy, so the moment we pause to ponder, we begin to wander. This is why long browsing sessions often feel scattered. Kaizen steps in at that exact point, especially for people who experience attention drift or ADHD-like patterns, helping the web feel connected again rather than fragmented.ㅤ

🔗 Try it out here: https://kaizen.apps.sandipan.dev

Product Demo ▶️

The name "Kaizen" comes from the Japanese concept of continuous improvement (改善). Small, steady gains. That's the whole philosophy — not perfection, just awareness.

Inspiration 💡

If you write code or study on the web, you’ve likely lived this moment — a tab for documentation leads to a blog post, then a video, then a forum thread, and somewhere between the scrolls, the thread of your original question frays. Minutes later, you know you saw something useful, but you can’t quite recall where, or what 😭

“Distraction is the modern poverty. Focus is the new wealth.” — James Clear.

The truth is, the internet has made information abundant, but our ability to retain and build on that knowledge hasn't kept pace. The truth is, our brains are wired to conserve energy, so the moment we pause to ponder, we begin to wander. This is why long browsing sessions often feel scattered.

I suffer from ADHD. My co-builder @sandipndev does too. We've both tried the usual focus apps, especially the ones that block websites or guilt-trip you with timers. They made us feel worse, not better.

So, around New Year's, we decided that our resolution for 2026 would be to actually fix this problem, but not with another blocker or pomodoro clone, but with something that genuinely understands how attention works! That motivation landed us at the Commit To Change: An AI Agents Hackathon 2026 hosted by Encode Club, and that's where Kaizen was born.

We asked ourselves: what if we could utilize AI smartly to understand where your attention actually settles, be it the content you read, the figure you saw, or the video you watched, and turn those moments into a private learning loop? We wanted to turn scattered web browsing into genuine learning, and that too without blocking sites or nagging you to focus, but by understanding what you're actually paying attention to and helping you build on it.

Built on Google gemini-3.1-flash, Kaizen supercharges your browser activity while keeping your data private. It notices, gently reflects, and helps you remember, the kind of help that keeps users rooted in the activity because progress is felt, not forced!

Codebase / App Repository 🔗

Kaizen 👉 github.com/anikvox/kaizen [Open Source]
Live App 👉 kaizen.apps.sandipan.dev

anikvox / kaizen

overcome adhd

Kaizen

kaizen.apps.sandipan.dev · Focus that Follows You

A privacy-first browser extension that tracks where your attention actually goes and gently helps you stay on track — without blocking content or enforcing rigid workflows.

Built by CS students with ADHD who wanted a tool that understands attention patterns, not one that locks you out.

Screenshots

Left: Extension side panel with focus tracking and growing tree. Right: Gentle nudge when you drift.

Features

Cognitive Attention Tracking — Knows what you're reading, watching, and listening to — not just which tabs are open
Focus Guardian Agent — Detects doomscrolling and distraction patterns, sends supportive nudges instead of blocking
AI Chat with Memory — Ask "What was I reading about today?" and get context-aware answers from your browsing history
Auto-Generated Quizzes — Turn passive reading into active recall with knowledge verification quizzes
Insights & Achievements — Track streaks, milestones, and focus patterns over time
…

View on GitHub

What it does 🤔

Kaizen acts as your AI co-pilot for focused learning on the web. It runs silently in the background, tracking what you actually pay attention to, including what you read, watch, and explore, and utilizes Google Gemini & Opik to help you stay focused, remember what matters, and test your understanding.

As you browse, Kaizenm gradually turns your native attention into learning. When your focus slips, it offers gentle nudges to bring you back. When you finish reading or watching something, it surfaces quick recall prompts to reinforce what you just absorbed. It occasionally slips in short, well-timed quizzes to check your understanding while the idea is still fresh. And when you want to go deeper, its context-aware chat remembers where you’ve been, helping you connect ideas and build knowledge over time.

Kaizen is supercharged with a plethora of awesome features,

🧠 Cognitive Attention Tracking — tracks where your mind actually settles across text, images, audio & YouTube

🤖 Multi-Agent AI System — four coordinated agents (Focus Guardian, Chat, Focus Clustering, Mental Health) powered by Gemini

💬 Agentic Co-Pilot Chat — tool-calling assistant that synthesizes your reading sessions with context-aware insights

🌊 Supportive Pulse Nudges — gentle reminders when you drift, never blocking — with self-calibrating sensitivity

📝 Knowledge Quizzes — auto-generated verification quizzes from your actual browsing content

📊 Cognitive Analytics Dashboard — attention entropy, browsing fragmentation, late-night patterns over 7–90 day windows

🌱 Growing Plant Gamification — a virtual plant that grows with your focus time

🔐 Privacy-First Engine — PII anonymization, encrypted API keys (AES-256-GCM), GDPR-compliant with full data export/deletion

🔭 Full Opik Observability — every LLM call, tool invocation, and agent decision traced end-to-end

Also, for more transparency, we also present a comparative analysis of available solutions,

Traditional Approach	Kaizen Approach
🟠 Blocks websites entirely	🟢 Supportive pulse nudges — zero blocking
🟠 Binary "focused" / "distracted" state	🟢 Granular cognitive attention sensing
🟠 Punishes distraction	🟢 Understands attention patterns and gently guides
🟠 No understanding of what you're learning	🟢 Tracks reading, images, audio, video — builds context
🟠 Cloud-locked data silos	🟢 Privacy-first, PII-anonymized AI

We wanted something that understands where your attention actually goes and gently helps you stay on track — without locking you out of anything.

How Gemini powers the system ⚡

"We used Gemini" tells you nothing. So let me be specific about how deeply it's woven into every layer of Kaizen.

Gemini is the system default provider throughout Kaizen, integrated via the Vercel AI SDK (@ai-sdk/google v3.0.22) alongside the direct Google SDK (@google/genai v1.40.0). Every agent, every summarization call, every quiz — Gemini handles it unless the user explicitly switches to another provider (Anthropic Claude or OpenAI GPT-4 are available as alternatives).

Here's the core provider resolution logic:

// service.ts — LLM Provider Resolution
export class LLMService {
  getProvider(): LLMProvider {
    // 1. Check user's custom provider + encrypted API key
    if (this.settings?.llmProvider) {
      const provider = this.tryCreateUserProvider();
      if (provider) return provider;
    }
    // 2. Fall back to system Gemini
    return this.createSystemProvider(); // → gemini-2.5-flash-lite
  }
}

// models.ts — System Defaults
export const SYSTEM_DEFAULT_PROVIDER: LLMProviderType = "gemini";
export const SYSTEM_DEFAULT_MODEL = "gemini-2.5-flash-lite";

And the GeminiProvider class wraps the Vercel AI SDK with full tool-calling, multimodal content (text + base64 images), and streaming support:

// providers/gemini.ts
export class GeminiProvider implements LLMProvider {
  readonly providerType = "gemini" as const;

  constructor(config: LLMProviderConfig) {
    this.google = createGoogleGenerativeAI({
      apiKey: config.apiKey,
    });
  }

  async generate(options: LLMGenerateOptions): Promise<LLMResponse> {
    const result = await generateText({
      model: this.google(this.model),
      system: options.systemPrompt,
      messages,
      tools: options.tools,
      experimental_telemetry: getTelemetrySettings({
        name: `gemini-${this.model}`,
        userId: this.userId,
      }),
    });
    // Extract toolCalls and toolResults from response...
  }

  async stream(options: LLMStreamOptions): Promise<void> {
    const result = streamText({
      model: this.google(this.model),
      system: options.systemPrompt,
      messages,
      tools: options.tools,
    });
    for await (const chunk of result.textStream) {
      await options.callbacks.onToken(chunk, fullContent);
    }
  }
}

ㅤ

The four agents 🤖

Kaizen isn't an usual GPT wrapper with a focus timer bolted on. It's a coordinated multi-agent system where each agent has a specific job, its own set of tools, and its own Gemini-powered decision loop.

Agent	What it does	How Gemini is used
🛡️ Focus Guardian	Monitors your browsing every 60 seconds. Detects doomscrolling, distraction, and focus drift. Sends nudges when confidence is high enough.	Gemini analyzes 15 minutes of activity context (domain switches, dwell times, social media ratio) and returns a structured JSON decision at `temperature: 0.1` for consistency.
💬 Chat Agent	Conversational AI with tool-calling. You can ask "what was I reading about today?" and it queries your actual attention data.	Gemini streams responses via `streamText()` with up to 5 agentic steps. It autonomously selects from 11 tools to ground answers in real data.
🎯 Focus Agent	Clusters your attention into focus sessions. Figures out what topics you're working on and tracks evolution.	Gemini runs an agentic loop (up to 10 iterations) calling tools like `create_focus`, `merge_focuses`, `update_focus` to organize attention data into coherent sessions.
🧘 Mental Health Agent	Generates cognitive wellness reports — fragmentation, sleep patterns, media balance, quiz retention.	Gemini runs another agentic loop with specialized tools (`analyze_sleep_patterns`, `analyze_focus_quality`, `analyze_media_balance`, `think_aloud`) and produces a full report in supportive, non-clinical language.

◉ Temperature tuning across tasks 🌡️

Different tasks need different levels of creativity. We tuned Gemini's temperature for each use case:

// config.ts — LLM Configuration Presets
export const LLM_CONFIG = {
  decision:        { temperature: 0.1, maxTokens: 10   },  // Should we nudge? Yes/no.
  summarization:   { temperature: 0.3, maxTokens: 200  },  // Factual, deterministic
  focusAnalysis:   { temperature: 0.3, maxTokens: 50   },  // Concise clustering
  imageDescription:{ temperature: 0.3, maxTokens: 150  },  // Vision captions
  titleGeneration: { temperature: 0.7, maxTokens: 20   },  // Creative but short
  agent:           { temperature: 0.7, maxTokens: 4096 },  // Chat — balanced
  quizGeneration:  { temperature: 0.9, maxTokens: 2000 },  // We *want* variety!
};

At 0.1, Gemini is disciplined — it gives consistent nudge decisions. At 0.9, it generates creative quiz question phrasing without going off the rails. That predictability across the temperature range was one of the reasons we kept Gemini as the default over other providers.

◉ Tool-calling in practice 🔧

The Chat Agent's tool-calling is where Gemini's structured output really shines. When you ask "what have I been focusing on?", here's what actually happens:

User message arrives
  → Gemini evaluates available tools
  → Selects: get_active_focus
  → Tool executes Prisma query against PostgreSQL
  → Results returned to Gemini
  → Gemini composes a response grounded in your data
  → Response streamed back via SSE

The 11 tools available to the Chat Agent:

get_attention_data — recent text/image/audio/YouTube attention
get_active_website — what tab you're on right now
get_active_focus — your current focus topics
search_browsing_history — search past activity
get_reading_activity — reading session data
get_youtube_history — YouTube watch history
get_focus_history — past focus sessions
get_current_time — current time in user's timezone
get_current_weather — weather at user's location
set_user_location — remember location (geocoding via OpenMeteo)
set_translation_language — language preferences

Gemini picks which tools to call, interprets the results, and sometimes chains multiple tool calls in a single turn. We capped it at 5 steps per message to prevent runaway loops. Here's the actual execution from agent.ts:

// chat/agent.ts — Agentic Chat Execution
const result = streamText({
  model: provider(modelId),
  system: systemPrompt,  // Fetched from Opik prompt library
  messages: coreMessages, // Supports multimodal (text + images)
  tools,
  maxSteps: 5,
  onStepFinish: (step) => {
    // Create Opik span for each tool call
    if (step.toolCalls && step.toolCalls.length > 0) {
      for (const toolCall of step.toolCalls) {
        const toolSpan = trace?.span({
          name: `tool:${toolCall.toolName}`,
          type: "tool",
          input: { args: toolCall.args },
        });
        toolSpan?.end({ result: /* tool output */ });
      }
    }
  },
});

We tested Gemini, Claude, and GPT-4 for this pipeline. Gemini's tool selection was the most reliable for our use case — it rarely picked the wrong tool or returned malformed tool calls across 11 different tool schemas. That's why it became the default.

◉ Multimodal attention — Gemini Vision 👁️

When you linger on an image while browsing, the extension tracks your hover duration and confidence score. If you're actually paying attention, Kaizen sends the image as base64-encoded data directly to Gemini for caption generation:

// providers/gemini.ts — Multimodal content formatting
private formatUserContent(content: LLMMessageContent) {
  return content.map((part) => {
    if (part.type === "image") {
      return {
        type: "image" as const,
        image: `data:${part.mimeType};base64,${part.data}`,
      };
    }
    return { type: "text" as const, text: part.text };
  });
}

This means the Chat Agent can later tell you "you were looking at a diagram of TCP handshakes" instead of just "you visited a networking article." The image summaries + text summaries together form Kaizen's memory layer.

◉ Quiz generation from real attention 📝

When you hit "Generate Quiz," a pg-boss background job fires. The Quiz Agent pulls your recent attention data, feeds it to Gemini at temperature: 0.9, and generates 10 multiple-choice questions based on what you've been reading. A content hash prevents duplicate questions across sessions. The quiz stays valid for 24 hours.

This is probably the feature I'm most proud of. Passive reading becomes active recall, and you didn't have to do anything extra. You just browsed normally, and now there's a quiz waiting for you. 🎯

◉ Focus Guardian — the self-learning nudge engine 🛡️

The Focus Guardian runs autonomously, analyzing your last 15 minutes of activity. Here's what actually happens in the decision loop (from focus-agent.ts):

// agent/focus-agent.ts — Focus Guardian Decision
const prompt = `${promptData.content}

RECENT ACTIVITY (last 15 minutes):
- Domains visited: ${context.recentDomains.join(", ")}
- Number of different sites: ${context.domainSwitchCount}
- Average time per page: ${Math.round(context.averageDwellTime / 1000)}s
- Social media/entertainment time: ${Math.round(context.socialMediaTime / 1000)}s
- Reading time (estimated): ${Math.round(context.readingTime / 1000)}s
- Has active focus: ${context.hasActiveFocus ? `Yes (${context.focusTopics.join(", ")})` : "No"}

USER FEEDBACK HISTORY:
- False positive rate: ${(feedback.falsePositiveRate * 100).toFixed(0)}%
- Acknowledged rate: ${(feedback.acknowledgedRate * 100).toFixed(0)}%
- Sensitivity setting: ${sensitivity}`;

const response = await provider.generate({
  messages: [{ role: "user", content: prompt }],
});

Nudge types: doomscroll, distraction, break, focus_drift, encouragement, and all_clear. There's a configurable cooldown between nudges so it never feels like nagging.

The system self-calibrates. Every nudge records whether you acknowledged it, dismissed it, or marked it as a false positive:

// Sensitivity auto-adjustment from user feedback
if (response === "false_positive") {
  newSensitivity = Math.max(0.1, newSensitivity - 0.05); // fewer nudges
} else if (response === "acknowledged") {
  newSensitivity = Math.min(0.9, newSensitivity + 0.02); // nudge was helpful
}

Over time, the agent learns your patterns. If it keeps getting it wrong, it backs off. If it's on point, it stays the course.

The Tech Stack ⚙️

Everything runs on a TypeScript monorepo (pnpm workspaces):

kaizen/
├── apps/
│   ├── api/          # Hono backend — agents, data ingestion, SSE
│   ├── extension/    # Plasmo browser extension — attention sensors
│   └── web/          # Next.js dashboard — analytics, chat, settings
├── packages/
│   ├── api-client/   # Shared typed API client
│   └── ui/           # Shared component library
└── docker-compose.yml

Layer	What we used
Runtime	Node.js 22+
Backend	Hono v4.6.14, Prisma ORM v6.2.1, PostgreSQL 16
Job Queue	pg-boss v12 (single-concurrency, resource-aware)
Real-time	Custom SSE (Server-Sent Events) for cross-device sync
Auth	Clerk v1.21.4 (web), device token handshake (extension)
AI	Google Gemini via Vercel AI SDK v6.0.77 (`@ai-sdk/google` + `@google/genai`)
Observability	Comet Opik v1.0.6 — tracing, prompt library, anonymizers
Extension	Plasmo, React, TypeScript
Dashboard	Next.js 15, Tailwind CSS, Lucide Icons
Encryption	AES-256-GCM for API key storage

Attention sensors 📡

The extension runs separate monitors for different content types:

Sensor	File	What it tracks
📖 Text	`monitor-text.ts`	Paragraphs read, words processed, reading progress, sustained attention duration
🖼️ Image	`monitor-image.ts`	Hover duration, confidence score → triggers Gemini Vision for caption generation
🔊 Audio	`monitor-audio.ts`	Playback duration, active listening time
📺 YouTube	background scripts	Watch time, captions ingestion, video context

Each sensor generates a confidence score (0–100) based on hover duration, scroll velocity, and viewport position. A quick skim doesn't count as learning. Sustained attention does.

Database Schema 🗄️

// Core attention tracking
TextAttention    → text, wordsRead, confidence, timestamp
ImageAttention   → src, alt, hoverDuration, summary (AI-generated)
AudioAttention   → playbackDuration, activeTime
YoutubeAttention → captions, activeWatchTime

// Agentic features
Focus            → item, keywords[], isActive, lastActivityAt
AgentNudge       → type, message, confidence, reasoning, response
Pulse            → userId, message (short nudges)

// Quiz system
Quiz             → questions (JSON), contentHash (deduplication)
QuizAnswer       → selectedIndex, isCorrect
QuizResult       → totalQuestions, correctAnswers

// User settings (encrypted API keys)
UserSettings     → geminiApiKeyEncrypted, llmProvider, llmModel

ㅤ

Real-time SSE events 📡

Custom Server-Sent Events sync state across browser extension + dashboard:

pomodoro-tick — Timer updates
chat-message-created/updated — Chat streaming
active-tab-changed — Tab context sync
focus-changed — Focus session state
settings-updated — Cross-device settings sync
pulses-updated — Nudge notifications

Observability with Opik 🔭

We integrated Comet Opik for full observability across the entire agent system. This turned out to be one of the best decisions we made, as you can't evaluate what you can't see.

What we instrumented

🔗 Tracing — Every LLM call, every tool invocation, every agent decision is traced end-to-end. Traces are grouped by thread ID so you can follow the full decision flow:

// telemetry.ts — Opik Trace Hierarchy
const trace = client.trace({
  name: options.name,
  input: options.input ? anonymizeInput(options.input) : undefined,
  metadata: { ...options.metadata, environment: process.env.NODE_ENV },
  tags: options.tags || ["kaizen"],
  threadId: options.threadId,
});

// Nested spans for each step
const span = trace.span({
  name: "tool:get_attention_data",
  type: "tool",
  input: anonymizeInput({ args: toolCall.args }),
});
span.update({ output: processedOutput, endTime: new Date() });

The resulting trace hierarchy looks like:

Trace: chat-agent
├── Span: streamText [type: llm]
│   ├── Span: tool:get_active_website [type: tool]
│   ├── Span: tool:get_attention_data [type: tool]
│   └── Span: tool:search_browsing_history [type: tool]
└── Span: followUp-streamText [type: llm]

📚 Prompt Library — All 11 system prompts live in Opik under named entries, fetched fresh on every call with local fallbacks:

// prompt-provider.ts — Opik-first, local fallback
export async function getPromptWithMetadata(name: PromptName) {
  if (isOpikPromptsEnabled()) {
    const opikPrompt = await getPromptFromOpik(name);
    if (opikPrompt?.content) {
      return { content: opikPrompt.content, source: "opik",
               promptVersion: opikPrompt.commit };
    }
  }
  return { content: LOCAL_PROMPT_MAP[name], source: "local" };
}

This let us iterate on prompts without redeploying code. We'd see a bad nudge in a trace, tweak the prompt in Opik's dashboard, and the fix was live immediately.

🔒 Anonymizers — Before anything gets logged to Opik, we strip PII using @cdssnc/sanitize-pii:

// anonymizer.ts — PII Protection
function isSensitiveKey(key: string): boolean {
  const sensitivePatterns = [
    /^userId$/i, /password/i, /secret/i,
    /token/i, /api[_-]?key/i, /auth/i,
    /credential/i, /private[_-]?key/i,
  ];
  return sensitivePatterns.some((pattern) => pattern.test(key));
}

// User inputs → anonymized. LLM outputs → preserved for debugging.
export function anonymizeInput<T>(data: T): T { return anonymizeData(data); }
export function anonymizeOutput<T>(data: T): T { /* only redact sensitive keys */ }

🛡️ Guardrails — Agents validate inputs before tool execution. The Focus Guardian only fires a nudge when confidence exceeds a dynamically adjusted threshold. The Chat Agent validates tool arguments before running Prisma queries.

Why Opik mattered 🎯

Early on, the Focus Guardian was nudging people during legitimate deep dives. Someone would be reading a 30-minute technical article, and the agent would flag it as distraction because the domain switching pattern looked similar to aimless browsing.

Without Opik, we'd have said "the AI is dumb" and guessed at fixes.

With tracing, we could pull up the exact decision chain: here's the 15 minutes of context the agent saw, here's the domain switch count, here's the confidence score, here's the prompt, here's the output. The problem was obvious — the prompt didn't have a strong enough signal for sustained single-topic browsing. We tweaked the prompt in Opik, the fix deployed without a code change, and false positives dropped.

That cycle — trace the failure → find the root cause → fix the prompt → verify in production — happened dozens of times.

What worked well ✅

Tool-calling was reliable. We tested Gemini, Claude, and GPT-4 for our agent pipelines, and Gemini's structured output parsing was the most consistent for our use case. The Chat Agent makes autonomous tool selections across 11 different tools, and Gemini rarely picked the wrong one or returned malformed tool calls. This is why it became our system default.
The million-token context window was a real advantage. Gemini 3.1-flash and 3.1-flash-lite both support 1M token context windows. For the Focus Agent's clustering loop, which sometimes processes hours of attention data across many topics, especially having that headroom meant we didn't have to aggressively truncate context. We could pass in a richer activity history and get better clustering decisions.
Temperature control behaved predictably. From 0.1 for binary decisions to 0.9 for quiz generation, Gemini responded consistently. At 0.1 it was disciplined; at 0.9 it got creative without going off the rails. That predictability across the full range was a real win.
Multimodal input worked out of the box. We send base64-encoded images directly to Gemini for caption generation. The quality of image descriptions was good enough that the Chat Agent could later reference them meaningfully. No separate vision pipeline needed.
The model fetcher dynamically discovers new models. We use @google/genai to fetch the live model list from the Gemini API (filtering for generateContent support), with sorting priority baked in for the upcoming gemini-series family, so when new model lands, Kaizen will pick it up automatically.

Research 📚

We don’t remember things just because we saw them. We remember them when we bring them back to mind. A small, well-timed reminder can turn a passing moment online into something that actually sticks. And it works better when the reminder supports your intention rather than trying to control your behavior. When your browser can quietly keep track of the ideas you spent time on and surface them again when you need them, the pressure of “trying to hold everything in your head” eases up.

This is especially supportive for people with ADHD, where working memory and task switching can feel heavy, and for people who experience early memory decline, where gentle spaced recall helps keep learning active. Kaizen helps keep the thread. Small nudges, quick check-ins, and context that stays with you, so you don’t have to start from scratch every time you return to a thought.

Kaizen keeps attention anchored to meaning, not effort! ✨

Challenges we ran into 😤

We did run into a few challenges along the way. Since we were working from different time zones, coordinating calls and staying in sync took some extra effort. Most of our collaboration happened asynchronously, which meant we had to be very clear about decisions and hand-offs.

On the technical side, figuring out what “real attention” meant was something we had to refine multiple times. We experimented with how much weight to give scroll patterns, mouse movement, viewport position, and reading pace so that quick skims didn’t count as learning. Handling different types of content also took care, especially images and YouTube videos, since the context needed to stay meaningful, not noisy.

Besides, some other challenges we faced aren't limited to,

Multi-turn coherence degraded over long conversations. After 10+ turns with interleaved tool calls, the Chat Agent would sometimes lose track of earlier context or repeat information. We partially fixed this by injecting a conversation summary into the system prompt, but it meant extra token usage. Not unique to Gemini, but noticeable.
Streaming with tool calls needed careful handling. When Gemini decides mid-stream to call a tool, the handoff between text chunks and tool-call events required state management in our SSE layer. The Vercel AI SDK abstracted most of it, but edge cases (tool call at the very start, multiple rapid tool calls) needed explicit handling.
Occasional overconfidence in Focus Guardian decisions. At temperature: 0.1, when the Focus Guardian is wrong, it's confidently wrong. A few times it classified focused research (lots of Stack Overflow tabs) as aimless browsing. The fix was better prompting + the feedback loop, not a model change.

What we learned 🙌

Proper sleep is very important! 😛

Well, a lot of things, both summed up in technical & non-technical sides. We learned that — it’s one thing to get the AI features working, and another to make them feel good while someone is actually browsing. Most of our time went into small details: when to nudge, when to stay quiet, how to store attention history without slowing down browser, and how to keep things calm instead of distracting. Shipping Kaizen from a barebone idea into something stable took a lot of iteration, testing, and rethinking. It reminded us that real products are built in the tiny decisions, not the big demos! 🤗

There are a few more items that I'd love to share with the community,

🟦 Building for attention requires restraint. The hardest design decisions weren't technical. They were about when not to act. Our early Focus Guardian nudged aggressively, mostly it felt like a backseat driver. So, our lesson was that if your tool annoys people, they'll uninstall it. Being right isn't enough; you have to be right at the right moment.
🟦 Agents need structure, not freedom. We initially gave the Chat Agent broad instructions. The results were inconsistent. What worked was constraining each agent to a narrow job with specific tools and clear decision boundaries. The Focus Guardian doesn't chat. The Chat Agent doesn't nudge. In sharp, Specialization + Coordination > Generalization.
🟦 Observability isn't optional for agent systems. Without Opik traces, we'd still be guessing why nudges misfired. We stopped treating the AI as a black box and started treating it like any other system component with logs and metrics.
🟦 The real product is the quiet moments. Nobody remembers the quiz that worked perfectly. They remember the time the extension stayed silent for 45 minutes during a Wikipedia deep dive they genuinely cared about, and then gently reminded them about the assignment they'd originally set out to work on. Getting those moments right took dozens of prompt iterations and hundreds of traced decisions.
🟦 Gemini as a default provider was the right call. After benchmarking all three providers, Gemini's combination of reliable tool-calling, 1M context window, and consistent temperature behavior made it the best fit. Our system makes potentially dozens of Gemini calls per user per hour — attention summaries, focus clustering, guardian checks — and reliability at that volume mattered more than peak performance on any single call.

What's next? 🚀

We're continuing to develop Kaizen and planning the next release cycle:

⏰ Spaced repetition — surfacing what you read at the moment you're most likely to forget it
🕸️ Topic relationship mapping — showing how things you learn connect across sessions
⚡ Better batching — optimized Gemini call grouping during long browsing sessions
📤 Export to note-taking tools — so learning doesn't stay trapped in the extension
👥 Shareable study threads — lightweight collaboration for shared focus sessions

For Gemini specifically, we're interested in structured output (JSON mode) for agent responses. Right now we parse freeform text from several agent pipelines, and guaranteed JSON would let us simplify those parsing layers.

End notes 🙌🏻

As CS students who struggle with ADHD, we primarily built Kaizen because we needed it ourselves. Traditional blockers felt like punishment. Our New Year's resolution was to build the tool we wished existed, something that doesn't lock you out, doesn't judge, just watches where your attention goes, learns your patterns, and gently, continuously helps you get better. That's what kaizen (改善) stands for, i.e. continuous improvement.

Huge thanks to DEV and MLH for hosting this writing challenge, and to the Google Gemini team for building models that actually hold up under real multi-agent workloads! 🙌

Permissive License ⚖️

MIT License

DEV Community