DEV Community

Pratyay Banerjee
Pratyay Banerjee Subscriber

Posted on

Kaizen — Let your focus follow you!

Built with Google Gemini: Writing Challenge

This is a submission for the Built with Google Gemini: Writing Challenge


I lose my train of thought a lot. Not in the dramatic, movie-character way. More like: I open a tab to look up a CSS property, and twenty minutes later I'm reading about the history of Egyptian obelisks 🗿. I have ADHD. My co-builder Sandipan does too. We've both tried the usual focus apps — the ones that block websites or guilt-trip you with timers. They made us feel worse, not better.

Around New Year's, we decided that our resolution for 2026 would be to actually fix this problem — not with another blocker or pomodoro clone, but with something that genuinely understands how attention works. That motivation landed us at the Commit To Change: An AI Agents Hackathon 2026 hosted by Encode Club, and that's where Kaizen was born.

Video ▶️

breaker

The problem 😤

If you write code or study on the web, you've lived this moment. A documentation tab leads to a blog post, then a video, then a forum thread. Somewhere between the scrolls, the thread of your original question frays. Minutes later, you know you saw something useful, but you can't recall where — or what.

"Distraction is the modern poverty. Focus is the new wealth." — James Clear

Traditional productivity tools treat this as a discipline problem. They block sites, enforce timers, and punish distraction. For people with ADHD or anyone who naturally multitasks, these tools feel adversarial. Focus isn't binary, and distraction isn't a moral failing.

Traditional Approach
Kaizen Approach
🟠 Blocks websites entirely 🟢 Supportive pulse nudges — zero blocking
🟠 Binary "focused" / "distracted" state 🟢 Granular cognitive attention sensing
🟠 Punishes distraction 🟢 Understands attention patterns and gently guides
🟠 No understanding of what you're learning 🟢 Tracks reading, images, audio, video — builds context
🟠 Cloud-locked data silos 🟢 Privacy-first, PII-anonymized AI

We wanted something that understands where your attention actually goes and gently helps you stay on track — without locking you out of anything.

transparent-divider

What is Kaizen? 🦄

Kaizen is a multi-agent browser extension paired with a full web dashboard that tracks your cognitive attention in real time. It doesn't just monitor which websites you open — it watches which paragraphs you read, where you hover, how you scroll, how long you stay on specific content, and whether you're actually engaged or just skimming.

🔗 Try it out here: https://kaizen.apps.sandipan.dev

That attention signal feeds into a coordinated multi-agent AI system powered by Google Gemini — specifically gemini-2.5-flash-lite as the system default, with gemini-2.5-flash and gemini-2.5-pro available for deeper reasoning. The codebase is also future-proofed with sorting priority for gemini-3 family models (including gemini-3.1-flash) as they become available through the Google AI API.

The name comes from the Japanese concept of continuous improvement (改善). Small, steady gains. That's the whole philosophy — not perfection, just awareness.

Codebase / App Repository 🔗

GitHub logo anikvox / kaizen

overcome adhd

Kaizen

kaizen.apps.sandipan.dev · Focus that Follows You

A privacy-first browser extension that tracks where your attention actually goes and gently helps you stay on track — without blocking content or enforcing rigid workflows.

Built by CS students with ADHD who wanted a tool that understands attention patterns, not one that locks you out.


Screenshots

Extension Side Panel    Focus Guardian Nudge

Left: Extension side panel with focus tracking and growing tree. Right: Gentle nudge when you drift.


Features

  • Cognitive Attention Tracking — Knows what you're reading, watching, and listening to — not just which tabs are open
  • Focus Guardian Agent — Detects doomscrolling and distraction patterns, sends supportive nudges instead of blocking
  • AI Chat with Memory — Ask "What was I reading about today?" and get context-aware answers from your browsing history
  • Auto-Generated Quizzes — Turn passive reading into active recall with knowledge verification quizzes
  • Insights & Achievements — Track streaks, milestones, and focus patterns over time

Features 🎠

  • 🧠 Cognitive Attention Tracking — tracks where your mind actually settles across text, images, audio & YouTube
  • 🤖 Multi-Agent AI System — four coordinated agents (Focus Guardian, Chat, Focus Clustering, Mental Health) powered by Gemini
  • 💬 Agentic Co-Pilot Chat — tool-calling assistant that synthesizes your reading sessions with context-aware insights
  • 🌊 Supportive Pulse Nudges — gentle reminders when you drift, never blocking — with self-calibrating sensitivity
  • 📝 Knowledge Quizzes — auto-generated verification quizzes from your actual browsing content
  • 📊 Cognitive Analytics Dashboard — attention entropy, browsing fragmentation, late-night patterns over 7–90 day windows
  • 🌱 Growing Plant Gamification — a virtual plant that grows with your focus time
  • 🔐 Privacy-First Engine — PII anonymization, encrypted API keys (AES-256-GCM), GDPR-compliant with full data export/deletion
  • 🔭 Full Opik Observability — every LLM call, tool invocation, and agent decision traced end-to-end

transparent-divider

How Gemini powers the system ⚡

"We used Gemini" tells you nothing. So let me be specific about how deeply it's woven into every layer of Kaizen.

Gemini is the system default provider throughout Kaizen, integrated via the Vercel AI SDK (@ai-sdk/google v3.0.22) alongside the direct Google SDK (@google/genai v1.40.0). Every agent, every summarization call, every quiz — Gemini handles it unless the user explicitly switches to another provider (Anthropic Claude or OpenAI GPT-4 are available as alternatives).

Here's the core provider resolution logic:

// service.ts — LLM Provider Resolution
export class LLMService {
  getProvider(): LLMProvider {
    // 1. Check user's custom provider + encrypted API key
    if (this.settings?.llmProvider) {
      const provider = this.tryCreateUserProvider();
      if (provider) return provider;
    }
    // 2. Fall back to system Gemini
    return this.createSystemProvider(); // → gemini-2.5-flash-lite
  }
}

// models.ts — System Defaults
export const SYSTEM_DEFAULT_PROVIDER: LLMProviderType = "gemini";
export const SYSTEM_DEFAULT_MODEL = "gemini-2.5-flash-lite";
Enter fullscreen mode Exit fullscreen mode

And the GeminiProvider class wraps the Vercel AI SDK with full tool-calling, multimodal content (text + base64 images), and streaming support:

// providers/gemini.ts
export class GeminiProvider implements LLMProvider {
  readonly providerType = "gemini" as const;

  constructor(config: LLMProviderConfig) {
    this.google = createGoogleGenerativeAI({
      apiKey: config.apiKey,
    });
  }

  async generate(options: LLMGenerateOptions): Promise<LLMResponse> {
    const result = await generateText({
      model: this.google(this.model),
      system: options.systemPrompt,
      messages,
      tools: options.tools,
      experimental_telemetry: getTelemetrySettings({
        name: `gemini-${this.model}`,
        userId: this.userId,
      }),
    });
    // Extract toolCalls and toolResults from response...
  }

  async stream(options: LLMStreamOptions): Promise<void> {
    const result = streamText({
      model: this.google(this.model),
      system: options.systemPrompt,
      messages,
      tools: options.tools,
    });
    for await (const chunk of result.textStream) {
      await options.callbacks.onToken(chunk, fullContent);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The four agents 🤖

Kaizen isn't a ChatGPT wrapper with a focus timer bolted on. It's a coordinated multi-agent system where each agent has a specific job, its own set of tools, and its own Gemini-powered decision loop.

Agent What it does How Gemini is used
🛡️ Focus Guardian Monitors your browsing every 60 seconds. Detects doomscrolling, distraction, and focus drift. Sends nudges when confidence is high enough. Gemini analyzes 15 minutes of activity context (domain switches, dwell times, social media ratio) and returns a structured JSON decision at temperature: 0.1 for consistency.
💬 Chat Agent Conversational AI with tool-calling. You can ask "what was I reading about today?" and it queries your actual attention data. Gemini streams responses via streamText() with up to 5 agentic steps. It autonomously selects from 11 tools to ground answers in real data.
🎯 Focus Agent Clusters your attention into focus sessions. Figures out what topics you're working on and tracks evolution. Gemini runs an agentic loop (up to 10 iterations) calling tools like create_focus, merge_focuses, update_focus to organize attention data into coherent sessions.
🧘 Mental Health Agent Generates cognitive wellness reports — fragmentation, sleep patterns, media balance, quiz retention. Gemini runs another agentic loop with specialized tools (analyze_sleep_patterns, analyze_focus_quality, analyze_media_balance, think_aloud) and produces a full report in supportive, non-clinical language.

Gemini temperature tuning across tasks 🌡️

Different tasks need different levels of creativity. We tuned Gemini's temperature for each use case:

// config.ts — LLM Configuration Presets
export const LLM_CONFIG = {
  decision:        { temperature: 0.1, maxTokens: 10   },  // Should we nudge? Yes/no.
  summarization:   { temperature: 0.3, maxTokens: 200  },  // Factual, deterministic
  focusAnalysis:   { temperature: 0.3, maxTokens: 50   },  // Concise clustering
  imageDescription:{ temperature: 0.3, maxTokens: 150  },  // Vision captions
  titleGeneration: { temperature: 0.7, maxTokens: 20   },  // Creative but short
  agent:           { temperature: 0.7, maxTokens: 4096 },  // Chat — balanced
  quizGeneration:  { temperature: 0.9, maxTokens: 2000 },  // We *want* variety!
};
Enter fullscreen mode Exit fullscreen mode

At 0.1, Gemini is disciplined — it gives consistent nudge decisions. At 0.9, it generates creative quiz question phrasing without going off the rails. That predictability across the temperature range was one of the reasons we kept Gemini as the default over other providers.

Tool-calling in practice 🔧

The Chat Agent's tool-calling is where Gemini's structured output really shines. When you ask "what have I been focusing on?", here's what actually happens:

User message arrives
  → Gemini evaluates available tools
  → Selects: get_active_focus
  → Tool executes Prisma query against PostgreSQL
  → Results returned to Gemini
  → Gemini composes a response grounded in your data
  → Response streamed back via SSE
Enter fullscreen mode Exit fullscreen mode

The 11 tools available to the Chat Agent:

  • get_attention_data — recent text/image/audio/YouTube attention
  • get_active_website — what tab you're on right now
  • get_active_focus — your current focus topics
  • search_browsing_history — search past activity
  • get_reading_activity — reading session data
  • get_youtube_history — YouTube watch history
  • get_focus_history — past focus sessions
  • get_current_time — current time in user's timezone
  • get_current_weather — weather at user's location
  • set_user_location — remember location (geocoding via OpenMeteo)
  • set_translation_language — language preferences

Gemini picks which tools to call, interprets the results, and sometimes chains multiple tool calls in a single turn. We capped it at 5 steps per message to prevent runaway loops. Here's the actual execution from agent.ts:

// chat/agent.ts — Agentic Chat Execution
const result = streamText({
  model: provider(modelId),
  system: systemPrompt,  // Fetched from Opik prompt library
  messages: coreMessages, // Supports multimodal (text + images)
  tools,
  maxSteps: 5,
  onStepFinish: (step) => {
    // Create Opik span for each tool call
    if (step.toolCalls && step.toolCalls.length > 0) {
      for (const toolCall of step.toolCalls) {
        const toolSpan = trace?.span({
          name: `tool:${toolCall.toolName}`,
          type: "tool",
          input: { args: toolCall.args },
        });
        toolSpan?.end({ result: /* tool output */ });
      }
    }
  },
});
Enter fullscreen mode Exit fullscreen mode

We tested Gemini, Claude, and GPT-4 for this pipeline. Gemini's tool selection was the most reliable for our use case — it rarely picked the wrong tool or returned malformed tool calls across 11 different tool schemas. That's why it became the default.

Multimodal attention — Gemini Vision 👁️

When you linger on an image while browsing, the extension tracks your hover duration and confidence score. If you're actually paying attention, Kaizen sends the image as base64-encoded data directly to Gemini for caption generation:

// providers/gemini.ts — Multimodal content formatting
private formatUserContent(content: LLMMessageContent) {
  return content.map((part) => {
    if (part.type === "image") {
      return {
        type: "image" as const,
        image: `data:${part.mimeType};base64,${part.data}`,
      };
    }
    return { type: "text" as const, text: part.text };
  });
}
Enter fullscreen mode Exit fullscreen mode

This means the Chat Agent can later tell you "you were looking at a diagram of TCP handshakes" instead of just "you visited a networking article." The image summaries + text summaries together form Kaizen's memory layer.

Quiz generation from real attention 📝

When you hit "Generate Quiz," a pg-boss background job fires. The Quiz Agent pulls your recent attention data, feeds it to Gemini at temperature: 0.9, and generates 10 multiple-choice questions based on what you've been reading. A content hash prevents duplicate questions across sessions. The quiz stays valid for 24 hours.

This is probably the feature I'm most proud of. Passive reading becomes active recall, and you didn't have to do anything extra. You just browsed normally, and now there's a quiz waiting for you. 🎯

Focus Guardian — the self-learning nudge engine 🛡️

The Focus Guardian runs autonomously, analyzing your last 15 minutes of activity. Here's what actually happens in the decision loop (from focus-agent.ts):

// agent/focus-agent.ts — Focus Guardian Decision
const prompt = `${promptData.content}

RECENT ACTIVITY (last 15 minutes):
- Domains visited: ${context.recentDomains.join(", ")}
- Number of different sites: ${context.domainSwitchCount}
- Average time per page: ${Math.round(context.averageDwellTime / 1000)}s
- Social media/entertainment time: ${Math.round(context.socialMediaTime / 1000)}s
- Reading time (estimated): ${Math.round(context.readingTime / 1000)}s
- Has active focus: ${context.hasActiveFocus ? `Yes (${context.focusTopics.join(", ")})` : "No"}

USER FEEDBACK HISTORY:
- False positive rate: ${(feedback.falsePositiveRate * 100).toFixed(0)}%
- Acknowledged rate: ${(feedback.acknowledgedRate * 100).toFixed(0)}%
- Sensitivity setting: ${sensitivity}`;

const response = await provider.generate({
  messages: [{ role: "user", content: prompt }],
});
Enter fullscreen mode Exit fullscreen mode

Nudge types: doomscroll, distraction, break, focus_drift, encouragement, and all_clear. There's a configurable cooldown between nudges so it never feels like nagging.

The system self-calibrates. Every nudge records whether you acknowledged it, dismissed it, or marked it as a false positive:

// Sensitivity auto-adjustment from user feedback
if (response === "false_positive") {
  newSensitivity = Math.max(0.1, newSensitivity - 0.05); // fewer nudges
} else if (response === "acknowledged") {
  newSensitivity = Math.min(0.9, newSensitivity + 0.02); // nudge was helpful
}
Enter fullscreen mode Exit fullscreen mode

Over time, the agent learns your patterns. If it keeps getting it wrong, it backs off. If it's on point, it stays the course.

transparent-divider

The technical stack ⚙️

Everything runs on a TypeScript monorepo (pnpm workspaces):

kaizen/
├── apps/
│   ├── api/          # Hono backend — agents, data ingestion, SSE
│   ├── extension/    # Plasmo browser extension — attention sensors
│   └── web/          # Next.js dashboard — analytics, chat, settings
├── packages/
│   ├── api-client/   # Shared typed API client
│   └── ui/           # Shared component library
└── docker-compose.yml
Enter fullscreen mode Exit fullscreen mode
Layer What we used
Runtime Node.js 22+
Backend Hono v4.6.14, Prisma ORM v6.2.1, PostgreSQL 16
Job Queue pg-boss v12 (single-concurrency, resource-aware)
Real-time Custom SSE (Server-Sent Events) for cross-device sync
Auth Clerk v1.21.4 (web), device token handshake (extension)
AI Google Gemini via Vercel AI SDK v6.0.77 (@ai-sdk/google + @google/genai)
Observability Comet Opik v1.0.6 — tracing, prompt library, anonymizers
Extension Plasmo, React, TypeScript
Dashboard Next.js 15, Tailwind CSS, Lucide Icons
Encryption AES-256-GCM for API key storage

Attention sensors 📡

The extension runs separate monitors for different content types:

Sensor File What it tracks
📖 Text monitor-text.ts Paragraphs read, words processed, reading progress, sustained attention duration
🖼️ Image monitor-image.ts Hover duration, confidence score → triggers Gemini Vision for caption generation
🔊 Audio monitor-audio.ts Playback duration, active listening time
📺 YouTube background scripts Watch time, captions ingestion, video context

Each sensor generates a confidence score (0–100) based on hover duration, scroll velocity, and viewport position. A quick skim doesn't count as learning. Sustained attention does.

Database schema highlights 🗄️

// Core attention tracking
TextAttention    → text, wordsRead, confidence, timestamp
ImageAttention   → src, alt, hoverDuration, summary (AI-generated)
AudioAttention   → playbackDuration, activeTime
YoutubeAttention → captions, activeWatchTime

// Agentic features
Focus            → item, keywords[], isActive, lastActivityAt
AgentNudge       → type, message, confidence, reasoning, response
Pulse            → userId, message (short nudges)

// Quiz system
Quiz             → questions (JSON), contentHash (deduplication)
QuizAnswer       → selectedIndex, isCorrect
QuizResult       → totalQuestions, correctAnswers

// User settings (encrypted API keys)
UserSettings     → geminiApiKeyEncrypted, llmProvider, llmModel
Enter fullscreen mode Exit fullscreen mode

Real-time SSE events 📡

Custom Server-Sent Events sync state across browser extension + dashboard:

  • pomodoro-tick — Timer updates
  • chat-message-created/updated — Chat streaming
  • active-tab-changed — Tab context sync
  • focus-changed — Focus session state
  • settings-updated — Cross-device settings sync
  • pulses-updated — Nudge notifications

transparent-divider

Observability with Opik 🔭

We integrated Comet Opik for full observability across the entire agent system. This turned out to be one of the best decisions we made — you can't evaluate what you can't see.

What we instrumented

🔗 Tracing — Every LLM call, every tool invocation, every agent decision is traced end-to-end. Traces are grouped by thread ID so you can follow the full decision flow:

// telemetry.ts — Opik Trace Hierarchy
const trace = client.trace({
  name: options.name,
  input: options.input ? anonymizeInput(options.input) : undefined,
  metadata: { ...options.metadata, environment: process.env.NODE_ENV },
  tags: options.tags || ["kaizen"],
  threadId: options.threadId,
});

// Nested spans for each step
const span = trace.span({
  name: "tool:get_attention_data",
  type: "tool",
  input: anonymizeInput({ args: toolCall.args }),
});
span.update({ output: processedOutput, endTime: new Date() });
Enter fullscreen mode Exit fullscreen mode

The resulting trace hierarchy looks like:

Trace: chat-agent
├── Span: streamText [type: llm]
│   ├── Span: tool:get_active_website [type: tool]
│   ├── Span: tool:get_attention_data [type: tool]
│   └── Span: tool:search_browsing_history [type: tool]
└── Span: followUp-streamText [type: llm]
Enter fullscreen mode Exit fullscreen mode

📚 Prompt Library — All 11 system prompts live in Opik under named entries, fetched fresh on every call with local fallbacks:

// prompt-provider.ts — Opik-first, local fallback
export async function getPromptWithMetadata(name: PromptName) {
  if (isOpikPromptsEnabled()) {
    const opikPrompt = await getPromptFromOpik(name);
    if (opikPrompt?.content) {
      return { content: opikPrompt.content, source: "opik",
               promptVersion: opikPrompt.commit };
    }
  }
  return { content: LOCAL_PROMPT_MAP[name], source: "local" };
}
Enter fullscreen mode Exit fullscreen mode

Prompt names in Opik: kaizen-chat-agent, kaizen-focus-guardian, kaizen-focus-agent, kaizen-quiz-generation, kaizen-mental-health-agent, kaizen-text-summarization, kaizen-image-summarization, kaizen-individual-image, kaizen-title-generation, kaizen-focus-analysis, kaizen-chat.

This let us iterate on prompts without redeploying code. We'd see a bad nudge in a trace, tweak the prompt in Opik's dashboard, and the fix was live immediately.

🔒 Anonymizers — Before anything gets logged to Opik, we strip PII using @cdssnc/sanitize-pii:

// anonymizer.ts — PII Protection
function isSensitiveKey(key: string): boolean {
  const sensitivePatterns = [
    /^userId$/i, /password/i, /secret/i,
    /token/i, /api[_-]?key/i, /auth/i,
    /credential/i, /private[_-]?key/i,
  ];
  return sensitivePatterns.some((pattern) => pattern.test(key));
}

// User inputs → anonymized. LLM outputs → preserved for debugging.
export function anonymizeInput<T>(data: T): T { return anonymizeData(data); }
export function anonymizeOutput<T>(data: T): T { /* only redact sensitive keys */ }
Enter fullscreen mode Exit fullscreen mode

🛡️ Guardrails — Agents validate inputs before tool execution. The Focus Guardian only fires a nudge when confidence exceeds a dynamically adjusted threshold. The Chat Agent validates tool arguments before running Prisma queries.

Why Opik mattered 🎯

Early on, the Focus Guardian was nudging people during legitimate deep dives. Someone would be reading a 30-minute technical article, and the agent would flag it as distraction because the domain switching pattern looked similar to aimless browsing.

Without Opik, we'd have said "the AI is dumb" and guessed at fixes.

With tracing, we could pull up the exact decision chain: here's the 15 minutes of context the agent saw, here's the domain switch count, here's the confidence score, here's the prompt, here's the output. The problem was obvious — the prompt didn't have a strong enough signal for sustained single-topic browsing. We tweaked the prompt in Opik, the fix deployed without a code change, and false positives dropped.

That cycle — trace the failure → find the root cause → fix the prompt → verify in production — happened dozens of times.

transparent-divider

Our honest Gemini feedback 🗣️

What worked well ✅

Tool-calling was reliable. We tested Gemini, Claude, and GPT-4 for our agent pipelines, and Gemini's structured output parsing was the most consistent for our use case. The Chat Agent makes autonomous tool selections across 11 different tools, and Gemini rarely picked the wrong one or returned malformed tool calls. This is why it became our system default.

The million-token context window was a real advantage. Gemini 2.5 Flash and Pro both support 1M token context windows. For the Focus Agent's clustering loop — which sometimes processes hours of attention data across many topics — having that headroom meant we didn't have to aggressively truncate context. We could pass in a richer activity history and get better clustering decisions.

Temperature control behaved predictably. From 0.1 for binary decisions to 0.9 for quiz generation, Gemini responded consistently. At 0.1 it was disciplined; at 0.9 it got creative without going off the rails. That predictability across the full range was a real win.

Multimodal input worked out of the box. We send base64-encoded images directly to Gemini for caption generation. The quality of image descriptions was good enough that the Chat Agent could later reference them meaningfully. No separate vision pipeline needed.

The model fetcher dynamically discovers new models. We use @google/genai to fetch the live model list from the Gemini API (filtering for generateContent support), with sorting priority baked in for the upcoming gemini-3 family — so when gemini-3.1-flash lands, Kaizen will pick it up automatically:

// model-fetcher.ts — Dynamic model discovery
const response = await client.models.list();
for (const model of modelList) {
  if (model.name && actions.includes("generateContent")) {
    models.push({ id: modelId, contextWindow: model.inputTokenLimit, ... });
  }
}
// Sort with priority: gemini-2.5-flash > flash-lite > pro > gemini-3.x
Enter fullscreen mode Exit fullscreen mode

Where we hit friction ⚠️

Multi-turn coherence degraded over long conversations. After 10+ turns with interleaved tool calls, the Chat Agent would sometimes lose track of earlier context or repeat information. We partially fixed this by injecting a conversation summary into the system prompt, but it meant extra token usage. Not unique to Gemini, but noticeable.

Streaming with tool calls needed careful handling. When Gemini decides mid-stream to call a tool, the handoff between text chunks and tool-call events required state management in our SSE layer. The Vercel AI SDK abstracted most of it, but edge cases (tool call at the very start, multiple rapid tool calls) needed explicit handling.

Occasional overconfidence in Focus Guardian decisions. At temperature: 0.1, when the Focus Guardian is wrong, it's confidently wrong. A few times it classified focused research (lots of Stack Overflow tabs) as aimless browsing. The fix was better prompting + the feedback loop, not a model change.

Two Google SDKs with slightly different interfaces. @google/genai (for model discovery) and @ai-sdk/google (for inference via Vercel AI SDK) have different API surfaces. We ended up using both, which added minor complexity.

transparent-divider

What we learned 🙌

Building for attention requires restraint. The hardest design decisions weren't technical. They were about when not to act. Our early Focus Guardian nudged aggressively — it felt like a backseat driver. The lesson: if your tool annoys people, they'll uninstall it. Being right isn't enough; you have to be right at the right moment.

Agents need structure, not freedom. We initially gave the Chat Agent broad instructions. The results were inconsistent. What worked was constraining each agent to a narrow job with specific tools and clear decision boundaries. The Focus Guardian doesn't chat. The Chat Agent doesn't nudge. Specialization + coordination > generalization.

Observability isn't optional for agent systems. Without Opik traces, we'd still be guessing why nudges misfired. We stopped treating the AI as a black box and started treating it like any other system component with logs and metrics.

The real product is the quiet moments. Nobody remembers the quiz that worked perfectly. They remember the time the extension stayed silent for 45 minutes during a Wikipedia deep dive they genuinely cared about — and then gently reminded them about the assignment they'd originally set out to work on. Getting those moments right took dozens of prompt iterations and hundreds of traced decisions. 🌱

Gemini as a default provider was the right call. After benchmarking all three providers, Gemini's combination of reliable tool-calling, 1M context window, and consistent temperature behavior made it the best fit. Our system makes potentially dozens of Gemini calls per user per hour — attention summaries, focus clustering, guardian checks — and reliability at that volume mattered more than peak performance on any single call.

transparent-divider

What's next? 🚀

We're continuing to develop Kaizen and planning the next release cycle:

  • Spaced repetition — surfacing what you read at the moment you're most likely to forget it
  • 🕸️ Topic relationship mapping — showing how things you learn connect across sessions
  • Better batching — optimized Gemini call grouping during long browsing sessions
  • 📤 Export to note-taking tools — so learning doesn't stay trapped in the extension
  • 👥 Shareable study threads — lightweight collaboration for shared focus sessions
  • 🔮 gemini-3.1-flash migration — our model fetcher already has gemini-3 priority sorting; when the next-gen Flash models drop, we'll benchmark and potentially make it the new default

For Gemini specifically, we're interested in structured output (JSON mode) for agent responses. Right now we parse freeform text from several agent pipelines, and guaranteed JSON would let us simplify those parsing layers.

End notes 🙌🏻

We built Kaizen because we needed it ourselves. We're CS students who struggle with ADHD. Traditional blockers felt like punishment. Our New Year's resolution was to build the tool we wished existed — one that doesn't lock you out, doesn't judge, just watches where your attention goes, learns your patterns, and gently, continuously helps you get better. That's what kaizen (改善) means — continuous improvement. One focus session at a time.

Huge thanks to DEV and MLH for hosting this writing challenge, and to the Google Gemini team for building models that actually hold up under real multi-agent workloads! 🙌

Permissive License ⚖️

MIT License

Top comments (0)