DEV Community: Nguyen Phuc Hai

How I Practice System Design with AI (URL Shortener Walkthrough)

Nguyen Phuc Hai — Thu, 21 May 2026 14:00:00 +0000

I've been doing system design interviews for years - both as a candidate and as an interviewer. The hardest part to practice alone is not the knowledge. It's the process: starting from requirements, running the numbers, evaluating multiple options before committing, and making explicit trade-off arguments at the end.

Reading about it helps. Actually working through it is different.

A few months ago I built a structured multi-step AI plan that walks through a complete system design session end-to-end. I've been using it to practice against different systems. This post shows exactly what it produces, using a URL shortener as the example.

The problem with single-prompt system design

The obvious move when practicing system design with AI is something like:

"You are a senior engineer. Design a URL shortener. Cover requirements, architecture, database schema, caching, and scalability."

You get back something that looks thorough. Redis for caching. PostgreSQL for storage. Load balancers. CDN. The answer ticks the boxes.

But try pushing on it. Why Redis specifically and not Memcached? What QPS number drove that decision? Why 302 and not 301? Why that partitioning key?

The AI cannot answer because it did not derive these choices from anything - it pattern-matched from the thousands of URL shortener articles it has seen. The output sounds right because the training data sounds right.

The other problem: a single prompt collapses the entire design process into one shot. Real system design is sequential. You cannot choose an architecture before running the numbers. You cannot make a trade-off argument before evaluating the alternatives. Skipping the steps does not make the output wrong, it makes the reasoning invisible.

A different approach: chain the steps

System design interviews have a well-known format for good reason. You are expected to work through specific phases in order: clarify requirements, estimate scale, propose and compare architecture options, draw the high-level design, deep-dive into the critical components, and close with trade-offs. Skipping phases or doing them out of order is a signal that you do not have a structured approach.

The insight is that this format maps directly onto a multi-step AI workflow. Instead of one big prompt, you instruct the AI to follow the same interview structure, one phase at a time, where each phase builds on the output of the previous one.

I structured the workflow as 7 sequential steps that mirror the formal system design interview format:

Step	Interview phase	What it does
1 - Requirements	Clarify requirements	Clarifies and completes the requirements, fills in missing NFR defaults, states assumptions
2 - Back-of-envelope	Estimate scale	Derives traffic, storage, bandwidth, and cache estimates with arithmetic
3 - Architecture options	Propose options	Proposes 2-3 options with pros/cons, recommends one, produces a comparison diagram
4 - High-level design	High-level design	Component overview, data flow, full architecture diagram
5 - Deep-dive	Deep-dive	Data model, API design, scalability strategies, failure modes table
6 - Trade-offs	Trade-offs	Decision table, known limitations, future improvements
7 - Final doc	-	Assembles everything into a single coherent document

Each step prompt references prior outputs using {{step_id}} placeholders. By the time the failure modes table is written, the AI knows the exact QPS numbers, the dominant bottleneck, and which architecture was chosen - and why. Nothing is invented in isolation.

I built this using Askimo Plans which lets you define multi-step AI sessions in YAML. Here is a short demo of how the plan runs:

But the structure itself is what matters - you could implement the same chain in any tool that passes context between steps.

Walking through the URL shortener

I provided these inputs:

System: URL Shortener
Functional requirements: shorten URL, redirect to original, custom aliases, click analytics
Non-functional: 99.99% availability, redirect latency < 10ms p99
Scale hints: 500M users, 50M DAU. 200M new URLs/day (2,300 writes/sec avg)

The requirements step does not design anything. It restates what you gave it precisely, fills in missing non-functional defaults, and marks explicit assumptions. One thing worth noting from this run: my input just said "redirect." The AI flagged that 301 is permanent - browsers cache it, so users only hit your service once and analytics stop working. It surfaced 302 as the correct choice. Small detail, real architectural consequence, and the right place to catch it.

The back-of-envelope step is the one most people skip in practice and get destroyed on in interviews. For this system: 35K peak redirect QPS, 915 GB for URL records, 365 TB for analytics over 5 years, 183 GB cache for hot URLs. That 365 TB number immediately rules out storing analytics in the same database as URLs. The 1000:1 read/write ratio flags this as read-heavy where almost everything depends on cache hit rate. These derived numbers drive every architecture decision that follows.

The architecture options step takes those numbers and proposes 2-3 distinct designs with pros/cons and a Mermaid comparison diagram. For this system it landed on a read/write split with async analytics: stateless redirect service backed by Redis and read replicas, separate write service, analytics into Kafka then ClickHouse. The recommendation is justified by the numbers, not by what sounds architecturally fashionable.

The high-level design, deep-dive, and trade-offs steps each build on what came before. Every sizing decision in the architecture diagram traces back to the estimates. The failure modes table references the bottlenecks identified two steps earlier. The trade-off table - the step most candidates skip - has every row traceable to a prior decision: 302 vs 301 from requirements, Redis over Memcached from the 183 GB cache estimate, Kafka over direct ClickHouse write from the < 10ms p99 redirect SLA.

The design is a starting point, not a dead end

Once the plan finishes, Askimo keeps the full context of the entire run in memory. You can keep the conversation going with follow-up questions and the AI already knows everything it produced.

For example, the high-level design above is deliberately cloud-agnostic. If you want to deploy it on AWS, you can ask:

"Map this architecture to AWS services"

And because the AI has the full context - the 35K peak QPS, the 183 GB Redis cluster, the Kafka pipeline, the ClickHouse analytics store - the response is not generic. It translates each specific component: ElastiCache for Redis, MSK for Kafka, ALB + ECS Fargate for the redirect service, RDS Aurora with read replicas for PostgreSQL, and suggests whether ClickHouse on EC2 or an alternative like Redshift makes more sense given the 365 TB analytics volume.

Same idea for other follow-ups:

"How would I run this on GCP?" - gets you Cloud Memorystore, Pub/Sub, Cloud Run, Cloud SQL
"What would a Kubernetes deployment look like for the redirect service?" - gets you HPA config, resource limits, liveness probes tuned to the latency SLA
"Estimate the monthly AWS cost for this design at 35K QPS" - gets you a rough cost breakdown per service

The context the plan built is what makes these answers useful. Without the prior chain, you get generic cloud mapping. With it, you get something specific to the system you actually designed.

Why the chained context matters

The trade-off table above is not invented. Every row traces back to a decision made in a previous step:

302 vs 301 came from the requirements step, where the AI flagged that click analytics requires 302. A 301 causes browsers to cache the redirect and stop hitting your service, so analytics data stops flowing.
NoSQL over RDBMS follows from the back-of-envelope step. 50TB of URL mappings and 500K peak QPS rules out a manually sharded PostgreSQL cluster. The NoSQL choice is the direct consequence of those numbers, not a default assumption.
ALB over API Gateway for the read path follows from the back-of-envelope. At 10B redirects per day, API Gateway's per-request billing becomes cost-prohibitive. The estimate made that visible before any architecture was drawn.
Async analytics via Kafka follows from the < 10ms p99 redirect SLA in requirements. Writing 115K click events per second synchronously on the redirect hot path would blow the latency budget. Kafka decouples the write so the redirect service returns immediately.
DynamoDB Atomic Counters for ID generation follows from the collision guarantee in requirements, combined with the NoSQL-only architecture chosen in the previous step. No ZooKeeper cluster required.

A single prompt cannot produce this because there is no prior context to reference. The AI just picks whatever sounds reasonable for a URL shortener. The multi-step approach forces the conclusions to be derived, not guessed.

The plan structure (for the curious)

Behind this session is a YAML file with 7 steps. Each step has a single clear goal and receives the outputs of all prior steps as context via {{step_id}} references.

Step	Goal	Key inputs
`requirements`	Clarify, complete, and de-ambiguate the inputs	Your functional/NFR/scale hints
`back-of-envelope`	Derive traffic, storage, bandwidth, and cache estimates with arithmetic	`{{requirements}}`
`architecture-options`	Propose 2-3 distinct options with pros/cons and a Mermaid comparison diagram	`{{requirements}}`, `{{back-of-envelope}}`
`high-level`	Component overview, data flow, full architecture diagram	`{{requirements}}`, `{{back-of-envelope}}`, `{{architecture-options}}`
`deep-dive`	Data model, API design, failure modes table, sequence diagrams	`{{high-level}}`
`tradeoffs`	Decision table, known limitations, future improvements	`{{requirements}}`, `{{high-level}}`, `{{deep-dive}}`
`final`	Assemble all outputs into a single shareable document	All prior steps

The {{step_id}} references are what make the chain work. The back-of-envelope step does not re-read your inputs — it reads the clarified requirements from step 1. The architecture options step does not guess — it sees the numbers from step 2 and the requirements from step 1. Each step does one thing.

The full YAML is on GitHub. You can copy it, load it into any Askimo instance, or use it as a template for your own plan variants.

Adapting it to other systems

I've run this against a few different systems now:

Real-time chat - the architecture options step centres on WebSocket vs SSE vs long-polling trade-offs; message ordering and delivery guarantees dominate the deep-dive
Ride-sharing platform - matching latency becomes the dominant constraint; the back-of-envelope gets interesting when you factor in geospatial indexing
Video streaming - storage and CDN costs utterly dominate everything; the trade-offs around pre-encoding vs adaptive bitrate streaming are where the depth is

The plan structure stays the same. The conclusions change based on what the numbers show.

Trying it yourself

The plan is built into Askimo if you want to run it as-is. You fill in system name, requirements, and scale hints, then step through the session with whatever AI provider you use (OpenAI, Claude, Gemini, or a local model via Ollama).

You can also adapt the YAML to add steps - a cost estimation step after back-of-envelope, a security review step, a migration plan step. The plan editor has an AI generator that writes the YAML from a plain-English description if you do not want to write it by hand.

The full source for the system design plan is available on GitHub. Contributions welcome.

Building a Desktop AI Chat App for ChatGPT, Claude, Gemini & Ollama

Nguyen Phuc Hai — Tue, 17 Feb 2026 00:54:28 +0000

Learn how to build an open-source desktop AI chat client that connects multiple AI providers in one application. This technical guide covers Kotlin Compose architecture, streaming responses, RAG implementation, and production patterns.

The Problem: Using ChatGPT, Claude, Gemini Means Opening Multiple Apps

As developers, we've discovered that different AI models excel at different tasks. ChatGPT is great for general conversation and brainstorming. Claude works well for coding questions and technical analysis. Gemini handles multimodal tasks with images and documents. Ollama gives you free, unlimited access to open-source models without subscription limits.

But here's the frustration: Each of these requires a different web application, a different account, a different browser tab.

The modern AI workflow reality:

ChatGPT web app open for general questions
Claude web app open for coding help
Google AI Studio open for multimodal tasks
Ollama command line running locally for experimentation without API costs
Constant context switching between different interfaces, keyboard shortcuts, and UX patterns
Fragmented conversation history - your coding discussion with Claude is separate from your general brainstorming in ChatGPT
No unified search - can't search across all your AI conversations in one place
Multiple subscriptions - managing different payment plans and free tier limits

What if you could have one desktop application that works with ChatGPT, Claude, Gemini, Ollama, and any other AI provider - letting you choose the best model for each task without switching apps?

That's why we built Askimo - an open-source desktop AI chat client built with Kotlin and Compose for Desktop. You can download it for free for macOS, Windows, and Linux.

Why Desktop? Why Not Another Web App?

Before diving into the technical implementation, let's address the elephant in the room: Why build a desktop app in 2026?

Desktop Advantages for AI Chat

Zero Infrastructure: Just download and run. No server to set up, no deployment, no hosting costs. Open the app and start chatting.
Persistent State: Desktop apps don't lose state when you close a tab. Chat in up to 20 tabs simultaneously - more than enough for any workflow - and they all stay exactly where you left them.
True Privacy: Local-first architecture means conversations never leave your machine unless you explicitly send them to an AI provider.
Native Performance: No browser overhead. Direct access to system resources for faster rendering and lower memory usage (50-300 MB vs 500+ MB for browser tabs).
Offline Capability: Read past conversations, search history, and manage projects - all without internet.
System Integration: Deep OS integration for keyboard shortcuts, native notifications, and file system access.

Why Kotlin + Compose for Desktop?

Modern declarative UI with Compose's reactive paradigm
Shared code between CLI and desktop modules
Coroutines for elegant async/concurrent programming
Type safety that prevents entire classes of runtime errors
Mature ecosystem with LangChain4j for AI integration

Strategic advantage: Code reuse for mobile apps

Choosing Kotlin and Compose Multiplatform gives us a significant long-term benefit: when we expand to mobile (iOS/Android), we can reuse 60-80% of our codebase.

The same business logic that powers the desktop app can power mobile:

Session management - Same conversation state management across all platforms
AI provider integrations - OpenAI, Anthropic, Ollama clients work identically
Streaming handling - The concurrent stream management we built for desktop works on mobile
Database layer - SQLite-based storage runs on all platforms
Markdown rendering - Custom renderer works on iOS and Android without changes
RAG pipeline - Document processing and embedding logic is platform-agnostic

Only the UI layer needs platform-specific adaptation - and even there, Compose Multiplatform lets us share UI components with platform-specific tweaks.
Compare this to the web app path:

Web → Mobile means rebuilding everything in Swift/Kotlin or using slower hybrid frameworks
Desktop Electron → Mobile means completely separate codebases
Native from the start → Future mobile apps share the same proven, battle-tested core

And this isn't some experimental tech we're betting on - Compose Multiplatform is already battle-tested in production by companies like JetBrains and Netflix. So when we decide to ship mobile apps, we won't be starting from scratch. All the tricky stuff - session management, streaming handlers, RAG pipelines - will already work. We'll just need to adapt the UI.

TL;DR: Desktop first, but with mobile in our back pocket for later.

The Trade-offs: More Effort, Better Control

Let's be honest: building a native desktop app requires significantly more effort than a web app.

When you build for the web, the browser gives you useful tools for free:

Markdown rendering - Just use a library like marked.js and let the browser's HTML engine handle it
Syntax highlighting - Drop in Prism.js or highlight.js
Charts and visualizations - Chart.js, D3.js, countless options
File handling - Browser APIs abstract the complexity
Cross-platform rendering - Write once, runs everywhere with the same look

For a native desktop app, we had to build all of this ourselves:

1. Custom Markdown Rendering

Implemented a CommonMark parser in Kotlin
Built custom rendering logic for code blocks, tables, lists
Created syntax highlighting integration for 50+ programming languages
No browser HTML engine to fall back on

2. Platform-Specific Challenges

File system access differs on macOS, Windows, and Linux
Window management and keyboard shortcuts need OS-specific handling
Native menus and notifications require platform adapters
Different packaging systems for each OS (DMG, MSI, DEB/RPM)

3. Custom UI Components

Built chart rendering using Compose Canvas APIs
Implemented custom text editors with syntax highlighting
Created scrollable containers with proper touch/mouse handling
Designed responsive layouts without CSS flexbox

4. Resource Management

Manual memory management for long-running processes
Thread pool sizing for concurrent AI streams
Database connection pooling
No browser garbage collection to rely on

So why go through all this extra effort?

We believe the benefits are worth it for this specific use case:

1. Performance & Resource Efficiency

50-300 MB memory usage vs 500+ MB for equivalent web apps
1.5-3 second startup vs 5-10 seconds for web-based alternatives
Direct system access - no browser overhead for file operations
Efficient rendering - only what changed, not full DOM diffing

2. User Control & Privacy

Complete local storage - users truly own their data
No cloud dependencies for core functionality
Encrypted local database - conversations never leave the machine (learn more about Askimo's security features)
No telemetry or tracking by default - users control everything

3. Long-term Strategic Benefits

Local tool integration - direct access to file system, terminal, development tools
Offline-first - full functionality without internet (except AI API calls)
System integration - global keyboard shortcuts, menu bar presence, system notifications
Future extensibility - can integrate with OS-level features (Spotlight search, Quick Look, etc.)

4. Better UX for AI Workflows

Instant search - local SQLite queries are 10-100x faster than cloud-based search
Reliable state - no session timeouts, no lost tabs, no connection drops
Multi-tab workflows - handle 20+ concurrent conversations without browser memory bloat (see desktop features)
Consistent experience - same UI across all platforms, not dependent on browser quirks

The web browser is well-suited for content consumption, but for productivity tools handling sensitive data and requiring deep system integration, native apps offer advantages.

For Askimo specifically, the ability to:

Store thousands of conversations locally with instant search
Switch AI providers without page reloads or state loss
Work with multiple AI platforms - from cloud services like ChatGPT, Claude, and Gemini to local Ollama models
Access project files and web content directly for RAG context
Work offline for reviewing past conversations

...made the extra development effort a worthwhile investment.

Bottom line: If you're building a simple content-focused app, choose web. If you're building a productivity tool that needs privacy, performance, and deep system integration, the native desktop path - despite its challenges - delivers better long-term value for users.

Architecture Overview

Askimo uses a provider-agnostic architecture that abstracts AI models behind a common interface. Here's the high-level structure:

┌─────────────────────────────────────────┐
│     Compose Desktop UI Layer            │
│   (ViewModels + Reactive State)         │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│      Session Management Layer           │
│   (up to 20 tabs, LRU cache)            │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│    Provider Abstraction Layer           │
│   ChatModelFactory<T: ProviderSettings> │
└──────────────┬──────────────────────────┘
               │
       ┌───────┴───────┐
       │               │
┌──────▼─────┐   ┌────▼──────┐
│  OpenAI    │   │  Ollama   │  ...
│  Factory   │   │  Factory  │
└────────────┘   └───────────┘
       │               │
┌──────▼───────────────▼──────────────────┐
│         LangChain4j Integration         │
│  (Streaming, Memory, RAG, Tools)        │
└─────────────────────────────────────────┘

Core Implementation: Provider Abstraction

The heart of Askimo's multi-provider support is the ChatModelFactory interface. This is how we achieve provider independence. You can see all supported AI providers and their configuration in the documentation.

1. ChatModelFactory Interface

interface ChatModelFactory<T : ProviderSettings> {
    // List available models for this provider
    fun availableModels(settings: T): List<String>

    // Identify which provider this factory creates
    fun getProvider(): ModelProvider

    // Default configuration for this provider
    fun defaultSettings(): T

    // Create a chat client instance
    fun create(
        sessionId: String? = null,
        model: String,
        settings: T,
        retriever: ContentRetriever? = null,
        executionMode: ExecutionMode,
        chatMemory: ChatMemory? = null,
    ): ChatClient

    // Create cheap utility client for classification tasks
    fun createUtilityClient(settings: T): ChatClient
}

Key Design Decisions:

Generic type parameter <T: ProviderSettings> - Each factory specifies its own settings type, ensuring type safety at compile time
ContentRetriever for RAG - Optional parameter enables Retrieval-Augmented Generation for file/project context
ChatMemory injection - Conversation history managed externally but injected at creation time
ExecutionMode awareness - Different behavior for CLI vs Desktop (e.g., tools disabled in desktop)
Utility client for background tasks - createUtilityClient() returns a cheap, fast model for tasks that don't need the most powerful AI

Why createUtilityClient?

Many AI workflows involve tasks that don't require expensive, state-of-the-art models. Examples include:

Memory summarization:

Condensing old conversation messages into summaries
A simple task that GPT-3.5-turbo handles just as well as GPT-4
Running hundreds of times during long conversations
Using GPT-4 would cost 10-20x more with no quality benefit

Intent classification:

Deciding "should we use RAG for this query?" → YES/NO
Validating "is this a question?" → YES/NO
Simple binary decisions that don't need advanced reasoning

The trade-off:

Cloud providers (OpenAI, Anthropic, Google): Use a cheaper model (e.g., GPT-3.5-turbo costs ~$0.001/1K tokens vs GPT-4's ~$0.03/1K tokens)
Local providers (Ollama, LM Studio): Use the same model (no API costs, so no benefit to switching)

// Example: OpenAI implementation
class OpenAiChatModelFactory : ChatModelFactory<OpenAiSettings> {
    override fun createUtilityClient(settings: OpenAiSettings): ChatClient {
        // Use GPT-3.5-turbo for cheap background tasks
        return create(
            sessionId = null,
            model = "gpt-3.5-turbo",  // Cheap model for utility tasks
            settings = settings,
            retriever = null,
            executionMode = ExecutionMode.DESKTOP,
            chatMemory = null
        )
    }
}

// Example: Ollama implementation
class OllamaChatModelFactory : ChatModelFactory<OllamaSettings> {
    override fun createUtilityClient(settings: OllamaSettings): ChatClient {
        // Local models have no API cost, use the same model
        return create(
            sessionId = null,
            model = settings.defaultModel,  // Same model, no cost difference
            settings = settings,
            retriever = null,
            executionMode = ExecutionMode.DESKTOP,
            chatMemory = null
        )
    }
}

Real-world impact:

A user with 100 conversations averaging 200 messages each triggers ~100 summarization calls
With GPT-4: ~$6-10 in API costs for background tasks
With GPT-3.5-turbo utility client: ~$0.30-0.50 in API costs
20x cost reduction for the same functionality

This pattern keeps the AI experience responsive and affordable without compromising the quality of user-facing responses.

2. ProviderSettings Interface

Each provider has its own settings class:

interface ProviderSettings {
    val defaultModel: String

    // Human-readable description (masks sensitive data)
    fun describe(): List<String>

    // Configurable fields for UI
    fun getFields(): List<SettingField>

    // Update a field and return new instance (immutable pattern)
    fun updateField(fieldName: String, value: String): ProviderSettings

    // Validate settings are ready for use
    fun validate(): Boolean

    // Help text when validation fails
    fun getSetupHelpText(messageResolver: (String) -> String): String
}

3. Example: OpenAI Provider Implementation

Here's how we implement the OpenAI/ChatGPT provider:

@Serializable
data class OpenAiSettings(
    override var apiKey: String = "",
    override val defaultModel: String = "gpt-4o",
    var baseUrl: String = "https://api.openai.com/v1",
    override var presets: Presets = Presets(Style.BALANCED),
) : ProviderSettings, HasApiKey, HasBaseUrl {

    override fun validate(): Boolean = apiKey.isNotBlank()

    override fun getSetupHelpText(messageResolver: (String) -> String): String {
        return """
            OpenAI requires an API key to use.
            1. Get your API key: https://platform.openai.com/api-keys
            2. Configure it with: :set-param api_key YOUR_KEY
        """.trimIndent()
    }
}

Streaming AI Responses: Managing Multiple Concurrent Conversations

One of Askimo's key features: handling up to 20 simultaneous AI conversations, each with its own streaming response thread.

The Challenge

Each active conversation needs a dedicated thread for streaming AI responses
Memory must be bounded to prevent resource exhaustion
Inactive sessions should be cached but unloaded from memory
Thread-safe state management across concurrent operations

Our Approach

We use Kotlin's StateFlow for reactive state management and Coroutines for concurrent streaming:

class ChatViewModel(
    private val sessionId: String,
    private val chatService: ChatService,
) {
    private val _isStreaming = MutableStateFlow(false)
    val isStreaming: StateFlow<Boolean> = _isStreaming.asStateFlow()

    private val _messages = MutableStateFlow<List<Message>>(emptyList())
    val messages: StateFlow<List<Message>> = _messages.asStateFlow()

    fun sendMessage(content: String) {
        viewModelScope.launch {
            _isStreaming.value = true

            try {
                chatService.streamResponse(content)
                    .catch { error -> handleStreamingError(error) }
                    .collect { chunk -> appendToLastMessage(chunk) }
            } finally {
                _isStreaming.value = false
            }
        }
    }
}

Key benefits:

Reactive UI updates - Compose automatically recomposes when StateFlow changes
Thread-safe - StateFlow handles concurrent access safely
Backpressure handling - Won't overwhelm UI with rapid updates
Automatic cleanup - Coroutines cancelled when ViewModel disposed

Session Management

We maintain up to 20 active sessions in memory with LRU-style eviction:

Sessions only created when first accessed (lazy initialization)
Inactive sessions automatically cleaned up when limit reached
Active streaming sessions are never evicted
Mutex-protected state for thread safety

This keeps memory usage bounded (~50-300 MB total) while supporting real-world workflows.

Error Recovery: Preserving Partial AI Responses

AI APIs can fail at any moment - network issues, rate limits, timeouts. Most chat apps lose everything when this happens. Askimo preserves partial responses.

The Problem

When an AI streaming call fails:

You've already received 500 words of a 1000-word response
The API connection drops
Standard implementations discard everything

Our Solution: Incremental Persistence

private suspend fun handleStreamingError(error: Throwable) {
    // Get the partial content we've accumulated so far
    val partialContent = getCurrentAccumulatedContent()

    if (partialContent.isNotEmpty()) {
        // Save what we have
        val partialMessage = Message(
            content = partialContent + "\n\n[Response interrupted: ${error.message}]",
            role = Role.ASSISTANT,
            timestamp = Clock.System.now(),
            isError = true
        )

        // Replace the temporary streaming message with saved version
        replaceTemporaryMessage(partialMessage)

        // Persist to database immediately
        messageRepository.save(sessionId, partialMessage)
    }

    // Notify user with non-intrusive error indicator
    eventBus.publish(StreamingErrorEvent(sessionId, error))
}

User Experience:

✅ Partial responses are preserved and saved
✅ Clear visual indication that response was interrupted
✅ Resume capability - Users can retry from the partial state
✅ No data loss - Everything is persisted immediately

Project-Based Context: RAG for Your Documents

One useful feature we added: point it at your documents and ask questions. Whether it's code, PDFs, Microsoft Office files, OpenOffice documents, or web pages - Askimo can understand and answer questions about your content. Learn more about Askimo's RAG capabilities.

Architecture: Content Retrieval

// User attaches a project folder or documents
val project = Project(
    name = "my-project",
    knowledgeSources = listOf(
        FileSystemSource("/path/to/documents"),  // PDFs, Office docs, text files
        FileSystemSource("/path/to/codebase/src"),  // Source code
        UrlSource("https://docs.example.com")  // Web documentation
    )
)

// When user sends a message in this project's session:
val retriever = createContentRetriever(project)

val chatClient = factory.create(
    sessionId = sessionId,
    model = "gpt-4o",
    settings = openAiSettings,
    retriever = retriever,  // RAG enabled!
    executionMode = ExecutionMode.DESKTOP,
    chatMemory = conversationMemory
)

How RAG Works in Askimo

Askimo supports a wide range of document formats:

Office Documents: Microsoft Word (.docx), Excel (.xlsx), PowerPoint (.pptx)
OpenOffice: Writer (.odt), Calc (.ods), Impress (.odp)
PDFs: Extracts text content from PDF files
Code: All programming languages and text-based formats
Web Pages: Crawl and index documentation sites

The RAG Pipeline:

Ingestion: Documents are parsed, chunked, and embedded when project is created
Query-time retrieval: User's question is embedded and similar chunks retrieved
Context injection: Retrieved chunks are added to the prompt automatically
Response: AI answers using both conversation history AND document context

Hybrid Search: JVector + Lucene

We chose a hybrid content retriever that combines two complementary search strategies:

1. Vector Search (JVector) - Semantic similarity

Finds content that's conceptually related to the query
Example: Query "error handling" matches "exception management" even without exact words
Uses embeddings to capture meaning, not just keywords

2. Keyword Search (Lucene) - Exact term matching

Finds content with specific terms, names, or identifiers
Example: Query "UserRepository.findById" finds exact method references
Critical for code, API names, and technical terms

Why hybrid?

Neither approach alone is sufficient:

Vector-only: Misses exact matches (class names, function signatures, specific error codes)
Keyword-only: Misses semantic relationships (synonyms, paraphrased concepts, related ideas)

The hybrid retriever combines both using Reciprocal Rank Fusion (RRF) - a proven algorithm that merges ranked lists:

class HybridContentRetriever(
    private val vectorRetriever: ContentRetriever,
    private val keywordRetriever: ContentRetriever,
    private val maxResults: Int,
    private val k: Int = 60  // Standard RRF constant
) : ContentRetriever {

    override fun retrieve(query: Query): List<Content> {
        val vectorResults = vectorRetriever.retrieve(query)
        val keywordResults = keywordRetriever.retrieve(query)

        // Merge using Reciprocal Rank Fusion
        return reciprocalRankFusion(vectorResults, keywordResults)
            .take(maxResults)
    }
}

How RRF works:

For each document, calculate a fusion score based on its rank in each list:

RRF_score(doc) = Σ 1 / (k + rank_i)

Where:

k = 60 (standard constant that balances the contribution from different retrievers)
rank_i is the position of the document in retriever i's results (1st = rank 1, 2nd = rank 2, etc.)

Example:

A document ranked #1 in vector search and #3 in keyword search:
- Vector score: 1/(60+1) ≈ 0.016
- Keyword score: 1/(60+3) ≈ 0.016
- Total RRF score: 0.032
A document ranked #1 in both:
- Vector score: 1/(60+1) ≈ 0.016
- Keyword score: 1/(60+1) ≈ 0.016
- Total RRF score: 0.032 (same as above!)
A document ranked #1 in vector but not found in keyword:
- Vector score: 1/(60+1) ≈ 0.016
- Keyword score: 0
- Total RRF score: 0.016 (lower than documents found in both)

Why RRF is better than weighted averaging:

Rank-based, not score-based: Different retrievers produce incomparable scores. RRF only cares about relative ranking.
Robust to failures: If one retriever fails, we gracefully fall back to the other
Rewards consensus: Documents appearing in both lists naturally get higher scores
Well-researched: RRF is a proven algorithm used in information retrieval research

Real-world impact:

Asks "how to fix null pointer" → finds both "NullPointerException" (keyword) and "defensive null checks" (semantic)
Asks about "database queries" → finds both "SQL" (keyword) and "data access patterns" (semantic)
More accurate retrieval = better AI answers grounded in your actual documents

Implementation uses LangChain4j's RAG components:

fun createContentRetriever(project: Project): ContentRetriever {
    val embeddingStore = InMemoryEmbeddingStore<TextSegment>()
    val embeddingModel = createEmbeddingModel()

    // Index all knowledge sources
    project.knowledgeSources.forEach { source ->
        val documents = loadDocuments(source)
        val segments = DocumentSplitters.recursive(500, 50).split(documents)
        embeddingStore.addAll(segments, embeddingModel.embedAll(segments))
    }

    return EmbeddingStoreContentRetriever(
        embeddingStore = embeddingStore,
        embeddingModel = embeddingModel,
        maxResults = 5,
        minScore = 0.7
    )
}

Memory Management: Token-Aware Conversation History

The Token Problem

Here's something many users don't realize: Every time you send a message to an AI model, the entire conversation history goes with it.

When you ask ChatGPT or Claude a question, the API call looks like this:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a programming language..."},
    {"role": "user", "content": "How do I install it?"},
    {"role": "assistant", "content": "You can install Python by..."},
    {"role": "user", "content": "Show me a hello world example"}
  ]
}

Notice the pattern? Every previous message is sent again. This is how AI models maintain context - they don't actually "remember" your conversation. Each request is stateless, so you must resend the entire history for the model to understand what you're talking about.

The consequences:

Token consumption grows quadratically:
- Message 1: ~100 tokens sent
- Message 10: ~1,000 tokens sent (all previous messages)
- Message 50: ~5,000+ tokens sent
- Message 100: ~10,000+ tokens sent (approaching model limits!)
API costs increase: You pay for every token sent, so longer conversations get exponentially more expensive
Context limits: Most models have token limits (4K-128K depending on the model). Once you hit the limit, you can't continue the conversation without removing history
Performance degradation: Larger context windows slow down response times

Askimo's solution: Auto-summarize old messages while keeping recent ones to maintain conversational flow.

Token-Aware Memory with Intelligent Summarization

The key insight: You don't need the entire history, just enough context.

Most conversations follow a natural pattern - the most recent exchanges are what matter for understanding the current question. Earlier messages provide background context, but you rarely need word-for-word accuracy from 50 messages ago. What you need is:

Recent messages in full - The last 50-60% of conversation for immediate context and continuity
Historical overview - A structured summary of earlier messages capturing key facts, decisions, and topics
System instructions preserved - Original prompts and setup never discarded

Think of it like a work meeting - you don't replay the entire 2-hour discussion. You recap the key decisions from the first hour, then dive into the details of recent conversation.

Askimo's approach:

Summarize old messages - Condense the oldest 45% into a structured summary with key facts and topics
Keep recent messages intact - Preserve the remaining 55% for immediate conversational context
Never touch system messages - Instructions are always preserved
Run asynchronously - Doesn't block user interaction

class TokenAwareSummarizingMemory(
    private val appContext: AppContext,
    private val sessionId: String,
    private val summarizationThreshold: Double = 0.6  // Trigger at 60% of max tokens
) : ChatMemory {

    // Maximum tokens: 40% of model's context window (dynamically calculated)
    private val maxTokens: Int
        get() = (ModelContextSizeCache.get(currentModel) * 0.4).toInt()

    override fun add(message: ChatMessage) {
        messages.add(message)
        persistToDatabase()

        val totalTokens = estimateTotalTokens()
        val threshold = (maxTokens * summarizationThreshold).toInt()

        if (totalTokens > threshold && !summarizationInProgress) {
            triggerAsyncSummarization()  // Non-blocking
        }
    }

    override fun messages(): List<ChatMessage> = buildList {
        // Structured summary as system message (if exists)
        structuredSummary?.let { add(SystemMessage.from(it)) }

        // Recent conversation messages
        messages.forEach { add(it.toChatMessage()) }
    }
}

// Structured summary format
@Serializable
data class ConversationSummary(
    val keyFacts: Map<String, String>,
    val mainTopics: List<String>,
    val recentContext: String
)

What this achieves:

Before (100 messages, ~15,000 tokens):
[Message 1] [Message 2] ... [Message 98] [Message 99] [Message 100]
❌ Exceeds token limit, API call fails

After summarization (~10,000 tokens):
[Summary of messages 1-45] [Message 46] ... [Message 99] [Message 100]
✅ Under token limit, context preserved, conversation continues smoothly

Implementation details:

Summarizes the oldest 45% of conversation messages when token threshold (60% of max) is reached
System messages (instructions) are never summarized or removed - they're preserved indefinitely
Runs asynchronously so it doesn't block the user's interaction
Falls back to extractive summary if AI-powered summarization fails

Real-world example:

Imagine a 100-message conversation about building a React app that hits the token limit:

Messages 1-45: Initial planning, architecture decisions, setup questions, debugging
Messages 46-100: Recent implementation and current discussion

Without summarization: All 100 messages sent = ~15,000 tokens ❌ Exceeds limit
With summarization:

Structured summary of messages 1-45: ~800 tokens
Messages 46-100 (55 messages): ~8,250 tokens
Total: ~9,050 tokens (~40% reduction, under the limit ✅)

Why keep the majority of recent messages intact?

The AI needs immediate context to understand:

What you discussed in the last 50+ messages
The current flow and direction of conversation
Recent code examples, error messages, or specific questions
Continuity between related topics

A structured summary with key facts like "User is building a React app with TypeScript, discussed routing and API integration" provides useful background context. But the AI needs the actual recent messages to understand nuanced questions like:

"So should I use try-catch or error boundaries?" (referring to your error handling discussion 10 messages ago)
"Can you show me the implementation for the second approach?" (referring to two options discussed recently)
"What was that library you mentioned earlier?" (needs the actual message where the library was named)

The 45/55 split strikes the right balance:

45% oldest messages → Summarized into key facts and topics (compressed ~95%)
55% recent messages → Kept verbatim for full conversational context
System messages → Always preserved (these are instructions, not conversation)

This approach ensures the AI has both:

Condensed historical context - What the conversation has been about overall
Full recent detail - The nuanced back-and-forth needed to continue naturally

Benefits:

✅ 30-50% token reduction - Meaningful API cost savings over time
✅ Unlimited conversations - Never hit token limits, chat forever
✅ Structured summaries - AI extracts key facts and topics, not just truncation
✅ Transparent to users - Happens asynchronously in the background
✅ Robust fallback - If AI summarization fails, uses extractive summary
✅ Dynamic limits - Automatically adjusts based on model's context window (40% allocation)
✅ Smart preservation - System messages (instructions) are never removed

Why this matters:

No manual intervention - Summarization happens transparently when 60% threshold is reached
Cost optimization - Reducing 30-50% of tokens adds up over hundreds of conversations
Better context quality - Structured summaries preserve key facts and topics, removing conversational noise
Persistence - Memory is saved to database, survives app restarts
Async operation - 60-second timeout ensures it doesn't block user interaction

Performance Insights: Managing Multiple AI Platforms in One Desktop App

Building a desktop app that manages multiple concurrent AI conversations taught us important lessons about resource management. Here's what we learned about the performance trade-offs:

Memory Usage Patterns

A typical desktop AI chat application's memory footprint consists of:

Base application layer (~50 MB)

JVM runtime overhead
Compose Desktop UI framework
Core application state

Per-session overhead (~2-5 MB each)

Each conversation needs its own ViewModel instance
State management (messages list, streaming state, settings)
With 20 concurrent sessions: ~40-100 MB additional

Conversation history caching (~5-10 MB per 100 messages)

Messages are kept in memory for active sessions
Lazy loading from SQLite for inactive sessions
A power user with 20 tabs × 100 messages each ≈ 100-200 MB

RAG embedding stores (varies by project size)

Small project (500 files): ~50 MB
Medium project (5,000 files): ~200-500 MB
Large project (20,000+ files): 1+ GB

Total memory range: 50-300 MB for typical usage (excluding large RAG projects and AI model memory).

Why These Numbers Matter

Compared to web-based alternatives:

Web apps in browser tabs: 200-500 MB per tab (browser overhead included)
Our approach: 2-5 MB per session (no browser overhead)
Trade-off: We had to build custom rendering, but gained 10-100x better per-session memory efficiency

Startup time trade-offs:

Cold start: 1.5-3 seconds (loading JVM + Compose Desktop)
Web apps: ~1 second initial load, but 3-5 seconds for full interactivity
Electron alternatives: 5-10 seconds (loading Chromium)
Learning: Desktop app initialization is competitive once you account for full interactivity

Database Performance

SQLite for local message storage:

Write latency: <10ms per message (includes indexing)
Full-text search: <50ms across 10,000+ messages
No network round-trip delays like cloud-based alternatives

Why local-first matters:

Zero API latency for message retrieval
Works fully offline for history browsing
No sync conflicts or version issues

Concurrency Limits

Why we cap at 20 concurrent sessions:

Each streaming session holds an open HTTP connection
Memory grows linearly with active sessions
UI remains responsive up to ~30 tabs, but 20 is a comfortable limit
Real-world usage: Most users have 3-8 active conversations

The lesson: Hard limits prevent resource exhaustion. Better to cap explicitly than let the system degrade unpredictably.

Rendering Performance

Compose Desktop maintains 60 FPS because:

Only re-renders changed UI components (reactive architecture)
Streaming updates are throttled to prevent overwhelming the UI thread
Message virtualization for long conversation lists

Trade-off we made:

Custom markdown renderer required significant effort
But we gained full control over rendering performance and caching

Key Takeaways for Desktop AI Application Development

Memory management is crucial - With multiple concurrent sessions, every MB counts. Lazy loading and LRU eviction prevented unbounded growth.
Local-first architecture pays off - SQLite message storage gives us instant search and offline access without cloud sync complexity.
Async everywhere - Kotlin coroutines made concurrent streaming manageable. Every blocking operation runs in a background dispatcher.
Cap resources explicitly - 20 concurrent sessions is a reasonable limit that prevents degradation while supporting real workflows.
Desktop overhead is acceptable - The 1.5-3s startup time and 50MB base memory are worthwhile for the privacy, performance, and offline benefits.

Try It Yourself

Askimo is open source (AGPLv3) and available now:

🌐 Website: https://askimo.chat
💻 GitHub: github.com/haiphucnguyen/askimo
📥 Download: Get installers for macOS, Windows, Linux
📖 Docs: Complete documentation and setup guides

Related resources:

Installation guides for macOS, Windows, and Linux

What's Next?

We're actively developing:

Voice input/output - Hands-free conversations with speech-to-text and text-to-speech support
Plugin system - Extensible architecture for custom integrations:
- Custom RAG material sources - Integrate with Confluence, Notion, Google Drive, databases, or any data source
- MCP (Model Context Protocol) integrations - Connect AI models to external tools and services
- Custom AI providers - Add support for new AI services without modifying core code
Team features - Share prompts, custom directives, and RAG projects across your organization
Mobile companion app - iOS and Android apps using Kotlin Multiplatform to reuse 60-80% of desktop codebase

Want to contribute? Check out our CONTRIBUTING.md - we welcome PRs for new providers, features, and bug fixes!

Found this helpful? ⭐ Star Askimo on GitHub and try it for your own AI workflows!

This article showcases production patterns from Askimo, an AGPLv3-licensed desktop AI chat application built with Kotlin and Compose for Desktop. All code examples are simplified from the actual implementation available on GitHub.

Inside Askimo: My Daily Journey with an AI CLI

Nguyen Phuc Hai — Thu, 04 Sep 2025 16:00:00 +0000

Inside Askimo

When I first started tinkering with Askimo, I wasn’t trying to create a big project. I just wanted something simple to make my day easier. I live in the terminal and bounce between AI tools—OpenAI for some things, Ollama locally, Copilot at work. Switching between them felt clunky, and being tied to one vendor didn’t make sense.

Then it clicked: what if I had one CLI that could talk to all of them, and let me automate the boring parts? Not just cross-platform, but repeatable. I often need to run the same command with different inputs—a set of messages, a list of files, variations of a prompt—pipe data in, script it, and reuse it later. That’s how Askimo began.

A Tool I Actually Use Every Day

Askimo isn’t just a side project that I work on in spare time - it has become something I rely on daily. I use it to summarize long documents, generate quick drafts, or even suggest names for functions when I’m stuck. Because it lives in the terminal, it feels natural - just another command, like git or docker.

I didn’t build Askimo for show. I built it for myself first. But once it became part of my routine, I realized it might be useful for others too.

What Askimo Can Do (Right Now)

Even though it’s still early, Askimo already fits neatly into my workflow:

Runs everywhere - Homebrew on macOS/Linux, binaries for Windows, or Docker if I don’t want to install anything.
Feels consistent - the same commands work whether I’m on my laptop or a server.
Local file context - I can ask questions about a file in my project.
Multiple providers - I can switch between OpenAI, Ollama, Gemini, or X AI without leaving the CLI.

These weren’t “features” I brainstormed - they were gaps I ran into while working. Each one exists because I personally needed it.

A Journey of Learning

Askimo has also been my way of learning how to apply AI, not just read about it. Building it forced me to experiment: to test prompts, to break things, to see where AI adds value and where it doesn’t.

I’ve come to realize that AI doesn’t replace my work - it extends it. Sometimes it saves me from tedious repetition. Other times, it pushes me to think differently about automation. Each step of Askimo’s development has been a reflection of how I’m learning to work with AI rather than around it.

Getting Started

If you want to try it out, installation is simple - Homebrew, binaries, or Docker. I keep the instructions here:
👉 Installation Guide

What’s Next

I’ve got plenty of ideas for where to take it:

Chaining commands into more powerful workflows.
Custom commands - I can turn repeated prompts into shortcuts, so I don’t waste time retyping.
Indexing projects so Askimo understands my real workspace - source code, database schemas/migrations, configuration files, API specs, docs, and even build logs - not just isolated files.

The vision isn’t just a CLI for chat - I want Askimo to grow into a programmable AI environment that feels at home in the terminal.

Moving Forward

What excites me most isn’t just the tool itself, but what it represents. Askimo started as a weekend hack, but it’s grown into both a part of my daily workflow and a mirror of my own journey learning to apply AI.

For me, it’s proof that AI can be practical, lightweight, and personal. And as I keep building, I’ll keep sharing the lessons I learn along the way - because Askimo isn’t just about what AI can do, it’s about how we, as developers, can shape it into something that fits naturally into our work.

If you try Askimo, I’d love to hear how it fits into your routine.

I’ve made the project open source because I believe tools like this get better when they’re shaped by a community, not just by one developer’s perspective. If you’re curious, want to contribute, or simply want to star the project to follow its progress, you can find it here:
👉 Askimo on GitHub

Askimo: An Open-Source Command-Line AI Assistant

Nguyen Phuc Hai — Tue, 19 Aug 2025 16:00:00 +0000

Over the last two weeks, I’ve been working on a side project called Askimo — a command-line AI assistant that I’ve released under the MIT license.

It started from a simple need: I use AI a lot in my daily work — OpenAI, Claude, Ollama, Copilot. Many of the tasks are small and repetitive, like:

Generating release notes from commits
Summarizing logs
Updating a GitHub email owner etc

I wanted a tool that could:

Switch between providers quickly
Automate repetitive tasks in a creative way
Be customized for my workflow

Why build another CLI AI tool?

There are already some great tools out there. But I decided to build my own for a few reasons:

Learning → I wanted to explore GraalVM for cross-platform native images and get hands-on with LangChain4j to experiment with system messages, tokens, memory, and prompt tuning.
Control → Having my own tool means I can set the pace, customize features for my workflow, and extend it however I want.
Openness → Askimo is MIT licensed, with a pluggable design that makes it easy to support both closed APIs and open-source models like Ollama.

What Askimo can do today

Streaming chat in the terminal (interactive REPL)
Pipe execution: cat logs.txt | askimo "summarize this"
Multiple AI providers: currently OpenAI and Ollama, with a pluggable design to add more (e.g., Gemini)
Simple web chat page for those who prefer not to use the terminal (though CLI is where automation really shines)

What’s next?

Askimo is still very early. I’m exploring:

Adding more providers (both open-source and hosted APIs)
Custom function for repetitive tasks (e.g., release notes, log analysis, ticket triage)
Extending the plugin system so the community can add their own commands and providers

Contributions welcome 🙌

I built Askimo mainly for myself, but I’d love to see how others use it. Every contribution helps — whether it’s opening issues, suggesting features, or sending a PR.

Repo: https://github.com/haiphucnguyen/askimo

If you’re interested in AI at the terminal, or just want to tinker with GraalVM and LangChain4j, give it a try. And if you like the idea, a ⭐️ would mean a lot.