<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nguyen Phuc Hai</title>
    <description>The latest articles on DEV Community by Nguyen Phuc Hai (@nguyen_phuchai_b01cae130).</description>
    <link>https://dev.to/nguyen_phuchai_b01cae130</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1621371%2F68ee1056-2f11-44e3-bc41-7a84c12935a7.jpg</url>
      <title>DEV Community: Nguyen Phuc Hai</title>
      <link>https://dev.to/nguyen_phuchai_b01cae130</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nguyen_phuchai_b01cae130"/>
    <language>en</language>
    <item>
      <title>How I Practice System Design with AI (URL Shortener Walkthrough)</title>
      <dc:creator>Nguyen Phuc Hai</dc:creator>
      <pubDate>Thu, 21 May 2026 14:00:00 +0000</pubDate>
      <link>https://dev.to/nguyen_phuchai_b01cae130/how-i-practice-system-design-with-ai-url-shortener-walkthrough-1mmf</link>
      <guid>https://dev.to/nguyen_phuchai_b01cae130/how-i-practice-system-design-with-ai-url-shortener-walkthrough-1mmf</guid>
      <description>&lt;p&gt;I've been doing system design interviews for years - both as a candidate and as an interviewer. The hardest part to practice alone is not the knowledge. It's the process: starting from requirements, running the numbers, evaluating multiple options before committing, and making explicit trade-off arguments at the end.&lt;/p&gt;

&lt;p&gt;Reading about it helps. Actually working through it is different.&lt;/p&gt;

&lt;p&gt;A few months ago I built a structured multi-step AI plan that walks through a complete system design session end-to-end. I've been using it to practice against different systems. This post shows exactly what it produces, using a URL shortener as the example.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with single-prompt system design
&lt;/h2&gt;

&lt;p&gt;The obvious move when practicing system design with AI is something like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You are a senior engineer. Design a URL shortener. Cover requirements, architecture, database schema, caching, and scalability."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You get back something that looks thorough. Redis for caching. PostgreSQL for storage. Load balancers. CDN. The answer ticks the boxes.&lt;/p&gt;

&lt;p&gt;But try pushing on it. &lt;em&gt;Why Redis specifically and not Memcached? What QPS number drove that decision? Why 302 and not 301? Why that partitioning key?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The AI cannot answer because it did not derive these choices from anything - it pattern-matched from the thousands of URL shortener articles it has seen. The output sounds right because the training data sounds right.&lt;/p&gt;

&lt;p&gt;The other problem: a single prompt collapses the entire design process into one shot. Real system design is sequential. You cannot choose an architecture before running the numbers. You cannot make a trade-off argument before evaluating the alternatives. Skipping the steps does not make the output wrong, it makes the reasoning invisible.&lt;/p&gt;

&lt;h2&gt;
  
  
  A different approach: chain the steps
&lt;/h2&gt;

&lt;p&gt;System design interviews have a well-known format for good reason. You are expected to work through specific phases in order: clarify requirements, estimate scale, propose and compare architecture options, draw the high-level design, deep-dive into the critical components, and close with trade-offs. Skipping phases or doing them out of order is a signal that you do not have a structured approach.&lt;/p&gt;

&lt;p&gt;The insight is that this format maps directly onto a multi-step AI workflow. Instead of one big prompt, you instruct the AI to follow the same interview structure, one phase at a time, where each phase builds on the output of the previous one.&lt;/p&gt;

&lt;p&gt;I structured the workflow as 7 sequential steps that mirror the formal system design interview format:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Interview phase&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 - Requirements&lt;/td&gt;
&lt;td&gt;Clarify requirements&lt;/td&gt;
&lt;td&gt;Clarifies and completes the requirements, fills in missing NFR defaults, states assumptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 - Back-of-envelope&lt;/td&gt;
&lt;td&gt;Estimate scale&lt;/td&gt;
&lt;td&gt;Derives traffic, storage, bandwidth, and cache estimates with arithmetic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 - Architecture options&lt;/td&gt;
&lt;td&gt;Propose options&lt;/td&gt;
&lt;td&gt;Proposes 2-3 options with pros/cons, recommends one, produces a comparison diagram&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 - High-level design&lt;/td&gt;
&lt;td&gt;High-level design&lt;/td&gt;
&lt;td&gt;Component overview, data flow, full architecture diagram&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 - Deep-dive&lt;/td&gt;
&lt;td&gt;Deep-dive&lt;/td&gt;
&lt;td&gt;Data model, API design, scalability strategies, failure modes table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6 - Trade-offs&lt;/td&gt;
&lt;td&gt;Trade-offs&lt;/td&gt;
&lt;td&gt;Decision table, known limitations, future improvements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7 - Final doc&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;Assembles everything into a single coherent document&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each step prompt references prior outputs using &lt;code&gt;{{step_id}}&lt;/code&gt; placeholders. By the time the failure modes table is written, the AI knows the exact QPS numbers, the dominant bottleneck, and which architecture was chosen - and why. Nothing is invented in isolation.&lt;/p&gt;

&lt;p&gt;I built this using &lt;a href="https://askimo.chat" rel="noopener noreferrer"&gt;Askimo Plans&lt;/a&gt; which lets you define multi-step AI sessions in YAML. Here is a short demo of how the plan runs:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/VyzZEUKOCV0?start=6"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;But the structure itself is what matters - you could implement the same chain in any tool that passes context between steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Walking through the URL shortener
&lt;/h2&gt;

&lt;p&gt;I provided these inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System: &lt;em&gt;URL Shortener&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Functional requirements: shorten URL, redirect to original, custom aliases, click analytics&lt;/li&gt;
&lt;li&gt;Non-functional: 99.99% availability, redirect latency &amp;lt; 10ms p99&lt;/li&gt;
&lt;li&gt;Scale hints: 500M users, 50M DAU. 200M new URLs/day (2,300 writes/sec avg)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The requirements step does not design anything. It restates what you gave it precisely, fills in missing non-functional defaults, and marks explicit assumptions. One thing worth noting from this run: my input just said "redirect." The AI flagged that 301 is permanent - browsers cache it, so users only hit your service once and analytics stop working. It surfaced 302 as the correct choice. Small detail, real architectural consequence, and the right place to catch it.&lt;/p&gt;

&lt;p&gt;The back-of-envelope step is the one most people skip in practice and get destroyed on in interviews. For this system: 35K peak redirect QPS, 915 GB for URL records, 365 TB for analytics over 5 years, 183 GB cache for hot URLs. That 365 TB number immediately rules out storing analytics in the same database as URLs. The 1000:1 read/write ratio flags this as read-heavy where almost everything depends on cache hit rate. These derived numbers drive every architecture decision that follows.&lt;/p&gt;

&lt;p&gt;The architecture options step takes those numbers and proposes 2-3 distinct designs with pros/cons and a Mermaid comparison diagram. For this system it landed on a read/write split with async analytics: stateless redirect service backed by Redis and read replicas, separate write service, analytics into Kafka then ClickHouse. The recommendation is justified by the numbers, not by what sounds architecturally fashionable.&lt;/p&gt;

&lt;p&gt;The high-level design, deep-dive, and trade-offs steps each build on what came before. Every sizing decision in the architecture diagram traces back to the estimates. The failure modes table references the bottlenecks identified two steps earlier. The trade-off table - the step most candidates skip - has every row traceable to a prior decision: 302 vs 301 from requirements, Redis over Memcached from the 183 GB cache estimate, Kafka over direct ClickHouse write from the &amp;lt; 10ms p99 redirect SLA.&lt;/p&gt;

&lt;h2&gt;
  
  
  The design is a starting point, not a dead end
&lt;/h2&gt;

&lt;p&gt;Once the plan finishes, Askimo keeps the full context of the entire run in memory. You can keep the conversation going with follow-up questions and the AI already knows everything it produced.&lt;/p&gt;

&lt;p&gt;For example, the high-level design above is deliberately cloud-agnostic. If you want to deploy it on AWS, you can ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Map this architecture to AWS services"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And because the AI has the full context - the 35K peak QPS, the 183 GB Redis cluster, the Kafka pipeline, the ClickHouse analytics store - the response is not generic. It translates each specific component: ElastiCache for Redis, MSK for Kafka, ALB + ECS Fargate for the redirect service, RDS Aurora with read replicas for PostgreSQL, and suggests whether ClickHouse on EC2 or an alternative like Redshift makes more sense given the 365 TB analytics volume.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbng19soq6ecylsy8x4i1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbng19soq6ecylsy8x4i1.png" alt=" " width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Same idea for other follow-ups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;"How would I run this on GCP?"&lt;/em&gt; - gets you Cloud Memorystore, Pub/Sub, Cloud Run, Cloud SQL&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"What would a Kubernetes deployment look like for the redirect service?"&lt;/em&gt; - gets you HPA config, resource limits, liveness probes tuned to the latency SLA&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"Estimate the monthly AWS cost for this design at 35K QPS"&lt;/em&gt; - gets you a rough cost breakdown per service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The context the plan built is what makes these answers useful. Without the prior chain, you get generic cloud mapping. With it, you get something specific to the system you actually designed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the chained context matters
&lt;/h2&gt;

&lt;p&gt;The trade-off table above is not invented. Every row traces back to a decision made in a previous step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;302 vs 301&lt;/strong&gt; came from the requirements step, where the AI flagged that click analytics requires 302. A 301 causes browsers to cache the redirect and stop hitting your service, so analytics data stops flowing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NoSQL over RDBMS&lt;/strong&gt; follows from the back-of-envelope step. 50TB of URL mappings and 500K peak QPS rules out a manually sharded PostgreSQL cluster. The NoSQL choice is the direct consequence of those numbers, not a default assumption.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ALB over API Gateway for the read path&lt;/strong&gt; follows from the back-of-envelope. At 10B redirects per day, API Gateway's per-request billing becomes cost-prohibitive. The estimate made that visible before any architecture was drawn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async analytics via Kafka&lt;/strong&gt; follows from the &amp;lt; 10ms p99 redirect SLA in requirements. Writing 115K click events per second synchronously on the redirect hot path would blow the latency budget. Kafka decouples the write so the redirect service returns immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DynamoDB Atomic Counters for ID generation&lt;/strong&gt; follows from the collision guarantee in requirements, combined with the NoSQL-only architecture chosen in the previous step. No ZooKeeper cluster required.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single prompt cannot produce this because there is no prior context to reference. The AI just picks whatever sounds reasonable for a URL shortener. The multi-step approach forces the conclusions to be derived, not guessed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plan structure (for the curious)
&lt;/h2&gt;

&lt;p&gt;Behind this session is a YAML file with 7 steps. Each step has a single clear goal and receives the outputs of all prior steps as context via &lt;code&gt;{{step_id}}&lt;/code&gt; references.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Key inputs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;requirements&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Clarify, complete, and de-ambiguate the inputs&lt;/td&gt;
&lt;td&gt;Your functional/NFR/scale hints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;back-of-envelope&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Derive traffic, storage, bandwidth, and cache estimates with arithmetic&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{{requirements}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;architecture-options&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Propose 2-3 distinct options with pros/cons and a Mermaid comparison diagram&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;{{requirements}}&lt;/code&gt;, &lt;code&gt;{{back-of-envelope}}&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;high-level&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Component overview, data flow, full architecture diagram&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;{{requirements}}&lt;/code&gt;, &lt;code&gt;{{back-of-envelope}}&lt;/code&gt;, &lt;code&gt;{{architecture-options}}&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deep-dive&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Data model, API design, failure modes table, sequence diagrams&lt;/td&gt;
&lt;td&gt;&lt;code&gt;{{high-level}}&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tradeoffs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Decision table, known limitations, future improvements&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;{{requirements}}&lt;/code&gt;, &lt;code&gt;{{high-level}}&lt;/code&gt;, &lt;code&gt;{{deep-dive}}&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;final&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Assemble all outputs into a single shareable document&lt;/td&gt;
&lt;td&gt;All prior steps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;{{step_id}}&lt;/code&gt; references are what make the chain work. The back-of-envelope step does not re-read your inputs — it reads the clarified requirements from step 1. The architecture options step does not guess — it sees the numbers from step 2 and the requirements from step 1. Each step does one thing.&lt;/p&gt;

&lt;p&gt;The full YAML is &lt;a href="https://github.com/askimo-ai/askimo/blob/main/shared/src/main/resources/plans/system-design.yml" rel="noopener noreferrer"&gt;on GitHub&lt;/a&gt;. You can copy it, load it into any Askimo instance, or use it as a template for your own plan variants.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adapting it to other systems
&lt;/h2&gt;

&lt;p&gt;I've run this against a few different systems now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time chat&lt;/strong&gt; - the architecture options step centres on WebSocket vs SSE vs long-polling trade-offs; message ordering and delivery guarantees dominate the deep-dive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ride-sharing platform&lt;/strong&gt; - matching latency becomes the dominant constraint; the back-of-envelope gets interesting when you factor in geospatial indexing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video streaming&lt;/strong&gt; - storage and CDN costs utterly dominate everything; the trade-offs around pre-encoding vs adaptive bitrate streaming are where the depth is&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The plan structure stays the same. The conclusions change based on what the numbers show.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trying it yourself
&lt;/h2&gt;

&lt;p&gt;The plan is built into &lt;a href="https://askimo.chat" rel="noopener noreferrer"&gt;Askimo&lt;/a&gt; if you want to run it as-is. You fill in system name, requirements, and scale hints, then step through the session with whatever AI provider you use (OpenAI, Claude, Gemini, or a local model via Ollama).&lt;/p&gt;

&lt;p&gt;You can also adapt the YAML to add steps - a cost estimation step after back-of-envelope, a security review step, a migration plan step. The plan editor has an AI generator that writes the YAML from a plain-English description if you do not want to write it by hand.&lt;/p&gt;

&lt;p&gt;The full source for the system design plan is available on &lt;a href="https://github.com/askimo-ai/askimo" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. Contributions welcome.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>architecture</category>
      <category>ai</category>
      <category>interview</category>
    </item>
    <item>
      <title>Building a Desktop AI Chat App for ChatGPT, Claude, Gemini &amp; Ollama</title>
      <dc:creator>Nguyen Phuc Hai</dc:creator>
      <pubDate>Tue, 17 Feb 2026 00:54:28 +0000</pubDate>
      <link>https://dev.to/nguyen_phuchai_b01cae130/building-a-desktop-ai-chat-app-for-chatgpt-claude-gemini-ollama-1abh</link>
      <guid>https://dev.to/nguyen_phuchai_b01cae130/building-a-desktop-ai-chat-app-for-chatgpt-claude-gemini-ollama-1abh</guid>
      <description>&lt;p&gt;Learn how to build an open-source desktop AI chat client that connects multiple AI providers in one application. This technical guide covers Kotlin Compose architecture, streaming responses, RAG implementation, and production patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Using ChatGPT, Claude, Gemini Means Opening Multiple Apps
&lt;/h2&gt;

&lt;p&gt;As developers, we've discovered that different AI models excel at different tasks. &lt;a href="https://askimo.chat/app/openai/" rel="noopener noreferrer"&gt;ChatGPT&lt;/a&gt; is great for general conversation and brainstorming. &lt;a href="https://askimo.chat/app/claude/" rel="noopener noreferrer"&gt;Claude&lt;/a&gt; works well for coding questions and technical analysis. &lt;a href="https://askimo.chat/app/gemini/" rel="noopener noreferrer"&gt;Gemini&lt;/a&gt; handles multimodal tasks with images and documents. &lt;a href="https://askimo.chat/app/ollama/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; gives you free, unlimited access to open-source models without subscription limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But here's the frustration:&lt;/strong&gt; Each of these requires a different web application, a different account, a different browser tab.&lt;/p&gt;

&lt;p&gt;The modern AI workflow reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT web app&lt;/strong&gt; open for general questions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude web app&lt;/strong&gt; open for coding help&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google AI Studio&lt;/strong&gt; open for multimodal tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama command line&lt;/strong&gt; running locally for experimentation without API costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constant context switching&lt;/strong&gt; between different interfaces, keyboard shortcuts, and UX patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fragmented conversation history&lt;/strong&gt; - your coding discussion with Claude is separate from your general brainstorming in ChatGPT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No unified search&lt;/strong&gt; - can't search across all your AI conversations in one place&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple subscriptions&lt;/strong&gt; - managing different payment plans and free tier limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What if you could have &lt;strong&gt;one desktop application&lt;/strong&gt; that works with ChatGPT, Claude, Gemini, Ollama, and any other AI provider - letting you choose the best model for each task without switching apps?&lt;/p&gt;

&lt;p&gt;That's why we built &lt;strong&gt;&lt;a href="https://askimo.chat/app/" rel="noopener noreferrer"&gt;Askimo&lt;/a&gt;&lt;/strong&gt; - an open-source desktop AI chat client built with Kotlin and Compose for Desktop. You can &lt;a href="https://askimo.chat/download/" rel="noopener noreferrer"&gt;download it for free&lt;/a&gt; for macOS, Windows, and Linux.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Desktop? Why Not Another Web App?
&lt;/h2&gt;

&lt;p&gt;Before diving into the technical implementation, let's address the elephant in the room: &lt;strong&gt;Why build a desktop app in 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Desktop Advantages for AI Chat
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Zero Infrastructure&lt;/strong&gt;: Just download and run. No server to set up, no deployment, no hosting costs. Open the app and start chatting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Persistent State&lt;/strong&gt;: Desktop apps don't lose state when you close a tab. Chat in up to 20 tabs simultaneously - more than enough for any workflow - and they all stay exactly where you left them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;True Privacy&lt;/strong&gt;: Local-first architecture means conversations never leave your machine unless you explicitly send them to an AI provider.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Native Performance&lt;/strong&gt;: No browser overhead. Direct access to system resources for faster rendering and lower memory usage (50-300 MB vs 500+ MB for browser tabs).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Offline Capability&lt;/strong&gt;: Read past conversations, search history, and manage projects - all without internet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System Integration&lt;/strong&gt;: Deep OS integration for keyboard shortcuts, native notifications, and file system access.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why Kotlin + Compose for Desktop?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modern declarative UI&lt;/strong&gt; with Compose's reactive paradigm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared code&lt;/strong&gt; between &lt;a href="https://askimo.chat/cli/" rel="noopener noreferrer"&gt;CLI&lt;/a&gt; and desktop modules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coroutines&lt;/strong&gt; for elegant async/concurrent programming&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type safety&lt;/strong&gt; that prevents entire classes of runtime errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mature ecosystem&lt;/strong&gt; with LangChain4j for AI integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Strategic advantage: Code reuse for mobile apps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Choosing Kotlin and Compose Multiplatform gives us a significant long-term benefit: &lt;strong&gt;when we expand to mobile (iOS/Android), we can reuse 60-80% of our codebase.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The same business logic that powers the desktop app can power mobile:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Session management&lt;/strong&gt; - Same conversation state management across all platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI provider integrations&lt;/strong&gt; - OpenAI, Anthropic, Ollama clients work identically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming handling&lt;/strong&gt; - The concurrent stream management we built for desktop works on mobile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database layer&lt;/strong&gt; - SQLite-based storage runs on all platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markdown rendering&lt;/strong&gt; - Custom renderer works on iOS and Android without changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG pipeline&lt;/strong&gt; - Document processing and embedding logic is platform-agnostic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only the UI layer needs platform-specific adaptation - and even there, Compose Multiplatform lets us share UI components with platform-specific tweaks.&lt;br&gt;
&lt;strong&gt;Compare this to the web app path:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Web → Mobile means rebuilding everything in Swift/Kotlin or using slower hybrid frameworks&lt;/li&gt;
&lt;li&gt;Desktop Electron → Mobile means completely separate codebases&lt;/li&gt;
&lt;li&gt;Native from the start → Future mobile apps share the same proven, battle-tested core&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And this isn't some experimental tech we're betting on - Compose Multiplatform is already battle-tested in production by companies like JetBrains and Netflix. So when we decide to ship mobile apps, we won't be starting from scratch. All the tricky stuff - session management, streaming handlers, RAG pipelines - will already work. We'll just need to adapt the UI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Desktop first, but with mobile in our back pocket for later.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Trade-offs: More Effort, Better Control
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Let's be honest: building a native desktop app requires significantly more effort than a web app.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you build for the web, the browser gives you useful tools for free:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Markdown rendering&lt;/strong&gt; - Just use a library like &lt;code&gt;marked.js&lt;/code&gt; and let the browser's HTML engine handle it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Syntax highlighting&lt;/strong&gt; - Drop in Prism.js or highlight.js&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Charts and visualizations&lt;/strong&gt; - Chart.js, D3.js, countless options&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File handling&lt;/strong&gt; - Browser APIs abstract the complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-platform rendering&lt;/strong&gt; - Write once, runs everywhere with the same look&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For a native desktop app, we had to build all of this ourselves:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Custom Markdown Rendering&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implemented a CommonMark parser in Kotlin&lt;/li&gt;
&lt;li&gt;Built custom rendering logic for code blocks, tables, lists&lt;/li&gt;
&lt;li&gt;Created syntax highlighting integration for 50+ programming languages&lt;/li&gt;
&lt;li&gt;No browser HTML engine to fall back on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Platform-Specific Challenges&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File system access differs on macOS, Windows, and Linux&lt;/li&gt;
&lt;li&gt;Window management and keyboard shortcuts need OS-specific handling&lt;/li&gt;
&lt;li&gt;Native menus and notifications require platform adapters&lt;/li&gt;
&lt;li&gt;Different packaging systems for each OS (DMG, MSI, DEB/RPM)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Custom UI Components&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built chart rendering using Compose Canvas APIs&lt;/li&gt;
&lt;li&gt;Implemented custom text editors with syntax highlighting&lt;/li&gt;
&lt;li&gt;Created scrollable containers with proper touch/mouse handling&lt;/li&gt;
&lt;li&gt;Designed responsive layouts without CSS flexbox&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Resource Management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manual memory management for long-running processes&lt;/li&gt;
&lt;li&gt;Thread pool sizing for concurrent AI streams&lt;/li&gt;
&lt;li&gt;Database connection pooling&lt;/li&gt;
&lt;li&gt;No browser garbage collection to rely on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;So why go through all this extra effort?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We believe the benefits are worth it for this specific use case:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Performance &amp;amp; Resource Efficiency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50-300 MB memory usage&lt;/strong&gt; vs 500+ MB for equivalent web apps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.5-3 second startup&lt;/strong&gt; vs 5-10 seconds for web-based alternatives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct system access&lt;/strong&gt; - no browser overhead for file operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficient rendering&lt;/strong&gt; - only what changed, not full DOM diffing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. User Control &amp;amp; Privacy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complete local storage&lt;/strong&gt; - users truly own their data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No cloud dependencies&lt;/strong&gt; for core functionality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encrypted local database&lt;/strong&gt; - conversations never leave the machine (learn more about &lt;a href="https://askimo.chat/security/" rel="noopener noreferrer"&gt;Askimo's security features&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No telemetry or tracking&lt;/strong&gt; by default - users control everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Long-term Strategic Benefits&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local tool integration&lt;/strong&gt; - direct access to file system, terminal, development tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline-first&lt;/strong&gt; - full functionality without internet (except AI API calls)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System integration&lt;/strong&gt; - global keyboard shortcuts, menu bar presence, system notifications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Future extensibility&lt;/strong&gt; - can integrate with OS-level features (Spotlight search, Quick Look, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Better UX for AI Workflows&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instant search&lt;/strong&gt; - local SQLite queries are 10-100x faster than cloud-based search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliable state&lt;/strong&gt; - no session timeouts, no lost tabs, no connection drops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tab workflows&lt;/strong&gt; - handle 20+ concurrent conversations without browser memory bloat (see &lt;a href="https://askimo.chat/docs/desktop/features/" rel="noopener noreferrer"&gt;desktop features&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent experience&lt;/strong&gt; - same UI across all platforms, not dependent on browser quirks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The web browser is well-suited for content consumption, but for productivity tools handling sensitive data and requiring deep system integration, native apps offer advantages.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For Askimo specifically, the ability to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store thousands of conversations locally with instant search&lt;/li&gt;
&lt;li&gt;Switch AI providers without page reloads or state loss&lt;/li&gt;
&lt;li&gt;Work with multiple AI platforms - from cloud services like ChatGPT, Claude, and Gemini to &lt;a href="https://askimo.chat/docs/desktop/providers/ollama/" rel="noopener noreferrer"&gt;local Ollama models&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Access project files and web content directly for RAG context&lt;/li&gt;
&lt;li&gt;Work offline for reviewing past conversations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...made the extra development effort a worthwhile investment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; If you're building a simple content-focused app, choose web. If you're building a productivity tool that needs privacy, performance, and deep system integration, the native desktop path - despite its challenges - delivers better long-term value for users.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Askimo uses a &lt;strong&gt;provider-agnostic architecture&lt;/strong&gt; that abstracts AI models behind a common interface. Here's the high-level structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│     Compose Desktop UI Layer            │
│   (ViewModels + Reactive State)         │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│      Session Management Layer           │
│   (up to 20 tabs, LRU cache)            │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│    Provider Abstraction Layer           │
│   ChatModelFactory&amp;lt;T: ProviderSettings&amp;gt; │
└──────────────┬──────────────────────────┘
               │
       ┌───────┴───────┐
       │               │
┌──────▼─────┐   ┌────▼──────┐
│  OpenAI    │   │  Ollama   │  ...
│  Factory   │   │  Factory  │
└────────────┘   └───────────┘
       │               │
┌──────▼───────────────▼──────────────────┐
│         LangChain4j Integration         │
│  (Streaming, Memory, RAG, Tools)        │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Core Implementation: Provider Abstraction
&lt;/h2&gt;

&lt;p&gt;The heart of Askimo's multi-provider support is the &lt;code&gt;ChatModelFactory&lt;/code&gt; interface. This is how we achieve provider independence. You can see all &lt;a href="https://askimo.chat/docs/desktop/ai-providers/" rel="noopener noreferrer"&gt;supported AI providers&lt;/a&gt; and their configuration in the documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. ChatModelFactory Interface
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;interface&lt;/span&gt; &lt;span class="nc"&gt;ChatModelFactory&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;T&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ProviderSettings&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// List available models for this provider&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;availableModels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;T&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

    &lt;span class="c1"&gt;// Identify which provider this factory creates&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;getProvider&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;ModelProvider&lt;/span&gt;

    &lt;span class="c1"&gt;// Default configuration for this provider&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;defaultSettings&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;T&lt;/span&gt;

    &lt;span class="c1"&gt;// Create a chat client instance&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ContentRetriever&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;executionMode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ExecutionMode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;chatMemory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ChatMemory&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;ChatClient&lt;/span&gt;

    &lt;span class="c1"&gt;// Create cheap utility client for classification tasks&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;createUtilityClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;T&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;ChatClient&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Design Decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generic type parameter &lt;code&gt;&amp;lt;T: ProviderSettings&amp;gt;&lt;/code&gt;&lt;/strong&gt; - Each factory specifies its own settings type, ensuring type safety at compile time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ContentRetriever for RAG&lt;/strong&gt; - Optional parameter enables Retrieval-Augmented Generation for file/project context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChatMemory injection&lt;/strong&gt; - Conversation history managed externally but injected at creation time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ExecutionMode awareness&lt;/strong&gt; - Different behavior for CLI vs Desktop (e.g., tools disabled in desktop)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Utility client for background tasks&lt;/strong&gt; - &lt;code&gt;createUtilityClient()&lt;/code&gt; returns a cheap, fast model for tasks that don't need the most powerful AI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why createUtilityClient?
&lt;/h3&gt;

&lt;p&gt;Many AI workflows involve tasks that don't require expensive, state-of-the-art models. Examples include:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory summarization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Condensing old conversation messages into summaries&lt;/li&gt;
&lt;li&gt;A simple task that GPT-3.5-turbo handles just as well as GPT-4&lt;/li&gt;
&lt;li&gt;Running hundreds of times during long conversations&lt;/li&gt;
&lt;li&gt;Using GPT-4 would cost 10-20x more with no quality benefit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Intent classification:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deciding "should we use RAG for this query?" → YES/NO&lt;/li&gt;
&lt;li&gt;Validating "is this a question?" → YES/NO&lt;/li&gt;
&lt;li&gt;Simple binary decisions that don't need advanced reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The trade-off:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud providers (OpenAI, Anthropic, Google):&lt;/strong&gt; Use a cheaper model (e.g., GPT-3.5-turbo costs ~$0.001/1K tokens vs GPT-4's ~$0.03/1K tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local providers (Ollama, LM Studio):&lt;/strong&gt; Use the same model (no API costs, so no benefit to switching)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example: OpenAI implementation&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OpenAiChatModelFactory&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ChatModelFactory&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;OpenAiSettings&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;createUtilityClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;OpenAiSettings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;ChatClient&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Use GPT-3.5-turbo for cheap background tasks&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;sessionId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"gpt-3.5-turbo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Cheap model for utility tasks&lt;/span&gt;
            &lt;span class="n"&gt;settings&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;executionMode&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExecutionMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DESKTOP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;chatMemory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Example: Ollama implementation&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OllamaChatModelFactory&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ChatModelFactory&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;OllamaSettings&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;createUtilityClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;OllamaSettings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;ChatClient&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Local models have no API cost, use the same model&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;sessionId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;defaultModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Same model, no cost difference&lt;/span&gt;
            &lt;span class="n"&gt;settings&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;executionMode&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExecutionMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DESKTOP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;chatMemory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real-world impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A user with 100 conversations averaging 200 messages each triggers ~100 summarization calls&lt;/li&gt;
&lt;li&gt;With GPT-4: ~$6-10 in API costs for background tasks&lt;/li&gt;
&lt;li&gt;With GPT-3.5-turbo utility client: ~$0.30-0.50 in API costs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;20x cost reduction&lt;/strong&gt; for the same functionality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern keeps the AI experience responsive and affordable without compromising the quality of user-facing responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. ProviderSettings Interface
&lt;/h3&gt;

&lt;p&gt;Each provider has its own settings class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;interface&lt;/span&gt; &lt;span class="nc"&gt;ProviderSettings&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;defaultModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;

    &lt;span class="c1"&gt;// Human-readable description (masks sensitive data)&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

    &lt;span class="c1"&gt;// Configurable fields for UI&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;getFields&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SettingField&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

    &lt;span class="c1"&gt;// Update a field and return new instance (immutable pattern)&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;updateField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fieldName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;ProviderSettings&lt;/span&gt;

    &lt;span class="c1"&gt;// Validate settings are ready for use&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt;

    &lt;span class="c1"&gt;// Help text when validation fails&lt;/span&gt;
    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;getSetupHelpText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messageResolver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Example: OpenAI Provider Implementation
&lt;/h3&gt;

&lt;p&gt;Here's how we implement the &lt;a href="https://askimo.chat/docs/desktop/providers/openai/" rel="noopener noreferrer"&gt;OpenAI/ChatGPT provider&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Serializable&lt;/span&gt;
&lt;span class="kd"&gt;data class&lt;/span&gt; &lt;span class="nc"&gt;OpenAiSettings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;defaultModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://api.openai.com/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="py"&gt;presets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Presets&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Presets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BALANCED&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ProviderSettings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;HasApiKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;HasBaseUrl&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;Boolean&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isNotBlank&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;getSetupHelpText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messageResolver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;"""
            OpenAI requires an API key to use.
            1. Get your API key: https://platform.openai.com/api-keys
            2. Configure it with: :set-param api_key YOUR_KEY
        """&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trimIndent&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Streaming AI Responses: Managing Multiple Concurrent Conversations
&lt;/h2&gt;

&lt;p&gt;One of Askimo's key features: &lt;strong&gt;handling up to 20 simultaneous AI conversations&lt;/strong&gt;, each with its own streaming response thread.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Each active conversation needs a dedicated thread for streaming AI responses&lt;/li&gt;
&lt;li&gt;Memory must be bounded to prevent resource exhaustion&lt;/li&gt;
&lt;li&gt;Inactive sessions should be cached but unloaded from memory&lt;/li&gt;
&lt;li&gt;Thread-safe state management across concurrent operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Our Approach
&lt;/h3&gt;

&lt;p&gt;We use Kotlin's &lt;strong&gt;StateFlow&lt;/strong&gt; for reactive state management and &lt;strong&gt;Coroutines&lt;/strong&gt; for concurrent streaming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChatViewModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;chatService&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ChatService&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;_isStreaming&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MutableStateFlow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;isStreaming&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;StateFlow&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Boolean&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_isStreaming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;asStateFlow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;_messages&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MutableStateFlow&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;(&lt;/span&gt;&lt;span class="nf"&gt;emptyList&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;StateFlow&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;asStateFlow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;sendMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;viewModelScope&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;_isStreaming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;

            &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;chatService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;streamResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;handleStreamingError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;appendToLastMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;_isStreaming&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reactive UI updates&lt;/strong&gt; - Compose automatically recomposes when StateFlow changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread-safe&lt;/strong&gt; - StateFlow handles concurrent access safely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backpressure handling&lt;/strong&gt; - Won't overwhelm UI with rapid updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic cleanup&lt;/strong&gt; - Coroutines cancelled when ViewModel disposed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Session Management
&lt;/h3&gt;

&lt;p&gt;We maintain up to 20 active sessions in memory with LRU-style eviction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sessions only created when first accessed (lazy initialization)&lt;/li&gt;
&lt;li&gt;Inactive sessions automatically cleaned up when limit reached&lt;/li&gt;
&lt;li&gt;Active streaming sessions are never evicted&lt;/li&gt;
&lt;li&gt;Mutex-protected state for thread safety&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps memory usage bounded (~50-300 MB total) while supporting real-world workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Error Recovery: Preserving Partial AI Responses
&lt;/h2&gt;

&lt;p&gt;AI APIs can fail at any moment - network issues, rate limits, timeouts. Most chat apps lose everything when this happens. Askimo preserves partial responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;When an AI streaming call fails:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You've already received 500 words of a 1000-word response&lt;/li&gt;
&lt;li&gt;The API connection drops&lt;/li&gt;
&lt;li&gt;Standard implementations discard everything&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Our Solution: Incremental Persistence
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;suspend&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;handleStreamingError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Throwable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Get the partial content we've accumulated so far&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;partialContent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getCurrentAccumulatedContent&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;partialContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isNotEmpty&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Save what we have&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;partialMessage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partialContent&lt;/span&gt; &lt;span class="p"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"\n\n[Response interrupted: ${error.message}]"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ASSISTANT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Clock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;isError&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;// Replace the temporary streaming message with saved version&lt;/span&gt;
        &lt;span class="nf"&gt;replaceTemporaryMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;partialMessage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;// Persist to database immediately&lt;/span&gt;
        &lt;span class="n"&gt;messageRepository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;partialMessage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Notify user with non-intrusive error indicator&lt;/span&gt;
    &lt;span class="n"&gt;eventBus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;StreamingErrorEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;User Experience:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Partial responses are &lt;strong&gt;preserved and saved&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;✅ Clear &lt;strong&gt;visual indication&lt;/strong&gt; that response was interrupted&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Resume capability&lt;/strong&gt; - Users can retry from the partial state&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;No data loss&lt;/strong&gt; - Everything is persisted immediately&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Project-Based Context: RAG for Your Documents
&lt;/h2&gt;

&lt;p&gt;One useful feature we added: &lt;strong&gt;point it at your documents and ask questions&lt;/strong&gt;. Whether it's code, PDFs, Microsoft Office files, OpenOffice documents, or web pages - Askimo can understand and answer questions about your content. Learn more about &lt;a href="https://askimo.chat/docs/desktop/rag/" rel="noopener noreferrer"&gt;Askimo's RAG capabilities&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture: Content Retrieval
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="c1"&gt;// User attaches a project folder or documents&lt;/span&gt;
&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;project&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Project&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"my-project"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;knowledgeSources&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;listOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;FileSystemSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/path/to/documents"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;// PDFs, Office docs, text files&lt;/span&gt;
        &lt;span class="nc"&gt;FileSystemSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/path/to/codebase/src"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;// Source code&lt;/span&gt;
        &lt;span class="nc"&gt;UrlSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://docs.example.com"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// Web documentation&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// When user sends a message in this project's session:&lt;/span&gt;
&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;retriever&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createContentRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;chatClient&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sessionId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openAiSettings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// RAG enabled!&lt;/span&gt;
    &lt;span class="n"&gt;executionMode&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ExecutionMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DESKTOP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chatMemory&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conversationMemory&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How RAG Works in Askimo
&lt;/h3&gt;

&lt;p&gt;Askimo supports a wide range of document formats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Office Documents&lt;/strong&gt;: Microsoft Word (.docx), Excel (.xlsx), PowerPoint (.pptx)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenOffice&lt;/strong&gt;: Writer (.odt), Calc (.ods), Impress (.odp)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PDFs&lt;/strong&gt;: Extracts text content from PDF files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code&lt;/strong&gt;: All programming languages and text-based formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web Pages&lt;/strong&gt;: Crawl and index documentation sites&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The RAG Pipeline:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ingestion&lt;/strong&gt;: Documents are parsed, chunked, and embedded when project is created&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query-time retrieval&lt;/strong&gt;: User's question is embedded and similar chunks retrieved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context injection&lt;/strong&gt;: Retrieved chunks are added to the prompt automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response&lt;/strong&gt;: AI answers using both conversation history AND document context&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Hybrid Search: JVector + Lucene
&lt;/h3&gt;

&lt;p&gt;We chose a &lt;strong&gt;hybrid content retriever&lt;/strong&gt; that combines two complementary search strategies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Vector Search (JVector)&lt;/strong&gt; - Semantic similarity&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finds content that's conceptually related to the query&lt;/li&gt;
&lt;li&gt;Example: Query "error handling" matches "exception management" even without exact words&lt;/li&gt;
&lt;li&gt;Uses embeddings to capture meaning, not just keywords&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Keyword Search (Lucene)&lt;/strong&gt; - Exact term matching&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finds content with specific terms, names, or identifiers&lt;/li&gt;
&lt;li&gt;Example: Query "UserRepository.findById" finds exact method references&lt;/li&gt;
&lt;li&gt;Critical for code, API names, and technical terms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why hybrid?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Neither approach alone is sufficient:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vector-only&lt;/strong&gt;: Misses exact matches (class names, function signatures, specific error codes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyword-only&lt;/strong&gt;: Misses semantic relationships (synonyms, paraphrased concepts, related ideas)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hybrid retriever combines both using &lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt; - a proven algorithm that merges ranked lists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HybridContentRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;vectorRetriever&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ContentRetriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;keywordRetriever&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ContentRetriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;maxResults&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;  &lt;span class="c1"&gt;// Standard RRF constant&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ContentRetriever&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Content&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;vectorResults&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vectorRetriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;keywordResults&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;keywordRetriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;// Merge using Reciprocal Rank Fusion&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;reciprocalRankFusion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectorResults&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keywordResults&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxResults&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How RRF works:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each document, calculate a fusion score based on its rank in each list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RRF_score(doc) = Σ 1 / (k + rank_i)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;k = 60&lt;/code&gt; (standard constant that balances the contribution from different retrievers)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rank_i&lt;/code&gt; is the position of the document in retriever i's results (1st = rank 1, 2nd = rank 2, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A document ranked &lt;strong&gt;#1 in vector search&lt;/strong&gt; and &lt;strong&gt;#3 in keyword search&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector score: 1/(60+1) ≈ 0.016&lt;/li&gt;
&lt;li&gt;Keyword score: 1/(60+3) ≈ 0.016&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total RRF score: 0.032&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;A document ranked &lt;strong&gt;#1 in both&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector score: 1/(60+1) ≈ 0.016&lt;/li&gt;
&lt;li&gt;Keyword score: 1/(60+1) ≈ 0.016&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total RRF score: 0.032&lt;/strong&gt; (same as above!)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;A document ranked &lt;strong&gt;#1 in vector&lt;/strong&gt; but &lt;strong&gt;not found in keyword&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector score: 1/(60+1) ≈ 0.016&lt;/li&gt;
&lt;li&gt;Keyword score: 0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total RRF score: 0.016&lt;/strong&gt; (lower than documents found in both)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why RRF is better than weighted averaging:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Rank-based, not score-based&lt;/strong&gt;: Different retrievers produce incomparable scores. RRF only cares about relative ranking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Robust to failures&lt;/strong&gt;: If one retriever fails, we gracefully fall back to the other&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rewards consensus&lt;/strong&gt;: Documents appearing in both lists naturally get higher scores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Well-researched&lt;/strong&gt;: RRF is a proven algorithm used in information retrieval research&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Real-world impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Asks "how to fix null pointer" → finds both "NullPointerException" (keyword) and "defensive null checks" (semantic)&lt;/li&gt;
&lt;li&gt;Asks about "database queries" → finds both "SQL" (keyword) and "data access patterns" (semantic)&lt;/li&gt;
&lt;li&gt;More accurate retrieval = better AI answers grounded in your actual documents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implementation uses LangChain4j's RAG components:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;createContentRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Project&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;ContentRetriever&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;embeddingStore&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemoryEmbeddingStore&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;TextSegment&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;()&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;embeddingModel&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createEmbeddingModel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;// Index all knowledge sources&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;knowledgeSources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;documents&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;loadDocuments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;segments&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocumentSplitters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recursive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;embeddingStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embeddingModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embedAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;segments&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;EmbeddingStoreContentRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;embeddingStore&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embeddingStore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;embeddingModel&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embeddingModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;maxResults&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;minScore&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Memory Management: Token-Aware Conversation History
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Token Problem
&lt;/h3&gt;

&lt;p&gt;Here's something many users don't realize: &lt;strong&gt;Every time you send a message to an AI model, the entire conversation history goes with it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you ask ChatGPT or Claude a question, the API call looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are a helpful assistant"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is Python?"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Python is a programming language..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"How do I install it?"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You can install Python by..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Show me a hello world example"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the pattern? &lt;strong&gt;Every previous message is sent again.&lt;/strong&gt; This is how AI models maintain context - they don't actually "remember" your conversation. Each request is stateless, so you must resend the entire history for the model to understand what you're talking about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The consequences:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Token consumption grows quadratically&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message 1: ~100 tokens sent&lt;/li&gt;
&lt;li&gt;Message 10: ~1,000 tokens sent (all previous messages)&lt;/li&gt;
&lt;li&gt;Message 50: ~5,000+ tokens sent&lt;/li&gt;
&lt;li&gt;Message 100: ~10,000+ tokens sent (approaching model limits!)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;API costs increase&lt;/strong&gt;: You pay for every token sent, so longer conversations get exponentially more expensive&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context limits&lt;/strong&gt;: Most models have token limits (4K-128K depending on the model). Once you hit the limit, you can't continue the conversation without removing history&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Performance degradation&lt;/strong&gt;: Larger context windows slow down response times&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Askimo's solution:&lt;/strong&gt; Auto-summarize old messages while keeping recent ones to maintain conversational flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Token-Aware Memory with Intelligent Summarization
&lt;/h3&gt;

&lt;p&gt;The key insight: &lt;strong&gt;You don't need the entire history, just enough context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most conversations follow a natural pattern - the most recent exchanges are what matter for understanding the current question. Earlier messages provide background context, but you rarely need word-for-word accuracy from 50 messages ago. What you need is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recent messages in full&lt;/strong&gt; - The last 50-60% of conversation for immediate context and continuity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Historical overview&lt;/strong&gt; - A structured summary of earlier messages capturing key facts, decisions, and topics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System instructions preserved&lt;/strong&gt; - Original prompts and setup never discarded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like a work meeting - you don't replay the entire 2-hour discussion. You recap the key decisions from the first hour, then dive into the details of recent conversation.&lt;/p&gt;

&lt;p&gt;Askimo's approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Summarize old messages&lt;/strong&gt; - Condense the oldest 45% into a structured summary with key facts and topics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep recent messages intact&lt;/strong&gt; - Preserve the remaining 55% for immediate conversational context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never touch system messages&lt;/strong&gt; - Instructions are always preserved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run asynchronously&lt;/strong&gt; - Doesn't block user interaction
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TokenAwareSummarizingMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;appContext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;AppContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;sessionId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;summarizationThreshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Double&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.6&lt;/span&gt;  &lt;span class="c1"&gt;// Trigger at 60% of max tokens&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ChatMemory&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="c1"&gt;// Maximum tokens: 40% of model's context window (dynamically calculated)&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Int&lt;/span&gt;
        &lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ModelContextSizeCache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentModel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toInt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ChatMessage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;persistToDatabase&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;totalTokens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;estimateTotalTokens&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;threshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxTokens&lt;/span&gt; &lt;span class="p"&gt;*&lt;/span&gt; &lt;span class="n"&gt;summarizationThreshold&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toInt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;totalTokens&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="p"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;!&lt;/span&gt;&lt;span class="n"&gt;summarizationInProgress&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nf"&gt;triggerAsyncSummarization&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;// Non-blocking&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ChatMessage&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildList&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Structured summary as system message (if exists)&lt;/span&gt;
        &lt;span class="n"&gt;structuredSummary&lt;/span&gt;&lt;span class="o"&gt;?.&lt;/span&gt;&lt;span class="nf"&gt;let&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Recent conversation messages&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toChatMessage&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Structured summary format&lt;/span&gt;
&lt;span class="nd"&gt;@Serializable&lt;/span&gt;
&lt;span class="kd"&gt;data class&lt;/span&gt; &lt;span class="nc"&gt;ConversationSummary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;keyFacts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;,&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;mainTopics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;,&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;recentContext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this achieves:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before (100 messages, ~15,000 tokens):
[Message 1] [Message 2] ... [Message 98] [Message 99] [Message 100]
❌ Exceeds token limit, API call fails

After summarization (~10,000 tokens):
[Summary of messages 1-45] [Message 46] ... [Message 99] [Message 100]
✅ Under token limit, context preserved, conversation continues smoothly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Implementation details:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summarizes the oldest &lt;strong&gt;45% of conversation messages&lt;/strong&gt; when token threshold (60% of max) is reached&lt;/li&gt;
&lt;li&gt;System messages (instructions) are &lt;strong&gt;never summarized or removed&lt;/strong&gt; - they're preserved indefinitely&lt;/li&gt;
&lt;li&gt;Runs &lt;strong&gt;asynchronously&lt;/strong&gt; so it doesn't block the user's interaction&lt;/li&gt;
&lt;li&gt;Falls back to &lt;strong&gt;extractive summary&lt;/strong&gt; if AI-powered summarization fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Imagine a 100-message conversation about building a React app that hits the token limit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Messages 1-45&lt;/strong&gt;: Initial planning, architecture decisions, setup questions, debugging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Messages 46-100&lt;/strong&gt;: Recent implementation and current discussion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Without summarization:&lt;/strong&gt; All 100 messages sent = ~15,000 tokens ❌ Exceeds limit&lt;br&gt;
&lt;strong&gt;With summarization:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured summary of messages 1-45: ~800 tokens&lt;/li&gt;
&lt;li&gt;Messages 46-100 (55 messages): ~8,250 tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total: ~9,050 tokens&lt;/strong&gt; (~40% reduction, under the limit ✅)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why keep the majority of recent messages intact?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AI needs &lt;strong&gt;immediate context&lt;/strong&gt; to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What you discussed in the last 50+ messages&lt;/li&gt;
&lt;li&gt;The current flow and direction of conversation&lt;/li&gt;
&lt;li&gt;Recent code examples, error messages, or specific questions&lt;/li&gt;
&lt;li&gt;Continuity between related topics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A structured summary with key facts like "User is building a React app with TypeScript, discussed routing and API integration" provides useful background context. But the AI needs the actual recent messages to understand nuanced questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"So should I use try-catch or error boundaries?" (referring to your error handling discussion 10 messages ago)&lt;/li&gt;
&lt;li&gt;"Can you show me the implementation for the second approach?" (referring to two options discussed recently)&lt;/li&gt;
&lt;li&gt;"What was that library you mentioned earlier?" (needs the actual message where the library was named)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The 45/55 split&lt;/strong&gt; strikes the right balance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;45% oldest messages&lt;/strong&gt; → Summarized into key facts and topics (compressed ~95%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;55% recent messages&lt;/strong&gt; → Kept verbatim for full conversational context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System messages&lt;/strong&gt; → Always preserved (these are instructions, not conversation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach ensures the AI has both:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Condensed historical context&lt;/strong&gt; - What the conversation has been about overall&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full recent detail&lt;/strong&gt; - The nuanced back-and-forth needed to continue naturally&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;30-50% token reduction&lt;/strong&gt; - Meaningful API cost savings over time&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Unlimited conversations&lt;/strong&gt; - Never hit token limits, chat forever&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Structured summaries&lt;/strong&gt; - AI extracts key facts and topics, not just truncation&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Transparent to users&lt;/strong&gt; - Happens asynchronously in the background&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Robust fallback&lt;/strong&gt; - If AI summarization fails, uses extractive summary&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Dynamic limits&lt;/strong&gt; - Automatically adjusts based on model's context window (40% allocation)&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Smart preservation&lt;/strong&gt; - System messages (instructions) are never removed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No manual intervention&lt;/strong&gt; - Summarization happens transparently when 60% threshold is reached&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt; - Reducing 30-50% of tokens adds up over hundreds of conversations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better context quality&lt;/strong&gt; - Structured summaries preserve key facts and topics, removing conversational noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistence&lt;/strong&gt; - Memory is saved to database, survives app restarts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async operation&lt;/strong&gt; - 60-second timeout ensures it doesn't block user interaction&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Insights: Managing Multiple AI Platforms in One Desktop App
&lt;/h2&gt;

&lt;p&gt;Building a desktop app that manages multiple concurrent AI conversations taught us important lessons about resource management. Here's what we learned about the performance trade-offs:&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Usage Patterns
&lt;/h3&gt;

&lt;p&gt;A typical desktop AI chat application's memory footprint consists of:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Base application layer (~50 MB)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JVM runtime overhead&lt;/li&gt;
&lt;li&gt;Compose Desktop UI framework&lt;/li&gt;
&lt;li&gt;Core application state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Per-session overhead (~2-5 MB each)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each conversation needs its own ViewModel instance&lt;/li&gt;
&lt;li&gt;State management (messages list, streaming state, settings)&lt;/li&gt;
&lt;li&gt;With 20 concurrent sessions: ~40-100 MB additional&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conversation history caching (~5-10 MB per 100 messages)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Messages are kept in memory for active sessions&lt;/li&gt;
&lt;li&gt;Lazy loading from SQLite for inactive sessions&lt;/li&gt;
&lt;li&gt;A power user with 20 tabs × 100 messages each ≈ 100-200 MB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RAG embedding stores (varies by project size)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small project (500 files): ~50 MB&lt;/li&gt;
&lt;li&gt;Medium project (5,000 files): ~200-500 MB&lt;/li&gt;
&lt;li&gt;Large project (20,000+ files): 1+ GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total memory range: 50-300 MB&lt;/strong&gt; for typical usage (excluding large RAG projects and AI model memory).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why These Numbers Matter
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Compared to web-based alternatives:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Web apps in browser tabs: 200-500 MB &lt;strong&gt;per tab&lt;/strong&gt; (browser overhead included)&lt;/li&gt;
&lt;li&gt;Our approach: 2-5 MB per session (no browser overhead)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-off&lt;/strong&gt;: We had to build custom rendering, but gained 10-100x better per-session memory efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Startup time trade-offs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cold start: 1.5-3 seconds (loading JVM + Compose Desktop)&lt;/li&gt;
&lt;li&gt;Web apps: ~1 second initial load, but 3-5 seconds for full interactivity&lt;/li&gt;
&lt;li&gt;Electron alternatives: 5-10 seconds (loading Chromium)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning&lt;/strong&gt;: Desktop app initialization is competitive once you account for full interactivity&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Database Performance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SQLite for local message storage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write latency: &amp;lt;10ms per message (includes indexing)&lt;/li&gt;
&lt;li&gt;Full-text search: &amp;lt;50ms across 10,000+ messages&lt;/li&gt;
&lt;li&gt;No network round-trip delays like cloud-based alternatives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why local-first matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero API latency for message retrieval&lt;/li&gt;
&lt;li&gt;Works fully offline for history browsing&lt;/li&gt;
&lt;li&gt;No sync conflicts or version issues&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Concurrency Limits
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why we cap at 20 concurrent sessions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each streaming session holds an open HTTP connection&lt;/li&gt;
&lt;li&gt;Memory grows linearly with active sessions&lt;/li&gt;
&lt;li&gt;UI remains responsive up to ~30 tabs, but 20 is a comfortable limit&lt;/li&gt;
&lt;li&gt;Real-world usage: Most users have 3-8 active conversations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Hard limits prevent resource exhaustion. Better to cap explicitly than let the system degrade unpredictably.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rendering Performance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Compose Desktop maintains 60 FPS&lt;/strong&gt; because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only re-renders changed UI components (reactive architecture)&lt;/li&gt;
&lt;li&gt;Streaming updates are throttled to prevent overwhelming the UI thread&lt;/li&gt;
&lt;li&gt;Message virtualization for long conversation lists&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trade-off we made:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom markdown renderer required significant effort&lt;/li&gt;
&lt;li&gt;But we gained full control over rendering performance and caching&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways for Desktop AI Application Development
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory management is crucial&lt;/strong&gt; - With multiple concurrent sessions, every MB counts. Lazy loading and LRU eviction prevented unbounded growth.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Local-first architecture pays off&lt;/strong&gt; - SQLite message storage gives us instant search and offline access without cloud sync complexity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Async everywhere&lt;/strong&gt; - Kotlin coroutines made concurrent streaming manageable. Every blocking operation runs in a background dispatcher.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cap resources explicitly&lt;/strong&gt; - 20 concurrent sessions is a reasonable limit that prevents degradation while supporting real workflows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Desktop overhead is acceptable&lt;/strong&gt; - The 1.5-3s startup time and 50MB base memory are worthwhile for the privacy, performance, and offline benefits.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Askimo is open source (AGPLv3) and available now:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🌐 &lt;strong&gt;Website&lt;/strong&gt;: &lt;a href="https://askimo.chat" rel="noopener noreferrer"&gt;https://askimo.chat&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/haiphucnguyen/askimo" rel="noopener noreferrer"&gt;github.com/haiphucnguyen/askimo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📥 &lt;strong&gt;Download&lt;/strong&gt;: &lt;a href="https://askimo.chat/download/" rel="noopener noreferrer"&gt;Get installers for macOS, Windows, Linux&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📖 &lt;strong&gt;Docs&lt;/strong&gt;: &lt;a href="https://askimo.chat/docs/" rel="noopener noreferrer"&gt;Complete documentation and setup guides&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Related resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://askimo.chat/docs/desktop/installation/" rel="noopener noreferrer"&gt;Installation guides&lt;/a&gt; for &lt;a href="https://askimo.chat/docs/desktop/installation/macos/" rel="noopener noreferrer"&gt;macOS&lt;/a&gt;, &lt;a href="https://askimo.chat/docs/desktop/installation/windows/" rel="noopener noreferrer"&gt;Windows&lt;/a&gt;, and &lt;a href="https://askimo.chat/docs/desktop/installation/linux/" rel="noopener noreferrer"&gt;Linux&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;We're actively developing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Voice input/output&lt;/strong&gt; - Hands-free conversations with speech-to-text and text-to-speech support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plugin system&lt;/strong&gt; - Extensible architecture for custom integrations:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Custom RAG material sources&lt;/strong&gt; - Integrate with Confluence, Notion, Google Drive, databases, or any data source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP (Model Context Protocol) integrations&lt;/strong&gt; - Connect AI models to external tools and services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom AI providers&lt;/strong&gt; - Add support for new AI services without modifying core code&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Team features&lt;/strong&gt; - Share prompts, custom directives, and RAG projects across your organization&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Mobile companion app&lt;/strong&gt; - iOS and Android apps using Kotlin Multiplatform to reuse 60-80% of desktop codebase&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Want to contribute?&lt;/strong&gt; Check out our &lt;a href="https://github.com/haiphucnguyen/askimo/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;CONTRIBUTING.md&lt;/a&gt; - we welcome PRs for new providers, features, and bug fixes!&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Found this helpful?&lt;/strong&gt; ⭐ &lt;a href="https://github.com/haiphucnguyen/askimo" rel="noopener noreferrer"&gt;Star Askimo on GitHub&lt;/a&gt; and try it for your own AI workflows!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article showcases production patterns from &lt;a href="https://askimo.chat" rel="noopener noreferrer"&gt;Askimo&lt;/a&gt;, an AGPLv3-licensed desktop AI chat application built with Kotlin and Compose for Desktop. All code examples are simplified from the actual implementation available on GitHub.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>openai</category>
      <category>kotlin</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Inside Askimo: My Daily Journey with an AI CLI</title>
      <dc:creator>Nguyen Phuc Hai</dc:creator>
      <pubDate>Thu, 04 Sep 2025 16:00:00 +0000</pubDate>
      <link>https://dev.to/nguyen_phuchai_b01cae130/inside-askimo-my-daily-journey-with-an-ai-cli-534l</link>
      <guid>https://dev.to/nguyen_phuchai_b01cae130/inside-askimo-my-daily-journey-with-an-ai-cli-534l</guid>
      <description>&lt;h2&gt;
  
  
  Inside Askimo
&lt;/h2&gt;

&lt;p&gt;When I first started tinkering with Askimo, I wasn’t trying to create a big project. I just wanted something simple to make my day easier. I live in the terminal and bounce between AI tools—OpenAI for some things, Ollama locally, Copilot at work. Switching between them felt clunky, and being tied to one vendor didn’t make sense.&lt;/p&gt;

&lt;p&gt;Then it clicked: what if I had one CLI that could talk to all of them, and let me automate the boring parts? Not just cross-platform, but repeatable. I often need to run the same command with different inputs—a set of messages, a list of files, variations of a prompt—pipe data in, script it, and reuse it later. That’s how Askimo began.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Tool I Actually Use Every Day
&lt;/h2&gt;

&lt;p&gt;Askimo isn’t just a side project that I work on in spare time - it has become something I rely on daily. I use it to summarize long documents, generate quick drafts, or even suggest names for functions when I’m stuck. Because it lives in the terminal, it feels natural - just another command, like git or docker.&lt;/p&gt;

&lt;p&gt;I didn’t build Askimo for show. I built it for myself first. But once it became part of my routine, I realized it might be useful for others too.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Askimo Can Do (Right Now)
&lt;/h2&gt;

&lt;p&gt;Even though it’s still early, Askimo already fits neatly into my workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Runs everywhere - Homebrew on macOS/Linux, binaries for Windows, or Docker if I don’t want to install anything.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Feels consistent - the same commands work whether I’m on my laptop or a server.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Local file context - I can ask questions about a file in my project.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple providers - I can switch between OpenAI, Ollama, Gemini, or X AI without leaving the CLI.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These weren’t “features” I brainstormed - they were gaps I ran into while working. Each one exists because I personally needed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Journey of Learning
&lt;/h2&gt;

&lt;p&gt;Askimo has also been my way of learning how to apply AI, not just read about it. Building it forced me to experiment: to test prompts, to break things, to see where AI adds value and where it doesn’t.&lt;/p&gt;

&lt;p&gt;I’ve come to realize that AI doesn’t replace my work - it extends it. Sometimes it saves me from tedious repetition. Other times, it pushes me to think differently about automation. Each step of Askimo’s development has been a reflection of how I’m learning to work with AI rather than around it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you want to try it out, installation is simple - Homebrew, binaries, or Docker. I keep the instructions here:&lt;br&gt;
&lt;a href="https://haiphucnguyen.github.io/askimo/installation/" rel="noopener noreferrer"&gt;👉 Installation Guide&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;I’ve got plenty of ideas for where to take it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Chaining commands into more powerful workflows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Custom commands - I can turn repeated prompts into shortcuts, so I don’t waste time retyping.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Indexing projects so Askimo understands my real workspace - source code, database schemas/migrations, configuration files, API specs, docs, and even build logs - not just isolated files.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The vision isn’t just a CLI for chat - I want Askimo to grow into a programmable AI environment that feels at home in the terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving Forward
&lt;/h2&gt;

&lt;p&gt;What excites me most isn’t just the tool itself, but what it represents. Askimo started as a weekend hack, but it’s grown into both a part of my daily workflow and a mirror of my own journey learning to apply AI.&lt;/p&gt;

&lt;p&gt;For me, it’s proof that AI can be practical, lightweight, and personal. And as I keep building, I’ll keep sharing the lessons I learn along the way - because Askimo isn’t just about what AI can do, it’s about how we, as developers, can shape it into something that fits naturally into our work.&lt;/p&gt;

&lt;p&gt;If you try Askimo, I’d love to hear how it fits into your routine.&lt;/p&gt;

&lt;p&gt;I’ve made the project open source because I believe tools like this get better when they’re shaped by a community, not just by one developer’s perspective. If you’re curious, want to contribute, or simply want to star the project to follow its progress, you can find it here:&lt;br&gt;
&lt;a href="https://github.com/haiphucnguyen/askimo" rel="noopener noreferrer"&gt;👉 Askimo on GitHub&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cli</category>
      <category>ai</category>
      <category>opensource</category>
      <category>openai</category>
    </item>
    <item>
      <title>Askimo: An Open-Source Command-Line AI Assistant</title>
      <dc:creator>Nguyen Phuc Hai</dc:creator>
      <pubDate>Tue, 19 Aug 2025 16:00:00 +0000</pubDate>
      <link>https://dev.to/nguyen_phuchai_b01cae130/askimo-an-open-source-command-line-ai-assistant-3dnj</link>
      <guid>https://dev.to/nguyen_phuchai_b01cae130/askimo-an-open-source-command-line-ai-assistant-3dnj</guid>
      <description>&lt;p&gt;Over the last two weeks, I’ve been working on a side project called Askimo — a command-line AI assistant that I’ve released under the MIT license.&lt;/p&gt;

&lt;p&gt;It started from a simple need: I use AI a lot in my daily work — OpenAI, Claude, Ollama, Copilot. Many of the tasks are small and repetitive, like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Generating release notes from commits&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Summarizing logs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Updating a GitHub email owner etc&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted a tool that could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Switch between providers quickly&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Automate repetitive tasks in a creative way&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Be customized for my workflow&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Why build another CLI AI tool?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are already some great tools out there. But I decided to build my own for a few reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Learning → I wanted to explore GraalVM for cross-platform native images and get hands-on with LangChain4j to experiment with system messages, tokens, memory, and prompt tuning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Control → Having my own tool means I can set the pace, customize features for my workflow, and extend it however I want.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Openness → Askimo is MIT licensed, with a pluggable design that makes it easy to support both closed APIs and open-source models like Ollama.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;What Askimo can do today&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Streaming chat in the terminal (interactive REPL)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Pipe execution: cat logs.txt | askimo "summarize this"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple AI providers: currently OpenAI and Ollama, with a pluggable design to add more (e.g., Gemini)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simple web chat page for those who prefer not to use the terminal (though CLI is where automation really shines)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;What’s next?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Askimo is still very early. I’m exploring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Adding more providers (both open-source and hosted APIs)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Custom function for repetitive tasks (e.g., release notes, log analysis, ticket triage)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Extending the plugin system so the community can add their own commands and providers&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Contributions welcome 🙌&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built Askimo mainly for myself, but I’d love to see how others use it. Every contribution helps — whether it’s opening issues, suggesting features, or sending a PR.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/haiphucnguyen/askimo" rel="noopener noreferrer"&gt;https://github.com/haiphucnguyen/askimo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you’re interested in AI at the terminal, or just want to tinker with GraalVM and LangChain4j, give it a try. And if you like the idea, a ⭐️ would mean a lot.&lt;/p&gt;

</description>
      <category>cli</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
