<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shashank Kumar Singh</title>
    <description>The latest articles on DEV Community by Shashank Kumar Singh (@bitshank2338).</description>
    <link>https://dev.to/bitshank2338</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3915859%2Fc4cbcf69-dc76-44ef-8171-efa70efb8666.jpeg</url>
      <title>DEV Community: Shashank Kumar Singh</title>
      <link>https://dev.to/bitshank2338</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bitshank2338"/>
    <language>en</language>
    <item>
      <title>Why I chose Ollama over cloud AI for my screen-reading assistant (and what I learned)</title>
      <dc:creator>Shashank Kumar Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 11:38:43 +0000</pubDate>
      <link>https://dev.to/bitshank2338/why-i-chose-ollama-over-cloud-ai-for-my-screen-reading-assistant-and-what-i-learned-171m</link>
      <guid>https://dev.to/bitshank2338/why-i-chose-ollama-over-cloud-ai-for-my-screen-reading-assistant-and-what-i-learned-171m</guid>
      <description>&lt;p&gt;When I started building Clicky — a Windows AI assistant that reads your screen and answers out loud — I had to make a fundamental choice: cloud AI or local AI?&lt;/p&gt;

&lt;p&gt;I chose local (Ollama). Here's exactly why, and what I learned the hard way.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/WYY9yJHDaEU"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with cloud AI for a screen assistant
&lt;/h2&gt;

&lt;p&gt;Clicky's core loop is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take a screenshot of your screen&lt;/li&gt;
&lt;li&gt;Record your voice question&lt;/li&gt;
&lt;li&gt;Send both to an LLM&lt;/li&gt;
&lt;li&gt;Speak the answer back&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 3 is where cloud AI becomes a problem. You're sending a &lt;strong&gt;screenshot of your screen&lt;/strong&gt; to a remote server every single time someone presses the hotkey.&lt;/p&gt;

&lt;p&gt;Think about what's on your screen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Passwords in password managers&lt;/li&gt;
&lt;li&gt;Emails with sensitive info&lt;/li&gt;
&lt;li&gt;Code with API keys&lt;/li&gt;
&lt;li&gt;Personal documents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wasn't comfortable sending that to OpenAI servers (or any server) by default. And I couldn't expect users to be comfortable with it either.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Ollama actually gives you
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://ollama.ai" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; lets you run LLMs locally. Pull a model, it runs on your GPU/CPU, responses never leave your machine.&lt;/p&gt;

&lt;p&gt;For Clicky, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero privacy risk&lt;/strong&gt; — screenshots stay on your PC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero cost per query&lt;/strong&gt; — no API bills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero internet required&lt;/strong&gt; — works offline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero rate limits&lt;/strong&gt; — query as fast as your hardware allows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff: you need a decent GPU and ~4–8GB RAM for a usable model. But most modern gaming PCs qualify.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models I tested
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;VRAM needed&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;6GB&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral 7B&lt;/td&gt;
&lt;td&gt;5GB&lt;/td&gt;
&lt;td&gt;Great&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phi-3 mini&lt;/td&gt;
&lt;td&gt;3GB&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Very fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 3B&lt;/td&gt;
&lt;td&gt;2GB&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;Blazing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I use Llama 3.1 8B as the default. It handles screen content, code, and general questions well.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ollama HTTP API is dead simple
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask_ollama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;screenshot_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;img_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;img_b64&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No SDK, no auth, no API key. Just a POST to localhost.&lt;/p&gt;

&lt;h2&gt;
  
  
  When I kept cloud AI as a fallback
&lt;/h2&gt;

&lt;p&gt;Despite going local-first, I built a &lt;code&gt;CompanionManager&lt;/code&gt; class that routes to different providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CompanionManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;OllamaProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;OpenAIProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;openai_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ClaudeProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;claude_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;OllamaProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# default
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Power users who want GPT-4V or Claude's vision can switch in settings. But the default is Ollama, zero config.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Ollama model management is the real UX problem.&lt;/strong&gt;&lt;br&gt;
Users don't want to open a terminal and run &lt;code&gt;ollama pull llama3.1&lt;/code&gt;. I had to build a model-picker UI that shows which models are installed and lets you pull new ones from inside Clicky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. First response latency matters more than throughput.&lt;/strong&gt;&lt;br&gt;
For a voice assistant, you want the first token fast. Streaming responses (even over localhost) made Clicky feel much snappier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. GPU vs CPU is night and day.&lt;/strong&gt;&lt;br&gt;
On CPU only, Llama 3.1 8B takes 15–30 seconds. On a GTX 1080, it's 2–4 seconds. I added a GPU detection check and recommend CPU-only users use Phi-3 mini instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Context window size limits what you can do with screenshots.&lt;/strong&gt;&lt;br&gt;
A 1080p screenshot as base64 JPEG at q=70 is ~200–400KB. Some models handle this fine; others choke. I compress to 768px max width before encoding.&lt;/p&gt;

&lt;h2&gt;
  
  
  The verdict
&lt;/h2&gt;

&lt;p&gt;For a privacy-sensitive app like a screen assistant, &lt;strong&gt;local AI is the right default&lt;/strong&gt;. Cloud AI makes sense as an opt-in for users who want more power and don't mind the tradeoff.&lt;/p&gt;

&lt;p&gt;Ollama made this possible without asking users to do anything complicated. Install Clicky, install Ollama, pull a model — done.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Download Clicky free (Windows):&lt;/strong&gt; &lt;a href="https://clicky.foo" rel="noopener noreferrer"&gt;https://clicky.foo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Article 1 (what it does):&lt;/strong&gt; &lt;a href="https://dev.to/bitshank2338/i-built-a-free-ai-that-reads-my-screen-and-answers-out-loud-no-api-key-runs-offline-with-ollama-15np"&gt;https://dev.to/bitshank2338/i-built-a-free-ai-that-reads-my-screen-and-answers-out-loud-no-api-key-runs-offline-with-ollama-15np&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Article 2 (full technical breakdown):&lt;/strong&gt; &lt;a href="https://dev.to/bitshank2338/how-i-built-a-screen-aware-ai-assistant-in-python-full-stack-breakdown-pyqt6-whisper-ollama-1354"&gt;https://dev.to/bitshank2338/how-i-built-a-screen-aware-ai-assistant-in-python-full-stack-breakdown-pyqt6-whisper-ollama-1354&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What model are you running locally? Let me know in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>showdev</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How I built a screen-aware AI assistant in Python – full stack breakdown (PyQt6 + Whisper + Ollama)</title>
      <dc:creator>Shashank Kumar Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 11:35:22 +0000</pubDate>
      <link>https://dev.to/bitshank2338/how-i-built-a-screen-aware-ai-assistant-in-python-full-stack-breakdown-pyqt6-whisper-ollama-1354</link>
      <guid>https://dev.to/bitshank2338/how-i-built-a-screen-aware-ai-assistant-in-python-full-stack-breakdown-pyqt6-whisper-ollama-1354</guid>
      <description>&lt;p&gt;Three months ago I started building Clicky — a Windows AI assistant that reads your screen and answers out loud. Here's the full technical breakdown of every piece.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; PyQt6 system tray → Ctrl+Alt+Space hotkey → screenshot + Whisper STT → Ollama/OpenAI/Claude → edge-tts speaks answer back. Open source, free, no API key needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User presses Ctrl+Alt+Space
        ↓
GlobalHotkey listener (pynput)
        ↓
Screenshot all monitors (mss)
        ↓
Whisper.cpp transcribes audio
        ↓
CompanionManager routes to AI provider
        ↓
Ollama (local) / OpenAI / Claude / Copilot
        ↓
edge-tts speaks answer + arrow overlay on screen
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  1. System tray + hotkey (PyQt6 + pynput)
&lt;/h2&gt;

&lt;p&gt;The app lives in the system tray — no window, zero friction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pynput&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keyboard&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_activate&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;QMetaObject&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invokeMethod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;companion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_listening&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Qt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QueuedConnection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;hotkey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;keyboard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GlobalHotKeys&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;ctrl&amp;gt;+&amp;lt;alt&amp;gt;+&amp;lt;space&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;on_activate&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;hotkey&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key trick: &lt;code&gt;QMetaObject.invokeMethod&lt;/code&gt; with &lt;code&gt;Qt.QueuedConnection&lt;/code&gt; — this crosses the thread boundary safely from pynput's background thread into Qt's main thread.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Screen capture (mss)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;capture_all_screens&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mss&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;monitor&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]:&lt;/span&gt;  &lt;span class="c1"&gt;# skip monitor[0] (all combined)
&lt;/span&gt;            &lt;span class="n"&gt;shot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;frombytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bgra&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BGRX&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# encode as JPEG base64 for vision API
&lt;/span&gt;            &lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JPEG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getvalue&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quality 75 JPEG keeps the payload under API limits while preserving readability.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Speech-to-text (Whisper.cpp)
&lt;/h2&gt;

&lt;p&gt;I use the &lt;code&gt;whisper-cpp&lt;/code&gt; Python bindings — runs on CPU, no GPU needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;whispercpp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Whisper&lt;/span&gt;

&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Whisper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;base.en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;base.en&lt;/code&gt; model is 142MB and transcribes ~10s of audio in ~2s on a mid-range CPU. Fast enough to feel instant.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. AI provider routing
&lt;/h2&gt;

&lt;p&gt;This was the trickiest part — supporting 4 providers with one interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CompanionManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;OllamaProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;OpenAIProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ClaudeProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;copilot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;GitHubCopilotProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;screenshots&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_provider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="c1"&gt;# Identity questions skip screenshots (avoids vision API refusals)
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_identity_question&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;screenshots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;screenshots&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;is_identity_question()&lt;/code&gt; filter was a fun challenge — vision APIs refuse to identify people in images. So I detect "who is X" patterns with regex and strip the screenshots before sending.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Local AI with Ollama
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5vl:3b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# vision model
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Text-to-speech (edge-tts)
&lt;/h2&gt;

&lt;p&gt;Microsoft's neural TTS — free, no API key, sounds great:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;edge_tts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;speak&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;communicate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;edge_tts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Communicate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US-AriaNeural&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;communicate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp/response.mp3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# play with pygame
&lt;/span&gt;    &lt;span class="n"&gt;pygame&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mixer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;music&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp/response.mp3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pygame&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mixer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;music&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;play&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Packaging (PyInstaller + Inno Setup)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# build.bat&lt;/span&gt;
pyinstaller clicky.spec &lt;span class="nt"&gt;--noconfirm&lt;/span&gt;
&lt;span class="c"&gt;# Then Inno Setup builds Setup-Clicky.exe&lt;/span&gt;
iscc installer.iss
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.spec&lt;/code&gt; file needs explicit hidden imports for everything dynamic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;hiddenimports&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ai.ollama_bootstrap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ui.setup_wizard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whispercpp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Thread safety with PyQt6 is non-negotiable&lt;/strong&gt; — never call Qt UI methods from background threads. Use &lt;code&gt;QMetaObject.invokeMethod&lt;/code&gt; or signals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper base.en is the sweet spot&lt;/strong&gt; — tiny is too inaccurate, small is too slow on CPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision APIs hate face identification&lt;/strong&gt; — build the filter early, not after your first support ticket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyInstaller + Ollama = packaging nightmare&lt;/strong&gt; — Ollama runs as a separate process, not bundled. The setup wizard that auto-installs it saved countless support issues.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;macOS port (the hotkey system is the main blocker)&lt;/li&gt;
&lt;li&gt;Plugin system for custom AI skills&lt;/li&gt;
&lt;li&gt;Sub-200ms response time optimization&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;The full source is on GitHub and it's MIT licensed. If you build something on top of it, let me know.&lt;/p&gt;

&lt;p&gt;📥 Download: &lt;a href="https://clicky.foo" rel="noopener noreferrer"&gt;https://clicky.foo&lt;/a&gt;&lt;br&gt;
⭐ GitHub: &lt;a href="https://github.com/Bitshank-2338/clicky-windows" rel="noopener noreferrer"&gt;https://github.com/Bitshank-2338/clicky-windows&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>showdev</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I built a free AI that reads my screen and answers out loud – no API key, runs offline with Ollama</title>
      <dc:creator>Shashank Kumar Singh</dc:creator>
      <pubDate>Wed, 06 May 2026 11:15:16 +0000</pubDate>
      <link>https://dev.to/bitshank2338/i-built-a-free-ai-that-reads-my-screen-and-answers-out-loud-no-api-key-runs-offline-with-ollama-15np</link>
      <guid>https://dev.to/bitshank2338/i-built-a-free-ai-that-reads-my-screen-and-answers-out-loud-no-api-key-runs-offline-with-ollama-15np</guid>
      <description>&lt;p&gt;I kept hitting ChatGPT's $20/mo wall. So I built my own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clicky&lt;/strong&gt; is a free, open-source AI assistant for Windows that floats next to your cursor, reads your screen, and speaks the answer back out loud. No copy-paste. No alt-tab. No subscription.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/WYY9yJHDaEU"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem I was solving
&lt;/h2&gt;

&lt;p&gt;Every time I hit an API limit or saw that $20/mo charge, I thought — I have a decent laptop, Ollama runs locally, why am I paying for this?&lt;/p&gt;

&lt;p&gt;But existing local AI setups require you to open a terminal, type prompts, copy-paste context... it's friction. I wanted something that &lt;em&gt;just worked&lt;/em&gt; — press a hotkey, ask a question, get an answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Clicky works
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Press &lt;strong&gt;Ctrl+Alt+Space&lt;/strong&gt; anywhere on Windows&lt;/li&gt;
&lt;li&gt;It captures all your screens and starts listening&lt;/li&gt;
&lt;li&gt;You ask your question out loud&lt;/li&gt;
&lt;li&gt;It reads the screen context, queries the AI, and &lt;strong&gt;speaks the answer back&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's the entire flow. No window switching. No typing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python + PyQt6&lt;/strong&gt; — system tray app, always running in background&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper.cpp&lt;/strong&gt; — fast local speech-to-text (runs on CPU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; — local LLM inference (llama3.2:3b for text, qwen2.5vl:3b for vision)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;edge-tts&lt;/strong&gt; — Microsoft's neural TTS, free and offline-capable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyInstaller&lt;/strong&gt; — packages everything into a single Windows .exe&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inno Setup&lt;/strong&gt; — builds the one-click installer&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI provider flexibility
&lt;/h2&gt;

&lt;p&gt;Clicky isn't locked to Ollama. It supports:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Requires&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ollama (local)&lt;/td&gt;
&lt;td&gt;Free forever&lt;/td&gt;
&lt;td&gt;~4GB RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI GPT-4o&lt;/td&gt;
&lt;td&gt;Pay per use&lt;/td&gt;
&lt;td&gt;API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Claude&lt;/td&gt;
&lt;td&gt;Pay per use&lt;/td&gt;
&lt;td&gt;API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Copilot&lt;/td&gt;
&lt;td&gt;Free with Student Pack&lt;/td&gt;
&lt;td&gt;GitHub account&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What it can do
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Read whatever is on your screen (YouTube, code, emails, docs, anything)&lt;/li&gt;
&lt;li&gt;Search the web and answer instantly&lt;/li&gt;
&lt;li&gt;Point at exact UI elements with a glowing arrow overlay&lt;/li&gt;
&lt;li&gt;Click buttons on your behalf just by asking&lt;/li&gt;
&lt;li&gt;Answer questions about code you're looking at without copy-pasting&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setup in v1.1.1
&lt;/h2&gt;

&lt;p&gt;The latest version ships with a &lt;strong&gt;first-run setup wizard&lt;/strong&gt; — it detects if Ollama is missing, downloads and installs it automatically, then pulls the right models. One click from installer to working AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;📥 &lt;strong&gt;Download free&lt;/strong&gt;: &lt;a href="https://clicky.foo" rel="noopener noreferrer"&gt;https://clicky.foo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;⭐ &lt;strong&gt;GitHub (open source)&lt;/strong&gt;: &lt;a href="https://github.com/Bitshank-2338/clicky-windows" rel="noopener noreferrer"&gt;https://github.com/Bitshank-2338/clicky-windows&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🐦 &lt;strong&gt;Twitter&lt;/strong&gt;: &lt;a href="https://x.com/iamshashank_dev" rel="noopener noreferrer"&gt;https://x.com/iamshashank_dev&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy to answer questions about the Ollama integration, the PyQt6 architecture, or the PyInstaller packaging pipeline. AMA.&lt;/p&gt;

</description>
      <category>showdev</category>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
    </item>
  </channel>
</rss>
