When I started building Clicky — a Windows AI assistant that reads your screen and answers out loud — I had to make a fundamental choice: cloud AI or local AI?
I chose local (Ollama). Here's exactly why, and what I learned the hard way.
The problem with cloud AI for a screen assistant
Clicky's core loop is:
- Take a screenshot of your screen
- Record your voice question
- Send both to an LLM
- Speak the answer back
Step 3 is where cloud AI becomes a problem. You're sending a screenshot of your screen to a remote server every single time someone presses the hotkey.
Think about what's on your screen:
- Passwords in password managers
- Emails with sensitive info
- Code with API keys
- Personal documents
I wasn't comfortable sending that to OpenAI servers (or any server) by default. And I couldn't expect users to be comfortable with it either.
What Ollama actually gives you
Ollama lets you run LLMs locally. Pull a model, it runs on your GPU/CPU, responses never leave your machine.
For Clicky, this means:
- Zero privacy risk — screenshots stay on your PC
- Zero cost per query — no API bills
- Zero internet required — works offline
- Zero rate limits — query as fast as your hardware allows
The tradeoff: you need a decent GPU and ~4–8GB RAM for a usable model. But most modern gaming PCs qualify.
Models I tested
| Model | VRAM needed | Quality | Speed |
|---|---|---|---|
| Llama 3.1 8B | 6GB | Excellent | Fast |
| Mistral 7B | 5GB | Great | Fast |
| Phi-3 mini | 3GB | Good | Very fast |
| Llama 3.2 3B | 2GB | OK | Blazing |
I use Llama 3.1 8B as the default. It handles screen content, code, and general questions well.
The Ollama HTTP API is dead simple
import httpx
import base64
async def ask_ollama(screenshot_path: str, question: str) -> str:
with open(screenshot_path, "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
payload = {
"model": "llama3.1",
"messages": [
{
"role": "user",
"content": question,
"images": [img_b64]
}
],
"stream": False
}
async with httpx.AsyncClient(timeout=30) as client:
response = await client.post(
"http://localhost:11434/api/chat",
json=payload
)
return response.json()["message"]["content"]
That's it. No SDK, no auth, no API key. Just a POST to localhost.
When I kept cloud AI as a fallback
Despite going local-first, I built a CompanionManager class that routes to different providers:
class CompanionManager:
def get_provider(self):
if self.settings.provider == "ollama":
return OllamaProvider()
elif self.settings.provider == "openai":
return OpenAIProvider(api_key=self.settings.openai_key)
elif self.settings.provider == "claude":
return ClaudeProvider(api_key=self.settings.claude_key)
else:
return OllamaProvider() # default
Power users who want GPT-4V or Claude's vision can switch in settings. But the default is Ollama, zero config.
What I learned
1. Ollama model management is the real UX problem.
Users don't want to open a terminal and run ollama pull llama3.1. I had to build a model-picker UI that shows which models are installed and lets you pull new ones from inside Clicky.
2. First response latency matters more than throughput.
For a voice assistant, you want the first token fast. Streaming responses (even over localhost) made Clicky feel much snappier.
3. GPU vs CPU is night and day.
On CPU only, Llama 3.1 8B takes 15–30 seconds. On a GTX 1080, it's 2–4 seconds. I added a GPU detection check and recommend CPU-only users use Phi-3 mini instead.
4. Context window size limits what you can do with screenshots.
A 1080p screenshot as base64 JPEG at q=70 is ~200–400KB. Some models handle this fine; others choke. I compress to 768px max width before encoding.
The verdict
For a privacy-sensitive app like a screen assistant, local AI is the right default. Cloud AI makes sense as an opt-in for users who want more power and don't mind the tradeoff.
Ollama made this possible without asking users to do anything complicated. Install Clicky, install Ollama, pull a model — done.
Download Clicky free (Windows): https://clicky.foo
Article 1 (what it does): https://dev.to/bitshank2338/i-built-a-free-ai-that-reads-my-screen-and-answers-out-loud-no-api-key-runs-offline-with-ollama-15np
Article 2 (full technical breakdown): https://dev.to/bitshank2338/how-i-built-a-screen-aware-ai-assistant-in-python-full-stack-breakdown-pyqt6-whisper-ollama-1354
What model are you running locally? Let me know in the comments.
Top comments (0)