Harish Kotra (he/him)

Posted on Mar 10

Building Screenshot Sage: A Local AI That Can Understand Your Screenshots

#ai #python #beginners #dailybuild2026

AI vision models are becoming incredibly powerful—but most tools send your screenshots and images to the cloud.

For developers working with sensitive data, internal dashboards, or proprietary code, that’s a major concern.

So we built Screenshot Sage: a privacy-first AI assistant that lets you chat with your screenshots locally.

The entire system runs on your machine using a multimodal vision model and a simple Python stack.

No cloud APIs.
No data leaving your computer.

In this article, we’ll walk through:

The idea behind Screenshot Sage
The architecture
The technology stack
How the system works
Key implementation details
Code snippets you can use to build your own

What Screenshot Sage Does

Screenshot Sage allows you to:

• Upload a screenshot
• Ask questions about the image
• Get explanations in natural language

The AI can analyze:

UI screens
charts
dashboards
terminal output
error messages
code snippets
documents

Example questions you can ask:

Why did this Python program crash?
What trends do you see in this chart?
What does this error message mean?
How could this UI be improved?

And because the model runs locally, your screenshots remain private.

System Architecture

The architecture is intentionally simple.

Components

Frontend

Gradio provides the browser UI where users:

upload screenshots
ask questions
view streaming responses

Agent Layer

Agno handles prompt orchestration and communication with the model.

Model Server

LM Studio runs a local inference server exposing an OpenAI-compatible API.

Vision Model

A multimodal model analyzes both the screenshot and the user’s question.

Technology Stack

Layer	Technology
Agent Framework	Agno
UI Framework	Gradio
Language	Python
Model Runtime	LM Studio
Vision Model	Qwen3-VL-8B

Why We Chose This Stack

Agno

Agno provides a clean way to manage agents and model calls without writing raw OpenAI client code.

It simplifies:

prompts
multimodal inputs
model interaction

Gradio

Gradio makes it extremely easy to build interactive ML interfaces with minimal code.

We get:

drag and drop uploads
chat UI
streaming responses

LM Studio

LM Studio allows you to run local models and expose them via a simple API.

The interface is compatible with OpenAI-style requests, which means most libraries work out of the box.

How Screenshot Processing Works

When a user uploads an image, the system performs the following steps:

The image is uploaded through the Gradio interface
The image file is converted into Base64
The Base64 image is sent to the agent
The agent sends the image and prompt to LM Studio
The vision model analyzes the image
The response streams back to the UI

This entire pipeline runs locally.

Creating the Agent

The agent is responsible for interacting with the model.

from agno.agent import Agent
from agno.models.openai import OpenAIChat

def create_agent():

    model = OpenAIChat(
        id="qwen3-vl-8b",
        base_url="http://localhost:1234/v1",
        api_key="lm-studio"
    )

    agent = Agent(
        model=model,
        instructions="""
You are an expert visual assistant.

Analyze screenshots and answer user questions clearly.
Explain errors, UI elements, charts, and code snippets.
"""
    )

    return agent

This agent communicates directly with LM Studio.

Encoding Images for the Model

The model expects images to be provided as Base64 data URLs.

import base64

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()

We then convert it into a data URL:

data:image/png;base64,...

Sending the Image to the Model

Once encoded, the image is included in the request.

response = agent.run(
    message,
    images=[{"url": image_data_url}]
)

The model receives both:

the image
the user prompt

This allows it to reason over the visual content.

Streaming Responses

To improve user experience, we stream the model response word-by-word.

for word in answer.split():
    partial += word + " "
    history[-1]["content"] = partial
    yield history

This creates the effect of the AI typing its response in real time.

The Gradio Interface

The UI is built using Gradio Blocks.

Key components include:

image uploader
chatbot interface
prompt input
send button
copy answer button

Example component:

chatbot = gr.Chatbot(
    height=450,
    type="messages",
    autoscroll=True
)

Gradio automatically handles:

UI rendering
websocket communication
live updates

Why Local AI Matters

Running AI locally has several advantages:

Privacy

Your screenshots never leave your machine.

Cost

No API costs.

Offline capability

Works without internet.

Control

You can swap models or customize prompts freely.

Future Improvements

Here are some features we plan to add:

Region selection

Ask questions about a specific part of the screenshot.

Multi-image comparison

Compare two screenshots.

OCR export

Extract all text from screenshots.

Chat history

Save conversations locally.

Screenshot Sage demonstrates how easy it has become to build powerful AI tools using local models.

With just a few components:

a multimodal model
a lightweight UI
a simple agent framework

you can create applications that previously required complex cloud infrastructure.

As local AI models continue to improve, tools like this will become increasingly common.

And the best part?
Your data stays with you.

Checkout the Github Repo

DEV Community