DEV Community

Cover image for Building Screenshot Sage: A Local AI That Can Understand Your Screenshots
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Building Screenshot Sage: A Local AI That Can Understand Your Screenshots

AI vision models are becoming incredibly powerful—but most tools send your screenshots and images to the cloud.

For developers working with sensitive data, internal dashboards, or proprietary code, that’s a major concern.

So we built Screenshot Sage: a privacy-first AI assistant that lets you chat with your screenshots locally.

The entire system runs on your machine using a multimodal vision model and a simple Python stack.

No cloud APIs.
No data leaving your computer.

In this article, we’ll walk through:

  • The idea behind Screenshot Sage
  • The architecture
  • The technology stack
  • How the system works
  • Key implementation details
  • Code snippets you can use to build your own

What Screenshot Sage Does

Screenshot Sage allows you to:

• Upload a screenshot
• Ask questions about the image
• Get explanations in natural language

The AI can analyze:

  • UI screens
  • charts
  • dashboards
  • terminal output
  • error messages
  • code snippets
  • documents

Example questions you can ask:

Why did this Python program crash?
What trends do you see in this chart?
What does this error message mean?
How could this UI be improved?
Enter fullscreen mode Exit fullscreen mode

And because the model runs locally, your screenshots remain private.


System Architecture

The architecture is intentionally simple.

System Architecture

Components

Frontend

Gradio provides the browser UI where users:

  • upload screenshots
  • ask questions
  • view streaming responses

Agent Layer

Agno handles prompt orchestration and communication with the model.

Model Server

LM Studio runs a local inference server exposing an OpenAI-compatible API.

Vision Model

A multimodal model analyzes both the screenshot and the user’s question.


Technology Stack

Layer Technology
Agent Framework Agno
UI Framework Gradio
Language Python
Model Runtime LM Studio
Vision Model Qwen3-VL-8B

Why We Chose This Stack

Agno

Agno provides a clean way to manage agents and model calls without writing raw OpenAI client code.

It simplifies:

  • prompts
  • multimodal inputs
  • model interaction

Gradio

Gradio makes it extremely easy to build interactive ML interfaces with minimal code.

We get:

  • drag and drop uploads
  • chat UI
  • streaming responses

LM Studio

LM Studio allows you to run local models and expose them via a simple API.

The interface is compatible with OpenAI-style requests, which means most libraries work out of the box.


How Screenshot Processing Works

When a user uploads an image, the system performs the following steps:

  1. The image is uploaded through the Gradio interface
  2. The image file is converted into Base64
  3. The Base64 image is sent to the agent
  4. The agent sends the image and prompt to LM Studio
  5. The vision model analyzes the image
  6. The response streams back to the UI

This entire pipeline runs locally.


Creating the Agent

The agent is responsible for interacting with the model.

from agno.agent import Agent
from agno.models.openai import OpenAIChat

def create_agent():

    model = OpenAIChat(
        id="qwen3-vl-8b",
        base_url="http://localhost:1234/v1",
        api_key="lm-studio"
    )

    agent = Agent(
        model=model,
        instructions="""
You are an expert visual assistant.

Analyze screenshots and answer user questions clearly.
Explain errors, UI elements, charts, and code snippets.
"""
    )

    return agent
Enter fullscreen mode Exit fullscreen mode

This agent communicates directly with LM Studio.


Encoding Images for the Model

The model expects images to be provided as Base64 data URLs.

import base64

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode()
Enter fullscreen mode Exit fullscreen mode

We then convert it into a data URL:

data:image/png;base64,...
Enter fullscreen mode Exit fullscreen mode

Sending the Image to the Model

Once encoded, the image is included in the request.

response = agent.run(
    message,
    images=[{"url": image_data_url}]
)
Enter fullscreen mode Exit fullscreen mode

The model receives both:

  • the image
  • the user prompt

This allows it to reason over the visual content.


Streaming Responses

To improve user experience, we stream the model response word-by-word.

for word in answer.split():
    partial += word + " "
    history[-1]["content"] = partial
    yield history
Enter fullscreen mode Exit fullscreen mode

This creates the effect of the AI typing its response in real time.


The Gradio Interface

The UI is built using Gradio Blocks.

Key components include:

  • image uploader
  • chatbot interface
  • prompt input
  • send button
  • copy answer button

Example component:

chatbot = gr.Chatbot(
    height=450,
    type="messages",
    autoscroll=True
)
Enter fullscreen mode Exit fullscreen mode

Gradio automatically handles:

  • UI rendering
  • websocket communication
  • live updates

Why Local AI Matters

Running AI locally has several advantages:

Privacy

Your screenshots never leave your machine.

Cost

No API costs.

Offline capability

Works without internet.

Control

You can swap models or customize prompts freely.


Future Improvements

Here are some features we plan to add:

Region selection

Ask questions about a specific part of the screenshot.

Multi-image comparison

Compare two screenshots.

OCR export

Extract all text from screenshots.

Chat history

Save conversations locally.


Screenshot Sage demonstrates how easy it has become to build powerful AI tools using local models.

With just a few components:

  • a multimodal model
  • a lightweight UI
  • a simple agent framework

you can create applications that previously required complex cloud infrastructure.

As local AI models continue to improve, tools like this will become increasingly common.

And the best part?
Your data stays with you.

Example Output 1

Example Output 2

Example Output 3

Checkout the Github Repo

Top comments (0)