AI vision models are becoming incredibly powerful—but most tools send your screenshots and images to the cloud.
For developers working with sensitive data, internal dashboards, or proprietary code, that’s a major concern.
So we built Screenshot Sage: a privacy-first AI assistant that lets you chat with your screenshots locally.
The entire system runs on your machine using a multimodal vision model and a simple Python stack.
No cloud APIs.
No data leaving your computer.
In this article, we’ll walk through:
- The idea behind Screenshot Sage
- The architecture
- The technology stack
- How the system works
- Key implementation details
- Code snippets you can use to build your own
What Screenshot Sage Does
Screenshot Sage allows you to:
• Upload a screenshot
• Ask questions about the image
• Get explanations in natural language
The AI can analyze:
- UI screens
- charts
- dashboards
- terminal output
- error messages
- code snippets
- documents
Example questions you can ask:
Why did this Python program crash?
What trends do you see in this chart?
What does this error message mean?
How could this UI be improved?
And because the model runs locally, your screenshots remain private.
System Architecture
The architecture is intentionally simple.
Components
Frontend
Gradio provides the browser UI where users:
- upload screenshots
- ask questions
- view streaming responses
Agent Layer
Agno handles prompt orchestration and communication with the model.
Model Server
LM Studio runs a local inference server exposing an OpenAI-compatible API.
Vision Model
A multimodal model analyzes both the screenshot and the user’s question.
Technology Stack
| Layer | Technology |
|---|---|
| Agent Framework | Agno |
| UI Framework | Gradio |
| Language | Python |
| Model Runtime | LM Studio |
| Vision Model | Qwen3-VL-8B |
Why We Chose This Stack
Agno
Agno provides a clean way to manage agents and model calls without writing raw OpenAI client code.
It simplifies:
- prompts
- multimodal inputs
- model interaction
Gradio
Gradio makes it extremely easy to build interactive ML interfaces with minimal code.
We get:
- drag and drop uploads
- chat UI
- streaming responses
LM Studio
LM Studio allows you to run local models and expose them via a simple API.
The interface is compatible with OpenAI-style requests, which means most libraries work out of the box.
How Screenshot Processing Works
When a user uploads an image, the system performs the following steps:
- The image is uploaded through the Gradio interface
- The image file is converted into Base64
- The Base64 image is sent to the agent
- The agent sends the image and prompt to LM Studio
- The vision model analyzes the image
- The response streams back to the UI
This entire pipeline runs locally.
Creating the Agent
The agent is responsible for interacting with the model.
from agno.agent import Agent
from agno.models.openai import OpenAIChat
def create_agent():
model = OpenAIChat(
id="qwen3-vl-8b",
base_url="http://localhost:1234/v1",
api_key="lm-studio"
)
agent = Agent(
model=model,
instructions="""
You are an expert visual assistant.
Analyze screenshots and answer user questions clearly.
Explain errors, UI elements, charts, and code snippets.
"""
)
return agent
This agent communicates directly with LM Studio.
Encoding Images for the Model
The model expects images to be provided as Base64 data URLs.
import base64
def encode_image(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
We then convert it into a data URL:
data:image/png;base64,...
Sending the Image to the Model
Once encoded, the image is included in the request.
response = agent.run(
message,
images=[{"url": image_data_url}]
)
The model receives both:
- the image
- the user prompt
This allows it to reason over the visual content.
Streaming Responses
To improve user experience, we stream the model response word-by-word.
for word in answer.split():
partial += word + " "
history[-1]["content"] = partial
yield history
This creates the effect of the AI typing its response in real time.
The Gradio Interface
The UI is built using Gradio Blocks.
Key components include:
- image uploader
- chatbot interface
- prompt input
- send button
- copy answer button
Example component:
chatbot = gr.Chatbot(
height=450,
type="messages",
autoscroll=True
)
Gradio automatically handles:
- UI rendering
- websocket communication
- live updates
Why Local AI Matters
Running AI locally has several advantages:
Privacy
Your screenshots never leave your machine.
Cost
No API costs.
Offline capability
Works without internet.
Control
You can swap models or customize prompts freely.
Future Improvements
Here are some features we plan to add:
Region selection
Ask questions about a specific part of the screenshot.
Multi-image comparison
Compare two screenshots.
OCR export
Extract all text from screenshots.
Chat history
Save conversations locally.
Screenshot Sage demonstrates how easy it has become to build powerful AI tools using local models.
With just a few components:
- a multimodal model
- a lightweight UI
- a simple agent framework
you can create applications that previously required complex cloud infrastructure.
As local AI models continue to improve, tools like this will become increasingly common.
And the best part?
Your data stays with you.
Checkout the Github Repo




Top comments (0)