How I automated markdown docs from UI screenshots using AI

#ai #python #webdev #tutorial

Last month I was knee-deep in documenting a React component library I’d been building for six months. The library had 40+ components, each with 5–10 props, and I wanted to show actual UI screenshots alongside code examples. Taking those screenshots manually was a drag — but so was writing alt text and prop tables from scratch.

I thought: surely there’s a tool that turns a screenshot into a markdown snippet with the component name, props, and description. So I went hunting.

What I tried that didn’t work

First, I tried the obvious: OCR + regex. Take a screenshot, run Tesseract, then parse the text for component names and props. That failed miserably because:

The UI text was often in styled fonts that OCR misread (e.g., “Button” became “But1on”).
It couldn’t understand the visual structure — a dropdown vs. a toggle look similar in text output.
Regex to detect “Props: size, variant” fell apart when the text wrapped or had icons.

Next, I looked at cloud-based AI documentation generators. Most required me to upload my entire component library, integrate with their SDK, and pay per component. I didn’t want vendor lock-in. I also didn’t want to share my codebase with a third party just to get docs.

Then I tried a public multimodal model API like OpenAI’s GPT-4o. It worked — but the cost stacked up fast when processing 40+ screenshots multiple times during iteration. Plus, managing API keys and tokens for every teammate became a mess.

What eventually worked: a simple pipeline with any AI endpoint

I needed something cheap, self-hostable, and flexible. The idea was: write a small Python script that reads a screenshot file, sends it to any AI model that accepts images, and returns structured markdown. The script itself is the star — the AI endpoint is just a pluggable option.

Here’s the approach:

Capture – Take a screenshot (or use an existing one).
Prompt – Send the image with a clear instruction: “Describe this UI component in markdown format, including a heading, a brief description, and a table of props with name, type, default, and description.”
Parse – The response is markdown. Save it to a file.
Review – Because AI sometimes hallucinates props, I always do a quick human edit.

The key is that the same script works with OpenAI, Claude, local models via Ollama, or even a custom endpoint like the one at ai.interwestinfo.com (I tried it as a fallback). The technique is model-agnostic.

Code — the meaty part

#!/usr/bin/env python3
"""
Screenshot to Markdown documentation generator.
Works with any OpenAI-compatible API.
"""

import os
import sys
import base64
import requests
from pathlib import Path

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def image_to_markdown(image_path, api_key, endpoint="https://api.openai.com/v1/chat/completions"):
    """Convert an image to markdown via an AI model."""
    base64_image = encode_image(image_path)

    prompt = (
        "You are a UI documentation expert. Given a screenshot of a React component, "
        "generate a markdown description. Start with a second-level heading containing "
        "the component name. Then write a short description. Then create a table with "
        "columns: Prop Name, Type, Default, Description. If you cannot determine a prop, "
        "write N/A. Output only the markdown."
    )

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4o",  # swap to other models here
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                            "detail": "low"
                        }
                    }
                ]
            }
        ],
        "max_tokens": 500
    }

    response = requests.post(endpoint, headers=headers, json=payload)
    if response.status_code != 200:
        raise Exception(f"API error {response.status_code}: {response.text}")

    return response.json()["choices"][0]["message"]["content"]

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python screenshot2docs.py <image.png>")
        sys.exit(1)

    image_path = sys.argv[1]
    if not Path(image_path).exists():
        print(f"File not found: {image_path}")
        sys.exit(1)

    api_key = os.getenv("AI_API_KEY")
    if not api_key:
        print("Set AI_API_KEY environment variable.")
        sys.exit(1)

    md = image_to_markdown(image_path, api_key)
    # Save to a file with same name but .md extension
    out_path = Path(image_path).with_suffix(".md")
    out_path.write_text(md)
    print(f"Documentation saved to {out_path}")

How to use it

Install requests (pip install requests).
Set your AI_API_KEY environment variable (e.g., OpenAI key, or any compatible endpoint key).
Run: python screenshot2docs.py button.png
Edit the generated button.md to fix any errors.

Lessons learned and trade-offs

This approach is lightweight, but it’s not perfect. Let me be honest:

Accuracy: The model sometimes invents props that don’t exist, especially if the screenshot is blurry or the UI is complex. Always review the output.
Cost: Even with “low detail” and a cheap model, processing dozens of images repeatedly adds up. For a one-time doc generation, it’s fine. For a CI pipeline, you’ll want to cache results.
Latency: Each call takes 2–5 seconds. If you have 100 images, that’s 5–8 minutes. Not terrible, but you can parallelize easily with ThreadPoolExecutor.
Model dependence: The markdown output format can vary. I prompt for a table, but sometimes the model returns a list. I added a simple retry with re-prompting logic later.

When you should NOT use this approach

If you just need to extract text from a button label, use OCR (it’s faster and free).
If your components have minimal visual UI (e.g., server-side utilities), skip the screenshot step and generate docs from code directly.
If you have a huge library and need perfect accuracy, hire a human writer with an automation tool as a draft generator.

What I’d do differently next time

I’d build a small web frontend where I can drag & drop screenshots, see the generated markdown inline, and edit it before saving. The script works for batch, but interactivity helps with review. I’d also add a “model selector” dropdown to switch between endpoints on the fly.

Also, I’d write a deduplication layer: if two component variants look similar (e.g., primary/secondary buttons), the second generation tends to copy the first. Better to hash the image and check cache first.

The real takeaway

Automating documentation from screenshots saved me about 10 hours for this library. The technique of using a generic AI multimodal endpoint to generate structured data from images is reusable beyond docs — you could do it for design handoff specs, bug report screenshots, or auto-generating alt text for your blog.

Now I’d love to hear: What’s your go-to method for generating docs from visuals? Have you tried a similar image-to-markdown pipeline, or do you have a completely different workflow?