This article shows a clean, copy-pasteable Python example for analyzing an architecture diagram image using a GPT multimodal model. You’ll get:
- A minimal, production-oriented code snippet (cleaned of comments / clutter)
- A line-by-line explanation of how the code works
- Prompt-engineering tips for accurate, structured output
- Important gotchas (image size, encoding, costs, privacy, and error handling)
- Suggestions for parsing and validating responses
The snippet below assumes you have a configured client
object that wraps your OpenAI-compatible API (Azure OpenAI or OpenAI). Replace the model
string with the multimodal model name available to you (e.g., gpt-5
, gpt-4o
, or the provider-specific variant).
Clean, copy-pasteable Python code
from clients import client
import base64
from io import BytesIO
from PIL import Image
IMAGE_SRC = "./assets/banking.png"
MODEL = "gpt-4o" # replace with your multimodal model name if different
def encode_image_pil(img: Image.Image) -> str:
buffered = BytesIO()
fmt = img.format or "PNG"
img.save(buffered, format=fmt)
return base64.b64encode(buffered.getvalue()).decode("utf-8")
def analyze_image_pil(img: Image.Image, model: str = MODEL, temperature: float = 0.0):
encoded = encode_image_pil(img)
prompt_text = (
"Analyze the architecture diagram in the image in comprehensive detail. "
"1) Identify every distinct block and name it as a service/module. For each, describe its role. "
"2) For every connection, list the exact data types/names that flow (include direction). "
"3) List any text visible in the image verbatim. "
"4) Describe the full system behaviour and highlight bottlenecks or single points of failure. "
"Return the analysis as JSON with keys: blocks[], connections[], visible_text[], summary, issues[]. "
)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": [{"type": "text", "text": prompt_text},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded}", "detail": "high"}}]}
],
temperature=temperature,
)
return response.choices[0].message.content
if __name__ == "__main__":
img = Image.open(IMAGE_SRC)
result = analyze_image_pil(img)
print(result)
What the code does — step-by-step
1. Imports and constants
- from clients import client — client is your pre-configured OpenAI/Azure client. Keep credentials/config outside this file.
- base64, BytesIO, and PIL.Image are used to convert an in-memory image into a base64 data URI suitable for embedding into the chat request.
- IMAGE_SRC and MODEL are simple constants you can change to point to a different file or to select a different model.
2. encode_image_pil
- Accepts a PIL.Image.Image instance.
- Uses a BytesIO buffer and img.save(...) to serialize the image into its native or fallback format (PNG).
- Base64-encodes the bytes and returns a UTF-8 string.
- Why: A data: URL (data:image/png;base64,<...>) is a portable way to include an image inline in the request. Some providers accept binary uploads — check your client API if you prefer attachments instead of base64.
3. analyze_image_pil
Builds a clear, structured prompt asking for:
- Identification of blocks (services/modules)
- Exact data flows with data names and direction
- Verbatim visible text from the image
- System summary and issues/bottlenecks
- The function sends a single chat request that contains both a text item and an image_url item where the image_url is a data: URI containing the base64 image.
- temperature is set to 0.0 by default to maximize determinism for structural analysis. For more creative, higher-level interpretation you can raise it.
- Return: response.choices[0].message.content (this returns the model output in a provider-specific structure).
4. main usage
- Opens the image file with PIL and calls analyze_image_pil.
- Prints the model response.
Prompt engineering & output structure (why we ask for JSON)
Asking the model to return JSON with explicit keys (blocks, connections, visible_text, summary, issues) helps you:
- Parse programmatically and validate results.
- Enforce consistent structure across different images.
- Avoid missed details (the model can be instructed to error if it can't find a key).
Example JSON structure you should ask for:
{
"blocks": [
{
"id": "AuthService",
"label": "Auth Service",
"role": "Issue JWT tokens",
"data_in": ["credentials(username, password)"],
"data_out": ["jwt_token"]
}
],
"connections": [
{
"from": "Frontend",
"to": "AuthService",
"data": ["login_request { username, password }"]
}
],
"visible_text": ["User Login", "DB: Postgres"],
"summary": "...",
"issues": ["AuthService is a single point of failure"]
}
If you need stricter parsing, request content-type: application/json or ask the model to always wrap JSON in triple backticks and nothing else. Then validate with json.loads().
Key gotchas & best practices
1) Image size and base64 length
- Base64 increases payload size by ~33%. Very large images can blow up request size or exceed API limits.
- Best practice: resize or compress images (downscale while preserving readability) before encoding. Use PNG for diagrams, JPEG for photos where acceptable.
2) Format detection
- The code uses img.format if available, otherwise falls back to PNG. Ensure PIL recognized the format; otherwise save as PNG.
3) Model selection
- Use the multimodal model provided by your vendor. Replace MODEL value with the latest multimodal model name you have access to (e.g., gpt-5 or a vendor-specific alias).
- Some models are better at text extraction (OCR), others at structural reasoning; test and tune.
4) Temperature & determinism
- temperature=0.0 yields deterministic, conservative output (good for structured extraction).
Higher temperature gives more creative / speculative answers — useful for brainstorming but not for precise extraction.
5) Costs & rate limitsMultimodal models are often more expensive and may have different quotas. Batch requests and cache results where possible.
For bulk image processing, implement retry/backoff and monitor billing.
6) Privacy & security
Don’t send PII or sensitive diagrams to public APIs unless permitted by policy. If the image contains sensitive data, sanitize or host on a private, audited endpoint.
Consider on-premise or enterprise offerings if compliance is required.
7) OCR vs structural interpretation
- If you only need text in the image, dedicated OCR tools may be cheaper and more deterministic.
- For structural interpretation (blocks, flows, architectural reasoning) multimodal LLMs offer richer semantic outputs.
8) Data URI vs binary uploads
- data: URIs are simple and portable but large. Check your client library for binary upload methods (multipart/form-data or separate image input fields) — those are more efficient.
9) Validation & fallback
- Always validate JSON returned by the model. The model may return malformed JSON if the prompt isn’t strict.
- Add retries or a secondary prompt to reformat output if parsing fails.
Advanced tips
1) Use schema enforcement
- Ask the model to validate its own JSON before returning, or run a JSON schema validator locally after receiving the reply.
2) Two-step pipeline for extreme reliability
- First, ask the model to extract visible text and primitive OCR (low temperature).
- Second, feed the extracted text + simplified structural image to a second prompt for deeper reasoning (bottlenecks, verification).
3) Human-in-the-loop verification
- For critical architecture diagrams, present the structured output to a human reviewer who can correct or confirm before taking automated actions.
Example: parse and validate JSON safely
import json
raw = analyze_image_pil(img)
# Model might return a markdown-wrapped JSON; strip code fences if needed
if isinstance(raw, str):
raw = raw.strip()
if raw.startswith("```
"):
raw = raw.split("
```")[1] if "```
" in raw else raw
try:
parsed = json.loads(raw)
except json.JSONDecodeError:
raise ValueError("Model returned non-JSON output. Consider tightening your prompt.")
Final checklist before productionizing
- Limit image resolution to what’s necessary for readability.
- Use deterministic prompts + temperature=0.0 for extraction tasks.
- Request JSON and validate locally.
- Protect sensitive images (don’t leak PII).
- Monitor API usage & costs.
- Add retries, exponential backoff, and rate-limiting.
Conclusion
Multimodal GPT models make it straightforward to extract structured architecture insights from diagrams, but reliable production usage requires careful image handling, explicit prompt engineering, and robust validation. Use the cleaned code above as a starting point; iterate on prompts, model choice, and data handling to reach the reliability your product needs.
Top comments (0)