AI Engine

Posted on Apr 1 • Originally published at ai-engine.net

Extract Text from Screenshots with an OCR API

#ai #api #tutorial #python

Screenshots are everywhere in developer workflows. Error logs from a terminal, metrics from a dashboard, text from a chat conversation, UI copy from a design mockup. The text inside those images is useful, but it's trapped in pixels. An OCR API can extract it in a single HTTP call.

This tutorial uses the OCR Wizard API to pull text out of screenshots with Python.

Want to test it? Try the OCR Wizard API on your own screenshots.

Why Not Tesseract?

Tesseract is the go-to open-source OCR engine, but it struggles with screenshots. Colored backgrounds, UI elements, and non-standard fonts confuse it. Some developers add GPT on top just to clean up Tesseract's noisy output. That's two API calls, a local install, and extra latency. A cloud OCR API handles screenshots natively: send the image, get back clean text.

Extracting Text in Python

import requests

url = "https://ocr-wizard.p.rapidapi.com/ocr"
headers = {
    "x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
    "x-rapidapi-key": "YOUR_API_KEY",
}

with open("screenshot.png", "rb") as f:
    response = requests.post(url, headers=headers, files={"image": f})

data = response.json()
print(data["body"]["fullText"])

Real output from a terminal screenshot

$ python3 app.py
Processing 847 images from /data/uploads...
Batch 1/9: 100 images processed (12.3s)
Batch 2/9: 100 images processed (11.8s)
Batch 3/9: 100 images processed (13.1s)
Traceback (most recent call last):
  File "app.py", line 42, in process_batch
    result = api_client.analyze(image_path)
  File "client.py", line 118, in analyze
    response.raise_for_status()
requests.exceptions.HTTPError: 429 Too Many
Requests: Rate limit exceeded. Retry after 60s
ERROR: Batch 4/9 failed at image 312/847
Total processed: 312/847 (36.8%)
Elapsed: 37.2s | ETA: unknown

Every word captured: traceback, error code (429), file names, line numbers, progress stats. No noise, no missing characters.

Testing with cURL

curl -X POST \
  'https://ocr-wizard.p.rapidapi.com/ocr' \
  -H 'x-rapidapi-host: ocr-wizard.p.rapidapi.com' \
  -H 'x-rapidapi-key: YOUR_API_KEY' \
  -F 'image=@screenshot.png'

Handling Different Screenshot Types

Terminal and error logs - High contrast, monospaced text. Line breaks preserved, so you can parse stack traces or grep for error codes
Dashboards and analytics - Numbers mixed with labels and charts. The API extracts text and skips graphical parts
Chat conversations - Slack, Discord, WhatsApp. Usernames, timestamps, message bodies in reading order
UI mockups - Figma designs, web pages. Extract button labels and headings for QA spec verification

See the full tutorial with JavaScript examples and QA automation pipeline in the complete screenshot OCR guide.

Structuring Extracted Text with GPT

OCR gives you raw text. Sometimes you need structured data. Combine it with GPT-4o mini to go from pixels to JSON in two API calls.

import requests
from openai import OpenAI

# Step 1: OCR
ocr_url = "https://ocr-wizard.p.rapidapi.com/ocr"
ocr_headers = {
    "x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
    "x-rapidapi-key": "YOUR_API_KEY",
}

with open("dashboard_screenshot.png", "rb") as f:
    ocr_response = requests.post(ocr_url, headers=ocr_headers, files={"image": f})

raw_text = ocr_response.json()["body"]["fullText"]

# Step 2: Structure with GPT-4o mini
client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract structured data from the following text. Return valid JSON only."},
        {"role": "user", "content": raw_text},
    ],
)

print(completion.choices[0].message.content)

Real GPT-4o mini output from a dashboard screenshot

{
  "monthly_revenue": { "amount": "$12,450", "growth_rate": "+18.3%" },
  "active_users": { "count": 3201, "growth_rate": "+7.2%" },
  "conversion_rate": "4.2%",
  "avg_response_time": { "time": "245ms", "change": "+12ms" },
  "top_pages": [
    { "page": "/pricing", "views": 8421, "bounce_rate": "32%", "avg_time": "2m 15s" },
    { "page": "/blog/ocr-guide", "views": 5102, "bounce_rate": "45%", "avg_time": "4m 30s" },
    { "page": "/apis/face-analyzer", "views": 3887, "bounce_rate": "28%", "avg_time": "1m 48s" },
    { "page": "/signup", "views": 2654, "bounce_rate": "18%", "avg_time": "3m 02s" }
  ]
}

GPT paired each metric with its value, converted the table into an array, and typed numbers as integers. The same approach works for error logs:

{
  "command": "python3 app.py",
  "error_type": "HTTPError",
  "error_message": "429 Too Many Requests: Rate limit exceeded. Retry after 60s",
  "file": "client.py",
  "line": 118,
  "progress": { "total_processed": 312, "total_images": 847, "elapsed_time": "37.2s" },
  "traceback": [
    { "file": "app.py", "line": 42, "function": "process_batch" },
    { "file": "client.py", "line": 118, "function": "analyze" }
  ]
}

The difference from approaches that use GPT to clean up bad Tesseract output: here the OCR result is already accurate. GPT adds structure, not quality.

QA Automation with Screenshot OCR

One of the strongest use cases: take a Playwright screenshot and verify visible text directly.

from playwright.sync_api import sync_playwright
import requests

OCR_URL = "https://ocr-wizard.p.rapidapi.com/ocr"
OCR_HEADERS = {
    "x-rapidapi-host": "ocr-wizard.p.rapidapi.com",
    "x-rapidapi-key": "YOUR_API_KEY",
}

def get_page_text(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        page.screenshot(path="/tmp/page.png", full_page=True)
        browser.close()

    with open("/tmp/page.png", "rb") as f:
        resp = requests.post(OCR_URL, headers=OCR_HEADERS, files={"image": f})
    return resp.json()["body"]["fullText"]

text = get_page_text("https://myapp.com/dashboard")
assert "Welcome back" in text
assert "0 errors" in text

This catches visual regressions that DOM-based tests miss: text hidden by CSS, overlapping elements, or content rendered by JavaScript.

Tips

Use PNG for screenshots (lossless). JPG compression adds artifacts that reduce OCR accuracy
Crop before sending if you only need text from one region
Use the annotations array for layout-aware extraction (word-level bounding boxes)
Multi-language works automatically, check the detectedLanguage field

Read the full guide with comparison tables and detailed tips on ai-engine.net.

DEV Community