Ken Imoto

Posted on Mar 28

Passing Rotated Images to Claude or ChatGPT Drops Accuracy to One-Third

#ai #python #machinelearning #imageprocessing

"I Can't Read This" - When Claude Refuses Your Screenshot

These days, throwing error screenshots at Claude or Codex for debugging is pretty standard practice. Your terminal output is trapped in an environment where you can't copy-paste, so you screenshot it and ask the AI, "What's going on here?" We all do it.

But what if the screenshot is slightly rotated, and the AI's response becomes completely useless?

Photos of monitors taken on phones. Whiteboard diagrams captured on iPads. Images end up in all sorts of orientations. You might assume, "It's AI, surely it can handle a little rotation." But for VLMs (Vision Language Models), image orientation is far more critical than you'd think.

Here's a good way to think about it: VLMs have great eyesight but a stiff neck. They can read a properly oriented image flawlessly, but hand them an upside-down image and their reading comprehension drops to kindergarten level. How far does it drop exactly? We ran the experiment to find out.

Experiment Design

Here's how we set it up:

Test images: 12 (text / charts / code / mixed content)
Rotation patterns: 0°, 90°, 180°, 270°
Total conditions: 12 images × 4 rotations = 48 conditions
Models compared: Claude 3.5 Sonnet vs GPT-4o
Metrics: Text extraction accuracy + keyword match rate

Each image was rotated in all four orientations and fed to each model with the same prompt for text extraction. The upright image (0°) result was treated as ground truth.

Results: 180° Destroys Everything

Text Extraction Accuracy

Rotation	Claude	GPT-4o
0°	97.0%	97.0%
90°	39.5%	95.0%
180°	27.9%	24.9%
270°	44.5%	91.9%

Keyword Match Rate

Rotation	Claude	GPT-4o
0°	94.3%	94.3%
90°	50.5%	94.3%
180°	22.9%	14.8%
270°	62.4%	100%

The numbers speak for themselves.

180° rotation (upside-down) is catastrophic for both models. Text extraction accuracy drops to 25-28%, a roughly 70-point nosedive from 97% at the correct orientation. Keyword match rate falls to 15-23%, essentially "I didn't read a thing" territory.

Sideways Rotation: A Tale of Two Models

The 90°/270° results are where things get interesting.

GPT-4o: Maintains 90%+ accuracy on sideways images. Barely affected.
Claude: Drops to 40-50%. Roughly half its normal accuracy.

GPT-4o is remarkably robust against sideways rotation. However, flip the image upside-down (180°) and even GPT-4o's keyword match rate plummets to 14.8%. Both models share the same pattern: "I can tilt my head to read sideways, but upside-down? No chance." Not unlike humans, really.

Heatmaps: Visualizing the Accuracy Drop

Here's Claude's accuracy degradation pattern. You can clearly see the collapse at 90° and 270°:

GPT-4o holds strong against sideways rotation, but 180° is equally devastating:

Degradation by Content Type

The type of image content also affects how badly rotation hurts performance:

Text-heavy images suffered the most from rotation, while charts and code snippets were relatively more resilient. This tells us that spatial pattern recognition plays a major role in how VLMs "read" text.

Why Does Rotation Tank Accuracy?

You might be thinking, "This much accuracy loss from just rotating an image? Something doesn't add up." Humans struggle with upside-down text too, but not from 97% down to 28%. The answer lies in VLM architecture.

Patch Splitting + Position Embedding

VLMs split input images into small patches (e.g., 14×14 pixels) and assign each patch a Position Embedding. These embeddings are fixed during training.

This means the model learned with the assumption that "the top-left patch represents the top-left of the image." When an image is rotated 180°, what's actually bottom-right content gets processed as "top-left." It's like being handed an upside-down map and asked, "Which way is north?"

Training Data Bias

VLM training data is almost 100% upright images. Web-crawled images, scanned books, datasets... they're virtually all correctly oriented. So the model is optimized under a strong prior that images are upright.

Text Recognition is Deeply Spatial

Character recognition is fundamentally spatial pattern matching. The letter "A" is recognized by a vertex at the top and two legs at the bottom. Rotate it 180° and it starts looking like "V", and OCR-like processing breaks down at a fundamental level.

"I Know It's Rotated" But Still Can't Read It

Here's the fascinating part: modern VLMs (like Claude Sonnet 4) actually know the image is rotated. They'll even say, "The image appears to be upside-down, which makes it difficult to read."

A human who notices upside-down text can mentally rotate it and read along. But a VLM, even after recognizing the rotation, continues processing patches under the upright assumption. The ability to identify the problem and the ability to fix the problem are completely decoupled.

In other words, telling a VLM "just correct the rotation yourself and read it" doesn't work. The information is already mangled at the patch-splitting stage, and no amount of text-level prompting can undo that. That's why preprocessing before you send the image matters.

Side Note: Color Recognition Is Unaffected

Interestingly, when we tested with a "name the dominant colors in this image" task, rotation had virtually no impact on accuracy. Color is a per-pixel feature that doesn't depend on spatial position, so it's naturally rotation-invariant. VLMs aren't bad at "rotated images" per se; they're bad at "spatial pattern recognition on rotated images."

The Preprocessing Fix

Now that we understand the problem, let's fix it. Here's a preprocessing function that corrects image orientation before sending it to a VLM.

Two Approaches

Method	How It Works	Best For
EXIF	Reads the EXIF Orientation tag and rotates accordingly	Smartphone photos (with EXIF data)
Entropy	Analyzes spatial text patterns to estimate orientation	Scanned images, screenshots (no EXIF)

Smartphone photos contain EXIF metadata, so we try that first. For images without EXIF (scans, screenshots), we fall back to the entropy-based method.

Implementation

Here's the complete, ready-to-use preprocessing module. Dependencies are just Pillow and NumPy.

#!/usr/bin/env python3
"""Image orientation correction preprocessing functions"""

import numpy as np
from PIL import Image, ImageFilter
from pathlib import Path


def auto_orient_exif(image_path: str) -> Image.Image:
    """EXIF-based orientation correction

    Corrects image orientation based on the EXIF Orientation tag.
    Returns the image as-is if no EXIF data is present.
    """
    img = Image.open(image_path)

    exif = img.getexif()
    if not exif:
        return img

    # Orientation tag = 0x0112
    orientation = exif.get(0x0112)
    if orientation is None:
        return img

    transforms = {
        2: Image.FLIP_LEFT_RIGHT,
        3: Image.ROTATE_180,
        4: Image.FLIP_TOP_BOTTOM,
        5: [Image.FLIP_LEFT_RIGHT, Image.ROTATE_90],
        6: Image.ROTATE_270,
        7: [Image.FLIP_LEFT_RIGHT, Image.ROTATE_270],
        8: Image.ROTATE_90,
    }

    transform = transforms.get(orientation)
    if transform is None:
        return img

    if isinstance(transform, list):
        for t in transform:
            img = img.transpose(t)
    else:
        img = img.transpose(transform)

    return img


def auto_orient_entropy(image_path: str) -> Image.Image:
    """Entropy + edge analysis based orientation estimation

    Exploits the fact that text lines in documents run horizontally.
    Uses horizontal vs vertical edge ratios and top vs bottom edge
    density to estimate the correct orientation.
    """
    img = Image.open(image_path)
    arr = np.array(img.convert("L").resize((400, 400)))

    best_angle = 0
    best_score = -1

    for angle in [0, 90, 180, 270]:
        if angle == 0:
            rotated = arr
        elif angle == 90:
            rotated = np.rot90(arr, k=3)  # 90° clockwise
        elif angle == 180:
            rotated = np.rot90(arr, k=2)
        else:
            rotated = np.rot90(arr, k=1)  # 270° clockwise

        score = _compute_text_orientation_score(rotated)
        if score > best_score:
            best_score = score
            best_angle = angle

    if best_angle == 0:
        return img

    # PIL rotate is counter-clockwise, we need to undo the detected rotation
    return img.rotate(best_angle, expand=True)


def _compute_text_orientation_score(arr: np.ndarray) -> float:
    """Compute an "uprightness" score for a text image

    Combines the following features:
    1. Horizontal edge dominance (text lines are horizontal)
    2. Higher edge density at the top (titles/headers)
    3. Row variance patterns (alternating text lines and whitespace)
    """
    # Sobel-like edge detection
    h_edges = np.abs(np.diff(arr.astype(float), axis=1))  # horizontal edges
    v_edges = np.abs(np.diff(arr.astype(float), axis=0))  # vertical edges

    # Feature 1: horizontal edge dominance
    h_sum = np.sum(h_edges)
    v_sum = np.sum(v_edges)

    # Feature 2: row variance pattern
    row_means = np.mean(arr, axis=1)
    row_variance = np.var(np.diff(row_means))

    # Feature 3: top-heaviness
    h, w = arr.shape
    top_edge_density = np.mean(v_edges[:h//3, :])
    bottom_edge_density = np.mean(v_edges[2*h//3:, :])

    # Feature 4: left-alignment indicator
    col_means = np.mean(arr, axis=0)
    left_quarter = np.mean(col_means[:w//4])
    right_quarter = np.mean(col_means[3*w//4:])

    # Combine scores
    score = 0.0

    # Horizontal structure bonus
    if h_sum + v_sum > 0:
        score += (v_sum / (h_sum + v_sum)) * 30

    # Row regularity bonus
    score += min(row_variance / 100, 30)

    # Top-heavy bonus
    if top_edge_density > bottom_edge_density:
        score += 20

    # Whitespace gradient bonus
    bottom_brightness = np.mean(arr[3*h//4:, :])
    top_brightness = np.mean(arr[:h//4, :])
    if bottom_brightness > top_brightness:
        score += 10

    return score


def correct_image(image_path: str, method: str = "entropy") -> Image.Image:
    """Unified correction function

    Args:
        image_path: Path to the image
        method: Correction method ("exif" or "entropy")

    Returns:
        Corrected image
    """
    methods = {
        "exif": auto_orient_exif,
        "entropy": auto_orient_entropy,
    }

    func = methods.get(method)
    if func is None:
        raise ValueError(f"Unknown method: {method}. Choose from {list(methods.keys())}")

    return func(image_path)


def correct_and_save(image_path: str, output_path: str, method: str = "entropy") -> str:
    """Correct orientation and save"""
    img = correct_image(image_path, method)
    img.save(output_path)
    return output_path

Usage

from orientation_preprocess import correct_image

# Images with EXIF data (smartphone photos, etc.)
img = correct_image("photo.jpg", method="exif")

# Images without EXIF (screenshots, scans, etc.)
img = correct_image("scan.png", method="entropy")

# Correct and save to a new file
from orientation_preprocess import correct_and_save
correct_and_save("input.png", "output.png", method="entropy")

Plugging It Into Your VLM Workflow

In practice, it looks like this:

from orientation_preprocess import correct_and_save
import anthropic

# Step 1: Correct image orientation
correct_and_save("receipt.jpg", "receipt_corrected.jpg", method="exif")

# Step 2: Send the corrected image to the VLM
client = anthropic.Anthropic()
# ... proceed with your normal API call

One line of preprocessing. That's all it takes to get your 97% accuracy back.

Before and After Preprocessing

Takeaways

Here's what the experiment showed:

180° rotation (upside-down) is catastrophic: Both models drop to 25-28% accuracy
Rotation tolerance varies by model: GPT-4o handles sideways images well (90%+), while Claude drops to ~40%
One line of preprocessing fixes it: Correct orientation before the API call and accuracy is restored

VLMs are impressively capable at "seeing," but they're built on the assumption that you'll show them things right-side up. Great eyesight, stiff neck. Fortunately, this blind spot is trivially easy to work around with a bit of preprocessing.

If you're throwing error screenshots at AI for debugging, or feeding smartphone photos into a VLM pipeline, consider adding that one-line orientation fix. It could be the difference between "I can't read this" and "Here's your bug."

If you're interested in practical techniques for working with Claude Code, check out my book Practical Claude Code on Amazon.

Originally published in Japanese on Qiita.

DEV Community