"I Can't Read This" - When Claude Refuses Your Screenshot
These days, throwing error screenshots at Claude or Codex for debugging is pretty standard practice. Your terminal output is trapped in an environment where you can't copy-paste, so you screenshot it and ask the AI, "What's going on here?" We all do it.
But what if the screenshot is slightly rotated, and the AI's response becomes completely useless?
Photos of monitors taken on phones. Whiteboard diagrams captured on iPads. Images end up in all sorts of orientations. You might assume, "It's AI, surely it can handle a little rotation." But for VLMs (Vision Language Models), image orientation is far more critical than you'd think.
Here's a good way to think about it: VLMs have great eyesight but a stiff neck. They can read a properly oriented image flawlessly, but hand them an upside-down image and their reading comprehension drops to kindergarten level. How far does it drop exactly? We ran the experiment to find out.
Experiment Design
Here's how we set it up:
- Test images: 12 (text / charts / code / mixed content)
- Rotation patterns: 0°, 90°, 180°, 270°
- Total conditions: 12 images × 4 rotations = 48 conditions
- Models compared: Claude 3.5 Sonnet vs GPT-4o
- Metrics: Text extraction accuracy + keyword match rate
Each image was rotated in all four orientations and fed to each model with the same prompt for text extraction. The upright image (0°) result was treated as ground truth.
Results: 180° Destroys Everything
Text Extraction Accuracy
| Rotation | Claude | GPT-4o |
|---|---|---|
| 0° | 97.0% | 97.0% |
| 90° | 39.5% | 95.0% |
| 180° | 27.9% | 24.9% |
| 270° | 44.5% | 91.9% |
Keyword Match Rate
| Rotation | Claude | GPT-4o |
|---|---|---|
| 0° | 94.3% | 94.3% |
| 90° | 50.5% | 94.3% |
| 180° | 22.9% | 14.8% |
| 270° | 62.4% | 100% |
The numbers speak for themselves.
180° rotation (upside-down) is catastrophic for both models. Text extraction accuracy drops to 25-28%, a roughly 70-point nosedive from 97% at the correct orientation. Keyword match rate falls to 15-23%, essentially "I didn't read a thing" territory.
Sideways Rotation: A Tale of Two Models
The 90°/270° results are where things get interesting.
- GPT-4o: Maintains 90%+ accuracy on sideways images. Barely affected.
- Claude: Drops to 40-50%. Roughly half its normal accuracy.
GPT-4o is remarkably robust against sideways rotation. However, flip the image upside-down (180°) and even GPT-4o's keyword match rate plummets to 14.8%. Both models share the same pattern: "I can tilt my head to read sideways, but upside-down? No chance." Not unlike humans, really.
Heatmaps: Visualizing the Accuracy Drop
Here's Claude's accuracy degradation pattern. You can clearly see the collapse at 90° and 270°:
GPT-4o holds strong against sideways rotation, but 180° is equally devastating:
Degradation by Content Type
The type of image content also affects how badly rotation hurts performance:
Text-heavy images suffered the most from rotation, while charts and code snippets were relatively more resilient. This tells us that spatial pattern recognition plays a major role in how VLMs "read" text.
Why Does Rotation Tank Accuracy?
You might be thinking, "This much accuracy loss from just rotating an image? Something doesn't add up." Humans struggle with upside-down text too, but not from 97% down to 28%. The answer lies in VLM architecture.
Patch Splitting + Position Embedding
VLMs split input images into small patches (e.g., 14×14 pixels) and assign each patch a Position Embedding. These embeddings are fixed during training.
This means the model learned with the assumption that "the top-left patch represents the top-left of the image." When an image is rotated 180°, what's actually bottom-right content gets processed as "top-left." It's like being handed an upside-down map and asked, "Which way is north?"
Training Data Bias
VLM training data is almost 100% upright images. Web-crawled images, scanned books, datasets... they're virtually all correctly oriented. So the model is optimized under a strong prior that images are upright.
Text Recognition is Deeply Spatial
Character recognition is fundamentally spatial pattern matching. The letter "A" is recognized by a vertex at the top and two legs at the bottom. Rotate it 180° and it starts looking like "V", and OCR-like processing breaks down at a fundamental level.
"I Know It's Rotated" But Still Can't Read It
Here's the fascinating part: modern VLMs (like Claude Sonnet 4) actually know the image is rotated. They'll even say, "The image appears to be upside-down, which makes it difficult to read."
A human who notices upside-down text can mentally rotate it and read along. But a VLM, even after recognizing the rotation, continues processing patches under the upright assumption. The ability to identify the problem and the ability to fix the problem are completely decoupled.
In other words, telling a VLM "just correct the rotation yourself and read it" doesn't work. The information is already mangled at the patch-splitting stage, and no amount of text-level prompting can undo that. That's why preprocessing before you send the image matters.
Side Note: Color Recognition Is Unaffected
Interestingly, when we tested with a "name the dominant colors in this image" task, rotation had virtually no impact on accuracy. Color is a per-pixel feature that doesn't depend on spatial position, so it's naturally rotation-invariant. VLMs aren't bad at "rotated images" per se; they're bad at "spatial pattern recognition on rotated images."
The Preprocessing Fix
Now that we understand the problem, let's fix it. Here's a preprocessing function that corrects image orientation before sending it to a VLM.
Two Approaches
| Method | How It Works | Best For |
|---|---|---|
| EXIF | Reads the EXIF Orientation tag and rotates accordingly | Smartphone photos (with EXIF data) |
| Entropy | Analyzes spatial text patterns to estimate orientation | Scanned images, screenshots (no EXIF) |
Smartphone photos contain EXIF metadata, so we try that first. For images without EXIF (scans, screenshots), we fall back to the entropy-based method.
Implementation
Here's the complete, ready-to-use preprocessing module. Dependencies are just Pillow and NumPy.
#!/usr/bin/env python3
"""Image orientation correction preprocessing functions"""
import numpy as np
from PIL import Image, ImageFilter
from pathlib import Path
def auto_orient_exif(image_path: str) -> Image.Image:
"""EXIF-based orientation correction
Corrects image orientation based on the EXIF Orientation tag.
Returns the image as-is if no EXIF data is present.
"""
img = Image.open(image_path)
exif = img.getexif()
if not exif:
return img
# Orientation tag = 0x0112
orientation = exif.get(0x0112)
if orientation is None:
return img
transforms = {
2: Image.FLIP_LEFT_RIGHT,
3: Image.ROTATE_180,
4: Image.FLIP_TOP_BOTTOM,
5: [Image.FLIP_LEFT_RIGHT, Image.ROTATE_90],
6: Image.ROTATE_270,
7: [Image.FLIP_LEFT_RIGHT, Image.ROTATE_270],
8: Image.ROTATE_90,
}
transform = transforms.get(orientation)
if transform is None:
return img
if isinstance(transform, list):
for t in transform:
img = img.transpose(t)
else:
img = img.transpose(transform)
return img
def auto_orient_entropy(image_path: str) -> Image.Image:
"""Entropy + edge analysis based orientation estimation
Exploits the fact that text lines in documents run horizontally.
Uses horizontal vs vertical edge ratios and top vs bottom edge
density to estimate the correct orientation.
"""
img = Image.open(image_path)
arr = np.array(img.convert("L").resize((400, 400)))
best_angle = 0
best_score = -1
for angle in [0, 90, 180, 270]:
if angle == 0:
rotated = arr
elif angle == 90:
rotated = np.rot90(arr, k=3) # 90° clockwise
elif angle == 180:
rotated = np.rot90(arr, k=2)
else:
rotated = np.rot90(arr, k=1) # 270° clockwise
score = _compute_text_orientation_score(rotated)
if score > best_score:
best_score = score
best_angle = angle
if best_angle == 0:
return img
# PIL rotate is counter-clockwise, we need to undo the detected rotation
return img.rotate(best_angle, expand=True)
def _compute_text_orientation_score(arr: np.ndarray) -> float:
"""Compute an "uprightness" score for a text image
Combines the following features:
1. Horizontal edge dominance (text lines are horizontal)
2. Higher edge density at the top (titles/headers)
3. Row variance patterns (alternating text lines and whitespace)
"""
# Sobel-like edge detection
h_edges = np.abs(np.diff(arr.astype(float), axis=1)) # horizontal edges
v_edges = np.abs(np.diff(arr.astype(float), axis=0)) # vertical edges
# Feature 1: horizontal edge dominance
h_sum = np.sum(h_edges)
v_sum = np.sum(v_edges)
# Feature 2: row variance pattern
row_means = np.mean(arr, axis=1)
row_variance = np.var(np.diff(row_means))
# Feature 3: top-heaviness
h, w = arr.shape
top_edge_density = np.mean(v_edges[:h//3, :])
bottom_edge_density = np.mean(v_edges[2*h//3:, :])
# Feature 4: left-alignment indicator
col_means = np.mean(arr, axis=0)
left_quarter = np.mean(col_means[:w//4])
right_quarter = np.mean(col_means[3*w//4:])
# Combine scores
score = 0.0
# Horizontal structure bonus
if h_sum + v_sum > 0:
score += (v_sum / (h_sum + v_sum)) * 30
# Row regularity bonus
score += min(row_variance / 100, 30)
# Top-heavy bonus
if top_edge_density > bottom_edge_density:
score += 20
# Whitespace gradient bonus
bottom_brightness = np.mean(arr[3*h//4:, :])
top_brightness = np.mean(arr[:h//4, :])
if bottom_brightness > top_brightness:
score += 10
return score
def correct_image(image_path: str, method: str = "entropy") -> Image.Image:
"""Unified correction function
Args:
image_path: Path to the image
method: Correction method ("exif" or "entropy")
Returns:
Corrected image
"""
methods = {
"exif": auto_orient_exif,
"entropy": auto_orient_entropy,
}
func = methods.get(method)
if func is None:
raise ValueError(f"Unknown method: {method}. Choose from {list(methods.keys())}")
return func(image_path)
def correct_and_save(image_path: str, output_path: str, method: str = "entropy") -> str:
"""Correct orientation and save"""
img = correct_image(image_path, method)
img.save(output_path)
return output_path
Usage
from orientation_preprocess import correct_image
# Images with EXIF data (smartphone photos, etc.)
img = correct_image("photo.jpg", method="exif")
# Images without EXIF (screenshots, scans, etc.)
img = correct_image("scan.png", method="entropy")
# Correct and save to a new file
from orientation_preprocess import correct_and_save
correct_and_save("input.png", "output.png", method="entropy")
Plugging It Into Your VLM Workflow
In practice, it looks like this:
from orientation_preprocess import correct_and_save
import anthropic
# Step 1: Correct image orientation
correct_and_save("receipt.jpg", "receipt_corrected.jpg", method="exif")
# Step 2: Send the corrected image to the VLM
client = anthropic.Anthropic()
# ... proceed with your normal API call
One line of preprocessing. That's all it takes to get your 97% accuracy back.
Before and After Preprocessing
Takeaways
Here's what the experiment showed:
- 180° rotation (upside-down) is catastrophic: Both models drop to 25-28% accuracy
- Rotation tolerance varies by model: GPT-4o handles sideways images well (90%+), while Claude drops to ~40%
- One line of preprocessing fixes it: Correct orientation before the API call and accuracy is restored
VLMs are impressively capable at "seeing," but they're built on the assumption that you'll show them things right-side up. Great eyesight, stiff neck. Fortunately, this blind spot is trivially easy to work around with a bit of preprocessing.
If you're throwing error screenshots at AI for debugging, or feeding smartphone photos into a VLM pipeline, consider adding that one-line orientation fix. It could be the difference between "I can't read this" and "Here's your bug."
If you're interested in practical techniques for working with Claude Code, check out my book Practical Claude Code on Amazon.
Originally published in Japanese on Qiita.




Top comments (0)