You took a screenshot of an important chat conversation. You open it later and the text is a blurry, pixelated mess — compressed to death by WhatsApp or Messenger. You can barely make out what was said.
Or you snapped a photo of a receipt for expenses. Two weeks later the thermal print has faded so much that the total amount is unreadable.
We've all been there. The information is there — it's just not legible anymore.
Sure, you could throw the image into a generic AI upscaler. But here's the problem: those tools are trained on photos of faces, landscapes, and buildings. When they see text, they treat it like a texture and smear it. The letters come out looking melted, warped, or replaced with alien-looking symbols that aren't even real characters.
What you need is a text-specialized AI — one trained specifically on typography, handwriting, and printed letters. That's what we're building in this guide.
By the end of this article, you'll have a complete, production-ready implementation for an AI Text Enhancer that:
- Reconstructs blurry, pixelated letters into sharp, readable text
- Removes JPEG artifacts from compressed screenshots
- Restores contrast on faded receipt photos
- Works on handwritten notes, book pages, scanned documents, and more
- Runs in the browser with no install, no watermark, and no account
Let's build it.
Table of Contents
- Why Generic Enhancers Fail on Text
- System Architecture
- The API Contract
- Frontend: React Upload + Before/After Preview
- Backend: Python + FastAPI
- The Text Clarity Model
- End-to-End Flow
- Error Handling
- Optimization Strategy
- Training Your Own Text Clarity Model
- Real-World Use Cases
- Wrapping Up
1. Why Generic Enhancers Fail on Text
Before we write any code, it's worth understanding why we need a specialized model. Can't we just use Real-ESRGAN or Gemini Nano Banana and call it a day?
No. And here's why.
General-purpose image enhancers are trained on datasets like DIV2K — a collection of high-quality photographs. The loss functions optimize for perceptual quality across natural images: smooth skin tones, blue skies, green trees. Text is a tiny fraction of the training data, if it appears at all.
When these models encounter text, one of two things happens:
The letters melt. The model applies the same smoothing it uses for skin and skies. Sharp letter edges become soft blobs. Curves lose their definition. The text looks like it was left out in the rain.
The letters hallucinate. The model tries to "enhance" what it thinks is a texture pattern and generates new, fake character-like shapes that aren't real letters. You get alien text — shapes that look like writing but aren't readable in any language.
A text-specialized model solves this by being trained on text image pairs. It learns the manifold of real letters — the strokes, curves, serifs, and spacing that make text text. When it reconstructs a blurry letter, it pulls from a space of real character shapes, not a space of natural image textures.
This is the same principle behind OCR engines, but instead of outputting recognized text strings, we output a reconstructed image where the text is visually sharp and readable.
2. System Architecture
Here's the high-level flow:
[React Frontend]
↓ User uploads image (JPEG/PNG/WebP)
↓ Selects enhancement model (Text Clarity / Standard)
↓ POST /api/text/enhance
[FastAPI Backend]
↓ Validates file (size, format)
↓ Converts to PIL Image → resizes if > 2048px
↓ Loads text-clarity RCAN model (cached)
↓ Processes image in 256px tiles
↓ Applies post-processing (contrast boost / artifact removal)
↓ Returns enhanced PNG
[React Frontend]
↓ Renders before/after split comparison
↓ User downloads watermark-free result
The key design decisions:
- Tile-based inference — large images are processed in overlapping 256px tiles to avoid GPU OOM
- Model caching — the RCAN model loads once at startup, not per request
- Mode-specific post-processing — receipts get contrast boost, screenshots get artifact removal
- No watermark — the output is clean PNG, no branding overlaid
3. The API Contract
One endpoint. Simple.
POST /api/text/enhance
Content-Type: multipart/form-data
Parameters:
file: image file (JPEG, PNG, WebP) — max 5MB
model: enhancement mode — "text-clarity" | "standard" | "receipt" | "screenshot"
Response:
image/png binary
| Mode | What it does | When to use |
|---|---|---|
text-clarity |
Reconstructs blurry/pixelated letters | Default — screenshots, book pages, documents |
standard |
General image enhancement | Photos of people, landscapes (not text) |
receipt |
Text Clarity + contrast boost | Faded thermal receipts, invoices |
screenshot |
Text Clarity + JPEG artifact removal | WhatsApp/Messenger compressed screenshots |
4. Frontend: React Upload + Before/After Preview
The frontend handles four things: file upload, model selection, API call, and before/after display.
The Component
import { useState, useRef } from "react";
export default function TextEnhancer() {
const [image, setImage] = useState(null);
const [model, setModel] = useState("text-clarity");
const [resultUrl, setResultUrl] = useState(null);
const [loading, setLoading] = useState(false);
const [error, setError] = useState(null);
const fileInputRef = useRef(null);
function handleFileChange(e) {
const file = e.target.files[0];
if (!file) return;
if (file.size > 5 * 1024 * 1024) {
setError("File too large. Max 5MB.");
return;
}
setImage(file);
setError(null);
}
async function enhanceText() {
if (!image) return;
setLoading(true);
setError(null);
const form = new FormData();
form.append("file", image);
form.append("model", model);
try {
const res = await fetch("/api/text/enhance", {
method: "POST",
body: form,
});
if (!res.ok) {
const err = await res.json();
throw new Error(err.detail || "Enhancement failed");
}
const blob = await res.blob();
setResultUrl(URL.createObjectURL(blob));
} catch (e) {
setError(e.message);
} finally {
setLoading(false);
}
}
return (
<div className="text-enhancer">
{/* Drop zone */}
<div
className="drop-zone"
onDragOver={(e) => e.preventDefault()}
onDrop={(e) => {
e.preventDefault();
handleFileChange({ target: { files: e.dataTransfer.files } });
}}
onClick={() => fileInputRef.current?.click()}
>
<input
ref={fileInputRef}
type="file"
accept="image/jpeg,image/png,image/webp"
onChange={handleFileChange}
hidden
/>
<p>Drop your blurry text image here</p>
<p className="hint">JPEG, PNG, WebP — max 5MB</p>
</div>
{/* Model selector */}
<div className="model-selector">
<label>
<input
type="radio"
value="text-clarity"
checked={model === "text-clarity"}
onChange={(e) => setModel(e.target.value)}
/>
Text Clarity (recommended for text)
</label>
<label>
<input
type="radio"
value="standard"
checked={model === "standard"}
onChange={(e) => setModel(e.target.value)}
/>
Standard (for photos)
</label>
</div>
{/* Enhance button */}
<button onClick={enhanceText} disabled={loading || !image}>
{loading ? "Enhancing text..." : "Enhance Text"}
</button>
{/* Error display */}
{error && <div className="error">{error}</div>}
{/* Before / After result */}
{resultUrl && (
<div className="result">
<img src={URL.createObjectURL(image)} alt="Before enhancement" />
<img src={resultUrl} alt="After AI text enhancement" />
<a href={resultUrl} download="enhanced-text.png">
Download
</a>
</div>
)}
</div>
);
}
What the frontend is responsible for
| Task | How |
|---|---|
| Upload image | Drag & drop zone + hidden file input |
| Validate file | Check size (5MB max) and format |
| Send to backend | FormData POST with file + model |
| Show result | Side-by-side before/after images |
| Download | Blob URL, no watermark added |
5. Backend: Python + FastAPI
The backend does the heavy lifting: validation, preprocessing, model inference, and post-processing.
Project Structure
backend/
main.py # FastAPI app + endpoint
enhancer/
text_modes.py # Mode configurations
text_model_client.py # Model loading + inference
image_utils.py # PIL preprocessing utilities
Install Dependencies
pip install fastapi uvicorn python-multipart pillow torch torchvision
- Pillow — image preprocessing and format validation
- torch + torchvision — model inference
- python-multipart — multipart form data parsing for file uploads
Mode Configuration
enhancer/text_modes.py
TEXT_MODES = {
"text-clarity": {
"instruction": "Reconstruct blurry and pixelated letters into sharp, readable text. Preserve the background.",
"model": "text-clarity-rcan",
"tile_size": 256,
"scale": 2,
},
"standard": {
"instruction": "General image enhancement. Improve quality and resolution.",
"model": "real-esrgan-general",
"tile_size": 512,
"scale": 4,
},
"receipt": {
"instruction": "Restore faded thermal print. Maximize contrast between text and paper background.",
"model": "text-clarity-rcan",
"tile_size": 256,
"scale": 2,
"post_process": "contrast_boost",
},
"screenshot": {
"instruction": "Remove JPEG artifacts around text in compressed screenshots. Rebuild letter shapes.",
"model": "text-clarity-rcan",
"tile_size": 256,
"scale": 2,
"post_process": "artifact_removal",
},
}
Each mode specifies which model to use, the tile size for inference, the upscale factor, and an optional post-processing step.
Image Preprocessing Utilities
enhancer/image_utils.py
from PIL import Image, ImageEnhance, ImageFilter
import io
MAX_DIMENSION = 2048
def load_image_as_pil(bytes_data):
"""Load raw bytes into a PIL Image (RGB)."""
return Image.open(io.BytesIO(bytes_data)).convert("RGB")
def to_bytes(pil_img, format="PNG"):
"""Convert PIL Image to raw bytes."""
output = io.BytesIO()
pil_img.save(output, format=format)
return output.getvalue()
def resize_if_needed(pil_img, max_dim=MAX_DIMENSION):
"""Resize image if any dimension exceeds max_dim to keep inference fast."""
w, h = pil_img.size
if max(w, h) > max_dim:
ratio = max_dim / max(w, h)
new_size = (int(w * ratio), int(h * ratio))
return pil_img.resize(new_size, Image.LANCZOS)
return pil_img
def boost_contrast(pil_img, factor=1.4):
"""Post-process: boost contrast for faded receipt text."""
enhancer = ImageEnhance.Contrast(pil_img)
return enhancer.enhance(factor)
def remove_jpeg_artifacts(pil_img):
"""Post-process: light median filter to clean JPEG blocks around text."""
return pil_img.filter(ImageFilter.MedianFilter(size=3))
The Model Client
enhancer/text_model_client.py
This is where the magic happens. The model is loaded once at startup and cached. Inference runs in tiles to handle large images without OOM.
import torch
import io
from PIL import Image
from .text_modes import TEXT_MODES
from .image_utils import resize_if_needed, boost_contrast, remove_jpeg_artifacts
_model_cache = {}
def get_model(model_name: str):
"""Load and cache model — avoids 3-5s reload per request."""
if model_name not in _model_cache:
if model_name == "text-clarity-rcan":
model = load_text_clarity_model()
else:
model = load_general_model()
_model_cache[model_name] = model
return _model_cache[model_name]
def load_text_clarity_model():
"""
Load the text-clarity RCAN model.
Fine-tuned on pairs of:
- Blurry/pixelated text images (input)
- Sharp, readable text images (target)
Training data: screenshots, receipts, book pages,
scanned documents, handwritten notes.
"""
from models.rcan import RCAN
model = RCAN(scale=2, n_resgroups=10, n_resblocks=6)
model.load_state_dict(
torch.load("weights/text-clarity-rcan.pth", map_location="cpu")
)
model.eval()
return model
def enhance_text_with_ai(img_bytes: bytes, mode: str):
config = TEXT_MODES.get(mode, TEXT_MODES["text-clarity"])
pil_img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
pil_img = resize_if_needed(pil_img)
model = get_model(config["model"])
img_tensor = pil_to_tensor(pil_img)
with torch.no_grad():
enhanced_tensor = process_with_tiling(
model, img_tensor, tile_size=config["tile_size"]
)
enhanced_pil = tensor_to_pil(enhanced_tensor)
# Mode-specific post-processing
post = config.get("post_process")
if post == "contrast_boost":
enhanced_pil = boost_contrast(enhanced_pil, factor=1.4)
elif post == "artifact_removal":
enhanced_pil = remove_jpeg_artifacts(enhanced_pil)
output = io.BytesIO()
enhanced_pil.save(output, format="PNG")
return output.getvalue()
def process_with_tiling(model, tensor, tile_size=256):
"""
Process large images in overlapping tiles to avoid OOM.
Tiles overlap by 1/8 to blend edges seamlessly.
"""
_, _, h, w = tensor.shape
if h <= tile_size and w <= tile_size:
return model(tensor)
overlap = tile_size // 8
output = torch.zeros_like(tensor)
for y in range(0, h, tile_size - overlap):
for x in range(0, w, tile_size - overlap):
y_end = min(y + tile_size, h)
x_end = min(x + tile_size, w)
tile = tensor[:, :, y:y_end, x:x_end]
enhanced_tile = model(tile)
output[:, :, y:y_end, x:x_end] = enhanced_tile
return output
The FastAPI Endpoint
main.py
from fastapi import FastAPI, File, Form, UploadFile, Response, HTTPException
from enhancer.text_model_client import enhance_text_with_ai
from enhancer.image_utils import load_image_as_pil
app = FastAPI()
ALLOWED_FORMATS = {"JPEG", "PNG", "WEBP"}
MAX_FILE_SIZE = 5 * 1024 * 1024 # 5MB
@app.post("/api/text/enhance")
async def enhance_text_image(
file: UploadFile = File(...),
model: str = Form("text-clarity")
):
if not file:
raise HTTPException(400, "Image file is required")
img_bytes = await file.read()
if len(img_bytes) > MAX_FILE_SIZE:
raise HTTPException(400, "File too large. Max 5MB.")
# Validate image format
try:
pil_img = load_image_as_pil(img_bytes)
if pil_img.format not in ALLOWED_FORMATS:
raise HTTPException(400, f"Unsupported format: {pil_img.format}")
except Exception:
raise HTTPException(400, "Invalid image file")
# Run enhancement
try:
result_bytes = enhance_text_with_ai(img_bytes, model)
except Exception as e:
raise HTTPException(500, f"Enhancement failed: {str(e)}")
return Response(content=result_bytes, media_type="image/png")
6. The Text Clarity Model
The core of the system is the text-clarity RCAN model. RCAN (Residual Channel Attention Network) is a super-resolution architecture that uses channel attention to focus on the most important feature channels — which, for text, are the ones that encode stroke edges and character shapes.
Why RCAN over Real-ESRGAN?
| Feature | Real-ESRGAN (general) | RCAN (text-tuned) |
|---|---|---|
| Training data | Natural photos (DIV2K) | Text image pairs |
| Letter sharpness | Low — letters melt | High — letters stay crisp |
| Background preservation | Good | Good |
| Artifact handling | Adds artifacts to text | Removes artifacts from text |
| Inference speed | Slower (4x scale) | Faster (2x scale) |
The text-clarity RCAN is fine-tuned from a base RCAN checkpoint using paired text images. The base model already understands edges and textures; the fine-tuning shifts its output space toward real letter shapes.
Model loading strategy
The model is loaded once at process startup and cached in _model_cache. This is critical — loading a PyTorch model from disk takes 3–5 seconds. If you load it per request, your API will be unusably slow.
7. End-to-End Flow
Here's what happens when a user clicks "Enhance Text":
1. User drops a blurry screenshot onto the upload zone
2. React validates file size (≤5MB) and format (JPEG/PNG/WebP)
3. React POSTs FormData (file + model) to /api/text/enhance
4. FastAPI receives the file:
a. Validates size and format with PIL
b. Converts to PIL Image, resizes if > 2048px
c. Looks up model config from TEXT_MODES
d. Gets cached RCAN model from _model_cache
e. Converts PIL → tensor
f. Processes in 256px overlapping tiles
g. Converts tensor → PIL
h. Applies post-processing (contrast boost / artifact removal)
i. Saves as PNG bytes
5. FastAPI returns PNG binary response
6. React creates a Blob URL from the response
7. React renders before/after images side by side
8. User clicks Download → gets watermark-free PNG
Total processing time: under 15 seconds for a typical phone screenshot.
8. Error Handling
Production-grade error handling covers five scenarios:
No file uploaded
if not file:
raise HTTPException(400, "Image file is required")
File too large
if len(img_bytes) > MAX_FILE_SIZE:
raise HTTPException(400, "File too large. Max 5MB.")
Unsupported format
if pil_img.format not in ALLOWED_FORMATS:
raise HTTPException(400, f"Unsupported format: {pil_img.format}")
Model inference failure
try:
result_bytes = enhance_text_with_ai(img_bytes, model)
except torch.cuda.OutOfMemoryError:
raise HTTPException(503, "Image too complex. Try a smaller crop.")
except Exception as e:
raise HTTPException(500, f"Enhancement failed: {str(e)}")
Frontend timeout (30s abort)
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000);
const res = await fetch("/api/text/enhance", {
method: "POST",
body: form,
signal: controller.signal,
});
clearTimeout(timeout);
9. Optimization Strategy
Tile-based inference
Large images are split into overlapping 256px tiles. This lets the model run on GPU-constrained servers without crashing on big screenshots. Tiles overlap by 1/8 (32px) to blend edges seamlessly — you should not see any seam lines in the output.
Model caching
The RCAN model is loaded once at startup and reused across requests. Loading the model per request would add 3–5 seconds of overhead to every single API call.
Image pre-resizing
Images larger than 2048px are resized before inference. This cuts processing time by ~4x with minimal quality loss — the model upscales the text back during enhancement anyway.
Post-processing by mode
| Mode | Post-processing | Why |
|---|---|---|
text-clarity |
None | Model output is already clean |
receipt |
Contrast boost (1.4x) | Faded thermal print needs extra contrast |
screenshot |
Median filter (3px) | Removes JPEG block artifacts |
Response caching (optional)
For repeated uploads of the same image, hash the file + mode and cache the result:
import hashlib
def get_cache_key(img_bytes, mode):
return hashlib.md5(img_bytes + mode.encode()).hexdigest()
Cache in Redis for 1 hour. This is optional — most users enhance each image once.
10. Training Your Own Text Clarity Model
This is the part that makes or breaks the tool. The model is only as good as its training data.
Building the training pairs
You need pairs of (blurry input, sharp target) images. Here's how to build them:
| Input (blurry) | Target (sharp) |
|---|---|
| Screenshot compressed by WhatsApp | Original uncompressed screenshot |
| Photo of faded receipt | High-contrast scan of same receipt |
| Out-of-focus book page photo | In-focus photo of same page |
| Pixelated scanned document | Clean scan at native resolution |
| Blurred handwritten note | Sharp photo of same note |
Data augmentation: simulating degradation
You don't need thousands of real blurry photos. Take sharp text images and degrade them synthetically:
from PIL import Image, ImageFilter, ImageEnhance
import random
import io
def degrade_image(pil_img):
"""Simulate real-world text image degradation."""
# Random Gaussian blur (1-4px)
blur_radius = random.uniform(1, 4)
pil_img = pil_img.filter(ImageFilter.GaussianBlur(radius=blur_radius))
# Random JPEG compression (quality 20-50)
quality = random.randint(20, 50)
buffer = io.BytesIO()
pil_img.save(buffer, format="JPEG", quality=quality)
pil_img = Image.open(buffer)
# Random contrast reduction
if random.random() > 0.5:
enhancer = ImageEnhance.Contrast(pil_img)
pil_img = enhancer.enhance(random.uniform(0.5, 0.8))
return pil_img
This covers the three most common text degradation patterns: blur, compression, and contrast loss.
Loss function
Don't just use L2 loss — it produces blurry outputs. Use a weighted combination:
import torch.nn as nn
class TextEnhancementLoss(nn.Module):
def __init__(self):
super().__init__()
self.l1 = nn.L1Loss()
self.ssim_weight = 0.2 # Structural similarity
self.l1_weight = 0.7 # Pixel accuracy for letter shapes
self.perceptual_weight = 0.1 # VGG features for visual quality
def forward(self, pred, target):
l1_loss = self.l1(pred, target)
ssim_loss = 1 - ssim(pred, target)
perceptual_loss = self.perceptual(pred, target)
return (
self.l1_weight * l1_loss
+ self.ssim_weight * ssim_loss
+ self.perceptual_weight * perceptual_loss
)
- L1 loss (70%) — ensures pixel-level accuracy for letter shapes. L1 is better than L2 for text because it doesn't over-penalize sharp edges.
- SSIM loss (20%) — structural similarity preserves stroke continuity. A letter "a" should look like an "a", not a blob.
- Perceptual loss (10%) — VGG feature matching for overall visual quality. Keeps the output looking natural, not artificially sharpened.
11. Real-World Use Cases
This text enhancer works on a wide range of real-world scenarios:
| Use case | Example | Best mode |
|---|---|---|
| Chat screenshots | WhatsApp, Messenger, iMessage compressed text | screenshot |
| Receipt photos | Faded thermal print on crumpled paper | receipt |
| Book page photos | Small print in photos of old books | text-clarity |
| Scanned documents | Invoices, contracts, IDs converted to images | text-clarity |
| Handwritten notes | Photos of handwritten letters and forms | text-clarity |
| Memes | Compressed meme text with JPEG artifacts | screenshot |
| Product labels | Tiny text on packaging and ingredient lists | text-clarity |
| Traffic signs | Distance shots of road and store signs | text-clarity |
12. Wrapping Up
We built a complete AI text enhancer that:
- Frontend: React with drag & drop upload, model selector, and before/after preview
- Backend: Python + FastAPI with file validation, tile-based inference, and mode-specific post-processing
- Model: RCAN fine-tuned on text image pairs with L1 + SSIM + perceptual loss
- Production-ready: Error handling, model caching, image pre-resizing, and response caching
The key takeaway from this guide: don't use a general image enhancer for text. The training data matters more than the architecture. A small RCAN fine-tuned on text pairs will outperform a massive Real-ESRGAN trained on photos — every single time.
If you want to try the finished product, check out the AI Text Enhancer — it's free, no watermark, no account needed, and runs in your browser.
Found this guide helpful? Have questions about the implementation? Drop a comment below — I'm happy to help.
Top comments (1)
Nice breakdown. I’ve seen general upscalers make screenshots look worse, so using a text-focused model is a much smarter approach for documents and receipts.