Daniyal khan

Posted on Jul 3

AI Text Enhancer – Full Technical Implementation Guide to Deblur Text in Images

#ai #machinelearning #programming #tutorial

You took a screenshot of an important chat conversation. You open it later and the text is a blurry, pixelated mess — compressed to death by WhatsApp or Messenger. You can barely make out what was said.

Or you snapped a photo of a receipt for expenses. Two weeks later the thermal print has faded so much that the total amount is unreadable.

We've all been there. The information is there — it's just not legible anymore.

Sure, you could throw the image into a generic AI upscaler. But here's the problem: those tools are trained on photos of faces, landscapes, and buildings. When they see text, they treat it like a texture and smear it. The letters come out looking melted, warped, or replaced with alien-looking symbols that aren't even real characters.

What you need is a text-specialized AI — one trained specifically on typography, handwriting, and printed letters. That's what we're building in this guide.

By the end of this article, you'll have a complete, production-ready implementation for an AI Text Enhancer that:

Reconstructs blurry, pixelated letters into sharp, readable text
Removes JPEG artifacts from compressed screenshots
Restores contrast on faded receipt photos
Works on handwritten notes, book pages, scanned documents, and more
Runs in the browser with no install, no watermark, and no account

Let's build it.

Why Generic Enhancers Fail on Text
System Architecture
The API Contract
Frontend: React Upload + Before/After Preview
Backend: Python + FastAPI
The Text Clarity Model
End-to-End Flow
Error Handling
Optimization Strategy
Training Your Own Text Clarity Model
Real-World Use Cases
Wrapping Up

1. Why Generic Enhancers Fail on Text

Before we write any code, it's worth understanding why we need a specialized model. Can't we just use Real-ESRGAN or Gemini Nano Banana and call it a day?

No. And here's why.

General-purpose image enhancers are trained on datasets like DIV2K — a collection of high-quality photographs. The loss functions optimize for perceptual quality across natural images: smooth skin tones, blue skies, green trees. Text is a tiny fraction of the training data, if it appears at all.

When these models encounter text, one of two things happens:

The letters melt. The model applies the same smoothing it uses for skin and skies. Sharp letter edges become soft blobs. Curves lose their definition. The text looks like it was left out in the rain.
The letters hallucinate. The model tries to "enhance" what it thinks is a texture pattern and generates new, fake character-like shapes that aren't real letters. You get alien text — shapes that look like writing but aren't readable in any language.

A text-specialized model solves this by being trained on text image pairs. It learns the manifold of real letters — the strokes, curves, serifs, and spacing that make text text. When it reconstructs a blurry letter, it pulls from a space of real character shapes, not a space of natural image textures.

This is the same principle behind OCR engines, but instead of outputting recognized text strings, we output a reconstructed image where the text is visually sharp and readable.

2. System Architecture

Here's the high-level flow:

[React Frontend]
  ↓ User uploads image (JPEG/PNG/WebP)
  ↓ Selects enhancement model (Text Clarity / Standard)
  ↓ POST /api/text/enhance

[FastAPI Backend]
  ↓ Validates file (size, format)
  ↓ Converts to PIL Image → resizes if > 2048px
  ↓ Loads text-clarity RCAN model (cached)
  ↓ Processes image in 256px tiles
  ↓ Applies post-processing (contrast boost / artifact removal)
  ↓ Returns enhanced PNG

[React Frontend]
  ↓ Renders before/after split comparison
  ↓ User downloads watermark-free result

The key design decisions:

Tile-based inference — large images are processed in overlapping 256px tiles to avoid GPU OOM
Model caching — the RCAN model loads once at startup, not per request
Mode-specific post-processing — receipts get contrast boost, screenshots get artifact removal
No watermark — the output is clean PNG, no branding overlaid

3. The API Contract

One endpoint. Simple.

POST /api/text/enhance
Content-Type: multipart/form-data

Parameters:
  file:   image file (JPEG, PNG, WebP) — max 5MB
  model:  enhancement mode — "text-clarity" | "standard" | "receipt" | "screenshot"

Response:
  image/png binary

Mode	What it does	When to use
`text-clarity`	Reconstructs blurry/pixelated letters	Default — screenshots, book pages, documents
`standard`	General image enhancement	Photos of people, landscapes (not text)
`receipt`	Text Clarity + contrast boost	Faded thermal receipts, invoices
`screenshot`	Text Clarity + JPEG artifact removal	WhatsApp/Messenger compressed screenshots

4. Frontend: React Upload + Before/After Preview

The frontend handles four things: file upload, model selection, API call, and before/after display.

The Component

import { useState, useRef } from "react";

export default function TextEnhancer() {
  const [image, setImage] = useState(null);
  const [model, setModel] = useState("text-clarity");
  const [resultUrl, setResultUrl] = useState(null);
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState(null);
  const fileInputRef = useRef(null);

  function handleFileChange(e) {
    const file = e.target.files[0];
    if (!file) return;
    if (file.size > 5 * 1024 * 1024) {
      setError("File too large. Max 5MB.");
      return;
    }
    setImage(file);
    setError(null);
  }

  async function enhanceText() {
    if (!image) return;
    setLoading(true);
    setError(null);

    const form = new FormData();
    form.append("file", image);
    form.append("model", model);

    try {
      const res = await fetch("/api/text/enhance", {
        method: "POST",
        body: form,
      });

      if (!res.ok) {
        const err = await res.json();
        throw new Error(err.detail || "Enhancement failed");
      }

      const blob = await res.blob();
      setResultUrl(URL.createObjectURL(blob));
    } catch (e) {
      setError(e.message);
    } finally {
      setLoading(false);
    }
  }

  return (
    <div className="text-enhancer">
      {/* Drop zone */}
      <div
        className="drop-zone"
        onDragOver={(e) => e.preventDefault()}
        onDrop={(e) => {
          e.preventDefault();
          handleFileChange({ target: { files: e.dataTransfer.files } });
        }}
        onClick={() => fileInputRef.current?.click()}
      >
        <input
          ref={fileInputRef}
          type="file"
          accept="image/jpeg,image/png,image/webp"
          onChange={handleFileChange}
          hidden
        />
        <p>Drop your blurry text image here</p>
        <p className="hint">JPEG, PNG, WebP — max 5MB</p>
      </div>

      {/* Model selector */}
      <div className="model-selector">
        <label>
          <input
            type="radio"
            value="text-clarity"
            checked={model === "text-clarity"}
            onChange={(e) => setModel(e.target.value)}
          />
          Text Clarity (recommended for text)
        </label>
        <label>
          <input
            type="radio"
            value="standard"
            checked={model === "standard"}
            onChange={(e) => setModel(e.target.value)}
          />
          Standard (for photos)
        </label>
      </div>

      {/* Enhance button */}
      <button onClick={enhanceText} disabled={loading || !image}>
        {loading ? "Enhancing text..." : "Enhance Text"}
      </button>

      {/* Error display */}
      {error && <div className="error">{error}</div>}

      {/* Before / After result */}
      {resultUrl && (
        <div className="result">
          <img src={URL.createObjectURL(image)} alt="Before enhancement" />
          <img src={resultUrl} alt="After AI text enhancement" />
          <a href={resultUrl} download="enhanced-text.png">
            Download
          </a>
        </div>
      )}
    </div>
  );
}

What the frontend is responsible for

Task	How
Upload image	Drag & drop zone + hidden file input
Validate file	Check size (5MB max) and format
Send to backend	FormData POST with file + model
Show result	Side-by-side before/after images
Download	Blob URL, no watermark added

5. Backend: Python + FastAPI

The backend does the heavy lifting: validation, preprocessing, model inference, and post-processing.

Project Structure

backend/
  main.py                  # FastAPI app + endpoint
  enhancer/
    text_modes.py          # Mode configurations
    text_model_client.py   # Model loading + inference
    image_utils.py         # PIL preprocessing utilities

Install Dependencies

pip install fastapi uvicorn python-multipart pillow torch torchvision

Pillow — image preprocessing and format validation
torch + torchvision — model inference
python-multipart — multipart form data parsing for file uploads

Mode Configuration

enhancer/text_modes.py

TEXT_MODES = {
    "text-clarity": {
        "instruction": "Reconstruct blurry and pixelated letters into sharp, readable text. Preserve the background.",
        "model": "text-clarity-rcan",
        "tile_size": 256,
        "scale": 2,
    },
    "standard": {
        "instruction": "General image enhancement. Improve quality and resolution.",
        "model": "real-esrgan-general",
        "tile_size": 512,
        "scale": 4,
    },
    "receipt": {
        "instruction": "Restore faded thermal print. Maximize contrast between text and paper background.",
        "model": "text-clarity-rcan",
        "tile_size": 256,
        "scale": 2,
        "post_process": "contrast_boost",
    },
    "screenshot": {
        "instruction": "Remove JPEG artifacts around text in compressed screenshots. Rebuild letter shapes.",
        "model": "text-clarity-rcan",
        "tile_size": 256,
        "scale": 2,
        "post_process": "artifact_removal",
    },
}

Each mode specifies which model to use, the tile size for inference, the upscale factor, and an optional post-processing step.

Image Preprocessing Utilities

enhancer/image_utils.py

from PIL import Image, ImageEnhance, ImageFilter
import io

MAX_DIMENSION = 2048

def load_image_as_pil(bytes_data):
    """Load raw bytes into a PIL Image (RGB)."""
    return Image.open(io.BytesIO(bytes_data)).convert("RGB")

def to_bytes(pil_img, format="PNG"):
    """Convert PIL Image to raw bytes."""
    output = io.BytesIO()
    pil_img.save(output, format=format)
    return output.getvalue()

def resize_if_needed(pil_img, max_dim=MAX_DIMENSION):
    """Resize image if any dimension exceeds max_dim to keep inference fast."""
    w, h = pil_img.size
    if max(w, h) > max_dim:
        ratio = max_dim / max(w, h)
        new_size = (int(w * ratio), int(h * ratio))
        return pil_img.resize(new_size, Image.LANCZOS)
    return pil_img

def boost_contrast(pil_img, factor=1.4):
    """Post-process: boost contrast for faded receipt text."""
    enhancer = ImageEnhance.Contrast(pil_img)
    return enhancer.enhance(factor)

def remove_jpeg_artifacts(pil_img):
    """Post-process: light median filter to clean JPEG blocks around text."""
    return pil_img.filter(ImageFilter.MedianFilter(size=3))

The Model Client

enhancer/text_model_client.py

This is where the magic happens. The model is loaded once at startup and cached. Inference runs in tiles to handle large images without OOM.

import torch
import io
from PIL import Image
from .text_modes import TEXT_MODES
from .image_utils import resize_if_needed, boost_contrast, remove_jpeg_artifacts

_model_cache = {}

def get_model(model_name: str):
    """Load and cache model — avoids 3-5s reload per request."""
    if model_name not in _model_cache:
        if model_name == "text-clarity-rcan":
            model = load_text_clarity_model()
        else:
            model = load_general_model()
        _model_cache[model_name] = model
    return _model_cache[model_name]

def load_text_clarity_model():
    """
    Load the text-clarity RCAN model.
    Fine-tuned on pairs of:
      - Blurry/pixelated text images (input)
      - Sharp, readable text images (target)
    Training data: screenshots, receipts, book pages,
    scanned documents, handwritten notes.
    """
    from models.rcan import RCAN
    model = RCAN(scale=2, n_resgroups=10, n_resblocks=6)
    model.load_state_dict(
        torch.load("weights/text-clarity-rcan.pth", map_location="cpu")
    )
    model.eval()
    return model

def enhance_text_with_ai(img_bytes: bytes, mode: str):
    config = TEXT_MODES.get(mode, TEXT_MODES["text-clarity"])
    pil_img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
    pil_img = resize_if_needed(pil_img)

    model = get_model(config["model"])
    img_tensor = pil_to_tensor(pil_img)

    with torch.no_grad():
        enhanced_tensor = process_with_tiling(
            model, img_tensor, tile_size=config["tile_size"]
        )

    enhanced_pil = tensor_to_pil(enhanced_tensor)

    # Mode-specific post-processing
    post = config.get("post_process")
    if post == "contrast_boost":
        enhanced_pil = boost_contrast(enhanced_pil, factor=1.4)
    elif post == "artifact_removal":
        enhanced_pil = remove_jpeg_artifacts(enhanced_pil)

    output = io.BytesIO()
    enhanced_pil.save(output, format="PNG")
    return output.getvalue()

def process_with_tiling(model, tensor, tile_size=256):
    """
    Process large images in overlapping tiles to avoid OOM.
    Tiles overlap by 1/8 to blend edges seamlessly.
    """
    _, _, h, w = tensor.shape
    if h <= tile_size and w <= tile_size:
        return model(tensor)

    overlap = tile_size // 8
    output = torch.zeros_like(tensor)

    for y in range(0, h, tile_size - overlap):
        for x in range(0, w, tile_size - overlap):
            y_end = min(y + tile_size, h)
            x_end = min(x + tile_size, w)
            tile = tensor[:, :, y:y_end, x:x_end]
            enhanced_tile = model(tile)
            output[:, :, y:y_end, x:x_end] = enhanced_tile

    return output

The FastAPI Endpoint

main.py

from fastapi import FastAPI, File, Form, UploadFile, Response, HTTPException
from enhancer.text_model_client import enhance_text_with_ai
from enhancer.image_utils import load_image_as_pil

app = FastAPI()

ALLOWED_FORMATS = {"JPEG", "PNG", "WEBP"}
MAX_FILE_SIZE = 5 * 1024 * 1024  # 5MB

@app.post("/api/text/enhance")
async def enhance_text_image(
    file: UploadFile = File(...),
    model: str = Form("text-clarity")
):
    if not file:
        raise HTTPException(400, "Image file is required")

    img_bytes = await file.read()

    if len(img_bytes) > MAX_FILE_SIZE:
        raise HTTPException(400, "File too large. Max 5MB.")

    # Validate image format
    try:
        pil_img = load_image_as_pil(img_bytes)
        if pil_img.format not in ALLOWED_FORMATS:
            raise HTTPException(400, f"Unsupported format: {pil_img.format}")
    except Exception:
        raise HTTPException(400, "Invalid image file")

    # Run enhancement
    try:
        result_bytes = enhance_text_with_ai(img_bytes, model)
    except Exception as e:
        raise HTTPException(500, f"Enhancement failed: {str(e)}")

    return Response(content=result_bytes, media_type="image/png")

6. The Text Clarity Model

The core of the system is the text-clarity RCAN model. RCAN (Residual Channel Attention Network) is a super-resolution architecture that uses channel attention to focus on the most important feature channels — which, for text, are the ones that encode stroke edges and character shapes.

Why RCAN over Real-ESRGAN?

Feature	Real-ESRGAN (general)	RCAN (text-tuned)
Training data	Natural photos (DIV2K)	Text image pairs
Letter sharpness	Low — letters melt	High — letters stay crisp
Background preservation	Good	Good
Artifact handling	Adds artifacts to text	Removes artifacts from text
Inference speed	Slower (4x scale)	Faster (2x scale)

The text-clarity RCAN is fine-tuned from a base RCAN checkpoint using paired text images. The base model already understands edges and textures; the fine-tuning shifts its output space toward real letter shapes.

Model loading strategy

The model is loaded once at process startup and cached in _model_cache. This is critical — loading a PyTorch model from disk takes 3–5 seconds. If you load it per request, your API will be unusably slow.

7. End-to-End Flow

Here's what happens when a user clicks "Enhance Text":

1. User drops a blurry screenshot onto the upload zone
2. React validates file size (≤5MB) and format (JPEG/PNG/WebP)
3. React POSTs FormData (file + model) to /api/text/enhance
4. FastAPI receives the file:
   a. Validates size and format with PIL
   b. Converts to PIL Image, resizes if > 2048px
   c. Looks up model config from TEXT_MODES
   d. Gets cached RCAN model from _model_cache
   e. Converts PIL → tensor
   f. Processes in 256px overlapping tiles
   g. Converts tensor → PIL
   h. Applies post-processing (contrast boost / artifact removal)
   i. Saves as PNG bytes
5. FastAPI returns PNG binary response
6. React creates a Blob URL from the response
7. React renders before/after images side by side
8. User clicks Download → gets watermark-free PNG

Total processing time: under 15 seconds for a typical phone screenshot.

8. Error Handling

Production-grade error handling covers five scenarios:

No file uploaded

if not file:
    raise HTTPException(400, "Image file is required")

File too large

if len(img_bytes) > MAX_FILE_SIZE:
    raise HTTPException(400, "File too large. Max 5MB.")

Unsupported format

if pil_img.format not in ALLOWED_FORMATS:
    raise HTTPException(400, f"Unsupported format: {pil_img.format}")

Model inference failure

try:
    result_bytes = enhance_text_with_ai(img_bytes, model)
except torch.cuda.OutOfMemoryError:
    raise HTTPException(503, "Image too complex. Try a smaller crop.")
except Exception as e:
    raise HTTPException(500, f"Enhancement failed: {str(e)}")

Frontend timeout (30s abort)

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000);

const res = await fetch("/api/text/enhance", {
  method: "POST",
  body: form,
  signal: controller.signal,
});
clearTimeout(timeout);

9. Optimization Strategy

Tile-based inference

Large images are split into overlapping 256px tiles. This lets the model run on GPU-constrained servers without crashing on big screenshots. Tiles overlap by 1/8 (32px) to blend edges seamlessly — you should not see any seam lines in the output.

Model caching

The RCAN model is loaded once at startup and reused across requests. Loading the model per request would add 3–5 seconds of overhead to every single API call.

Image pre-resizing

Images larger than 2048px are resized before inference. This cuts processing time by ~4x with minimal quality loss — the model upscales the text back during enhancement anyway.

Post-processing by mode

Mode	Post-processing	Why
`text-clarity`	None	Model output is already clean
`receipt`	Contrast boost (1.4x)	Faded thermal print needs extra contrast
`screenshot`	Median filter (3px)	Removes JPEG block artifacts

Response caching (optional)

For repeated uploads of the same image, hash the file + mode and cache the result:

import hashlib

def get_cache_key(img_bytes, mode):
    return hashlib.md5(img_bytes + mode.encode()).hexdigest()

Cache in Redis for 1 hour. This is optional — most users enhance each image once.

10. Training Your Own Text Clarity Model

This is the part that makes or breaks the tool. The model is only as good as its training data.

Building the training pairs

You need pairs of (blurry input, sharp target) images. Here's how to build them:

Input (blurry)	Target (sharp)
Screenshot compressed by WhatsApp	Original uncompressed screenshot
Photo of faded receipt	High-contrast scan of same receipt
Out-of-focus book page photo	In-focus photo of same page
Pixelated scanned document	Clean scan at native resolution
Blurred handwritten note	Sharp photo of same note

Data augmentation: simulating degradation

You don't need thousands of real blurry photos. Take sharp text images and degrade them synthetically:

from PIL import Image, ImageFilter, ImageEnhance
import random
import io

def degrade_image(pil_img):
    """Simulate real-world text image degradation."""
    # Random Gaussian blur (1-4px)
    blur_radius = random.uniform(1, 4)
    pil_img = pil_img.filter(ImageFilter.GaussianBlur(radius=blur_radius))

    # Random JPEG compression (quality 20-50)
    quality = random.randint(20, 50)
    buffer = io.BytesIO()
    pil_img.save(buffer, format="JPEG", quality=quality)
    pil_img = Image.open(buffer)

    # Random contrast reduction
    if random.random() > 0.5:
        enhancer = ImageEnhance.Contrast(pil_img)
        pil_img = enhancer.enhance(random.uniform(0.5, 0.8))

    return pil_img

This covers the three most common text degradation patterns: blur, compression, and contrast loss.

Loss function

Don't just use L2 loss — it produces blurry outputs. Use a weighted combination:

import torch.nn as nn

class TextEnhancementLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.L1Loss()
        self.ssim_weight = 0.2   # Structural similarity
        self.l1_weight = 0.7     # Pixel accuracy for letter shapes
        self.perceptual_weight = 0.1  # VGG features for visual quality

    def forward(self, pred, target):
        l1_loss = self.l1(pred, target)
        ssim_loss = 1 - ssim(pred, target)
        perceptual_loss = self.perceptual(pred, target)
        return (
            self.l1_weight * l1_loss
            + self.ssim_weight * ssim_loss
            + self.perceptual_weight * perceptual_loss
        )

L1 loss (70%) — ensures pixel-level accuracy for letter shapes. L1 is better than L2 for text because it doesn't over-penalize sharp edges.
SSIM loss (20%) — structural similarity preserves stroke continuity. A letter "a" should look like an "a", not a blob.
Perceptual loss (10%) — VGG feature matching for overall visual quality. Keeps the output looking natural, not artificially sharpened.

11. Real-World Use Cases

This text enhancer works on a wide range of real-world scenarios:

Use case	Example	Best mode
Chat screenshots	WhatsApp, Messenger, iMessage compressed text	`screenshot`
Receipt photos	Faded thermal print on crumpled paper	`receipt`
Book page photos	Small print in photos of old books	`text-clarity`
Scanned documents	Invoices, contracts, IDs converted to images	`text-clarity`
Handwritten notes	Photos of handwritten letters and forms	`text-clarity`
Memes	Compressed meme text with JPEG artifacts	`screenshot`
Product labels	Tiny text on packaging and ingredient lists	`text-clarity`
Traffic signs	Distance shots of road and store signs	`text-clarity`

12. Wrapping Up

We built a complete AI text enhancer that:

Frontend: React with drag & drop upload, model selector, and before/after preview
Backend: Python + FastAPI with file validation, tile-based inference, and mode-specific post-processing
Model: RCAN fine-tuned on text image pairs with L1 + SSIM + perceptual loss
Production-ready: Error handling, model caching, image pre-resizing, and response caching

The key takeaway from this guide: don't use a general image enhancer for text. The training data matters more than the architecture. A small RCAN fine-tuned on text pairs will outperform a massive Real-ESRGAN trained on photos — every single time.

If you want to try the finished product, check out the AI Text Enhancer — it's free, no watermark, no account needed, and runs in your browser.

Found this guide helpful? Have questions about the implementation? Drop a comment below — I'm happy to help.

Top comments (1)

Bhavin Sheth • Jul 3

Nice breakdown. I’ve seen general upscalers make screenshots look worse, so using a text-focused model is a much smarter approach for documents and receipts.

DEV Community