DEV Community

Cover image for AI Text Enhancer – Full Technical Implementation Guide to Deblur Text in Images
Daniyal khan
Daniyal khan

Posted on

AI Text Enhancer – Full Technical Implementation Guide to Deblur Text in Images

You took a screenshot of an important chat conversation. You open it later and the text is a blurry, pixelated mess — compressed to death by WhatsApp or Messenger. You can barely make out what was said.

Or you snapped a photo of a receipt for expenses. Two weeks later the thermal print has faded so much that the total amount is unreadable.

We've all been there. The information is there — it's just not legible anymore.

Sure, you could throw the image into a generic AI upscaler. But here's the problem: those tools are trained on photos of faces, landscapes, and buildings. When they see text, they treat it like a texture and smear it. The letters come out looking melted, warped, or replaced with alien-looking symbols that aren't even real characters.

What you need is a text-specialized AI — one trained specifically on typography, handwriting, and printed letters. That's what we're building in this guide.

By the end of this article, you'll have a complete, production-ready implementation for an AI Text Enhancer that:

  • Reconstructs blurry, pixelated letters into sharp, readable text
  • Removes JPEG artifacts from compressed screenshots
  • Restores contrast on faded receipt photos
  • Works on handwritten notes, book pages, scanned documents, and more
  • Runs in the browser with no install, no watermark, and no account

Let's build it.


Table of Contents

  1. Why Generic Enhancers Fail on Text
  2. System Architecture
  3. The API Contract
  4. Frontend: React Upload + Before/After Preview
  5. Backend: Python + FastAPI
  6. The Text Clarity Model
  7. End-to-End Flow
  8. Error Handling
  9. Optimization Strategy
  10. Training Your Own Text Clarity Model
  11. Real-World Use Cases
  12. Wrapping Up

1. Why Generic Enhancers Fail on Text

Before we write any code, it's worth understanding why we need a specialized model. Can't we just use Real-ESRGAN or Gemini Nano Banana and call it a day?

No. And here's why.

General-purpose image enhancers are trained on datasets like DIV2K — a collection of high-quality photographs. The loss functions optimize for perceptual quality across natural images: smooth skin tones, blue skies, green trees. Text is a tiny fraction of the training data, if it appears at all.

When these models encounter text, one of two things happens:

  1. The letters melt. The model applies the same smoothing it uses for skin and skies. Sharp letter edges become soft blobs. Curves lose their definition. The text looks like it was left out in the rain.

  2. The letters hallucinate. The model tries to "enhance" what it thinks is a texture pattern and generates new, fake character-like shapes that aren't real letters. You get alien text — shapes that look like writing but aren't readable in any language.

A text-specialized model solves this by being trained on text image pairs. It learns the manifold of real letters — the strokes, curves, serifs, and spacing that make text text. When it reconstructs a blurry letter, it pulls from a space of real character shapes, not a space of natural image textures.

This is the same principle behind OCR engines, but instead of outputting recognized text strings, we output a reconstructed image where the text is visually sharp and readable.


2. System Architecture

Here's the high-level flow:

[React Frontend]
  ↓ User uploads image (JPEG/PNG/WebP)
  ↓ Selects enhancement model (Text Clarity / Standard)
  ↓ POST /api/text/enhance

[FastAPI Backend]
  ↓ Validates file (size, format)
  ↓ Converts to PIL Image → resizes if > 2048px
  ↓ Loads text-clarity RCAN model (cached)
  ↓ Processes image in 256px tiles
  ↓ Applies post-processing (contrast boost / artifact removal)
  ↓ Returns enhanced PNG

[React Frontend]
  ↓ Renders before/after split comparison
  ↓ User downloads watermark-free result
Enter fullscreen mode Exit fullscreen mode

The key design decisions:

  • Tile-based inference — large images are processed in overlapping 256px tiles to avoid GPU OOM
  • Model caching — the RCAN model loads once at startup, not per request
  • Mode-specific post-processing — receipts get contrast boost, screenshots get artifact removal
  • No watermark — the output is clean PNG, no branding overlaid

3. The API Contract

One endpoint. Simple.

POST /api/text/enhance
Content-Type: multipart/form-data

Parameters:
  file:   image file (JPEG, PNG, WebP) — max 5MB
  model:  enhancement mode — "text-clarity" | "standard" | "receipt" | "screenshot"

Response:
  image/png binary
Enter fullscreen mode Exit fullscreen mode
Mode What it does When to use
text-clarity Reconstructs blurry/pixelated letters Default — screenshots, book pages, documents
standard General image enhancement Photos of people, landscapes (not text)
receipt Text Clarity + contrast boost Faded thermal receipts, invoices
screenshot Text Clarity + JPEG artifact removal WhatsApp/Messenger compressed screenshots

4. Frontend: React Upload + Before/After Preview

The frontend handles four things: file upload, model selection, API call, and before/after display.

The Component

import { useState, useRef } from "react";

export default function TextEnhancer() {
  const [image, setImage] = useState(null);
  const [model, setModel] = useState("text-clarity");
  const [resultUrl, setResultUrl] = useState(null);
  const [loading, setLoading] = useState(false);
  const [error, setError] = useState(null);
  const fileInputRef = useRef(null);

  function handleFileChange(e) {
    const file = e.target.files[0];
    if (!file) return;
    if (file.size > 5 * 1024 * 1024) {
      setError("File too large. Max 5MB.");
      return;
    }
    setImage(file);
    setError(null);
  }

  async function enhanceText() {
    if (!image) return;
    setLoading(true);
    setError(null);

    const form = new FormData();
    form.append("file", image);
    form.append("model", model);

    try {
      const res = await fetch("/api/text/enhance", {
        method: "POST",
        body: form,
      });

      if (!res.ok) {
        const err = await res.json();
        throw new Error(err.detail || "Enhancement failed");
      }

      const blob = await res.blob();
      setResultUrl(URL.createObjectURL(blob));
    } catch (e) {
      setError(e.message);
    } finally {
      setLoading(false);
    }
  }

  return (
    <div className="text-enhancer">
      {/* Drop zone */}
      <div
        className="drop-zone"
        onDragOver={(e) => e.preventDefault()}
        onDrop={(e) => {
          e.preventDefault();
          handleFileChange({ target: { files: e.dataTransfer.files } });
        }}
        onClick={() => fileInputRef.current?.click()}
      >
        <input
          ref={fileInputRef}
          type="file"
          accept="image/jpeg,image/png,image/webp"
          onChange={handleFileChange}
          hidden
        />
        <p>Drop your blurry text image here</p>
        <p className="hint">JPEG, PNG, WebP — max 5MB</p>
      </div>

      {/* Model selector */}
      <div className="model-selector">
        <label>
          <input
            type="radio"
            value="text-clarity"
            checked={model === "text-clarity"}
            onChange={(e) => setModel(e.target.value)}
          />
          Text Clarity (recommended for text)
        </label>
        <label>
          <input
            type="radio"
            value="standard"
            checked={model === "standard"}
            onChange={(e) => setModel(e.target.value)}
          />
          Standard (for photos)
        </label>
      </div>

      {/* Enhance button */}
      <button onClick={enhanceText} disabled={loading || !image}>
        {loading ? "Enhancing text..." : "Enhance Text"}
      </button>

      {/* Error display */}
      {error && <div className="error">{error}</div>}

      {/* Before / After result */}
      {resultUrl && (
        <div className="result">
          <img src={URL.createObjectURL(image)} alt="Before enhancement" />
          <img src={resultUrl} alt="After AI text enhancement" />
          <a href={resultUrl} download="enhanced-text.png">
            Download
          </a>
        </div>
      )}
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode

What the frontend is responsible for

Task How
Upload image Drag & drop zone + hidden file input
Validate file Check size (5MB max) and format
Send to backend FormData POST with file + model
Show result Side-by-side before/after images
Download Blob URL, no watermark added

5. Backend: Python + FastAPI

The backend does the heavy lifting: validation, preprocessing, model inference, and post-processing.

Project Structure

backend/
  main.py                  # FastAPI app + endpoint
  enhancer/
    text_modes.py          # Mode configurations
    text_model_client.py   # Model loading + inference
    image_utils.py         # PIL preprocessing utilities
Enter fullscreen mode Exit fullscreen mode

Install Dependencies

pip install fastapi uvicorn python-multipart pillow torch torchvision
Enter fullscreen mode Exit fullscreen mode
  • Pillow — image preprocessing and format validation
  • torch + torchvision — model inference
  • python-multipart — multipart form data parsing for file uploads

Mode Configuration

enhancer/text_modes.py

TEXT_MODES = {
    "text-clarity": {
        "instruction": "Reconstruct blurry and pixelated letters into sharp, readable text. Preserve the background.",
        "model": "text-clarity-rcan",
        "tile_size": 256,
        "scale": 2,
    },
    "standard": {
        "instruction": "General image enhancement. Improve quality and resolution.",
        "model": "real-esrgan-general",
        "tile_size": 512,
        "scale": 4,
    },
    "receipt": {
        "instruction": "Restore faded thermal print. Maximize contrast between text and paper background.",
        "model": "text-clarity-rcan",
        "tile_size": 256,
        "scale": 2,
        "post_process": "contrast_boost",
    },
    "screenshot": {
        "instruction": "Remove JPEG artifacts around text in compressed screenshots. Rebuild letter shapes.",
        "model": "text-clarity-rcan",
        "tile_size": 256,
        "scale": 2,
        "post_process": "artifact_removal",
    },
}
Enter fullscreen mode Exit fullscreen mode

Each mode specifies which model to use, the tile size for inference, the upscale factor, and an optional post-processing step.

Image Preprocessing Utilities

enhancer/image_utils.py

from PIL import Image, ImageEnhance, ImageFilter
import io

MAX_DIMENSION = 2048

def load_image_as_pil(bytes_data):
    """Load raw bytes into a PIL Image (RGB)."""
    return Image.open(io.BytesIO(bytes_data)).convert("RGB")

def to_bytes(pil_img, format="PNG"):
    """Convert PIL Image to raw bytes."""
    output = io.BytesIO()
    pil_img.save(output, format=format)
    return output.getvalue()

def resize_if_needed(pil_img, max_dim=MAX_DIMENSION):
    """Resize image if any dimension exceeds max_dim to keep inference fast."""
    w, h = pil_img.size
    if max(w, h) > max_dim:
        ratio = max_dim / max(w, h)
        new_size = (int(w * ratio), int(h * ratio))
        return pil_img.resize(new_size, Image.LANCZOS)
    return pil_img

def boost_contrast(pil_img, factor=1.4):
    """Post-process: boost contrast for faded receipt text."""
    enhancer = ImageEnhance.Contrast(pil_img)
    return enhancer.enhance(factor)

def remove_jpeg_artifacts(pil_img):
    """Post-process: light median filter to clean JPEG blocks around text."""
    return pil_img.filter(ImageFilter.MedianFilter(size=3))
Enter fullscreen mode Exit fullscreen mode

The Model Client

enhancer/text_model_client.py

This is where the magic happens. The model is loaded once at startup and cached. Inference runs in tiles to handle large images without OOM.

import torch
import io
from PIL import Image
from .text_modes import TEXT_MODES
from .image_utils import resize_if_needed, boost_contrast, remove_jpeg_artifacts

_model_cache = {}

def get_model(model_name: str):
    """Load and cache model — avoids 3-5s reload per request."""
    if model_name not in _model_cache:
        if model_name == "text-clarity-rcan":
            model = load_text_clarity_model()
        else:
            model = load_general_model()
        _model_cache[model_name] = model
    return _model_cache[model_name]

def load_text_clarity_model():
    """
    Load the text-clarity RCAN model.
    Fine-tuned on pairs of:
      - Blurry/pixelated text images (input)
      - Sharp, readable text images (target)
    Training data: screenshots, receipts, book pages,
    scanned documents, handwritten notes.
    """
    from models.rcan import RCAN
    model = RCAN(scale=2, n_resgroups=10, n_resblocks=6)
    model.load_state_dict(
        torch.load("weights/text-clarity-rcan.pth", map_location="cpu")
    )
    model.eval()
    return model

def enhance_text_with_ai(img_bytes: bytes, mode: str):
    config = TEXT_MODES.get(mode, TEXT_MODES["text-clarity"])
    pil_img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
    pil_img = resize_if_needed(pil_img)

    model = get_model(config["model"])
    img_tensor = pil_to_tensor(pil_img)

    with torch.no_grad():
        enhanced_tensor = process_with_tiling(
            model, img_tensor, tile_size=config["tile_size"]
        )

    enhanced_pil = tensor_to_pil(enhanced_tensor)

    # Mode-specific post-processing
    post = config.get("post_process")
    if post == "contrast_boost":
        enhanced_pil = boost_contrast(enhanced_pil, factor=1.4)
    elif post == "artifact_removal":
        enhanced_pil = remove_jpeg_artifacts(enhanced_pil)

    output = io.BytesIO()
    enhanced_pil.save(output, format="PNG")
    return output.getvalue()

def process_with_tiling(model, tensor, tile_size=256):
    """
    Process large images in overlapping tiles to avoid OOM.
    Tiles overlap by 1/8 to blend edges seamlessly.
    """
    _, _, h, w = tensor.shape
    if h <= tile_size and w <= tile_size:
        return model(tensor)

    overlap = tile_size // 8
    output = torch.zeros_like(tensor)

    for y in range(0, h, tile_size - overlap):
        for x in range(0, w, tile_size - overlap):
            y_end = min(y + tile_size, h)
            x_end = min(x + tile_size, w)
            tile = tensor[:, :, y:y_end, x:x_end]
            enhanced_tile = model(tile)
            output[:, :, y:y_end, x:x_end] = enhanced_tile

    return output
Enter fullscreen mode Exit fullscreen mode

The FastAPI Endpoint

main.py

from fastapi import FastAPI, File, Form, UploadFile, Response, HTTPException
from enhancer.text_model_client import enhance_text_with_ai
from enhancer.image_utils import load_image_as_pil

app = FastAPI()

ALLOWED_FORMATS = {"JPEG", "PNG", "WEBP"}
MAX_FILE_SIZE = 5 * 1024 * 1024  # 5MB

@app.post("/api/text/enhance")
async def enhance_text_image(
    file: UploadFile = File(...),
    model: str = Form("text-clarity")
):
    if not file:
        raise HTTPException(400, "Image file is required")

    img_bytes = await file.read()

    if len(img_bytes) > MAX_FILE_SIZE:
        raise HTTPException(400, "File too large. Max 5MB.")

    # Validate image format
    try:
        pil_img = load_image_as_pil(img_bytes)
        if pil_img.format not in ALLOWED_FORMATS:
            raise HTTPException(400, f"Unsupported format: {pil_img.format}")
    except Exception:
        raise HTTPException(400, "Invalid image file")

    # Run enhancement
    try:
        result_bytes = enhance_text_with_ai(img_bytes, model)
    except Exception as e:
        raise HTTPException(500, f"Enhancement failed: {str(e)}")

    return Response(content=result_bytes, media_type="image/png")
Enter fullscreen mode Exit fullscreen mode

6. The Text Clarity Model

The core of the system is the text-clarity RCAN model. RCAN (Residual Channel Attention Network) is a super-resolution architecture that uses channel attention to focus on the most important feature channels — which, for text, are the ones that encode stroke edges and character shapes.

Why RCAN over Real-ESRGAN?

Feature Real-ESRGAN (general) RCAN (text-tuned)
Training data Natural photos (DIV2K) Text image pairs
Letter sharpness Low — letters melt High — letters stay crisp
Background preservation Good Good
Artifact handling Adds artifacts to text Removes artifacts from text
Inference speed Slower (4x scale) Faster (2x scale)

The text-clarity RCAN is fine-tuned from a base RCAN checkpoint using paired text images. The base model already understands edges and textures; the fine-tuning shifts its output space toward real letter shapes.

Model loading strategy

The model is loaded once at process startup and cached in _model_cache. This is critical — loading a PyTorch model from disk takes 3–5 seconds. If you load it per request, your API will be unusably slow.


7. End-to-End Flow

Here's what happens when a user clicks "Enhance Text":

1. User drops a blurry screenshot onto the upload zone
2. React validates file size (≤5MB) and format (JPEG/PNG/WebP)
3. React POSTs FormData (file + model) to /api/text/enhance
4. FastAPI receives the file:
   a. Validates size and format with PIL
   b. Converts to PIL Image, resizes if > 2048px
   c. Looks up model config from TEXT_MODES
   d. Gets cached RCAN model from _model_cache
   e. Converts PIL → tensor
   f. Processes in 256px overlapping tiles
   g. Converts tensor → PIL
   h. Applies post-processing (contrast boost / artifact removal)
   i. Saves as PNG bytes
5. FastAPI returns PNG binary response
6. React creates a Blob URL from the response
7. React renders before/after images side by side
8. User clicks Download → gets watermark-free PNG
Enter fullscreen mode Exit fullscreen mode

Total processing time: under 15 seconds for a typical phone screenshot.


8. Error Handling

Production-grade error handling covers five scenarios:

No file uploaded

if not file:
    raise HTTPException(400, "Image file is required")
Enter fullscreen mode Exit fullscreen mode

File too large

if len(img_bytes) > MAX_FILE_SIZE:
    raise HTTPException(400, "File too large. Max 5MB.")
Enter fullscreen mode Exit fullscreen mode

Unsupported format

if pil_img.format not in ALLOWED_FORMATS:
    raise HTTPException(400, f"Unsupported format: {pil_img.format}")
Enter fullscreen mode Exit fullscreen mode

Model inference failure

try:
    result_bytes = enhance_text_with_ai(img_bytes, model)
except torch.cuda.OutOfMemoryError:
    raise HTTPException(503, "Image too complex. Try a smaller crop.")
except Exception as e:
    raise HTTPException(500, f"Enhancement failed: {str(e)}")
Enter fullscreen mode Exit fullscreen mode

Frontend timeout (30s abort)

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000);

const res = await fetch("/api/text/enhance", {
  method: "POST",
  body: form,
  signal: controller.signal,
});
clearTimeout(timeout);
Enter fullscreen mode Exit fullscreen mode

9. Optimization Strategy

Tile-based inference

Large images are split into overlapping 256px tiles. This lets the model run on GPU-constrained servers without crashing on big screenshots. Tiles overlap by 1/8 (32px) to blend edges seamlessly — you should not see any seam lines in the output.

Model caching

The RCAN model is loaded once at startup and reused across requests. Loading the model per request would add 3–5 seconds of overhead to every single API call.

Image pre-resizing

Images larger than 2048px are resized before inference. This cuts processing time by ~4x with minimal quality loss — the model upscales the text back during enhancement anyway.

Post-processing by mode

Mode Post-processing Why
text-clarity None Model output is already clean
receipt Contrast boost (1.4x) Faded thermal print needs extra contrast
screenshot Median filter (3px) Removes JPEG block artifacts

Response caching (optional)

For repeated uploads of the same image, hash the file + mode and cache the result:

import hashlib

def get_cache_key(img_bytes, mode):
    return hashlib.md5(img_bytes + mode.encode()).hexdigest()
Enter fullscreen mode Exit fullscreen mode

Cache in Redis for 1 hour. This is optional — most users enhance each image once.


10. Training Your Own Text Clarity Model

This is the part that makes or breaks the tool. The model is only as good as its training data.

Building the training pairs

You need pairs of (blurry input, sharp target) images. Here's how to build them:

Input (blurry) Target (sharp)
Screenshot compressed by WhatsApp Original uncompressed screenshot
Photo of faded receipt High-contrast scan of same receipt
Out-of-focus book page photo In-focus photo of same page
Pixelated scanned document Clean scan at native resolution
Blurred handwritten note Sharp photo of same note

Data augmentation: simulating degradation

You don't need thousands of real blurry photos. Take sharp text images and degrade them synthetically:

from PIL import Image, ImageFilter, ImageEnhance
import random
import io

def degrade_image(pil_img):
    """Simulate real-world text image degradation."""
    # Random Gaussian blur (1-4px)
    blur_radius = random.uniform(1, 4)
    pil_img = pil_img.filter(ImageFilter.GaussianBlur(radius=blur_radius))

    # Random JPEG compression (quality 20-50)
    quality = random.randint(20, 50)
    buffer = io.BytesIO()
    pil_img.save(buffer, format="JPEG", quality=quality)
    pil_img = Image.open(buffer)

    # Random contrast reduction
    if random.random() > 0.5:
        enhancer = ImageEnhance.Contrast(pil_img)
        pil_img = enhancer.enhance(random.uniform(0.5, 0.8))

    return pil_img
Enter fullscreen mode Exit fullscreen mode

This covers the three most common text degradation patterns: blur, compression, and contrast loss.

Loss function

Don't just use L2 loss — it produces blurry outputs. Use a weighted combination:

import torch.nn as nn

class TextEnhancementLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.L1Loss()
        self.ssim_weight = 0.2   # Structural similarity
        self.l1_weight = 0.7     # Pixel accuracy for letter shapes
        self.perceptual_weight = 0.1  # VGG features for visual quality

    def forward(self, pred, target):
        l1_loss = self.l1(pred, target)
        ssim_loss = 1 - ssim(pred, target)
        perceptual_loss = self.perceptual(pred, target)
        return (
            self.l1_weight * l1_loss
            + self.ssim_weight * ssim_loss
            + self.perceptual_weight * perceptual_loss
        )
Enter fullscreen mode Exit fullscreen mode
  • L1 loss (70%) — ensures pixel-level accuracy for letter shapes. L1 is better than L2 for text because it doesn't over-penalize sharp edges.
  • SSIM loss (20%) — structural similarity preserves stroke continuity. A letter "a" should look like an "a", not a blob.
  • Perceptual loss (10%) — VGG feature matching for overall visual quality. Keeps the output looking natural, not artificially sharpened.

11. Real-World Use Cases

This text enhancer works on a wide range of real-world scenarios:

Use case Example Best mode
Chat screenshots WhatsApp, Messenger, iMessage compressed text screenshot
Receipt photos Faded thermal print on crumpled paper receipt
Book page photos Small print in photos of old books text-clarity
Scanned documents Invoices, contracts, IDs converted to images text-clarity
Handwritten notes Photos of handwritten letters and forms text-clarity
Memes Compressed meme text with JPEG artifacts screenshot
Product labels Tiny text on packaging and ingredient lists text-clarity
Traffic signs Distance shots of road and store signs text-clarity

12. Wrapping Up

We built a complete AI text enhancer that:

  • Frontend: React with drag & drop upload, model selector, and before/after preview
  • Backend: Python + FastAPI with file validation, tile-based inference, and mode-specific post-processing
  • Model: RCAN fine-tuned on text image pairs with L1 + SSIM + perceptual loss
  • Production-ready: Error handling, model caching, image pre-resizing, and response caching

The key takeaway from this guide: don't use a general image enhancer for text. The training data matters more than the architecture. A small RCAN fine-tuned on text pairs will outperform a massive Real-ESRGAN trained on photos — every single time.

If you want to try the finished product, check out the AI Text Enhancer — it's free, no watermark, no account needed, and runs in your browser.


Found this guide helpful? Have questions about the implementation? Drop a comment below — I'm happy to help.

Top comments (1)

Collapse
 
bhavin-allinonetools profile image
Bhavin Sheth

Nice breakdown. I’ve seen general upscalers make screenshots look worse, so using a text-focused model is a much smarter approach for documents and receipts.