DEV Community

SEN LLC
SEN LLC

Posted on

The Right-Sized Image Service: A FastAPI Watermarker in ~110 MB

The Right-Sized Image Service: A FastAPI Watermarker in ~110 MB

A single-purpose FastAPI service that stamps a text or image watermark on uploaded photos. Two endpoints, format preservation as a contract, ~110 MB Alpine image, no ImageMagick, no SaaS bill. The interesting part is the alpha compositing, and the interesting decision is where this thing should live in your architecture.

Image watermarking is one of those requirements that looks trivial and quietly becomes load-bearing. Photography marketplaces want a copyright stamp on every listing. Document workflows overlay "DRAFT" on previews. Print-on-demand stores mark preview renders so people don't screenshot and bypass checkout. Job portals tag resumes with their brand before emailing them to recruiters. None of these features are hard, but all of them have a habit of being written inline in the main app β€” and then they never leave.

This article is about lifting that logic into a small, boring, self-hosted FastAPI service and being deliberate about what to include and exclude. Two endpoints, Pillow, a bundled font, multi-stage Alpine Dockerfile that comes in around 110 MB. Forty-three tests. The design brief was "something a photography site operator would deploy and forget about," and the constraint-setting is the bulk of what I'd like to talk through.

πŸ”— GitHub: https://github.com/sen-ltd/watermark-api

Screenshot

Problem: where does watermarking belong?

A common first answer is "it's just a Pillow call, put it in the request handler." It is just a Pillow call. That's exactly why it's worth extracting. Every feature that gets written inline because it's trivial joins the pile of trivial things you can't easily change later. When the CTO asks why half the marketplace is down because a corrupt JPEG upload took down the Django workers, "it was just a Pillow call" is not the story you want to tell.

The alternatives I considered:

  • Inline in the app. Zero ops overhead, but now your request worker is blocked on image decoding, you own the CVE response for Pillow in your main service, and any memory spike from a hostile 1 GB PNG affects everyone. Pillow is safe and fast, but "safe" is not the same as "isolated."
  • ImageMagick via convert or wand. Wand is thin wrapper around ImageMagick. ImageMagick has a long and documented history of CVEs, mostly around its maximalist format support. You get everything β€” GIF, HEIF, TIFF variants, PSD β€” and you also get everything. The Docker image usually lands above 200 MB before you start adding code.
  • Commercial image APIs (Cloudinary, Imgix, Filestack, etc.). Great products. Wrong product. They're built for "manage my entire image pipeline" and they price and DSL themselves for that. Paying per request to put six pixels of text on a photo is a bad deal, and the vendor lock-in hurts later.
  • Lambda + Sharp (the Node option). Popular and cheap, but you're now running a Node build in your Python shop and managing cold-start behavior for something that doesn't need any of that.

The right shape for "team of engineers who already know their infra" is usually a small, long-running, single-purpose HTTP service. One endpoint category, one library, one Dockerfile, zero cloud vendors. That's where this lives.

Design

A few decisions fall out immediately once you commit to that shape.

Pillow, not ImageMagick. Pillow supports JPEG, PNG, and WebP natively on Alpine via libjpeg-turbo, zlib, and libwebp. It doesn't need shellouts, doesn't care about argument escaping, and has a Image.alpha_composite primitive that does exactly what watermarking needs. The resulting container is half the size of an ImageMagick one and the CVE surface is significantly smaller.

Two endpoints, not one. A watermark is either text or an image β€” different inputs, different knobs, different validation. Trying to merge them into a single POST /watermark that accepts either a text or an overlay field means every request has to carry a mode flag and every error message has to guess what the caller meant. Two endpoints (/watermark/text and /watermark/image) keep the OpenAPI spec honest and documentation trivial.

Format preservation as a contract. This is the one promise worth making up front: if you send me a JPEG, you get a JPEG back; PNG returns PNG; WebP returns WebP. Nothing transcodes behind your back, nothing surprises the caller. It sounds obvious; most image SaaS products don't honor it unless you explicitly pass a format parameter.

Content-type by magic bytes, not MIME. The client's Content-Type header is a hint, not evidence. The server sniffs the first 12 bytes to confirm JPEG / PNG / WebP signatures. A client that lies about its MIME type β€” or that genuinely doesn't know β€” gets the same 415 as a real attack, which is the right answer.

MAX_UPLOAD_MB as memory protection. Pillow decodes images entirely into memory. A 50 MB PNG full of noise can balloon to hundreds of megabytes of pixel data. A hard upload limit β€” default 10 MB, env-configurable β€” is how we keep the process from being a DoS target. This is documented in the 413 error path, not hidden.

Bundling DejaVuSans. This is the deploy-time decision I want to pull out and talk about. Pillow's text rendering needs a TrueType font file at a real path. You have three options:

  1. Ask users to mount a font volume or set FONT_PATH. Flexible, but now everybody configures it, and "text renders as boxes" becomes a support ticket that could have been prevented.
  2. Ship a font inside the image. Opinionated, slightly bigger image, zero-config for users. font-dejavu on Alpine adds about 2 MB.
  3. Embed bitmap font. Pillow's ImageFont.load_default() works without any TTF, but ignores font size, which defeats the point of font_size_pct.

I picked option 2. For a "deploy and forget" service the right tradeoff is "it just works," and 2 MB is well inside the budget. The cost is that the service only speaks DejaVu. That's a tradeoff I'm willing to defend: multi-font support is a feature, and adding it later is cleaner than ripping out a half-baked font-loader.

Alpha compositing is actually the interesting part. Everything else in this service is plumbing β€” multipart parsing, format detection, HTTP shapes. The one step that deserves thought is how the watermark meets the image, and for text that means: draw the glyphs onto a transparent RGBA layer, set the fill alpha to opacity * 255, then Image.alpha_composite that layer over the base. That gives you a clean, anti-aliased blend regardless of the underlying image's content. Do it wrong β€” just paste with no alpha handling β€” and you get chunky solid boxes under the text.

Implementation highlights

A few of the load-bearing snippets.

Text watermark with alpha compositing

This is apply_text_watermark in src/watermark_api/watermark.py, trimmed for clarity:

def apply_text_watermark(
    image: Image.Image,
    *,
    text: str,
    position: Position = "bottom-right",
    opacity: float = 0.5,
    color: Tuple[int, int, int] = (255, 255, 255),
    font_size_pct: float = 3.0,
    padding_pct: float = 2.0,
) -> Image.Image:
    orig_mode = image.mode
    base = image.convert("RGBA")
    width, height = base.size

    pixel_size = max(8, int(height * font_size_pct / 100.0))
    padding = max(0, int(min(width, height) * padding_pct / 100.0))
    font = _load_font(pixel_size)

    layer = Image.new("RGBA", base.size, (0, 0, 0, 0))
    draw = ImageDraw.Draw(layer)
    bbox = draw.textbbox((0, 0), text, font=font)
    text_w = bbox[2] - bbox[0]
    text_h = bbox[3] - bbox[1]
    x, y = _position_xy(base.size, (text_w, text_h), position, padding)

    alpha = int(round(opacity * 255))
    fill = (color[0], color[1], color[2], alpha)
    draw.text((x - bbox[0], y - bbox[1]), text, font=font, fill=fill)

    composed = Image.alpha_composite(base, layer)

    if orig_mode == "RGBA":
        return composed
    flat = Image.new(orig_mode, composed.size, (255, 255, 255))
    flat.paste(composed, mask=composed.split()[3])
    return flat
Enter fullscreen mode Exit fullscreen mode

Things to notice:

  • The text layer is a fresh RGBA image exactly the size of the base. We draw into that and composite once. Drawing straight onto the base would skip alpha and give a binary on/off mark.
  • draw.textbbox((0, 0), text, font=font) replaces the older textsize API and correctly accounts for glyphs with non-zero left/top bearings. Without the (-bbox[0], -bbox[1]) offset, narrow-baselined fonts end up shifted.
  • _load_font tries a list of well-known TTF paths and, in last resort, falls back to Pillow's embedded bitmap font so import watermark_api.watermark works even on a laptop with no fonts at all. In the Docker image, /usr/share/fonts/TTF/DejaVuSans.ttf is always present.
  • The final mode handling is the RGBA-vs-JPEG bit I'll explain in "tradeoffs" below.

Magic-byte content-type check

In validators.py:

def sniff_image(data: bytes) -> ImageKind:
    if len(data) < 12:
        raise UnsupportedImageError("payload too short to be an image")

    if data[:3] == b"\xff\xd8\xff":
        return JPEG
    if data[:8] == b"\x89PNG\r\n\x1a\n":
        return PNG
    if data[:4] == b"RIFF" and data[8:12] == b"WEBP":
        return WEBP

    raise UnsupportedImageError(
        "payload does not match JPEG, PNG, or WebP magic bytes"
    )
Enter fullscreen mode Exit fullscreen mode

WebP is the interesting one: it's a RIFF container, so the magic is RIFF at offset 0 and WEBP at offset 8, with a little-endian file-size field between them. Checking both offsets rules out plain RIFF/AVI/WAV files that happen to start with the same four bytes.

Route handler

The FastAPI handler in main.py does validation, calls the pure function, streams the result:

@app.post("/watermark/text")
async def watermark_text(
    file: UploadFile = File(...),
    text: str = Form(..., max_length=200),
    position: str = Form("bottom-right"),
    opacity: float = Form(0.5, ge=0.0, le=1.0),
    color: str = Form("#ffffff"),
    font_size_pct: float = Form(3.0, ge=1.0, le=20.0),
    padding_pct: float = Form(2.0, ge=0.0, le=20.0),
) -> StreamingResponse:
    max_bytes = _max_upload_bytes()

    err = _validate_position(position)
    if err is not None:
        return err
    try:
        rgb = parse_hex_color(color)
    except InvalidColorError as exc:
        return _error(422, "invalid_color", str(exc))

    err, data, kind = await _read_image_upload(file, max_bytes)
    if err is not None:
        return err

    image = Image.open(io.BytesIO(data))
    image.load()
    out = apply_text_watermark(
        image, text=text, position=position, opacity=opacity,
        color=rgb, font_size_pct=font_size_pct, padding_pct=padding_pct,
    )
    body = _save_to_bytes(out, kind)
    return StreamingResponse(io.BytesIO(body), media_type=kind.mime)
Enter fullscreen mode Exit fullscreen mode

The handler does three things: validate, call the pure watermark function, write the result. FastAPI's Form(..., ge=..., le=...) gives us automatic 422 responses for numeric fields out of range, which covers most of the "user typo" surface with no extra code.

The pure function in watermark.py has no FastAPI imports, no HTTP, no file IO β€” it's Image -> Image. That makes the unit tests boring in a good way: generate a fixture image with Pillow, call the function, assert something about the result. No TestClient, no anyio, no mocking.

Tradeoffs and learnings

RGBA compositing into a JPEG output. JPEG has no alpha channel. The text layer is RGBA (it has to be β€” that's how we get opacity), so when the input is JPEG we need to flatten back to RGB before saving. Pillow's .convert("RGB") silently drops alpha against black, which looks terrible on anti-aliased text β€” you get dark halos on every glyph edge. The fix is explicit: create a new RGB image with a white background, then paste the RGBA composite with the alpha channel as the mask. That gives clean edges.

There's a real decision buried in that white background: it's the "best default," not "always right." A watermark applied to a photo with a mostly-white sky looks perfect; one applied to pure-black night sky has cleaner edges if you flatten against black. I chose white because the most common input (product photos, document scans, portraits) lives closer to white than to black. A future version could detect the local mean color at the watermark position and flatten against that, but "flatten against local mean" is the kind of premature clever that earns you a 2 AM page when someone uploads a four-pixel image.

Right-to-left text. Pillow renders glyphs left-to-right in Unicode logical order, which is wrong for Arabic and Hebrew. Those scripts need BiDi reordering and joining-form selection before you hand the string to ImageDraw. The real solution is python-bidi + arabic-reshaper, and it's deliberately not in this version. The honest answer is "pre-process the text in your app if you need RTL," and the limitation is documented in the README.

Long text overflow. There's no wrapping. A 200-character text= value at default font size will happily draw past the image edge β€” or past the left edge if the position is bottom-right. Options: clamp text to a fraction of image width, auto-scale font size to fit, or wrap to multiple lines. I picked none of them. Captions are not essays, and every "smart" fallback hides the real issue ("you sent a huge string to a watermark API"). The 200-character limit on text at the route level is the safety net; anything within that length is the caller's problem to size correctly.

Percentage vs absolute pixels for font size. font_size_pct scales with the image, which means a watermark on a 400 px thumbnail and a 4000 px print both look proportionally correct. Absolute pixels would have been simpler to code and harder to use β€” you'd need a separate call per image size tier. The downside is that users reading the code the first time have to stop and think about what "3%" means. I think that's a fair price for proportional output.

Single-position vs stripe/tile. Commercial watermarking services often offer "diagonal tile" or "horizontal stripe" modes for anti-piracy. This service supports exactly five positions. Tiling is possible with Pillow (for y in range(...): for x in range(...): draw.text(...)) but it's a separate feature with its own knobs β€” tile spacing, rotation, stagger, the whole form. YAGNI until someone asks.

Pillow's Resampling.LANCZOS vs BICUBIC. The image-overlay path resizes the overlay with LANCZOS. It's slower than BICUBIC but produces noticeably cleaner edges on logo-style overlays. For watermarks the quality gain is worth the ~2x resize cost.

Try it in 30 seconds

docker build -t watermark-api .
docker run --rm -d -p 8000:8000 --name wm watermark-api

# Make a test image
python3 -c "from PIL import Image; Image.new('RGB', (800,600), 'steelblue').save('/tmp/photo.jpg','JPEG',quality=85)"

# Watermark it
curl -sS -F "file=@/tmp/photo.jpg" \
         -F "text=Β© SEN 2026" \
         -F "position=bottom-right" \
         -F "opacity=0.7" \
         -F "color=#ffffff" \
         http://localhost:8000/watermark/text -o /tmp/photo-wm.jpg

file /tmp/photo-wm.jpg
# JPEG image data, baseline, precision 8, 800x600, components 3

docker stop wm
Enter fullscreen mode Exit fullscreen mode

Swagger UI is at http://localhost:8000/docs. Health probe is GET /health. Logs are one JSON line per request on stdout. That's the whole surface area.


The repo is part of SEN 合同会瀾's 100-public-repos portfolio. The through-line is "right-sized backends": small, boring, single-purpose HTTP services that a team can read in one sitting, deploy once, and forget. This one is entry #110.

Top comments (0)