DEV Community

SEN LLC
SEN LLC

Posted on

When the Library IS the Product: A 200-Line FastAPI Wrapper for wordcloud

When the Library IS the Product: A 200-Line FastAPI Wrapper for wordcloud

Word clouds keep getting requested. Not by me β€” by dashboards, by reports, by blog post headers. The wordcloud Python package solves the hard part completely. This entry is about the case for putting a thin HTTP wrapper around a library you didn't write, and being honest about the tradeoffs.

πŸ”— GitHub: https://github.com/sen-ltd/wordcloud-api

Screenshot

The problem that refuses to die

Every few months somebody asks for a word cloud. A PM wants one for a quarterly report. A marketing team wants one for a blog post header. Someone building an analytics dashboard wants one for a "popular topics" widget. Academics still put them in slide decks.

They're a meme at this point β€” "word clouds are bad data viz" β€” but that argument has been lost for a decade. People like how they look. They keep asking.

On the Python side the answer has been the same since 2013: Andreas MΓΌller's wordcloud package. It's good. It handles layout, font sizing by frequency, masking, colormaps, the whole thing. You don't need to reimplement any of it.

So the problem isn't "how do I generate a word cloud." The problem is "how do I make sure the eighth team in my company that needs one doesn't have to add numpy, matplotlib, and Pillow to their Go service."

The service argument

The rule I use: if a capability is independently scalable, language-agnostic, and has a clean input/output boundary, it earns a service. Word clouds are three-for-three:

  • Independently scalable: rendering is pure CPU work. No shared state.
  • Language-agnostic: your consumer might be Python, might be Node, might be a Bash script that curls from a cron job.
  • Clean boundary: text β†’ PNG bytes. You couldn't write a simpler contract.

The cost of the service is that you have to run it. The cost of not running it is that every consumer carries the dependency weight. Which is…

The elephant

I need to be upfront about this because it's the single biggest thing anyone should know before deploying this:

The resulting Docker image is ~213 MB. matplotlib alone accounts for about 150 MB of that.

wordcloud depends on matplotlib because it uses matplotlib.cm to look up colormaps (viridis, plasma, magma, etc.). It also depends on numpy (for frequency arrays) and Pillow (for the actual drawing). On Alpine Linux, none of those have musl wheels, so you compile them β€” except matplotlib, which is mostly pure Python and just pulls its compiled deps from wheels.

Can you strip matplotlib? Not without forking wordcloud and replacing the colormap lookup with a hand-rolled table. I considered it. I decided no β€” the whole point of this exercise is "when the library IS the product, don't reinvent it." If you want a smaller image, fork and maintain. I want a thin wrapper.

The service ships at 213 MB. That's the honest number. Plan accordingly.

The design

Two endpoints, one pure module, one FastAPI module. That's it.

src/wordcloud_api/
β”œβ”€β”€ generator.py   # pure: options + text (or frequencies) β†’ PNG bytes
β”œβ”€β”€ models.py      # pydantic request/response shapes + limits
β”œβ”€β”€ main.py        # FastAPI routes, content-type validation, error shaping
└── logging.py     # small ASGI middleware emitting one JSON line per request
Enter fullscreen mode Exit fullscreen mode

The split that matters is generator.py vs main.py. The generator knows nothing about HTTP; it takes a dataclass of options and returns bytes. That's the testable core.

# generator.py (abridged)
from dataclasses import dataclass
from typing import Optional
import io
from wordcloud import WordCloud

class GeneratorError(ValueError):
    pass

@dataclass(frozen=True)
class RenderOptions:
    width: int = 800
    height: int = 400
    background: str = "white"
    colormap: str = "viridis"
    max_words: int = 200
    stopwords: Optional[frozenset[str]] = None

def render_from_text(text: str, options: RenderOptions) -> bytes:
    if not text.strip():
        raise GeneratorError("text must contain at least one non-whitespace character")

    wc = _build_wordcloud(options)
    try:
        wc.generate(text)
    except ValueError as exc:
        # wordcloud raises ValueError("We need at least 1 word to plot ...")
        # if the caller's text is all stopwords.
        raise GeneratorError(str(exc)) from exc
    return _to_png_bytes(wc)
Enter fullscreen mode Exit fullscreen mode

Nothing clever. The interesting part is how many edge cases wordcloud already handles, which means the wrapper is boring in the good way.

The bug I walked into

There is one gotcha, and I want to mention it because it cost me a test cycle:

wordcloud validates the colormap name eagerly (at WordCloud(...) construction time), but it validates the background_color name lazily (deferred to WordCloud.to_image()).

That asymmetry exists because matplotlib.cm.get_cmap is called during construction to build the color function, but the background color is only touched when Pillow finally paints the canvas. Pillow raises its own ValueError("unknown color specifier") β€” a different error class from a different library, through a totally different stack frame.

If you're wrapping library exceptions into your own error class (I am), you have to catch ValueError in both places. My first pass only caught it in the constructor, and the test for "reject bad background color" failed with a raw Pillow stack trace instead of a clean GeneratorError.

def _to_png_bytes(wc: WordCloud) -> bytes:
    try:
        image = wc.to_image()
    except ValueError as exc:
        # Pillow raises this for unknown background color strings;
        # wordcloud defers the lookup until render time.
        raise GeneratorError(str(exc)) from exc
    buf = io.BytesIO()
    image.save(buf, format="PNG", optimize=True)
    return buf.getvalue()
Enter fullscreen mode Exit fullscreen mode

Tiny fix. The lesson is bigger: when you wrap a library, your exception firewall needs to cover every layer the library is built on top of. wordcloud inherits error shapes from matplotlib, numpy, and Pillow. Any of them can leak through.

The frequencies endpoint β€” the escape hatch

Here is the one concession the API makes to reality: wordcloud tokenizes on whitespace. That works for English, Spanish, French, German, Russian, Arabic β€” anything with spaces between words. It does not work for Japanese, Chinese, Thai, or any other language that doesn't use a space delimiter.

So the service has a second endpoint:

@app.post("/wordcloud/frequencies")
async def wordcloud_from_frequencies(request: Request):
    bad_ct = _require_json(request)
    if bad_ct is not None:
        return bad_ct

    body_or_err = await _read_body_with_limit(request)
    if isinstance(body_or_err, JSONResponse):
        return body_or_err

    payload = json.loads(body_or_err or b"{}")
    req = FrequenciesRequest.model_validate(payload)

    opts = RenderOptions(
        width=req.width,
        height=req.height,
        background=req.background,
        colormap=req.colormap,
        max_words=req.max_words,
    )

    try:
        png = render_from_frequencies(req.frequencies, opts)
    except GeneratorError as exc:
        return _error(422, "render_error", str(exc))

    return StreamingResponse(io.BytesIO(png), media_type="image/png")
Enter fullscreen mode Exit fullscreen mode

The client sends {"frequencies": {"word": 10, "other": 5}} directly, bypassing tokenization entirely. Tokenize however you want β€” MeCab, SudachiPy, jieba, PyThaiNLP, your own regex, whatever fits your corpus. In a previous entry in this series I built furigana-api using SudachiPy for Japanese, and slug-jp for romanization. Either of those pairs naturally with this endpoint: tokenize upstream, POST the counts, get a PNG back.

(Important caveat: the default font wordcloud bundles is DejaVu, which has no CJK glyphs. Pass CJK words and they render as empty boxes. You need to mount a CJK TTF β€” Noto Sans CJK is the canonical choice β€” for actual visible Japanese or Chinese. That's documented in the README and is left out of the default image deliberately, because a full Noto font pack would add another ~120 MB.)

Route + pydantic validation

The route reads raw bytes, parses JSON manually (so we can return a clean invalid_json 422 instead of FastAPI's auto-generated noise), and runs pydantic validation with explicit limits:

# models.py
MAX_TEXT_LEN = 50 * 1024  # 50 KB
MAX_WORDS_LIMIT = 500
MAX_SIDE = 2000
MIN_SIDE = 50
MAX_FREQ_ENTRIES = 5000

class WordCloudRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=MAX_TEXT_LEN)
    width: int = Field(800, ge=MIN_SIDE, le=MAX_SIDE)
    height: int = Field(400, ge=MIN_SIDE, le=MAX_SIDE)
    background: str = Field("white", min_length=1, max_length=32)
    colormap: str = Field("viridis", min_length=1, max_length=32)
    max_words: int = Field(200, ge=1, le=MAX_WORDS_LIMIT)
    stopwords: Optional[list[str]] = Field(default=None, max_length=2000)
Enter fullscreen mode Exit fullscreen mode

The limits aren't arbitrary:

  • 50 KB of text because beyond that you're not visualizing vocabulary anymore, you're DoS-ing the wordcloud layout algorithm (it's O(max_words Γ— canvas_area) in the worst case).
  • 2000 px per side because 2000Γ—2000 Γ— 4 bytes per pixel = 16 MB of raw canvas. That's a reasonable ceiling before someone thinks they'll render a 4K desktop wallpaper through a JSON API.
  • 500 max_words because the layout gets visually illegible past a couple hundred anyway; this is a "you're clearly misusing this" gate, not a performance gate.

There's also a 1 MB raw body limit enforced before pydantic, so an adversary sending 500 MB of JSON gets a 413 after the first megabyte instead of eating RAM.

Tests

16 test cases across two files:

tests/test_generator.py (15 cases) hits the pure module:

  • PNG magic-byte check on the output
  • Whitespace-only text β†’ GeneratorError
  • All-stopwords text β†’ GeneratorError (wordcloud raises "need at least 1 word")
  • Custom colormap happy path
  • Unknown colormap β†’ error
  • Custom background happy path
  • Unknown background β†’ error (this is the bug I mentioned above)
  • Stopwords filtering works
  • Frequencies happy path
  • Empty frequencies β†’ error
  • Zero / negative weights are silently filtered, but all-zero raises
  • Non-numeric weight raises
  • Blank-key entries ignored
  • Japanese-ish romaji keys work (the bytes come out; the visual is limited by font)
  • max_words respected

tests/test_api.py (18 cases) hits the HTTP layer via TestClient:

  • /health shape and version
  • / HTML landing page is served and mentions both endpoints
  • /docs served (Swagger UI)
  • /wordcloud happy path + PNG magic byte verification
  • Custom-options happy path
  • Stopwords filter at the API layer
  • Empty text β†’ 422
  • Oversize text β†’ 413 or 422 (either is acceptable for "too big")
  • Oversize width/height β†’ 422
  • Oversize max_words β†’ 422
  • Bad colormap β†’ 422
  • Wrong Content-Type β†’ 415
  • Malformed JSON β†’ 422
  • /wordcloud/frequencies happy path
  • Empty frequencies β†’ 422
  • Too many frequency entries (> 5000) β†’ 422

33 passing on the host, 33 passing in the Alpine container.

Try it in 30 seconds

docker build -t wordcloud-api https://github.com/sen-ltd/wordcloud-api.git
docker run --rm -p 8000:8000 wordcloud-api
Enter fullscreen mode Exit fullscreen mode
curl -X POST http://localhost:8000/wordcloud \
  -H "Content-Type: application/json" \
  -d '{"text": "the quick brown fox jumps over the lazy dog the brown fox runs fast the lazy dog sleeps the quick dog chases the slow fox"}' \
  -o wc.png
Enter fullscreen mode Exit fullscreen mode
curl -X POST http://localhost:8000/wordcloud/frequencies \
  -H "Content-Type: application/json" \
  -d '{"frequencies": {"python": 10, "rust": 7, "go": 5, "javascript": 8, "typescript": 6}, "colormap": "plasma"}' \
  -o tech.png
Enter fullscreen mode Exit fullscreen mode
curl http://localhost:8000/health
# {"status":"ok","version":"0.1.0","default_colormap":"viridis"}
Enter fullscreen mode Exit fullscreen mode

Full OpenAPI schema at http://localhost:8000/docs.

When this is a bad fit

  • You need SVG output. wordcloud only does raster. There is no SVG path. If you need crisp print-quality output, look at amueller/word_cloud#svg and be prepared to do work.
  • You need animations or interactivity. This is a PNG endpoint. It's a PNG endpoint.
  • You care about sub-100 MB container size. You're not going to get there without forking upstream and ripping matplotlib out. The dependency chain is what it is.
  • You need CJK out of the box. Mount a Noto CJK font and rebuild; that's another ~120 MB.
  • Your corpus is tiny. A dict comprehension and a PIL.ImageDraw call is cheaper than a service.

The meta-point

Not every portfolio entry needs to contain a novel algorithm. Sometimes the contribution is operational β€” taking a good library, sticking a boring HTTP interface on it, writing limits that match real-world use, being honest about the weight it pulls in, and exposing the one escape hatch (the frequencies endpoint) that matters.

That's this repo. 200-ish lines of real code, 33 tests, one Dockerfile, and an explicit acknowledgment that the interesting engineering happened somewhere else.


Entry #128 in a 100+ portfolio series by SEN LLC. Related entries:

  • furigana-api β€” morphological analysis for Japanese text; pair with /wordcloud/frequencies
  • slug-jp β€” Japanese β†’ romaji slugs
  • watermark-api β€” another FastAPI + Pillow microservice

Feedback welcome.

Top comments (0)