DEV Community

Om Prakash
Om Prakash

Posted on • Originally published at pixelapi.dev

Image Captioning API: Auto-Generate Alt Text and Descriptions

Image Captioning API: Auto-Generate Alt Text and Descriptions

Most product catalogues, content feeds, and media libraries have one quiet shame: thousands of images with empty alt="" attributes, no search metadata, and no human-readable description. Writing them by hand does not scale. Here is the endpoint that makes that problem go away.

Today we are launching POST /v1/image/caption — a single endpoint that turns any public image URL into a caption tuned for the job you actually have: accessibility alt-tags, SEO descriptions, or full paragraph-length narration.

What it does

Send a public image URL. Get back text. That is the whole shape of the API.

What makes it useful is the style parameter. The same endpoint can produce three very different outputs from the same image, depending on what you are building:

  • concise (default) — tight, alt-text-shaped output. The kind of single sentence you want sitting inside an alt attribute. Screen-reader friendly, no fluff, no marketing language.
  • seo — keyword-rich descriptions written for indexing. Think product page meta-descriptions, image search optimisation, structured-data fields. Longer than alt text, denser with terms a search engine actually picks up.
  • detailed — paragraph-length narration. For accessibility contexts where someone genuinely needs to understand the image, not just identify it. Also useful when you want a content-moderation reviewer to triage uploads without opening every preview.

The full request shape is small:

Field Required Default Notes
image_url yes Public URL of the image
style no concise One of concise, detailed, seo
max_tokens no 64 Length cap, range 32–256

That is the entire surface. No model selection, no temperature knobs, no system-prompt plumbing. You picked an endpoint called "caption" — we figured the job out so you do not have to.

max_tokens is a hard ceiling, not a target. A concise request with max_tokens: 256 will not waste tokens producing fluff; the style governs length, the cap is just there to protect you from runaway output in edge cases (extremely busy images, weird aspect ratios, etc.).

Why we built it

Image captioning is one of those features where the gap between "I could build this in a weekend" and "I have a production-grade pipeline serving 10k catalogue images a night" is enormous.

The weekend version is a Jupyter notebook calling someone's hosted model. It works on the demo image. It falls over the moment you point it at a real product feed: the URLs are sometimes 404, sometimes a redirect, sometimes a 50MB raw camera dump. Half the captions sound like art-gallery placards ("a serene composition featuring…") when you wanted alt text. The other half are three words long when you wanted a paragraph. You spend a week writing prompt scaffolding, retry logic, content filters, and length normalisation, and you still have not shipped the actual feature you set out to build.

The production version requires a team. We have already built that team's worth of work. The API in front of it is what we are shipping.

The other thing we kept hearing: one caption style does not fit every job. The alt-text you want on <img> tags is not the description you want in your og:image meta tag, and neither of those is what you want feeding into a moderation reviewer queue. Most APIs make you pick one shape and live with it, then post-process the output to fake the other two. We exposed the three styles directly because that is how the work actually splits in real codebases.

The angle, in plain terms: a purpose-built captioning endpoint, three styles in one call, priced flat-rate per request, no token accounting on your side.

Quickstart

Get an API key from the dashboard, then:

curl -X POST https://api.pixelapi.dev/v1/image/caption \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"image_url": "https://example.com/source.jpg", "style": "seo"}'
Enter fullscreen mode Exit fullscreen mode

The response is JSON with a caption string. That is it.

Same call in Python with requests:

import requests

response = requests.post(
    "https://api.pixelapi.dev/v1/image/caption",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "image_url": "https://example.com/source.jpg",
        "style": "seo",
        "max_tokens": 128,
    },
    timeout=30,
)

response.raise_for_status()
caption = response.json()
print(caption)
Enter fullscreen mode Exit fullscreen mode

A few notes worth flagging up front:

  • The image_url must be publicly reachable. If your images live behind a signed-URL CDN, generate a short-lived signed URL and pass that. We fetch the bytes server-side; we cannot reach a URL that requires your session cookie.
  • style is the lever you reach for first. If a caption feels off, change the style before you start fiddling with max_tokens. The style governs register (alt-text voice vs. SEO voice vs. narration voice); the token cap only governs length.
  • Set a sensible client timeout. 30 seconds is comfortable for a single call; if you are batching, see the use-case section below for parallelism guidance.

If you are wiring this into a job queue, treat each call as independent. There is no statefulness across requests — captioning image A does not influence the caption for image B. That makes parallelism trivial: fan out as many concurrent requests as your account's rate limit allows.

Use cases

Bulk-generate alt text for an ecommerce catalogue (10k+ products)

This is the canonical job. You have a product database, every row has one or more image URLs, and the alt_text column is either empty or full of the SKU. Lighthouse is screaming. Your accessibility consultant has filed a report. Someone on the marketing side has noticed that Google Image search is sending you nothing.

The shape of the migration is straightforward: read product images out of the database in batches, fan out parallel calls to /v1/image/caption with style: "concise" for the on-page alt attribute, then a second pass with style: "seo" for the meta-description and structured-data fields. Two calls per product, sixteen credits, and the entire 10k catalogue gets done overnight while you sleep. The next morning you re-run Lighthouse and the accessibility score has moved 20+ points without anyone writing a single sentence by hand. The SEO improvement is harder to measure on day one but shows up in the indexing reports a few weeks later, when image-search referrals start trending up. This is the use case that pays for the integration on its own.

Auto-tag user uploads in an image-sharing app for search

User-generated content has a tagging problem: nobody tags their own uploads properly. They write "IMG_4823" as the title, no description, no tags, and then complain that search inside your app does not work. You cannot force users to write metadata — you have tried, the conversion drops every time you add a required field.

Solve it on the server. When an upload finishes, kick off a background job that hits /v1/image/caption with style: "detailed" to get a full sentence or two describing what is in the image. Index that text into your existing search engine — Postgres full-text, Meilisearch, Elastic, whatever you already have. Now "sunset over a mountain lake" finds the user upload that the user titled "IMG_4823.jpg". The user did nothing extra; their content became searchable. You can also display the detailed caption as default alt-text on the image so screen-reader users get a real description, not a filename. One API call, two product wins.

Build a moderation queue with auto-described preview labels

If you run a moderation queue, you know the bottleneck: a human reviewer has to open each preview, register what is in it, decide, and move on. The "register what is in it" step is where seconds add up across thousands of items. Anything that can be turned into a one-line label before the reviewer's eyes land on it is a direct throughput win.

Hit /v1/image/caption with style: "concise" on every flagged upload as it enters the queue. The caption becomes a sortable, filterable label next to the thumbnail. Reviewers can now scan the queue text-first and only open previews where the caption is ambiguous or the policy call is genuinely hard. You are not replacing the human — image moderation is a human-judgement job and should stay that way — you are removing the obvious cases from their cognitive load. Reviewers go faster, the queue burns down sooner, and the borderline cases get more attention because the easy ones got triaged on the way in.

Pricing

Flat rate per call. No token accounting on your side, no surprise overages on long detailed captions vs. short alt-text — every call costs the same regardless of which style you pick.

  • Credits per call: 8
  • Price (INR): ₹0.0054 per call
  • Price (USD): $0.00007 per call

To put that in catalogue terms: a 10,000-image alt-text run costs roughly ₹54 / $0.70. Two passes (one concise for alt-text, one seo for meta-descriptions) on the same catalogue is roughly ₹108 / $1.40. That is comfortably below the cost of a single hour of an engineer writing alt text by hand, for the entire catalogue.

A 100k-image media library indexed for natural-language search costs roughly ₹540 / $7.00 as a one-time pass, plus whatever your incremental ingestion rate adds — usually rounding error.

There is no minimum, no monthly commitment, no per-seat pricing. Top up credits, make calls, the meter runs down by 8 each request. If a request fails (we return a non-2xx), no credits are deducted.

Try it

Grab an API key and start captioning:

  • Dashboard: pixelapi.dev/dashboard — sign up, generate a key, top up credits.
  • Docs: pixelapi.dev/docs — full reference for /v1/image/caption including all request fields, response shape, error codes, and rate limits.

If you are migrating off a hand-rolled captioning script or an older general-purpose vision API, the easiest way to evaluate is to run a hundred of your real images through all three styles and eyeball the output. Pick the one that fits your surface, wire it in, ship it. The whole integration is a single HTTP call — there is genuinely nothing else to learn.

Build something useful with it. We will keep making the captions better underneath; you will not have to change a line of code when we do.

Top comments (0)