FlareCanary

Posted on Apr 25

DALL·E shuts down May 12 — the gpt-image-1 migration isn't the drop-in swap it looks like

#openai #ai #api #webdev

OpenAI is shutting down dall-e-2 and dall-e-3 on May 12, 2026. After that date, requests to /v1/images/generations with either model string will stop working. The recommended replacements are gpt-image-1 and gpt-image-1-mini.

On paper this is a one-line change: swap the model name, ship. In practice, the request and response shapes are different enough that a naive swap breaks clients that worked against DALL·E for the last two years.

What the shutdown looks like

OpenAI's deprecation language: "deprecated models will no longer be accessible" after the shutdown date. Translated: your POST /v1/images/generations with "model": "dall-e-3" will return an error, not a fallback to the new model.

Expected response shape:

{
  "error": {
    "message": "The model `dall-e-3` has been deprecated. Learn more: https://platform.openai.com/docs/deprecations",
    "type": "invalid_request_error",
    "code": "model_not_found"
  }
}

No grace period, no auto-upgrade. The endpoint itself still exists — /v1/images/generations is alive and serves gpt-image-1. Only the old model IDs are gone.

Where `dall-e-3` is probably pinned in your code

The model string is usually spread across more surfaces than you'd expect:

Environment variables: OPENAI_IMAGE_MODEL, DALLE_MODEL, IMAGE_MODEL. Check every environment, not just prod.
SDK wrappers: the Python openai SDK's client.images.generate(model=...) call. LangChain's DallEAPIWrapper. Vercel AI SDK image helpers. LiteLLM routers.
Hardcoded defaults: it's common to see model or "dall-e-3" as a fallback when the caller doesn't pass one. That fallback becomes a runtime error on May 12.
Tests and fixtures: VCR cassettes, recorded responses, snapshot tests. These pass locally but the behavior they capture is about to change.
Documentation and onboarding code: if your docs show "model": "dall-e-3" as an example, update them before users copy it into fresh projects.

The five-minute audit:

git grep -n "dall-e-3\|dall-e-2\|dalle-3\|dalle-2"

The request shape changed

A DALL·E 3 call looks like this:

{
  "model": "dall-e-3",
  "prompt": "a watercolor fox",
  "n": 1,
  "size": "1024x1024",
  "quality": "hd",
  "style": "vivid",
  "response_format": "url"
}

The gpt-image-1 equivalent doesn't accept three of those fields:

response_format is gone. DALL·E returned hosted image URLs by default. gpt-image-1 returns base64-encoded PNG bytes in b64_json, always. If your client reads response.data[0].url, it will be None. You need to decode the bytes and either upload them to your own storage or serve them inline.
style is gone. The vivid / natural distinction doesn't exist. Prompt-engineer the style into the text instead.
quality values changed. DALL·E 3 used standard / hd. gpt-image-1 uses low / medium / high / auto. A literal "hd" in the request will be rejected.
size values changed. DALL·E 3 supported 1024x1024 | 1792x1024 | 1024x1792. gpt-image-1 supports 1024x1024 | 1024x1536 | 1536x1024 | auto. Landscape and portrait are 1536×1024 and 1024×1536, not 1792×1024 and 1024×1792.

New parameters you probably want to set explicitly: output_format (png / jpeg / webp), output_compression (for jpeg/webp), and moderation (low / auto).

The cost model flipped

This is the migration gotcha nobody flags in the deprecation notice.

DALL·E 3 was billed per image: $0.040 for standard 1024×1024, $0.080 for HD. You could forecast cost by counting images.

gpt-image-1 is billed per token — input text tokens, input image tokens (for edits), and output image tokens. A single medium quality 1024×1024 generation is roughly ~1,000 output image tokens at $40 per million, so around $0.04. high quality is several times that. gpt-image-1-mini is cheaper but still token-billed.

If your current cost forecast is images_per_month × $0.04, that forecast is wrong after May 12 in ways that depend on your prompt length and quality setting. Re-model before you switch, not after the first invoice.

The pattern this fits

OpenAI's deprecation cadence over the last year:

October 2024 — gpt-3.5-turbo-0301, gpt-3.5-turbo-0613 retired
June 2025 — gpt-4-0314, gpt-4-32k-0314 retired
January 2026 — text-moderation-007 moved to legacy
May 12, 2026 — dall-e-2, dall-e-3 retired
Announced, no firm date — Assistants API sunset August 26, 2026

Every one of these was announced with at least 60 days of notice. Every one of them broke production for teams that didn't see the notice. The pattern isn't a surprise attack — it's that nobody instruments the provider's public surface (models list, deprecation page, error shapes) as a monitored contract.

What actually catches this

A dependency on a third-party API is a silent coupling. The provider changes the shape of what they return, or stops accepting a model ID, and your code — unchanged — starts failing. CI doesn't catch it because CI runs against fixtures or mocks. Unit tests don't catch it because the SDK still type-checks.

Two things catch it:

Integration tests that hit the real API on a schedule (nightly is enough), against every model ID and request shape you rely on. When the shape drifts, the test turns red.
Monitoring the provider's models list and documented error shapes as a diffable contract. GET https://api.openai.com/v1/models tells you which IDs are live. When dall-e-3 stops appearing there, you want the alert before the next deploy.

Either approach works. What doesn't work is hoping the deprecation email lands in an inbox someone reads.

Minimum-viable fix for this one

git grep for every DALL·E model ID across every repo that talks to OpenAI.
Replace with gpt-image-1 (or gpt-image-1-mini if cost-sensitive).
Remove response_format, style, and any quality: "hd" / quality: "standard" from the request body.
Update response-handling code to decode b64_json instead of fetching url. Add your own storage step if your product needed hosted URLs.
Re-forecast cost against the token-billed model. Run a sample batch through gpt-image-1 at your expected quality setting and multiply out.
Check that the old model IDs aren't still in staging, demo apps, or onboarding example code.

If this is the third or fourth time a provider change has broken production without warning, the problem isn't this specific deprecation. It's that you're finding out from alerts instead of ahead of time.

FlareCanary monitors REST APIs and MCP servers for schema drift — including models-list and error-shape changes from AI providers. Free tier covers 5 endpoints with daily checks.

Top comments (2)

PEACEBINFLOW • Apr 25

The line that stuck with me isn't about DALL·E at all — it's "nobody instruments the provider's public surface as a monitored contract." That's such a quietly damning observation.

I think we've internalized this weird double standard. When we add a dependency inside our own codebase, we pin it, we review changelogs, we run CI against version bumps. But when the dependency is an HTTP API, we just... trust the URL. Same URL, same contract, right? Even though the only thing actually versioned is the domain name, and everything behind it can shift on any Tuesday.

The token-billed vs. per-image thing is actually a perfect example of why this matters. It's not just a breaking change in the request shape — it's a breaking change in the economic model. And it'll show up as a successful 200 response with slightly different JSON, so your health checks stay green while your margins quietly invert.

What I keep wondering is whether we're going to look back on this era of API consumption the way we look back on vendoring dependencies without lockfiles. Like, of course you should diff the provider's model list. Of course you should run integration tests against their actual endpoints. But for some reason that's still "advanced" practice instead of table stakes.

Have you seen teams actually operationalizing this, or is it mostly still individuals wiring up monitoring after they get burned?

FlareCanary • Apr 26

Mostly the latter, in my experience. Almost every team I've talked to that has real monitoring on third-party APIs got there because something silent broke first — usually a billing surprise or a 200-with-wrong-payload that ran for a week before anyone noticed. The few that operationalize it before getting burned tend to be teams where one engineer brought the practice over from a previous incident at another company. So it propagates, but slowly, one war story at a time.

The "successful 200 with quietly different JSON" framing you used is exactly the gap. Standard health checks (200 = OK) and standard APMs (span starts at the HTTP request) both miss it. What catches it is something boring and specific: snapshot the response shape on a schedule, diff it, alert on the diff. That's a small enough tool to build in an afternoon — but it's adjacent to nobody's job, so it doesn't get built until the day after the postmortem.

I think you're right that we'll look back at this the way we look at vendoring without lockfiles. The pre-lockfile world also took a few public catastrophes before "of course you pin" became table stakes. We're probably one or two well-publicized AI-billing-blowup stories away from "of course you diff your provider's response shape" being assumed practice.