Codex 5.4 vs 5.5 pricing and quality

#programming #ai

You can get very close results to GPT 5.5 by using GPT 5.4 with a highly detailed prompt.

I ran a small test to check this properly. I generated the same technical content into summaries using both GPT 5.4 and GPT 5.5, across four different prompt detail levels (Low to XHigh). Then I asked ChatGPT to rank all 8 outputs blindly, without giving it any scoring categories or guidelines — so my own preferences wouldn’t influence the result.

Here’s how it turned out:

Rankings (1 = best):

GPT 5.5 XHigh — 9.4/10
Best overall balance of technical depth, accuracy, and framing.
GPT 5.4 XHigh — 9.0/10
Extremely close to the top. Clean, well-structured, and strong.
GPT 5.4 High — 8.7/10
Solid and grounded, with good references to the source material.
GPT 5.5 Medium — 8.5/10
GPT 5.5 High — 8.5/10
Both clear and reliable.
GPT 5.5 Low — 8.3/10
Held up surprisingly well for a lighter prompt.
GPT 5.4 Medium — 8.0/10
GPT 5.4 Low — 7.6/10

Main takeaway:
Once you go all-in on prompt detail (XHigh), the performance gap between 5.4 and 5.5 becomes quite small. This gives you a practical, lower-cost option without losing much quality.

Top comments (1)

Nazar Boyko • Jun 21

The takeaway that more detail closes most of the gap matches what I'd expect, and it's a handy way to save on cost. The bit I'd hold loosely is the judge. Asking a model to rank eight outputs, some of them its own, leans on whatever style that model already likes, so the scores aren't quite neutral. The spread is also tight, from 7.6 to 9.4, tight enough that a re-run could reshuffle the middle. If you ran the same ranking twice and it landed the same way, that would make the headline a lot stronger.