GPT-5.4 dropped on March 5th. The internet did its thing a thousand hot takes, a hundred "this changes everything" threads, the usual cycle.
I want to try something different. I want to actually explain what this number means, because I've spent the last few weeks thinking about it and I think most of the coverage missed the point entirely.
Start Here: What Is a Token?
Before we get to a million of them, let's be precise about what we're counting.
A token is not a word. It's closer to a syllable a chunk of text that the model processes as a single unit. "Running" might be one token. "Unbelievable" might be two. Numbers, punctuation, spaces those are tokens too. As a rough rule of thumb, 1,000 tokens is about 750 words.
So when OpenAI says 1,000,000 tokens that's roughly 750,000 words of context that GPT-5.4 can hold in its working memory at once.
750,000 words.
The entire Harry Potter series is about 1.08 million words. So GPT-5.4 can almost hold the entire Harry Potter series in its head, and reason across all of it at the same time. Not read it once and forget. Hold it. Actively.
That's what we're talking about.
The Problem Previous Models Had That Nobody Talked About Enough
Every LLM before this had a context window problem. It was real and it was annoying and most people who used these tools professionally hit it constantly.
Here's how it showed up in practice. You'd load a long document into GPT-4. Ask questions about section 7. It would answer well. Ask a question that required connecting section 2 to section 14 to the appendix — and it would hallucinate, hedge, or quietly ignore the parts it couldn't hold anymore.
The technical term is "lost in the middle." Research from Stanford in 2023 showed that language models are dramatically better at using information from the beginning and end of their context window than information buried in the middle. Give a model a 100-page document and the crucial fact is on page 50? It might as well not be there.
A 1M token window doesn't automatically solve "lost in the middle." But it changes the shape of the problem. When you have so much room that you don't have to make hard choices about what to include, the retrieval architecture can be more deliberate. OpenAI has clearly been working on this early testing suggests GPT-5.4 holds its attention across long documents more reliably than its predecessors.
Not perfectly. But more reliably.
The Benchmark That Actually Matters Here
The headline score is 83% on something called GDPVal.
Most AI benchmarks are kind of useless for normal people to interpret. "Achieved 94% on MMLU" okay, but what does that mean? MMLU tests knowledge recall. That's a library, not a thinker.
GDPVal is different in an important way. It measures performance on tasks that have real economic value — the kind of work actual companies pay actual humans to do. Legal analysis. Financial modeling. Code review. Research synthesis. The tasks that make up knowledge work.
83% on that benchmark puts GPT-5.4 at or above human expert level.
I want to be careful here, because "human expert level" is doing a lot of work in that sentence. Human experts vary enormously. The benchmark comparison is against a specific sample. There are categories within it where GPT-5.4 is weaker. This is not me saying "AI is now as good as a lawyer."
But 83% on GDPVal is not a number you can handwave away either. The previous best score on that benchmark was in the mid-70s. This is a real jump. And the combination of that reasoning capability with a million token context window changes what you can actually ask the model to do.
What This Actually Changes, Concretely
Let me give you three specific things that become possible now that weren't practical before.
First: Whole-document legal review.
Previously, if you had a 400-page merger agreement and you wanted an AI to flag every clause that created liability exposure across the entire document you'd have to chunk it. Feed it in sections. Hope the model's understanding of section 3 was still active when it got to section 287. Hope it could connect the indemnification clause to the arbitration clause two hundred pages later.
Now you can feed the whole thing. Ask for a coherent analysis across the entire document. Get something that treats it as a single artifact rather than 30 disconnected chunks.
This is not replacing lawyers. It's changing what the first pass of due diligence looks like.
Second: Codebase-level reasoning.
Most serious software projects have codebases with hundreds of thousands of lines of code. Asking an AI to refactor a function is useful. Asking it to understand how a change in this module propagates to that service, which interacts with those three APIs, which was written by a team that made these specific assumptions that required the model to hold the entire codebase in context.
GPT-4 could do this for small projects. GPT-5.4 can do this for medium-to-large ones. The senior engineers I've talked to about this are the ones who seem most shaken by it, honestly. They're the people whose value has always been knowing the whole system. That's what just got cheaper.
Third: Longitudinal research synthesis.
Say you want to understand what the academic literature on, I don't know, attention mechanisms in transformer models actually says — not the popular summaries, the actual papers, the debates, the contradictions, the methodological criticisms. That might be 200 papers. 300. You can now feed GPT-5.4 a significant chunk of a research literature and ask it to synthesize, identify contradictions, find the questions nobody's answering.
Literature reviews take PhD students months. I'm not saying AI does this as well as a careful human scholar. I'm saying the gap is smaller than it was and getting smaller.
The Part I'm Genuinely Uncertain About
I want to be honest about something.
Every time a new model drops, there's a wave of "this changes everything" content and then a wave of "actually it's not that impressive" content. Both are usually wrong in specific ways. The first wave overstates; the second wave tests the model on tasks it's not designed for and concludes it's overhyped.
GPT-5.4 is genuinely impressive. The context window increase is real and useful. The GDPVal score is real and meaningful.
But I keep coming back to one thing: context window and raw capability are different axes. A model can hold a million tokens and still reason poorly about what's in them. Retrieval is not understanding. Holding information in context is not the same as having good judgment about what matters.
The question I don't think anyone has a clean answer to yet is: as the context window grows, does the quality of attention across that window hold up? Does having access to a 700,000-word document actually lead to better reasoning, or does it lead to the model confidently using irrelevant information it wouldn't have encountered in a smaller window?
Early evidence suggests GPT-5.4 handles this better than expected. But this is still an open question in a real way, and I'd be suspicious of anyone who tells you with confidence how it fully plays out.
What I Actually Did With It
I want to give you something concrete because I think "here is what this model scored on benchmarks" is only so useful.
I took the last three years of quarterly earnings calls from five tech companies Meta, Microsoft, Google, Amazon, Apple — transcribed and timestamped, roughly 900,000 words total. Fed the whole corpus into GPT-5.4 and asked it to identify every instance where an executive made a specific forward-looking claim about AI investment returns, then tell me how those claims aged quarter over quarter.
The result was not perfect. It missed a few things. It occasionally confused two executives with similar speaking styles. There were two moments where it summarized a position slightly stronger than what was actually said.
But it also identified a pattern I hadn't noticed: Google's Sundar Pichai has been notably more hedged about AI monetization timelines in his language since Q3 2024, while simultaneously becoming more specific about infrastructure numbers. That contrast confident about capex, careful about revenue is not something I would have easily spotted reading these calls one at a time over three years.
That's the kind of thing that's now practical to do in an afternoon.
The Honest Conclusion
I don't think the world ended when GPT-5.4 launched. The world didn't end when GPT-4 launched either. These things accumulate rather than rupture.
But I do think this release is quietly significant in a way that the noise around it obscures. It's not significant because of the context window number. It's significant because of what that number makes possible in combination with the capability level which is that a tool you can actually have a coherent analytical conversation with about very large bodies of information now exists, is accessible, and is getting cheaper every month.
That matters for some jobs more than others. It matters a lot for jobs that are essentially about synthesizing large amounts of text into insight analysts, researchers, lawyers, certain kinds of writers, certain kinds of consultants.
If your job is that kind of job, I think the honest thing to say is: pay attention. Not because GPT-5.4 is coming for you tomorrow. But because the version of this that comes out in 18 months is going to be a lot better than this one, and this one is already genuinely useful.
The context window expanding is almost beside the point. The capability sitting behind it is the part worth watching.
If this was useful, clap 50 times — it literally takes two seconds and it changes who Medium shows this to. And if you want the next one in your inbox, subscribe below.
Top comments (0)