A sysadmin's retrospective on three decades of GPU computing — from texture-memory hacks to CUDA monoculture, from demoscene dreams to enterprise myopia.
(Disclosure: I used Claude Sonnet to help with the writing and structuring)
My backstory, I learned to code before most people owned a computer
It was the early 1980s. I was five. My parents put a computer in the house. Back then we were only 3 kids at elementary school with something that could pass as a computer. I don't know if they understood what they were starting. I certainly didn't. I just knew that if you talked to the machine in the right language, it would do things. That was enough.
By the time I was a teenager / young adult, "the right language" had expanded considerably. I was a hacker. Pick your hat color, I wore most of them. Black, white, grey, whatever the situation called for. The demoscene pulled me in hard, because of course it did. If you've ever looked at a 64KB executable push geometry and music and impossible effects through hardware that technically couldn't do it, you understand. The whole scene was about understanding limits deeply enough to find workaround.
Find the constraint, find the exploit, make the hardware do something it wasn't designed to do, is what makes what happened with GPUs so interesting to me personally. I watched the whole arc unfold in real time. I was rooting for it, yet it was a very frustrating "told you so" experience. I saw it coming, I could have got rich off it, if I'd been paying a different kind of attention. Money was to buy book, pay the early internet phone bill, and buy hardware. PC, Sparc, SGI, PA-RISC, you-name-it.
Fixed pipelines and the first dream
Early GPUs had fixed pipelines. You couldn't program them — you could configure them. There was a rasterizer, there were texture units, there was a fixed path from geometry to pixel and you worked within it. For games, fine. For anything else, nope.
I dreamed about programmable GPUs the way some people dream about cars or houses. The raw parallelism was right there. Hundreds of tiny processors doing the same thing in parallel. If you could just tell them what to do rather than configure what they already knew how to do, you'd have a computer unlike anything available outside a supercomputer budget.
Then came shaders. Vertex shaders, pixel shaders. Programmable, finally ... for graphics. The operations were graphics-shaped. The inputs were graphics-shaped. The outputs landed on screen.
But there was a trick.
The texture memory hack: GPGPU before GPGPU existed
Here's the hack: a texture, at the hardware level, is just a 2D array of values that the GPU knows how to read very fast. If you squint hard enough, it's memory. And if it's memory, you can write your input data into it, run a shader that does arithmetic on those values, and read the result back out.
We were abusing the rendering pipeline as a compute pipeline. The outputs were technically "pixels" but they were actually numbers. This was genuinely useful. It was also genuinely painful. You had to think about your problem in terms of texture coordinates and fragment operations. Debugging was... creative.
But it worked. And a small community of people knew it worked, and we were waiting for a proper framework.
GPGPU arrives. Poorly. But it arrives.
The mid-2000s brought actual GPGPU frameworks. The hardware was getting more general. The abstractions were... less good. But the intent was there: use the GPU for compute, not just graphics.
The problem was that none of it was clean. You were still fighting against the graphics origins of the hardware. The programming models were awkward. It worked if you were determined enough and the problem was parallel enough but it was not something you'd hand to a sane programmer and expect sane results.
The world was not yet paying attention. That was fine. The world often isn't paying attention to the things that will matter most in fifteen years.
CUDA
Nvidia shipped CUDA and I had mixed feelings.
On one hand: finally.
Finally a coherent, well-designed abstraction for GPU compute. C-like syntax. Proper memory model. Actual documentation, books. A mental model that mapped to the hardware without requiring you to pretend you were drawing triangles. You wrote kernels, you launched them with a grid of threads, the hardware ran them in parallel. It made sense.
On the other hand: proprietary.
Nvidia hardware only, Nvidia toolchain, Nvidia ecosystem. Every piece of research, every library, every optimization trick you learned was locked to one vendor. That bothered me then and it bothers me now.
It worked on any hardware, as long as it was NVidia. From the cheapest GPU to enterprise-grade cluster. And that's how you win an ecosystem. Developers first. Enterprise later.
OpenCL: the trillion-dollar mistake (in hindsight)
A vendor-neutral, open standard for GPU compute. Multiple hardware targets, portable code, no lock-in. Technically sound. Standards-body blessed.
Delivered too late and then abandoned in place.
I learned Cuda, I learned OpenCL. I focused on OpenCL because it was the right thing to do.
This is the part that still frustrates me: they released it and sat on their hands. Khronos published the spec. AMD and Intel and others said nice things about it. And then Nvidia kept shipping libraries, kept improving the developer experience, kept making CUDA the path of least resistance for anyone who wanted to do serious compute. OpenCL got maintenance updates. CUDA got cuDNN. That's not a fair fight.
The mistake wasn't making OpenCL. OpenCL existing is fine and maybe still matters for certain use cases. The mistake was treating "we published a standard" as equivalent to "we built an ecosystem." Those are not the same thing. An ecosystem is libraries, tutorials, Stack Overflow answers, framework integrations, debugging tools, profilers, and ten years of muscle memory in the research community. A standard is a document.
"OpenCL was the right answer to the wrong question. 'We published a standard' is not the same as 'we built an ecosystem.' Nvidia understood this. The standards committee did not."
AMD's catastrophic misread of how ecosystems actually work
ROCm launched.. By that point, CUDA had a massive head start and the entire research community had built on top of it. ROCm was technically a reasonable effort. The problem was the strategy.
AMD decided that GPU compute was an enterprise product. Datacenter cards, enterprise support contracts, "we will certify this specific configuration" logic. Consumer hardware was excluded. The Radeon gaming cards that hundreds of thousands of developers and researchers actually owned? Not supported. Or barely supported. Or supported if you were willing to spend significant effort making it work, which is not the same as supported.
This was a catastrophic misread of how software ecosystems form.
Software ecosystems do not trickle down from enterprise to developers. They trickle up from developers and hobbyists and students and researchers to enterprise.
Every major platform that achieved ecosystem dominance did it the same way: make it easy for individuals to build things, let those individuals become the infrastructure of the larger world, and watch enterprise follow because the talent and the libraries and the knowledge are all already there.
The APU angle: the advantage AMD won't use
Here's the particular irony. AMD, unlike Nvidia, makes CPUs with integrated graphics. Ryzen APU: laptops, mini PCs, entry-level desktops, ... They all have GPU compute capability baked in. Every person who buys an AMD laptop has, in principle, a ROCm-capable device.
In principle. Because in practice, ROCm support for integrated AMD graphics is still, even for the high-end APU (like AMD AI Max something something), as of 2026, a charitable word for "partially works if you try hard enough". As for low/mid range APU: Forget it.
For learning purpose, it doesn't need to be fast, it doesn't need to be useful. It needs to be cheap, compile and run. AMD APU can't even do that.
AMD has a massive potential developer base. Millions of machines with their GPU silicon already in people's hands, already bought, already sitting on desks and they are failing to use it.
They're still focusing on datacenter wins and workstation certifications and "enterprise-grade" positioning, still treating "works reliably on hobbyist hardware" as a nice-to-have rather than the entire game.
Nvidia doesn't have integrated graphics. Nvidia has no path to the casual developer market except selling them a gamer GPU. And yet Nvidia owns the developer market, because CUDA works on the GPU you already bought for gaming, and it compile and run.
Where this ends
Nvidia won the AI war because they bet on developers first while AMD bet on enterprise. They made CUDA clean, they made the libraries comprehensive, they made "get started" genuinely easy, and they did this for years before the deep learning explosion made GPU compute a trillion-dollar market. By the time everyone realized how important this infrastructure was, every other competitors had already left the field.
And now that the competition is trying to catch up on compute and developer framework, NVidia already won the next war: GPU interconnect.
AMD is playing catch-up on software with a fundamental strategic instinct that keeps pointing them toward the wrong end of the adoption curve. Ecosystems are self-reinforcing: more developers means more libraries means more tutorials means more developers.
Intel, for what it's worth, is in a similar position. They have the hardware, they have the silicon, their framework and library stack are way better than AMD. But what they also have is the "Intel Graveyard" : A disastrous track record.
Intel has a notorious habit of developing groundbreaking hardware, only to ruthlessly execute the division the moment quarterly margins dip.
If you buy an Arc Pro B70 today, you get incredible hardware specs for the price. The risk is that in 18 months, if Intel’s core manufacturing business continues to hemorrhage cash, executive leadership might simply declare the Arc GPU division a "non-core asset." They could quietly reallocate the driver development team, leaving you holding a $1,000 brick that won't compile the next major version of PyTorch.
Given that track record, does the massive upfront cost saving of a 32GB B70 outweigh the risk of Intel pulling the plug on OneAPI support in a couple of years, or is the predictability of the Nvidia tax still the safer long-term bet?
Could this change? Yes. It would require AMD to treat consumer hardware support as a first-class priority, aggressively activate the APU developer base, fund the library ecosystem directly, and do all of this for five to eight years before expecting results. They're starting to understand some of this. They're probably starting too late, and they're definitely still not going far enough.
Meanwhile I'm running CUDA on Linux, the RTX is doing what it's told, and somewhere in a drawer there's a twenty-year-old version of me who understood all of this and still didn't buy the stock.
Such is life.
Apple accidentally enters the chat
Everything so far has been about companies that were trying to win GPU compute. Nvidia built CUDA because they wanted to own compute. AMD built ROCm because they had to respond. The whole story is intentional moves in a deliberate game.
Apple is the exception. Apple built Apple Silicon to solve a completely different problem: they wanted to stop paying Intel, stop being held hostage to Intel's roadmap, and build a laptop that didn't need a fan while lasting all day on battery. More specifically, the M1 was fundamentally a scaled-up iPhone chip — they took the architecture that powered iPads and iPhones and figured out how to bring it to desktop without destroying its power efficiency.
The unified memory architecture was a power efficiency play. It was Apple deciding that moving data between separate CPU and GPU memory pools over a bus was slow and wasteful and they could do better by putting everything on the same die with the same pool.
They were right. And three years later, the exact architectural feature Apple built to make video rendering faster — massive memory bandwidth, vast memory capacity, no PCIe bottleneck — turned out to be the exact hardware profile required to run massive language models locally. Apple stumbled backwards into a goldmine.
The critical insight is memory capacity. A discrete GPU is constrained by its VRAM. You can have a 500W H100 with 80GB of HBM and it's an extraordinary machine — but it's 80GB, and a model that doesn't fit requires multi-GPU infrastructure, which is expensive and complex.
Apple sell 256GB in a metal cube on your desk, instead of a massive, noisy, hot, power hungry, GPU cluster. For a price, yes, but still infinitely cheaper than 256GB of GPU.
And the software catching up part is worth being precise about: with its Metal backend, MLX, and a growing library of tools that actually use the Neural Engine. The trickle-up dynamic again, except Apple didn't have to do anything strategic to trigger the first wave. They just shipped hardware with good properties and developers noticed.
It works, from a cheap Mac mini to the ludicrously priced memory packed Mac studio. And yet... still worth it.
A $5,000 Mac Studio is mathematically absurd for a standard consumer desktop. Apple charges a predatory markup on unified memory configurations and everyone in the industry knows it. But if you are an independent AI developer looking at $30,000 worth of enterprise Nvidia accelerators as the alternative, or the years-long waitlist and cloud bills that come with not having local hardware, the Apple tax suddenly looks like a bargain. Especially with the current memory price trend.
About renting a GPU, besides not having the hardware at home. I don't know about you, but it stress me out to see the bill going up while i'm busy reading the documentation (or watching that random Indian youtube video with a thick Hindi accent for the 5th time because he's apparently the only one on the planet who understands the cryptic debug log)
For inference, Apple hardware has a real capability. Not theoretical. Not "works if you squint".
The limitation is equally real: training models on Apple Silicon is not competitive with CUDA hardware. BUT, IT, WORKS ! (Yes, i'm looking at you AMD)
This isn't really a story about hardware. It never was.
The hardware was always roughly comparable. AMD has made competitive GPUs for years. Intel makes competitive silicon. Apple Silicon benchmarks are genuinely impressive. At almost every point in this history, the gap in raw compute between competitors was survivable. Companies have closed bigger gaps.
What nobody closed was the software gap. And the software gap wasn't built by Nvidia's marketing department or their enterprise sales team. It was built by every grad student who learned CUDA because it was the thing that worked on their gaming card. By every tutorial that assumed CUDA. By every framework that added CUDA support first and "maybe ROCm later." By every Stack Overflow answer that solved a CUDA problem and got indexed by Google and found by the next person with the same problem.
Ecosystems are built by individuals who just want to get something working. Nvidia understood this, consciously or not, from day one.
This isn't even new. Microsoft won enterprise desktop back in the day by letting everyone pirate their Office suite. By offering students whatever software they asked for as long as it was part of their university curriculum. Then they get their first job knowing Excel, not Lotus 1-2-3.
The pattern is always the same: make it work for the person with one GPU in their bedroom, and eventually you own the datacenter.
AMD skipped the bedroom. Intel is trying to prove it won't pull the plug before it gets there. Apple wandered in through the window without knowing there was a door.
And Nvidia? Nvidia is busy winning the next war: GPU interconnect, NVLink, the infrastructure layer that determines how clusters scale.
Because they got the developers first, now we have an entire NVidia economy, geopolitical conflict, IP/trade war, while everyone else is still fighting over the framework layer.
I've spent forty-odd years watching technology markets, and the consistent lesson is this: the companies that win developer mindshare early win everything later. Not because developers are the end customer. Because developers build the world the end customers live in.
40 years laters, I still don't own any stock. I still buy books & hardware instead. I'm still coding in my bedroom, the only thing that change is that I need to share that expense with motorcycle as well
Top comments (0)