DEV Community: Swarit Shukla

Nvidia Vs Everyone: Why everyone is building their own accelerators

Swarit Shukla — Mon, 04 May 2026 06:41:50 +0000

There has been a race in the AI industry where companies are now preferring their own custom accelerators to train AI models and generate inference from them. But why? Why all of a sudden is everyone ditching Nvidia and either building their own accelerators or just buying from the ones that make them? For example, Anthropic uses Google’s TPUs (Tensor Processing Units), Amazon is offering their own Trainium chips, and OpenAI is hinting at making their own custom accelerators. The answer is fairly simple, and it’s not because they don’t want Nvidia to dominate over the market; it’s because their accelerator chips are faster than those Nvidia GPUs. So, let’s discuss this in detail.

Why is everyone building their own accelerators?
Nvidia has a reputation of making the best accelerated computing chips in the world, and for a long time everyone used them, but in recent years we are seeing a shift from GPUs to ASICs (Application-Specific Integrated Circuits).

Once an AI model has been trained, running and constantly generating inference from it using GPUs is not very economical. The cloud service providers and hyperscalers are looking for custom chips that strip away the unnecessary components of GPUs, handling only a specific work, hence reducing the TCO (Total Cost of Ownership).

The current data centers are power-hungry; they consume multiple gigawatts of power. Many CEOs, including Sam Altman, say that power is the ultimate bottleneck for AI. So to minimize power consumption, companies are turning to ASIC chips which increase performance per watt.

Companies want control on their hardware; relying solely on Nvidia means fighting for the GPUs and waiting in the queue. By having their own chips, they are simply reducing their dependency on Nvidia.

The difference between general purpose accelerators and application-specific accelerators
You can say that AI models are just mathematical equations, so when we say transformer architecture, we are just referring to a sequence of mathematical operations that approximately looks like:

Multiply matrices (weights and input)
Add some numbers (bias)
Run softmax (a mathematical function)
Move to the next layer
How a general purpose accelerator handles it:
A general purpose accelerator doesn’t have any idea what a transformer architecture looks like; it just knows how to do basic maths. So it fetches the instructions to follow, it looks like:

Check the memory for instructions (it says multiply)
Get number a and b from the memory
Send them to the calculator unit
Save the answer back to the memory
Check the memory for the next instruction
Most of the energy is spent moving data between memory and compute, including reading instructions.

How an ASIC handles it:
We already know what maths is coming in the sequence so we don’t need to fetch instructions and move data in and out of the memory, so they take the physical logic gates that multiply matrices and connect their output to the logic gates that do softmax.

In ASIC chips there is no instruction reading. The data enters and the laws of physics force it to go through a set sequence that generates the output.

It’s like an automated car wash, you just take the car in and a clean car comes out. No interventions are being made, no directions are given, just the car goes in and comes out clean.

How is Nvidia different from others?
Nvidia’s biggest moat isn’t their chip designs, it’s “CUDA” and “SUPPLY CHAIN”.

CUDA stands for Compute Unified Device Architecture. It’s a parallel computing platform that enables developers to use any CUDA-enabled GPU as a general purpose computing device. Millions of developers write software for CUDA; if you wrote a piece of software and it works on your Nvidia GPU, then it will surely work on any NVIDIA GPU.

So if you are a startup writing code for Nvidia GPUs, you instantly unlock the deployment for every cloud service because every cloud service company in the world runs Nvidia GPUs.

While watching the Jensen Huang podcast with Dwarkesh Patel, I realized that Nvidia has invested heavily to eradicate the bottlenecks. They determine bottlenecks in the supply chain prior, and go all in to make sure that it doesn’t bother them.

They have invested heavily in CoWoS packaging technology, silicon photonics (interconnects using light), and HBM (High Bandwidth Memory). Nvidia handles the upstream very well so that it can stay ahead of the competition.

Why are Nvidia’s chips still better than ASIC chips?
When you compare an ASIC with accelerators, on paper surely you will see ASICs outperforming Nvidia’s accelerators, but in real life that’s not the case. Nvidia accelerators give the best performance per watt in today’s time.

Nvidia doesn’t just sell the accelerators and rest, they have a lot of high-performing software engineers that continuously optimize the lowest-level code that runs on the accelerators. Nvidia helps the companies to optimize their code so they can run super fast on their GPUs.

Jensen, while on the podcast with Dwarkesh Patel, gave a very good analogy. He said anyone can drive an F1 car, but to push it to its absolute limits you need to have a lot of expertise. He was referring to F1 cars as his accelerators here; they send their best software engineers to the AI labs and optimize their software so that it runs best on Nvidia hardware.

Here is the most important point: the bottleneck is not the chip, it’s the network. Modern AI models are not trained and run using a single chip; they are trained and run on multiple chips working together. So it doesn’t matter if your chip is super fast because most of the time it will be sitting idle and waiting for the data to arrive. Nvidia recognized this bottleneck a long time ago and acquired Mellanox, a company that builds high-speed networking. Nvidia builds NVLink switches and InfiniBand networking fabrics that connect them.

I would encourage you to think of Nvidia not as a chip design company but as a “DATACENTER MAKING COMPANY”.

Why isn’t Nvidia making ASIC chips?
So here is the real question: if NVIDIA can design the best chips and has got the best networks, then why aren’t they making ASIC chips themselves?

To answer this question let me ask you a different question. We know that ASIC chips are hardwired to follow a specific sequence of maths and give the output, but what if the fundamental architecture behind the AI models (which is Transformers) changes? The answer is those ASIC chips instantly go from accelerators to paperweights; they will not be able to perform computation because the architecture has been changed and the chips are hardwired to perform computation for a specific architecture.

The total addressable market (TAM) is bigger for general purpose accelerators. If NVIDIA builds ASICs, then the ones who will be buying them would be the companies who run Large Language Models. But by making general purpose accelerators, Nvidia can sell them to:

LLM-centric companies for running AI models
Medical companies for drug discovery by simulating molecular physics
Tesla for training their self-driving cars
The weather companies to predict the weather
So why go for a small one when you can get a bigger piece of the pie?

It’s not like Nvidia simply ignores this ASICs rise, they acknowledge it. If you study their GPUs closely you will find they have something called tensor cores and transformer engines; these sections are designed to make the Transformer-specific calculations fast, so NVIDIA gets an edge here.

What if someone comes with something truly revolutionary? It’s simple, they acquire them. That’s what happened when the startup “GROQ” came up with LPUs (Language Processing Units) which reduced the cost of inference by a lot and sped up the time it took for inference; they didn’t waste a single second to acquire their core technology and talent.

I hope this article gives you good information about the current landscape of AI hardware. Thanks for reading :)
By Swarit Shukla

The Elegance of MoE: How Gemma 4's 26B Model Runs Like a 4B Model

Swarit Shukla — Sun, 12 Apr 2026 06:57:28 +0000

Google recently dropped its new family of open-source AI models, Gemma 4, but the variant that truly captured my interest is Gemma-4-26B-A4B-IT. The question is: how can a 26 billion parameter model only activate 4 billion parameters at a time? This is where the elegance lies. By only activating 4 billion parameters, it reduces the cost of compute a lot. So what’s the magic behind this? It turns out it uses a clever architecture called MoE (Mixture of Experts) that lets the model choose experts, and hence it only activates 4 billion parameters, making it extremely fast and compute-efficient.

A Mixture of Experts model is not a giant monolith. Internally, it is divided into experts (for example, 128). Experts specialize in different fields like coding, physics, calculus, and literature. So instead of using a giant neural network, it uses smaller expert neural networks. Note that these experts are not predefined—the neural network learns this itself during backpropagation.

Dense models vs Mixture of Experts
Traditional dense models differ from Mixture of Experts. In a dense model, each input token is fed to all of the parameters—but not in the case of MoE.

MoE uses a router that assigns a token to only the top k experts (usually 2 or 8). The router (which is a neural network) takes in the token as input and then computes probabilities for all the experts. The top two with the highest probability are assigned that token.

So at a time, only four billion parameters are activated, and the remaining 22 billion sit idle.

The restaurant analogy
Think of it this way—there are two restaurants:

Dense restaurant
You place an order.
In a dense restaurant, that order gets passed to every chef, and every chef works on it. It doesn’t matter if the order is for pasta—even the dessert chef will work on it. After every chef works on the dish, they create a delicious pasta.
MoE restaurant
This is where the router—the manager of experts—comes into play.

In an MoE, instead of the order directly going to the chefs, it first goes to the manager. The manager then decides which two chefs will work on the dish. If the dish ordered is pasta, then the two chefs working on it would be:

The Vegetable Chef (Entremetier): Boils the starch (the pasta noodles)
The Sauce Chef (Saucier): Cooks the hot, savory meat sauce to pour over the top
Together, they create a delicious pasta without making all the chefs in the restaurant work on it. It’s as good as the one made by a dense restaurant, but with fewer chefs involved.

(The idea for this analogy came from The Bear show—it’s an amazing show, by the way. Check it out.)

Total vs Active parameters
Total parameters – This represents the amount of diverse knowledge an LLM has. Let’s say a model has 128 experts and 26 billion parameters. Those 26 billion parameters are spread across 128 experts in their fields. Some are good at math, some at literature, and they might also go niche—like an expert in pop culture, movies, and music.

Active parameters – This represents the compute cost of the model. So if a model has 26 billion parameters but only activates 4 billion at a time, the model’s compute cost and response time become that of a 4 billion parameter model.

The vRAM twist
Even though the model becomes extremely efficient at generating inference and reduces the compute cost significantly, there’s still the angle of vRAM.

It doesn’t matter if the model activates only 4 billion parameters at a time—you still have to load the whole 26 billion parameters into your vRAM. So even though your low-end hardware can run the model efficiently, to load the model you will still need enough vRAM, which may force you to use a high-end system.

It might give you fast responses and consume less energy, but you will still need a powerful device.

An intuitive demonstration
Let’s say you input the sequence: “Indian cuisine is very…“ and the LLM has to complete it.

Input – The token “Indian” arrives
Router’s evaluation – Based on mathematical evaluation of the token “Indian”, the router selects the top 2 experts, which could be the ones that specialize in geography (#34) and food (#87). (Modern LLMs consist of multiple layers stacked together, so the router assigns the token to different experts repeatedly as it goes deeper into the architecture.)
Computation – Only the parameters in experts #34 and #87 get activated and used for computation, while the remaining parameters stay idle
Repetition – The model repeats the process, but this time the router might choose completely different experts based on the next token
The History
It might seem like a very novel idea to most people, but the actual concept was introduced more than three decades ago, in 1991, by Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and the “Godfather of AI,” Geoffrey E. Hinton. The paper was titled “Adaptive Mixtures of Local Experts.”

The modern implementation of this idea came in 2017. The paper, titled “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” was written by Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean (the Google Brain team). This paper further refined the idea by introducing the concept of sparsity—it forced the neural network to activate only a small number of parameters at a time, making them highly efficient.

by Swarit Shukla