CPU vs GPU vs TPU vs NPU — What's Actually Different?

#ai #machinelearning #hardware #beginners

If you've ever shopped for a laptop or read a chip announcement, you've seen these four letters thrown around like they're interchangeable. They're not. Each one is a processor, but each was built to solve a different kind of math problem efficiently. Here's the difference, explained with real examples of where you'd actually run into each one.

The short version

CPU (Central Processing Unit) — the generalist. Good at everything, great at nothing specific.
GPU (Graphics Processing Unit) — the parallel workhorse. Built for doing thousands of simple operations at once.
TPU (Tensor Processing Unit) — Google's custom chip, built specifically to accelerate the matrix math behind neural networks.
NPU (Neural Processing Unit) — a small, power-efficient chip baked into phones and laptops to run AI tasks locally without draining your battery.

Now let's go one level deeper.

CPU: the all-purpose brain

A CPU has a small number of powerful cores (typically 4 to 16 in consumer devices) designed to execute instructions one after another extremely fast, while juggling many different kinds of tasks. It's optimized for sequential logic, branching decisions, and low-latency response — the kind of work where step 2 depends on the result of step 1.

This is why your CPU handles your operating system, runs your web browser, manages file systems, and executes business logic in your backend code. It's flexible enough to do almost anything, but that flexibility comes at a cost: it's not efficient at doing the same simple operation millions of times in parallel.

Example use case: Running a Node.js API server, compiling code, querying a database, or running an Excel spreadsheet with complex formulas. Any task that's logic-heavy and sequential.

GPU: built for doing one thing, a million times at once

GPUs were originally built to render graphics — and rendering a frame means computing the color of millions of pixels independently and simultaneously. That requirement led to a very different architecture from the CPU: instead of a few powerful cores, a GPU has thousands of smaller, simpler cores that all execute the same instruction across different pieces of data at the same time. This is called SIMD (Single Instruction, Multiple Data) parallelism.

It turns out that the math behind training neural networks — mostly matrix multiplications and additions — looks a lot like rendering pixels: lots of small, independent, repetitive operations. That's why GPUs became the default hardware for deep learning, even though that's not what they were originally designed for.

Example use case: Training a deep learning model in PyTorch or TensorFlow on an NVIDIA RTX or A100 card, rendering 3D scenes in a game engine, or running large-scale scientific simulations (fluid dynamics, weather modeling).

TPU: a chip designed specifically for neural network math

A TPU is Google's answer to the question "what if we stopped repurposing graphics hardware and built a chip purely for neural network workloads?" TPUs are Application-Specific Integrated Circuits (ASICs), meaning they're not general-purpose at all — they're wired at the hardware level to do one thing extremely fast: matrix multiplication, the core operation in both training and running neural networks.

The core component is something Google calls a "systolic array," which lets data flow through a grid of processing units in a way that drastically reduces the memory movement that normally slows down matrix math. The tradeoff is flexibility — TPUs aren't great at general computing or even at every type of machine learning model, but for large-scale transformer and deep learning workloads, they can outperform GPUs on both speed and power efficiency.

TPUs aren't something you buy and plug into a desktop; they live in Google's data centers and are accessed via Google Cloud or Colab.

Example use case: Training large language models or vision transformers at scale on Google Cloud TPU pods, or running inference for Google services like Search ranking, Translate, and Photos' image recognition.

NPU: AI acceleration that fits in your pocket

An NPU is also purpose-built for neural network math, similar in spirit to a TPU, but designed with a completely different goal: efficiency at a tiny power and size budget, rather than raw throughput in a data center. NPUs are now standard in modern smartphones, laptops, and even some smart cameras, where they run AI features locally on-device instead of sending data to the cloud.

Running inference locally on an NPU means lower latency (no network round-trip), better privacy (your data never leaves the device), and dramatically lower power draw than running the same task on a CPU or GPU.

Example use case: Face ID-style face unlock, real-time photo background blur in a video call, on-device voice transcription, and the "AI" features marketed on recent chips like Apple's Neural Engine, Qualcomm's Hexagon NPU, or Intel's AI Boost — all of which power local features like live captions, image search in your photo gallery, or background noise removal, without sending anything to a server.

Putting them side by side

Chip	Designed for	Core count	Where you'll find it
CPU	Sequential, general-purpose logic	Few, powerful	Every computer and server
GPU	Massive parallel computation	Thousands, simple	Gaming PCs, ML training rigs, data centers
TPU	Large-scale neural network matrix math	Specialized systolic arrays	Google Cloud, Colab
NPU	Efficient, low-power AI inference	Specialized, small footprint	Phones, laptops, edge devices

Why this matters for you as a developer

If you're training a model from scratch, you're choosing between a GPU (flexible, widely supported, available on every major cloud) and a TPU (faster and cheaper at scale, but locked into Google's ecosystem and a narrower set of supported frameworks). If you're shipping a mobile app with an on-device AI feature, you're targeting the NPU through frameworks like Core ML, NNAPI, or ONNX Runtime, so the feature runs fast and doesn't kill the battery. And for everything else — your web server, your database, your build pipeline — that's still squarely CPU territory.

None of these chips replaced the others; they specialized. Understanding which one fits which job is increasingly part of writing efficient software, not just a hardware trivia question.