Isaac Lyman

Posted on Jan 5, 2018

This is how Meltdown works

#exploit #bug #security #explainlikeimfive

Meltdown and Spectre are the two latest exploits throwing the tech world for a loop. They have a lot in common with each other; both depend on built-in features of your computer's processor.

After doing some reading, I think I understand the Meltdown exploit well enough to explain it in layman's terms. If any of this is incorrect, please comment below and correct me.

As you know, the CPU (or processor) is the brain of the computer. It performs lots of simple instructions very quickly, like adding numbers together and reading memory.

Suppose your CPU is an employee at a fast-food counter. People line up in front of the counter and order items off the menu, then pay with their credit card, then go to the pickup line and get their food. The process is very simple. If they don't pay, they don't get their food.

(Similarly, the CPU accepts instructions from a process, makes sure the process has the permissions necessary to execute those instructions, then executes them and returns the results to the process.)

The fast-food place prides itself on FAST service. In fact, it will lose a lot of customers if any other restaurant is faster than it. So it starts taking shortcuts. For example, as soon as you tell the cashier what food you want, they send your order to the back and the cooks start making it, even before you hand over your credit card. They cook it and put it on a tray in the back, ready to deliver to you when you get to the pickup line. This speeds things up a lot. Of course, if you can't pay, they have to cancel the order and throw away the food, which is a waste. But most people pay right away, so it's not a big problem, and the gained efficiency of getting started sooner definitely makes up for it.

(Your CPU does this. It executes certain instructions while it's checking the process's permissions, and doing these in parallel (called "speculative execution") saves it some time. If it doesn't have the correct permissions, it cancels the instructions and dumps the results.)

Then the restaurant gets a bright idea: if you don't pay, instead of dumping the food, they'll leave it on its tray in the back for up to 5 minutes. If someone else puts in the same order before the 5 minutes are up, they'll give them the food from your cancelled order. Less food waste, quicker service. Win-win.

(Your CPU does this too. It has its own cache where it puts the results of executed instructions, before it even knows if the process has the right permissions.)

One day, the restaurant comes up with an incredible innovation: the ability to copy food digitally. Once the food is made, it can be given to several customers at a time without re-making it from scratch: just copy it onto each customer's tray. The R&D department hasn't figured out how to keep the food hot and fresh for longer than 5 minutes, but if one customer orders a #6 combo, everyone else who orders a #6 combo within the next 5 minutes will get a copy of the same meal, and they'll be able to get it super quick, since it doesn't have to be made from scratch. It's super efficient. Profits are soaring.

(I know the metaphor is getting thin at this point, but that's how data works. Once the CPU has something in the cache, it can hand it out to as many processes as needed until that part of the cache gets overwritten or dumped.)

The restaurant's final innovation is all about privacy. Ever been too embarrassed to ask for three large orders of fries, all for yourself? Worry no more: thanks to a very creative system of white-noise machines, thick curtains, and whispers, you're now able to order and eat your food without any other customer knowing what you've got.

(Private communication between a process and a CPU is very important. It prevents programs from spying on each other. You wouldn't want your Solitaire game to be reading your Outlook password, would you?)

You're a fast-food hacker, and you want to figure out how to manipulate the restaurant in order to find out what someone else has ordered. You dedicate a lot of time and thought to this.

First, you discover that if you order food that's already been made in the last 5 minutes, you get it a lot faster than if they have to make it fresh for you. So by using a stopwatch, you can determine whether it's been ordered by someone else recently. In fact, if you keep an eye on the counter, you can tell whether your order is made fresh or just digitally copied, even if your credit card gets declined.

(The point of the CPU cache is to speed things up. If instruction results are needed multiple times, getting them from the cache is much faster than crunching the same operations over again. And even if a process doesn't have the right permissions, it can ask the CPU for some data, time the interaction, and find out whether that data is in the CPU cache or not.)

From this you hatch your plan. You go to the restaurant and pick a victim: another customer who's already eating, and you want to find out what they ordered. You approach the cashier, claim to be that other customer, and ask if you can modify your order. You say you accidentally ordered the menu item previous to the one you wanted; can they please change your order to the next menu item? Out of courtesy, they say yes and send the order to the kitchen, but ask to see your credit card to verify your identity. When it turns out that you're lying about being the other customer, the cashier cancels the order -- but it's already been cooked; it's on a tray in the back. At this point, all you have to do is order each item on the menu, one at a time, stopwatch in hand, until one of them comes back fast enough to indicate that it was already made. And then you know: the other customer ordered the menu item previous to that one. For example, if #6 comes back very quickly, you know the other customer ordered a #5.

(CPUs can be instructed to look up data based on other data (for example, "look up the memory address 0xFFF; it contains another memory address; return the data stored at that memory address"). This is the final piece of the puzzle. Meltdown asks the CPU to look up a piece of data based on other data it doesn't have access to. The CPU rejects the request, but it still does the calculation and puts the second piece of data in its cache. Then Meltdown just needs to request a bunch of memory addresses, and whichever one comes back really fast is the other part of the equation (in our example, it's the memory address that was stored in the original, forbidden memory address.) This allows Meltdown to read anything in memory. This is very bad.)

That's Meltdown. And you are affected by it. And you should update your browsers and your OS immediately to get the patch.

In retrospect, maybe this would have been easier to explain without the fast-food metaphor...but hey, I had fun.

Let me know if this didn't make sense to you, or if you've got anything to add.

Top comments (26)

Isaac Lyman • Jan 5 '18

I know I should be terrified by the implications of a low-level exploit like this, but I'm mostly just impressed by how clever humans are. I mean, it's amazing to me that someone came up with this, even if the only valid uses for it are nefarious.

Ben Halpern • Jan 6 '18

In retrospect, maybe this would have been easier to explain without the fast-food metaphor...but hey, I had fun.

I had a lot of fun reading it.

Paula Hasstenteufel • Mar 14 '18 • Edited

Me too, and even though the metaphor ran thin at some point, it delivered a good base to begin abstracting from there. 10/10 would recommend.

Though now I feel like I want some fries :/

Tobias • Jan 6 '18 • Edited

Man, I must say that as a fresh CS student, this really helped me understand this problem. I was curious about it, but almost all examples used the tech language which I have no full knowledge of (yet).
Thank you sir!

edA‑qa mort‑ora‑y • Jan 6 '18

This is a nice description of how the attack works. Good work!

I guess it should be noted that this is a defect in Intel's chip design; it's not the way CPU's are supposd to work. AMD indicated they don't do speculative execution without permission checks. And I'm guessing no future Intel chip will either.

Perhaps this gives somebody the idea that chips should get faster again instead of cleverer. There's prone to be more faults in all these magical chip mechanisms. :(

Isaac Lyman • Jan 6 '18

Thanks, Ed! (Is "Ed" good? I'm embarrassed to admit I can't figure out how to parse your name.)

It sounds like you understand this stuff pretty well. Any chance you can answer a couple questions for me?
1) is all this speculative execution and permission checking stuff literally hardwired? Or is it software on the CPU that could theoretically be updated without replacing the whole chip?
2) the patches are OS- and browser-based (which is a little unfair, but I digress). How do they prevent the exploit?

Jean-Daniel • Jan 8 '18

While your analogie is fine to introduce the Meltdown concept, is miss an important part. The distinction between User and Kernel space and virtual memory.

On modern OSes, when you try to access a memory address, this address is a virtual address and has to be translated to a physical address first. The map to convert between virtual to physical address is stored in a dedicated piece of hardware (the TLB which is part of the processor). Each process has its own map.

Every time the active process change on the CPU, the kernel has to flush the TLB to load the new process mapping. Today, as an optimization, all majors OSes choose to copy the kernel mapping into each process at launch so when a process call a kernel function, the kernel don't have to flush the TLB and load its own mapping.
The CPU is design to know which part of the mapping is the kernel memory and which part is the process memory. So when trying to access kernel memory from the process, it denies the access.
As seen with Meltdown, this check is performed to late as the access is denied after the memory was loaded.

The patch adopted by OSes to mitigate the issue is to separate the Kernel memory map from the process map. So when a process try to access kernel memory, the speculative execution failed to map it to physical address (as the kernel map is not present anymore) and return a exception instead of loading the actual kernel memory.

This patch has a performance cost as it force to revert an useful optimization. Fortunately, from some time now, CPUs provide functions to optimize usage of the TLB and avoid flush and reload of the mapping when a process change (by allowing to store more than one process map and tagging them with Context Identifier), so the performance cost should be small enough to be invisible for most users.

edA‑qa mort‑ora‑y • Jan 6 '18

I'm not sure I know much more than that tidbit about AMD. :D

I believe the Linux patch is something called Kernel Page Table Isolation, which isolates the kernel memory even more. In the fast food analogy I think that'd be like moving the kitchen to a different building, the customers can't see anything even if poking around

I'm presuming the affected code on Intel is hardware as they indicated they can't release a microcode patch for it. That would seem to imply that a lot of it is hardware, but there is updatable code also at play.

A saw an LLVM patch that could also help mitigate some of the issues, but the details weren't entirely clear: I don't know if this means something in user-land could help, or they are compiling kernel bits with this patch.

The exact details of all this are still a bit cloudy; full info release hasn't been made yet I believe.

Skyler • Jan 6 '18

Excellent write up. My question is how easy is this to exploit? Is this going to be easy for hacker groups to take advantage of this or is this really going to require people with advanced skills to exploit (something a very small minority would be able to do)?

Ben Halpern • Jan 7 '18

Exploiting this from scratch seems quite complicated. Exploiting it via some abstraction made available on the web seems like it could be pretty straightforward, unfortunately.