Arunabh Gupta

Posted on Jan 6

DP Isn’t Just About Big-O: How Cache Misses Killed My Knapsack

#dsa #dp #memory

“I had a correct O(n·x) DP solution. Constraints were within limits. Still… TLE.”

I'm sure you must have said or heard of this phrase a lot of times before. Same happened with me as well. I was trying to solve the Book Shop Problem on CSES. It's a knapsack problem. I had my logic sorted out, but when I submitted the code, I was getting TLE for some reason. I checked multiple times, tried sorting and other techniques to somehow get better results but nothing worked. Also memory limit is not an issue since enough memory is given for this problem. So what the hell happened here ???

I would strongly suggest to try this problem first before reading ahead.

The DP that should have passed

Let's first discuss an approach for this problem. If you can observe here, you basically have two options at any index of the prices array. You can either take it or skip that book.

So, if we define a 2D dp array for this problem, any state dp[i][k] ( where i is index and k is budget ) can be defined as maximum value using values from i .... n-1 with a budget k.

Therefore the dp state would become

dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k])

Algorithmically, nothing seems to be wrong.

( I would suggest to watch some knapsack problem tutorials if you have difficulty understanding this. You can refer to this video for a better explanation. Although his approach is a bit different. He is thinking in prefix array terms whereas I my approach is derived from suffix array point of view. )

The mistake I didn’t know was a mistake

My initial code was something like this:

int dp[1002][100002];
int solve(int n, int x, int* h, int* s) {

    for(int k=0; k<=x; k++){
        dp[n-1][k] = k>=h[n-1] ? s[n-1] : 0;
    }

    for(int k=0; k<=x; k++){
        for(int i=n-2; i>=0; i--){
            if(k>=h[i]){
                dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k]);
            }
            else{
                dp[i][k] = dp[i+1][k];
            }
        }
    }

    return dp[0][x];
}

This looks fine, right?
Let me break this down a little bit. I am looping through all the k from 0 -> x and for every k, I am looping through the array, updating the dp array using the same logic we discussed before.

This looping seems fine but still it was slow and I was getting a TLE.

The missing concept: memory access

When we write code, it’s easy to think that accessing an array is a constant-time operation.
In theory, it is.

In practice, where the data lives matters more than the operation itself.

Modern CPUs are extremely fast. So fast, in fact, that reading from main memory (RAM) would slow them down constantly. To avoid this, CPUs use caches — small, very fast memory layers placed between the CPU and RAM.

Roughly speaking, access times look like this:

Location	Approximate cost
CPU register	1 cycle
L1 cache	~4 cycles
L2 cache	~12 cycles
L3 cache	~40 cycles
RAM	200–300 cycles

This means that a single memory access from RAM can cost as much as hundreds of CPU instructions.

Cache lines: why memory is fetched in chunks

Another important detail:
The CPU never fetches a single integer from memory.

Instead, it fetches a cache line, typically 64 bytes at a time.

For an int array, that’s:

64 bytes = 16 integers

So when you access:

dp[i][k]

the CPU actually loads:

dp[i][k], dp[i][k+1], dp[i][k+2], ..., dp[i][k+15]

Whether you use them or not.

Why sequential access is fast

If your code accesses memory sequentially, like this:

dp[i][0], dp[i][1], dp[i][2], dp[i][3], ...

then:

One cache line is loaded
All 16 integers are used
The next access is already in cache
The CPU runs at full speed

This is why simple loops over arrays are incredibly fast.

Why skipping memory is slow

Now consider this access pattern:

dp[n-2][k]
dp[n-3][k]
dp[n-4][k]

Because C++ stores 2D arrays in row-major order, each of these accesses is far apart in memory—often hundreds of kilobytes.

What happens in this case is

A cache line is loaded
Only one integer from it is used
The remaining 15 integers are wasted
The CPU has to fetch another cache line from RAM
The CPU stalls for hundreds of cycles

In other words, I was paying the cost of memory access 16 times more often than necessary.

At small input sizes, this doesn’t matter.
At 100 million DP transitions, it absolutely does.

You can also use the analogy of a book to get a better feel for this. Imagine reading word 1 from page 1, word 1 from page 2, word 1 from page 3 and so on. It's very inefficient right. Instead you'll read the entire page 1, then page 2, then page 3 and so on.

Why “skipping memory” is disastrous

At this point, it's clear that memory is fetched in cache lines, not as individual elements. So the real problem appears when we don't completely use what we have fetched.

So we know that a cache line holds 16 integers. If my code access the memory sequentially, then all 16 integers will be used. If my code is jumping around the memory, then only 1 integer will be used and the rest 15 integers are wasted.

This minor difference causes a theoretically correct solution to give TLE.

Let's understand this in a bit more detail.

One cache miss vs many cache misses

Consider two access patterns over an array of integers.

Pattern 1: Sequential access
a[0], a[1], a[2], a[3], ...

What happens:

One cache line is loaded
16 integers become available
The CPU uses all of them
Cache miss every 16 accesses

Pattern 2: Skipping access
a[0], a[100000], a[200000], ...

What happens:

A cache line is loaded
Only one integer is used
The rest are wasted
Cache miss every single access

Both patterns perform the same number of array reads.
But one causes 16× more cache misses. On a large scale this can cause some serious optimisation issues.

The fix (almost anticlimactic)

We just need to flip the loops inside out. This way we will fix the index and loop for all k from 0 -> x. This way we are looping in a row-first manner, avoiding cache misses on every single access. This is much more efficient than the first version.

for(int i=n-2; i>=0; i--){
        for(int k=x; k>=0; k--){
            if(k>=h[i]){
                dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k]);
            }
            else{
                dp[i][k] = dp[i+1][k];
            }
        }
    }

How to identify this mistake in other problems

By now, you might be thinking:
“How do I even know when to flip loops?”
“How do I decide which loop goes outside and which goes inside?”

The answer lies in DP state dependency — not in optimization tricks.

Let’s revisit the DP transition we used.

Step 1: Identify the dependency in the DP formula

From the knapsack recurrence, we need:

dp[i+1][k]
dp[i+1][k - h[i]]

So we can summarize the dependency as:

dp[i][k]  ←  dp[i+1][*]

This is the only dependency that matters.

Step 2: Translate dependency into an ordering constraint

In simple terms, this means:
To compute values for row i, the entire row i+1 must already be computed.
Now ask yourself a very direct question:
"How do I compute the table so that row i+1 is already available when I compute row i?"
There is only one correct answer:
i must go from n-1 down to 0

Which immediately gives us:

for (int i = n - 1; i >= 0; i--)

This loop order is not an optimization — it is required by correctness.

Step 3: What about the inner loop?

Once we are inside a fixed row i, we are computing:

dp[i][0], dp[i][1], dp[i][2], ..., dp[i][x]

Now ask the next crucial question:
Does dp[i][k] depend on dp[i][k-1]?
The answer is no.
It depends only on:

dp[i+1][k]
dp[i+1][k - h[i]]

Which means:

All k values inside the same row are independent
They can be computed in any order
As long as row i+1 already exists

Step 4: The natural loop structure

Putting this together, the correct loop order becomes:

for (int i = n - 2; i >= 0; i--) {   // dependency loop
    for (int k = 0; k <= x; k++) {  // free loop
        compute dp[i][k];
    }
}

This loop order matches the DP logic itself, not just performance concerns.

A crucial mental rule to avoid this confusion
The dimension that appears in the DP dependency must be the outer loop.
Look at the formula again:

dp[i][k] = f(dp[i+1][...])

i appears on the right-hand side
k does not

So we can say:

i controls the correctness order
k is a secondary, free dimension

Which gives us the rule:

for (i ...)        // dependency dimension
    for (k ...)    // free dimension

Once you start thinking in these terms, “flipping loops” stops being a trick and becomes a natural consequence of the DP definition.

Conclusion

At the end I came to a realization that big O notation only gives you an idea of how many operations are performed. It says nothing about how fast memory is accessed in these operations.
Once n × x gets large enough, cache behavior matters more than anything else.

DEV Community