“I had a correct O(n·x) DP solution. Constraints were within limits. Still… TLE.”
I'm sure you must have said or heard of this phrase a lot of times before. Same happened with me as well. I was trying to solve the Book Shop Problem on CSES. It's a knapsack problem. I had my logic sorted out, but when I submitted the code, I was getting TLE for some reason. I checked multiple times, tried sorting and other techniques to somehow get better results but nothing worked. Also memory limit is not an issue since enough memory is given for this problem. So what the hell happened here ???
I would strongly suggest to try this problem first before reading ahead.
The DP that should have passed
Let's first discuss an approach for this problem. If you can observe here, you basically have two options at any index of the prices array. You can either take it or skip that book.
So, if we define a 2D dp array for this problem, any state dp[i][k] ( where i is index and k is budget ) can be defined as maximum value using values from i .... n-1 with a budget k.
Therefore the dp state would become
dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k])
Algorithmically, nothing seems to be wrong.
( I would suggest to watch some knapsack problem tutorials if you have difficulty understanding this. You can refer to this video for a better explanation. Although his approach is a bit different. He is thinking in prefix array terms whereas I my approach is derived from suffix array point of view. )
The mistake I didn’t know was a mistake
My initial code was something like this:
int dp[1002][100002];
int solve(int n, int x, int* h, int* s) {
for(int k=0; k<=x; k++){
dp[n-1][k] = k>=h[n-1] ? s[n-1] : 0;
}
for(int k=0; k<=x; k++){
for(int i=n-2; i>=0; i--){
if(k>=h[i]){
dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k]);
}
else{
dp[i][k] = dp[i+1][k];
}
}
}
return dp[0][x];
}
This looks fine, right?
Let me break this down a little bit. I am looping through all the k from 0 -> x and for every k, I am looping through the array, updating the dp array using the same logic we discussed before.
This looping seems fine but still it was slow and I was getting a TLE.
The missing concept: memory access
When we write code, it’s easy to think that accessing an array is a constant-time operation.
In theory, it is.
In practice, where the data lives matters more than the operation itself.
Modern CPUs are extremely fast. So fast, in fact, that reading from main memory (RAM) would slow them down constantly. To avoid this, CPUs use caches — small, very fast memory layers placed between the CPU and RAM.
Roughly speaking, access times look like this:
| Location | Approximate cost |
|---|---|
| CPU register | 1 cycle |
| L1 cache | ~4 cycles |
| L2 cache | ~12 cycles |
| L3 cache | ~40 cycles |
| RAM | 200–300 cycles |
This means that a single memory access from RAM can cost as much as hundreds of CPU instructions.
Cache lines: why memory is fetched in chunks
Another important detail:
The CPU never fetches a single integer from memory.
Instead, it fetches a cache line, typically 64 bytes at a time.
For an int array, that’s:
64 bytes = 16 integers
So when you access:
dp[i][k]
the CPU actually loads:
dp[i][k], dp[i][k+1], dp[i][k+2], ..., dp[i][k+15]
Whether you use them or not.
Why sequential access is fast
If your code accesses memory sequentially, like this:
dp[i][0], dp[i][1], dp[i][2], dp[i][3], ...
then:
- One cache line is loaded
- All 16 integers are used
- The next access is already in cache
- The CPU runs at full speed
This is why simple loops over arrays are incredibly fast.
Why skipping memory is slow
Now consider this access pattern:
dp[n-2][k]
dp[n-3][k]
dp[n-4][k]
Because C++ stores 2D arrays in row-major order, each of these accesses is far apart in memory—often hundreds of kilobytes.
What happens in this case is
- A cache line is loaded
- Only one integer from it is used
- The remaining 15 integers are wasted
- The CPU has to fetch another cache line from RAM
- The CPU stalls for hundreds of cycles
In other words, I was paying the cost of memory access 16 times more often than necessary.
At small input sizes, this doesn’t matter.
At 100 million DP transitions, it absolutely does.
You can also use the analogy of a book to get a better feel for this. Imagine reading word 1 from page 1, word 1 from page 2, word 1 from page 3 and so on. It's very inefficient right. Instead you'll read the entire page 1, then page 2, then page 3 and so on.
Why “skipping memory” is disastrous
At this point, it's clear that memory is fetched in cache lines, not as individual elements. So the real problem appears when we don't completely use what we have fetched.
So we know that a cache line holds 16 integers. If my code access the memory sequentially, then all 16 integers will be used. If my code is jumping around the memory, then only 1 integer will be used and the rest 15 integers are wasted.
This minor difference causes a theoretically correct solution to give TLE.
Let's understand this in a bit more detail.
One cache miss vs many cache misses
Consider two access patterns over an array of integers.
Pattern 1: Sequential access
a[0], a[1], a[2], a[3], ...
What happens:
- One cache line is loaded
- 16 integers become available
- The CPU uses all of them
- Cache miss every 16 accesses
Pattern 2: Skipping access
a[0], a[100000], a[200000], ...
What happens:
- A cache line is loaded
- Only one integer is used
- The rest are wasted
- Cache miss every single access
Both patterns perform the same number of array reads.
But one causes 16× more cache misses. On a large scale this can cause some serious optimisation issues.
The fix (almost anticlimactic)
We just need to flip the loops inside out. This way we will fix the index and loop for all k from 0 -> x. This way we are looping in a row-first manner, avoiding cache misses on every single access. This is much more efficient than the first version.
for(int i=n-2; i>=0; i--){
for(int k=x; k>=0; k--){
if(k>=h[i]){
dp[i][k] = max(dp[i+1][k-h[i]]+s[i], dp[i+1][k]);
}
else{
dp[i][k] = dp[i+1][k];
}
}
}
How to identify this mistake in other problems
By now, you might be thinking:
“How do I even know when to flip loops?”
“How do I decide which loop goes outside and which goes inside?”
The answer lies in DP state dependency — not in optimization tricks.
Let’s revisit the DP transition we used.
Step 1: Identify the dependency in the DP formula
From the knapsack recurrence, we need:
dp[i+1][k]
dp[i+1][k - h[i]]
So we can summarize the dependency as:
dp[i][k] ← dp[i+1][*]
This is the only dependency that matters.
Step 2: Translate dependency into an ordering constraint
In simple terms, this means:
To compute values for row i, the entire row i+1 must already be computed.
Now ask yourself a very direct question:
"How do I compute the table so that row i+1 is already available when I compute row i?"
There is only one correct answer:
i must go from n-1 down to 0
Which immediately gives us:
for (int i = n - 1; i >= 0; i--)
This loop order is not an optimization — it is required by correctness.
Step 3: What about the inner loop?
Once we are inside a fixed row i, we are computing:
dp[i][0], dp[i][1], dp[i][2], ..., dp[i][x]
Now ask the next crucial question:
Does dp[i][k] depend on dp[i][k-1]?
The answer is no.
It depends only on:
- dp[i+1][k]
- dp[i+1][k - h[i]]
Which means:
- All k values inside the same row are independent
- They can be computed in any order
- As long as row i+1 already exists
Step 4: The natural loop structure
Putting this together, the correct loop order becomes:
for (int i = n - 2; i >= 0; i--) { // dependency loop
for (int k = 0; k <= x; k++) { // free loop
compute dp[i][k];
}
}
This loop order matches the DP logic itself, not just performance concerns.
A crucial mental rule to avoid this confusion
The dimension that appears in the DP dependency must be the outer loop.
Look at the formula again:
dp[i][k] = f(dp[i+1][...])
- i appears on the right-hand side
- k does not
So we can say:
- i controls the correctness order
- k is a secondary, free dimension
Which gives us the rule:
for (i ...) // dependency dimension
for (k ...) // free dimension
Once you start thinking in these terms, “flipping loops” stops being a trick and becomes a natural consequence of the DP definition.
Conclusion
At the end I came to a realization that big O notation only gives you an idea of how many operations are performed. It says nothing about how fast memory is accessed in these operations.
Once n × x gets large enough, cache behavior matters more than anything else.
Top comments (0)