Tala Amm

Posted on Jun 5

I Thought More Threads ALWAYS Improve Performance, My Assumption Was Wrong!

#multithreading #programming #productivity #beginners

I Thought More Threads Would Make Matrix Multiplication Faster. My Benchmarks Said Otherwise.

When I started implementing multithreaded matrix multiplication, I thought I already knew how the results would look.

More threads should mean better performance.

Right?

After all, if one thread performs the work in one second, then two threads should finish in roughly half a second, four threads should be even faster, and sixteen threads should be significantly better.

That intuition sounds reasonable.

My benchmarks told a different story.

In this article, I'll walk through the experiment, the implementations, the benchmark results, and the surprising lessons I learned about parallel programming, cache locality, and performance engineering.

The Goal

The objective was simple:

Implement matrix multiplication in C using POSIX threads and compare different workload distribution strategies.

I implemented four versions:

Sequential baseline
Row-wise parallel multiplication
Column-wise parallel multiplication
Block-based (tiled) parallel multiplication

Then I benchmarked them using several matrix sizes and thread counts.

The goal was not only to measure execution time, but to understand why certain approaches performed better than others.

Why Matrix Multiplication?

Matrix multiplication is one of the most common operations in:

Machine learning
Scientific computing
Computer graphics
Numerical simulation
High-performance computing

The classic algorithm performs a huge number of arithmetic operations.

For an N×N matrix:

The time complexity is: O(N³)

This makes it an excellent candidate for parallelization.

There is a lot of work to distribute among threads.

Or at least that's what I thought.

The Baseline

Before introducing threads, I implemented a traditional single-threaded matrix multiplication algorithm.

The baseline served two purposes:

Correctness reference
Performance comparison point

Every parallel implementation was verified against the baseline output.

This turned out to be more important than expected because one of my parallel implementations initially produced incorrect results despite appearing to run faster.

Performance is meaningless if the answers are wrong.

Strategy 1: Row-Wise Parallelization

The first parallel approach divided the output matrix rows among threads.

For example:

Thread 1 computes rows 0–249

Thread 2 computes rows 250–499

And so on.

This strategy has a nice property:

Each thread writes to a completely separate region of memory.

No locks. No mutexes.

No synchronization overhead.

It is conceptually simple and relatively cache-friendly.

I expected this implementation to perform well.

And it did.

Strategy 2: Column-Wise Parallelization

The second approach divided columns among threads.

At first glance, it seemed almost identical to row-wise decomposition.

The amount of mathematical work was nearly the same.

The number of threads was the same.

The matrix dimensions were the same.

So I expected similar performance.

I was wrong. Very wrong.

The column-wise implementation became one of the most interesting results in the entire project.

Strategy 3: Block-Based Parallelization

The final approach used tiled matrix multiplication.

Instead of processing entire rows or columns, matrices were divided into smaller blocks.

Each thread worked on groups of tiles.

The idea was to improve cache utilization.

Modern CPUs are incredibly fast at arithmetic.

What often slows programs down is moving data between memory and the processor.

By keeping smaller regions of matrices inside cache longer, block-based multiplication attempts to reduce expensive memory accesses.

This turned out to be the winning strategy.

Benchmark Setup

Hardware:

AMD Athlon Silver 7120U
2 CPU cores / 4 Logical Processors
WSL Ubuntu
GCC with -O2 optimization

Matrix sizes:

100×100
500×500
1000×1000

Thread counts:

Each experiment was repeated multiple times and average execution times were recorded. You can find them here

How Matrix Size Changed Everything

The thread-count experiments were only part of the story.

I also wanted to understand how each implementation behaved as the problem size increased.

To do this, I benchmarked three matrix dimensions:

100×100
500×500
1000×1000

The resulting execution-time graph for each 16 threads strategy, revealed a pattern that is common in high-performance computing but easy to overlook.

For small matrices, there was relatively little difference between implementations.

In some cases, the sequential baseline was actually faster than the multithreaded versions.

This might seem surprising at first.

After all, multiple threads should mean more computational power.

The explanation lies in overhead.

Creating threads, scheduling them, and coordinating their work introduces additional cost.

For small workloads, these costs represent a significant fraction of the total execution time.

The computational problem simply isn't large enough to benefit from parallelism.

As matrix size increased, however, the situation changed dramatically.

The amount of arithmetic work in matrix multiplication grows approximately with O(N³).

This means that increasing matrix dimensions has a much larger effect on runtime than many developers initially expect.

A 1000×1000 multiplication is not merely ten times larger than a 100×100 multiplication.

It requires roughly one thousand times more arithmetic operations.

As the workload increased, the overhead of thread management became relatively less important.

The actual computation began to dominate execution time.

This is where the advantages of parallel execution became visible.

The row-wise and block-based implementations began outperforming the sequential baseline by increasingly larger margins.

The block-based implementation showed the strongest scalability.

Its execution time increased more slowly than the other approaches as matrix size grew.

This suggests that its cache-friendly design becomes more valuable as the amount of data increases.

Interestingly, the column-wise implementation did not scale as effectively.

As matrix dimensions increased, its inefficient memory access patterns generated more cache misses and reduced overall performance.

The graph highlights an important principle of performance engineering:

Optimization strategies that appear insignificant on small datasets often become critical on large datasets.

This is why benchmarking only small inputs can be misleading.

A program that appears efficient for a 100×100 matrix may behave very differently when processing matrices that are ten or one hundred times larger.

The matrix-size experiments demonstrated that scalability is not only about adding threads.

It is also about ensuring that memory access patterns remain efficient as the workload grows.

In this project, the block-based implementation achieved the best balance between parallelism and memory efficiency, allowing it to scale more effectively than the other approaches.

The First Surprise: More Threads Didn't Always Help

One of the assumptions I made, as a beginner, about parallel programming is:

"More threads equals more speed."

The benchmark results immediately challenged that assumption.

For smaller matrices, multithreading often performed worse than the sequential implementation.

Why?

Because threads are not free.

Creating threads costs time.

Scheduling threads costs time.

Context switching costs time.

Managing threads introduces overhead.

When the workload is small, those costs can exceed the benefit of parallel execution.

The computation simply isn't large enough to justify the extra complexity.

The Second Surprise: Row-Wise and Column-Wise Were Not Equal

This was the most interesting result of the project.

Row-wise decomposition consistently outperformed column-wise decomposition.

At first this seemed strange.

Both implementations perform essentially the same arithmetic operations.

So why the difference?

The answer lies in memory layout.

Matrices were stored in row-major order.

That means:

A[0][0]
A[0][1]
A[0][2]
A[0][3]

which are adjacent in memory.

Rows are contiguous.

Columns are not.

When the row-wise implementation accessed matrix elements, it often read memory sequentially.

Processors love sequential memory access.

Caches become highly effective.

The column-wise implementation frequently jumped around memory. In physical RAM addresses, it would look like:

0x02, 0x06, 0x0A, 0x0E, ...

Because of the row-major layout, column elements are physically separated by a "stride" (the width of the row).

Those jumps cause more cache misses, and stalls.

Because the data brought into the cache is thrown away unused.

More cache misses meant more trips to slower memory RAM to fetch the next value.

The arithmetic was the same.

The memory behavior was not.

And memory behavior won.

The Third Surprise: Cache Locality Was More Important Than Thread Count

The best-performing implementation was not simply the one with the most threads.

It was the one with the best cache behavior.

The block-based algorithm consistently produced the strongest results.

The reason wasn't additional parallelism.

It was improved cache reuse.

The same arithmetic operations were being performed.

But the processor spent less time waiting for memory.

This was one of the most valuable lessons from the project.

Performance engineering is often about moving data efficiently rather than performing calculations faster.

Understanding Speedup

To evaluate parallel performance, I measured speedup.

Speedup is defined as:

Speedup = Sequential Time / Parallel Time

A speedup of:

1.0 means no improvement
2.0 means twice as fast
4.0 means four times as fast

The block-based implementation achieved the highest speedup.

However, another interesting pattern emerged.

The speedup did not scale linearly with thread count.

Doubling the number of threads did not double the performance.

This is a classic result in parallel computing.

Understanding Efficiency

After computing speedup, I calculated efficiency.

Efficiency = Speedup / Threads

This metric measures how effectively each thread contributes useful work.

One surprising observation was:

Execution time improved while efficiency decreased.

At first this seems contradictory. But it isn't.

The program was getting faster.

But each additional thread contributed less benefit than the previous one.

This phenomenon is called diminishing returns.

It appears in almost every parallel system.

Why Efficiency Drops

As more threads are added:

Scheduling overhead increases
Context switching increases
Cache contention increases
Memory bandwidth becomes shared

Eventually the overhead begins to dominate.

The program may still become faster.

Just not proportionally faster.

This explains why efficiency graphs often trend downward even when performance continues to improve.

Hardware Matters

One important limitation of my experiment was hardware.

The machine only had a small number of CPU cores.

Despite testing 8 and 16 threads, the processor could not execute all of them simultaneously.

The operating system had to continuously schedule and reschedule threads.

This limited scalability.

The results illustrate a fundamental principle:

Software parallelism cannot completely overcome hardware limitations, we still need some hardware support.

At some point, the hardware becomes the bottleneck.

Lessons Learned

This project taught me several lessons that I fully appreciate after running real benchmarks.

More threads do not guarantee better performance.
Memory access patterns matter enormously.
Cache locality can outweigh algorithmic differences.
Benchmarking often disproves intuition.
Correctness verification is essential.
Performance engineering is as much about memory as computation.
Smaller benchmarks create "Performance Illusions".
Parallelism introduces a "thread tax" (creation, context switching, and synchronization overhead).

Perhaps the biggest lesson was that performance optimization should be driven by measurements, not assumptions.

Several results contradicted my expectations.

Without benchmarks, I would have drawn the wrong conclusions.

Final Thoughts

Before this project, I viewed multithreading primarily as a way to divide work.

After this project, I see it differently.

The number of threads is only one piece of the puzzle.

Memory layout, cache behavior, workload distribution, and hardware characteristics can have a greater impact on performance than simply adding more workers.

The block-based implementation won not because it created more threads, but because it used memory more intelligently.

That was the most valuable takeaway from the entire experiment.

Sometimes the fastest code isn't the code doing the least work.

It's the code waiting the least for memory.

DEV Community