Kyrylo Rud

Posted on Nov 16

When Big O Lies

#algorithms #computerscience #performance #programming

Since the academy era and throughout documentation digging, the Big O has been a de-facto standard for evaluating algorithms: from sorting and look-up in a collection to insertion operations or even simple palindrome checks (especially in interview live-coding sessions).

However, efficiency is not the same as "best to use in this case".
What actually measures Big O, how it scales and what could it lie about? Let’s dive deeper to the roots of the Big O and what could help developers to choose algorithms smarter. Especially when it comes to implementation, maintainability and long-term support of algorithms written by hand and not fetched from the standard library.

Disclaimer
This article focuses on practical understanding of algorithmic complexity for real-world engineering.
The explanation of asymptotic notation is intentionally simplified and does not cover all formal mathematical details (Θ, Ω, ω, etc.).
All performance comparisons assume typical hardware behavior and ignore machine-specific optimizations unless stated.
The small-N "impossible cases" are illustrative and not meant as practical recommendations.

What the Big O measures?

Big O is, foremost, a mathematical notation, defined by German mathematicians Paul Bachmann, Edmund Landau, and others, normally called the “Bachmann-Landau notation” or simply "Big O". It exists for more than 100 years, already.

The "o" part comes from order of approximation, basically "how fast something [algorithm complexity] grows".
Grate, but where does the "big" come from? Actually, there are couple 'o' types.

Analysis of an algorithm can be done in various ways, and the "O" notation helps to classify them into:

Big O notation - the algorithm works at most this fast;
Little-o notation - the algorithm works always strictly slower than this;
Other notations such as Big Omega, Little Omega, etc. (not covered in this article).

Note: It is incorrect to state that the worst case of an algorithm is the same as the little-o notation. Little-o has nothing to do with worst-case vs. best-case.
While Big O allows the runtime to reach the upper bound, Little-o requires the runtime to eventually never reach the stated bound. These statements might look confusing right now, but they will become clear a bit later.
So, there is no practical way to express one in terms of the other, since they measure different metrics.

Note: However, to understand the Big O easier, it can be simply stated: the function does not grow faster than some constant multiplied by n.

Basically, the impracticality of the little-O (difficulty of proving that an algorithm would never reach a certain bound) led to the popularity, wide adoption, and everyday usage of the Big O in the programming world. However, this order of things is mostly relevant to software engineering; in mathematics the situation is different.

It allows to understand the rate at which an algorithm grows depending on the amount of input data. The key idea is: it is not a guarantee of actual algorithm speed, but rather a reference to how complex it becomes as the input size grows.

This is where the idea of time-complexity comes from: the scaling behavior of runtime as N increases.

Note: Big O hides constants, machine effects, memory patterns, branch prediction, etc. Some algorithms may execute (not work, but execute) faster despite the Big O analysis due to cache locality, compiler-level machine code optimizations, and similar factors.

Growth rate

Any article that explains the Big O usually references the following picture.

This picture shows multiple function graphs of the Big O growth rate. Any algorithm can be expressed through one of these common complexity classes (function graphs). However, it is not limited to them, and a more precise growth rate can be expressed with a custom equation.

The following functions are de-facto standards for most common algorithms. Their origin comes from mathematics and from "natural" algorithmic patterns (natural meaning: what you would intuitively expect from the algorithm's structure):

O(n) - effectively one pass over each item of the data;
O(n²) - nested loops or matrix operations;
O(log n) - halving work, where each iteration removes half of the remaining data (e.g. binary search);
O(√n) - geometric search spaces (e.g., 2-D grids);
O(2ⁿ) - exploring all subsets;
O(n!) - exploring all permutations.

Note: While O(log n), O(n log n), O(log₁₀ n) as well as O(lg n) (a.k.a. O(log₂ n)) may differ in actual runtime, they all belong to the same complexity class.
It is more precise to say that halving work is O(lg n), or that merge sort is O(n log n), but the difference will be explained later.

But what would be faster: jump search or binary search?
The natural answer is: binary search, of course, since the growth rate of O(log n) is lower than that of O(√n).

The problem is that, apart from the "implementation quality" of the algorithm (how clean the code is, whether there are redundant checks, etc.), there is always the K-factor.

K-Factor and asymptotic analysis

When the Big O is mentioned, it is never bound to the context of input data (what is actually expected). Complexity of the input algorithm is discovered from the infinite POV (analyzed in the limit as input size approaches infinity).

Any algorithm with complexity class O(log n) would be faster than O(√n). Right?

It may sound obvious that one function's growth rate is expected to be much faster than another one.

Analyzing an algorithm's complexity growth rate as the input size increases is asymptotic analysis. There are two practical perspectives for the Big O:

asymptotic (infinite-N behavior) - basically what is represented in documentation for most algorithms (e.g. quicksort is O(n log n));
pre-asymptotic (finite-N behavior) - rarely documented (because context matters: which algorithms are being compared, and which N-range is relevant), although it can be useful when comparing algorithms on specific datasets or limited input sizes.

Asymptotically speaking (analysis of a function in the limit as N → ∞), the growth of function A can be faster than the growth of function B - this is the case with all the "standard" growth rates (complexity classes) usually compared on graphs.

But what would happen when the algorithm has constants? Well, here is where K-factor kicks in, as well as pre-asymptotics!

I’m using the old K-factor naming here - a little habit I carried from early C/C++ and embeded style guides, where K was the go-to marker (prefix) for constants like kConstantValue.
Also, using "C-factor" would sound confusingly close to the C programming language itself, so sticking with the traditional "K".

Let's take a look on O(√n) and O(log n) again, the same picture as before (apart from the negative time complexity for logarithms where N < 1 - this is impractical for algorithms, but a function graph is a function graph).

We can clearly see that O(log n) is unconditionally "faster" (grows more slowly asymptotically), which is great. However, if there were some factor that affects it with a small constant, let’s say K = 2, then we would see a very different picture.

This drastically changes the situation: these curves never intersect without the constant factor. Now, on the scale where N < 75, there are two cross-points, at approximately N ≈ 2 and N ≈ 74.
Therefore, in theory, on the range 2 < N < 74, the O(√n) algorithm would be more efficient than O(2 log n), regardless of the asymptotic analysis.
This is pre-asymptotic behavior.

Note: The constant is usually greater than 2, as it may depend on various extra comparisons or operations inside the algorithm that improve it asymptotically.

Asymptotically, the O(2 log n) is equal to O(log n) (have same growth rate = belong to the same class; this is one-way equation), since constants are not relevant in Big O analysis.

Such one-way equations (the naming is suggested by Donald Knuth), are basically an abuse of notation, but widely used and contextually & semantically clear. The more correct notation would be 2 log n ∈ O(log n).

From the Big O perspective: constants never move a function to a different complexity class - ever. And notation O(2 log n) is incorrect (extra constants are redundant asymptotically), but it is well-understandable.

Note: As mentioned earlier, O(log n), O(lg n), O(log10 n), and even O(ln n) may differ in actual runtime but belong to the same complexity class.
This is exactly why, in asymptotic analysis, all logarithms are collapsed into the generic O(log n).
In the infinite-N limit, their differences reduce to constant factors, and constants disappear. This is why documentation almost always writes O(log n), even when the underlying behavior is specifically O(lg n), etc.

On the large scale it might not matter, but in applied programming it is actually important - especially in performance management.

In practice, knowing how high the N-scale is expected on average (or consistently) can significantly save precious CPU cycles, even if this is counter-intuitive from the Big O perspective. Even a subtle complexity difference can play its role.

Note: This particular example is for demonstration only; in practice there are no pure O(√n) algorithms. Some constant is always present, so it never behaves as a "real" O(√n).

Zooming too close

Let's zoom in, maybe too much, into the region where N < 10.
At this scale, the complexity curves behave very differently from what could be expected. Some functions dip below zero (for example, O(log n) becomes negative for 0 < n < 1), while others appear to cross each other or even briefly reverse their asymptotic complexity order. And, of course, O(log n) never has zero-cost even for small N.

For algorithms and Big O, N is always a positive integer, and meaningful values begin at N = 1, since it represents the size of the input data.

However, even though this small region is not practically relevant, it shows a trend: the shape of each curve hints at where crossover points may appear when constants (the K-factor) are introduced.

If the growth rates differ only slightly (for example, O(√n) vs O(log n)), a sufficiently big constant may shift the curves so that one algorithm becomes faster than the other for some realistic inputs, as we saw before. And even more, as we are going to see.

All pre-asymptotic behavior starts from this region. It is helpful to explore it in a graphical calculator to understand the trends better. The K-factor may be small, but experimenting with graphs makes it much easier to visualize and remember how constants shift the curves.

Practical example

Let's have a look at sorting algorithms again, and probably the most iconic example of them: quick sort vs. insertion sort.

Quick sort has a reputation as the most optimized and the fastest sorting algorithm for a common use-case (which does not exist, of course). The actual quirk, as we saw, is that: being faster from the asymptotic POV does not mean it is faster on the pre-asymptotic scale.

Let's implement the quick and insertion sort in the simplest reincarnation and see the pre-asymptotic in action.

Insertion Sort

The insertion sort algorithm is used in the loop-based version, not the recursive one, since the main difference between them is the extra stack frames in the recursive version. The loop version is better for a C++ use-case (not a pure functional language).

Note: C++ is not a pure functional language, so the compiler cannot guarantee automatic tail-call optimization. Therefore, recursive permutations create stack frames, while iterative permutations avoid this overhead.

Also, the shift variation is used in this particular case rather than swap, to reduce the number of assignments.

algorithm insertion_sort(A, lo, hi) is
  for i := lo + 1 to hi do
    x := A[i]
    j := i
    while j > lo and A[j - 1] > x do
      A[j] := A[j - 1]
      j := j - 1
    A[j] := x

Source

Note: pseudo-code slightly formatted and stylized to align with the general style of the article.

Literal translation to the C++ will look like:

void insertion_sort(std::span<int> A) {
    int i = 1;
    while (i < A.size()) {
        int x = A[i];
        int j = i;
        while (j > 0 && A[j - 1] > x) {
            A[j] = A[j - 1];
            j = j - 1;
        }
        A[j] = x;
        i = i + 1;
    }
}

Or slightly shorter and cleaner without changing the flow of the algorithm:

void insertion_sort(std::span<int> A) {
    for( auto size = A.size(), i = 1; i < size; ++i ) {
        int x = A[i];
        auto j = i;
        while( j > 0 && A[j - 1] > x ) {
            A[j] = A[j - 1]; // Note: shift right, not swap!
            --j;
        }
        A[j] = x; // Note: place saved value
    }
}

Note: using std::swap adds 2 extra assignments per inner-loop step compared to shifting, because:

swap = 3 writes / per inner loop;

shift = 1 write / per inner loop.

Quick Sort

To keep things fair with respect to insertion sort, the Hoare partition scheme variation of quick sort is used.
It is more efficient, since it performs fewer swaps. It uses recursion, but it can be used "as-is" in the C++ implementation, since the average depth of the call stack is O(log n), with the worst case being O(n), while insertion sort’s call stack depth is always O(n).

algorithm quick_sort(A, lo, hi) is
  if lo >= 0 && lo < hi then
    lt, gt := partition(A, lo, hi)
    quicksort(A, lo, lt - 1)
    quicksort(A, gt + 1, hi)

algorithm partition(A, lo, hi) is
  pivot := A[(lo + hi) / 2]
  lt := lo
  eq := lo
  gt := hi
  while eq <= gt do
    if A[eq] < pivot then
      swap A[eq] with A[lt]
      lt := lt + 1
      eq := eq + 1
    else if A[eq] > pivot then
      swap A[eq] with A[gt]
      gt := gt - 1
    else
      eq := eq + 1
  return lt, gt
Source

Note: pseudo-code slightly formatted and stylized to align with the general style of the article.

Note: the partition function used here is technically the three-way partitioning approach (also known as the DNF partition). I originally referred to it as "Hoare partition", but strictly speaking Hoare’s classic scheme uses only two pointers and does not form an explicit middle region. The 3-way version is simply a more practical refinement often used in modern quicksorts.

Direct C++ implementation has, virtually no difference from the pseudocode:

inline auto partition_span(std::span<int> A, int lo, int hi) {
  int pivot = A[(lo + hi) / 2];
  int lt = lo;
  int eq = lo;
  int gt = hi;

  while (eq <= gt) {
    if (A[eq] < pivot) {
      std::swap(A[eq], A[lt]);
      lt++; eq++;
    } else if (A[eq] > pivot) {
      std::swap(A[eq], A[gt]);
      gt--;
    } else {
      eq++;
    }
  }
  return std::make_pair{lt, gt};
}

inline void quick_sort_impl(std::span<int> A, int lo, int hi) {
  if (lo >= 0 && lo < hi) {
    auto [lt, gt] = partition_span(A, lo, hi);
    quick_sort_impl(A, lo, lt - 1);
    quick_sort_impl(A, gt + 1, hi);
  }
}

void quick_sort(std::span<int> A) {
  if (!A.empty()) {
    quick_sort_impl(A, 0, static_cast<int>(A.size() - 1));
  }
}

Benchmarking

The quick "toy" benchmark, sorting over ~300 arrays with random unique values (insertion and quick sort use the same input data), shows the following trend.

Adding the reference graphs to the picture shows the actual difference between real-life execution and the estimated Big O values. The difference is caused by the K-factor.

Note: The O(n²) and O(n log n) values shown are the average-case complexities of insertion sort and quick sort, respectively.

With an appropriate K-factor, the reference curves can be aligned closely with the empirical measurements. The constant is computed as K = med(Sₘ) / med(F[Sₘ]), where F[Sₘ] = { F(n) | n ∈ Sₘ }. The Sₘ denotes the set containing the last m empirical measurements (typically about 0.5% of the dataset is sufficient), and F is the corresponding Big O complexity function evaluated over the same subset.

The general trend is aligned with asymptotic analysis: O(n log n) is faster than O(n²), as expected.

However, if we zoom in to the scale of N <= 60, we can clearly see that suddenly insertion sort is faster than quick sort, and the trend has flipped.

Although, the theoretical crossover point is expected to appear around N ≈ 40, the empirical data shows it closer to N ≈ 45. This small shift is completely normal and can come from things like machine-level optimizations, cache locality effects, or simply the fact that insertion sort avoids recursion.

Ultimately, it doesn't change the bigger picture, but implementation details always affect the runtime performance.

Standard libraries are always moving in the direction of optimization. This is why so many of them almost never use pure quick sort.
Below is a part of the quick sort implementation in libc++ from the LLVM project v21.

// The main sorting function.  Implements introsort combined with other ideas:
//  - option of using block quick sort for partitioning,
//  - guarded and unguarded insertion sort for small lengths,
//  - Tuckey's ninther technique for computing the pivot,
//  - check on whether partition was not required.
// The implementation is partly based on Orson Peters' pattern-defeating
// quicksort, published at: <https://github.com/orlp/pdqsort>.
void __introsort(...) {
  // Upper bound for using insertion sort for sorting.
  _LIBCPP_CONSTEXPR difference_type __limit = 24;
  ...
  if (__len < __limit) {
    std::__insertion_sort(...);
  }
  ...
}

Source

The quick and insertion sort are, obviously, not the whole optimization. However, the selection logic (insertion vs. quick sort) is clearly there: use insertion sort for collections smaller than 24 elements.

Note: One of the reasons why, in simple benchmarking, insertion sort wins by such a large margin vs. quick sort (~45 vs. 24) is that the example does not include challenges like cache locality, non-unique elements, or partially sorted data. Changing the dataset or the data distribution may significantly reduce the gap between insertion and quick sort.
For a broader overview of some of these effects, check: Global Scientific Journal - "Comparison between Quicksort, MergeSort and Insertion Sort".

But why is insertion sort faster than quick sort? Well, this is not purely an asymptotic issue. There are other reasons for this, such as:

no recursion (as mentioned in the algorithm variation description);
contiguous memory;
extremely cache-friendly behavior;
stable branch prediction;
fewer instructions;
etc.

Note: The list of reasons why algorithm A can be faster than algorithm B is not limited to asymptotic/pre-asymptotic behavior, and is also influenced by the actual physical manifestation of the algorithm on specific hardware.

Key Takeaways

The Big O can evaluate the complexity growth rate for any algorithm used in code. However, the asymptotic essence of documented algorithms does not allow to be sure which algorithm will be "faster" at a specific scale.

Standard library algorithms are usually optimized very well. But when an algorithm is custom-written for a specific application, not only pre-asymptotic behavior but also hardware optimization levels, CPU characteristics, and instruction patterns may affect the performance management pretty much.

The best approach is not only to understand the K-factor of the algorithm, but also to understand where this algorithm is applied. This includes, but is not limited to, knowing the median/average/edge-case ranges of the N-scale.

But what is a "big enough" N? When will the asymptotics show the difference according to the Big O estimation? This purely depends on the specifics of the algorithm, the data distribution, and many other factors. As with the "quick/insertion sort" issue, the cross-point could be:

below N ≈ 45 in a simple benchmark;
below N ≈ 24 in the standard C++ library implementation;
above 10ᵐ, depending on the dataset, cache locality, machine code, etc.

The actual cross-point may vary noticeably - sometimes by several orders of magnitude.

Remember, Big O is neither a performance factor nor actual runtime. It is only the rate of complexity growth, always specified in the asymptotic sense (not from the pre-asymptotic POV). This requires in-depth analysis during optimization and may give relatively easy performance wins.

DEV Community