Why brute-force counterexample search fails (and what to do instead)

#algorithms #python #ai #programming

A few weeks ago there was a headline going around that an AI system disproved a long-standing conjecture in discrete geometry by finding an explicit counterexample. Most of the discussion focused on the AI angle, which is fair enough. But the part that stuck with me was buried in the methods: the search space was astronomical, roughly larger than the number of atoms in the observable universe.

If you've ever tried to brute-force a combinatorial search — looking for a graph that breaks your heuristic, a configuration that violates an invariant, a counterexample to your own pet conjecture — you've watched your CPU spin for hours while making essentially zero progress. I spent three days last year trying to find a small graph that violated a property I was convinced was wrong. Ended up with a 14 GB log file and nothing useful.

This post is about why naive search collapses in these spaces, and what actually works.

The problem: combinatorial explosion is worse than it looks

Let's make it concrete. Say you want to find a counterexample involving binary configurations of length n. Even at a modest n = 200, the space is 2^200, which is around 1.6 × 10^60. At a billion checks per second on a single core, here's how long that takes:

# Time to enumerate 2^n configurations at 1e9 checks/sec
import math

def years_to_enumerate(n, checks_per_sec=1e9):
    seconds = (2 ** n) / checks_per_sec
    return seconds / (365.25 * 24 * 3600)

for n in [40, 60, 80, 100, 200]:
    print(f'n={n}: {years_to_enumerate(n):.2e} years')
# n=40 finishes in ~18 minutes
# n=60 takes ~36 years
# n=200 takes ~5e43 years. Heat death first.

The doubling sneaks up on you. Adding twenty bits to the input doesn't make the problem twenty times harder — it makes it a million times harder.

And here's the part most folks miss: even if you had infinite compute, random sampling wouldn't save you. In a lot of these problems, counterexamples aren't uniformly distributed. They live in tiny, structured regions of the space. Sampling uniformly is like trying to find a specific grain of sand by teleporting to random spots on Earth.

Root cause: you're treating structure as if it doesn't exist

When brute force fails, it's almost always because the problem has structure you're throwing away:

Symmetry: many configurations are equivalent under relabeling, rotation, or reflection. If your problem has a symmetry group of size k, you're doing k times more work than necessary.
Locality: small changes to a candidate usually produce small changes in whether it's a counterexample. Brute force ignores this gradient entirely.
Constraint propagation: if a partial configuration already violates a necessary condition, every extension of it is doomed. Brute force often re-discovers this from scratch billions of times.

Fix any one of these and the practical search space can shrink by many orders of magnitude.

What actually works

Three families of techniques, roughly in order of how much problem-specific work they require.

1. SAT/SMT solvers (if your problem is encodable)

If you can phrase the existence of a counterexample as a Boolean satisfiability problem, modern solvers will outperform anything you'd write by hand. They do clause learning, non-chronological backtracking, and variable activity heuristics that took the research community decades to get right. The Z3 docs are a reasonable starting point.

from z3 import Solver, Bool, Or, Not, sat

# Toy example: find an assignment that satisfies a tricky clause set
x = [Bool(f'x_{i}') for i in range(50)]
s = Solver()

# Encode the constraints that define 'this is a counterexample'
# Real problems have hundreds to millions of clauses
s.add(Or(x[0], Not(x[1]), x[2]))
s.add(Or(Not(x[0]), x[3]))
# ... more clauses encoding the conjecture's negation

if s.check() == sat:
    model = s.model()
    # model is your counterexample
    print([model[v] for v in x])
else:
    print('No counterexample exists for this encoding')

The catch: encoding is an art. A bad encoding can make a tractable problem intractable.

2. Local search with smart neighborhoods

When SAT encoding is awkward, simulated annealing or its cousins often work surprisingly well. The trick is defining a continuous-ish "badness" score even when your problem is discrete, so you have a gradient to follow.

import random, math

def simulated_annealing(initial, neighbors, score, steps=100000):
    # score: lower is closer to being a counterexample, 0 means found one
    current = initial
    current_score = score(current)
    best, best_score = current, current_score
    T = 1.0  # initial temperature

    for step in range(steps):
        # Cool down slowly. Too fast and you get stuck in local minima
        T = 1.0 * (0.9999 ** step)
        candidate = random.choice(neighbors(current))
        cand_score = score(candidate)
        delta = cand_score - current_score

        # Accept improvements always, accept worse moves probabilistically
        if delta < 0 or random.random() < math.exp(-delta / T):
            current, current_score = candidate, cand_score
            if current_score < best_score:
                best, best_score = current, current_score
                if best_score == 0:
                    return best  # counterexample found
    return best

I've used this on graph problems where SAT solvers timed out. It won't find a needle in a 10^200 haystack on its own, but combined with a decent scoring function it routinely beats anything you'd brute-force.

3. Learned search policies

This is the family that's been getting the recent headlines. The idea is to train a model — could be a neural net, could be a simpler policy — that predicts which moves in your search tree are most likely to lead to a counterexample, and biases the search accordingly. AlphaZero-style self-play with a math problem as the game.

This is heavy machinery and overkill for most day-to-day work. I haven't tested it thoroughly on my own problems, but the pattern showing up in published results is consistent: when the search space has hidden structure that's hard to encode by hand, learning the structure from search trajectories beats hand-tuned heuristics. The AlphaZero paper is still the cleanest exposition of the core idea.

Prevention tips for the next time you reach for brute force

Estimate the size of the space before writing the loop. If it's bigger than 2^40, brute force is not a plan.
Quotient by symmetry first. Canonicalize candidates (sort, hash, whatever) so you don't visit equivalent ones twice.
Add cheap necessary conditions as early pruning. If a partial candidate fails a necessary condition, prune the whole subtree.
Profile your scoring function. Most local-search runs spend 90%+ of their time in score(). Speeding it up by 10x is often easier than improving the algorithm.
Save your random seeds. When you do find something, you want to reproduce it.

The boring takeaway from all the recent AI-finds-counterexample news isn't that AI is magic. It's that search with structure beats search without it, and the gap grows fast as the space grows. You don't need a frontier model to apply that lesson — most of the wins come from the boring stuff: symmetry reduction, propagation, and a half-decent local search.