DEV Community: whomi928

Building a 2D Physics Engine From Scratch: SAT Collision Detection in C++ and SDL2

whomi928 — Wed, 01 Jul 2026 07:20:51 +0000

I wanted to actually understand collision detection — not "call the library function," but understand it well enough to write it myself. So I built a small 2D physics engine in C++ and SDL2, no external physics library, and picked the Separating Axis Theorem (SAT) as the core algorithm.

Here's how it works, and the gotchas I ran into turning "shapes are touching" into "shapes bounce correctly."

Why not just bounding circles?

Circle-vs-circle collision is one line: compare the distance between centers to the sum of the radii. Easy, but wrong the moment you want anything that isn't round — a square, a triangle, a pentagon. Once shapes have edges and orientation, you need real polygon-vs-polygon detection. That's what SAT gives you.

The idea behind SAT

Two convex shapes are not colliding if you can find any single axis where their projections don't overlap. So the algorithm is:

For every edge of both shapes, take the perpendicular (the edge's normal) as a candidate separating axis.
Project every vertex of both shapes onto that axis.
If the projections don't overlap on any axis, the shapes aren't colliding — you can stop immediately.
If every axis shows overlap, the shapes are colliding, and the axis with the smallest overlap tells you the cheapest way to push them apart — the minimum translation vector.

Here's the core of it:

cppCollisionResult check_collision_sat(const vector>& shape1, const vector>& shape2) {
CollisionResult result = { false, INFINITY, 0.0f, 0.0f };
const vector>* shapes[] = { &shape1, &shape2 };

for (int s = 0; s < 2; s++) {
    const vector<pair<float, float>>& current_shape = *shapes[s];

    for (int i = 0; i < current_shape.size(); i++) {
        int next_i = (i + 1) % current_shape.size();
        float edge_x = current_shape[next_i].first - current_shape[i].first;
        float edge_y = current_shape[next_i].second - current_shape[i].second;

        // perpendicular to the edge = candidate separating axis
        float axis_x = -edge_y;
        float axis_y = edge_x;
        float len = vector_length(axis_x, axis_y);
        axis_x /= len;
        axis_y /= len;

        // project both shapes onto this axis
        float minA = INFINITY, maxA = -INFINITY;
        for (const auto& p : shape1) {
            float projected = dot_product(p.first, p.second, axis_x, axis_y);
            minA = std::min(minA, projected);
            maxA = std::max(maxA, projected);
        }

        float minB = INFINITY, maxB = -INFINITY;
        for (const auto& p : shape2) {
            float projected = dot_product(p.first, p.second, axis_x, axis_y);
            minB = std::min(minB, projected);
            maxB = std::max(maxB, projected);
        }

        if (maxA < minB || maxB < minA) {
            return { false, 0.0f, 0.0f, 0.0f }; // found a gap — no collision
        }

        float overlap = std::min(maxA, maxB) - std::max(minA, minB);
        if (overlap < result.depth) {
            result.depth = overlap;
            result.normal_x = axis_x;
            result.normal_y = axis_y;
        }
    }
}

result.isColliding = true;
return result;

}

Looping over every edge of both shapes as candidate axes is what makes this work for arbitrary convex polygons, not just a fixed set of shape types.

Detection isn't the interesting part — response is

Knowing two shapes overlap doesn't move anything. You need two separate passes after that: positional correction (untangle the overlap) and velocity resolution (actually make them bounce). I originally tried to do both in one step and got jittery, sticky collisions — separating them fixed it.

cppif (result.isColliding) {
// SAT gives you AN axis, not necessarily pointing from shape1 -> shape2.
// If it's backwards, the correction pushes shapes the wrong way.
float dir_x = e2.x - e1.x;
float dir_y = e2.y - e1.y;
if ((dir_x * result.normal_x + dir_y * result.normal_y) < 0) {
result.normal_x *= -1;
result.normal_y *= -1;
}

// --- positional correction: push apart proportional to mass ---
float total_mass = e1.mass + e2.mass;
float ratio1 = e2.mass / total_mass;
float ratio2 = e1.mass / total_mass;
e1.x -= result.normal_x * (result.depth * ratio1);
e1.y -= result.normal_y * (result.depth * ratio1);
e2.x += result.normal_x * (result.depth * ratio2);
e2.y += result.normal_y * (result.depth * ratio2);

// --- impulse resolution: the actual bounce ---
float relVelX = e2.velocity_x - e1.velocity_x;
float relVelY = e2.velocity_y - e1.velocity_y;
float velAlongNormal = (relVelX * result.normal_x) + (relVelY * result.normal_y);

if (velAlongNormal > 0) continue; // already separating, do nothing

float impulse = -(1.0f + restitution) * velAlongNormal;
impulse /= (1.0f / e1.mass) + (1.0f / e2.mass);

float impulseX = impulse * result.normal_x;
float impulseY = impulse * result.normal_y;
e1.velocity_x -= impulseX / e1.mass;
e1.velocity_y -= impulseY / e1.mass;
e2.velocity_x += impulseX / e2.mass;
e2.velocity_y += impulseY / e2.mass;

}

Two things worth calling out:

The normal direction check. SAT hands you an axis of least overlap, but nothing guarantees it points from shape A toward shape B — it depends on which edge produced the smallest overlap. Skip the direction check and roughly half your collisions push objects into each other instead of apart. Took me a while to figure out why some collisions looked fine and others looked like objects were teleporting through each other.

Mass-weighted positional correction. Splitting the correction 50/50 regardless of mass looks wrong the moment you have a heavy shape and a light one — the light one should move more. Dividing the push by mass ratio fixes that with barely any extra code.

Where it falls apart (for now)

Being honest about the current limits:

No broadphase. Every pair of entities gets tested against every other pair, every frame — O(n²). Fine for a few dozen shapes on screen, not built to scale past that yet.
No real torque. Rotation is currently a cosmetic hack — velocity nudges the orientation angle so shapes tumble as they fall, but there's no actual angular momentum or moment of inertia driving it.
Energy creep at high restitution. Without sub-stepping, impulse resolution close to restitution = 1.0 can slowly add energy to the system over many bounces — a known trade-off of this style of solver.

What's next

Spatial partitioning (probably a simple grid/spatial hash before reaching for a quadtree), real angular velocity and torque so rotation is physically driven instead of faked, and then folding all of this into a small tile-based game engine that's currently sitting half-built on the side.

Code's up on GitHub if you want to poke at it or tell me what's wrong with my impulse math: https://github.com/whomi928/Physics-Engine.git

LinkedIn: www.linkedin.com/in/shaurya-aditya-0563a0377

I Built a Neural Network Inference Engine From Scratch in C++ (No PyTorch, No ONNX, Just AVX2)

whomi928 — Mon, 29 Jun 2026 02:30:40 +0000

Why does inference need a framework at all?

Every time I ran a tiny linear model through PyTorch, I felt like I was driving a go-kart with a jet engine strapped to it. The model was a few hundred KB. PyTorch's runtime was gigabytes. Somewhere between model(x) and the actual floating-point math, an entire universe of abstraction — autograd graphs, dispatch layers, tensor metadata — was quietly eating my CPU cycles.

So I asked a simple question: what does inference actually look like with nothing in the way?

That question turned into ML-model-loader — a bare-metal C++ inference engine that loads raw binary weights and runs forward passes directly on the CPU, using the same low-level techniques that power ggml and llama.cpp: cache-tiled GEMM, AVX2 SIMD intrinsics, and INT8 quantization.

No PyTorch. No ONNX Runtime. No GPU. Just C++, some pointer arithmetic, and a CPU that's faster than people give it credit for.

The architecture

The pipeline is intentionally minimal — two stages, one handoff:

[ Python Training (Colab) ]
          |
          | exports
          v
[ multi_model_weights.bin ]   (FP32 binary weight dump)
[ quantized_weights.bin   ]   (INT8 quantized weights)
          |
          | loaded by
          v
[ ML_loader_3.cpp ]
  ├── Weight loader (binary deserialization)
  ├── GEMM kernel (cache-tiled, AVX2)
  ├── INT8 quantization runtime
  └── Chrono benchmarking

Training stays in Python because there's no point reinventing backprop. But the moment the model is trained, it gets exported to a flat binary file — just layer dimensions followed by raw FP32 arrays — and from there, Python never touches the inference path again.

The actual bottleneck: it's not the math, it's the cache

A naive triple-nested-loop matrix multiply is O(N³), and on any model bigger than a toy example, it absolutely destroys your L1 cache. Every time you stride across a large matrix row, you evict data you'll need again two iterations later. The CPU spends more time waiting on memory than doing arithmetic.

The fix is cache tiling: instead of multiplying full rows and columns, you break the matrices into small blocks — roughly 64×64 in this engine — sized so a tile fits entirely inside L1 cache. The inner multiply loop then operates entirely on hot data, and cache misses during the GEMM operation basically disappear. This one change is usually the single biggest performance lever in CPU-bound inference, bigger than any individual instruction-level trick.

Then: feeding the cores 8 floats at a time

Once memory stops being the bottleneck, the next lever is the ALU. Scalar code multiplies one float, adds one float, one instruction at a time. AVX2 lets you do better:

__m256 acc = _mm256_setzero_ps();
acc = _mm256_fmadd_ps(weight_vec, input_vec, acc); // 8 floats, fused multiply-add, one instruction

_mm256_fmadd_ps performs a fused multiply-add across 8 floats simultaneously. On paper that's an 8× speedup on the compute-bound inner loop — in practice you don't get the full 8× because memory bandwidth and tiling overhead eat into it, but it's still a massive win over scalar code. Combined with cache tiling, this is what took the FP32 forward pass down to roughly 8ms for a 10→512→512→128→10 network — no GPU involved.

One detail that matters more than people expect: all weight buffers are allocated with _mm_malloc for 32-byte alignment. Unaligned loads with AVX2 carry a real penalty, and it's a one-line fix that's easy to forget.

Squeezing further: INT8 quantization

FP32 weights are 4 bytes per value. For large weight matrices, that's a lot of memory bandwidth spent just moving numbers around — and bandwidth, not compute, is often the real ceiling. Quantizing to INT8 cuts that 4×.

The scheme here is symmetric per-tensor quantization — about as simple as quantization gets:

scale = max(|W|) / 127
W_int8 = round(W / scale)

At inference time, the integer-quantized weights run through _mm256_madd_epi16, processing integer vectors instead of floats, and the FP32 result is recovered by dequantizing after accumulation. That took the same network down to roughly 5ms — a meaningful drop on top of an already-fast FP32 path, mostly from the reduced memory traffic rather than from integer math being inherently faster here.

Model Architecture	Precision	Inference Time
10→512→512→128→10 (Linear NN)	FP32	~8ms
10→512→512→128→10 (Linear NN)	INT8 (quantized)	~5ms

(Benchmarked with std::chrono, CPU only.)

What I'd still call unfinished

This is deliberately a learning engine, not a production one, and the roadmap reflects that honestly:

Convolutional layers (2D GEMM tiling) — currently linear/fully-connected only
Multi-threading across tiles via std::thread or OpenMP — right now it's single-threaded, which leaves obvious performance on the table
ONNX import, so models don't need a custom binary export step
An ARM NEON port, since AVX2 ties this to x86-64 entirely

Try it yourself

git clone https://github.com/whomi928/ML-model-loader
cd ML-model-loader
g++ -O3 -mavx2 -mfma -o ML_loader_3 ML_loader_3.cpp
./ML_loader_3

You'll need a CPU with AVX2 (Intel Haswell/AMD Ryzen or newer) and multi_model_weights.bin sitting next to the binary — there's an included Colab notebook that trains a small linear network and exports the weights file if you want to generate your own.

If you've ever wanted to see what's actually happening underneath a model.forward() call — no autograd, no dispatch tables, just memory layout and instruction throughput — this is a fun rabbit hole to fall into. The repo's linked below, and the ggml/llama.cpp projects are worth a read if you want to see these same ideas taken much, much further.

Repo: github.com/whomi928/ML-model-loader
Linkedin: www.linkedin.com/in/shaurya-aditya-0563a0377

Shaurya Aditya — B.Tech ECE, IIT BHU