DEV Community

Alex Towell
Alex Towell

Posted on • Originally published at metafunctor.com

Differentiation: Three Ways

A synthesis of three earlier posts, comparing forward-mode AD, reverse-mode AD, and numerical differentiation.

Computing derivatives shows up everywhere: optimization, machine learning, physics simulation, numerical analysis. This series has explored three distinct approaches:

  1. Forward-mode AD via dual numbers
  2. Reverse-mode AD via computational graphs
  3. Numerical differentiation via finite differences

Each has different strengths. The right choice depends on the shape of your problem.

The Landscape

Method Accuracy Cost for (f: \mathbb{R}^n \to \mathbb{R}) Cost for (f: \mathbb{R} \to \mathbb{R}^m) Memory
Forward AD Exact (O(n)) passes (O(1)) pass (O(1))
Reverse AD Exact (O(1)) pass (O(m)) passes (O(\text{ops}))
Finite Diff (O(h^p)) (O(n)) evaluations (O(n)) evaluations (O(1))

The key point: problem structure determines the best method.

Forward-Mode AD: Dual Numbers

Forward-mode AD extends numbers with an infinitesimal (\varepsilon) where (\varepsilon^2 = 0). The derivative falls out of the arithmetic for free:

// f(x) = x^3 - 3x + 1
// f'(x) = 3x^2 - 3

auto x = dual<double>::variable(2.0);  // x = 2, dx = 1
auto f = x*x*x - 3.0*x + 1.0;

std::cout << f.value() << "\n";       // 3.0
std::cout << f.derivative() << "\n";  // 9.0
Enter fullscreen mode Exit fullscreen mode

Strengths:

  • Simple implementation (operator overloading)
  • No memory overhead
  • Naturally composable for higher derivatives
  • Works with any function of overloaded operators

When to use:

  • Single input variable (or few inputs)
  • Computing Jacobian-vector products
  • Higher-order derivatives via nesting
  • Sensitivity analysis along one direction

Complexity: One forward pass per input variable. For f: R^n -> R^m, computing the full Jacobian requires n passes.

Reverse-Mode AD: Computational Graphs

Reverse-mode AD builds a computational graph during the forward pass, then propagates gradients backward via the chain rule:

auto f = [](https://metafunctor.com/const auto& x) {
    return sum(pow(x, 2.0));  // f(x) = sum(x^2)
};

auto df = grad(f);  // Returns gradient function
auto gradient = df(x);  // One backward pass for all partials
Enter fullscreen mode Exit fullscreen mode

Strengths:

  • O(1) backward passes regardless of input dimension
  • Powers modern deep learning (backpropagation)
  • Efficient for loss functions: f: R^n -> R

When to use:

  • Many inputs, scalar output (neural networks)
  • Computing vector-Jacobian products
  • Optimization where you need the full gradient

Complexity: One forward pass to build the graph, one backward pass to compute all gradients. Memory scales with the number of operations because you have to store intermediate values.

Numerical Differentiation: Finite Differences

Approximate the derivative using the limit definition:

// Central difference: f'(x) ~ (f(x+h) - f(x-h)) / 2h
double df = central_difference(f, x);
Enter fullscreen mode Exit fullscreen mode

Strengths:

  • Works with black-box functions
  • No special types required
  • Simple to implement and understand

Limitations:

  • Approximate (truncation error + round-off error)
  • Requires careful step size selection
  • O(n) function evaluations for gradient

When to use:

  • Black-box functions (no source code)
  • Quick approximations
  • Validating AD implementations

Decision Tree

Need derivatives?
|
+-- Source code available?
|   |
|   +-- Yes: Use AD
|   |   |
|   |   +-- f: R -> R^m (few inputs, many outputs)?
|   |   |   --> Forward-mode AD
|   |   |
|   |   +-- f: R^n -> R (many inputs, scalar output)?
|   |   |   --> Reverse-mode AD
|   |   |
|   |   +-- f: R^n -> R^m (both)?
|   |       --> Forward for m < n, Reverse for n < m
|   |
|   +-- No: Black-box function
|       --> Finite differences
|
+-- Validating correctness?
    --> Finite differences (ground truth)
Enter fullscreen mode Exit fullscreen mode

Side-by-Side Comparison

Consider f(x,y,z) = x^2 + y^2 + z^2, evaluated at (1, 2, 3).

Forward-Mode (3 passes for full gradient)

// Pass 1: differentiate w.r.t. x
auto dx = dual<double>::variable(1.0);
auto y1 = dual<double>::constant(2.0);
auto z1 = dual<double>::constant(3.0);
auto f1 = dx*dx + y1*y1 + z1*z1;
// f1.derivative() = 2 (df/dx)

// Pass 2: differentiate w.r.t. y
// Pass 3: differentiate w.r.t. z
Enter fullscreen mode Exit fullscreen mode

Reverse-Mode (1 pass for full gradient)

auto f = [](https://metafunctor.com/const auto& v) {
    return v(0)*v(0) + v(1)*v(1) + v(2)*v(2);
};
auto gradient = grad(f)(vector{1.0, 2.0, 3.0});
// gradient = {2, 4, 6} in one backward pass
Enter fullscreen mode Exit fullscreen mode

Finite Differences (3 evaluations for gradient)

// df/dx ~ (f(1+h,2,3) - f(1-h,2,3)) / 2h
// df/dy ~ (f(1,2+h,3) - f(1,2-h,3)) / 2h
// df/dz ~ (f(1,2,3+h) - f(1,2,3-h)) / 2h
auto gradient = finite_diff::gradient(f, {1.0, 2.0, 3.0});
Enter fullscreen mode Exit fullscreen mode

Combining Methods: Validation

The gold standard for testing AD: compare against finite differences.

auto ad_grad = grad(f)(x);
auto fd_grad = finite_diff::gradient(f, x);

// Should match within tolerance
REQUIRE(approx_equal(ad_grad, fd_grad, 1e-4, 1e-6));
Enter fullscreen mode Exit fullscreen mode

Finite differences are slow and approximate, but they don't lie. If your AD implementation disagrees with finite differences, the AD is wrong.

Complexity Analysis

For (f: \mathbb{R}^n \to \mathbb{R}^m) with (p) elementary operations:

Method Time Space Function Evals
Forward AD (O(np)) (O(1)) (n)
Reverse AD (O(mp)) (O(p)) (m)
Finite Diff (O(n)) (O(1)) (2n) (central)

The crossover point:

  • (n < m): Forward-mode wins
  • (n > m): Reverse-mode wins
  • (n \approx m): Either works; consider memory constraints

For neural networks with millions of parameters and scalar loss, reverse mode is the only practical choice.

Integration with Numerical Quadrature

One of my favorite applications: differentiation under the integral sign.

// Compute d/da of integral from 0 to 1 of a*sin(x) dx
using D = dual<double>;

D a = D::variable(2.0);
auto integrand = [&a](https://metafunctor.com/D x) { return a * sin(x); };

D result = integrate(integrand, D::constant(0.0), D::constant(1.0));

// result.value() = 2 * (1 - cos(1)) ~ 0.919
// result.derivative() = (1 - cos(1)) ~ 0.459
Enter fullscreen mode Exit fullscreen mode

Forward-mode AD composes naturally with generic numerical algorithms because dual numbers form a ring. They satisfy the algebraic requirements that the algorithms need. You don't have to do anything special.

Conclusion

Three methods, three use cases:

  1. Forward-mode AD: Simple, memory-efficient, ideal for few inputs
  2. Reverse-mode AD: Powers deep learning, ideal for many inputs
  3. Finite differences: Universal fallback, validation tool

The Stepanov perspective: each method exploits different structure. Forward AD uses the ring structure of dual numbers. Reverse AD uses the DAG structure of computation. Finite differences use only function evaluation. Understanding which structure your problem has tells you which method to reach for.

Further Reading

Top comments (0)