Jose Crespo, PhD

Posted on Nov 26

The Two Programming Styles of AI — and Why Everyone Uses the Wrong One

#ai #algorithms #programming

AI keeps crashing against the same walls

Yep, everybody, even Tesla, is using the old damn math from 2 centuries ago, and hence it’s not surprising to watch many of scenes like this all over YouTube when you rely on AI driving your Tesla:

Apparently, not even Tesla - with its 1.4 Trillion valuation and army of PhDs - knows about this math. Or maybe they do, and just enjoy watching their cars perform interpretive dance routines at 60 mph.

Either way, here’s the greatest hits compilation you’ve seen all over YouTube:

The Tesla Self-Driving Blooper Reel:

🎬 Phantom Braking - The car slams the brakes for a shadow. Because apparently, shadows are the #1 threat to highway safety in the 21st century.

🎬 The Surprise Party Turn - Takes curves at full speed, then goes “OH SHIT A CURVE!” and throws a mini-chicane out of nowhere, comedy for everyone, except your neck..

🎬 The Seizure Shuffle - Steering adjustments so jerky you’d think the car is having an existential crisis. Left, right, left, right.. it’s not driving, it’s vibrating down the highway.

🎬 The “Why Did It Do That?” - Does something so inexplicable that even the AI researchers watching the logs just shrug and mutter “gradient descent, probably.”

If this post is sparking your curiosity, you may enjoy exploring deeper analyses, geometric AI concepts, and ongoing research on my personal site:

👉 https://josecrespo.substack.com

The Fix That Nobody’s Using

Tesla could solve this - easily - by using second derivatives (Hessian-vector products, or HVP for the cool kids).

So could Google, Meta, OpenAI, and pretty much every company with an “AI Strategy” PowerPoint deck.

But they’re not. See the table below - notice a pattern?

Wait — These Are Different Problems, Right?

Not exactly. They are different symptoms, but the same disease.

They’re all using math that can answer “Which way should I go?”

but not “How sharply is this about to change?”

It’s like asking a GPS for directions but never checking if there’s a cliff ahead.

The Root Cause: Your Great-great-grandfather’s Calculus

As said, in the case of Tesla what is happening is that their cars are reacting to what’s happening right now, not anticipating what’s about to happen.

It’s like playing chess by only looking at the current board position - no planning, no strategy, just “I see a piece, I move a piece.”

Chess players call this “beginner level.” Tesla calls it “Full Self-Driving.”

Ready for the diagnosis? Tesla engineers, like everyone else in Silicon Valley, are still using 19th-century limit-based calculus — the math equivalent of trying to stream Netflix on a telegraph machine.

Meanwhile, the solution has been sitting on the shelf for 60 years: dual/jet numbers.

Nobody thought to check the manual. Seriously, who bothers with that “wacko, exotic math” they don’t teach in university CS programs?s?

And yet, these hyperreal-related algebras (duals and jets) make second derivatives (HVP) a computationally trivial operation through the elegant composition of two first-order operators (JVP ∘ VJP).

Hold Up — Are You Telling Me…

that the “gold-standard” h-limit calculus makes it a slog, while duals/jets make it trivial.…that what’s computationally intractable with the traditional h-limit calculus so many Ivy-League courses treat as the gold standard is trivial with dual/jet numbers, that can fix most of those damn curve-related problems in our current AI?

Yes. Exactly that.

And it gets worse.

The Hyperreal Revolution: Your Calculus Professor Never Told You This

The calculus you learned in college — the one that got you through differential equations, optimization theory, and machine learning courses — isn’t wrong. It’s just incomplete.

It’s like learning arithmetic but never being taught that multiplication is just repeated addition. You can still do math, but you’re doing it the hard way.

Here’s the specific problem:

Traditional calculus (the h-limit approach):

f'(x) = lim[h→0] (f(x+h) - f(x)) / h

This defines derivatives as limits — which means:

✅ Mathematically rigorous
✅ Great for proving theorems
❌ Computationally nightmarish for anything beyond first derivatives

f'(x+h) = lim[h'→0] (f(x+h+h') - f(x+h)) / h'

But f'(x+h) itself requires computing

f'(x+h) = lim[h'→ 0] (f(x+h+h') - f(x+h)) / h'

So, summing up: either you end up with nested limits and two step sizes (h,h′) that interact unstably, or you resort to higher-order stencils that are exquisitely sensitive to step size and noise. In both cases you lose derivative structure, so two first-derivative passes (JVP → VJP) don’t compose into a true second derivative - you’re rebuilding guesses instead of carrying derivatives.

For a third derivative? Three nested limits or or use even higher-order stencils.

For the k-th derivative: either nest k layers or use wider stencils - noise blows up as O(h^-k), truncation depends on stencil order, and you still lose derivative structure, so JVP→VJP won’t compose into HVP in an FD pipeline.

So your self-driving car keeps crashing against sun-set lit walls.

And for GPT-5’s approximately 1.8 trillion parameters? Computational impossibility.

Sharp Readers Will Notice:

“Hold on, if we know the function f, can’t we just compute f’ and f’’ analytically? Why do we need any of this limit or dual number stuff?”

Great question!

Here’s why that doesn’t work for neural networks:

The Problem: Neural Networks Are Black Boxes
When you write a simple function, you can compute derivatives analytically:

# Simple case - analytic derivatives work fine
f(x) = x² + 3x + 5
f'(x) = 2x + 3      # Easy to derive by hand
f''(x) = 2          # Even easier

But a neural network with 1.8 trillion parameters looks like this:

f(x) = σ(W₁₇₅·σ(W₁₇₄·σ(...σ(W₂·σ(W₁·x))...)))


Where:
- Each `W` is a matrix with billions of parameters
- Each `σ` is a nonlinear activation function
- There are hundreds of layers (GPT-style)
- The composition is **dynamically computed** during runtime

You literally cannot write down the analytic form of f'(x) because:
1. The function changes every time you update parameters (every training step)
2. It's too large to express symbolically
3. It contains billions of nested compositions


### Why Traditional Calculus Fails Here

**The h-limit formula:**

f''(x) = lim[h→0] (f'(x+h) - f'(x)) / h


**Requires you to evaluate 

f'(x+h)`**, which means:
f'(x+h) = lim[h'→0] (f(x+h+h') - f(x+h)) / h'

And here’s the trap:

You can’t compute f' analytically (the function is too complex)
So you approximate it using finite differences (the h-limit)
Now you need f'(x+h) for the second derivative
So you approximate that using another finite difference (with step size h’)

Result: You’re approximating an approximation — errors compound catastrophically.

The skeptical reader might continue objecting:

”But can’t we use something like SymPy or Mathematica to compute derivatives symbolically?”

In theory, yes. In practice, we face a similar problem.

For a 1.8 trillion parameter model!:

The symbolic expression for f' would be larger than the model itself 👀
Computing it would take years
Storing it would require more memory than exists
Simplifying it would be computationally intractable

Example: Even for a tiny 3-layer network with 1000 neurons per layer:

Symbolic f' lands in the millions of terms.
Symbolic f'' jumps to the billions of terms.
Growth is combinatorial with depth/width; common-subexpression tricks don’t save you enough.

For hundred of layers? Forget it.

clear now?

Let’s Bring Back Our Hyperreals Flavor for AI Computing and let’s see what happens when hyperreals face similar scenarios:

What Dual/Jet Numbers Do Differently: Automatic Differentiation
Dual numbers don’t use limits at all. Instead, they:

- Encode the differentiation rules in the arithmetic
- Evaluate f with special numbers that carry derivative info
- Derivatives emerge through rule-following arithmetic

Jets generalize this. k-jets carry truncated Taylor lanes up to order k (nilpotent ε^k+1=0), so higher-order derivatives fall out in one pass.

Here’s the key: The calculus rules (power rule, chain rule, etc.) are built into the jet arithmetic operations, not applied symbolically! So you get all the advantages of analytical solution without using them!

The Three Fundamental Differences
Calculus with Symbolic Rule Application ( impractical at modern AI scale)
Process:

Write down the function: f(x) = x³
Recall the power rule: d/dx[xⁿ] = n·xⁿ⁻¹
Apply it symbolically: f’(x) = 3x²
Store both formulas separately

For neural networks: Must build the entire derivative expression — exponential memory explosion.

Traditional h-Limit Calculus: Numerical Approximation:

Process:

Choose a step size h (guesswork)
Evaluate: (f(x+h) — f(x))/h
Get an approximation with error
Problems:

Not exact (always has truncation or roundoff error)
Can’t compose cleanly
Breaks down at higher orders

Dual/Jet Numbers Algebra: Evaluation with Augmented Arithmetic(practical at modern AI scale)

Process:

Extend the number system with ε where ε² = 0
Evaluate f at (x + ε) using this arithmetic
Derivatives appear as ε-coefficients automatically

For neural networks: No expression built — just evaluate once with special numbers. Linear memory scaling.

How It Actually Works: The Binomial Magic with dual numbers

Let’s see as a toy example how the power rule emerge without applying any calculus:

Example: compute derivative of f(x) = x³

Step 1: Evaluate at augmented input

Example: compute derivative of f(x) = x³

Step 1: Evaluate at augmented input

f(x + ε) = (x + ε)³

Step 2: Expand using binomial theorem (combinatorics, not calculus)

(x + ε)³ = x³ + 3x²ε + 3xε² + ε³

Step 3: Apply nilpotent algebra (ε² = 0)

= x³ + 3x²ε + 0 + 0
= x³ + 3x²ε

Step 4: Read the dual number

x³ + 3x²ε = (x³) + ε·(3x²)
            ↑         ↑
         value   derivative

The derivative f’(x) = 3x² emerged through:

Binomial expansion (algebra)
Nilpotent simplification (ε² = 0)
Coefficient reading

NOT through:

❌ Power rule application
❌ h-limit formula
❌ Symbolic differentiation

> You don’t apply the power rule — you let binomial expansion reveal it.

Why This Scales When Symbolic Differentiation Doesn’t

Symbolic Differentiation (Analytical):
With AI working with neural networkd you must build expressions:

Layer 1 derivative: thousands terms
Layer 2 derivative: millions terms (combinatorial explosion)
Hundreds of layers: expression size grows exponentially in depth/width; even with common-subexpression elimination it becomes intractable to construct, store, or simplify. Memory required: More than all atoms in the universe 👀

Dual Number Evaluation:
Never builds expressions:

Each instrumented tensor stored value + ε·derivative
Memory: 2× base model (for k=1)
Or 3× base model with Jets (for k=2 with second derivative)

For GPT-5 (1.8T parameters):
k=1: ~14.4 TB → 18.0 TB (totally practical)
k=2: ~14.4 TB → 21.6 TB (fits on ~34 H100 nodes)

BUT WAIT — YOU’RE FLYING FIRST CLASS IN AI MATH

And there’s still more.
The algebra of dual/jet numbers lets you use composition of functions (yup, if you want to do yourself a favor and write real AI that works, learn category theory now!).

Here’s your genius move:

With composition of functions, we can get second derivatives for the price of a first derivative!!

Woah. 🤯

How? Just by using composition of functions — otherwise structurally impossible with limit-based calculus.

In Plain English: Why Composition Fails With h-Limits

Traditional calculus can’t do JVP∘VJP = HVP because:
JVP via finite differences gives you a number (an approximation of f’(x)·v)
That number has no derivative structure for VJP to differentiate
You must start over with a new finite-difference approximation
The operations don’t chain — each one discards the structure the next one needs

Dual numbers CAN do JVP∘VJP = HVP because:

JVP with duals gives you a dual number (f(x), f'(x)·v)
That dual number carries derivative structure in its ε-coefficient
VJP can differentiate it directly by treating it as input
The operations chain naturally — each preserves the structure the next needs

Dual numbers are algebraically closed under composition.
The Practical Consequence
what the new paradigm can compute that the old one can’t:

Why This Is The Key To Fixing AI
Current AI (k=1 only):

Can answer: “Which direction should I go?”
Cannot answer: “How sharply is this direction changing?”
Result: Reactive, not anticipatory

With composition (JVP∘VJP):

Get second derivatives for 2× the cost of first derivatives
Can anticipate curves, detect trajectory changes
Result: one of many examples, Tesla stops phantom braking; AI stops hallucinating.

With explicit k=3 jets:

Get third derivatives for 3× the cost
Can verify topological consistency (winding numbers)
Result: Mathematically certified AI outputs

The Functors + Composition Advantage

And why Hyperreal Algebra Matters:

Without it (finite differences):

Each derivative order requires starting from scratch
Errors accumulate with each nesting
No compositional structure to exploit

With it (dual/jet numbers):

Higher-order derivatives = compose lower-order operations
Exact (within floating-point)
Automatic (chain rule built into ε-arithmetic)
This is why:

✅ Dual/Jet numbers scale to hundred of layers (linear memory)

✅ Composition works (JVP∘VJP = HVP automatically)

✅ Higher orders accessible with Jet numbers ( k=3, k=4 feasible)

And why:

❌ Symbolic differentiation explodes (exponential expressions)
❌ Finite differences can’t compose (no functoriality)
❌ h-limit methods break at higher orders (error compounds)

SUMMING UP

The entire AI industry is stuck at first-order optimization because:

They learned calculus as h-limits (doesn’t scale)
They implement derivatives as finite differences (doesn’t compose)
They never learned about Group Theory and Hyperreal Numbers (not in CS curricula)

Meanwhile:

Dual numbers make derivatives algebraic objects (not approximations)
Jets make higher orders linear in cost (not exponential)
Functorial composition makes second derivatives cheap (JVP∘VJP)

The math to fix Tesla’s phantom braking, OpenAI’s hallucinations, and Meta’s moderation chaos has been sitting in textbooks since 1960s.

Waiting for someone to connect the dots among: the binomial theorem (~400 years old), nilpotent algebra (~150 years old), and functorial composition + hyperreals (~60 years old).

To the biggest unsolved problems in AI.

Now you know what Silicon Valley doesn’t and see what they cannot.

NOTE: In this article, “traditional calculus” means the finite-difference (h-limit) implementation used in practice — pick an h, approximate, repeat — not analytic/symbolic derivatives.

If this post has sparked your curiosity, you may enjoy exploring deeper analyses, geometric AI concepts, and ongoing research on my personal site:

👉 https://josecrespo.substack.com

DEV Community