Shrijith Venkatramana

Posted on Feb 18, 2025 • Edited on Mar 7

Fixing A Bug in micrograd BackProp (As Explained by Karpathy)

#programming #ai #python #machinelearning

Hello, I'm Shrijith. I'm building git-lrc, an AI code reviewer that runs on every commit. It is free, unlimited, and source-available on Github. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

## A Bug In Our Code

In the previous post, we got automatic gradient calculation going for the whole expression graph.

However, it has a tricky bug. Here's a sample program that invokes the bug:

Let me know if you have any questions!a = Value(3.0, label='a')
b = a + a ; b.label = 'b'

b.backward()
draw_dot(b)


![Buggy Graph](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kxh2l358mxcsjfcuzx1w.png)

In the above, forward pass looks alright:

b = a + a = 3 + 3 = 6


But think about the backward pass:

b = a + a
db/da = 1 + 1 = 2


The answer should be `2`, but we've got `1` as the `a.grad` value.


The problem is in the `__add__` operation of `Value` class:

python
class Value:
def init(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
self.grad = 0.0
self._backward = lambda: None # by default doesn't do anything (for a leaf
# node for ex)

def repr(self):
return f"Value(data={self.data})"

def add(self, other):
out = Value(self.data + other.data, (self, other), '+')
# out.grad = 1 here

# derivative of '+' is just distributing the grad of the output to inputs
def backward():
  self.grad = 1.0 * out.grad # a.grad = 1
  other.grad = 1.0 * out.grad # again a.grad = 1

out._backward = backward


Here is another example of a bug:

python
a = Value(-2.0, label='a')
b = Value(3.0, label='b')

d = a * b ; d.label = 'd'
e = a + b ; e.label = 'e'
f = d * e ; f.label = 'f'

f.backward()
draw_dot(f)




![Another Bug Example](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/16n9c3hyyceco0wmutqc.png)



We know that for multiplication operation:

self.grad = other.data * out.grad

d.grad = e.data * out.grad = 1 * 1 = 1

e.grad = d.data * out.grad = -6 * 1 = -6



So far, so good.

Let's look for the next stage:

self.grad = other.data * out.grad

b.grad = a.data * d.grad = -2 * 1 = -2

But, if we consider the expression,

e = a + b

a.grad = b.grad = e.grad = -6


So we have the conflict - of `b.grad = -6` (addition) and `b.grad = -2` (multiplication)

So the general problem here is that - when a Value is used multiple times, there is a conflict and overwriting happens.

So first maybe the grad results of addition are updated, but then in another iteration the grad results of multiplication are also updated - overwriting the previous value.

## Solving the bug - "Accumulate Gradients" rather than Replacing Them

The Wikipedia page for [Chain Rule](https://en.wikipedia.org/wiki/Chain_rule#Multivariable_case) a section on multivariable case.

The gist of the general solution is that gradients must be accumulated, rather than replaced, in calculating gradients.

So, the new `Value` class is as follows where in `_backwards` we accumulate, rather than replace gradients:

def repr(self):
return f"Value(data={self.data})"

def add(self, other):
out = Value(self.data + other.data, (self, other), '+')

# derivative of '+' is just distributing the grad of the output to inputs
def backward():
  self.grad += 1.0 * out.grad
  other.grad += 1.0 * out.grad

out._backward = backward

return out

def mul(self, other):
out = Value(self.data * other.data, (self, other), '*')

# derivative of `mul` is gradient of result multiplied by sibling's data
def backward():
  self.grad += other.data * out.grad
  other.grad += self.data * out.grad

out._backward = backward

return out

def tanh(self):
x = self.data
t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
out = Value(t, (self, ), 'tanh')

  # derivative of tanh = 1 - (tanh)^2
  def backward():
    self.grad += (1 - t**2) * out.grad

  out._backward = backward
  return out

def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
v

    def build_topo(self):
        visited = set()
        topo = []
        for node in self._nodes:
            if node not in visited:
                build_topo(node)
                visited.add(node)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)

    self.grad = 1.0
    for node in reversed(topo):
        node._backward()

Now the gradient calculations are correct:

Reference

The spelled-out intro to neural networks and backpropagation: building micrograd - YouTube

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

⭐ Star it on GitHub:

HexmosTech / git-lrc

Free, Unlimited AI Code Reviews That Run on Commit

git-lrc

Free, Unlimited AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
🔁 Build a habit, ship better code. Regular review → fewer bugs → more robust code → better results in your team.
🔗 Why git? Git is universal. Every editor, every IDE, every AI…