What if instead of evolving weights randomly, we could calculate exactly how to fix them? That's backprop — the algorithm that made deep learning possible.
Neuroevolution solved XOR by guessing — spawning thousands of random networks and keeping the lucky ones. It works, but it's slow. What if, after every wrong answer, we could trace the error backward through the network and tell each weight exactly how much to change?
That's backpropagation. Three steps, repeated thousands of times:
1. Forward pass — feed inputs through the network, get an output.
2. Measure error — how far off was the output from the right answer?
3. Backward pass — send the error backward, adjusting each weight by its share of the blame.
The key insight is the chain rule from calculus. Each layer passes blame to the layer before it, like a chain of "because" statements: the output was wrong because the last layer's weights were off, because the hidden layer fed it bad values, because the first layer's weights were off. Follow the chain, fix each link.
Imagine you're blindfolded on a hilly landscape. Your goal is to reach the lowest valley. You can't see, but you can feel which way the ground slopes under your feet. So you take a step downhill. Then another. Eventually you reach a low point. That's gradient descent.
The "landscape" is the loss function — a measure of how wrong the network is. The "slope" is the gradient — it tells us which direction to adjust each weight to reduce the error. The size of each step is the learning rate.
Try a very high learning rate — the ball overshoots, bouncing past the minimum. Try too low and it barely moves. The sweet spot is a rate that descends quickly but doesn't overshoot. This tradeoff is one of the most important decisions in training any neural network.
Below is a tiny network: 2 inputs, 2 hidden neurons, 1 output. Step through the forward pass to see values flow left-to-right, then step through the backward pass to see gradients flow right-to-left. Watch how each weight receives its "blame" for the error.
| Node | Value | Gradient |
|---|
| Weight | Value | δ | New |
|---|
The learning rate controls how big each weight update is. Below, three networks train on XOR simultaneously with different learning rates. Watch their loss curves — too small is slow, too large is chaotic, just right converges fast.
The perceptron couldn't solve XOR. Neuroevolution solved it by trial and error over many generations. Backpropagation solves it surgically — computing the exact gradient for every weight, every step. Watch a 2→4→1 network learn XOR from scratch, typically in under 1000 epochs.
| A | B | Expected | Output | |
|---|---|---|---|---|
| 0 | 0 | 0 | — | |
| 0 | 1 | 1 | — | |
| 1 | 0 | 1 | — | |
| 1 | 1 | 0 | — |
Compute the gradient of the loss with respect to every weight, then adjust each weight to reduce the error. Uses the chain rule to pass blame backward through layers.
An optimization algorithm: repeatedly step in the direction that reduces the loss. The gradient tells you which way is "downhill" for every parameter simultaneously.
Measures how wrong the network is. Mean squared error, cross-entropy, etc. Training means minimizing this number. The landscape of the loss function determines how hard the problem is.
The step size for gradient descent. Too large and you overshoot. Too small and you crawl. Modern optimizers like Adam adapt the rate automatically per-weight.