Video Transcript
Here, we tackle backpropagation, the core algorithm behind how neural networks
learn. After a quick recap for where we are, the first thing I’ll do is an intuitive
walkthrough for what the algorithm is actually doing, without any reference to the
formulas. Then, for those of you who do want to dive into the math, the next video goes into
the calculus underlying all this.
If you watched the last two videos or if you’re just jumping in with the appropriate
background, you know what a neural network is and how it feeds forward
information. Here, we’re doing the classic example of recognizing handwritten digits whose pixel
values get fed into the first layer of the network with 784 neurons. And I’ve been showing a network with two hidden layers having just 16 neurons each,
and an output layer of 10 neurons, indicating which digit the network is choosing as
its answer. I’m also expecting you to understand gradient descent as described in the last
video. And how- what we mean by learning is that we wanna find which weights and biases
minimize a certain cost function.
As a quick reminder, for the cost of a single training example, what you do is take
the output that the network gives along with the output that you wanted it to
give. And you just add up the squares of the differences between each component. Doing this for all of your tens of thousands of training examples and averaging the
results, this gives you the total cost of the network. And as if that’s not enough to think about, as described in the last video, the thing
that we’re looking for is the negative gradient of this cost function, which tells
you how you need to change all of the weights and biases, all of these connections,
so as to most efficiently decrease the cost.
Backpropagation, the topic of this video, is an algorithm for computing that crazy
complicated gradient. And the one idea from the last video that I really want you to hold firmly in your
mind right now is that because thinking of the gradient vector as a direction in
13000 dimensions is, to put it lightly, beyond the scope of our imaginations,
there’s another way you can think about it. The magnitude of each component here is telling you how sensitive the cost function
is to each weight and bias.
For example, let’s say you go through the process I’m about to describe and you
compute the negative gradient. And the component associated with the weight on this edge here comes out to be 3.2
while the component associated with this edge here comes out as 0.1. The way you would interpret that is that the cost of the function is 32 times more
sensitive to changes in that first weight. So if you were to wiggle that value just a little bit, it’s gonna cause some change
to the cost, And that change is 32 times greater than what the same wiggle to that
second weight would give.
Personally, when I was first learning about backpropagation, I think the most
confusing aspect was just the notation and the index chasing of it all. But once you unwrap what each part of this algorithm is really doing, each individual
effect that it’s having is actually pretty intuitive. It’s just that there’s a lot of little adjustments getting layered on top of each
other. So I’m gonna start things off here with a complete disregard for the notation and
just step through those effects that each training example is having on the weights
and biases.
Because the cost function involves averaging a certain cost per example over all the
tens of thousands of training examples, the way that we adjust the weights and
biases for a single gradient descent step also depends on every single example. Or, rather, in principle it should. But for computational efficiency, we’re going to do a little trick later to keep you
from needing to hit every single example for every single step. In other case, right now, all we’re gonna do is focus our attention on one single
example, this image of a two. What effect should this one training example have on how the weights and biases get
adjusted?
Let’s say we’re at a point where the network is not well-trained yet. So the activations in the output are gonna look pretty random, maybe something like
0.5, 0.8, 0.2, on and on. Now we can’t directly change those activations. We only have influence on the weights and biases. But it is helpful to keep track of which adjustments we wish should take place to
that output layer. And since we want it to classify the image as a two, we want that third value to get
nudged up, while all of the others get nudged down. Moreover, the sizes of these nudges should be proportional to how far away each
current value is from its target value.
For example, the increase to that number two neurons activation is in a sense more
important than the decrease to the number eight neuron, which is already pretty
close to where it should be. So zooming in further, let’s focus just on this one neuron, the one whose activation
we wish to increase. Remember, that activation is defined as a certain weighted sum of all of the
activations in the previous layer plus a bias, which has all been plugged into
something like the sigmoid squishification function or a ReLU. So there are three different avenues that can team up together to help increase that
activation. You can increase the bias, you can increase the weights, and you can change the
activations from the previous layer.
Focusing just on how the weights should be adjusted, notice how the weights actually
have differing levels of influence. The connections with the brightest neurons from the preceding layer have the biggest
effect, since those weights are multiplied by larger activation values. So if you were to increase one of those weights, it actually has a stronger influence
on the ultimate cost function than increasing the weights of connections with dimmer
neurons, at least as far as this one training example is concerned. Remember, when we talk about gradient descent, we don’t just care about whether each
component should get nudged up or down, we care about which ones give you the most
bang for your buck.
This, by the way, is at least somewhat reminiscent of the theory in neuroscience for
how biological networks of neurons learn. Hebbian theory, often summed up in the phrase “neurons that fire together wire
together”. Here, the biggest increases to weights, the biggest strengthening of connections,
happens between neurons which are the most active and the ones which we wish to
become more active. In a sense, the neurons that are firing while seeing a two, get more strongly linked
to those firing when thinking about a two.
To be clear, I really am not in a position to make statements one way or another
about whether artificial networks of neurons behave anything like biological
brains. And this fires-together-wire-together idea comes with a couple meaningful
asterisks. But, taken as a very loose analogy, I do find it interesting to note. Anyway, the third way that we can help increase this neuron’s activation is by
changing all the activations in the previous layer. Namely, if everything connected to that digit two neuron with a positive weight got
brighter, and if everything connected with a negative weight got dimmer, then that
digit two neuron would become more active.
And similar to the weight changes, you’re gonna get the most bang for your buck by
seeking changes that are proportional to the size of the corresponding weights. Now of course, we cannot directly influence those activations. We only have control over the weights and biases. But just as with the last layer, it’s helpful to just keep a note of what those
desired changes are. But keep in mind, zooming out one step here, this is only what that digit two output
neuron wants. Remember, we also want all of the other neurons in the last layer to become less
active. And each of those other output neurons has its own thoughts about what should happen
to that second-to-last layer.
So, the desire of this digit two neuron is added together with the desires of all the
other output neurons for what should happen to this second-to-last layer, again, in
proportion to the corresponding weights and in proportion to how much each of those
neurons needs to change. This right here is where the idea of propagating backwards comes in. By adding together all these desired effects, you basically get a list of nudges that
you want to happen to the second-to-last layer. And once you have those, you can recursively apply the same process to the relevant
weights and biases that determine those values, repeating the same process I just
walked through and moving backwards through the network.
And zooming out a bit further, remember that this is all just how a single training
example wishes to nudge each one of those weights and biases. If we only listen to what that two wanted, the network would ultimately be
incentivized just to classify all images as a two. So what you do is you go through the same backprop routine for every other training
example, recording how each of them would like to change the weights and the
biases. And you average together those desired changes.
This collection here of the averaged to nudges to each weight and bias is, loosely
speaking, the negative gradient of the cost function referenced in the last video,
or at least something proportional to it. I say loosely speaking only because I have yet to get quantitatively precise about
those nudges. But if you understood every change that I just referenced, why some are
proportionally bigger than others, and how they all need to be added together, you
understand the mechanics for what backpropagation is actually doing.
By the way, in practice, it takes computers an extremely long time to add up the
influence of every single training example, every single gradient descent step. So here’s what’s commonly done instead. You randomly shuffle your training data and then divide it into a whole bunch of
minibatches, let’s say, each one having 100 training examples. Then you compute a step according to the minibatch. It’s not gonna be the actual gradient of the cost function, which depends on all of
the training data, not this tiny subset. So it’s not the most efficient step downhill. But each minibatch does give you a pretty good approximation. And more importantly, it gives you a significant computational speed up.
If you were to plot the trajectory of your network under the relevant cost surface,
it would be a little more like a drunk man stumbling aimlessly down a hill, but
taking quick steps, rather than a carefully-calculating man determining the exact
downhill direction of each step before taking a very slow and careful step in that
direction. This technique is referred to as stochastic gradient descent. There’s kind of a lot going on here. So let’s just sum it up for ourselves, shall we?
Backpropagation is the algorithm for determining how a single training example would
like to nudge the weights and biases, not just in terms of whether they should go up
or down, but in terms of what relative proportions to those changes cause the most
rapid decrease to the cost. A true gradient descent step would involve doing this for all your tens and thousands
of training examples and averaging the desired changes that you get. But that’s computationally slow. So instead, you randomly subdivide the data into these minibatches and compute each
step with respect to a minibatch. Repeatedly going through all of the mini batches and making these adjustments, you
will converge towards a local minimum of the cost function, which is to say, your
network is gonna end up doing a really good job on the training examples.
So with all of that said, every line of code that would go into implementing backprop
actually corresponds with something that you have now seen, at least in informal
terms. But sometimes knowing what the math does is only half the battle, and just
representing the damn thing is where it gets all muddled and confusing. So for those of you who do want to go deeper, the next video goes through the same
ideas that were just presented here but in terms of the underlying calculus, which
should hopefully make it a little more familiar as you see the topic in other
resources.
Before that, one thing worth emphasizing is that for this algorithm to work, and this
goes for all sorts of machine learning beyond just neural networks, you need a lot
of training data. In our case, one thing that makes handwritten digits such a nice example is that
there exists the MNIST database, with so many examples that have been labeled by
humans. So a common challenge that those of you working in machine learning will be familiar
with is just getting the labeled training data that you actually need, whether
that’s having people label tens of thousands of images or whatever other data type
you might be dealing with.