The heart assumption here is that you’ve watched part three, giving an intuitive
walkthrough of the backpropagation algorithm. Here, we get a little bit more formal and dive into the relevant calculus. It’s normal for this to be at least a little confusing. So the mantra to regularly pause and ponder certainly applies as much here as
anywhere else. Our main goal is to show how people in machine learning commonly think about the
chain rule from calculus in the context of networks, which has kind of a different
feel from how much most introductory calculus courses approach the subject. For those of you uncomfortable with the relevant calculus, I do have a whole series
on the topic.
Let’s just start off with an extremely simple network, one where each layer has a
single neuron in it. So this particular network is determined by three weights and three biases. And our goal is to understand how sensitive the cost function is to these
variables. That way, we know which adjustments to those terms is gonna cause the most efficient
decrease to the cost function. And we’re just gonna focus on the connection between the last two neurons.
Let’s label the activation of that last neuron with a superscript 𝐿, indicating
which layer it’s in. So the activation of the previous neuron is 𝑎 𝐿 minus one. These are not exponents. They’re just a way of indexing what we’re talking about, since I wanna save
subscripts for different indices later on. Now, let’s say that the value we want this last activation to be, for a given
training example, is 𝑦. For example, 𝑦 might be zero or one. So the cost of this simple network for a single training example is 𝑎 𝐿 minus 𝑦
squared. We’ll denote the cost of that one training example as 𝐶 zero.
As a reminder, this last activation is determined by a weight, which I’m gonna call
𝑤 𝐿, times the previous neuron’s activation plus some bias, which I’ll call 𝑏
𝐿. And then you pump that through some special nonlinear function like the sigmoid or
ReLU. It’s actually gonna make things easier for us if we give a special name to this
weighted sum, like 𝑧, with the same superscript as the relevant activations. So this is a lot of terms. And a way that you might conceptualize it is that the weight, the previous action,
and the bias altogether are used to compute 𝑧, which in turn lets us compute 𝑎,
which, finally, along with the constant 𝑦, lets us compute the cost. And of course, 𝑎 𝐿 minus one is influenced by its own weight and bias and such. But, we’re not gonna focus on that right now.
Now all of these are just numbers, right? And it can be nice to think of each one as having its own little number line. Our first goal is to understand how sensitive the cost function is to small changes
in our weight, 𝑤 𝐿. Or phrased differently, what is the derivative of 𝐶 with respect to 𝑤 𝐿? When you see this 𝜕𝑤 term, think of it as meaning some tiny nudge to 𝑤, like a
change by 0.01. And think of this 𝜕𝐶 term as meaning whatever the resulting nudge to the cost
is. What we want is their ratio. Conceptually, this tiny nudge to 𝑤 𝐿 causes some nudge to 𝑧 𝐿, which in turn
causes some nudge to 𝑎 𝐿, which directly influences the cost.
So we break things up by first looking at the ratio of a tiny change to 𝑧 𝐿 to this
tiny change 𝑤. That is, the derivative of 𝑧 𝐿 with respect to 𝑤 𝐿. Likewise, you then consider the ratio of the change to 𝑎 𝐿 to the tiny change in 𝑧
𝐿 that caused it, as well as the ratio between the final nudge to 𝐶 and this
intermediate nudge to 𝑎 𝐿. This right here is the chain rule, where multiplying together these three ratios
gives us the sensitivity of 𝐶 to small changes in 𝑤 𝐿.
So on screen right now, there’s kind of a lot of symbols. Take a moment to just make sure it’s clear what they all are because now, we’re gonna
compute the relevant derivatives. The derivative of 𝐶 with respect to 𝑎 𝐿 works out to be two times 𝑎 𝐿 minus
𝑦. Notice, this means that its size is proportional to the difference between the
network’s output and the thing that we want it to be. So if that output was very different, even slight changes stand to have a big impact
on the final cost function. The derivative of 𝑎 𝐿 with respect to 𝑧 𝐿 is just the derivative of our sigmoid
function, or whatever nonlinearity you choose to use. And the derivative of 𝑧 𝐿 with respect to 𝑤 𝐿, in this case, comes out just to be
𝑎 𝐿 minus one.
Now, I don’t know about you. But I think it’s easy to get stuck head-down in the formulas without taking a moment
to sit back and remind yourself of what they all actually mean. In the case of this last derivative, the amount that that small nudge to the weight
influenced the last layer depends on how strong the previous neuron is. Remember, this is where that “neurons that fire together wire together” idea comes
in. And all of this is the derivative with respect to 𝑤 𝐿, only of the cost for a
specific single training example. Since the full cost function involves averaging together all those costs across many
different training examples, its derivative requires averaging this expression that
we found over all training examples.
And, of course, that is just one component of the gradient vector, which itself is
built up from the partial derivatives of the cost function with respect to all those
weights and biases. But even though that’s just one of the many partial derivatives we need, it’s more
than 50 percent of the work. The sensitivity to the bias, for example, is almost identical. We just need to change out this 𝜕𝑧 𝜕𝑤 term for a 𝜕𝑧 𝜕𝑏. And if you look at the relevant formula, that derivative comes out to be one. Also, and this is where the idea of propagating backwards comes in, you can see how
sensitive this cost function is to the activation of the previous layer. Namely, this initial derivative in the chain rule expression, the sensitivity of 𝑧
to the previous activation, comes out to be the weight, 𝑤 𝐿.
And again, even though we’re not gonna be able to directly influence that
previous-layer activation, it’s helpful to keep track of. Because now, we can just keep iterating the same chain rule idea backwards to see how
sensitive the cost function is to previous weights and previous biases. And you might think that this is an overly simple example, since all layers just have
one neuron, and that things are gonna get exponentially more complicated for a real
network. But honestly, not that much changes when we give the layers multiple neurons. Really, it’s just a few more indices to keep track of. Rather than the activation of a given layer simply being 𝑎 𝐿, it’s also gonna have
a subscript indicating which neuron of that layer it is. Let’s go ahead and use the letter 𝑘 to index the layer 𝐿 minus one and 𝑗 to index
the layer 𝐿.
For the cost, again, we look at what the desired output is. But this time, we add up the squares of the differences between these last layer
activations and the desired output. That is, you take a sum over 𝑎 𝐿 𝑗 minus 𝑦 𝑗 squared. Since there’s a lot more weights, each one has to have a couple more indices to keep
track of where it is. So let’s call the weight of the edge connecting this 𝑘th neuron to the 𝑗th neuron
𝑤 𝐿 𝑗𝑘. Those indices might feel a little backwards at first. But it lines up with how you’d index the weight matrix that I talked about in the
part one video.
Just as before, it’s still nice to give a name to the relevant weighted sum, like 𝑧,
so that the activation of the last layer is just your special function, like the
sigmoid, applied to 𝑧. You can kinda see what I mean, right, where all of these are essentially the same
equations that we had before in the one-neuron-per-layer case. It’s just that it looks a little more complicated. And indeed, the chain-rule derivative expression describing how sensitive the cost is
to a specific weight looks essentially the same. I’ll leave it to you to pause and think about each of those terms if you want.
What does change here, though, is the derivative of the cost with respect to one of
the activations in the layer 𝐿 minus one. In this case, the difference is that the neuron influences the cost function through
multiple different paths. That is, on the one hand, it influences 𝑎 𝐿 zero, which plays a role in the cost
function. But it also has an influence on 𝑎 𝐿 one, which also plays a role in the cost
function. And you have to add those up. And that — well, that’s pretty much it. Once you know how sensitive the cost function is to the activations in this
second-to-last layer, you can just repeat the process for all the weights and biases
feeding into that layer.
So pat yourself on the back! If all of this makes sense, you have now looked deep into the heart of
backpropagation, the workhorse behind how neural networks learn. These chain rule expressions give you the derivatives that determine each component
in the gradient that helps minimize the cost of the network by repeatedly stepping
downhill. Phew! If you sit back and think about all that, this is a lot of layers of complexity to
wrap your mind around. So don’t worry if it takes time for your mind to digest it all.