### Video Transcript

When I first learned about Taylor series, I definitely didn’t appreciate just how
important they are. But time and time again, they come up in math and physics and many fields of
engineering because they’re one of the most powerful tools that math has to offer
for approximating functions.

I think one of the first times this clicked for me as a student was not in a calculus
class but a physics class. We were studying a certain problem that had to do with the potential energy of a
pendulum. And for that, you need an expression for how high the weight of the pendulum is above
its lowest point. And when you work that out, it comes out to be proportional to one minus the cosine
of the angle between the pendulum and the vertical. Now the specifics of the problem we were trying to solve are beyond the point
here. But what I’ll say is that this cosine function made the problem awkward and
unwieldy. And it made it less clear how pendulums relate to other oscillating phenomena. But if you approximate cos of 𝜃 as one minus 𝜃 squared over two, of all things,
everything just fell into place much more easily.

Now if you’ve never seen anything like this before, an approximation like that might
seem completely out of left field. I mean if you graph cos of 𝜃 along with this function, one minus 𝜃 squared over
two, they do seem rather close to each other, at least for small angles near
zero. But how would you even think to make this approximation? And how would you find that particular quadratic? The study of Taylor series is largely about taking nonpolynomial functions and
finding polynomials that approximate them near some input. And the motive here is that polynomials tend to be much easier to deal with than
other functions. They’re easier to compute, easier to take derivatives, easier to integrate, just all
around more friendly.

So let’s take a look at that function, cos of 𝑥, and really take a moment to think
about how you might construct a quadratic approximation near 𝑥 equals zero. That is, among all of the possible polynomials that look like 𝑐 zero plus 𝑐 one
times 𝑥 plus 𝑐 two times 𝑥 squared for some choice of these constants 𝑐 zero, 𝑐
one, and 𝑐 two, find the one that most resembles cos of 𝑥 near 𝑥 equals zero,
whose graph kind of spoons with the graph of cos 𝑥 at that point.

Well, first of all, at the input zero, the value of cos of 𝑥 is one. So if our approximation is gonna be any good at all, it should also equal one at the
input 𝑥 equals zero. Plugging in zero just results in whatever 𝑐 zero is, so we can set that equal to
one. This leaves us free to choose constants 𝑐 one and 𝑐 two to make this approximation
as good as we can. But nothing we do with them is gonna change the fact that the polynomial equals one
at 𝑥 equals zero. Now it would also be good if our approximation had the same tangent slope as cos 𝑥
at this point of interest. Otherwise, the approximation drifts away from the cosine graph much faster than it
needs to.

The derivative of cosine is negative sine. And at 𝑥 equals zero, that equals zero, meaning the tangent line is perfectly
flat. On the other hand, when you work out the derivative of our quadratic, you get 𝑐 one
plus two times 𝑐 two times 𝑥. At 𝑥 equals zero, this just equals whatever we choose for 𝑐 one. So this constant 𝑐 one has complete control over the derivative of our approximation
around 𝑥 equals zero. Setting it equal to zero ensures that our approximation also has a flat tangent line
at this point. And this leaves us free to change 𝑐 two. But the value and the slope of our polynomial at 𝑥 equals zero are locked in place
to match that of cosine.

The final thing to take advantage of is the fact that the cosine graph curves
downward above 𝑥 equals zero. It has a negative second derivative. Or in other words, even though the rate of change is zero at that point, the rate of
change itself is decreasing around that point. Specifically, since its derivative is negative sin of 𝑥, its second derivative is
negative cos of 𝑥. And at 𝑥 equals zero, that equals negative one. Now in the same way that we wanted the derivative of our approximation to match that
of the cosine so that their values wouldn’t drift apart needlessly quickly, making
sure that their second derivatives match will ensure that they curve at the same
rate, that the slope of our polynomial doesn’t drift away from the slope of cos 𝑥
any more quickly than it needs to.

Pulling up the same derivative we had before and then taking its derivative, we see
that the second derivative of this polynomial is exactly two times 𝑐 two. So to make sure that this second derivative also equals negative one at 𝑥 equals
zero, two times 𝑐 two has to be negative one. Meaning 𝑐 two itself should be negative one-half. And this gives us the approximation one plus zero 𝑥 minus one-half 𝑥 squared. And to get a feel for how good it is, if you estimate say cos of 0.1 using this
polynomial, you’d estimate it to be 0.995. And this is the true value of cos of 0.1. It’s a really good approximation.

Take a moment to reflect on what just happened. You had three degrees of freedom with this quadratic approximation, the constants 𝑐
zero, 𝑐 one, and 𝑐 two. 𝑐 zero was responsible for making sure that the output of the approximation matches
that of cos 𝑥 at 𝑥 equals zero. 𝑐 one was in charge of making sure that the derivatives match at that point. And 𝑐 two was responsible for making sure that the second derivatives match up. This ensures that the way your approximation changes as you move away from 𝑥 equals
zero and the way that the rate of change itself changes is as similar as possible to
the behavior of cos 𝑥, given the amount of control that you have.

You could give yourself more control by allowing more terms in your polynomial and
matching higher-order derivatives. For example, let’s say you added on the term 𝑐 three times 𝑥 cubed for some
constant 𝑐 three. Well in that case, if you take the third derivative of a cubic polynomial, anything
that’s quadratic or smaller goes to zero. And as for that last term, after three iterations of the power rule, it looks like
one times two times three times whatever 𝑐 three is. On the other hand, the third derivative of cos 𝑥 comes out to sin of 𝑥, which
equals zero at 𝑥 equals zero. So to make sure that the third derivatives match, the constant 𝑐 three should be
zero. Or in other words, not only is one minus one-half 𝑥 squared the best possible
quadratic approximation of cosine, it’s also the best possible cubic
approximation.

You can actually make an improvement by adding on a fourth-order term, 𝑐 four times
𝑥 to the fourth. The fourth derivative of cosine is actually itself, which equals one at 𝑥 equals
zero. And what’s the fourth derivative of our polynomial with this new term? Well, when you keep applying the power rule over and over, with those exponents all
hopping down and front, you end up with one times two times three times four times
𝑐 four, which is 24 times 𝑐 four. So if we want this to match the fourth derivative of cos 𝑥, which is one, 𝑐 four
has to be one over 24. And indeed, the polynomial one minus one-half 𝑥 squared plus one twenty-fourth times
𝑥 to the fourth, which looks like this, is a very close approximation for cos 𝑥
around 𝑥 equals zero.

In any physics problem involving the cosine of a small angle, for example,
predictions would be almost unnoticeably different if you substituted this
polynomial for cos of 𝑥. Now, take a step back and notice a few things happening with this process. First of all, factorial terms come up very naturally in this process. When you take 𝑛 successive derivatives of the function 𝑥 to the 𝑛, letting the
power rule just keep cascading on down, what you’ll be left with is one times two
times three on and on and on up to whatever 𝑛 is. So you don’t simply set the coefficients of the polynomial equal to whatever
derivative you want, you have to divide by the appropriate factorial to cancel out
this effect.

For example, that 𝑥 to the fourth coefficient was the fourth derivative of cosine,
one, but divided by four factorial, 24. The second thing to notice is that adding on new terms, like this 𝑐 four times 𝑥 to
the fourth, doesn’t mess up what the old terms should be. And that’s really important. For example, the second derivative of this polynomial at 𝑥 equals zero is still
equal to two times the second coefficient, even after you introduce higher-order
terms. And it’s because we’re plugging in 𝑥 equals zero. So the second derivative of any higher-order term, which all include an 𝑥, will just
wash away. And the same goes for any other derivative, which is why each derivative of a
polynomial at 𝑥 equals zero is controlled by one and only one of the
coefficients.

If, instead, you were approximating near an input other than zero, like maybe 𝑥
equals 𝜋, in order to get the same effect, you would have to write your polynomial
in terms of powers of 𝑥 minus 𝜋 or whatever input you’re looking at. This makes it look noticeably more complicated. But all we’re doing is just making sure that the point 𝜋 looks and behaves like zero
so that plugging in 𝑥 equals 𝜋 is gonna result in a lot of nice cancelation that
leaves only one constant. And finally, on a more philosophical level, notice how what we’re doing here is
basically taking information about higher-order derivatives of a function at a
single point. And then translating that into information about the value of the function near that
point.

You can take as many derivatives of cosine as you want. It follows this nice cyclic pattern: cos of 𝑥, negative sin of 𝑥, negative cos,
sin, and then repeat. And the value of each one of these is easy to compute at 𝑥 equals zero. It gives this cyclic pattern: one, zero, negative one, zero, and then repeat. And knowing the values of all of those higher-order derivatives is a lot of
information about cos of 𝑥, even though it only involves plugging in a single
number, 𝑥 equals zero. So what we’re doing is leveraging that information to get an approximation around
this input. And you do it by creating a polynomial whose higher-order derivatives are designed to
match up with those of cosine, following the same one, zero, negative one, zero
cyclic pattern.

And to do that, you just make each coefficient of the polynomial follow that same
pattern. But you have to divide each one by the appropriate factorial. Like I mentioned before, this is what cancels out the cascading effects of many power
rule applications. The polynomials you get by stopping this process at any point are called Taylor
polynomials for cos of 𝑥. More generally, and hence more abstractly, if we were dealing with some other
function other than cosine, you would compute its derivative, its second derivative,
and so on, getting as many terms as you’d like. And you would evaluate each one of them at 𝑥 equals zero. Then for the polynomial approximation, the coefficient of each 𝑥 to the 𝑛 term
should be the value of the 𝑛th derivative of the function evaluated at zero but
divided by 𝑛 factorial.

And this whole rather abstract formula is something that you’ll likely see in any
text or any course that touches on Taylor polynomials. And when you see it, I want you to think to yourself that that constant term ensures
that the value of the polynomial matches with the value of 𝑓. The next term ensures that the slope of the polynomial matches the slope of the
function at 𝑥 equals zero. The next term ensures that the rate at which the slope changes is the same at that
point, and so on, depending on how many terms you want. And the more terms you choose, the closer the approximation. But the tradeoff is that the polynomial you’d get would be more complicated.

And to make things even more general, if you wanted to approximate near some input
other than zero, which we’ll call 𝑎, you would write this polynomial in terms of
powers of 𝑥 minus 𝑎. And you would evaluate all the derivatives of 𝑓 at that input, 𝑎. This is what Taylor polynomials look like in their fullest generality. Changing the value of 𝑎 changes where this approximation is hugging the original
function, where its higher-order derivatives will be equal to those of the original
function.

One of the simplest meaningful examples of this is the function 𝑒 to the 𝑥, around
the input 𝑥 equals zero. Computing the derivatives is super nice, as nice as it gets, because the derivative
of 𝑒 to the 𝑥 is itself. So the second derivative is also 𝑒 to the 𝑥, as is its third, and so on. So at the point 𝑥 equals zero, all of these are equal to one. And what that means is our polynomial approximation should look like one plus one
times 𝑥 plus one over two times 𝑥 squared plus one over three factorial times 𝑥
cubed and so on, depending on how many terms you want. These are the Taylor polynomials for 𝑒 to the 𝑥.

Okay, so with that as a foundation, in the spirit of showing you just how connected
all the topics of calculus are, let me turn to something kind of fun, a completely
different way to understand this second-order term of the Taylor polynomials, but
geometrically. It’s related to the fundamental theorem of calculus, which I talked about in chapters
one and chapters eight if you need a quick refresher. Like we did in those videos, consider a function that gives the area under some graph
between a fixed left point and a variable right point. What we’re gonna do here is think about how to approximate this area function, not
the function for the graph itself like we’ve been doing before. Focusing on that area is what’s gonna make the second order term kind of pop out.

Remember, the fundamental theorem of calculus is that this graph itself represents
the derivative of the area function. And it’s because a slight nudge, d𝑥, to the right bound of the area gives a new bit
of area that’s approximately equal to the height of the graph times d𝑥. And that approximation is increasingly accurate for smaller and smaller choices of
d𝑥. But if you wanted to be more accurate about this change in area given some change in
𝑥 that isn’t meant to approach zero, you would have to take into account this
portion right here, which is approximately a triangle.

Let’s name the starting input 𝑎 and the nudged input above it 𝑥 so that that change
is 𝑥 minus 𝑎. The base of that little triangle is that change, 𝑥 minus 𝑎. And its height is the slope of the graph times 𝑥 minus 𝑎. Since this graph is the derivative of the area function, its slope is the second
derivative of the area function, evaluated at the input 𝑎. So the area of this triangle, one-half base times height is one-half times the second
derivative of this area function, evaluated at 𝑎, multiplied by 𝑥 minus 𝑎
squared. And this is exactly what you would see with a Taylor polynomial. If you knew the various derivative information about this area function at the point
𝑎, how would you approximate the area at the point 𝑥?

Well, you have to include all that area up to 𝑎, 𝑓 of 𝑎, plus the area of this
rectangle here, which is the first derivative times 𝑥 minus 𝑎, plus the area of
that little triangle, which is one-half times the second derivative times 𝑥 minus
𝑎 squared. I really like this because even though it looks a bit messy all written out, each one
of the terms has a very clear meaning that you can just point to on the diagram. If you wanted, we could call it an end here. And you would have a phenomenally useful tool for approximations with these Taylor
polynomials. But, if you’re thinking like a mathematician, one question you might ask is whether
or not it makes sense to never stop and just add infinitely many terms.

In math, an infinite sum is called a series. So even though one of these approximations with finitely many terms is called a
Taylor polynomial, adding all infinitely many terms gives what’s called a Taylor
series. You have to be really careful with the idea of an infinite series because it doesn’t
actually make sense to add infinitely many things. You can only hit the plus button on the calculator so many times. But if you have a series where adding more and more of the terms, which make sense at
each step, gets you increasingly close to some specific value, what you say is that
the series converges to that value. Or, if you’re comfortable extending the definition of equality to include this kind
of series convergence, you’d say that the series as a whole, this infinite sum,
equals the value that it’s converging to.

For example, look at the Taylor polynomial for 𝑒 to the 𝑥 and plug in some input
like 𝑥 equals one. As you add more and more polynomial terms, the total sum gets closer and closer to
the value 𝑒. So you say that this infinite series converges to the number 𝑒. Or, what’s saying the same thing, that it equals the number 𝑒. In fact, it turns out that if you plug in any other value of 𝑥, like 𝑥 equals two,
and look at the value of the higher and higher-order Taylor polynomials at this
value, they will converge towards 𝑒 to the 𝑥, which in this case is 𝑒
squared. And this is true for any input, no matter how far away from zero it is, even though
these Taylor polynomials are constructed only from derivative information gathered
at the input zero.

In a case like this, we say that 𝑒 to the 𝑥 equals its own Taylor series at all
inputs 𝑥, which is kind of a magical thing to have happen. And even though this is also true for a couple other important functions, things like
sine and cosine, sometimes these series only converge within a certain range around
the input whose derivative information you’re using. If you work out the Taylor series for the natural log of 𝑥 around the input 𝑥
equals one, which is built by evaluating the higher-order derivatives of the natural
log of 𝑥 at 𝑥 equals one, this is what it would look like. When you plug in an input between zero and two, adding more and more terms of this
series will indeed get you closer and closer to the natural log of that input.

But outside of that range, even by just a little bit, this series fails to approach
anything. As you add on more and more terms, the sum just kind of bounces up back and forth
wildly. It does not, as you might expect, approach the natural log of that value, even though
the natural log of 𝑥 is perfectly well defined for inputs that are above two. In some sense, the derivative information of ln of 𝑥 at 𝑥 equals one doesn’t
propagate out that far. In a case like this where adding more terms of the series doesn’t approach anything,
you say that the series diverges. And that maximum distance between the input you’re approximating near and points
where the outputs of these polynomials actually do converge is called the radius of
convergence for the Taylor series.

There remains more to learn about Taylor series. There are many use cases, tactics for placing bounds on the error of these
approximations, tests for understanding when series do and don’t converge. And for that matter, there remains more to learn about calculus as a whole and the
countless topics not touched by this series. The goal with these videos is to give you the fundamental intuitions that make you
feel confident and efficient in learning more on your own and potentially even
rediscovering more of the topic for yourself. In the case of Taylor series, the fundamental intuition to keep in mind as you
explore more of what there is is that they translate derivative information at a
single point to approximation information around that point.