# Pop Video: Taylor Series

Grant Sanderson • 3Blue1Brown • Boclips

Taylor Series

21:43

### Video Transcript

When I first learned about Taylor series, I definitely didn’t appreciate just how important they are. But time and time again, they come up in math and physics and many fields of engineering because they’re one of the most powerful tools that math has to offer for approximating functions.

I think one of the first times this clicked for me as a student was not in a calculus class but a physics class. We were studying a certain problem that had to do with the potential energy of a pendulum. And for that, you need an expression for how high the weight of the pendulum is above its lowest point. And when you work that out, it comes out to be proportional to one minus the cosine of the angle between the pendulum and the vertical. Now the specifics of the problem we were trying to solve are beyond the point here. But what I’ll say is that this cosine function made the problem awkward and unwieldy. And it made it less clear how pendulums relate to other oscillating phenomena. But if you approximate cos of 𝜃 as one minus 𝜃 squared over two, of all things, everything just fell into place much more easily.

Now if you’ve never seen anything like this before, an approximation like that might seem completely out of left field. I mean if you graph cos of 𝜃 along with this function, one minus 𝜃 squared over two, they do seem rather close to each other, at least for small angles near zero. But how would you even think to make this approximation? And how would you find that particular quadratic? The study of Taylor series is largely about taking nonpolynomial functions and finding polynomials that approximate them near some input. And the motive here is that polynomials tend to be much easier to deal with than other functions. They’re easier to compute, easier to take derivatives, easier to integrate, just all around more friendly.

So let’s take a look at that function, cos of 𝑥, and really take a moment to think about how you might construct a quadratic approximation near 𝑥 equals zero. That is, among all of the possible polynomials that look like 𝑐 zero plus 𝑐 one times 𝑥 plus 𝑐 two times 𝑥 squared for some choice of these constants 𝑐 zero, 𝑐 one, and 𝑐 two, find the one that most resembles cos of 𝑥 near 𝑥 equals zero, whose graph kind of spoons with the graph of cos 𝑥 at that point.

Well, first of all, at the input zero, the value of cos of 𝑥 is one. So if our approximation is gonna be any good at all, it should also equal one at the input 𝑥 equals zero. Plugging in zero just results in whatever 𝑐 zero is, so we can set that equal to one. This leaves us free to choose constants 𝑐 one and 𝑐 two to make this approximation as good as we can. But nothing we do with them is gonna change the fact that the polynomial equals one at 𝑥 equals zero. Now it would also be good if our approximation had the same tangent slope as cos 𝑥 at this point of interest. Otherwise, the approximation drifts away from the cosine graph much faster than it needs to.

The derivative of cosine is negative sine. And at 𝑥 equals zero, that equals zero, meaning the tangent line is perfectly flat. On the other hand, when you work out the derivative of our quadratic, you get 𝑐 one plus two times 𝑐 two times 𝑥. At 𝑥 equals zero, this just equals whatever we choose for 𝑐 one. So this constant 𝑐 one has complete control over the derivative of our approximation around 𝑥 equals zero. Setting it equal to zero ensures that our approximation also has a flat tangent line at this point. And this leaves us free to change 𝑐 two. But the value and the slope of our polynomial at 𝑥 equals zero are locked in place to match that of cosine.

The final thing to take advantage of is the fact that the cosine graph curves downward above 𝑥 equals zero. It has a negative second derivative. Or in other words, even though the rate of change is zero at that point, the rate of change itself is decreasing around that point. Specifically, since its derivative is negative sin of 𝑥, its second derivative is negative cos of 𝑥. And at 𝑥 equals zero, that equals negative one. Now in the same way that we wanted the derivative of our approximation to match that of the cosine so that their values wouldn’t drift apart needlessly quickly, making sure that their second derivatives match will ensure that they curve at the same rate, that the slope of our polynomial doesn’t drift away from the slope of cos 𝑥 any more quickly than it needs to.

Pulling up the same derivative we had before and then taking its derivative, we see that the second derivative of this polynomial is exactly two times 𝑐 two. So to make sure that this second derivative also equals negative one at 𝑥 equals zero, two times 𝑐 two has to be negative one. Meaning 𝑐 two itself should be negative one-half. And this gives us the approximation one plus zero 𝑥 minus one-half 𝑥 squared. And to get a feel for how good it is, if you estimate say cos of 0.1 using this polynomial, you’d estimate it to be 0.995. And this is the true value of cos of 0.1. It’s a really good approximation.

Take a moment to reflect on what just happened. You had three degrees of freedom with this quadratic approximation, the constants 𝑐 zero, 𝑐 one, and 𝑐 two. 𝑐 zero was responsible for making sure that the output of the approximation matches that of cos 𝑥 at 𝑥 equals zero. 𝑐 one was in charge of making sure that the derivatives match at that point. And 𝑐 two was responsible for making sure that the second derivatives match up. This ensures that the way your approximation changes as you move away from 𝑥 equals zero and the way that the rate of change itself changes is as similar as possible to the behavior of cos 𝑥, given the amount of control that you have.

You could give yourself more control by allowing more terms in your polynomial and matching higher-order derivatives. For example, let’s say you added on the term 𝑐 three times 𝑥 cubed for some constant 𝑐 three. Well in that case, if you take the third derivative of a cubic polynomial, anything that’s quadratic or smaller goes to zero. And as for that last term, after three iterations of the power rule, it looks like one times two times three times whatever 𝑐 three is. On the other hand, the third derivative of cos 𝑥 comes out to sin of 𝑥, which equals zero at 𝑥 equals zero. So to make sure that the third derivatives match, the constant 𝑐 three should be zero. Or in other words, not only is one minus one-half 𝑥 squared the best possible quadratic approximation of cosine, it’s also the best possible cubic approximation.

You can actually make an improvement by adding on a fourth-order term, 𝑐 four times 𝑥 to the fourth. The fourth derivative of cosine is actually itself, which equals one at 𝑥 equals zero. And what’s the fourth derivative of our polynomial with this new term? Well, when you keep applying the power rule over and over, with those exponents all hopping down and front, you end up with one times two times three times four times 𝑐 four, which is 24 times 𝑐 four. So if we want this to match the fourth derivative of cos 𝑥, which is one, 𝑐 four has to be one over 24. And indeed, the polynomial one minus one-half 𝑥 squared plus one twenty-fourth times 𝑥 to the fourth, which looks like this, is a very close approximation for cos 𝑥 around 𝑥 equals zero.

In any physics problem involving the cosine of a small angle, for example, predictions would be almost unnoticeably different if you substituted this polynomial for cos of 𝑥. Now, take a step back and notice a few things happening with this process. First of all, factorial terms come up very naturally in this process. When you take 𝑛 successive derivatives of the function 𝑥 to the 𝑛, letting the power rule just keep cascading on down, what you’ll be left with is one times two times three on and on and on up to whatever 𝑛 is. So you don’t simply set the coefficients of the polynomial equal to whatever derivative you want, you have to divide by the appropriate factorial to cancel out this effect.

For example, that 𝑥 to the fourth coefficient was the fourth derivative of cosine, one, but divided by four factorial, 24. The second thing to notice is that adding on new terms, like this 𝑐 four times 𝑥 to the fourth, doesn’t mess up what the old terms should be. And that’s really important. For example, the second derivative of this polynomial at 𝑥 equals zero is still equal to two times the second coefficient, even after you introduce higher-order terms. And it’s because we’re plugging in 𝑥 equals zero. So the second derivative of any higher-order term, which all include an 𝑥, will just wash away. And the same goes for any other derivative, which is why each derivative of a polynomial at 𝑥 equals zero is controlled by one and only one of the coefficients.

If, instead, you were approximating near an input other than zero, like maybe 𝑥 equals 𝜋, in order to get the same effect, you would have to write your polynomial in terms of powers of 𝑥 minus 𝜋 or whatever input you’re looking at. This makes it look noticeably more complicated. But all we’re doing is just making sure that the point 𝜋 looks and behaves like zero so that plugging in 𝑥 equals 𝜋 is gonna result in a lot of nice cancelation that leaves only one constant. And finally, on a more philosophical level, notice how what we’re doing here is basically taking information about higher-order derivatives of a function at a single point. And then translating that into information about the value of the function near that point.

You can take as many derivatives of cosine as you want. It follows this nice cyclic pattern: cos of 𝑥, negative sin of 𝑥, negative cos, sin, and then repeat. And the value of each one of these is easy to compute at 𝑥 equals zero. It gives this cyclic pattern: one, zero, negative one, zero, and then repeat. And knowing the values of all of those higher-order derivatives is a lot of information about cos of 𝑥, even though it only involves plugging in a single number, 𝑥 equals zero. So what we’re doing is leveraging that information to get an approximation around this input. And you do it by creating a polynomial whose higher-order derivatives are designed to match up with those of cosine, following the same one, zero, negative one, zero cyclic pattern.

And to do that, you just make each coefficient of the polynomial follow that same pattern. But you have to divide each one by the appropriate factorial. Like I mentioned before, this is what cancels out the cascading effects of many power rule applications. The polynomials you get by stopping this process at any point are called Taylor polynomials for cos of 𝑥. More generally, and hence more abstractly, if we were dealing with some other function other than cosine, you would compute its derivative, its second derivative, and so on, getting as many terms as you’d like. And you would evaluate each one of them at 𝑥 equals zero. Then for the polynomial approximation, the coefficient of each 𝑥 to the 𝑛 term should be the value of the 𝑛th derivative of the function evaluated at zero but divided by 𝑛 factorial.

And this whole rather abstract formula is something that you’ll likely see in any text or any course that touches on Taylor polynomials. And when you see it, I want you to think to yourself that that constant term ensures that the value of the polynomial matches with the value of 𝑓. The next term ensures that the slope of the polynomial matches the slope of the function at 𝑥 equals zero. The next term ensures that the rate at which the slope changes is the same at that point, and so on, depending on how many terms you want. And the more terms you choose, the closer the approximation. But the tradeoff is that the polynomial you’d get would be more complicated.

And to make things even more general, if you wanted to approximate near some input other than zero, which we’ll call 𝑎, you would write this polynomial in terms of powers of 𝑥 minus 𝑎. And you would evaluate all the derivatives of 𝑓 at that input, 𝑎. This is what Taylor polynomials look like in their fullest generality. Changing the value of 𝑎 changes where this approximation is hugging the original function, where its higher-order derivatives will be equal to those of the original function.

One of the simplest meaningful examples of this is the function 𝑒 to the 𝑥, around the input 𝑥 equals zero. Computing the derivatives is super nice, as nice as it gets, because the derivative of 𝑒 to the 𝑥 is itself. So the second derivative is also 𝑒 to the 𝑥, as is its third, and so on. So at the point 𝑥 equals zero, all of these are equal to one. And what that means is our polynomial approximation should look like one plus one times 𝑥 plus one over two times 𝑥 squared plus one over three factorial times 𝑥 cubed and so on, depending on how many terms you want. These are the Taylor polynomials for 𝑒 to the 𝑥.

Okay, so with that as a foundation, in the spirit of showing you just how connected all the topics of calculus are, let me turn to something kind of fun, a completely different way to understand this second-order term of the Taylor polynomials, but geometrically. It’s related to the fundamental theorem of calculus, which I talked about in chapters one and chapters eight if you need a quick refresher. Like we did in those videos, consider a function that gives the area under some graph between a fixed left point and a variable right point. What we’re gonna do here is think about how to approximate this area function, not the function for the graph itself like we’ve been doing before. Focusing on that area is what’s gonna make the second order term kind of pop out.

Remember, the fundamental theorem of calculus is that this graph itself represents the derivative of the area function. And it’s because a slight nudge, d𝑥, to the right bound of the area gives a new bit of area that’s approximately equal to the height of the graph times d𝑥. And that approximation is increasingly accurate for smaller and smaller choices of d𝑥. But if you wanted to be more accurate about this change in area given some change in 𝑥 that isn’t meant to approach zero, you would have to take into account this portion right here, which is approximately a triangle.

Let’s name the starting input 𝑎 and the nudged input above it 𝑥 so that that change is 𝑥 minus 𝑎. The base of that little triangle is that change, 𝑥 minus 𝑎. And its height is the slope of the graph times 𝑥 minus 𝑎. Since this graph is the derivative of the area function, its slope is the second derivative of the area function, evaluated at the input 𝑎. So the area of this triangle, one-half base times height is one-half times the second derivative of this area function, evaluated at 𝑎, multiplied by 𝑥 minus 𝑎 squared. And this is exactly what you would see with a Taylor polynomial. If you knew the various derivative information about this area function at the point 𝑎, how would you approximate the area at the point 𝑥?

Well, you have to include all that area up to 𝑎, 𝑓 of 𝑎, plus the area of this rectangle here, which is the first derivative times 𝑥 minus 𝑎, plus the area of that little triangle, which is one-half times the second derivative times 𝑥 minus 𝑎 squared. I really like this because even though it looks a bit messy all written out, each one of the terms has a very clear meaning that you can just point to on the diagram. If you wanted, we could call it an end here. And you would have a phenomenally useful tool for approximations with these Taylor polynomials. But, if you’re thinking like a mathematician, one question you might ask is whether or not it makes sense to never stop and just add infinitely many terms.

In math, an infinite sum is called a series. So even though one of these approximations with finitely many terms is called a Taylor polynomial, adding all infinitely many terms gives what’s called a Taylor series. You have to be really careful with the idea of an infinite series because it doesn’t actually make sense to add infinitely many things. You can only hit the plus button on the calculator so many times. But if you have a series where adding more and more of the terms, which make sense at each step, gets you increasingly close to some specific value, what you say is that the series converges to that value. Or, if you’re comfortable extending the definition of equality to include this kind of series convergence, you’d say that the series as a whole, this infinite sum, equals the value that it’s converging to.

For example, look at the Taylor polynomial for 𝑒 to the 𝑥 and plug in some input like 𝑥 equals one. As you add more and more polynomial terms, the total sum gets closer and closer to the value 𝑒. So you say that this infinite series converges to the number 𝑒. Or, what’s saying the same thing, that it equals the number 𝑒. In fact, it turns out that if you plug in any other value of 𝑥, like 𝑥 equals two, and look at the value of the higher and higher-order Taylor polynomials at this value, they will converge towards 𝑒 to the 𝑥, which in this case is 𝑒 squared. And this is true for any input, no matter how far away from zero it is, even though these Taylor polynomials are constructed only from derivative information gathered at the input zero.

In a case like this, we say that 𝑒 to the 𝑥 equals its own Taylor series at all inputs 𝑥, which is kind of a magical thing to have happen. And even though this is also true for a couple other important functions, things like sine and cosine, sometimes these series only converge within a certain range around the input whose derivative information you’re using. If you work out the Taylor series for the natural log of 𝑥 around the input 𝑥 equals one, which is built by evaluating the higher-order derivatives of the natural log of 𝑥 at 𝑥 equals one, this is what it would look like. When you plug in an input between zero and two, adding more and more terms of this series will indeed get you closer and closer to the natural log of that input.

But outside of that range, even by just a little bit, this series fails to approach anything. As you add on more and more terms, the sum just kind of bounces up back and forth wildly. It does not, as you might expect, approach the natural log of that value, even though the natural log of 𝑥 is perfectly well defined for inputs that are above two. In some sense, the derivative information of ln of 𝑥 at 𝑥 equals one doesn’t propagate out that far. In a case like this where adding more terms of the series doesn’t approach anything, you say that the series diverges. And that maximum distance between the input you’re approximating near and points where the outputs of these polynomials actually do converge is called the radius of convergence for the Taylor series.

There remains more to learn about Taylor series. There are many use cases, tactics for placing bounds on the error of these approximations, tests for understanding when series do and don’t converge. And for that matter, there remains more to learn about calculus as a whole and the countless topics not touched by this series. The goal with these videos is to give you the fundamental intuitions that make you feel confident and efficient in learning more on your own and potentially even rediscovering more of the topic for yourself. In the case of Taylor series, the fundamental intuition to keep in mind as you explore more of what there is is that they translate derivative information at a single point to approximation information around that point.