Gradient Descent Explained | annotated by Harry

Gradient descent is like. It's like trying to find your way down a dark mountain. You can't see where you're going. So you have to feel your way around. You take small steps in the direction that feels the most downhill. Eventually, if you keep going, you'll find your way to the bottom. That's gradient descent. Let's get into it. So Gradient Descent is a common optimization algorithm used to train machine learning models in neural networks by training on data that these models can learn over time, and because they're learning over time, they can improve their accuracy.

Now you see a neural network consists of connected neurons like this, and those neurons are in layers. And those layers have weights and biases describe how we navigate through this network. We provide the neural network with labeled training data to determine what we should set these weights and biases to, to figure something out. So like, for example, I could input a shape, let's say, like that, and then we could use the neural network to learn that squiggle as our input represents the output.

The number three. After we train the neural network, we can provide it with more label data like this squiggle, and then we can see if it could also correctly resolve that squiggle to. The number six. If we get some of these squiggles wrong, the weights and biases here can be adjusted and then we just try again.

Now, how can gradient descent help us here? Well, gradient descent is used to find the minimum of something called a cost function. So what is? A cost function? Well, it's a function that tells us how far off predictions are from the actual values. So the idea is that we want to minimize this cost function to get the best predictions. Now, to do this, we take small steps in the direction that reduces the cost function the most.

We think about this on a graph. We start here and we keep going downhill, reducing our cost function as we go. The size of the steps that we take. So the size of the steps from here to here and to here, that's called the learning rate. Let's think about another example. Let's consider a neural network, but instead of dealing with squiggles, predicts how much a house will sell for.

So first we train the network on a labeled dataset. Let's say that data has some information like. Like the location of the house, let's say the size of the house, and then how much it sold for. So with that, we can then use our model to train new labeled data. So here's, here's another example. We've got a house. It's location. Let's do it by zipcode. 27513. How big is it?

Uh, 3000 square feet and put that into a neural network. So how much does this house sell for? Well, now, on your network, we'll make a forecast. It says, we think. This sold for $300,000. And we compare that to the forecast of the actual sale price, which was. $450,000.

Not a good guess. We have a large cost function. Weights and biases now need to be adjusted and then the model can try again. And did it do any better over the entire label dataset or did it do worse? That's what Gradient Descent can help us with. Now, there are three types of gradient descent learning algorithms, and let's take a look at some of those. So first of all. We've got a gradient descent code. Batch.

This sums the entries for each point in the training set updating the model only after all the training examples have been evaluated, hence the term batch. Now, in terms of how well does this do well computationally it is computationally effective. You can give this a high rating because we are doing things in one big batch. But what about processing time? Well with processing time, we can end up with long processing times using batch gradient descent because well, we've got large training datasets and it needs to store all of that data in memory and process it.

So that's batch. Another option is stochastic gradient descent and this evaluates each training example but one at a time instead of in a batch. Since you only need to hold one training example, they're easy to store in memory and get individual responses much faster. So in terms of speed. That's fast.

But terms of computational efficiency, that's lower. Now there is a happy medium and that is called mini batch - and mini batch Gradient descent splits the training dataset into small batch sizes and performs updates on each of those batches. That is a nice balance of computational efficiency and of speed. Now, gradient descent does come with its own challenges.

So for example, it can struggle to find the global minimum in non-convex problems. This was a nice convex problem with a clearly defined bottom. So one of the slope of the cost function is close to zero or it's at zero. The model stops learning, but if we don't have this convex model here, we have something like this shape that's known as a saddle point, and it can mislead the gradient descent because it thinks it's at the bottom.

Before it really is. This is going to keep going down further. Called a subtle shape because it kind of looks like a horse saddle. I guess another challenge is that in deeper neural learning networks, gradient descent can suffer from vanishing gradients or exploding gradients. So vanishing gradients that when the gradient is too small and the earlier layers in the network learn more slowly than the later layers as we go through this network here. Exploding gradients, on the other hand, are when the gradient is too large and that can create an unstable model.

But look, despite those challenges. Gradient descent is a powerful optimization algorithm, and it is commonly used to train machine learning models and neural networks today. It's a clever way to get you back down that mountain safely. If you have any questions, please drop us a line below. And if you want to see more videos like this in the future, please like and subscribe. Thanks for watching.