Gradient descent method is an important search strategy in the field of machine learning. In this chapter, we will explain the basic principle of the gradient descent method in detail, and improve the gradient descent algorithm step by step, so that we can understand the significance of various parameters in the gradient descent method, especially the learning rate.
At the same time, we will also extend two methods: random gradient descent method and small batch gradient descent method, so that we can have a comprehensive understanding of the gradient descent method family

## 6-1 what is gradient descent method

• The derivative represents the corresponding change of J when theta unit changes

• The derivative can represent the direction, corresponding to the direction in which J increases

• Not all functions have unique extreme points (multivariate multiple functions)

## 6-3 gradient descent method in linear regression

Since the size of the gradient is obviously unreasonable due to the number of samples m, divide it by the number of samples m to make it not affected by the number of samples.

## 6-6 random gradient descent method

• Idea of simulated annealing algorithm: it is obtained by imitating the annealing phenomenon in nature, and makes use of the similarity between the annealing process of solid matter in physics and general optimization problems.
Starting from an initial temperature, with the continuous decline of temperature, combined with the characteristics of probability jump, the global optimal solution is randomly found in the solution space

## 6-8 how to determine the accuracy of gradient calculation? Debug gradient descent method

```# Code on ipynb without print()

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
X = np.random.random(size = (1000,10))

true_theta = np.arange(1,12,dtype = float)

X_b = np.hstack([np.ones((len(X),1)),X]   )
y = X_b.dot(true_theta) + np.random.normal(size = 1000)

print(X.shape)
print(y.shape)
print(true_theta)

def J(theta,X_b,y):  #Define loss function
try:
return np.sum((y-X_b.dot(theta))**2  ) / len(X_b)
except:
return float("inf")

def dJ_math(theta,X_b,y):  # Define gradient mathematical formula calculation
return X_b.T.dot(X_b.dot(theta) - y)*2. / len(y)

def dJ_debug(theta,X_b,y,epsilon=.01):    # Define gradient debug calculation
res = np.empty(len(theta))
for i in range(len(theta)):
theta_1 = theta.copy()
theta_1[i] += epsilon
theta_2 = theta.copy()
theta_2[i] -= epsilon
res[i] = (J(theta_1,X_b,y) - J(theta_2,X_b,y))/(2*epsilon)

return res

theta = initial_theta
i_iters = 0
while i_iters < n_iters:
last_theta = theta
theta = theta - eta * gradient
if (abs(J(theta,X_b,y)-J(last_theta,X_b,y)))< epsilon:
break
i_iters += 1

return theta

X_b = np.hstack( ( np.ones((len(X),1)),X) )
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

theta

theta
```

tip
dJ_debug is used to verify the debugging gradient. It is slow. You can take a small number of samples and use dJ_debug gets the correct result, then deduces the mathematical solution with the formula, and compares the results.

dJ_debug is not affected by the current loss function J, and the gradient is universal.

## 6-9 more in-depth discussion on gradient descent method

BGD: the whole sample needs to be traversed every time, and the direction of the fastest gradient descent is certain every time, stable but slow.
SGD: look at only one sample at a time. The direction of gradient descent is uncertain and may even move in the opposite direction. It is fast but unstable.

MBGD: a compromise between two extreme methods, k samples at a time, and k becomes a super parameter.

Summary: some related codes can only run in VSC, but Jupyter can't run, especially the hstack() function.

Posted on Mon, 04 Oct 2021 18:55:37 -0400 by Duke555