# Deep learning optimizer

## 1. Random gradient descent method

Deep learning optimizer summary

Reference: part code of own decision tree and random forest
1. The three most common types of gradient descent are BGD, SGD and MBGD. The difference between them depends on how much data we use to calculate the gradient of the objective function.

#coding:utf-8
#Let the loss function loss=(w+1)^2, let the initial value of w be constant 5. Back propagation is to find the optimal w, that is, to find the w value corresponding to the minimum loss
import tensorflow as tf
#Define the initial value of parameter w to be optimized and assign 5
w = tf.Variable(tf.constant(5, dtype=tf.float32))
#Define loss function
loss = tf.square(w+1)#tf.square() is to square every element in a
#Define the back propagation method
#Generate conversation, train 40 rounds
with tf.Session() as sess:
init_op=tf.global_variables_initializer()#initialization
sess.run(init_op)#initialization
for i in range(40):#40 rounds of training
sess.run(train_step)#train
w_val = sess.run(w)#weight
loss_val = sess.run(loss)#loss function
print ("After %s steps: w is %f,   loss is %f." % (i, w_val,loss_val))#Printing


## 2. Code analysis of gradient descent method with momentum

import numpy as np
def BatchGradientDescentM(x, y, step=0.001, iter_count=500, beta=0.9):

length, features = x.shape #Get the size of X
# Initialization parameters and momentum and integration x '
data = np.column_stack((x, np.ones((length, 1))))#Add another column of constants to X
w = np.zeros((features + 1, 1))#Initialize weights
v = np.zeros((features + 1, 1))#Initial momentum

# Start iteration
for i in range(iter_count):
# Calculate momentum
v = (beta * v + (1 - beta) * np.sum((np.dot(data, w) - y) * data, axis=0).reshape((features + 1, 1))) / length
# Update parameters
w -= step * v
return w

def Momentum(x, y, step=0.01, iter_count=1000, batch_size=4, beta=0.9):
length, features = x.shape
# Initialization parameters and momentum and integration x '
data = np.column_stack((x, np.ones((length, 1))))
w = np.zeros((features + 1, 1))
v = np.zeros((features + 1, 1))
start, end = 0, batch_size

# Start iteration
for i in range(iter_count):
v = (beta * v + (1 - beta) * np.sum((np.dot(data[start:end], w) - y[start:end]) * data[start:end], axis=0).reshape((features + 1, 1))) / length
w -= step * v
start = (start + batch_size) % length  #%Is the remainder
if start > length:
start -= length
end = (end + batch_size) % length
if end > length:
end -= length
return w

##Newtonian momentum
def Nesterov(x, y, step=0.01, iter_count=1000, batch_size=4, beta=0.9):
length, features = x.shape
data = np.column_stack((x, np.ones((length, 1))))
w = np.zeros((features + 1, 1))
v = np.zeros((features + 1, 1))
start, end = 0, batch_size
for i in range(iter_count):
# Update parameters first
w_temp = w - step * v
v = (beta * v + (1 - beta) * np.sum((np.dot(data[start:end], w_temp) - y[start:end]) * data[start:end], axis=0).reshape((features + 1, 1))) / length
w -= step * v
start = (start + batch_size) % length
if start > length:
start -= length
end = (end + batch_size) % length
if end > length:
end -= length
return w

if __name__=="__main__":
print(Momentum(x, y, batch_size=(x.shape - 1)))

Momentum(x, y, batch_size=5)

Momentum(x, y, batch_size=1)



Reference documents

Gamma formula shows γ (n)=(n − 1)! ∀ n ∈ n \ gamma (n)=(n-1)! \ Quad \ forall n\in\mathbb N Γ (n)=(n − 1)! ∀ n ∈ n is through Euler integration

Γ(z)=∫0∞tz−1e−tdt . \Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt\,. Γ(z)=∫0∞​tz−1e−tdt.

## 3,SGD+Momentum vt represents the sum of momentum accumulated by all previous steps. In the process of gradient updating, for the dimensions with the same direction at the re gradient point, the momentum term is increased, and for the dimensions with different direction at the gradient point, the momentum term is reduced.

dx=computes_gradient(x)
v_t=rtho*old_v-a*dx
x+=v_t



## 4. Nesterov Accelerated Gradient (NAG)  The momentum term rvt-1 is used to update the parameter θ. The approximate value of the parameter's future position is obtained by calculating (θ - rvt-1), and the gradient of the parameter's future approximate value position is calculated. It can be abstracted as a ball rolling down, generally blindly following a certain slope direction, and the result is not necessarily satisfactory. We hope to have a more intelligent ball, which can positively judge the direction of descent, so that when the slope rises on the way, it can slow down. NAG can make RNN perform better in many tasks

This algorithm can update the parameters of low frequency and high frequency. Therefore, it performs well for sparse data and improves the robustness of SGD. For example, it can recognize the cat in Youtube video and train GloVe word embeddings, because they need to update the features of low frequency.       Super parameter setting value:

It is suggested that β 1 = 0.9, β 2 = 0.999, ϵ = 10e − 8

## 7. How to choose optimization algorithm

Overall, Adam is the best choice.

Many papers will use SGD, no momentum, etc. Although SGD can reach the minimum value, it takes longer than other algorithms, and it may be trapped in the saddle point.

If we need faster convergence, or train deeper and more complex neural networks, we need to use an adaptive algorithm.

## 8. Two step algorithm

### 1. Hessian, Newton method

#### 1. Convex function Hessian matrix and Newton method
The second derivative of the function mentioned above is Hessian matrix, which is often used in Newton optimization method. Newton's method is an iterative solution method, which has first-order and second-order methods. It is mainly used in two aspects: 1. Finding the root of equation; 2. Optimization method  To solve this problem, when Newton method can not be effectively implemented, many improved methods are proposed. For example, quasi Newton method can be regarded as the approximation of Newton method. Quasi Newton method only needs the first derivative, does not need to calculate Hessian matrix and inverse matrix, so it can converge faster. There is no specific expansion of quasi Newton method here, and there are more in-depth DFP, BFGS, L-BFGS and other algorithms. You can search and learn by yourself. Generally speaking, quasi Newton method is used to solve the problems of complex calculation, difficult convergence and local minimum value of Newton method.

## Second Order Optimization Made Practical

### 1. Summary

In order to shorten the gap between the theoretical and practical optimization results, the researcher said, the paper proposed a conceptual verification of second-order optimization, and through a series of important algorithms and numerical calculations, it was proved that it can be greatly improved in the actual depth model. Specifically, in the process of training depth model, two-step optimization shampoos can efficiently utilize heterogeneous hardware architecture composed of multi-core CPU and multi accelerator units. And in large-scale machine translation, image recognition and other fields, it has achieved very superior performance, which is better than the existing top-level one step descent method. Tags: Session

Posted on Thu, 18 Jun 2020 04:47:21 -0400 by mbaroz