Deep learning optimizer

1. Random gradient descent method

Deep learning optimizer summary

Reference: part code of own decision tree and random forest
1. The three most common types of gradient descent are BGD, SGD and MBGD. The difference between them depends on how much data we use to calculate the gradient of the objective function.

#coding:utf-8
#Let the loss function loss=(w+1)^2, let the initial value of w be constant 5. Back propagation is to find the optimal w, that is, to find the w value corresponding to the minimum loss
import tensorflow as tf
#Define the initial value of parameter w to be optimized and assign 5
w = tf.Variable(tf.constant(5, dtype=tf.float32))
#Define loss function
loss = tf.square(w+1)#tf.square() is to square every element in a
#Define the back propagation method
train_step = tf.train.GradientDescentOptimizer(0.2).minimize(loss)
#Generate conversation, train 40 rounds
with tf.Session() as sess:
    init_op=tf.global_variables_initializer()#initialization
    sess.run(init_op)#initialization
    for i in range(40):#40 rounds of training
        sess.run(train_step)#train
        w_val = sess.run(w)#weight
        loss_val = sess.run(loss)#loss function 
        print ("After %s steps: w is %f,   loss is %f." % (i, w_val,loss_val))#Printing

2. Code analysis of gradient descent method with momentum

import numpy as np
def BatchGradientDescentM(x, y, step=0.001, iter_count=500, beta=0.9):
    
    length, features = x.shape #Get the size of X
    # Initialization parameters and momentum and integration x '
    data = np.column_stack((x, np.ones((length, 1))))#Add another column of constants to X
    w = np.zeros((features + 1, 1))#Initialize weights
    v = np.zeros((features + 1, 1))#Initial momentum
    
    # Start iteration
    for i in range(iter_count):
        # Calculate momentum
        v = (beta * v + (1 - beta) * np.sum((np.dot(data, w) - y) * data, axis=0).reshape((features + 1, 1))) / length    
        # Update parameters
        w -= step * v        
    return w


def Momentum(x, y, step=0.01, iter_count=1000, batch_size=4, beta=0.9):
    length, features = x.shape
    # Initialization parameters and momentum and integration x '
    data = np.column_stack((x, np.ones((length, 1))))
    w = np.zeros((features + 1, 1))
    v = np.zeros((features + 1, 1))
    start, end = 0, batch_size
    
    # Start iteration
    for i in range(iter_count):
        v = (beta * v + (1 - beta) * np.sum((np.dot(data[start:end], w) - y[start:end]) * data[start:end], axis=0).reshape((features + 1, 1))) / length         
        w -= step * v
        start = (start + batch_size) % length  #%Is the remainder
        if start > length:
            start -= length
        end = (end + batch_size) % length
        if end > length:
            end -= length
    return w

##Newtonian momentum
def Nesterov(x, y, step=0.01, iter_count=1000, batch_size=4, beta=0.9):
    length, features = x.shape
    data = np.column_stack((x, np.ones((length, 1))))
    w = np.zeros((features + 1, 1))
    v = np.zeros((features + 1, 1))
    start, end = 0, batch_size
    for i in range(iter_count):
        # Update parameters first
        w_temp = w - step * v
        # Recalculate gradient and velocity
        v = (beta * v + (1 - beta) * np.sum((np.dot(data[start:end], w_temp) - y[start:end]) * data[start:end], axis=0).reshape((features + 1, 1))) / length         
        w -= step * v
        start = (start + batch_size) % length
        if start > length:
            start -= length
        end = (end + batch_size) % length
        if end > length:
            end -= length
    return w

if __name__=="__main__":
    # Batch gradient decline
    print(Momentum(x, y, batch_size=(x.shape[0] - 1)))
     
    # Small batch gradient decline
    Momentum(x, y, batch_size=5)
    
    # Random gradient descent
    Momentum(x, y, batch_size=1)

Reference documents

Gamma formula shows γ (n)=(n − 1)! ∀ n ∈ n \ gamma (n)=(n-1)! \ Quad \ forall n\in\mathbb N Γ (n)=(n − 1)! ∀ n ∈ n is through Euler integration

Γ(z)=∫0∞tz−1e−tdt . \Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt\,. Γ(z)=∫0∞​tz−1e−tdt.

3,SGD+Momentum


vt represents the sum of momentum accumulated by all previous steps. In the process of gradient updating, for the dimensions with the same direction at the re gradient point, the momentum term is increased, and for the dimensions with different direction at the gradient point, the momentum term is reduced.

dx=computes_gradient(x)
v_t=rtho*old_v-a*dx
x+=v_t

4. Nesterov Accelerated Gradient (NAG)



The momentum term rvt-1 is used to update the parameter θ. The approximate value of the parameter's future position is obtained by calculating (θ - rvt-1), and the gradient of the parameter's future approximate value position is calculated. It can be abstracted as a ball rolling down, generally blindly following a certain slope direction, and the result is not necessarily satisfactory. We hope to have a more intelligent ball, which can positively judge the direction of descent, so that when the slope rises on the way, it can slow down. NAG can make RNN perform better in many tasks

5,Adagrad (Adaptive gradient algorithm),Adadelta

This algorithm can update the parameters of low frequency and high frequency. Therefore, it performs well for sparse data and improves the robustness of SGD. For example, it can recognize the cat in Youtube video and train GloVe word embeddings, because they need to update the features of low frequency.
Adagrad actually has a constraint on learning rate. Namely:


6,Adam: Adaptive Moment Estimation






Super parameter setting value:

It is suggested that β 1 = 0.9, β 2 = 0.999, ϵ = 10e − 8

Practice shows that Adam is better than other adaptive learning methods.

7. How to choose optimization algorithm

If the data is sparse, the adaptive method is used, i.e. Adagrad, Adadelta, RMSprop, Adam.

Rmsprop, adadelta and Adam have similar effects in many cases.

Adam added bias correction and momentum to RMSprop,

As the gradient becomes sparse, Adam is better than RMSprop.

Overall, Adam is the best choice.

Many papers will use SGD, no momentum, etc. Although SGD can reach the minimum value, it takes longer than other algorithms, and it may be trapped in the saddle point.

If we need faster convergence, or train deeper and more complex neural networks, we need to use an adaptive algorithm.

8. Two step algorithm

1. Hessian, Newton method

1. Convex function


Hessian matrix and Newton method
The second derivative of the function mentioned above is Hessian matrix, which is often used in Newton optimization method. Newton's method is an iterative solution method, which has first-order and second-order methods. It is mainly used in two aspects: 1. Finding the root of equation; 2. Optimization method


To solve this problem, when Newton method can not be effectively implemented, many improved methods are proposed. For example, quasi Newton method can be regarded as the approximation of Newton method. Quasi Newton method only needs the first derivative, does not need to calculate Hessian matrix and inverse matrix, so it can converge faster. There is no specific expansion of quasi Newton method here, and there are more in-depth DFP, BFGS, L-BFGS and other algorithms. You can search and learn by yourself. Generally speaking, quasi Newton method is used to solve the problems of complex calculation, difficult convergence and local minimum value of Newton method.

Second Order Optimization Made Practical

1. Summary

In order to shorten the gap between the theoretical and practical optimization results, the researcher said, the paper proposed a conceptual verification of second-order optimization, and through a series of important algorithms and numerical calculations, it was proved that it can be greatly improved in the actual depth model. Specifically, in the process of training depth model, two-step optimization shampoos can efficiently utilize heterogeneous hardware architecture composed of multi-core CPU and multi accelerator units. And in large-scale machine translation, image recognition and other fields, it has achieved very superior performance, which is better than the existing top-level one step descent method.

Tags: Session

Posted on Thu, 18 Jun 2020 04:47:21 -0400 by mbaroz