## 1. Random gradient descent method

Deep learning optimizer summary

Reference: part code of own decision tree and random forest

1. The three most common types of gradient descent are BGD, SGD and MBGD. The difference between them depends on how much data we use to calculate the gradient of the objective function.

#coding:utf-8 #Let the loss function loss=(w+1)^2, let the initial value of w be constant 5. Back propagation is to find the optimal w, that is, to find the w value corresponding to the minimum loss import tensorflow as tf #Define the initial value of parameter w to be optimized and assign 5 w = tf.Variable(tf.constant(5, dtype=tf.float32)) #Define loss function loss = tf.square(w+1)#tf.square() is to square every element in a #Define the back propagation method train_step = tf.train.GradientDescentOptimizer(0.2).minimize(loss) #Generate conversation, train 40 rounds with tf.Session() as sess: init_op=tf.global_variables_initializer()#initialization sess.run(init_op)#initialization for i in range(40):#40 rounds of training sess.run(train_step)#train w_val = sess.run(w)#weight loss_val = sess.run(loss)#loss function print ("After %s steps: w is %f, loss is %f." % (i, w_val,loss_val))#Printing

## 2. Code analysis of gradient descent method with momentum

import numpy as np def BatchGradientDescentM(x, y, step=0.001, iter_count=500, beta=0.9): length, features = x.shape #Get the size of X # Initialization parameters and momentum and integration x ' data = np.column_stack((x, np.ones((length, 1))))#Add another column of constants to X w = np.zeros((features + 1, 1))#Initialize weights v = np.zeros((features + 1, 1))#Initial momentum # Start iteration for i in range(iter_count): # Calculate momentum v = (beta * v + (1 - beta) * np.sum((np.dot(data, w) - y) * data, axis=0).reshape((features + 1, 1))) / length # Update parameters w -= step * v return w def Momentum(x, y, step=0.01, iter_count=1000, batch_size=4, beta=0.9): length, features = x.shape # Initialization parameters and momentum and integration x ' data = np.column_stack((x, np.ones((length, 1)))) w = np.zeros((features + 1, 1)) v = np.zeros((features + 1, 1)) start, end = 0, batch_size # Start iteration for i in range(iter_count): v = (beta * v + (1 - beta) * np.sum((np.dot(data[start:end], w) - y[start:end]) * data[start:end], axis=0).reshape((features + 1, 1))) / length w -= step * v start = (start + batch_size) % length #%Is the remainder if start > length: start -= length end = (end + batch_size) % length if end > length: end -= length return w ##Newtonian momentum def Nesterov(x, y, step=0.01, iter_count=1000, batch_size=4, beta=0.9): length, features = x.shape data = np.column_stack((x, np.ones((length, 1)))) w = np.zeros((features + 1, 1)) v = np.zeros((features + 1, 1)) start, end = 0, batch_size for i in range(iter_count): # Update parameters first w_temp = w - step * v # Recalculate gradient and velocity v = (beta * v + (1 - beta) * np.sum((np.dot(data[start:end], w_temp) - y[start:end]) * data[start:end], axis=0).reshape((features + 1, 1))) / length w -= step * v start = (start + batch_size) % length if start > length: start -= length end = (end + batch_size) % length if end > length: end -= length return w if __name__=="__main__": # Batch gradient decline print(Momentum(x, y, batch_size=(x.shape[0] - 1))) # Small batch gradient decline Momentum(x, y, batch_size=5) # Random gradient descent Momentum(x, y, batch_size=1)

Gamma formula shows γ (n)=(n − 1)! ∀ n ∈ n \ gamma (n)=(n-1)! \ Quad \ forall n\in\mathbb N Γ (n)=(n − 1)! ∀ n ∈ n is through Euler integration

Γ(z)=∫0∞tz−1e−tdt . \Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt\,. Γ(z)=∫0∞tz−1e−tdt.

## 3,SGD+Momentum

vt represents the sum of momentum accumulated by all previous steps. In the process of gradient updating, for the dimensions with the same direction at the re gradient point, the momentum term is increased, and for the dimensions with different direction at the gradient point, the momentum term is reduced.

dx=computes_gradient(x) v_t=rtho*old_v-a*dx x+=v_t

## 4. Nesterov Accelerated Gradient (NAG)

The momentum term rvt-1 is used to update the parameter θ. The approximate value of the parameter's future position is obtained by calculating (θ - rvt-1), and the gradient of the parameter's future approximate value position is calculated. It can be abstracted as a ball rolling down, generally blindly following a certain slope direction, and the result is not necessarily satisfactory. We hope to have a more intelligent ball, which can positively judge the direction of descent, so that when the slope rises on the way, it can slow down. NAG can make RNN perform better in many tasks

## 5,Adagrad (Adaptive gradient algorithm),Adadelta

This algorithm can update the parameters of low frequency and high frequency. Therefore, it performs well for sparse data and improves the robustness of SGD. For example, it can recognize the cat in Youtube video and train GloVe word embeddings, because they need to update the features of low frequency.

Adagrad actually has a constraint on learning rate. Namely:

## 6,Adam: Adaptive Moment Estimation

Super parameter setting value:

It is suggested that β 1 = 0.9, β 2 = 0.999, ϵ = 10e − 8

Practice shows that Adam is better than other adaptive learning methods.

## 7. How to choose optimization algorithm

If the data is sparse, the adaptive method is used, i.e. Adagrad, Adadelta, RMSprop, Adam.

Rmsprop, adadelta and Adam have similar effects in many cases.

Adam added bias correction and momentum to RMSprop,

As the gradient becomes sparse, Adam is better than RMSprop.

Overall, Adam is the best choice.

Many papers will use SGD, no momentum, etc. Although SGD can reach the minimum value, it takes longer than other algorithms, and it may be trapped in the saddle point.

If we need faster convergence, or train deeper and more complex neural networks, we need to use an adaptive algorithm.

## 8. Two step algorithm

### 1. Hessian, Newton method

#### 1. Convex function

Hessian matrix and Newton method

The second derivative of the function mentioned above is Hessian matrix, which is often used in Newton optimization method. Newton's method is an iterative solution method, which has first-order and second-order methods. It is mainly used in two aspects: 1. Finding the root of equation; 2. Optimization method

To solve this problem, when Newton method can not be effectively implemented, many improved methods are proposed. For example, quasi Newton method can be regarded as the approximation of Newton method. Quasi Newton method only needs the first derivative, does not need to calculate Hessian matrix and inverse matrix, so it can converge faster. There is no specific expansion of quasi Newton method here, and there are more in-depth DFP, BFGS, L-BFGS and other algorithms. You can search and learn by yourself. Generally speaking, quasi Newton method is used to solve the problems of complex calculation, difficult convergence and local minimum value of Newton method.

## Second Order Optimization Made Practical

### 1. Summary

In order to shorten the gap between the theoretical and practical optimization results, the researcher said, the paper proposed a conceptual verification of second-order optimization, and through a series of important algorithms and numerical calculations, it was proved that it can be greatly improved in the actual depth model. Specifically, in the process of training depth model, two-step optimization shampoos can efficiently utilize heterogeneous hardware architecture composed of multi-core CPU and multi accelerator units. And in large-scale machine translation, image recognition and other fields, it has achieved very superior performance, which is better than the existing top-level one step descent method.