### Catalog

# 1. Role of optimizer

The optimizer is used to minimize the loss function, that is, to find the optimal value of the parameters in the defined model. At first, you assume the initial values of the weights and offsets in the model, and then in each iteration, the parameter values are updated in the direction specified by the optimizer, and finally they will be stable to the optimal value.

# 2. common Optimizer

tf.train.Optimizer tf.train.GradientDescentOptimizer tf.train.AdadeltaOptimizer tf.train.AdagradOptimizer tf.train.AdagradDAOptimizer tf.train.MomentumOptimizer tf.train.AdamOptimizer tf.train.FtrlOptimizer tf.train.ProximalAdagradOptimizer tf.train.ProximalGradientDescentOptimizer tf.train.RMSPropOptimizer

# 3. Comparison of various optimizers

### 3.1 comparison of three gradient descent methods

Standard gradient descent method:

The standard gradient descent first calculates the total error of all samples, and then updates the weight according to the total error

Random gradient descent method:

Random gradient descent: take a sample randomly to calculate the error, and then update the weight

Batch gradient descent method:

Batch gradient descent is a compromise scheme. Select a batch from the total samples (for example, there are 10000 samples in total, and randomly select 100 samples as a batch). Then calculate the total error of the batch, and update the weight according to the total error.

Symbol | explain |
---|---|

W | Parameters to train |

J(W) | Cost function |

∇wJ(W) | Gradient of cost function |

η | Learning rate |

### 3.2 SGD (gradient descent)

### 3.3 Momentum

The change of the current weight will be affected by the last change of the weight, similar to the inertia of the ball when it rolls down. In this way, you can speed up the downward speed of the ball.

### 3.4 NAG(Nesterov accelerated gradient)

NAG is combined with Momentum in TF in the same function tf.train.Momentum optimizer, which can be enabled through parameter configuration. In Momentum, the little ball will blindly follow the downhill gradient, which is easy to make mistakes, so we need a smarter little ball. This little ball knows where it is going in advance, and it also knows that when it reaches the bottom of the slope, it will slow down rather than rush to another slope. γ vt − 1 will be used to modify the value of W. the calculation of W − γ vt − 1 can indicate the next position of the ball. So we can calculate the gradient of the next position in advance, and then use it to the current position.

### 3.5 Adagrad

It is an algorithm based on SGD. Its core idea is to give it a smaller learning rate to adjust parameters for common data and a larger learning rate to adjust parameters for rare data. It is very suitable for data sets with sparse data (such as a picture data set, with 10000 photos of dogs, 10000 photos of cats and only 100 photos of elephants).

The main advantage of Adagrad is that it does not need to adjust the learning rate artificially. It can adjust automatically. Its disadvantage is that as the number of iterations increases, the learning rate will be lower and lower, and eventually it will tend to 0.

### 3.6 RMSprop

RMS(Root Mean Square) is the abbreviation of root mean square.

RMSprop uses Adagrad's ideas for reference, but here RMSprop only uses the average value of the first t-1 gradient square plus the square root of the sum of the squares of the current gradient as the denominator of the learning rate. In this way, RMSprop will not have the problem of lower and lower learning rate, but also can adjust the learning rate by itself, and can have a better effect.

### 3.7 Adadelta

With Adadelta, we don't even need to set a default learning rate. In Adadelta, we can achieve a very good effect without using learning rate.

### 3.8 Adam

Like Adadelta and RMSprop, Adam stores the square gradient of the previous falloff, and it also stores the gradient of the previous falloff. After some processing, the parameters are updated in a way similar to Adadelta and RMSprop.

# 4. Selection of optimizer

Choosing the best optimizer for the model, which converges quickly and can learn the appropriate weight and offset value, is a skill.

Adaptive technology (Adadelta,Adagrad, etc.) is a good optimizer for complex neural network model, with faster convergence. In most cases, Adam is probably the best optimizer. Adam is better than other adaptive technologies, but its cost is very high. For sparse data sets (such as a picture data set, with 10000 photos of dogs, 10000 photos of cats, and only 100 photos of elephants), some methods such as SGD, NGA and momentum are not the best choice, and the method that can adaptively adjust the learning rate is. An additional advantage is that the learning rate does not need to be adjusted, and the optimal solution can be achieved by using the default learning rate.

# 5. Demonstration example

Taking handwritten numbers as an example, a simple neural network is designed and compared with SGD and Adam optimizers

Code modification:

#Training with gradient descent method #train_step = tf.train.GradientDescentOptimizer(0.2).minimize(loss) #Training with the Adam optimizer optimizer train_step = tf.train.AdamOptimizer(1e-2).minimize(loss)

When we use SGD, the learning rate is set to 0.2. When we use Adam, the learning rate is set to 0.01, that is 1e-2.

Generally, when we use Adam, the set learning rate is relatively small, and 0.01 here is relatively large.

The complete code is as follows:

import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data #Load data set mnist = input_data.read_data_sets("MNIST_data",one_hot=True) #Size of each batch batch_size = 100 #How many batches are there n_batch = mnist.train.num_examples // batch_size #Define two placeholder s x = tf.placeholder(tf.float32,[None,784]) y = tf.placeholder(tf.float32,[None,10]) #Create a simple neural network W = tf.Variable(tf.zeros([784,10])) b = tf.Variable(tf.zeros([10])) prediction = tf.nn.softmax(tf.matmul(x,W) + b) #Quadratic cost function # loss = tf.reduce_mean(tf.square(y-prediction)) #Using softmax cross entropy cost function loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=prediction)) #Training with gradient descent method #train_step = tf.train.GradientDescentOptimizer(0.2).minimize(loss) #Training with the Adam optimizer optimizer train_step = tf.train.AdamOptimizer(1e-2).minimize(loss) #initialize variable init = tf.global_variables_initializer() #Results are stored in a Boolean list #argmax returns the position of the maximum value in one-dimensional tensor correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(prediction,1)) #Accuracy rate accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32)) with tf.Session() as sess: sess.run(init) for epoch in range(21): for batch in range(n_batch): batch_xs,batch_ys = mnist.train.next_batch(batch_size) sess.run(train_step,feed_dict = {x:batch_xs,y:batch_ys}) acc = sess.run(accuracy,feed_dict = {x:mnist.test.images,y:mnist.test.labels}) print("Iter"+str(epoch)+",Testing Accuracy"+str(acc))

Using SGD optimizer results:

Iter0,Testing Accuracy0.8243 Iter1,Testing Accuracy0.8882 Iter2,Testing Accuracy0.9009 Iter3,Testing Accuracy0.9052 Iter4,Testing Accuracy0.9089 Iter5,Testing Accuracy0.9103 Iter6,Testing Accuracy0.9111 Iter7,Testing Accuracy0.9128 Iter8,Testing Accuracy0.9145 Iter9,Testing Accuracy0.9158 Iter10,Testing Accuracy0.9192 Iter11,Testing Accuracy0.9181 Iter12,Testing Accuracy0.9184 Iter13,Testing Accuracy0.9191 Iter14,Testing Accuracy0.9195 Iter15,Testing Accuracy0.9196 Iter16,Testing Accuracy0.9212 Iter17,Testing Accuracy0.9213 Iter18,Testing Accuracy0.9204 Iter19,Testing Accuracy0.9213 Iter20,Testing Accuracy0.9211

Using the Adam optimizer results:

Iter0,Testing Accuracy0.9244 Iter1,Testing Accuracy0.9256 Iter2,Testing Accuracy0.9298 Iter3,Testing Accuracy0.9262 Iter4,Testing Accuracy0.9275 Iter5,Testing Accuracy0.926 Iter6,Testing Accuracy0.9296 Iter7,Testing Accuracy0.9313 Iter8,Testing Accuracy0.9293 Iter9,Testing Accuracy0.929 Iter10,Testing Accuracy0.9305 Iter11,Testing Accuracy0.9326 Iter12,Testing Accuracy0.9307 Iter13,Testing Accuracy0.9328 Iter14,Testing Accuracy0.9295 Iter15,Testing Accuracy0.9324 Iter16,Testing Accuracy0.9282 Iter17,Testing Accuracy0.9328 Iter18,Testing Accuracy0.9297 Iter19,Testing Accuracy0.9325 Iter20,Testing Accuracy0.9297

Result analysis

From the results, we can see that when Adam is used, the convergence speed is obviously faster than SGD.