TensorFlow foundation -- comparison and selection of optimizers commonly used in TensorFlow


1. Role of optimizer

The optimizer is used to minimize the loss function, that is, to find the optimal value of the parameters in the defined model. At first, you assume the initial values of the weights and offsets in the model, and then in each iteration, the parameter values are updated in the direction specified by the optimizer, and finally they will be stable to the optimal value.

2. common Optimizer


3. Comparison of various optimizers

3.1 comparison of three gradient descent methods

Standard gradient descent method:
The standard gradient descent first calculates the total error of all samples, and then updates the weight according to the total error
Random gradient descent method:
Random gradient descent: take a sample randomly to calculate the error, and then update the weight
Batch gradient descent method:
Batch gradient descent is a compromise scheme. Select a batch from the total samples (for example, there are 10000 samples in total, and randomly select 100 samples as a batch). Then calculate the total error of the batch, and update the weight according to the total error.

Symbol explain
W Parameters to train
J(W) Cost function
∇wJ(W) Gradient of cost function
η Learning rate

3.2 SGD (gradient descent)

3.3 Momentum

The change of the current weight will be affected by the last change of the weight, similar to the inertia of the ball when it rolls down. In this way, you can speed up the downward speed of the ball.

3.4 NAG(Nesterov accelerated gradient)

NAG is combined with Momentum in TF in the same function tf.train.Momentum optimizer, which can be enabled through parameter configuration. In Momentum, the little ball will blindly follow the downhill gradient, which is easy to make mistakes, so we need a smarter little ball. This little ball knows where it is going in advance, and it also knows that when it reaches the bottom of the slope, it will slow down rather than rush to another slope. γ vt − 1 will be used to modify the value of W. the calculation of W − γ vt − 1 can indicate the next position of the ball. So we can calculate the gradient of the next position in advance, and then use it to the current position.

3.5 Adagrad

It is an algorithm based on SGD. Its core idea is to give it a smaller learning rate to adjust parameters for common data and a larger learning rate to adjust parameters for rare data. It is very suitable for data sets with sparse data (such as a picture data set, with 10000 photos of dogs, 10000 photos of cats and only 100 photos of elephants).
The main advantage of Adagrad is that it does not need to adjust the learning rate artificially. It can adjust automatically. Its disadvantage is that as the number of iterations increases, the learning rate will be lower and lower, and eventually it will tend to 0.

3.6 RMSprop

RMS(Root Mean Square) is the abbreviation of root mean square.

RMSprop uses Adagrad's ideas for reference, but here RMSprop only uses the average value of the first t-1 gradient square plus the square root of the sum of the squares of the current gradient as the denominator of the learning rate. In this way, RMSprop will not have the problem of lower and lower learning rate, but also can adjust the learning rate by itself, and can have a better effect.

3.7 Adadelta

With Adadelta, we don't even need to set a default learning rate. In Adadelta, we can achieve a very good effect without using learning rate.

3.8 Adam

Like Adadelta and RMSprop, Adam stores the square gradient of the previous falloff, and it also stores the gradient of the previous falloff. After some processing, the parameters are updated in a way similar to Adadelta and RMSprop.

4. Selection of optimizer

Choosing the best optimizer for the model, which converges quickly and can learn the appropriate weight and offset value, is a skill.
Adaptive technology (Adadelta,Adagrad, etc.) is a good optimizer for complex neural network model, with faster convergence. In most cases, Adam is probably the best optimizer. Adam is better than other adaptive technologies, but its cost is very high. For sparse data sets (such as a picture data set, with 10000 photos of dogs, 10000 photos of cats, and only 100 photos of elephants), some methods such as SGD, NGA and momentum are not the best choice, and the method that can adaptively adjust the learning rate is. An additional advantage is that the learning rate does not need to be adjusted, and the optimal solution can be achieved by using the default learning rate.

5. Demonstration example

Taking handwritten numbers as an example, a simple neural network is designed and compared with SGD and Adam optimizers
Code modification:

#Training with gradient descent method
#train_step = tf.train.GradientDescentOptimizer(0.2).minimize(loss)
#Training with the Adam optimizer optimizer
train_step = tf.train.AdamOptimizer(1e-2).minimize(loss)

When we use SGD, the learning rate is set to 0.2. When we use Adam, the learning rate is set to 0.01, that is 1e-2.
Generally, when we use Adam, the set learning rate is relatively small, and 0.01 here is relatively large.
The complete code is as follows:

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

#Load data set
mnist = input_data.read_data_sets("MNIST_data",one_hot=True)

#Size of each batch
batch_size = 100
#How many batches are there
n_batch = mnist.train.num_examples // batch_size

#Define two placeholder s
x = tf.placeholder(tf.float32,[None,784])
y = tf.placeholder(tf.float32,[None,10])

#Create a simple neural network
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
prediction = tf.nn.softmax(tf.matmul(x,W) + b)

#Quadratic cost function
# loss = tf.reduce_mean(tf.square(y-prediction))
#Using softmax cross entropy cost function
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=prediction))
#Training with gradient descent method
#train_step = tf.train.GradientDescentOptimizer(0.2).minimize(loss)
#Training with the Adam optimizer optimizer
train_step = tf.train.AdamOptimizer(1e-2).minimize(loss)

#initialize variable
init = tf.global_variables_initializer()

#Results are stored in a Boolean list
#argmax returns the position of the maximum value in one-dimensional tensor
correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(prediction,1))
#Accuracy rate
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))

with tf.Session() as sess:
    for epoch in range(21):
        for batch in range(n_batch):
            batch_xs,batch_ys = mnist.train.next_batch(batch_size)
            sess.run(train_step,feed_dict = {x:batch_xs,y:batch_ys})
        acc = sess.run(accuracy,feed_dict = {x:mnist.test.images,y:mnist.test.labels})
        print("Iter"+str(epoch)+",Testing Accuracy"+str(acc))

Using SGD optimizer results:

Iter0,Testing Accuracy0.8243
Iter1,Testing Accuracy0.8882
Iter2,Testing Accuracy0.9009
Iter3,Testing Accuracy0.9052
Iter4,Testing Accuracy0.9089
Iter5,Testing Accuracy0.9103
Iter6,Testing Accuracy0.9111
Iter7,Testing Accuracy0.9128
Iter8,Testing Accuracy0.9145
Iter9,Testing Accuracy0.9158
Iter10,Testing Accuracy0.9192
Iter11,Testing Accuracy0.9181
Iter12,Testing Accuracy0.9184
Iter13,Testing Accuracy0.9191
Iter14,Testing Accuracy0.9195
Iter15,Testing Accuracy0.9196
Iter16,Testing Accuracy0.9212
Iter17,Testing Accuracy0.9213
Iter18,Testing Accuracy0.9204
Iter19,Testing Accuracy0.9213
Iter20,Testing Accuracy0.9211

Using the Adam optimizer results:

Iter0,Testing Accuracy0.9244
Iter1,Testing Accuracy0.9256
Iter2,Testing Accuracy0.9298
Iter3,Testing Accuracy0.9262
Iter4,Testing Accuracy0.9275
Iter5,Testing Accuracy0.926
Iter6,Testing Accuracy0.9296
Iter7,Testing Accuracy0.9313
Iter8,Testing Accuracy0.9293
Iter9,Testing Accuracy0.929
Iter10,Testing Accuracy0.9305
Iter11,Testing Accuracy0.9326
Iter12,Testing Accuracy0.9307
Iter13,Testing Accuracy0.9328
Iter14,Testing Accuracy0.9295
Iter15,Testing Accuracy0.9324
Iter16,Testing Accuracy0.9282
Iter17,Testing Accuracy0.9328
Iter18,Testing Accuracy0.9297
Iter19,Testing Accuracy0.9325
Iter20,Testing Accuracy0.9297

Result analysis
From the results, we can see that when Adam is used, the convergence speed is obviously faster than SGD.

37 original articles published, 7 praised, 30000 visitors+
Private letter follow

Tags: network Session

Posted on Tue, 11 Feb 2020 01:43:12 -0500 by prashanth0626