optimizer.zero_grad(),loss.backward, and optimizer.step(), LR_ Analysis of scheduler.step principle

When training the model with pytorch, optimizer.zero is usually used successively in the process of traversing epochs_ Grad(), loss. Backward, and optimizer.step(), LR_ The four functions of scheduler. Step() are as follows:

train_loader=DataLoader(
    train_dataset,
    batch_size=2,
    shuffle=True
)
model=myModel()   
criterion=nn.CrossEntropyLoss()                                                                                                                           
optimizer=torch.optim.SGD(model.parameters(),lr=1e-6,momentum=0.9,weight_decay=2e-4)     #Set optimizer parameters
lr_scheduler=lr_scheduler.StepLR(optimizer,step_size=3,gamma=0.1) #Learning rate optimization module
for epoch in range(epochs):
      model.train()
      for  i,(inputs,labels) in enumerate(train_loader):
             outputs=model(inputs)
             loss=criterion(outputs,labels)
             optimizer.zero_grad()
             loss.backward()
             optimizer.step()
      lr_scheduler.step()
      

optimizer.zero_grad() # gradient value reset
loss.backward() # back propagation calculation parameter gradient value
optimizer.step() # updates parameters by gradient descent
lr_scheduler.step() # updates learning rate according to iteration epoch
The four functions are used to return the gradient value to zero (optimizer.zero_grad()), then calculate the gradient value of each parameter (loss.backward()), update the parameters through gradient descent (optimizer.step()), and finally update the learning rate (lr_scheduler.step()) according to the number of opeoch training rounds.

Next, analyze the four functions through the source code. Before that, explain the common parameter variables in the function.
Param_groups: when the optimizer class is instantiated, a param_groups list will be created. In the list, there are num_groups (num_groups depends on how many groups of parameters you pass in when defining the optimizer) param_group dictionaries with a length of 6. Each param_group contains six groups of key value pairs ['param', 'lr', 'momentum', 'damping', 'weight_caution', 'nesterov].
params (iterable) - parameters w, b to be optimized or dict with parameter group defined
lr(float, optional) - learning rate
momentum(float, optional, default 0) - momentum factor
weight_decay(float, optional, default 0) - weight decay
Damping (float, optional) – momentum suppression factor (default: 0)
nesterov (bool, optional) – use nesterov momentum (default: False)
param_group ['param ']: a list of model parameters passed in, that is, the parameters passed in to the group when the Optimizer class is instantiated. If the parameters are not grouped, it is the parameter model.parameters() of the whole model. Each parameter is a torch.nn.parameter.Parameter object.
1, Optimizer torch.optim**
The class used to optimize model weights in pytorch is torch.optim.Optimizer. Other optimizers are subclasses based on the base class Optimizer. Let's talk about how to build an Optimizer object instance of a model.
All optimizers inherit the torch.optim.Optimizer class

CLASS torch.optim.Optimizer(params,defaults)

params has two types of incoming data and two methods of parameter incoming
There are two types of incoming data:

  1. Iteratable torch.tensor
  2. Dictionary (dict). The value of the item whose key is "params" in the dictionary must be iterative torch.tensor
  3. If more than one dict is passed in, it will be put into a list
  4. List item

There are two ways to import parameters:
1. The weights of the model share a learning rate, and the model parameters are directly passed in.

self.optimizer=torch.optim.SGD(params=self.model.parameter(),lr=args.lr)
#In this way, all parameters of the model are adjusted with the same learning rate lr

2. Model weights do not share the same learning rate

#It is preferred to classify the model parameters, and each category has different learning rates
def get_one_lr_params(self):
    modules=[self.backbone]
    for i in range(len(modules)):
         for m in modules[i].named_modules():
             if isinstance(m[1],nn.Conv2d) or isinstance(m[1],SynchronizedBatchNorm2d) or isinstance(m[1],nn.batchNorm2d):
             for p in m[1].parameters():
                 if p.requires_grad:
                    yield p
def get_two_lr_params(self):
    modules=[self.aspp,self.decoder]
    for i in range(len(modules)):
        for m in modules[i].named_modules():
            if isinstance(m[1],nn.Conv2d) or isinstance(m[1],SynchronizedBatchNorm2d) isinstance(m[1],nn.batchNorm2d):
             for p in m[1].parameters():
                 if p.requires_grad:
                    yield p
#Build optimization parameter list
def optim_parameters(self,args):
    return [{'params': self.get_one_lr_params,'lr':,args.learning_rate},
    {'params': self.get_two_lr_params,'lr': 10*args.learning_rate}]
#Pass in model parameters to build the optimizer
self.optimizer=torch.optim.SGD(params=self.model.optim_parameters(args))  

Dynamically adjust the learning rate of the optimizer
Each optimizer instance has an attribute of optimizer.param_groups, including a list type. Each element in the list is stored in the type of dict. You only need to adjust the value of the 'lr' item of dict.
Constructing a function method for dynamically adjusting learning rate

#Calculate learning rate
def lr_poly(base_lr,iter,max_iter,power):
    return base_lr*((1-float(iter)/max_iter)**(power))
#Adjust the learning rate and pass in the learning rate parameter
def adjust_learning_rate(optimizer,i_iter):
    lr = lr_poly(args.learning_rate, i_iter, args.num_steps, args.power)
    optimizer.param_groups[0]['lr]=lr
    optimizer.param_groups[1]['lr]=10*lr

2, Loss function
Loss function is used to estimate the inconsistency between the predicted value f(x) and the real value Y of the model. It is a non negative real value function, usually expressed by L(Y, f(x)). The smaller the loss function, the better the robustness of the model. Loss function is the core part of empirical risk function and an important part of structural risk function. Basic usage:

criterion=LossCriterion() #Constructor
loss=criterion(x,y)       #Call standard

The following chapters will introduce the types and principles of loss functions.
3, Gradient zero optimizer.zero_grad()
optimizer_zero_grad() means to clear the gradient and change the derivative of loss with respect to weight to 0

def zero_grad(self):
     for group in self.param_groups:
         for p in group['params']:
              if p.grad in not None:
                 p.grad.detach()
                 p.grad.zero()

The optimizer.zero_grad() function will traverse all the parameters of the model. The parameters mentioned here are the torch.nn.parameter.Parameter type variables described in the previous overview. That is, the following P. cut off the back-propagation gradient flow through the p.grad.detach_() method, and then set the gradient value of each parameter to 0 through the p.grad.zero_() function, that is, the last gradient record is cleared.

Because the training process usually uses the mini batch method, the gradient must be cleared before calling the backward() function, because if the gradient is not cleared, the gradient calculated last time and the gradient calculated this time will be accumulated in pytorch.
**Advantages: * * due to hardware limitations, larger batchsizes cannot be used. The code is constructed into multiple batchsizes to make one call to optimizer.zero_grad() function, so that the gradient average value of smaller bachsize s can be calculated multiple times instead.
**Disadvantages: * * clear the gradient every time: enter a batch of data, calculate the gradient and update the network.
summary
Normally, each batch needs to call the optimizer.zero_grad() function once to clear the gradient of the parameter;
You can also call the optimizer.zero_grad() function only once for multiple batches, which is equivalent to increasing the batch_size.
4, Back propagation loss.backward()
The back propagation of PyTorch (i.e. tensor.backward()) is realized through the autograd package, which will automatically calculate its corresponding gradient according to the mathematical operations performed by tensor.

Specifically, torch.tensor is the basic class of autograd package. If you set the requires_grads of tensor to True, you will start tracking all operations on this tensor.

If you use tensor.backward() after the operation, all gradients will be calculated automatically, and the gradient of tensor will be accumulated into its. grad attribute. If tensor.backward() is not performed, the gradient value will be None, so loss.backward() should be written before optimizer.step().
5, Update parameter optimizer.step()
Before looking at the source code, please understand the use of several parameters:
lr(float, optional) - learning rate
momentum(float, optional, default 0) - momentum factor
weight_decay(float, optional, default 0) - weight decay
momentum
The concept of "impulse" comes from mechanics in physics and represents the cumulative effect of force on time.
In the ordinary gradient descent method, x+=v
In, the update amount V of each x is v = − dx * lr, where dx is the first derivative of the objective function func(x) to X,.
When using impulse, the update amount v of each x is considered as the sum of the gradient descent − dx * lr of this time and the update amount v of last x multiplied by a factor momentum between [0,1] [0,1], i.e
v=−dx∗lr+v∗momemtum

When this gradient decreases- dx * lr The direction of the is the same as the last update v When the direction is the same, the last update can play a positive role in accelerating this search. 
When this gradient decreases- dx * lr The direction of the is the same as the last update v When the direction is opposite, the last update can slow down the search.

weight_decay
When the gradient descent method is used to solve the minimum value of the objective function func(x) = x * x, the update formula is x += v, where the update amount V of each x is v = - dx * lr, and dx is the first derivative of the objective function func(x) to X. It is conceivable that if lr can be attenuated and reduced with the iteration cycle, the step length of search time can be reduced to slow down the oscillation. The learning rate attenuation factor was born:
lri=lrstart∗1.0/(1.0+decay∗i)

decay The smaller, the slower the learning rate decays, when decay = 0 The learning rate remains unchanged. 
decay The larger, the faster the learning rate decays, when decay = 1 The learning rate decays fastest.
class SGD(Optimizer):
    def __init__(self,params,lr=required,momentum=0.9,dampening=0,weight_decay=0,nesterov=False):
    def__setstate__(self,state):
        super(SGD,self).__setstate__(state)
        for group in self.param_groups:
            group_setdefault('nesterov',False)
	def step(self,closure=None):
    	loss=None
	    if closure is not None:
	       loss=closure()
	    for group in self.param_groups:
	        weight_decay=group['weight_decay']
	        momentum=group['momentum']
	        dampening=group['dampening']
	        nesterov=group['nesterov']
	        for p in group['params']:
	            if p.grad is None:
	               continue
	            d_p=p.grad.data
	            if weight_decay!=0:
	               d_p=d_p.add_(p.data,weight_decay) #d_p=d_p+weigt_decap*p.data
	            if momentum!=0:
	               param_state=self.state[p]
	               if 'momentum_uffer' not in param_state:
	                   buf=param_sate['momentum_buffer']=torch.zeros_like(p.data)   
	                   buf.mul_(momentum).add_(1-dampening,d_p) #buf=momentum*buf+(1-dampening)*d_p     
	                   if nesterov:
                        d_p = d_p.add(momentum, buf)# d_p = d_p + momentum * buf
                    else:
                        d_p = buf
                  p.data.add_(-group['lr'], d_p)# p = p - lr * d_p
        return loss            

6, Parameter update lr_scheduler.step()
During training, we need to adjust the learning rate through a certain mechanism. At this time, we can use torch.optim.lr_scheduler class to adjust; torch.optim.lr_ The scheduler module provides some methods to adjust the learning rate according to the epoch training times. Generally, we will set the learning rate to gradually decrease with the increase of epoch, so as to achieve better training effect.
The following describes a policy adjustment mechanism: StepLR mechanism;

class torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1, last_epoch=-1)

Parameters:
Optimizer (optimizer): the optimizer to change the learning rate;
step_size (int): step per workout_ Size epoch s, update parameters once;
gamma (float): update the multiplication factor of lr;
last_epoch (int): the index of the last epoch. If the training is interrupted after training many epochs, continue the training, and this value is equal to the epoch of the loaded model. The default is - 1, which means to start training from scratch, that is, from epoch=1.
Update policy:
Every step_size epoch s, update once:

new_lr =lr×gamma
 among new_lr Is the new learning rate, lr Is the initial learning rate, step_size Are parameters step_size,

Examples:

import torch
import torch.nn as nn
from torch.optim.lr_scheduler import StepLR
import itertools
initial_lr = 0.1
class model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=3, kernel_size=3)
    def forward(self, x):
        pass
net_1 = model()
optimizer_1 = torch.optim.Adam(net_1.parameters(), lr = initial_lr)
scheduler_1 = StepLR(optimizer_1, step_size=3, gamma=0.1)
print("Initialized learning rate:", optimizer_1.defaults['lr']) 
for epoch in range(1, 11):
    # train
    optimizer_1.zero_grad()
    optimizer_1.step()
    print("The first%d individual epoch Learning rate:%f" % (epoch, optimizer_1.param_groups[0]['lr']))
    scheduler_1.step()

Output is:

Initialized learning rate: 0.1
 1st epoch Learning rate: 0.100000
 2nd epoch Learning rate: 0.100000
 3rd epoch Learning rate: 0.100000
 4th epoch Learning rate: 0.010000
 5th epoch Learning rate: 0.010000
 6th epoch Learning rate: 0.010000
 7th epoch Learning rate: 0.001000
 8th epoch Learning rate: 0.001000
 9th epoch Learning rate: 0.001000
 10th epoch Learning rate: 0.000100

Tags: neural networks Deep Learning NLP

Posted on Tue, 30 Nov 2021 08:21:08 -0500 by crinkle