Python source code analysis: Optimizer class

Most of Pytorch's implementation of Optimizer class is in Python, and only C + + is used for calculation, so we can continue to analyze.

Overview

The Optimizer class is a base class for all concrete Optimizer classes. The following picture shows it.

Here, I'll take the SGD class as an example to introduce it from bottom to top.

There are only two important member variables in the Optimizer class, self.param_groups and self.state.

self.param_groups is used to store model parameters and some parameters of the optimizer itself (such as learning rate).

Self.state is used to store various temporary states corresponding to model parameters in the update process. For example, each parameter in MSGD needs to correspond to a momentum. Each parameter may need to correspond to more than one temporary state. Therefore, self.state is an ordered dictionary with a key value pair of type parameter:dict.

The only important method in the Optimizer class is an add_param_group, which is used to initialize self.param_groups.

The initialization of self.state needs to be carried out in a specific optimizer class.

self.param_ How do groups initialize?

self.param_groups in the optimizer class__ init__ Method.

Here, you can take a look at the initialization method of SGD class. It packages optimizer parameters such as lr and momentum into dictionary defaults, and then passes them into the initialization method of optimizer class together with model parameter params.

class SGD(Optimizer):
    def __init__(self, params, lr=required, momentum=0, dampening=0,
                 weight_decay=0, nesterov=False):
        if lr is not required and lr < 0.0:
            raise ValueError("Invalid learning rate: {}".format(lr))
        if momentum < 0.0:
            raise ValueError("Invalid momentum value: {}".format(momentum))
        if weight_decay < 0.0:
            raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
        #Package optimizer parameters
        defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
                        weight_decay=weight_decay, nesterov=nesterov)
        if nesterov and (momentum <= 0 or dampening != 0):
            raise ValueError("Nesterov momentum requires a momentum and zero dampening")
        super(SGD, self).__init__(params, defaults)

In the initialization method of the Optimizer class, it only stores defaults for defaults. For params, it is first converted to list form, and then converted to a dictionary encapsulated by list. Then execute self. Add on the dictionary_ param_ group.

So far, we still haven't seen self.param_ How groups are initialized, so you need to continue to look at self.add_param_group this method.

Pay attention to distinguish the self.param here_ Groups and param_groups.

class Optimizer(object):
    def __init__(self, params, defaults):
        torch._C._log_api_usage_once("python.optimizer")
        self.defaults = defaults

        self._hook_for_profile()

        if isinstance(params, torch.Tensor):
            raise TypeError("params argument given to the optimizer should be "
                            "an iterable of Tensors or dicts, but got " +
                            torch.typename(params))
        #self.state initialization
        self.state = defaultdict(dict)
        self.param_groups = []
        param_groups = list(params)
        if len(param_groups) == 0:
            raise ValueError("optimizer got an empty parameter list")
        #In general, param_groups[0] is a parameters class
        #This is actually judging param_ Have groups been encapsulated before.
        if not isinstance(param_groups[0], dict):
            param_groups = [{'params': param_groups}]
        #Although it is a traversal operation, it does not traverse all parameters.
        for param_group in param_groups:
            #Param here_ Group is equivalent to {'params': param_group s}
            self.add_param_group(param_group)

add_param_group source code:

def add_param_group(self, param_group):
    r"""Add a param group to the :class:`Optimizer` s `param_groups`.

    This can be useful when fine tuning a pre-trained network as frozen layers can be made
    trainable and added to the :class:`Optimizer` as training progresses.

    Args:
        param_group (dict): Specifies what Tensors should be optimized along with group
        specific optimization options.
    """
    #The front is not important, but the judgment of some boundary conditions
    assert isinstance(param_group, dict), "param group must be a dict"
    params = param_group['params']
    if isinstance(params, torch.Tensor):
        param_group['params'] = [params]
    elif isinstance(params, set):
        raise TypeError('optimizer parameters need to be organized in ordered collections, but '
                        'the ordering of tensors in sets will change between runs. Please use a list instead.')
    else:
        param_group['params'] = list(params)

    for param in param_group['params']:
        if not isinstance(param, torch.Tensor):
            raise TypeError("optimizer can only optimize Tensors, "
                            "but one of the params is " + torch.typename(param))
        if not param.is_leaf:
            raise ValueError("can't optimize a non-leaf Tensor")
    #It starts here with self.param_ The initialization of groups is complete
    #defaults add param here_ group
    for name, default in self.defaults.items():
        if default is required and name not in param_group:
            raise ValueError("parameter group didn't specify a value of required optimization parameter " +
                                name)
        else:
            param_group.setdefault(name, default)

    params = param_group['params']
    if len(params) != len(set(params)):
        warnings.warn("optimizer contains a parameter group with duplicate parameters; "
                        "in future, this will cause an error; "
                        "see github.com/pytorch/pytorch/issues/40967 for more information", stacklevel=3)

    param_set = set()
    for group in self.param_groups:
        param_set.update(set(group['params']))

    if not param_set.isdisjoint(set(param_group['params'])):
        raise ValueError("some parameters appear in more than one parameter group")
    #Param_ The group dictionary is placed in self.param_ In the empty list of groups
    #This completes the initialization
    self.param_groups.append(param_group)

add_ param_ Param in group_ The operation of group is actually very simple, that is, put the previous optimizer parameter self.defaults into param_ In the group. Then put param_ Save group to self.param_ In groups.

Next, let's take a practical example to see if it's like this:

import torch
X = torch.tensor([1.0],requires_grad = True)
Y = torch.tensor([2.0],requires_grad = True)
optimizer = torch.optim.SGD([X,Y],lr =0.001)
print(optimizer.param_groups)
"""
Output results:
[
{
'params': [tensor([1.], requires_grad=True), tensor([2.], requires_grad=True)], 
'lr': 0.001, 
'momentum': 0, 
'dampening': 0, 
'weight_decay': 0, 
'nesterov': False
}
]
"""

How to update self.state?

Self.param above_ The initialization process of groups is almost complete. Next, consider the initialization and update of self.state, because it is said that self.state will be updated every iteration. The update operation of the optimizer is placed in the step method, but the optimizer base class does not implement the step method, and each subclass needs to implement it by itself. So let me take SGD as an example to introduce the update process of the optimizer.

The traditional SGD certainly does not need to use self.state. The SGD here in pytoch only needs to use self.state with momentum. Momentum simply means storing past gradient information. In this way, compared with SGD updating only based on the current gradient, SGD with momentum can be updated based on the current + past gradient, and the convergence is faster.

class SGD(Optimizer):
@torch.no_grad()
def step(self, closure=None):
    """Performs a single optimization step.

    Args:
        closure (callable, optional): A closure that reevaluates the model
            and returns the loss.
    """
    loss = None
    if closure is not None:
        with torch.enable_grad():
            loss = closure()

    for group in self.param_groups:
        #Parameters with gradient stored
        params_with_grad = []
        #Gradient corresponding to storage parameter
        d_p_list = []
        #Stored momentum
        momentum_buffer_list = []
        #Regularization coefficient
        weight_decay = group['weight_decay']
        #Momentum coefficient
        momentum = group['momentum']
        #Forget
        dampening = group['dampening']
        nesterov = group['nesterov']
        #Learning rate
        lr = group['lr']
        #Remove momentum from self.state_ buffer
        #Initialize momentum_buffer_list
        #Note the momentum at this time_ The buffer contains only past gradient information
        for p in group['params']:
            if p.grad is not None:
                params_with_grad.append(p)
                d_p_list.append(p.grad)
                #Automatically initialize to an empty dictionary
                state = self.state[p]
                if 'momentum_buffer' not in state:
                    momentum_buffer_list.append(None)
                else:
                    momentum_buffer_list.append(state['momentum_buffer'])
        #Update parameters
        #And update momentum_buffer
        #After the function is executed, momentum_ The buffer will contain past + present gradient information
        F.sgd(params_with_grad,
                d_p_list,
                momentum_buffer_list,
                weight_decay=weight_decay,
                momentum=momentum,
                lr=lr,
                dampening=dampening,
                nesterov=nesterov)
        #momentum_buffer_list is the momentum in the state obtained through the append copy operation_ Buffer
        #So although momentum_buffer_ The list has been updated, but the momentum in the state_ The buffer has not been updated yet
        #Therefore, you need to synchronize so that you can continue to get momentum from state in the next iteration_ buffer. 
        # update momentum_buffers in state
        for p, momentum_buffer in zip(params_with_grad, momentum_buffer_list):
            #Each parameter corresponds to a state dictionary.
            #Here, each parameter corresponds to a momentum
            state = self.state[p]
            state['momentum_buffer'] = momentum_buffer

    return loss

Here you can see that the step method is @torch.no_ The grad () decorator is decorated because the leaf node needs to be inplace. I have introduced this part before, so I won't repeat it here.

The update of parameters by step method is mainly divided into three steps:

The first step is to remove momentum from self.state_ Convert buffer to momentum in list form_ buffer_ list.

The second step is to update the parameters and the momentum_ buffer_ The list will also be updated.

The third step is to use the updated momentum_buffer_list to momentum in state_ The buffer is updated.

The real update operations are put in F.sgd. I have made comments here. If you are interested, you can have a look.

def sgd(params: List[Tensor],
        d_p_list: List[Tensor],
        momentum_buffer_list: List[Optional[Tensor]],
        *,
        weight_decay: float,
        momentum: float,
        lr: float,
        dampening: float,
        nesterov: bool):
    r"""Functional API that performs SGD algorithm computation.

    See :class:`~torch.optim.SGD` for details.
    """
    #Traverse all parameters
    for i, param in enumerate(params):
        #Take out the gradient corresponding to the parameter
        d_p = d_p_list[i]
        #Gradient plus regular term
        if weight_decay != 0:
            d_p = d_p.add(param, alpha=weight_decay)
        #Take out the momentum from the last iteration (prepare to update the momentum)
        if momentum != 0:
            buf = momentum_buffer_list[i]
            #Momentum initialization at the first iteration
            if buf is None:
                buf = torch.clone(d_p).detach()
                momentum_buffer_list[i] = buf
            #Momentum update is an inplace operation
            #mb*momentum_factor+(1-dampening)*grad
            else:
                buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
            # nesterov update method
            if nesterov:
                d_p = d_p.add(buf, alpha=momentum)
            # Update method of conventional momentum
            else:
                d_p = buf
        #Parameter update
        param.add_(d_p, alpha=-lr)

summary

Here is only an introduction to the source code related to update in optimizer, but there are many other methods in optimizer class, which I can't use at present, so I won't read it for the time being.

Tags: Python AI Pytorch

Posted on Thu, 04 Nov 2021 22:29:00 -0400 by bandit