[source code analysis] how PyTorch implements forward propagation -- specific implementation

[source code analysis] how PyTorch implements forward propagation (3) - specific implementation

0x00 summary

This series will analyze how PyTorch's automatic differentiation function is implemented through about ten articles. This paper is the third part of forward propagation, which introduces the specific implementation mechanism.

In back propagation, when you get a tensor, the engine needs to know:

How to call gradient calculation for this tensor, that is, where to find the function F for calculating the gradient.
After obtaining function F, the input of this function is the tensor itself, but function F needs to know some meta information of the input parameters (this tensor), such as type, shape and device.
After f calculates the gradient, you need to know where the output of F should be propagated, that is, how to proceed to the next step on the back propagation calculation diagram.

This paper specifically analyzes how to set these information in forward communication.

The previous articles in this series are linked as follows:

Automatic differentiation of deep learning tools (1)

Automatic differentiation of deep learning tools (2)

Source code analysis] automatic differentiation of deep learning tools (3) - example interpretation

[ Source code analysis] how PyTorch implements forward propagation (1) - basic class (1)

[ Source code analysis] how PyTorch implements forward propagation (2) - basic classes (2)

0x01 calculation diagram

1.1 related classes of drawings

A computational graph is a directed graph. Its nodes are implemented operators or data (leaf nodes). The direction of the arrow indicates the direction of data flow, from the input Node to the output Node. As can be seen from the previous chapters, there are three basic classes related to the diagram: Node, Edge and Engine (we will analyze Engine later).

A Node is a Node class that represents an operation.
- Each Node receives 0 or more variables and outputs 0 or more variables. Nodes are connected by Edge, which is actually through the Node's member Variable next_edges_ Connected together.
- Back propagation functions inherit from Node. For example, SubBackward0 inherits from Node.
Edge edge is essentially (Node, input_nr)
- Edge member variable std::shared_ptr function: Specifies the Node to which this side points.
- Member variable uint32 of Edge_ t input_ NR: specifies which input this side is for function.
Member of Node next_edges_ It is a group of Edge instances, representing the (other) Node to which the return value of this Node instance is to be output, that is, next_edges_ It is the link between nodes. When the calculation diagram is executed, variables flow between these edges.
Engine is an execution engine.

1.2 dynamic diagram

pytorch adopts the method of dynamic calculation diagram in the design. Dynamic means that the back-propagation calculation graph is dynamically updated. At the beginning of each round of back propagation (after the end of forward propagation), a calculation graph will be dynamically rebuilt. When this back propagation is completed, the graph will destroy the calculation graph and be released in memory. If you want to use it again in a new round, you can only build it again from scratch. This dynamic updating method allows users to change the shape and size of the network during iteration.

The following code can see the characteristics of dynamic graph.

# For the first time, generate dynamic graph a = torch.tensor(2., requires_grad=True) b = torch.tensor(6., requires_grad=True) Q = 3*a**3 - b**2 external_grad = torch.tensor(1.) Q.backward(gradient=external_grad) # normal Q.backward(gradient=external_grad) # RuntimeError # Second time: again a = torch.tensor(2., requires_grad=True) b = torch.tensor(6., requires_grad=True) Q = 3*a**3 - b**2 external_grad = torch.tensor(1.) Q.backward(gradient=external_grad) # normal

1.3 dynamic display

The following is the official dynamic diagram of PyTorch. You can have an image understanding.

For better display, let's decompose the dynamic graph.

The first is to declare some tensors.

Second, multiply the two matrices.

Multiply the other two matrices

Then add the two multiplication results.

Add Tanh activation function.

Add loss function.

Back propagation, calculate the gradient.

It can be seen that the dynamic graph relationship is constructed in the forward calculation process.

0x02 overall analysis

We continue to refine the example code mentioned above in order to see each tensor in the calculation diagram:

a = torch.tensor(2., requires_grad=True) b = torch.tensor(6., requires_grad=True) X = a ** 3 Y = 3 * X Z = b ** 2 Q = X - Z external_grad = torch.tensor(1.) Q.backward(gradient=external_grad) print(a.grad)

Look at the runtime variables as follows. Since Q = X - Z is subtraction, the corresponding reverse operation is SubBackward0:

Q = tensor(-28., grad_fn=<SubBackward0>) X = tensor(8., grad_fn=<PowBackward0>) Y = tensor(24., grad_fn=<MulBackward0>) Z = tensor(36., grad_fn=<PowBackward0>) a = tensor(2., requires_grad=True) b = tensor(6., requires_grad=True)

We can compare it with the visual representation of DAG. In the figure, the arrow points in the direction of the forward pass, and the node represents the backward function of each operation in the forward pass. The blue leaf node (2) represents our leaf tensors a and b.

At the code level, in the forward propagation process, PyTorch does not explicitly construct a back propagation calculation graph, but establishes several required data structures, which can be considered as a virtual graph relationship, but there is no real graph data structure. In the forward propagation of each iteration, for Q = X - Z, the following operations are performed:

1) Enter subtraction: subtraction will be sent to a device, where Q will be built.
2) First, build how to back propagate: when it is distributed to the VariableType, the autograd information of Q will be built first;
- Build an instance of the subtraction reverse calculation function SubBackward0.
- Initialize next for SubBackward0 instance_ edges_ And other relevant members, next_edges_ The value of the member comes from the forward propagating input parameters X and Z.
  - If the input Variable is a leaf node, then next_edges_ Grad from input Variable_ accumulator_
  - If the input Variable is a non leaf node, then next_edges_ Grad from input Variable_ fn_.
- Use the new Variable instance in step 3 (that is, the result Q of the forward calculation) to initialize the input of the SubBackward0 instance_ metadata_，
- In this way, we get how to carry out the back-propagation of Q, but at this time, we only get how to calculate it, which is not related to Q.
3) Then connect the forward calculation & with the back propagation: after the forward operation, a new Variable is obtained, which is Q. use the SubBackward0 instance in step 2) to initialize the autograd of Q_ meta_-> grad_ fn_ Members. When q is inversely calculated, you know to use the autograd of Q_ meta_-> grad_ fn_ Members, that is, SubBackward0 in 2).

Roughly as shown in the figure below:

+-----------------------+ +---------------------------------------+ | Q | | DifferentiableViewMeta | | | | | | autograd_meta_ +---------> | grad_ grad_accumulator_ | | | | | +-----------------------+ | | +----------------------+ grad_fn_ output_nr_ | Q Find out how to calculate the gradient | | | | +---------------------------------------+ v +-------------+------------+ +----------------------+ |SubBackward0 | | | | | | Compute the gradient | How to calculate the gradient | apply +---------------> | | | | +----------------------+ | | | | +-----------------------------------------------------+ | next_edges_ +---------> | edge_list | | | | | | other_scalar_type | | [(PowBackward0(self), 0), (PowBackward0(other), 0)] | output | | | | | alpha | +-----------------------------------------------------+ | | | self_scalar_type | +----------------------------------------+ | | | | | input_metadata_ +-----> | [(type of Q, shape of Q, device of Q)] | input | | | | +--------------------------+ +----------------------------------------+

Because a series of example nodes in the calculation diagram will be generated in forward calculation, we will analyze these nodes first.

0x03 Node inheritance system

Let's start with the lowest node SubBackward0 in the figure above.

3.1 inheritance system

The SubBackward0 definition is located in: Torch / include / Torch / CSR / autograd / generated / functions. H.

struct TORCH_API SubBackward0 : public TraceableFunction { using TraceableFunction::TraceableFunction; variable_list apply(variable_list&& grads) override; std::string name() const override { return "SubBackward0"; } void release_variables() override { } at::ScalarType other_scalar_type; at::Scalar alpha; at::ScalarType self_scalar_type; };

Let's look at the inheritance system of SubBackward0.

class SubBackward0 : public TraceableFunction class TraceableFunction : public Node /// See Node::is_traceable() for definition. struct TraceableFunction : public Node { using Node::Node; bool is_traceable() final { return true; } };

Therefore, SubBackward0 is a Node type.

3.2 Node

We introduced node earlier. Node class, representing an operation. Each node receives 0 or more variables and outputs 0 or more variables. Nodes are connected by Edge, which is actually through the node's member Variable next_edges_ Connected together. Back propagation functions are inherited from node.

We extract some Node codes as follows:

struct TORCH_API Node : std::enable_shared_from_this<Node> { /// Performs the `Node`'s actual operation. virtual variable_list apply(variable_list&& inputs) = 0; const uint64_t sequence_nr_; uint64_t topological_nr_ = 0; // The edge associated with the operator in the forward process corresponds to the input variable in the forward process. edge_list next_edges_; std::vector<std::unique_ptr<FunctionPreHook>> pre_hooks_; std::vector<std::unique_ptr<FunctionPostHook>> post_hooks_; at::SmallVector<InputMetadata, 2> input_metadata_; // The operator () is overloaded here, and the core is to call apply() variable_list operator()(variable_list&& inputs) { bool pre_sampled = false; if (at::shouldRunRecordFunction(&pre_sampled)) { return apply(std::move(inputs)); } else { return apply(std::move(inputs)); } } };

You can see that apply (variable_list & & inputs) is a pure virtual Function, which needs to be implemented by its derived class. The apply Function is the soul of Function and the core execution logic of back propagation calculation. The apply Function of each derived class can be called through the polymorphic Function of C + +.

3.3 SubBackward0

The apply function code of SubBackward0 is as follows. You can see its derivation process. The code is located in torch / CSR / autograd / generated / functions.cpp.

variable_list SubBackward0::apply(variable_list&& grads) { IndexRangeGenerator gen; auto self_ix = gen.range(1); auto other_ix = gen.range(1); variable_list grad_inputs(gen.size()); auto& grad = grads[0]; bool any_grad_defined = any_variable_defined(grads); if (should_compute_output({ other_ix })) { // Calculate auto grad_result = any_grad_defined ? (handle_r_to_c(other_scalar_type, -grad * alpha.conj())) : Tensor(); copy_range(grad_inputs, other_ix, grad_result); // Copy results to grad_inputs } if (should_compute_output({ self_ix })) { // Calculate auto grad_result = any_grad_defined ? (handle_r_to_c(self_scalar_type, grad)) : Tensor(); copy_range(grad_inputs, self_ix, grad_result); // Copy results to grad_inputs } return grad_inputs; // Return to grad_inputs }

Let's verify it by looking at the tools/autograd/derivatives.yaml file. Here is the mapping of forward and backward, which can be understood as the atomic operation of query when autograd engine performs reverse chain derivation. Our basis is as follows. Therefore, we can know that the derivation functions of addition and subtraction use handle_r_to_c.

- name: add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor self: handle_r_to_c(self.scalar_type(), grad) other: handle_r_to_c(other.scalar_type(), maybe_multiply(grad, alpha.conj())) result: self_t + maybe_multiply(other_t, alpha) - name: sub.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor self: handle_r_to_c(self.scalar_type(), grad) other: handle_r_to_c(other.scalar_type(), -grad * alpha.conj())

handle_r_to_c is defined as follows: conversion.

Tensor handle_r_to_c(ScalarType self_st, Tensor gradient_result) { if (!at::isComplexType(self_st) && gradient_result.is_complex()) { // R -> C return at::real(gradient_result); } return gradient_result; }

Use the code to confirm:

a = torch.tensor(2., requires_grad=True) b = torch.tensor(6., requires_grad=True) Q = a - b external_grad = torch.tensor(1.) Q.backward(gradient=external_grad) At this time, the runtime is as follows: a = tensor(2., requires_grad=True) T = tensor(2., grad_fn=<PermuteBackward>) data = tensor(2.) device = cpu dtype = torch.float32 grad = tensor(1.) grad_fn = None b = tensor(6., requires_grad=True) T = tensor(6., grad_fn=<PermuteBackward>) data = tensor(6.) device = cpu dtype = torch.float32 grad = tensor(-1.) grad_fn = None Q = tensor(-4., grad_fn=<SubBackward0>) T = tensor(-4., grad_fn=<PermuteBackward>) data = tensor(-4.) device = cpu dtype = torch.float32 grad = None grad_fn = <SubBackward0 object at 0x7fb76e365438> metadata = {} next_functions = 0 = (<AccumulateGrad object at 0x7fb76e344978>, 0) 1 = (<AccumulateGrad object at 0x7fb76e3447b8>, 0) __len__ = 2 requires_grad = True is_cuda = False is_leaf = False is_meta = False is_mkldnn = False is_mlc = False is_quantized = False is_sparse = False is_sparse_csr = False is_vulkan = False is_xpu = False layout = torch.strided name = None names = () ndim = 0 output_nr = 0 requires_grad = True shape = torch.Size([])

Let's move on to a few other nodes.

3.4 PowBackward0

PowBackward0 is defined as follows.

struct TORCH_API PowBackward0 : public TraceableFunction { using TraceableFunction::TraceableFunction; variable_list apply(variable_list&& grads) override; std::string name() const override { return "PowBackward0"; } void release_variables() override { std::lock_guard<std::mutex> lock(mutex_); self_.reset_data(); } SavedVariable self_; at::Scalar exponent; }; variable_list PowBackward0::apply(variable_list&& grads) { std::lock_guard<std::mutex> lock(mutex_); IndexRangeGenerator gen; auto self_ix = gen.range(1); variable_list grad_inputs(gen.size()); auto& grad = grads[0]; auto self = self_.unpack(); bool any_grad_defined = any_variable_defined(grads); if (should_compute_output({ self_ix })) { auto grad_result = any_grad_defined ? (pow_backward(grad, self, exponent)) : Tensor(); copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; }

Let's go to tools/autograd/derivatives.yaml and see that pow is used_ backward.

- name: pow.Tensor_Scalar(Tensor self, Scalar exponent) -> Tensor self: pow_backward(grad, self, exponent) result: auto_element_wise

Finally, handle is also used_ r_ to_ c.

Tensor pow_backward(Tensor grad, const Tensor & self, const Scalar & exponent) { if (exponent.equal(0.0)) { return at::zeros_like(self, LEGACY_CONTIGUOUS_MEMORY_FORMAT); } else { auto grad_lambda = [&](auto exp) { return grad * (exp * self.pow(exp - 1)).conj(); }; Tensor out = (exponent.isComplex()) ? grad_lambda(exponent.toComplexDouble()) : grad_lambda(exponent.toDouble()); return handle_r_to_c(self, out); } }

3.5 MulBackward0

MulBackward0 is defined as follows.

struct TORCH_API MulBackward0 : public TraceableFunction { using TraceableFunction::TraceableFunction; variable_list apply(variable_list&& grads) override; std::string name() const override { return "MulBackward0"; } void release_variables() override { std::lock_guard<std::mutex> lock(mutex_); self_.reset_data(); other_.reset_data(); } SavedVariable self_; at::ScalarType other_scalar_type; at::ScalarType self_scalar_type; SavedVariable other_; }; variable_list MulBackward0::apply(variable_list&& grads) { std::lock_guard<std::mutex> lock(mutex_); IndexRangeGenerator gen; auto self_ix = gen.range(1); auto other_ix = gen.range(1); variable_list grad_inputs(gen.size()); auto& grad = grads[0]; auto self = self_.unpack(); auto other = other_.unpack(); bool any_grad_defined = any_variable_defined(grads); if (should_compute_output({ other_ix })) { auto grad_result = any_grad_defined ? (mul_tensor_backward(grad, self, other_scalar_type)) : Tensor(); copy_range(grad_inputs, other_ix, grad_result); } if (should_compute_output({ self_ix })) { auto grad_result = any_grad_defined ? (mul_tensor_backward(grad, other, self_scalar_type)) : Tensor(); copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; }

Let's go to tools/autograd/derivatives.yaml and see that mul is used_ tensor_ backward.

- name: mul.Tensor(Tensor self, Tensor other) -> Tensor self: mul_tensor_backward(grad, other, self.scalar_type()) other: mul_tensor_backward(grad, self, other.scalar_type()) result: other_t * self_p + self_t * other_p

Finally, it also uses handle_r_to_c.

Tensor mul_tensor_backward(Tensor grad, Tensor other, ScalarType self_st) { auto out = grad * other.conj(); return handle_r_to_c(self_st, out); }

3.6 PermuteBackward

Although PermuteBackward is not shown in the figure above, it actually exists, that is, assignment operation. PermuteBackward is defined as follows:

struct TORCH_API PermuteBackward : public Node { using Node::Node; variable_list apply(variable_list&& grads) override; std::string name() const override { return "PermuteBackward"; } void release_variables() override { } std::vector<int64_t> dims; }; variable_list PermuteBackward::apply(variable_list&& grads) { IndexRangeGenerator gen; auto self_ix = gen.range(1); variable_list grad_inputs(gen.size()); auto& grad = grads[0]; bool any_grad_defined = any_variable_defined(grads); if (should_compute_output({ self_ix })) { auto grad_result = any_grad_defined ? (permute_backwards(grad, dims)) : Tensor(); copy_range(grad_inputs, self_ix, grad_result); } return grad_inputs; }

Let's go to tools/autograd/derivatives.yaml and see that permute is used_ backwards.

- name: permute(Tensor(a) self, int[] dims) -> Tensor(a) self: permute_backwards(grad, dims) result: auto_linear

permute_backwards is defined in torch / CSR / autograd / functionsmanual.cpp.

Tensor permute_backwards(const Tensor & grad, IntArrayRef fwd_dims) { // invert the permutation auto ndims = fwd_dims.size(); std::vector<int64_t> dims(ndims); for(const auto i : c10::irange(ndims)) { dims[at::maybe_wrap_dim(fwd_dims[i], ndims)] = i; } return grad.permute(dims); }

Next, we will specifically analyze forward computing to see how it builds dependencies.

0x04 forward calculation

Due to space constraints, we jump directly to the core of the C + + world.

4.1 subtraction implementation

After layer by layer distribution, subtraction is finally called to torch / CSR / autograd / generated / variabletypeeverything.cpp. PyTorch will build autograd information in this function. Its overall logic is as follows:

1) The subtraction operation will be dispatched to a device, where the forward calculation result Variable will be built.
2) When it is distributed to the VariableType, the autograd information will be built;
- Build an instance of the subtraction reverse calculation function SubBackward0 with the name grad_fn.
- Set the function used in reverse calculation.
- Initialize next for SubBackward0 instance_ edges_ And other relevant members, next_edges__ The value of the member comes from the forward propagating input parameter.
  - If the input Variable is a leaf node, then next_edges__ Grad from input Variable_ accumulator_
  - If Variable is a non leaf node, then next_edges_ Grad from Variable_ fn_.
- Use the Variable instance in step 3 to initialize the input of the SubBackward0 instance_ metadata_，
3) A new Variable result is obtained after forward operation, which is built using Variable::Impl.
4) Set the calculation history and use the SubBackward0 instance grad in step 2)_ FN initialize the autograd of the Variable instance_ meta_-> grad_fn_ Members.
5) Return result. The result here is the result of forward calculation, that is, Q in our example.

The specific codes are as follows:

m.impl("sub.Tensor", TORCH_FN(VariableType::sub_Tensor) ); at::Tensor sub_Tensor(c10::DispatchKeySet ks, const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) { auto& self_ = unpack(self, "self", 0); auto& other_ = unpack(other, "other", 1); auto _any_requires_grad = compute_requires_grad( self, other ); (void)_any_requires_grad; auto _any_has_forward_grad_result = isFwGradDefined(self) || isFwGradDefined(other); (void)_any_has_forward_grad_result; std::shared_ptr<SubBackward0> grad_fn; // Build SubBackward0 if (_any_requires_grad) { // Set the function used in reverse calculation grad_fn = std::shared_ptr<SubBackward0>(new SubBackward0(), deleteNode); // Sets all input variables for the next edge grad_fn->set_next_edges(collect_next_edges( self, other )); // Sets the type of the next edge grad_fn->other_scalar_type = other.scalar_type(); grad_fn->alpha = alpha; grad_fn->self_scalar_type = self.scalar_type(); } #ifndef NDEBUG c10::optional<Storage> self__storage_saved = self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> self__impl_saved; if (self_.defined()) self__impl_saved = self_.getIntrusivePtr(); c10::optional<Storage> other__storage_saved = other_.has_storage() ? c10::optional<Storage>(other_.storage()) : c10::nullopt; c10::intrusive_ptr<TensorImpl> other__impl_saved; if (other_.defined()) other__impl_saved = other_.getIntrusivePtr(); #endif auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; // Forward calculation return at::redispatch::sub(ks & c10::after_autograd_keyset, self_, other_, alpha); })(); // Get the output of the forward calculation auto result = std::move(_tmp); if (grad_fn) { // Match the output variable with grad_fn binding, grad_fn contains the function of calculating the gradient // Set calculation history set_history(flatten_tensor_args( result ), grad_fn); } if (_any_has_forward_grad_result) { auto self_t_raw = toNonOptFwGrad(self); auto self_t = self_t_raw.defined() ? self_t_raw : at::zeros_like(toNonOptTensor(self)); auto other_t_raw = toNonOptFwGrad(other); auto other_t = other_t_raw.defined() ? other_t_raw : at::zeros_like(toNonOptTensor(other)); auto result_new_fw_grad = self_t - maybe_multiply(other_t, alpha); if (result_new_fw_grad.defined()) { // The hardcoded 0 here will need to be updated once we support multiple levels. result._set_fw_grad(result_new_fw_grad, /* level */ 0, /* is_inplace_op */ false); } } return result; }

Let's analyze it one by one. First analyze the basic function, and then come back to analyze sub_Tensor.

4.3 edge basis function

We first introduce two construction edge related functions.

4.3.1 create_gradient_edge

create_ gradient_ The edge code is located in torch / CSR / autograd / function. H. Its functions are:

Creates an "edge" between a given "variable" and "function", which is the gradient function of the variable (that is, the function that calculates the gradient of the variable during backward propagation).
This function sets the "grad_fn" attribute of "variable".

create_ gradient_ The edge method assumes that 'Variable' is a new input to the gradient function, so its' input_nr 'equals function - > num_ inputs(). In addition, it increases the number of inputs to the node by one.

If you do not want to add "num_inputs" of "node", please use "set_gradient_edge" directly. Functionally, create_gradient_edge is approximately equivalent to variable.set_ gradient_ edge(function, function->add_input_metadata(variable.dispatch_type(), variable.sizes())).

/// Create an `Edge` between the given `variable` and the `function`, which is /// assumed to be the gradient function of this variable (i.e. the function /// through which this variable is backpropagated during the backward pass). /// This sets the `grad_fn` property of the `variable`. This function assumes /// that the `Variable` is a new input to the gradient function and its /// `input_nr` thus equal to `function->num_inputs()`. Additionally, it /// increments the `Node`'s number of inputs by one. Approximately /// equivalent to `variable.set_gradient_edge(function, /// function->add_input_metadata(variable.dispatch_type(), variable.sizes()))`. /// If you don't want the `Node`'s `num_inputs` to be incremented, use /// `set_gradient_edge` directly. inline void create_gradient_edge( Variable& variable, std::shared_ptr<Node> function) { // Copy before move. const auto input_nr = function->add_input_metadata(variable); impl::set_gradient_edge(variable, ); }

4.3.2 set_gradient_edge

set_ gradient_ The edge code is located in torch / CSR / autograd / variable.cpp.

The operation of configuring history will eventually be called here. This is the use of edge to really configure how the tensor calculates the gradient, and the autograd configured on the Variable class_ meta_. That is, get Tensor's autograd_meta_， Configure its grad_fn_ And output_nr_.

void set_gradient_edge(const Variable& self, Edge edge) { auto* meta = materialize_autograd_meta(self); meta->grad_fn_ = std::move(edge.function); // Configure gradient function meta->output_nr_ = edge.input_nr; // Configure the output of the gradient function // For views, make sure this new grad_fn_ is not overwritten unless it is necessary // in the VariableHooks::grad_fn below. // This logic is only relevant for custom autograd Functions for which multiple // operations can happen on a given Tensor before its gradient edge is set when // exiting the custom Function. auto diff_view_meta = get_view_autograd_meta(self); if (diff_view_meta && diff_view_meta->has_bw_view()) { diff_view_meta->set_attr_version(self._version()); } }

Among them, materialize_ autograd_ The meta code is as follows. Its function is to obtain autograd from Tensor_ meta_.

AutogradMeta* materialize_autograd_meta(const Variable& self) { TORCH_CHECK(self.defined(), "cannot call materialize_autograd_meta() on undefined tensor"); auto p = self.unsafeGetTensorImpl(); if (!p->autograd_meta()) { p->set_autograd_meta(std::make_unique<AutogradMeta>()); } return get_autograd_meta(self); }

get_ view_ autograd_ The meta code is as follows, and the differential viewmeta is returned.

DifferentiableViewMeta* get_view_autograd_meta(const Variable& self) { // NB: return nullptr if self is not a view AutogradMeta* meta = get_autograd_meta(self); if (meta && meta->is_view_) { return static_cast<DifferentiableViewMeta*>(meta); } else { return nullptr; } }

4.4 building networks

We've analyzed SubBackward0 and the underlying function, and then we'll go back to analyzing sub_ Implementation of tensor. The first is to build a backward communication network.

First, build a SubBackward0 grad_fn.
Secondly, for grad_fn, mainly using collect_next_edges() collects the of the two variables of the sub operation, and then sets them_ next_ edges.
Then, the forward calculation is carried out to obtain the output of the forward calculation.
Finally, add the output variable to the history, and compare the output variable with grad_fn binding.

The following code just keeps the sub_ Key parts of tensor.

std::shared_ptr<SubBackward0> grad_fn; if (_any_requires_grad) { // Function used in reverse calculation grad_fn = std::shared_ptr<SubBackward0>(new SubBackward0(), deleteNode); // Sets all input variables for the next edge grad_fn->set_next_edges(collect_next_edges( self, other )); grad_fn->other_scalar_type = other.scalar_type(); grad_fn->alpha = alpha; grad_fn->self_scalar_type = self.scalar_type(); } auto _tmp = ([&]() { at::AutoDispatchBelowADInplaceOrView guard; // Forward calculation return at::redispatch::sub(ks & c10::after_autograd_keyset, self_, other_, alpha); })(); // Get the output of the forward calculation auto result = std::move(_tmp); if (grad_fn) { // Match the output variable with grad_fn binding, grad_fn contains the function of calculating the gradient // Add your own calculation to the calculation history set_history(flatten_tensor_args( result ), grad_fn); }

4.5 construction edge

The key part of building a network is the building edge. Here is the output edge of configuring backpropagation (the output edge corresponds to the two inputs of SubBackward0), which has two steps:

Using collect_next_edges to collect the edges of the input parameters (tensors) and obtain the subsequent edges. The subsequent edges are the gradients of the two input parameters self and other_ edge().
Use set_next_edges configures edges to tensors. When set_ next_ After the edges call is completed, the next of a Node_ edges_ The member (type std::vector) is initialized.

4.5.1 obtaining edges

collect_ next_ The edges function is used to obtain edges based on input variables. Actually, collect_next_edges is to get the gradient of self and other_ edge.

4.5.1.1 gradient_edge

gradient_ The Edge method returns grad through Variable_ fn_ The logic of the Edge instance is as follows:

If a node has grad_fn:
- Description the node is an internal node (created internally by operation).
- grad_fn_ This is the gradient function of this Variable,
- Then use grad_fn to build an Edge return.
If a node does not have grad_fn:
- The description is a leaf node (user created).
- grad_fn_ It is the gradient calculator of the Variable, that is, an instance of the AccumulateGrad class (Function subclass). PyTorch uses grad_accumulator to accumulate the gradient output to the Variable.
- Using grad_ Calculator to build an Edge return.

The code is as follows. Note that output_nr is the output of the current variable during forward calculation. For single output operators such as add or mul, output_nr is generally 0, but for multi output operators such as split, output_nr may be 0,1,2.

Edge gradient_edge(const Variable& self) { // If grad_fn is null (as is the case for a leaf node), we instead // interpret the gradient function to be a gradient accumulator, which will // accumulate its inputs into the grad property of the variable. These // nodes get suppressed in some situations, see "suppress gradient // accumulation" below. Note that only variables which have `requires_grad = // True` can have gradient accumulators. // self.grad_fn() triggers a call here to get an instance of SubBackward0 if (const auto& gradient = self.grad_fn()) { // This is an intermediate node and gradient is a Function return Edge(gradient, self.output_nr()); // self.output_nr() indicates that this Edge is the nth input of the function. The nth output in forward propagation is the nth input in back propagation. } else { return Edge(grad_accumulator(self), 0); // This is a leaf node, so an AccumulateGrad is generated. 0 means that this Edge is the first input of the function } }

4.5.1.2 gradient accumulator

Here's a step to note: gradient_ In the edge method, there is a statement return Edge(grad_accumulator(self), 0). This code actually triggers variable:: grad_ The accumulator() call.

When a Variable calls this API for the first time, an AccumulateGrad will be generated to initialize its grad_accumulator_ Member, code as follows:

std::shared_ptr<Node> grad_accumulator(const Variable& self) { auto autograd_meta = get_autograd_meta(self); if (!autograd_meta) { return nullptr; } if (autograd_meta->grad_fn_) { throw std::logic_error( "grad_accumulator() should be only called on leaf Variables"); } if (!autograd_meta->requires_grad_) { return nullptr; } std::lock_guard<std::mutex> lock(autograd_meta->mutex_); auto result = autograd_meta->grad_accumulator_.lock(); if (result) return result; c10::raw::intrusive_ptr::incref(self.unsafeGetTensorImpl()); auto intrusive_from_this = c10::intrusive_ptr<at::TensorImpl>::reclaim(self.unsafeGetTensorImpl()); // Here, an AccumulateGrad will be initialized and configured to grad_accumulator_ result = std::make_shared<AccumulateGrad>(Variable(std::move(intrusive_from_this))); autograd_meta->grad_accumulator_ = result; return result; }

4.5.1.3 AccumulateGrad

The AccumulateGrad definition is located in torch / CSR / autograd / functions / accumulate_ grad.h

struct TORCH_API AccumulateGrad : public Node { explicit AccumulateGrad(Variable variable_); // Must be built with a Variable variable_list apply(variable_list&& grads) override; // Receive an instance of a list Variable Variable variable; };

Its constructor is in torch / CSR / autograd / functions / accumulate_ grad.cpp.

This creates a new AccumulateGrad object, using UINT64_MAX to initialize the sequence of the Function_ nr_ Members.

AccumulateGrad::AccumulateGrad(Variable variable_) : Node(/*sequence_nr=*/UINT64_MAX), variable(std::move(variable_)) { add_input_metadata(variable); }

4.5.1.4 collection edge

collect_next_edges the edges are established here. All input edges are collected.

/// Return the next edges of all the given variables, or tuples of variables. template <typename... Variables> edge_list collect_next_edges(Variables&&... variables) { detail::MakeNextFunctionList make; // Gradient will be called here_ edge // next_edges_ The value of the member comes from the input parameter of the forward time make.apply(std::forward<Variables>(variables)...); return std::move(make.next_edges); }

MakeNextFunctionList is defined as follows. Gradient will be built when apply ing_ Edge, which corresponds to the gradient mentioned earlier_ Edge and other sections.

struct MakeNextFunctionList : IterArgs<MakeNextFunctionList> { edge_list next_edges; using IterArgs<MakeNextFunctionList>::operator(); void operator()(const Variable& variable) { if (variable.defined()) { next_edges.push_back(impl::gradient_edge(variable)); // Call gradient_edge } else { next_edges.emplace_back(); } } void operator()(const c10::optional<Variable>& variable) { if (variable.has_value() && variable->defined()) { next_edges.push_back(impl::gradient_edge(*variable)); // Call gradient_edge } else { next_edges.emplace_back(); } } };

You get the edge_list, but no contact was established with SubBackward0.

+------------------------+ +----------------------+ | SubBackward0 | | | | | | Compute the gradient | | apply +-----------------> | | | | +----------------------+ | | | | | next_edges_ | | | | other_scalar_type | | | | alpha | | | | self_scalar_type | | | | input_metadata_ | | | +------------------------+ +-----------------------------------------------------+ | edge_list | | | | [(MulBackward0(self), 0), (PowBackward0(other), 0)] | | | +-----------------------------------------------------+

4.5.2 configuring edges

After all the output edges are obtained, the next step is to set to SubBackward0_ edges_ Above, be sure to pay attention to next_edges_ The value of the member comes from the input parameter when propagating forward.

void set_next_edges(edge_list&& next_edges) { next_edges_ = std::move(next_edges); // Here are the edges for(const auto& next_edge : next_edges_) { update_topological_nr(next_edge); } }

update_topological_nr sets topological based on the output edge_ nr

void update_topological_nr(const Edge& edge) { Node* node = edge.function.get(); if (node) { auto topo_nr = node->topological_nr(); if (topological_nr_ <= topo_nr) { topological_nr_ = topo_nr + 1; } } }

Combined with our example, it should be as shown in the figure below. The meaning of 0 in the figure below is as follows: (PowBackward0(other), 0) means that the calculation output of SubBackward0 is the first input of PowBackward0 (the original power operation has only one output).

+------------------------+ +----------------------+ | SubBackward0 | | | | | | Compute the gradient | | apply +-----------------> | | | | +----------------------+ | | | | +-----------------------------------------------------+ | next_edges_ +-----------> | edge_list | | | | | | other_scalar_type | | [(MulBackward0(self), 0), (PowBackward0(other), 0)] | | | | | | alpha | +-----------------------------------------------------+ | | | self_scalar_type | | | | input_metadata_ | | | +------------------------+

4.6 configuration history

Next is the configuration history. result is the forward propagation output calculated by the previous code. Here is actually the input parameters of the configuration back propagation and how to calculate the input.

if (grad_fn) { // grad_fn is STD:: shared_ ptr<SubBackward0> // Match the output variable with grad_fn binding, grad_fn contains the function of calculating the gradient set_history(flatten_tensor_args( result ), grad_fn); }

set_history will add the forward propagation results to history, specifically traversing the tensors in the results, and then adding each tensor to history. The key point is to call the previously mentioned set_gradient_edge, put grad_fn (SubBackward0) is configured to result.autograd_meta_ Grad of_ fn_.

Recall the Tensor member variable grad_fn definition.

grad_fn: point to a Function object.

This Function object is used to calculate the input gradient during back propagation.
If the tensor is a non leaf node, the Function is a back propagation Function operating in the leaf node direction. For example, the Function corresponding to the O node in the example is MulBackward, that is, the reverse Function of multiplication operation;

After comparison, we can know that grad will be used when the input result of forward operation is used to calculate the gradient in back propagation_ fn_ To calculate the gradient, it's SubBackward0 here. This sets how the back propagation calculates the gradient for the input.

Specific set_ The history code is as follows:

inline void set_history( at::Tensor& variable, const std::shared_ptr<Node>& grad_fn) { if (variable.defined()) { // grad_fn input_ The output instance is added to the metadata, and the output instance is the input during back propagation auto output_nr = grad_fn->add_input_metadata(variable); // Grad is set in the output instance result_ FN, the edge is configured here, and the edge is . // output_nr_ It is assigned as "index of current Variable information in input_metadata". impl::set_gradient_edge(variable, ); } else { // Set to undefined grad_fn->add_input_metadata(Node::undefined_input()); } } inline void set_history( std::vector<Variable>&& variables, const std::shared_ptr<Node>& grad_fn) { for (auto& variable : variables) { set_history(variable, grad_fn); // Call to the above function } }

4.6.1 configuring meta

In the configuration history, the first step is to configure input_metadata. Set input_ The output instance result is added to the metadata. The output instance result is the input during back propagation.

4.6.1.1 input_metadata_

In the Node class, input_metadata_ The types of are as follows:

at::SmallVector<InputMetadata, 2> input_metadata_;

Specific InputMetadata definitions are as follows:

struct InputMetadata { InputMetadata(const at::TensorOptions options, at::IntArrayRef shape, at::Device device) : options_, shape_, device_ { stream_ = c10::impl::getDeviceGuardImpl(device_.type())->getStream(device_); } InputMetadata(const at::Tensor& t) : InputMetadata(t.options(), t.sizes(), t.device()) { } private: const at::TensorOptions options_; at::DimVector shape_; at::Device device_ = at::kCPU; c10::Stream stream_ = c10::Stream(c10::Stream::Default::DEFAULT, device_); };

4.6.1.2 configuring meta

add_ input_ The metadata information in the metadata method is configured as follows:

/// Adds the type and shape metadata for a new input. Returns the index of /// of the new input. uint32_t add_input_metadata ( const at::TensorOptions& options , at::IntArrayRef shape , at::Device device) noexcept { uint32_t input_nr = input_metadata_.size(); input_metadata_.emplace_back(options, shape, device); return input_nr; }

After configuration, input_metadata_ A new InputMetadata is added. The content of InputMetadata is part of the information (type, shape, device) and input of the output variable result_ metadata_ The index in is the output in AutogradMeta_ nr_.

Therefore, the memory at this time is roughly as follows:

+-------------------------------------------------------------------------------------------------------------+ self +--+ | sub_Tensor | | | +--------------------------+ +----------------------+ | +---->+ |SubBackward0 | | | | | | | | | Compute the gradient | | other +--+ | +--> grad_fn---> | apply +-----------------> | | | | | | | +----------------------+ | | | | | | | | | | +-----------------------------------------------------+ | | | | next_edges_ +-----------> | edge_list | | | | | | | | | | | | other_scalar_type | | [(PowBackward0(self), 0), (PowBackward0(other), 0)] | | | | | | | | | | | | alpha | +-----------------------------------------------------+ | | | | | | | | | self_scalar_type | +------------------------------------------------------+ | | | | | | | | | | | input_metadata_ +-------> | [(type of result, shape of result, device of result)]| | | | | | | | | | | +--------------------------+ +------------------------------------------------------+ | | | | | | | | | +-----------------------+ +---------------------------------------+ | | | |result | | DifferentiableViewMeta | | | | | | | | | | | | autograd_meta_ +-----------> | grad_ grad_accumulator_ | | | | | | | | | | | +-----------------------+ | | | | +--------------------------------------------------------- grad_fn_ output_nr_ | | | | | | | +---------------------------------------+ | +-------------------------------------------------------------------------------------------------------------+

Mobile phones are as follows:

4.7 confirmation

We compare with the previous examples and continue to refine the example code to obtain:

a = torch.tensor(2., requires_grad=True) b = torch.tensor(6., requires_grad=True) X = a ** 3 Y = 3 * X Z = b ** 2 Q = X - Z external_grad = torch.tensor(1.) Q.backward(gradient=external_grad) print(a.grad) print(b.grad)

Look at the runtime variables as follows. Since Q = X - Z is subtraction, the corresponding reverse operation is SubBackward0:

Q = tensor(-28., grad_fn=<SubBackward0>) X = tensor(8., grad_fn=<PowBackward0>) Y = tensor(24., grad_fn=<MulBackward0>) Z = tensor(36., grad_fn=<PowBackward0>) a = tensor(2., requires_grad=True) b = tensor(6., requires_grad=True)

Let's take a closer look. Note that (< PowBackward0 object at 0x00000177300f4688 >, 0) the 0 here means that this Node is the output of the 0th of PowBackward0, that is, the only output.

Q = grad_fn = next_functions = 0 = (<PowBackward0 object at 0x00000177300F4688>, 0) 1 = (<PowBackward0 object at 0x00000177300F46C8>, 0) X = grad_fn = next_functions = 0 = (<AccumulateGrad object at 0x00000177300F49C8>, 0) Z = grad_fn = next_functions = 0 = (<AccumulateGrad object at 0x00000177301003C8>, 0) Y = grad_fn = next_functions = 0 = (<PowBackward0 object at 0x0000017730100CC8>, 0) 1 = (None, 0)

The corresponding brief diagram is:

Corresponding logic:

1. Call sub with self and other tensors as parameters_ Tensor
1. Using grad_fn = std::shared_ptr(new SubBackward0(), deleteNode); Build a subbackward0. Among them, grad_fn's next_edges_ The value of the member comes from the forward propagating input parameters, namely self and other.
1. Use at::redispatch::sub for forward calculation to get the result.
1. Use set_history sets the calculation history. set_history consists of two parts
2. Using output_ nr = grad_fn->add_ input_metadata (variable) is grad_fn input_ An output instance has been added to the metadata.
3. Using impl::set_gradient_edge(variable, ) gives the attribute autograd of the output instance result_ meta_-> grad_ fn_ Grad is set in_ fn.
1. Finally, result is returned.

As you can see, sub_Tensor configures the result as follows:

How to know to call reverse calculation: result is the result of forward calculation, and there is autograd in result_ meta_， It is a DifferentiableViewMeta type, grad of DifferentiableViewMeta_ And grad_fn_ Is the inverse gradient function. grad_fn_ Points to SubBackward0.
How backpropagation evaluates: call SubBackward0 to evaluate.
Input of SubBackward0: the output result of forward calculation is obtained (it will be used as the input variable during back propagation, that is, it is set above SubBackward0.input_metadata_).
Output of SubBackward0: built next_edges_ As the output side of its back propagation.

The logic diagram is as follows:

+---------------------------------------------------------------------------------------------------------------+ self +--+ | sub_Tensor +--------------------------+ +----------------------+ | | | |SubBackward0 | | | | +---->+ 2 | | | Compute the gradient | | | 1 | +-----> grad_fn +-----> | apply +-----------------> | | | other +--+ | | | | +----------------------+ | | | | | | | | | | +----------------------+ | | | | next_edges_ +-----------> | edge_list | | | | | | | | | | | | other_scalar_type | | self, other | | | | | | | | | | | | alpha | +----------------------+ | | | | | | | | | self_scalar_type | | | | | | | | | | input_metadata_ +------> [result] | | | | | ^ | | | +--------------------------+ | | | | | 5 | | | | | | | 3 result = at::redispatch::sub +--------------------------------------------------------+ | | | | | | | | | | + | | | | | output_nr = grad_fn+>add_input_metadata(variable) | | | | 4 set_history(result, grad_fn) +-------> | | | | | | impl::set_gradient_edge(variable,a)| | | | | + | | | +----------------------------+ | | | | | 6 | +--------------------------------------------------------+ | | | | | | +-----------------------+ | +-----------------------------------+ | 7 | | |result | | | DifferentiableViewMeta | | | | | | | | | <---+ | | | autograd_meta_ +---------------->+ | | | | | | | grad_ grad_accumulator_ | | | | | | | | | | | | +--------+grad_fn_ output_nr_ | | | | | | | | | +------------+----------+ +-----------------------------------+ | | | | +---------------------------------------------------------------------------------------------------------------+ | result | 7 v

Mobile phones are as follows: