[experience sharing] Cambrian pytorch MLU adding layer by layer operator method

Welcome to my official account, reply to 001 Google programming specification.

  O_o   >_<   o_O   O_o   ~_~   o_O

  this tutorial shares the method of adding layer by layer operators to pytorch MLU on Cambrian equipment.

  the basic unit of data transfer and storage between operators in pytorch MLU layer by layer mode is tensor. Pytorch MLU distributes operators to different devices according to the device attribute value in tensor. Take the abs() operator as an example. In the dispatch phase, the input_ The device attribute value of tensor distributes operator calls to specific devices. The logic is shown in the following figure:

    Catch is decoupled from the pytorch source code by adding MLU operators through registration. The specific steps of adding MLU operators to catch are described below.

1. Registration operator

    In catch / torch_ mlu/csrc/generated/aten_ mlu_ type_ Operator registered in default.cpp:

.op(torch::RegisterOperators::options().schema("aten::add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor")  // NOLINT 

  .impl_unboxedOnlyKernel<at::Tensor(const at::Tensor &, const at::Tensor &, at::Scalar), &AtenMluType::add>(at::TensorTypeId::MLUTensorId)
  
  aliasAnalysis(c10::AliasAnalysisKind::FROM_SCHEMA))

2. Operator distribution

  AtenMluType and AtenMluCustomType are the entries of operators in the Catch module. AtenMluType class mainly contains standard operators in the framework; The AtenMluCustomType class contains customized operators. Select whether to add the corresponding operator declaration and implementation in AtenMluType or AtenMluCustomType according to the operator attribute.

  • Standard operator distribution
    In catch/torch_mlu/csrc/aten/aten_mlu_type.h and catch / torch_ mlu/csrc/aten/aten_ mlu_ Add operator declaration and implementation in type.cpp:
aten_mlu_type.h
static at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
aten_mlu_type.cpp
at::Tensor AtenMluType::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){
  return OP_DISPATCH(add, self, other, alpha);
}
  • Customized operator distribution

  for MLU specific operators, in catch/torch_mlu/csrc/aten/aten_mlu_type.h and catch / torch_ mlu/csrc/aten/aten_ mlu_ custom_ Add operator declaration and implementation in type.cpp:

aten_mlu_type.h
static at::Tensor linear(const at::Tensor& input,
                         const at::Tensor& weight,
                         const at::Tensor& bias,
                         const at::Tensor& q_scale,
                         const at::Tensor& q_mode);
aten_mlu_custom_type.cpp
at::Tensor AtenMluCustomType::linear(const at::Tensor& input,
                                     const at::Tensor& weight,
                                     const at::Tensor& bias,
                                     const at::Tensor& q_scale,
                                     const at::Tensor& q_mode){
    return OP_DISPATCH(linear, input, weight, bias, q_scale, q_mode);
}

3. Modify OpMethods base class

  both AtenMluType and AtenMluCustomType will be distributed to reasoning operator or training operator through OpMethods. In catch/torch_mlu/csrc/aten/operators/op_methods.h and catch / torch_ mlu/csrc/aten/operators/op_ Add operator declaration and implementation in methods.cpp. The implementation part of OpMethods is the CPU implementation of the operator.

op_methods.h
virtual at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){
   auto input_cpu = self.cpu();
   auto other_cpu = other.cpu();
   auto output = at::add(input_cpu, other_cpu, alpha);
   return output.to(at::Device(at::Device::Type::MLU));
}

4. Distribution operator

  in catch/torch_mlu/csrc/aten/operators/cnml_ops.h and catch / torch_ mlu/csrc/aten/operators/cnml_ Add inference operator declaration and implementation in ops.cpp.

cnml_ops.h
at::Tensor add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha);
cnml_ops.cpp
at::Tensor CnmlOps::add(const at::Tensor& self, const at::Tensor& other, at::Scalar alpha){
  CNML_DISPATCH(add, cnml_add, self, other, alpha);  // CNML_ The first parameter of the dispatch macro is the interface name, the second parameter is the wrapper name, and the rest
}

5. Add wrapper

  wrapper is the encapsulation of operator kernel, and each operator corresponds to a wrapper. Here, take the add operator as an example, and add a wrapper as follows:

cnml_kernel.h
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha);
add.cpp
at::Tensor cnml_add(const at::Tensor& input, const at::Tensor& other, at::Scalar alpha_scalar){
  TORCH_CHECK(input.dim() >= 0 || other.dim() >= 0, "dimension not support");
  at::Tensor input_ = input;
  at::Tensor other_ = other;
  auto alpha_data = alpha_scalar.to<scalar_t>();
  if(alpha_data != 1){
    // scale_t
    other_ = cnml::ops::cnml_scale(other_, alpha_data, 0);
  }
  if(other_.dim() < 1 && other_.device().type() == c10::DeviceType::CPU){
    auto other_scalar = other_.item();
    return cnml_add_internal(input_, other_scalar);   // Call kernel
  }
  if(input_.dim() < 1 && input_.device().type() == c10::DeviceType::CPU){
    auto input_scalar = input_.item();
    return cnml_add_internal(other_, input_scalar);   // Call kernel
  }
  
  bool broadcast = input_.sizes() != other_.sizes();
  if(broadcast){
    auto broadcast_size = at::infer_size(input.sizes(), other.sizes());
    at::Tensor broadcast1 = cnml::ops::cnml_expand(input_, broadcast_size, false);
    at::Tensor broadcast2 = cnml::ops::cnml_expand(other_, broadcast_size, false);
    return cnml_add_internal(broadcast1, broadcast2);  // Call kernel
  }else{
    return cnml_add_internal(input_, other_);  //Call kernel
  }
  return cnml_add_internal(input_, other_);   //Call kernel
}

6. Add wrapper

  in Wrapper, the operator function is realized by calling kernel. CNML_ is invoked in the example. add_ internal. The specific implementation of the operator is mainly completed by calling the interface of the CNML library. The following is the logic of the CNML Library:

    The kernel implementation is completed by calling the CNML library interface according to the above programming logic in catch/torch_mlu/csrc/aten/operators/cnml/internal/cnml_internal.h and catch / torch_ mlu/csrc/aten/operators/CNML/internal/add_ Add the declaration and implementation of kernel function in internal / CPP.

cnml_internal.h
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2);
add_internal.cpp
at::Tensor cnml_add_internal(const at::Tensor& input1, const at::Tensor& input2){
  auto output = at::native::empty_like(input1);
  // prepare input cnml tensor
  auto* input1_impl = getMluTensorImpl(input1);  // Get MluTensorImpl
  auto input1_cnml = input1_impl->CreateCnmlTensor(
       CNML_TENSOR, toCnmlDataType(input1.dtype()));  // Type adaptation: toCnmlDataType()
       
  auto* input2_impl = getMluTensorImpl(input2);
  auto input2_cnml = input2_impl->CreateCnmlTensor(
      CNML_TENSOR, toCnmlDataType(input2.dtype()));
      
  // prepare output cnml tensor
  auto* output_impl = getMluTensorImpl(output);
  auto output_cnml = output_impl->CreateCnmlTensor(
      CNML_TENSOR, toCnmlDataType(output.dtype()));
      
  // End the execution flow if not MLU device
  CHECK_MLU_DEVICE(output);
  
  // setup operator
  cnmlBaseOp_t add_op;
  TORCH_CNML_CHECK(cnmlCreateAddOp(&add_op, input1_cnml, input2_cnml, output_cnml));
  
  // return to JIT if running mode is fuse
  CHEXK_RETURN_TO_FUSE(add_op, output);
  
  // compile op
  TORCH_CNML_CHECK(cnmlCompileBaseOp(add_op, GET_CORE_VERSION, GET_CORE_NUMBER));
  
  auto queue = getCurQueue();
  TORCH_CNML_CHECK(cnmlComputeAddOpForward_V4(add_op,
                                              NULL,
                                              input1_impl->raw_mutable_data(),
                                              NULL,
                                              input2_impl->raw_mutable_data(),
                                              NULL,
                                              output_impl->raw_mutable_data(),
                                              queue,
                                              NULL));
   syncQueue(queue);
   TORCH_CNML_CHECK(cnmlDestroyBaseOp(&add_op));
   
  return output;
}
  • Processing of operators is not supported for MLU

For MLU temporarily support operation, input data will be copied to CPU, and then call CPU related operations, so that it runs on CPU, and finally output results will be copied to MLU. Specific implementation, you can query op_methods.cp, which is in catch / torch_ MLU / CSR / aten / operators /.

op_methods.cpp
at::Tensor OpMethods::add(const at::Tensor& self,
                          const at::Tensor& other,
                          at::Scalar alpha){
  auto input_cpu = self.cpu();
  auto other_cpu = other.cpu();
  auto output = at::add(input_cpu, other_cpu, alpha);
  return output.to(at::Device(at::Device::Type::MLU));
}
  • When an exception is thrown during the execution of a new operator, if there is no corresponding operator operation on the CPU, the operation cannot be switched to the CPU for operation;
  • Wrapper s are usually in cnml_ The operator name is named, and the kernel is generally cnml_ Operator name_ internal naming

7. Operator test

  write operator unit tests using python based unittest module. During the test, the same parameters and input data shall be provided, the operators shall be executed on MLU and CPU respectively, and the output results of the two shall be compared. The calculation results of MLU and CPU may be different. Generally, the relative error of both is acceptable within 2%.

def test_add(self):
  # "Tensor + Tensor" mode testing
  for shape1, shape2 in [((1,3,224,224),(1,3,224,224)),((2,30,80),(2,30,80)),((3,20),(3,20)),((10),(10))]:
    input1_cpu = torch.rand(shape1, dtype=torch.float)
    input2_cpu = torch.rand(shape2, dtype=torch.float)
    input1_mlu = input1_cpu.to(xm.mlu_device())
    input2_mlu = input2_cpu.to(xm.mlu_device())
    # Calculate on CPU
    output_cpu = input1_cpu + input2_cpu
    # Calculate on MLU
    output_mlu = input1_mlu + input2_mlu
    # Calculate the error of MLU and ensure that the relative error is within 2%
    self.assertTensorsEqual(output_cpu, output_mlu.cpu(), 0.02, use_MSE=True)

    The above shared the method of adding layer by layer operators to the Cambrian equipment pytorch MLU, and compiled an example with the add() operator as an example. I hope my sharing will be helpful to your learning.

[official account transmission]
<[experience sharing] Cambrian pytorch MLU adding layer by layer operator method>

Tags: Algorithm AI Pytorch Deep Learning microchip

Posted on Wed, 24 Nov 2021 06:57:11 -0500 by Nexus10