Source reading of ONNX Runtime: implementation of model partition

Related reading:
Source reading of ONNX Runtime: overview of model reasoning process
Source reading of ONNX Runtime: Determination of serial execution order of model nodes

Preface

In order to achieve efficient reasoning, the neural network inference engine should try its best to use the hardware Device which can provide more efficient computing on the Host, of course, the ONNX Runtime is no exception. At present, the ONNX Runtime supports a variety of different devices, and the mobile support is also under development. A Host is likely to have multiple devices at the same time. How does the ONNX Runtime choose to run on that Device (that is, how to partition)? How to deal with some operations that are not supported by runtime (fallback)? How to determine the runtime priority of different hardware? These are the main contents of this paper.

Explain

In order not to confuse concepts, let's make a point first. The host here refers to a system composed of multiple software and hardware, such as a computer, a mobile phone, etc.; the device refers to CPU, GPU and other computing units, some of which are also called accelerators; and the driver and other software they depend on are called in many ways, some of which are called Runtime: for example, SNPE, Snapdragon Neural network Processing Engine), and in the ONNX Runtime, it is called the Execution Provider (in fact, translation into sponsor may be more appropriate, sponsorship computing power, ha ha). I think it's to distinguish it from the name of ONNX Runtime. But out of habit, they are all called providers.

Related documents

onnxruntime\onnxruntime\core\framework\sequential_executor.cc
onnxruntime\onnxruntime\core\session\inference_session.cc
onnxruntime\onnxruntime\core\framework\graph_partitioner.cc
onnxruntime\onnxruntime\python\onnxruntime_pybind_state.cc

text

In the ONNX Runtime, specific hardware and the drivers they depend on are abstracted. This abstraction unifies the methods of calling various hardware resources. Therefore, this abstraction is called the Execution Provider; a specific operation such as convolution and pooling is called the kernel, which is the base class of all operators. The implementation of the same operation is different for different providers. For example, as for convolution operation, using CPU Provider and Cuda Provider is different. The process of assigning a Provider to a node is actually the process of partitioning the model.
Through the previous articles, we know that the operation of ONNX Runtime is mainly divided into three stages: instantiation, initialization and reasoning. When we call InferenceSession.run(), through layers of delegation, the real reasoning is iexecution. Execute(), which is the execution method rewritten by the subclass of iexecution. Iexecutior has two subclasses, sequential executor and parallel executor. No matter where a subclass is, it is ultimately to find the corresponding OpKernel of Node in SessionState through the Node information provided. Below through the code specific description. Because the method body is too long, the following code example uses / /... To represent a lot of code omitted here. To see the complete code, please check according to the beginning of the example.

// onnxruntime\onnxruntime\core\framework\sequential_executor.cc#SequentialExecutor::Execute()
Status SequentialExecutor::Execute(const SessionState& session_state, const std::vector<int>& feed_mlvalue_idxs,
                                   const std::vector<OrtValue>& feeds, const std::vector<int>& fetch_mlvalue_idxs,
                                   std::vector<OrtValue>& fetches,
                                   const std::unordered_map<size_t, CustomAllocator>& fetch_allocators,
                                   const logging::Logger& logger) {

      // ......
    auto p_op_kernel = session_state.GetKernel(node_index);
      // .....
      // if a kernel has been added in the session state, it better be NON-null.
      if (p_op_kernel == nullptr)
        return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Got nullptr from GetKernel for node: ",
                             node.Name());

      // .....
      try {
        compute_status = p_op_kernel->Compute(&op_kernel_context);
      } catch (const std::exception& ex) {
        compute_status = ORT_MAKE_STATUS(ONNXRUNTIME, RUNTIME_EXCEPTION, ex.what());
      }

// .......
}

As can be seen from the above code, OpKernel is obtained directly through SessionState.GetKernel(), and called directly in the next step, without any conversion. That is to say, when executing, the OpKernel of the node has been determined, indicating that the Provider selection is not done in this stage, but in the instantiation or initialization stage, which we feel is the initialization stage.
So let's take a look at the initialization phase code. The main initialization code is in InferenceSession.Initialize(). We are not new to InferenceSession.Initialize(), Source reading of ONNX Runtime: Determination of serial execution order of model nodes As we can see, it also determines the execution sequence of each node in the model by calling the SessionStateInitializer.CreatePlan() method. Let's look at this method again:

// onnxruntime\onnxruntime\core\session\inference_session.cc#InferenceSession::Initialize()

common::Status InferenceSession::Initialize() {
  // ......
    SessionStateInitializer session_initializer(session_options_.enable_mem_pattern, model_location_, graph,
                                                *session_state_, execution_providers_, kernel_registry_manager_);

    // create SessionState for subgraphs as it's needed by the transformers
    ORT_RETURN_IF_ERROR_SESSIONID_(CreateSubgraphSessionState(graph, *session_state_));

    // apply any transformations to the main graph and any subgraphs
    ORT_RETURN_IF_ERROR_SESSIONID_(TransformGraph(graph, *graph_transformation_mgr_,
                                                  execution_providers_, kernel_registry_manager_,
                                                  insert_cast_transformer_,
                                                  *session_state_));

  // ......
    ORT_RETURN_IF_ERROR_SESSIONID_(session_initializer.CreatePlan(nullptr, nullptr, session_options_.execution_mode));

    // handle any subgraphs
    ORT_RETURN_IF_ERROR_SESSIONID_(InitializeSubgraphSessions(graph, *session_state_));
  // ......
}

Because I already know from the executed code that the OpKernel is extracted from the SessionState, then we know the process of finding the OpKernel to store in the SessionState and we know how to choose the runtime. In the above code, we should be able to find the operation places of SessionState one by one. For example, if the police have targeted a person who wants to buy drugs, they should be able to find the drug traffickers who sell drugs to him by staring at him and seeing who he has contacted and checking one by one. After all, we are not authors. We can only read one by one. Finally, it is found that the Provider determines that the acquisition of the specific OpKernel is separate: the Provider is determined in the InferenceSession::TransformGraph(). After the Provider is determined, the SessionStateInitializer.CreatePlan() initializes the corresponding OpKernel for them.
Let's first look at InferenceSession::TransformGraph():

// onnxruntime\onnxruntime\core\session\inference_session.cc#InferenceSession::TransformGraph()
common::Status InferenceSession::TransformGraph(onnxruntime::Graph& graph,
                                                const onnxruntime::GraphTransformerManager& graph_transformer_mgr,
                                                const ExecutionProviders& providers,
                                                KernelRegistryManager& kernel_registry_manager,
                                                const InsertCastTransformer& insert_cast_transformer,
                                                SessionState& session_state) {
  // .......
  // Do partitioning based on execution providers' capability.
  GraphPartitioner partitioner(kernel_registry_manager, providers);
  ORT_RETURN_IF_ERROR_SESSIONID_(partitioner.Partition(graph, session_state.ExportDll(), session_state.GetMutableFuncMgr()));
  // apply transformers except default transformers
  // Default transformers are required for correctness and they are owned and run by inference session
  for (int i = static_cast<int>(TransformerLevel::Level1); i <= static_cast<int>(TransformerLevel::MaxLevel); i++) {
    ORT_RETURN_IF_ERROR_SESSIONID_(graph_transformer_mgr.ApplyTransformers(graph, static_cast<TransformerLevel>(i), *session_logger_));
  }
  // .......
}

In the InferenceSession::TransformGraph(), delegate the GraphPartitioner.Partition() method to assign their corresponding providers to each node.

// onnxruntime\onnxruntime\core\framework\graph_partitioner.cc#GraphPartitioner::Partition()

Status GraphPartitioner::Partition(Graph& graph, bool export_dll, FuncManager& func_mgr) const {

// .......
for (auto& provider : providers_) {
    int count = 0;
    std::vector<Node*> nodes_need_compile;
    std::vector<std::unique_ptr<ComputeCapability>> capabilities =
        provider->GetCapability(graph_viewer, kernel_registry_mgr_.GetKernelRegistriesByProviderType(provider->Type()));
    for (auto& capability : capabilities) {
      Node* n = PlaceNode(graph, std::move(capability->sub_graph), kernel_registry_mgr_, provider->Type(), count);
      if (n != nullptr) {
        nodes_need_compile.push_back(n);
      }
    }

    if (!nodes_need_compile.empty()) {
      if (export_dll) {
        std::string dll_path;
        ORT_RETURN_IF_ERROR(provider->Compile(nodes_need_compile, dll_path));
        for (auto* node : nodes_need_compile)
          ORT_RETURN_IF_ERROR(func_mgr.AddFuncInfo(node->Name(), dll_path));
      } else {
        std::vector<NodeComputeInfo> node_compute_funcs;
        ORT_RETURN_IF_ERROR(provider->Compile(nodes_need_compile, node_compute_funcs));
        ORT_ENFORCE(node_compute_funcs.size() == nodes_need_compile.size(),
                    "Provider doesn't return correct number of compiled functions");
        for (size_t j = 0; j < nodes_need_compile.size(); j++)
          ORT_RETURN_IF_ERROR(func_mgr.AddFuncInfo(nodes_need_compile[j]->Name(), node_compute_funcs[j].compute_func,
                                                   node_compute_funcs[j].create_state_func,
                                                   node_compute_funcs[j].release_state_func));
      }
      for (auto* node : nodes_need_compile) {
        //prepare the func kernel
        KernelDefBuilder builder;
        BuildFusedKernelDef(builder, *node);
        ORT_RETURN_IF_ERROR(fused_kernel_registry->Register(
            builder, static_cast<KernelCreatePtrFn>([](const OpKernelInfo& info) -> OpKernel* { return new FunctionKernel(info); })));
      }
    }
  }
  // .......
}

In the above code, providers are an instance of ExecutionProviders whose reference is held in the InferenceSession. It is actually a list of std::vector type, which holds all available providers in this run. GraphPartitioner::Partition() will ask each Provider's processing power through Provider - > getcapability() method from the beginning to the end, that is, which nodes are supported by this model in this model. Finally, we will get a list of the nodes supported by the Provider, and then use the PlaceNode() function to set the execution Provider type attribute in each node, that is, the string representing the corresponding Provider. When a Provider asks, it is likely that some nodes do not have a special accelerator to handle. These nodes that cannot run on other accelerators are supported by the CPU Provider, so the CPU Provider is required to support all the onnx operators.
For example, the node in the model is a group of children who have completed the college entrance examination, and each university is a different Provider. After the college entrance examination, the volunteer system will ask each university, I have such a group of students here, do you want to? According to their own resources and conditions, colleges and universities have accepted the qualified children and given them an admission notice. Of course, all colleges and universities ask, there are likely to be some children who can not be admitted to colleges and universities, how to do? The social university gave him an excuse. The CPU in the ONNX Runtime is the social university.
The signature of GetCapability() is

std::vector<std::unique_ptr<ComputeCapability>>
IExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph,
                                  const std::vector<const KernelRegistry*>& kernel_registries) const

It receives a graphviewer & parameter and a const STD:: vector < const Kernel registry * > & parameter, and returns a list of STD:: vector < STD:: unique ﹤ PTR < ComputeCapability > > types. The index of the supported nodes can be obtained in the inner table. The parameters of graphviewer include the structure and node information of the model, including node type, index value, parameters, etc. The subclass of IExecutionProvider rewrites this method according to its own situation, obtains all nodes of the model through graphviewer and decides whether to support some or all of these nodes. If a node is supported, the index value of the node is encapsulated with ComputeCapability and returned to the caller in a list. IExecutionProvider provides a default implementation that attempts to support all the registered kernels of the Provider, regardless of the specific parameters. CPU Provider uses the default implementation. Similar to Cuda Provider, in addition to the node type, the parameter of the node also needs to be considered to decide whether to support it. For example, Cuda Provider only supports the convolution operation of symmetric filling. The following is the default implementation code of GetCapability():

// onnxruntime\onnxruntime\core\framework\execution_provider.cc#IExecutionProvider::GetCapability()
std::vector<std::unique_ptr<ComputeCapability>>
IExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph,
                                  const std::vector<const KernelRegistry*>& kernel_registries) const {
  std::vector<std::unique_ptr<ComputeCapability>> result;
  for (auto& node : graph.Nodes()) {
    for (auto registry : kernel_registries) {
      if (registry->TryFindKernel(node, Type()) != nullptr) {
        std::unique_ptr<IndexedSubGraph> sub_graph = onnxruntime::make_unique<IndexedSubGraph>();
        sub_graph->nodes.push_back(node.Index());
        result.push_back(onnxruntime::make_unique<ComputeCapability>(std::move(sub_graph)));
        break;
      }
    }
  }

  return result;
}

But here is a small question. Since each Provider is asked in turn about the processing power of all nodes in the same model, it is possible that multiple providers declare that they can handle a specific node. How to solve this conflict? The solution to the ONNX Runtime is simple and crude - first come, first served. Yes, if a node has been declared by a Provider and can be processed and assigned to the Provider, it will be ignored even if another Provider is declared later. Look at the following code. According to an if condition, if a node has been allocated, it will not be re allocated:

// onnxruntime\onnxruntime\core\framework\graph_partitioner.cc#PlaceNode()
static Node* PlaceNode(Graph& graph, std::unique_ptr<IndexedSubGraph> capability,
                       const KernelRegistryManager& kernel_registry_mgr, const std::string& provider_type, int& count) {
  if (nullptr == capability) {
    return nullptr;
  }

  if (nullptr == capability->GetMetaDef()) {
    // The <provider> can run a single node in the <graph> if not using meta-defs.
    // A fused kernel is not supported in this case.
    ORT_ENFORCE(1 == capability->nodes.size());

    auto* node = graph.GetNode(capability->nodes[0]);
    if (nullptr != node && node->GetExecutionProviderType().empty()) {
      // The node was not fused or assigned. Assign it to this <provider>.
      node->SetExecutionProviderType(provider_type);
    }
      // .......
  }
  // .......
}

So, how to determine who is the first and who is the second? There are two situations:

  1. The user specifies the Provider: in the order specified by the customer;
  2. User did not specify: the order in which developers using the ONNX Runtime write dead. In some way, the authors have arranged the order of various supported providers directly in the code. All supported providers and their default order are as follows. As you can see, the CPU has the lowest priority, so it can support all other providers:
// onnxruntime\onnxruntime\python\onnxruntime_pybind_state.cc#GetAllProviders()
// ordered by default priority. highest to lowest.
const std::vector<std::string>& GetAllProviders() {
  static std::vector<std::string> all_providers = {kTensorrtExecutionProvider, kCudaExecutionProvider, kDnnlExecutionProvider,
                                                   kNGraphExecutionProvider, kOpenVINOExecutionProvider, kNupharExecutionProvider,
                                                   kBrainSliceExecutionProvider, kCpuExecutionProvider};
  return all_providers;
}

We can see the registration process of Provider as follows:

// onnxruntime\onnxruntime\python\onnxruntime_pybind_state.cc#InitializeSession()
void InitializeSession(InferenceSession* sess, const std::vector<std::string>& provider_types) {
  if (provider_types.empty()) {
    // use default registration priority.
    RegisterExecutionProviders(sess, GetAllProviders());
  } else {
    RegisterExecutionProviders(sess, provider_types);
  }
  OrtPybindThrowIfError(sess->Initialize());
}

In the first case, the user does not specify a Provider. By default, all the providers supported by the local machine are registered. In the second case, the user specifies a Provider. Because the ONNXRuntime requires that the CPU must be able to support the bottom, if the user specifies the Provider but does not include the CPU Provider, the system will automatically add the CPU Provider to ensure that all operations may fall back to the CPU for execution.

// onnxruntime\onnxruntime\core\session\inference_session.cc#InferenceSession::Initialize()
common::Status InferenceSession::Initialize() {
  // ......
    // Register default CPUExecutionProvider if user didn't provide it through the Register() calls
    if (!execution_providers_.Get(onnxruntime::kCpuExecutionProvider)) {
      LOGS(*session_logger_, INFO) << "Adding default CPU execution provider.";
      CPUExecutionProviderInfo epi{session_options_.enable_cpu_mem_arena};
      auto p_cpu_exec_provider = onnxruntime::make_unique<CPUExecutionProvider>(epi);
      ORT_RETURN_IF_ERROR_SESSIONID_(RegisterExecutionProvider(std::move(p_cpu_exec_provider)));
    }
  // ......
}

summary

We know the partition method of ONNX Runtime through the above code examples:

  1. Register the supported providers in the order in which they appear in the list. This list can be specified by the user or a default list can be used. The higher the priority is, the higher the priority is;
  2. The ONNX Runtime calls the GetCapability() method of the Provider in turn according to the priority determined in the first step to query the processing capability of different providers. GetCapability() returns a list of all nodes in a given model that can be run by itself, that is, the sub graph of the model;
  3. According to the node list obtained in step 2, check the nodes in turn. If the node is not assigned to another higher priority Provider, the node will be assigned to the current Provider.
  4. In the end, nodes that cannot be supported by other proprietary providers will fall back to the CPU for execution.

Although the implementation is different, the basic idea of many inference engines is the same, such as partition. High pass neural network processing engine partition is done offline, that is to use a special compiler to compile nodes that can run on a specific hardware and write relevant information into a special model file. However, the partition implementation method of ONNX Runtime is much simpler, which can be directly partitioned according to the model at runtime.

WeChat official account TensorBoy. Wechat scan the QR code above or search tensorboy and pay attention to it to get more latest articles in time!

37 original articles published, praised 2, 10000 visitors+
Private letter follow

Tags: Session network Python Mobile

Posted on Fri, 13 Mar 2020 11:54:54 -0400 by cdm2