Modern C + + language features in DPC + +

Ⅰ DPC + + introduction

DPC + + is the acronym of Data Parallel C + +. It is an open source project developed by Intel to introduce SYCL into LLVM and oneAPI. SYCL is a high-level programming model developed to improve the programming efficiency of various acceleration devices. In short, it is a cross platform abstraction layer. Users do not need to care about the specific accelerator at the bottom. They can run on various platforms by writing unified code according to standards. It can be said that SYCL greatly improves the portability and programming efficiency of writing heterogeneous computing code, and has become the industry standard of heterogeneous computing. It is worth mentioning that SYCL is not an acronym of multiple words. DPC + + is built on SYCL and modern C + + language, specifically on the C++17 standard.
The purpose of this article is to discuss the application of modern C + + language in DPC + +, which is a supplement to the book "analysis of the core features of modern C + + language", rather than to explore the principle of heterogeneous computing, because this is a huge topic that needs senior experts to control.
For the experimental environment, I chose to install the Intel oneApi Toolkit locally, because the local tools are more convenient to use. However, if the hardware conditions of readers are not allowed, we can register to use DevCloud. DevCloud is a remote development environment provided by Intel, which includes the latest Intel hardware and software clusters.

Ⅱ DPC + + background

1. What is data parallel programming

Data parallel programming can be described as either a way of thinking or a way of programming. Data is manipulated by a set of parallel processing units. Each processing unit is a hardware device capable of calculating data. These processing units may exist on a single device or on multiple devices in our computer system. We can specify the code to process our data in the form of a kernel.
Kernel is an important concept in data parallel programming. Its function is to let the processing unit on the device perform computing. This term is used in SYCL, OpenCL, CUDA and DPC + +.

2. What is heterogeneous system

A heterogeneous system is any system that contains multiple types of computing devices. For example, a system with both CPU and GPU is a heterogeneous system. Now there are many such computing devices, including CPU, GPU, FPGA, DSP, ASIC and AI chip. The emergence of heterogeneous systems has brought a great challenge. Each of these devices just mentioned has different architecture and different characteristics, which leads to different programming and optimization requirements for each device. One motivation of DPC + + development is to help solve such challenges.

3. Why heterogeneous systems are needed

Because heterogeneous computing is very important, computer architects have been committed to limiting power consumption, reducing latency and improving throughput. From 1990 to 2006, because the processor performance doubled every two to three years (mainly because the clock frequency doubled every two years), the application performance improved at that time. This situation ended around 2006, and a new era of multi-core and multi-core processors appeared. The transformation of architecture to parallel processing has brought performance improvement to multitasking system, but it has not brought performance improvement to most existing single applications without changing programming code. In this new era, accelerators such as GPU become more popular than ever because they can accelerate applications more efficiently. This gave birth to an era of heterogeneous computing, a large number of accelerators with their own professional processing capabilities and many different programming models. They can provide higher performance computing on specific problems through more professional accelerator design, because they don't have to deal with all problems. This is a classic computer architecture trade-off. It usually means that accelerators can only support a subset of programming languages designed for processors. In fact, in DPC + +, only code written in the kernel can run in the accelerator.
Accelerator architectures can be divided into several categories that affect our decisions about programming models, algorithms, and how to use accelerators efficiently. For example, the CPU is the best choice for general-purpose code, including scalar and decision code, and usually has a built-in vector accelerator. GPU seeks acceleration vectors and closely related tensors. DSP is to accelerate specific mathematical operations with low delay, which is usually used to process analog signals of mobile phones. AI accelerators are usually used to accelerate matrix operations, although some accelerators may also accelerate graphs. FPGA and ASIC are especially suitable for accelerating computing space.

4. Why use DPC++

On the one hand, DPC + + has portability, advanced and non specificity, and meets the requirements of modern heterogeneous computer architecture. On the other hand, it allows code across hosts and computing devices to use the same programming environment, that is, the modern C + + programming environment. Finally, the future of computer architecture, including accelerators operating across scalar, vector, matrix and space (SVMS), requires support for heterogeneous machines including SVMS functions. And this support should cover highly complex programmable devices, as well as fixed functions or special devices with low programmability.

Ⅲ DPC++

Before discussing the application of modern C + + language in DPC + +, let's look at the complete code and test our experimental environment:

#include <CL/sycl.hpp>
constexpr int N = 16;
using namespace sycl;

class IntelGPUSelector : public device_selector {
 public:
  int operator()(const device& Device) const override {
    const std::string DeviceName = Device.get_info<info::device::name>();
    const std::string DeviceVendor = Device.get_info<info::device::vendor>();

    return Device.is_gpu() && (DeviceName.find("Intel") != std::string::npos) ? 100 : 0;
  }
};

int main() {
  IntelGPUSelector d;
  queue q(d);
  int* data = malloc_shared<int>(N, q);
  q.parallel_for(N, [=](auto i) {
     data[i] = i;
   }).wait();
  for (int i = 0; i < N; i++) std::cout << data[i] << " ";
  free(data, q);
}

Compile and run the above code. If there is no problem, it should output:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Briefly explain this code. Sycl is the namespace of DPC + + entities. Use namespace sycl; Opening namespaces simplifies subsequent code. IntelGPUSelector is an inherited device_ Device selector for selector, where device_selector is a pure virtual class. It has a pure virtual function int operator() (const device & device) const, which needs to be implemented by derived classes. This function will traverse the computing device on the computer and return the priority of the device. The higher the returned number, the higher the priority. Here, Intel GPU is selected as the preferred computing device, Note that this function uses override to indicate that its purpose is to override the virtual function. The purpose of queue is to specify the target location of work. Here, Intel GPU is set. Function template malloc_shared allocates working memory that can be used on the device. Member function parallel_for performs parallel computing. It is worth noting that free calls sycl::free instead of free in the C runtime library. In this code, the obvious place where the current C + + syntax is used is the function parallel_ Arguments to for,

[=](auto i) { data[i] = i; }

This is a lambda expression.
Ⅳ DPC + + and lambda expressions
If you want to choose one of the most important modern C + + language features for DPC + +, I think lambda expressions should be selected. Because in DPC + + code, kernel code generally appears in the form of lambda expressions. For example, the above example is to pass a lambda expression as an object to Intel's GPU device for calculation. In this lambda expression, [] is a capture list, which can capture the values of variables within the scope currently defined, which is why it can use data[i] in the function body. The capture list [=] is followed by the formal parameter list (auto i). Note that the formal parameter type here uses the auto placeholder, that is, we entrusted the confirmation of the formal parameter type to the compiler. We generally call this kind of lambda expression a generic lambda expression. Of course, if the C++20 standard is selected at compile time, we can also change it to a generic lambda expression of template syntax:

[=]<typename T>(T i) { data[i] = i; }

The capture list function of lambda expression is very powerful. In addition to capturing values, you can also capture references, such as:

[&](auto i) { data[i] = i; }

The above code will capture the references of variables within the scope of the current definition. However, it is worth noting that since the code here will be handed over to the accelerated core, capturing references is not a correct practice and will lead to compilation errors. In addition, generally speaking, we do not recommend capturing all the captured objects directly, but selectively, such as:

[data](auto i) { data[i] = i; }

Of course, in addition to using lambda expressions, we can also choose other forms of code to run the device, such as using imitation functions:

struct AssginTest {
  void operator()(auto i) const { data_[i] = i; }
  int* data_;
};

AssginTest functor{data};
q.parallel_for(N, functor).wait();

However, it is obvious that this method is not simple and direct using lambda expressions.

Ⅴ DPC + + and generic capabilities

The reason why parallel_for is so flexible to accept various forms of arguments because parallel_for itself is a member function template:

template <typename KernelName = detail::auto_name, typename KernelType>
event parallel_for(range<1> NumWorkItems,
                   _KERNELFUNCPARAM(KernelFunc) _CODELOCPARAM(&CodeLoc)) {
  _CODELOCARG(&CodeLoc);
  return parallel_for_impl<KernelName>(NumWorkItems, KernelFunc, CodeLoc);
}

Where KernelFunc is the passed in lambda expression or functor, and KernelType is the type of KernelFunc.
If you run the trace all the way from the code here, you will find that they all pass argument types with templates until submit_impl:

sycld.dll!cl::sycl::queue::submit_impl
dpcpp.exe!cl::sycl::queue::submit
dpcpp.exe!cl::sycl::queue::parallel_for_impl
dpcpp.exe!cl::sycl::queue::parallel_for

This is because sycld.dll is a binary module. It cannot provide code in the form of template. All types must be determined. To solve this problem, cl::sycl::queue::submit_impl uses std::function:

event submit_impl(function_class<void(handler &)> CGH,
                    const detail::code_location &CodeLoc);

Function template cl::sycl::queue::parallel_for_impl encapsulates KernelFunc into another lambda expression object through function_ Class < void (handler &) > to pass the entire lambda expression:

template <typename KernelName = detail::auto_name, typename KernelType,
            int Dims>
  event parallel_for_impl(
      range<Dims> NumWorkItems, KernelType KernelFunc,
      const detail::code_location &CodeLoc = detail::code_location::current()) {
    return submit(
        [&](handler &CGH) {
          CGH.template parallel_for<KernelName, KernelType>(NumWorkItems,
                                                            KernelFunc);
        },
        CodeLoc);
  }

Where function_class is std::function. Note the cgh.template parallel here_ For requires the specifier template, otherwise angle brackets will be parsed incorrectly. DPC + + retains the flexibility of user programming to the greatest extent through such a series of operations.

Ⅵ DPC + + and template derivation

The template derivation features introduced by the C++17 standard are widely used in DPC + + code. For these features, let's start with a small example of DPC + +:

int main() {
  IntelGPUSelector d;
  queue q(d);
  std::vector<int> v1(N);
  std::array<int, N> v2;
  {
    buffer buf1(v1);
    buffer buf2(v2);

    q.submit([&](handler& h) {
      accessor a1(buf1, h, write_only);
      accessor a2(buf2, h, write_only);
      h.parallel_for(N, [=](auto i) {
        a1[i] = i;
        a2[i] = i;
      });
    });
  }
  for (int i = 0; i < N; i++) std::cout << v1[i] << v2[i] << " ";
}

This code does not use malloc_shared allocates memory instead of using buffer and accessor, where buffer is used to encapsulate data and accessor is used to access data. Here, take buffer as an example to analyze the use of DPC + + for template derivation.
First, observe the two instances of buffer. The arguments of their constructors are STD:: vector < int > and STD:: array < int, n > types. The constructor can be called in this way, not because buffer overloads its constructor for these two types, but because its constructor uses a template. This involves a new feature of C++17 standard - template argument derivation of class template. In the past, the instantiation of class template must be explicitly passed in template arguments, otherwise compilation errors will be caused. In the new standard, the template arguments of the class template can be derived from the constructor. Take a look at the buffer constructor:

template <typename T, int dimensions = 1,
          typename AllocatorT = cl::sycl::buffer_allocator,
          typename = typename detail::enable_if_t<(dimensions > 0) &&
                                                  (dimensions <= 3)>>
class buffer {
public:
...
  template <class Container, int N = dimensions,
            typename = EnableIfOneDimension<N>,
            typename = EnableIfContiguous<Container>>
  buffer(Container &container, AllocatorT allocator,
         const property_list &propList = {})
      : Range(range<1>(container.size())) {
    impl = std::make_shared<detail::buffer_impl>(
        container.data(), get_count() * sizeof(T),
        detail::getNextPowerOfTwo(sizeof(T)), propList,
        make_unique_ptr<detail::SYCLMemObjAllocatorHolder<AllocatorT>>(
            allocator));
  }

  template <class Container, int N = dimensions,
            typename = EnableIfOneDimension<N>,
            typename = EnableIfContiguous<Container>>
  buffer(Container &container, const property_list &propList = {})
      : buffer(container, {}, propList) {}
...
};

Code buffer buffer 1 (V1); Will execute

buffer(Container &container, const property_list &propList = {})

For this constructor, it is worth noting that the constructor has no actual implementation code, but is called through the method of the delegate constructor

buffer(Container &container, AllocatorT allocator, const property_list &propList = {})

Delegate constructor is a feature introduced by C++11. It allows a constructor to hand over the execution right of the construction to another constructor. Back to template derivation, the constructor will deduce that the Container is STD:: vector < int >, and the derivation result of dimensions is 1. The latter two template parameters are used to check whether the first two template parameters are correct. Here, a lot of template meta programming skills are used:

template <int dims>
using EnableIfOneDimension = typename detail::enable_if_t<1 == dims>;

template <class Container>
using EnableIfContiguous =
    detail::void_t<detail::enable_if_t<std::is_convertible<
                       detail::remove_pointer_t<decltype(
                           std::declval<Container>().data())> (*)[],
                       const T (*)[]>::value>,
                   decltype(std::declval<Container>().size())>;

First of all, they are alias templates defined by using. Their purpose is to check whether the dims is 1 and the Container is continuous. The first alias template is very simple. Directly check whether dims is 1 and detail::enable_if_t is std::enable_if_t. The second method to check the continuity is a little troublesome. In short, it is to check whether the array pointer of the type returned by the member function data() of the Container object can be converted to const T (*) []. Here, we mainly check two points. The first Container has a data() member function, and the pointer of the second return type can be converted to T const T (*) []. In fact, in standard containers, only continuous containers have data() member functions, and others will report errors because there is no data(), for example:

no member named 'data' in 'std::list<int>'

Friends who read the above code carefully should find another problem, that is, there is no place to help the compiler deduce the class template parameter T of buffer. It has to be said that DPC + + uses the new features of C++17 on template derivation incisively and vividly. In fact, in the code, there is such a code for user-defined derivation guidance:

template <class Container>
buffer(Container &, const property_list & = {})
    ->buffer<typename Container::value_type, 1>;

User defined derivation guidelines refer to the types of template parameters that programmers can guide the compiler to derive from function arguments. Finally, in this example, it should be noted that the buffer will write the cached data to v1 and v2 only when it is destructed, so a separate scope is used here.

~buffer_impl() {
  try {
    BaseT::updateHostMemory();
  } catch (...) {
  }
}

VII. Summary

This article starts with several simple examples of DPC + +, and gradually explores the application of DPC + + to modern C + + language features, including lambda expressions, generics and template derivation. Of course, the new features of DPC + + are far more than these. On the other hand, the addition of these new features has indeed helped DPC + + complete the work that could not be done in the past. This is also the development trend of C + + in recent years. More and more code bases begin to introduce new features, and some are very "magical" "DPC + + is one of them. Just reading the code using new features in DPC + + is enough to impress people, not to mention the organizational structure of the code, the underlying abstraction, etc. I know that a single article can not discuss the characteristics of modern C + + language in DPC + +, so Wang Po recommended her own book< Analysis of core features of modern C + + language >Hesheng Geshu's course "modern C++42 lectures", I believe that after reading this book or after course training, friends will have a more in-depth understanding of the characteristics of modern C + + language.

reference

1.DPC++ Part 1: An Introduction to the New Programming Model [https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/Webinar-Slides-DPC-Part-1-An-Introduction-to-the-New-Programming-Model-.pdf]
2.Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL preview [https://resource-cms.springernature.com/springer-cms/rest/v1/content/17382710/data/v1]
3.Intel® DevCloud [https://software.intel.com/en-us/devcloud/oneapi]
4.New, Open DPC++ Extensions Complement SYCL and C++ [https://insidehpc.com/2020/06/new-open-dpc-extensions-complement-sycl-and-c/]

Tags: C++

Posted on Tue, 30 Nov 2021 22:15:35 -0500 by Xanza