Implementation principle of concurrent arena, a memory allocator for Rocksdb


Let's take a look at the memory allocator.

Rocksdb, as the representative of LSM tree's current implementation in the industry, must have memory allocation requirements because of the memory component of LSM tree - memtable (for the evolution of memory allocator, you can Technical evolution of memory allocator design If you know, you can roughly understand why current applications that require memory need a memory allocator).

Moreover, as a key component that directly affects the write performance, in addition to the efficient implementation of memtable index, the efficiency of memory allocation is a strong requirement. because rocksdb write model There will be multiple threads writing the same memtable concurrently, that is, the efficiency of multi-core concurrent memory allocation for the same memtable is a key requirement; Of course, while improving performance, you also need to reduce memory fragmentation as much as possible.

In addition to the memtable component, the memory allocator can be seen in various memory application requirements of rocksdb, such as the data storage and block of the iterator system_ The data storage of cache (index_filter / bloom_filter, etc.) and the memory allocator of rocksdb itself cover basically all the internal memory application requirements of the engine, and its performance / memory utilization control requires very high requirements.

Of course, rocksdb also supports other memory allocators: glibc allocator, jemalloc, and tcmalloc.

This article hopes to answer the following questions:

  • Rocksdb's arena memory allocator design, including how to ensure performance under multi-core and reduce memory fragmentation
  • "Differences" and connections between Rocksdb Arena Allocator and other memory allocators
  • Why doesn't Rocksdb use other memory allocators directly? Do you have to do it yourself?

Related rocksdb source code: v6.25

Implementation principle of Rocksdb ConcurrentArena

Basic architecture

Let's start with the basic architecture diagram.

Students familiar with the memory allocator should be able to answer some of the above questions after seeing this figure.

The gray dotted line in the figure above is the two main data structures of rocksdb used to maintain ConcurrentArena:

  • Arena, needless to say, both glibc malloc and jemalloc use it as the back-end part of the memory allocator to directly apply for large blocks of memory from the os for management as the memory pool provided to applications. In tcmalloc, it is the memory pool of the back end pagemap. Generally speaking, arena is a data structure in which an application uniformly manages heap memory and system free memory.
  • Shard, in the ConcurrentArena class, is actually called shards_, Itself is a corelocal array; As shown in the figure above, each cpu core maintains a shard to manage the memory application requirements of threads running on the current cpu. In fact, it is thread cache or per cpu cache introduced in the newer version of tcmalloc.

Why do you need a core local shard or other memory allocators that are similar to thread cache? Obviously, in a multi-core scenario, each cpu must have better performance in accessing its own cpu cache friendly data. Originally, it only takes a few ns for a cpu to access its own cpu cache, but because this data is shared by all CPUs (locking is required), Because of the cache consistency protocol (when other CPUs update this data in their own cpu cache, they need to be synchronized to the memory to prevent other CPUs from accessing the old data) it is necessary to load new data from the memory frequently, directly put several ns levels, and the delay becomes tens or even hundreds of ns. The performance is obviously very poor.

Therefore, each cpu / thread manages its own memory application requirements, which can not only reduce the lock overhead (there is no way to avoid, sometimes it needs to apply from the public arena), but also reduce cpu cache miss, which is very helpful to improve the performance of multi-core scenarios.

Memory allocation process

There are three types of memory application requirements. The ranges are: [0128kb], [128KB, 256Kb], [256Kb, OS upper limit)

The general flow chart is as follows:

The allocation of large memory will directly fall on the Allocator, that is, the Allocator used for the current db compilation depends on.

Small memory will be preferentially satisfied by the shard of each cpu. If the capacity is insufficient, it will be applied from arena.

  1. Small memory application < 128KB: less than 128KB by default (options. Arena_block_size: 1048576, divided by 8 is the shard_block_size_maintained by each shard). It will be applied from free_begin_managed by shard in priority.

    char* AllocateImpl(size_t bytes, bool force_arena, const Func& func) {
    		...
      // Select the currently running core and take out the shard previously bound to the cpu core
      Shard* s = shards_.AccessAtCore(cpu & (shards_.Size() - 1));
      if (!s->mutex.try_lock()) {
        s = Repick();
        s->mutex.lock();
      }
      // The shard granularity spinlock ensures that the memory allocation of the current thread from the current shard is atomic.
      std::unique_lock<SpinMutex> lock(s->mutex, std::adopt_lock);
    
      // Take the free memory size of the current shard
      size_t avail = s->allocated_and_unused_.load(std::memory_order_relaxed);
      ...
      s->allocated_and_unused_.store(avail - bytes, std::memory_order_relaxed);
    
      // Ensure that if the obtained memory area is aligned with 8 bytes, it will be appended after the previous free_begin (generally, there is only one case, and the allocated bytes are all pointer 8 bytes)
      // |++++++used+++++++|---current-bytes--|----unused----|
      //                   |
      //               free_begin_ --> move
      //
      // If there is no alignment, the memory start address is aligned according to end.
      // |----unused----|++++++used+++++++|---current-bytes--|
      //                                  |
      //                     move <-- free_begin_
      //
      
      char* rv;
      if ((bytes % sizeof(void*)) == 0) {
        // aligned allocation from the beginning
        rv = s->free_begin_;
        s->free_begin_ += bytes;
      } else {
        // unaligned from the end
        rv = s->free_begin_ + avail - bytes;
      }
      return rv;
    }
    
  2. Memory requests larger than the shard memory size: 128KB – 256KB. If a memory request exceeds the current default shard memory size or shard allocated_and_unused_, that is, there is not enough free memory, shard needs to apply for memory from arena and expand the applied memory size to the current shard memory. Of course, shard memory size expansion is available here Restricted, between [shard_block_size/2, shard_block_size*2).

    When allocating from arena, another arena level lock is required (this is usually a memtable and a lock. The intensity of competition depends on how many threads are writing memtables concurrently)

    // The func here is passed in externally, that is, the allocate function of arena
    // [this, bytes]() { return arena_.Allocate(bytes); }
    char* AllocateImpl(size_t bytes, bool force_arena, const Func& func) {
      ...
      size_t avail = s->allocated_and_unused_.load(std::memory_order_relaxed);
      // If the remaining memory of the current shard cannot meet the application requirements of bytes, it needs to be allocated from arena
      if (avail < bytes) {
        // reload
        // When allocating from arena, you need to have an arena granularity lock to ensure the atomicity of arena allocation
        std::lock_guard<SpinMutex> reload_lock(arena_mutex_);
    
        // arena allocated but unused memory
        auto exact = arena_allocated_and_unused_.load(std::memory_order_relaxed);
        assert(exact == arena_.AllocatedAndUnused());
    
        // If the memory allocated by arena but not used meets the bytes requirement, allocate the memory area directly from arena,
        // And return the address of the memory area to the user; this can reduce memory fragmentation.
        if (exact >= bytes && arena_.IsInInlineBlock()) {
          // If we haven't exhausted arena's inline block yet, allocate from arena
          // directly. This ensures that we'll do the first few small allocations
          // without allocating any blocks.
          // In particular this prevents empty memtables from using
          // disproportionately large amount of memory: a memtable allocates on
          // the order of 1 KB of memory when created; we wouldn't want to
          // allocate a full arena block (typically a few megabytes) for that,
          // especially if there are thousands of empty memtables.
          auto rv = func(); 
          Fixup();
          return rv;// When returning, execute arena allocation logic first
        }
    
        // If arena does not have allocated free memory, you need to reallocate memory from arena,
        // At this time, you need to adjust the size of free_begin_ through arena until the bytes required by the user are met.
        // Of course, the allocation logic of large memory > = 256Kb has been blocked before this logic, so the memory application requirements here are
        // < 256Kb application requirements
        avail = exact >= shard_block_size_ / 2 && exact < shard_block_size_ * 2
          ? exact
          : shard_block_size_;
        s->free_begin_ = arena_.AllocateAligned(avail);
        Fixup();
      }
      ...
    }
    
  3. Large memory application > 256Kb

    Take the spinlock lock of the current Arena and allocate it through arena:: allocate -- > arena:: allocatefallback -- > arena:: allocatenewblock.

    char* AllocateImpl(size_t bytes, bool force_arena, const Func& func) {
      size_t cpu;
    
      // Go directly to the arena if the allocation is too large, or if
      // we've never needed to Repick() and the arena mutex is available
      // with no waiting.  This keeps the fragmentation penalty of
      // concurrency zero unless it might actually confer an advantage.
      std::unique_lock<SpinMutex> arena_lock(arena_mutex_, std::defer_lock);
      if (bytes > shard_block_size_ / 4 || force_arena ||
          ((cpu = tls_cpuid) == 0 &&
           !shards_.AccessAtCore(0)->allocated_and_unused_.load(
             std::memory_order_relaxed) &&
           arena_lock.try_lock())) {
        // Try to add arena's lock until it is added
        if (!arena_lock.owns_lock()) {
          arena_lock.lock();
        }
        // In arena_ Allocate(bytes);  Large memory allocation in
        auto rv = func();
        Fixup();
        return rv;
      }
      ...
    }
    
    // Arena::AllocateNewBlock allocation logic
    char* Arena::AllocateNewBlock(size_t block_bytes) {
      // Reserve space in `blocks_` before allocating memory via new.
      // Use `emplace_back()` instead of `reserve()` to let std::vector manage its
      // own memory and do fewer reallocations.
      //
      // - If `emplace_back` throws, no memory leaks because we haven't called `new`
      //   yet.
      // - If `new` throws, no memory leaks because the vector will be cleaned up
      //   via RAII.
      blocks_.emplace_back(nullptr);
    
      // The new allocator allocates the memory. After allocation, the memory object management will be through vector < char * > blokcs_ Data structure management
      char* block = new char[block_bytes];
      size_t allocated_size;
    #ifdef ROCKSDB_MALLOC_USABLE_SIZE
      allocated_size = malloc_usable_size(block);
    #ifndef NDEBUG
      // It's hard to predict what malloc_usable_size() returns.
      // A callback can allow users to change the costed size.
      std::pair<size_t*, size_t*> pair(&allocated_size, &block_bytes);
      TEST_SYNC_POINT_CALLBACK("Arena::AllocateNewBlock:0", &pair);
    #endif  // NDEBUG
    #else
      allocated_size = block_bytes;
    #endif  // ROCKSDB_MALLOC_USABLE_SIZE
      blocks_memory_ += allocated_size;
      if (tracker_ != nullptr) {
        tracker_->Allocate(allocated_size);
      }
      blocks_.back() = block;
      return block;
    }
    

Here, all memory applications are applied on demand. The memtable is written in kv, and the size is arbitrary. Moreover, the value size of 128K or above is already a super large value for rocksdb. Most users should be able to split some key s or use blobdb.

Therefore, the small value memory allocation requirements in general scenarios are basically met. As for memory fragmentation, each shard can manage its own memory and allocate it on demand (some scenarios need to be aligned according to 16 bytes). If it is not enough, apply from arena, and the memory fragmentation problem is basically controllable.

Memory release process

For memory release, the Arena data structure maintains an AllocTracker* tracker_;, Used to track memory request records for the entire component.

This tracker is mainly written through user configuration_ buffer_ The manager releases the memory beyond the limit. Every time Arena::AllocateNewBlock requests memory from arena, the size of the request will be recorded in the tracker to confirm whether write has been reached_ buffer_ Manager restrictions.

After the current memtable reaches the size threshold, when this arena is destructed, the memory currently applied in the memtable will be released through the tracker in the destructor

Arena::~Arena() {
  if (tracker_ != nullptr) {
    assert(tracker_->is_freed());
    tracker_->FreeMem(); // Unified release in write_ buffer_ Memory managed in Manger
  }
  // The previously requested memory is in vector < char * > blocks_ You can now release each of these elements
  // The release here is actually the release of the underlying allocator (glibc/tcmalloc/jemalloc)
  for (const auto& block : blocks_) {
    delete[] block;
  }
  
  // munmap of some large page memory in the follow-up
	...
}

Differences and relations between the ConcurrentArena allocator and other memory allocators

In fact, this problem is closely related to the third problem we want to talk about at the beginning of the article, that is, why does rocksdb not directly use other memory allocators, but make an Arena on them?

Because there are too many places where rocksdb uses memory, and they are all core links.

Very high requirements are required for memory management and memory allocation performance, that is, rocksdb hopes to ensure performance and better control of memory at the same time; The underlying memory allocators do not have such active behavior. They just try to ensure performance and reduce memory fragmentation. Better still, they provide some monitoring means to check the underlying memory occupation. If you want to actively manage memory, you must need user intervention. Therefore, ConcurrentArena is only a allocator customized for rocksdb to facilitate rocksdb to efficiently manage its own memory usage, which is friendly to rocksdb; The actual memory allocation at the bottom is still allocated through the traditional allocator, either tcmalloc, jemalloc or glibc malloc.

In terms of implementation, the design idea of memory allocator is relatively similar: the way to ensure high performance is the same. The application of small memory is through thread cache or per cpu cache to ensure that the cpu will not frequently cache miss; If map is supported_ Hugetlb will directly mmap; Otherwise, large page memory will be applied directly from the underlying allocator.

summary

Recently, I have a general understanding of the basic design and implementation of the memory allocator. The birth and evolution of the memory allocator are constantly improving with the high requirements of the underlying architecture and our system users for performance and cost.

  • The initial os stack memory automatically allocates memory, which is simple and efficient. However, we want to manage the memory by ourselves. We don't care about the size and automatic release of care. The stack memory can't meet the demand.
  • Try to use mmap, but large and small memory applications have to fall into the kernel and CPU CS, and the performance can not meet the requirements.
  • We have the simplest bump allocator, but it is a user state stack space. The size and life cycle are controllable, but the release is a problem. Even if the current memory object has expired, it needs to wait until the top of the stack is released. Releasing memory is another problem.
  • Memory management is managed through the free list single linked list. It's good to string it up. In this way, when it is released, it's good to traverse the linked list node. There is still a problem. Back to performance, memory objects of different sizes are crammed into a linked list, and the release efficiency is appalling.
  • So we have size buckets (called size class in today's tcmalloc/jemalloc). The linked list is divided according to the memory size, and ranges of different sizes are divided into a linked list node. In this way, when released, we can directly scan the linked list of the corresponding range. Chasing the ultimate performance of hardware seems natural (squeezing every logic gate of cpu), just like capitalists chasing the ultimate interests, exploring the potential of every employee and creating wealth. The linked list here is random access and is not friendly to cpu cache.
  • Replace the linked list with an array and continue to maintain the shape of size buckets. It seems that the performance is good, and the hardware is also chasing its ultimate performance - multi-core architecture and hyper threading. Previous designs are based on single core scenarios, which ensure the allocation performance of multi-core architecture and avoid excessive cache miss. Thread cache comes out. However, thread cache manages its own memory, and there may be memory fragments. For example, arena starts to allocate 128K memory to each thread, but only one or two threads frequently apply for memory, and there are few other threads, so the memory they get is shelved and wasted.
  • (TCMalloc) therefore, in order to reduce memory utilization and memory fragmentation, it launched transfer cache, which can recover and migrate different thread cache free memory to other threads with large memory requirements. There are also performance and memory utilization problems: if the thread cache is divided according to the thread granularity, a process has tens of thousands of threads, and each thread has its own thread cache, so the manageable memory of each thread cache is very small, and it requires frequent interaction with transfer cache and central freelist in the background, the performance is unbearable.
  • Therefore, per CPU cache is designed to allocate its own cache for each processor (hyper thread granularity). The threads running on it will apply for memory from per CPU cache. This design will also be more friendly to arena management.
  • . . . There are also user requirements. I don't know how much memory the memory allocator allocates and what your internal management form is. So you also need to provide stats information...
  • . . .

In short, hardware pursues its own ultimate performance, more efficient CPU and more efficient CPU architecture (NUMA today), and its performance will continue to evolve in the visible future, faster and stronger.

Software developers, especially basic software developers, want to pursue the ultimate hardware performance. In addition to being familiar with the technology stack of the software itself, the underlying hardware technology stack is essential. Of course, the premise is to pursue the ultimate hardware performance. If mmap is implemented as a memory allocator, there must be no problem:)

Posted on Wed, 01 Dec 2021 06:20:59 -0500 by jarosciak