Concurrent Garbage-Collectiong/Compacting Memory Allocator

I’m developing an algorithm for concurrent heap garbage collection/compaction. It will be used in low latency systems that need to scale well to a lot of clients, e.g. web servers.

I thought about an algorithm that would be suitable but it does have a few flaws. I’m not very good at describing algorithms, so please correct me if my description isn’t clear, but here it goes:

  • There are two heaps of equal size
  • There’s an object handle table, which contains memory address and lock of each object.
  • The index to the handle table is the handle of an object.
  • Each heap has a linked list of all objects, which contains type and size of the object.
  • Objects are copied from one heap to another

The GC/compaction algorithm is designed to be incremental and concurrent. This is the Pseudocode, which runs in all threads from time to time. Atomic operations are marked with // atomic.

current = heapFrom->currentCopyObj; // atomic heapFrom->currentCopyObj = current->Next; // atomic  while(heapTo->currentCopyObj->freeSpaceToNextObj < current->size) // atomic {     heapTo->currentCopyObj = heapTo->currentCopyObj->Next; // atomic }  size = current->Size; oldAddress = current->Address; newAddress = heapTo->currentCopyObj->Address;  handleTable->LockObj(current->Handle); // atomic  memcpy(heapTo->currentCopyObj->Address, current->Address, current->Size);  heapTo->InsertIntoObjectList(heapTo->currentCopyObj, current); // atomic heapFrom->RemoveFromObjectList(current); // atomic  handleTable->SetHandleAddress(current->Handle, newAddress); // atomic  handleTable->UnlockObj(current->Handle); // atomic 

Allocation of objects is done using a sort of bump allocator, which allocates objects at the end of each heap, handle allocation is done using an O(1) algorithm, which uses either a bump allocator or cached handle slots. This should make allocation quite fast, O(1) theoretically.

This algorithm has a few flaws though:

  • Object writes have to be locking, reads can be concurrent with copying
  • It does not achieve very good heap compression
  • High memory overhad due to handle table
  • Not very cache friendly due to handle table

Would this algorithm work? If it would, how could I solve some of the problems it has? Or is there a better algorithm that does not have these flaws?

memcpy() in device/program with jemalloc allocator does not crash

Managed to get signed length value to memcpy()

02-16 01:44:49.096  6423  6471 W bt_hci_packet_fragmenter: reassemble_and_dispatch reassemble_and_dispatch 02-16 01:44:49.096  6423  6471 W bt_hci_packet_fragmenter: reassemble_and_dispatch partial_packet->offset 40 packet->len 304 HCI_ACL_PREAMBLE_SIZE 4   02-16 01:44:49.096  6423  6471 W bt_hci_packet_fragmenter: reassemble_and_dispatch projected_offset 340 partial_packet->len 41   02-16 01:44:49.096  6423  6471 W bt_hci_packet_fragmenter: reassemble_and_dispatch got packet which would exceed expected length of 41. Truncating. 02-16 01:44:49.096  6423  6471 W bt_hci_packet_fragmenter: reassemble_and_dispatch memcpy packet->len 1 packet->offset 4 expr -3   02-16 01:44:49.096  6423  6471 W bt_hci_packet_fragmenter: reassemble_and_dispatch partial_packet->data 0xacb14580 partial_packet->data + partial_packet->offset 0xacb145a8  packet->data 0xa553e110 packet->data + packet->offset 0xa553e114   02-16 01:44:49.097  6423  6469 W bt_hci_packet_fragmenter: fragment_and_dispatch fragment_and_dispatch 

In the example above, with a memcpy size of -3, this value is interpreted as an unsigned integer (4294967293) and the memcpy continues until there is a page fault due to unmapped memory and the process should terminate.

My phone in 32 bits, maybe that’s why. Mobile is Samsung S3 Neo+. Using jemalloc at least at Android 9.0 tests. Device is ARM arch.

¯_(ツ)_/¯

Any ideas why memcpy does not crash the process here?

memcpy(partial_packet->data + partial_packet->offset,              packet->data + packet->offset, packet->len - packet->offset); 

https://android.googlesource.com/platform/system/bt/+/3cb7149d8fed2d7d77ceaa95bf845224c4db3baf/hci/src/packet_fragmenter.cc#229

More info also here:

https://github.com/marcinguy/CVE-2020-0022

Thanks,

Are there situations were activedefrag should be kept disabled in Redis 5 (with Jemalloc allocator)?

Redis 4 added active memory defragmentation (source: release notes):

Active memory defragmentation. Redis is able to defragment the memory while online if the Jemalloc allocator is used (the default on Linux). Useful for workloads where the allocator cannot keep the fragmentation low enough, so the only possibility is for Redis and the allocator to collaborate in order to defragment the memory.

With Redis 5, the feature (now refered to as version 2) has been improved:

Source 1: tweet from Salvatore Sanfilippo, the Redis main developer

Active defragmentation version 2. Defragmenting the memory of a running server is black magic, but Oran Agra improved his past effort and now it works better than before. Very useful for long running workloads that tend to fragment Jemalloc.

Source 2: AWS announcement of Redis 5

One of the highlights of the previous release was the fact that Redis gained the capability to defragment the memory while online. The way it works is very clever: Redis scans the keyspace and, for each pointer, asks the allocator if moving it to a new address would help to reduce the fragmentation. This release ships with what can be called active defrag 2: It’s faster, smarter, and has lower latency. This feature is especially useful for workloads where the allocator cannot keep the fragmentation low enough, so the strategy is for both Redis and the allocator to cooperate. For this to work, the Jemalloc allocator has to be used. Luckily, it’s the default allocator on Linux.

Question: Assuming you are already using Jemalloc, is there any reason not to always set activedefrag yes?

Given that the alternative is to restart the instance to deal with fragmentation (which is highly problematic), and given that the overhead of activedefrag seems quite low from what I saw so far, the option seems to be too useful to be disabled by default.

Or are there any situations where it will harm performance?

Simple and safe C++ pool allocator

I have written a simple pool allocator for C++ and I’m looking for ways to improve it, in speed, usability or safety. For example, I don’t know how to allocate a buffer larger than the pool size.

#pragma once  #include <limits> #include <vector>  template<class T, size_t poolObjectCount = 100> class PoolAllocator { public:     template<class U>     struct rebind     {         typedef PoolAllocator<U> other;     }; private:     const size_t m_poolSize = poolObjectCount * sizeof(T);     std::vector<T*> m_pools;     size_t m_nextObject = 0;   public:     PoolAllocator()     {     allocateNewPool();     }      PoolAllocator(const PoolAllocator& other) = delete; // Copy is deleted      PoolAllocator(PoolAllocator& other) = delete; // Copy is deleted       PoolAllocator(PoolAllocator<T, poolObjectCount>&& other) noexcept         :m_poolSize(std::move(other.m_poolSize)),                  m_pools(std::move(other.m_pools)), m_nextObject(other.m_nextObject)         {}      PoolAllocator(const PoolAllocator<T, poolObjectCount>&& other) noexcept         :m_poolSize(std::move(other.m_poolSize)),          m_pools(std::move(other.m_pools)), m_nextObject(other.m_nextObject)     {}       ~PoolAllocator()     {         while (!m_pools.empty())         {             T* backPool = m_pools.back();             for (T* i = backPool + poolObjectCount - 1; i > backPool; i--)                 i->~T();             operator delete((void *)backPool);             m_pools.pop_back();         }     }      T* allocate(size_t objectCount)     {         if (objectCount >= poolObjectCount)             throw "Cannot allocate array greater than pool size";         if (m_nextObject + objectCount >= poolObjectCount)             allocateNewPool();          T* returnValue = m_pools.back() + m_nextObject;         m_nextObject++;          return returnValue;     }      template<class U = T>     void construct(U * object)     {         new(object) U();     }      template<class U = T>     void construct(U* object, const U& other)     {         new(object) U(other);     }      template<class U = T>     void construct(U* object, U&& other)     {         new(object) U(other);     }      template<class U = T, class... ConstructorArguments>     void construct(U* object, ConstructorArguments&& ...constructorArguments)     {         new(object) U(std::forward<ConstructorArguments>(constructorArguments)...);     }   private:     void allocateNewPool()     {     m_nextObject = 0;     m_pools.emplace_back((T*)operator new(m_poolSize));     } }; 

Writing a custom, highly-specialized, special-purpose standard-compliant C++ allocator


Brief Preface

I recognize that there are many nuances and requirements for a standard-compatible allocator. There are a number of questions here covering a range of topics associated with allocators. I realize that the requirements set out by the standard are critical to ensuring that the allocator functions correctly in all cases, doesn’t leak memory, doesn’t cause undefined-behaviour, etc. This is particularly true where the allocator is meant to be used (or at least, can be used) in a wide range of use cases, with a variety of underlying types and different standard containers, object sizes, etc.

In contrast, I have a very specific use case where I personally strictly control all of the conditions associated with its use, as I describe in detail below. Consequently, I believe that what I’ve done is perfectly acceptable given the highly-specific nature of what I’m trying to implement.

I’m hoping someone with far more experience and understanding than me can either confirm that the description below is acceptable or point out the problems (and, ideally, how to fix them too).

Overview / Specific Requirements

In a nutshell, I’m trying to write an allocator that is to be used within my own code and for a single, specific purpose:

  • I need “a few” std::vector (probably uint16_t), with a fixed (at runtime) number of elements. I’m benchmarking to determine the best tradeoff of performance/space for the exact integer type[1]
  • As noted, the number of elements is always the same, but it depends on some runtime configuration data passed to the application
  • The number of vectors is also either fixed or at least bounded. The exact number is handled by a library providing an implementation of parallel::for(execution::par_unseq, ...)
  • The vectors are constructed by me (i.e. so I know with certainty that they will always be constructed with N elements)

[1] The value of the vectors are used to conditionally copy a float from one of 2 vectors to a target: c[i] = rand_vec[i] < threshold ? a[i] : b[i] where a, b, c are contiguous arrays of float, rand_vec is the std::vector I’m trying to figure out here, and threshold is a single variable of type integer_tbd. The code compiles as SSE SIMD operations. I do not remember the details of this, but I believe that this requires additional shifting instructions if the ints are smaller than the floats.

On this basis, I’ve written a very simple allocator, with a single static boost::lockfree::queue as the free-list. Given that I will construct the vectors myself and they will go out of scope when I’m finished with them, I know with certainty that all calls to alloc::deallocate(T*, size_t) will always return vectors of the same size, so I believe that I can simply push them back onto the queue without worrying about a pointer to a differently-sized allocation being pushed onto the free-list.

As noted in the code below, I’ve added in runtime tests for both the allocate and deallocate functions for now, while I’ve been confirming for myself that these situations cannot and will not occur. Again, I believe it is unquestionably safe to delete these runtime tests. Although some advice would be appreciated here too — considering the surrounding code, I think they should be handled adequately by the branch predictor so they don’t have a significant runtime cost (although without instrumenting, hard to say for 100% certain).

In a nutshell – as far as I can tell, everything here is completely within my control, completely deterministic in behaviour, and, thus, completely safe. This is also suggested when running the code under typical conditions — there are no segfaults, etc. I haven’t yet tried running with sanitizers yet — I was hoping to get some feedback and guidance before doing so.

I should point out that my code runs 2x faster compared to using std::allocator which is at least qualitatively to be expected.

CR_Vector_Allocator.hpp

class CR_Vector_Allocator {    using T = CR_Range_t; // probably uint16_t or uint32_t, set elsewhere.  private:   using free_list_type = boost::lockfree::queue>;    static free_list_type free_list;  public:   T* allocate(size_t);   void deallocate(T* p, size_t) noexcept;    using value_type = T;   using pointer = T*;   using reference = T&;    template  struct rebind { using other = CR_Vector_Allocator;}; }; 

CR_Vector_Allocator.cc

CR_Vector_Allocator::T* CR_Vector_Allocator::allocate(size_t n) {    if (n <= 1)     throw std::runtime_error("Unexpected number of elements to initialize: " +                          std::to_string(n));    T* addr_;   if (free_list.pop(addr_)) return addr_;    addr_ = reinterpret_cast<T*>(std::malloc(n * sizeof(T)));   return addr_; }  void CR_Vector_Allocator::deallocate(T* p, size_t n) noexcept {   if (n <= 1) // should never happen. but just in case, I don't want to leak     free(p);   else     free_list.push(p); }  CR_Vector_Allocator::free_list_type CR_Vector_Allocator::free_list; 

It is used in the following manner:

using CR_Vector_t = std::vector<uint16_t, CR_Vector_Allocator>;  CR_Vector_t Generate_CR_Vector(){    /* total_parameters is a member of the same class      as this member function and is defined elsewhere */   CR_Vector_t cr_vec (total_parameters);    std::uniform_int_distribution<uint16_t> dist_;    /* urng_ is a member variable of type std::mt19937_64 in the class */   std::generate(cr_vec.begin(), cr_vec.end(), [this, &dist_](){      return dist_(this->urng_);});      return cr_vec;  }  void Prepare_Next_Generation(...){   /*       ...    */   using hpx::parallel::execution::par_unseq;   hpx::parallel::for_loop_n(par_unseq, 0l, pop_size, [this](int64_t idx){     auto crossovers = Generate_CR_Vector();     auto new_parameters = Generate_New_Parameters(/* ... */, std::move(crossovers));   } } 

Any feedback, guidance or rebukes would be greatly appreciated.
Thank you!!