Introduction to tuple_vector

tuple_vector is a data container that is designed to abstract and simplify the handling of a "structure of arrays" layout of data in memory. In particular, it mimics the interface of vector, including functionality to do inserts, erases, push_backs, and random-access. It also provides a RandomAccessIterator and corresponding functionality, making it compatible with most STL (and STL-esque) algorithms such as ranged-for loops, find_if, remove_if, or sort.

When used or applied properly, this container can improve performance of some algorithms through cache-coherent data accesses or allowing for sensible SIMD programming, while keeping the structure of a single container, to permit a developer to continue to use existing algorithms in STL and the like.

Review of "Structure of arrays" data layouts

When trying to improve the performance of some code, it can sometimes be desirable to transform how some data is stored in memory to be laid out not as an "array of structures", but as a "structure of arrays". That is, instead of storing a series of objects as a single contiguous chunk of memory, one or more data members are instead stored as separate chunks of memory that are handled and accessed in parallel to each other.

This can be beneficial in two primary respects:

  1. To improve the cache coherency of the data accesses, e.g. by utilizing more data that is loaded per cache line loaded from memory, and thereby reducing the amount of time waiting on memory accesses from off-CPU memory. This presentation from Mike Acton touches on this, among other things: https://www.youtube.com/watch?v=rX0ItVEVjHc

  2. To allow the data to be more easily loaded and utilized by SIMD kernels, by being able to load memory directly into a SIMD register. This is touched on in this presentation from Andreas Fredriksson for writing code with SIMD intrinsics: http://www.gdcvault.com/play/1022249/SIMD-at-Insomniac-Games-How ...and as well in this guide for writing performant ISPC kernels: https://ispc.github.io/perfguide.html

How TupleVecImpl works

tuple_vector inherits from TupleVecImpl, which provides the bulk of the functionality for those data containers. It manages the memory allocated, marshals data members to each array of memory, generates the necessary iterators, and so on.

When a tuple_vector is declared, it is alongside a list of types, or "tuple elements", indicating what data to store in the container, similar to how tuple operates. TupleVecImpl uses this list of tuple elements to then inherit from a series of TupleVecLeaf structures, which each have their own pointer to an array of their corresponding type in memory. When dereferencing the container, either to fetch a tuple of references or just fetching pointers to the memory, it is these pointers that are utilized or fetched.

While each TupleVecLeaf contains a pointer to its own block of memory, they are not individual memory allocations. When TupleVecImpl needs to grow its capacity, it calculates the total size needed for a single allocation, taking into account the number of objects for the container, the size of each tuple element's type, and the alignment requirements for each type. Pointers into the allocation for each tuple element are also determined at the same time, which are passed to each TupleVecLeaf. From there, many of the interactions with TupleVecImpl, to modify or access members of the container, then reference each TupleVecLeaf's data pointer in series, using parameter packs to repeat each operation for each parent TupleVecLeaf.

How tuple_vector's iterator works

TupleVecImpl provides a definition to an iterator type, TupleVecIter. As mentioned above, TupleVecIter provides all of the functionality to operate as a RandomAccessIterator. When it is dereferenced, it provides a tuple of references, similar to at() or operator[] on TupleVecImpl, as opposed to a reference of some other type. As well, a customization of move_iterator for TupleVecIter is provided, which will return a tuple of rvalue-references.

The way that TupleVecIter operates internally is to track an index into the container, as well as a copy of all of the TupleVecImpl's TupleVecLeaf pointers at the time of the iterator's construction. As a result, modifying the iterator involves just changing the index, and dereferencing the iterator into the tuple of references involves dereferencing each pointer with an offset specified by that index.

Of the various ways of handling the multitude of references, this tended to provide the best code-generation. For example, having a tuple of pointers that are collectively modified with each iterator modification resulted in the compiler not being able to accurately determine which pointers were relevant to the final output of some function, creating many redundant operations. Similarly, having the iterator refer to the source TupleVecImpl for the series of pointers often resulted in extra, unnecessary, data hops to the TupleVecImpl to repeatedly fetch data that was not practically mutable, but theoretically mutable. While this solution is the heaviest in terms of storage, the resulted assembly tends to be competitive with traditional structure-of-arrays setups.

How to work with tuple_vector, and where to use it

Put simply, tuple_vector can be used as a replacement for vector. For example, instead of declaring a structure and vector as:

struct Entity
{
	bool active;
	float lifetime;
	Vec3 position;
}
vector<Entity> entityVec;

...the tuple_vector equivalent of this can be defined as:

tuple_vector<bool, float, Vec3> entityVec;

In terms of how tuple_vector is modified and accessed, it has a similar featureset as vector, except where vector would accept or return a single value, it instead accepts or returns a tuple of values or unstructured series of equivalent arguments.

For example, the following functions can be used to access the data, either by fetching a tuple of references to a series of specific values, or the data pointers to the tuple elements:

tuple<bool&, float&, Vec3&> operator[](size_type)
tuple<bool&, float&, Vec3&> at(size_type)
tuple<bool&, float&, Vec3&> iterator::operator*()
tuple<bool&&, float&&, Vec3&&> move_iterator::operator*()
tuple<bool*, float*, Vec3*> data()

// extract the Ith tuple element pointer from the tuple_vector
template<size_type I>
T* get<I>()
// e.g. bool* get<0>(), float* get<1>(), and Vec3* get<2>()

// extract the tuple element pointer of type T from the tuple_vector
// note that this function can only be used if there is one instance
// of type T in the tuple_vector's elements
template<typename T>
T* get<T>()
// e.g. bool* get<bool>(), float* get<float>(), and Vec3* get<Vec3>()

And push_back(...) has the following overloads, accepting either values or tuples as needed.

tuple<bool&, float&, Vec3&> push_back()
push_back(const bool&, const float&, const Vec3&)
push_back(tuple<const bool&, const float&,const  Vec3&>)
push_back(bool&&, float&&, Vec3&&)
push_back(tuple<bool&&, float&&, Vec3&&>)		

...and so on, and so forth, for others like the constructor, insert(...), emplace(...), emplace_back(...), assign(...), and resize(...).

As well, note that the tuple types that are accepted or returned for tuple_vector<Ts...> have typedefs available in the case of not wanting to use automatic type deduction:

typedef eastl::tuple<Ts...> value_tuple;
typedef eastl::tuple<Ts&...> reference_tuple;
typedef eastl::tuple<const Ts&...> const_reference_tuple;
typedef eastl::tuple<Ts*...> ptr_tuple;
typedef eastl::tuple<const Ts*...> const_ptr_tuple;
typedef eastl::tuple<Ts&&...> rvalue_tuple;

With this, and the fact that the iterator type satisfies the RandomAccessIterator requirements, it is possible to use tuple_vector in most ways and manners that vector was previously used, with few structural differences.

However, even if not using it strictly as a replacement for vector, it is still useful as a tool for simplifying management of a traditional structure of arrays. That is, it is possible to use tuple_vector to just perform a single large memory allocation instead of a series of smaller memory allocations, by sizing the tuple_vector as needed, fetching the necessary pointers with data() or get<...>(), and carrying on normally.

One example where this can be utilized is with ISPC integration. Given the following ISPC function definition:

export void simple(uniform float vin[], uniform float vfactors[], uniform float vout[], uniform int size);

...which generates the following function prototype for C/C++ usage:

extern void simple(float* vin, float* vfactors, float* vout, int32_t size);

...this can be utilized with some raw float arrays:

float* vin = new float[NumElements];
float* vfactors = new float[NumElements];
float* vout = new float[NumElements];

// Initialize input buffer
for (int i = 0; i < NumElements; ++i)
{
	vin[i] = (float)i;
	vfactors[i] = (float)i / 2.0f;
}

// Call simple() function from simple.ispc file
simple(vin, vfactors, vout, NumElements);

delete vin;
delete vfactors;
delete vout;

or, with tuple_vector:

tuple_vector<float, float, float> simpleData(NumElements);
float* vin = simpleData.get<0>();
float* vfactors = simpleData.get<1>();
float* vout = simpleData.get<2>();

// Initialize input buffer
for (int i = 0; i < NumElements; ++i)
{
	vin[i] = (float)i;
	vfactors[i] = (float)i / 2.0f;
}

// Call simple() function from simple.ispc file
simple(vin, vfactors, vout, NumElements);

simpleData here only has a single memory allocation during its construction, instead of the three in the first example, and also automatically releases the memory when it falls out of scope.

It is possible to also skip a memory allocation entirely, in some circumstances. EASTL provides "fixed" counterparts of many data containers which allows for a data container to have an inlined buffer of memory. For example, eastl::vector<typename T> has the following counterpart:

eastl::fixed_vector<typename T, size_type nodeCount, bool enableOverflow = true>

This buffer allows for enough space to hold a nodeCount number of T objects, skipping any memory allocation at all, until the requested size becomes greater than nodeCount - assuming enableOverflow is True.

There is a similar counterpart to eastl::tuple_vector<typename... Ts> available as well:

eastl::fixed_tuple_vector<size_type nodeCount, bool enableOverflow, typename... Ts>

This does the similar legwork in creating an inlined buffer, and all of the functionality of tuple_vector otherwise is supported. Note the slight difference in declaration, though: nodeCount and enableOverflow are defined first, and enableOverflow is not a default parameter. This change arises out of restrictions surrounding variadic templates, in that they must be declared last, and cannot be mixed with default template parameters.

Lastly, eastl::vector and other EASTL data containers support custom Memory Allocator types, through their template parameters. For example, eastl::vector's full declaration is actually:

eastl::vector<typename T, typename AllocatorType = EASTLAllocatorType>

However, because such a default template parameter cannot be used with variadic templates, a separate type for tuple_vector is required for such a definition:

eastl::tuple_vector_alloc<typename AllocatorType, typename... Ts>

Note that tuple_vector uses EASTLAllocatorType as the allocator.

Performance comparisons/discussion

A small benchmark suite for tuple_vector is included when running the EASTLBenchmarks project. It provides the following output on a Core i7 3770k (Skylake) at 3.5GHz, with DDR3-1600 memory.

The tuple_vector benchmark cases compare total execution time of similar algorithms run against eastl::tuple_vector and std::vector, such as erasing or inserting elements, iterating through the array to find a specific element, sum all of the elements together via operator[] access, or just running eastl::sort on the data containers. More information about the EASTLBenchmarks suite can be found in EASTL/doc/EASTL Benchmarks.html

Benchmark STD execution time EASTL execution time Ratio
tuple_vector<AutoRefCount>/erase 1.7 ms 1.7 ms 1.00
tuple_vector<MovableType>/erase 104.6 ms 106.3 ms 0.98
tuple_vector<MovableType>/reallocate 1.3 ms 1.7 ms 0.77 -
tuple_vector<uint64>/erase 3.4 ms 3.5 ms 0.98
tuple_vector<uint64>/insert 3.4 ms 3.4 ms 0.99
tuple_vector<uint64>/iteration 56.3 us 81.4 us 0.69 -
tuple_vector<uint64>/operator[] 67.4 us 61.8 us 1.09
tuple_vector<uint64>/push_back 1.3 ms 818.3 us 1.53 +
tuple_vector<uint64>/sort 5.8 ms 7.3 ms 0.80
tuple_vector<uint64,Padding>/erase 34.7 ms 32.9 ms 1.05
tuple_vector<uint64,Padding>/insert 41.0 ms 32.6 ms 1.26
tuple_vector<uint64,Padding>/iteration 247.1 us 80.5 us 3.07 +
tuple_vector<uint64,Padding>/operator[] 695.7 us 81.1 us 8.58 +
tuple_vector<uint64,Padding>/push_back 10.0 ms 6.0 ms 1.67 +
tuple_vector<uint64,Padding>/sort 8.2 ms 10.1 ms 0.81
vector<AutoRefCount>/erase 1.3 ms 1.2 ms 1.05
vector<MovableType>/erase 104.4 ms 109.4 ms 0.95
vector<MovableType>/reallocate 1.5 ms 1.5 ms 0.95
vector<uint64>/erase 4.3 ms 3.6 ms 1.20
vector<uint64>/insert 4.8 ms 4.8 ms 1.01
vector<uint64>/iteration 71.5 us 77.3 us 0.92
vector<uint64>/operator[] 90.7 us 87.2 us 1.04
vector<uint64>/push_back 1.6 ms 1.2 ms 1.38 +
vector<uint64>/sort 7.7 ms 8.2 ms 0.93

First off, tuple_vector<uint64>'s performance versus std::vector<uint64> is comparable, as expected, as the tuple_vector's management for one type becomes very similar to just a regular vector. The major notable exception is the iteration case, which runs eastl::find_if. This performance differences is a consequence of the iterator design, and how it works with indices, not a direct pointer, so the code generation suffers slightly in this compute-bound scenario. This is worth noting as a demonstration of a case where falling back to pointer-based iteration by fetching the begin and end pointers of that tuple element may be preferable, instead of using the iterator constructs.

The set of tuple_vector<uint64,Padding> tests are more interesting. This is a comparison between a single std::vector with a structure containing a uint64 and 56 bytes of padding, and a tuple_vector with two elements: one for uint64 and one for 56 bytes of padding. The erase, insert, push_back, and sort cases all perform at a similar relative rate as they did in the tuple_vector<uint64> tests - demonstrating that operations that have to touch all of elements do not have a significant change in performance.

However, iteration and operator[] are very different, because those only access the uint64 member of both vector and tuple_vector to run some operation. The iteration test now runs 3x faster whereas before it ran 0.7x as fast, and operator[] runs 8.5x faster, instead of 1.1x. This demonstrates some of the utility of tuple_vector, in that these algorithms end up being limited by the CPU's compute capabilities, as opposed to being limited by how fast they can load memory in from DRAM.

In a series of other tests, generally speaking, tuple_vector tends to perform on par with manual management of multiple arrays in many algorithms and operations, often even generating the same code. It should be noted that significant degrees of inlining and optimization are required to get the most out of tuple_vector. Compared to accessing a series of arrays or vectors, tuple_vector does perform a multitude of extra trivial function calls internally in order to manage the various elements, or interact with eastl::tuple through its interface, so running in debug configurations can run significantly slower in some cases, e.g. sometimes running at 0.2x the speed compared to vector.

The problem of referencing tuple elements

This will be experienced shortly after using tuple_vector in most capacities, but it should be noted that the most significant drawback is that there is no way to symbolically reference each tuple element of the tuple_vector - much in the same way as tuple. For example, if translating a struct such as...

struct Entity
{
	float x, y, z;
	float lifetime;
};

...to tuple_vector, it will exist as:

tuple_vector<float, float, float, float> entityVec;

...and can only be accessed in a manner like entityVec.get<3>() to refer to the lifetime member. With existing tools, the only good alternatives are to encapsulate each float as a separate struct to give it unique typenames...

struct entityX { float val; };
struct entityY { float val; };
struct entityZ { float val; };
struct entityLifetime { float val; };

tuple_vector<entityX, entityY, entityZ, entityLifetime> entityVec;

...and then access each tuple element by typename like entityVec.get<entityLifetime>(); or, creating an enumerated value to replace the indices...

enum EntityTypeEnum
{
	entityX = 0,
	entityY = 1,
	entityZ = 2,
	entityLifetime = 3
};

tuple_vector<float, float, float, float> entityVec;

...and then access each tuple element by the enumerated value: entityVec.get<entityLifetime>().

Either way, there is a fairly significant maintenance and readability issue around this. This is arguably more severe than with tuple on its own because that is generally not intended for structures with long lifetime.

Ideally, if the language could be mutated to accommodate such a thing, it would be good to have some combination of typenames and symbolic names in the declaration, e.g. something like

tuple_vector<float x, float y, float z, float lifetime> entityVec;

and be able to reference the tuple elements not just by typename or index, but through their corresponding symbol, like entityVec.get<lifetime>(). Or, it may be interesting if the necessary get functions could be even automatically generated through a reflection system, e.g. entityVec.get_lifetime(). All of this remains a pipe dream for now.






Add Discussion as Guest

Log in