# ::<lexlblog>

## Is std::vector<bool> any good?

The [std::vector<>](https://en.cppreference.com/w/cpp/container/vector)
class is part of the C++ STL (standard library) and is
frequently used in basically any C++ program. If you look at some C++ code and
don't find any vectors, chances are the programmer is actually writing C code
and just doesn't know he/she has named it wrong (or doesn't care).

Now, in principle a vector can hold elements of pretty much any type (since
it's a template), but you might know that
[std::vector](https://en.cppreference.com/w/cpp/container/vector_bool)
is actually a
special case. Instead of holding booleans - which are just of type int in most
compiler implementations - it actually provides a variable-size bitset. And
yes, there's also
[std::bitset](https://en.cppreference.com/w/cpp/utility/bitset)
if you don't need the resizing feature. But
mostly out of lazyness, I am usually using std::vector.

The question that comes to ones mind is: Since std::vector obviously
requires some extra computations (bits cannot be addressed individually in
modern computers; so there **has** to be some bit-shifting and other overhead
involved), can it perform better than the other vector types?
Let's find out!

Test setup
================================================================================

In order to test the performance characteristics of std::vector versus
other types of vectors, I created the following super simple program that uses
a prime sieve:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cpp linenumbers
#include
#include

#ifndef VERSION
#define VERSION 1
#endif

int prime_sieve(int below)
{
#if VERSION==1
std::vector is_prime(below, 1);
#elif VERSION==2
std::vector is_prime(below, 1);
#elif VERSION==3
std::vector is_prime(below, 1);
#endif
int largest_prime = 0;

for (int i = 2; i < below; i++) {
if (is_prime[i] == 1) {
largest_prime = i;

int k = 2*i;
while (k < below) {
is_prime[k] = 0;
k += i;
}
}
}

return largest_prime;
}

int main(int argc, char *argv[])
{
int largest_prime = prime_sieve(100000000);
std::cout << largest_prime << std::endl;

return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The code should talk for itself. Basically we're just getting the largest prime
number below a certain limit and outputting it to stdout. The preprocessor
switch VERSION allows us to change the type of is_prime during compilation.
So I compiled the 3 different versions, with -O2, as you would do with a
practical program:

 bash
$g++ main.cpp -O2 -DVERSION=1 -o test1$ g++ main.cpp -O2 -DVERSION=2 -o test2
$g++ main.cpp -O2 -DVERSION=3 -o test3  And I created a little script that would run a program 10 times for me: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ bash linenumbers #!/bin/bash for i in {1..10} do$1
done
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All I had to do now was measuring execution time by doing:

 bash
$time ./run10x.sh ./test1$ time ./run10x.sh ./test2
$time ./run10x.sh ./test3  I had two computers to run this test on: - Ryzen 3500U Laptop with 2400 MHz DDR4 RAM - Ryzen 5800X Desktop with 3200 MHz DDR4 RAM And on both computers I had the CPU governor set to ondemand and turbo boost was enabled. Results ================================================================================ Here are the results I've got: VERSION | 5800X | 3500U ----------|----------|---------- 1 (int) | 14.037s | 25.637s 2 (char) | 8.552s | 18.206s 3 (bool) | 3.565s | 13.362s [Table [runtimes]: Execution times of 10 runs of the prime sieve programs.] As you can see, the speedup by using std::vector is quite significant! Without any changes to the algorithm itself we managed to get a speedup by a factor of ~2x for the 3500U and a factor of ~4x for the 5800X! The performance increase by replacing std::vector with std::vector is also quite impressive. In this case we got a speedup of ~40% for the 3500U and ~65% for the 5800X. All we really did was putting our data into memory in a more compact way really. And we even introduced some computational overhead! So why is std::vector so much faster? Further analysis ================================================================================ In order to understand the machine's behaviour a little bit better, I used the [perf](https://perf.wiki.kernel.org/index.php/Tutorial) command-line utility. By executing the command  bash$ perf stat -e cycles,instructions,cache-misses ./test3
$perf stat -e cycles,instructions,cache-misses ./test2$ perf stat -e cycles,instructions,cache-misses ./test1


I was able to retrieve some information about the number of cycles it took the
CPU to finish computation, the total number of instructions that were executed,
and the amount of cache-misses that happened meanwhile. Note that I tested my
programs in reverse. That is because I was worried the CPU's
prediction-mechanism might learn some access patterns on-the-fly and thus make
it seam like bool and char are faster than int. So I worked from fastest
to slowest to avoid giving std::vector any advantage.
Here are my results:

VERSION  | cycles        | instructions  | cache-misses
----------|---------------|---------------|----------------
1 (int)  | 6.513.968.536 | 2.366.240.895 | 195.540.901
2 (char) | 4.121.029.300 | 1.966.167.499 | 132.289.762
3 (bool) | 1.699.089.745 | 4.943.353.593 | 79.010.612
[Table [perf_5800X]: Statistics created by perf on the 5800X.]

VERSION  | cycles        | instructions  | cache-misses
----------|---------------|---------------|----------------
1 (int)  | 9.518.495.798 | 2.659.774.890 | 190.209.877
2 (char) | 6.720.663.564 | 1.971.889.530 | 129.980.019
3 (bool) | 4.918.045.013 | 4.241.653.189 | 72.136.543
[Table [perf_5800X]: Statistics created by perf on the 3500U.]

While the specific numbers differ a bit between the two processors, we can still
conclude from both of them the following 3 things:

1. The number of cycles is highest for int, lower for char, and lowest for
bool. Since the number of CPU cycles is directly correlated with the
execution time ($T_{exec} = \frac{N_{cycles}}{f_{clk}}$), this accommodates
the fact that bool was fastest, followed by char and int.

2. The number of instructions that the CPU is actually executing is **highest**
for bool. We expected this, since using individual bits must create some
form of overhead (see above). However, the amount of instructions to run
is higher for int than it is for char. I'm not quite sure yet what's
going on here.

3. The number of cache-misses is **by far** the lowest for std::vector.
Also the char-variant produced significantly less cache-misses than the
int-version. The amount of cache-misses that std::vector produced
was just 40% (Ryzen 5800X) respectively 38% (Ryzen 3500U) of what
std::vector did, and just 60% (Ryzen 5800X) respectively 55%
(Ryzen 3500U) of what std::vector did.

Just for completeness, here are the specs of the
[Ryzen 5800X](https://en.wikichip.org/wiki/amd/ryzen_7/5800x)
and the
[Ryzen 3500U](https://en.wikichip.org/wiki/amd/ryzen_5/3500u).
The 5800X has 32 MiB and the 3500U 4 MiB of L3 cache.

Conclusions
================================================================================

This little experiment has clearly shown that in a modern CPU the amount of
computational work in the form of instructions is not as important as the
layout of the data in memory. The way how modern CPUs use caching dictates
that a mostly sequential consistent data access pattern brings much more
performance to the table than a program that might do less computation itself,
but requires the CPU to frequently fetch data in far away blocks of memory.

Realizing this has significant consequenses for the types of algorithms we as
programmers use on a daily basis:

- Trees, lists, deques, ... - basically anything utilizing a chain/graph of
pointers and/or non-contiguous memory blocks should be considered slow, since
any access on data will make the CPU jump around in memory.

- Anything sequential, so basically all containers that use continous blocks of
memory, where data is accessed in a (mostly) linear fashion can be considered
fast and performant, since the CPU can prefetch cache-lines with minimal
amount of cache-misses.

This can completely change how we tackle a programming problem. As an example:
It might be better to use an unoptimized VM to run a bunch of instructions than
using an intelligent tree-optimized algorithm, just because reading a linear
array of instructions is way faster.
[See here.](https://benhoyt.com/writings/goawk-compiler-vm/)

In the future I will definitely look twice if there might be a way to increase
data-density in my programs. Just as we've seen with std::vector, more
densely packed data might - even **with** overhead - perform significantly
better. Often, we don't really need all the bits in char. So
std::vector might just be a really handy alternative if you only need
to store a few bits per "element".

But I've almost never seen std::vector appear in the wild. So it seems
people **are** still underestimating its power.