Is std::vector<bool> any good?

The [`std::vector<>`](https://en.cppreference.com/w/cpp/container/vector)
class is part of the C++ STL (standard library) and is
frequently used in basically any C++ program. If you look at some C++ code and
don't find any vectors, chances are the programmer is actually writing C code
and just doesn't know he/she has named it wrong (or doesn't care).

Now, in principle a vector can hold elements of pretty much any type (since
it's a template), but you might know that
[`std::vector`](https://en.cppreference.com/w/cpp/container/vector_bool)
is actually a
special case. Instead of holding booleans - which are just of type int in most
compiler implementations - it actually provides a variable-size bitset. And
yes, there's also
[`std::bitset`](https://en.cppreference.com/w/cpp/utility/bitset)
if you don't need the resizing feature. But
mostly out of lazyness, I am usually using `std::vector`.

The question that comes to ones mind is: Since `std::vector` obviously
requires some extra computations (bits cannot be addressed individually in
modern computers; so there **has** to be some bit-shifting and other overhead
involved), can it perform better than the other vector types?
Let's find out!


Test setup
================================================================================

In order to test the performance characteristics of `std::vector` versus
other types of vectors, I created the following super simple program that uses
a prime sieve:


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cpp linenumbers
#include 
#include 

#ifndef VERSION
#define VERSION 1
#endif

int prime_sieve(int below)
{
#if VERSION==1
    std::vector is_prime(below, 1);
#elif VERSION==2
    std::vector is_prime(below, 1);
#elif VERSION==3
    std::vector is_prime(below, 1);
#endif
    int largest_prime = 0;

    for (int i = 2; i < below; i++) {
        if (is_prime[i] == 1) {
            largest_prime = i;

            int k = 2*i;
            while (k < below) {
                is_prime[k] = 0;
                k += i;
            }
        }
    }

    return largest_prime;
}


int main(int argc, char *argv[])
{
    int largest_prime = prime_sieve(100000000);
    std::cout << largest_prime << std::endl;

    return 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


The code should talk for itself. Basically we're just getting the largest prime
number below a certain limit and outputting it to `stdout`. The preprocessor
switch `VERSION` allows us to change the type of `is_prime` during compilation.
So I compiled the 3 different versions, with `-O2`, as you would do with a
practical program:


``````````````````````````````````````````````` bash
$ g++ main.cpp -O2 -DVERSION=1 -o test1
$ g++ main.cpp -O2 -DVERSION=2 -o test2
$ g++ main.cpp -O2 -DVERSION=3 -o test3
```````````````````````````````````````````````


And I created a little script that would run a program 10 times for me:


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ bash linenumbers
#!/bin/bash

for i in {1..10}
  do
    $1
done
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


All I had to do now was measuring execution time by doing:


``````````````````````````````````````````````` bash
$ time ./run10x.sh ./test1
$ time ./run10x.sh ./test2
$ time ./run10x.sh ./test3
```````````````````````````````````````````````


I had two computers to run this test on:

- Ryzen 3500U Laptop with 2400 MHz DDR4 RAM
- Ryzen 5800X Desktop with 3200 MHz DDR4 RAM

And on both computers I had the CPU governor set to ondemand and turbo boost
was enabled.


Results
================================================================================

Here are the results I've got:


 VERSION  | 5800X    | 3500U
----------|----------|----------
 1 (int)  | 14.037s  | 25.637s
 2 (char) | 8.552s   | 18.206s
 3 (bool) | 3.565s   | 13.362s
[Table [runtimes]: Execution times of 10 runs of the prime sieve programs.]


As you can see, the speedup by using `std::vector` is quite significant!
Without any changes to the algorithm itself we managed to get a speedup by a
factor of ~2x for the 3500U and a factor of ~4x for the 5800X!

The performance increase by replacing `std::vector` with
`std::vector` is also quite impressive. In this case we got a speedup of
~40% for the 3500U and ~65% for the 5800X.

All we really did was putting our data into memory in a more compact way really.
And we even introduced some computational overhead! So why is
`std::vector` so much faster?


Further analysis
================================================================================

In order to understand the machine's behaviour a little bit better, I used the
[perf](https://perf.wiki.kernel.org/index.php/Tutorial) command-line utility.
By executing the command


``````````````````````````````````````````````` bash
$ perf stat -e cycles,instructions,cache-misses ./test3
$ perf stat -e cycles,instructions,cache-misses ./test2
$ perf stat -e cycles,instructions,cache-misses ./test1
```````````````````````````````````````````````

I was able to retrieve some information about the number of cycles it took the
CPU to finish computation, the total number of instructions that were executed,
and the amount of cache-misses that happened meanwhile. Note that I tested my
programs in reverse. That is because I was worried the CPU's
prediction-mechanism might learn some access patterns on-the-fly and thus make
it seam like `bool` and `char` are faster than `int`. So I worked from fastest
to slowest to avoid giving `std::vector` any advantage.
Here are my results:


 VERSION  | cycles        | instructions  | cache-misses
----------|---------------|---------------|----------------
 1 (int)  | 6.513.968.536 | 2.366.240.895 | 195.540.901
 2 (char) | 4.121.029.300 | 1.966.167.499 | 132.289.762
 3 (bool) | 1.699.089.745 | 4.943.353.593 | 79.010.612
[Table [perf_5800X]: Statistics created by perf on the 5800X.]


 VERSION  | cycles        | instructions  | cache-misses
----------|---------------|---------------|----------------
 1 (int)  | 9.518.495.798 | 2.659.774.890 | 190.209.877
 2 (char) | 6.720.663.564 | 1.971.889.530 | 129.980.019
 3 (bool) | 4.918.045.013 | 4.241.653.189 | 72.136.543
[Table [perf_5800X]: Statistics created by perf on the 3500U.]


While the specific numbers differ a bit between the two processors, we can still
conclude from both of them the following 3 things:

1. The number of cycles is highest for `int`, lower for `char`, and lowest for
   `bool`. Since the number of CPU cycles is directly correlated with the
   execution time ($ T_{exec} = \frac{N_{cycles}}{f_{clk}} $), this accommodates
   the fact that `bool` was fastest, followed by `char` and `int`.

2. The number of instructions that the CPU is actually executing is **highest**
   for `bool`. We expected this, since using individual bits must create some
   form of overhead (see above). However, the amount of instructions to run
   is higher for `int` than it is for `char`. I'm not quite sure yet what's
   going on here.

3. The number of cache-misses is **by far** the lowest for `std::vector`.
   Also the `char`-variant produced significantly less cache-misses than the
   `int`-version. The amount of cache-misses that `std::vector` produced
   was just 40% (Ryzen 5800X) respectively 38% (Ryzen 3500U) of what
   `std::vector` did, and just 60% (Ryzen 5800X) respectively 55%
   (Ryzen 3500U) of what `std::vector` did.

Just for completeness, here are the specs of the
[Ryzen 5800X](https://en.wikichip.org/wiki/amd/ryzen_7/5800x)
and the
[Ryzen 3500U](https://en.wikichip.org/wiki/amd/ryzen_5/3500u).
The 5800X has 32 MiB and the 3500U 4 MiB of L3 cache.


Conclusions
================================================================================

This little experiment has clearly shown that in a modern CPU the amount of
computational work in the form of instructions is not as important as the
layout of the data in memory. The way how modern CPUs use caching dictates
that a mostly sequential consistent data access pattern brings much more
performance to the table than a program that might do less computation itself,
but requires the CPU to frequently fetch data in far away blocks of memory.

Realizing this has significant consequenses for the types of algorithms we as
programmers use on a daily basis:

- Trees, lists, deques, ... - basically anything utilizing a chain/graph of
  pointers and/or non-contiguous memory blocks should be considered slow, since
  any access on data will make the CPU jump around in memory.

- Anything sequential, so basically all containers that use continous blocks of
  memory, where data is accessed in a (mostly) linear fashion can be considered
  fast and performant, since the CPU can prefetch cache-lines with minimal
  amount of cache-misses.

This can completely change how we tackle a programming problem. As an example:
It might be better to use an unoptimized VM to run a bunch of instructions than
using an intelligent tree-optimized algorithm, just because reading a linear
array of instructions is way faster.
[See here.](https://benhoyt.com/writings/goawk-compiler-vm/)

In the future I will definitely look twice if there might be a way to increase
data-density in my programs. Just as we've seen with `std::vector`, more
densely packed data might - even **with** overhead - perform significantly
better. Often, we don't really need all the bits in `char`. So
`std::vector` might just be a really handy alternative if you only need
to store a few bits per "element".

But I've almost never seen `std::vector` appear in the wild. So it seems
people **are** still underestimating its power.