Performance gap between vector<bool> and array

I was trying to solve a coding problem in C++ which counts the number of prime numbers less than a non-negative number `n`.

So I first came up with some code:

``````int countPrimes(int n) {
vector<bool> flag(n+1,1);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
``````

which takes 88 ms and uses 8.6 MB of memory. Then I changed my code into:

``````int countPrimes(int n) {
// vector<bool> flag(n+1,1);
bool flag[n+1] ;
fill(flag,flag+n+1,true);
for(int i =2;i<n;i++)
{
if(flag[i]==1)
for(long j=i;i*j<n;j++)
flag[i*j]=0;
}
int result=0;
for(int i =2;i<n;i++)
result+=flag[i];
return result;
}
``````

which takes 28 ms and 9.9 MB. I don't really understand why there is such a performance gap in both the running time and memory consumption. I have read relative questions like this one and that one but I am still confused.

EDIT: I reduced the running time to 40 ms with 11.5 MB of memory after replacing `vector<bool>` with `vector<char>`.

• How are you compiling your code? What compiler options? Also; `vector<bool>` is special. – Jesper Juhl Apr 18 at 11:45
• be carfull, std::vector<bool> is a weird specialisation, more a bitset than a vector – Martin Morterol Apr 18 at 11:49
• you may gain some time changing one loop a bit: `for(int i = 2; i * i <n;i++)` since if `i * i >= n` then next loop does nothing. – Marek R Apr 18 at 12:20
• This doesn't address the question, but when you're dealing with boolean types, use `true` and `false` and not `1`. So: `vector<bool> flag(n+1, true);` and `if (flag[i])`. That doesn't affect the result, but it makes it much clearer what you're doing. – Pete Becker Apr 18 at 12:53
• If you are not already doing so, make sure to compile your code with compiler optimizations enabled before benchmarking. – Jesper Juhl Apr 18 at 14:22

`std::vector<bool>` isn't like any other vector. The documentation says:

`std::vector<bool>` is a possibly space-efficient specialization of `std::vector` for the type `bool`.

That's why it may use up less memory than an array, because it might represent multiple boolean values with one byte, like a bitset. It also explains the performance difference, since accessing it isn't as simple anymore. According to the documentation, it doesn't even have to store it as a contiguous array.

`std::vector<bool>` is special case. It is specialized template. Each value is stored in single bit, so bit operations are needed. This memory compact but has couple drawbacks (like no way to have a pointer to `bool` inside this container).

Now `bool flag[n+1];` compiler will usually allocate same memory in same manner as for `char flag[n+1];` and it will do that on stack, not on heap.

Now depending on page sizes, cache misses and `i` values one can be faster then other. It is hard to predict (for small `n` array will be faster, but for larger `n` result may change).

As an interesting experiment you can change `std::vector<bool>` to `std::vector<char>`. In this case you will have similar memory mapping as in case of array, but it will be located at heap not a stack.

• The performance differences between `std::vector<bool>` and `std::vector<char>` may vary (a lot) between different library implementations and different sizes of the vectors.

See e.g. those quick benches: clang++ / libc++(LLVM) vs. g++ / libstdc++(GNU).

• This: `bool flag[n+1];` declares a Variable Length Array, which (despites some performance advantages due to it beeing allocated in the stack) has never been part of the C++ standard, even if provided as an extension by some (C99 compliant) compilers.

• Another way to increase the performances could be to reduce the amount of calculations (and memory occupation) by considering only the odd numbers, given that all the primes except for 2 are odd.

If you can bare the less readable code, you could try to profile the following snippet.

``````int countPrimes(int n)
{
if ( n < 2 )
return 0;
// Sieve starting from 3 up to n, the number of odd number between 3 and n are
int sieve_size = n / 2 - 1;
std::vector<char> sieve(sieve_size);
int result = 1;  // 2 is a prime.

for (int i = 0; i < sieve_size; ++i)
{
if ( sieve[i] == 0 )
{
// It's a prime, no need to scan the vector again
++result;
// Some ugly transformations are needed, here
int prime = i * 2 + 3;
for ( int j = prime * 3, k = prime * 2; j <= n; j += k)
sieve[j / 2 - 1] = 1;
}
}

return result;
}
``````

Edit

As Peter Cordes noted in the comments, using an unsigned type for the variable `j`

the compiler can implement j/2 as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that j will always be non-negative.

It's also possible to reduce the number of candidates exploiting the fact that all primes (past 2 and 3) are one below or above a multiple of 6.

• Wow, 20 years after C99, and not even the most rudimentary VLAs... Looks like C++ is getting more and more behind C ;-) Afaik, there was once a proposal that would have made `int array[length];` legal in C++, but even that proposal did not allow for multidimensional arrays, neither on the stack (`int array2d[height][width]`) nor on the heap (`int (*array2d)[width] = new int[height][width];`). Apparently, that proposal was never approved, even though C goes as far as to allow the multidimensional case (`int (*array2d)[width] = malloc(height * sizeof(*array2d));`) since 20 years ago... – cmaster Apr 18 at 15:53
• @cmaster Well, that's debatable ;) – Bob__ Apr 18 at 16:02
• Yeah, I know. Especially since very few people actually understand the power of the C99 syntax - both of the top-scoring answers of the question you linked completely underestimate its value. Because, you see, C99 VLAs not just exist on the stack, they can also be allocated on the heap. I have already given an example of that, it's the last one. The real power comes from the fact that any C99 array type can have a runtime length, it can be pointed to or typedefed. And C99 does not care how many runtime length array types you nest, the indexing calculation comes out right. – cmaster Apr 18 at 20:09
• Long story short: I really enjoy working with C++ for the most part, but when it comes to manipulating multidimensional arrays of plain data, (images, volumes, simulations), I'll just fall back to C99 for its VLA syntax. – cmaster Apr 18 at 20:11
• gcc does show a difference there, because it's actually doing worse with signed `int`. Overall it makes faster code, but you can see that it using `sar` / `sub` / `movslq` to sign-extend back to 64-bit. I suggested using `size_t` re-extension to pointer width wouldn't be necessary. (Although zero-extension is free on x86-64; writing a 32-bit reg zero-extends into the 64-bit reg, but 64 lets it use `-1` in the addr mode.) Anyway, `unsigned long` does give an extra speedup with gcc better than `unsigned`. quick-bench.com/4t9DE9nsT-oQYSNo0leeyYZ8GvY. (`long int` also is the same.) – Peter Cordes Apr 19 at 9:38

I am getting different timings and memory usage than the ones mentioned in the question when compiling with `g++-7.4.0 -g -march=native -O2 -Wall` and running on a Ryzen 5 1600 CPU:

• `vector<bool>`: 0.038 seconds, 3344 KiB memory, IPC 3.16
• `vector<char>`: 0.048 seconds, 12004 KiB memory, IPC 1.52
• `bool[N]`: 0.050 seconds, 12644 KiB memory, IPC 1.69

Conclusion: `vector<bool>` is the fastest option because of its higher IPC (instructions per clock).

``````#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <vector>

size_t countPrimes(size_t n) {
std::vector<bool> flag(n+1,1);
//std::vector<char> flag(n+1,1);
//bool flag[n+1]; std::fill(flag,flag+n+1,true);
for(size_t i=2;i<n;i++) {
if(flag[i]==1) {
for(size_t j=i;i*j<n;j++) {
flag[i*j]=0;
}
}
}
size_t result=0;
for(size_t i=2;i<n;i++) {
result+=flag[i];
}
return result;
}

int main() {
{
const rlim_t kStackSize = 16*1024*1024;
struct rlimit rl;
int result = getrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
if(rl.rlim_cur < kStackSize) {
rl.rlim_cur = kStackSize;
result = setrlimit(RLIMIT_STACK, &rl);
if(result != 0) abort();
}
}

printf("%zu\n", countPrimes(10e6));
return 0;
}
``````