Arrays, Part III: Vectorization

Inside a single core, most modern processors have special hardware that allows a Single Instruction to be applied in parallel to Multiple Data (SIMD). Often referred to as vector units, they have been the main components of supercomputers in the 1980s. In order to speed up applications that process image, sound and video, Intel introduced them in Pentiums in 1997 under the name MMX. They are now available in every Intel processor and most ARM processors. Wether you are reading this on your PC, your tablet or your phone, your device has a vector processor. The maximum speedup you can obtain depends both upon the width of the vector extension and the size of the data type. SSE2 registers, introduced by Intel in 2001, are 128 bit wide. As they can store two doubles, one can expect a speedup by a factor of 2 with double precision floating points. Such registers can also hold up to 16 chars, a type often used in image processing where we can expect a speedup by a factor of 16.

SSE2

 

In 2011, Intel released its first processor with Advanced Vector Extensions (AVX) running 256 bits wide vector registers. The case study at the end of this post has been run on a MacBook Pro with an AVX2 processor. Therefore, one can expect to speedup his numerical code by a factor of 4 on those machines. We’ll find out if this processor can bring such speed in real tests.

Vectorizing loops

Let’s look at the following C code.

int main (int argc, char const *argv[])
{
    const int n = 128;
    double a[n];
    double b[n];
    double c[n];

    // Assume that some work has been done here on b and c

    for(int k = 0; k < n; ++k) {
        a[k] = b[k] + c[k];
    }
    
    // Work with a

    return 0;
}

If vector instructions are not used, vector registers will be used as below.

AVX2-empty

Only one addition will be carried out at a time. But if we use vector instructions, the processor will be able to fill the registers and carry 4 double additions at a time. If the memory bandwidth is wide enough, one should expect a speedup by a factor of 4.

AVX2-full
Unfortunately , all loops are not good candidates for vectorization. To be vectorizable, loops must meet the following criteria:

  • Countable: The number of times the loop will be executed must be known before entering the loop. Therefore, the exit condition should not depend upon a variable that is modified during the loops. Statements that interrupt the flow, such as \(\verb+break+\) are not allowed.
  • Straight-line code: Vector instructions must broadcast the same instruction on multiple data.
    • Therefore, \(\verb+if+\) statements are not allowed in a vectorized loop such as here.
      // This code can't be vectorized
      int main (int argc, char const *argv[])
      {
          const int n{ 128 };
          double a[n];
          double b[n];
          
          // Do some work with b
      
          for(int k = 0; k < n; ++k) {
              if (b[k] >= 0.0) {
                  a[k] = b[k];
              } else {
                  std::cout << "Another negative number" << std::endl;
              }
          }
          
          // Do some work with a
          
          return 0;
      }
      

      However, compilers might still vectorize branches without side effect using the following trick: both branches are executed by the processor but only the result of the branch that should have been executed is being kept. The following code gets vectorized by most compilers.

      // This code can be vectorized
      int main (int argc, char const *argv[])
      {
          const int n{ 128 };
          double a[n];
          double b[n];
          
          // Do some work with b
      
          for(int k = 0; k < n; ++k) {
              if (b[k] >= 0.0) {
                  a[k] = b[k];
              } else {
                  a[k] = -b[k];
              }
          }
          
          // Do some work with a
          
          return 0;
      }
      
    • In a nest of loops, only the innermost can be vectorized.
  • No vector dependence: We have seen that vector instructions are applied in parallel to multiple data. Therefore the result of an operation at index \(k\) can’t be used for an operation at index \(k+1\). As a consequence, the following code that computes the sequence of Fibonacci can’t get vectorized.
    // This code can't be vectorized
    #include <iostream>
    
    int main (int argc, char const *argv[])
    {
        const int n = 32;
        int fibo[n];
    
        fibo[0] = 0;
        fibo[1] = 1;
        for(int k = 2; k < n; ++k) {
            fibo[n] = fibo[n - 1] + fibo[n - 2];
        }
        
        std::cout << "Fibo[31] = " << fibo[n - 1] << std::end;
        
        return 0;
    }
    

    Another consequence is that function calls that can not be inlined by the compiler are forbidden; those functions could have side effects which might lead to vector dependence. However, built-in mathematical functions such as \(\verb+pow+\), \(\verb+sqrt+\), \(\verb+cos+\) and \(\verb+exp+\) are allowed.

    There is a class of loops that exhibit vector dependence that can still be vectorized. For instance, the following summation gets vectorized by most compilers.

    int main (int argc, char const *argv[])
    {
        const int n{ 128 };
        int a[n];
        
        // Do some work with a
        
        int sum{ 0 };
        for(int k = 0; k < n; ++k) {
            sum += a[k];
        }
        
        // Do some work with sum
        
        return 0;
    }
    

    The compiler uses the following trick: on an AVX2 processor, it creates 4 variables \(\verb+sum0+\), \(\verb+sum1+\), …, \(\verb+sum3+\), initialize them at 0 and add \(\verb+a[0]+\) to the first, up to \(\verb+a[3]+\) to the last in one vector instruction. Then it continues with \(\verb+a[4]+\), …, \(\verb+a[7]+\). At the end of the loop, the variables \(\verb+sumX+\) are added and the results is stored in sum. Such a trick can’t be used with floating point numbers as addition on those numbers is not associative. However, your compiler should have an option to carry out this transformation on floating point numbers if you can live with the small lost of accuracy.

Aliasing in C and C++

Once you are happy with the performance on your code with static arrays in C, you start coding with dynamic arrays. One of the first function you might write could be this one.

void add(double* a, double* b, int n) {
    for(int k = 0; k < n; ++k) {
        a[k] = b[k] + 1.0;
    }
}

Your old friend who lives next door, still listening to Joe Cocker on his old turntable, and still enjoying Fortran, wrote this code (with \(\verb+dp = 8+\) for double precision).

subroutine add(a, b)
    real(dp), dimension(:), intent(out) :: a
    real(dp), dimension(:), intent(in) :: b
    ! Local variables
    integer :: k
    
    do k = 1, size(a)
        a(k) = b(k) + 1.0_dp
    end do
end subroutine add

which runs 4 times faster than yours on a recent version of gcc. What happened?

Your are a victim of pointer aliasing! Remember that in C, dynamic arrays are just pointers. Therefore, it could be that \(\verb+n=4+\) with \(\verb+a+\) and \(\verb+b+\) pointing to the following memory location.

aliasing

Yes, \(\verb+b[1]+\) and \(\verb+a[0]+\) refers to the same location in memory. Now suppose that those memory locations contains respectively: 0.0, 2.0, 4.0, 6.0, 8.0. The array \(\verb+a+\) containing 2.0, 4.0, 6.0, 8.0 and the array \(\verb+b+\) containing 0.0, 2.0, 4.0, 6.0. If you follow line by line the algorithm implemented in the C file, \(\verb+a+\) would end up with the following values: 1.0, 2.0, 3.0, 4.0. But if you store all the elements of \(\verb+b+\) in a vector register and you add 1.0 to all of these elements, you end up with 1.0, 3.0, 5.0, 7.0, a completely different result. This difference is due to pointer aliasing and your old friend programming Fortran does not have this problem as variables are not allowed to alias in Fortran.

Pointer aliasing was the main reason why Fortran programs could be faster than C programs before C99. In C99, the \(\verb+restrict+\) keyword has been introduced to explicitly tell the compiler that some pointers do not alias. Therefore, the following function

void add(double* restrict a, double* b, int n) {
    for(int k = 0; k < n; ++k) {
        a[k] = b[k] + 1.0;
    }
}

gets vectorized. It tells the compiler that no other pointer than \(\verb+a+\) can access its elements. Therefore \(\verb+b+\) is unaffected by the affectation in the loop which can be vectorized. Unfortunately, the \(\verb+restrict+\) keyword does not exist in C++. However, compilers such as Intel compilers provide support for such a keyword. You can use \(\verb+__restrict__+\) and compile it with the \(\verb+-restrict+\) option.

You might have heard of the strict aliasing rule which has been introduced in ANSI C and in C++. It might help vectorization when mixed types are used. In a nutshell, it says that pointers of different types are not allowed to access the same memory location unless one of them is a \(\verb+char*+\). As a consequence, \(\verb+double*+\) and \(\verb+int*+\) can’t alias. This rule is of no use in our previous program as \(\verb+a+\) and \(\verb+b+\) are both \(\verb+double*+\). But it might help in other cases. It is interesting to notice that many programmers refuse this strict aliasing rule even though it is a part of the C standard. Therefore their programs do not compile with a standard C compiler. As major programs are in this case, such as the Linux kernel, many compilers don’t enforce strict aliasing by default. The Intel compiler is one of them and you should use the \(\verb+-ansi-alias+\) option to allow the compiler to optimize the code using this rule.

Case study

In the following tests, we are timing the number of “operations” carried out by the processor per second. What we call an “operation” is what is done in the inner loop. It usually consists of an addition and an affectation, or multiplying 2 sets of numbers, adding them and taking their square root. The tests have been done with various length of vectors \(n = 8, 64, 10^3, 10^6, 10^8\) and the loops have been run \(n_{\textrm{loops}}\) times such that \(n \times n_{\textrm{loops}} = 10^9\). As loops are run multiple times, we can expect that for all but the first outer loop, arrays are already in the cache when they fit in (we say that the cache has been warmed up). A parameter (often named \(\verb+a+\)) has been slightly changed in the outer loop so that the compiler does not optimize this loop away (Fortran compilers where quite smart in that respect). We plot in the number of billion “operation” per second. Therefore, higher bars mean better performance.

Adding a number to all the elements of an array

In this test, a double is added to every element of an array. Here are the C and C++ code

void test1_c(double* x, int n, int nb_loops) {
    double a{ 0.0 };
    for (int i = 0; i < nb_loops; ++i) {
        a += 1.0;
        for (int k = 0; k < n; ++k) {
            x[k] += a;
        }
    }
}

void test1_cpp(std::vector<double>& x, int nb_loops) {
    double a{ 0.0 };
    for (int i = 0; i < nb_loops; ++i) {
        a += 1.0;
        for (std::size_t k = 0; k < x.size(); ++k) {
            x[k] += a;
        }
    }
}

and the Fortran code

subroutine test1_f(x, nb_loops)
     real(dp), dimension(:), intent(inout) :: x
     integer, intent(in) :: nb_loops
     ! local variables
     integer :: i, k
     real(dp) :: a

     a = 0.0_dp
     do i = 1, nb_loops
         a = a + 1.0_dp
         do k = 1, size(x)
             x(k) = x(k) + a
         end do
     end do
 end subroutine test1_f

both compiled with the Intel compilers and the following options.

icpc -std=c++11 -Ofast -march=core-avx2 -ansi-alias
ifort -Ofast -march=core-avx2

We also generated the timings of the C code without vectorization using the \(\verb+-no-vec+\) option of the compiler. The results are presented below.

add-simple

We can clearly see in this plot that for \(n = 10^3\), the vectorized code is up to 3.5 times fast than the code without vectorization, quite close to the maximum theoritical speedup of 4. For \(n = 10^6\), the difference in between the codes gets smaller. With such a size, an array of doubles takes 8MB of memory and does not fit in the cache of the processor any more. With such a vector size, vectorization becomes useless as the performance is limited by the memory bandwidth. One might be surprised that for \(n = 8\), vectorization slows down the code. A profiler shows that the program spends quite some time broadcasting the value of a in the different registers of the vector unit. Therefore, results should be better in real code.

Adding an array to another one

This second scenario consists of adding two vectors of doubles and storing the result in a third vector. As pointer aliasing might slow down the C code, we provided a C version using the restrict keyword.

void test2_c(double* x, double * y, int n, int nb_loops) {
    double a{ 0.0 };
    for (int i = 0; i < nb_loops; ++i) {
        a += 1.0;
        for (int k = 0; k < n; ++k) {
            x[k] += y[k] + a;
        }
    }
}

void test2_c_restrict(double* __restrict__ x, double* y, int n, int nb_loops) {
    double a{ 0.0 };
    for (int i = 0; i < nb_loops; ++i) {
        a += 1.0;
        for (int k = 0; k < n; ++k) {
            x[k] += y[k] + a;
        }
    }
}

void test2_cpp(std::vector<double>& x, const std::vector<double>& y,
        int nb_loops) {
    double a{ 0.0 };
    for (int i = 0; i < nb_loops; ++i) {
        a += 1.0;
        for (std::size_t k = 0; k < x.size(); ++k) {
            x[k] += y[k] + a;
        }
    }
}

and the Fortran version.

subroutine test2_f(x, y, nb_loops)
    real(dp), dimension(:), intent(inout) :: x
    real(dp), dimension(:), intent(in) :: y
    integer, intent(in) :: nb_loops
    ! local variables
    integer :: i, k
    real(dp) :: a

    a = 0.0_dp
    do i = 1, nb_loops
        a = a + 1.0_dp
        do k = 1, size(x)
            x(k) = x(k) + y(k) + a
        end do
    end do
end subroutine test2_f

The code has been compiled with the same options as in the first test. Here are the numbers.

add-aliasing

As you can see, vectorized code is up to 3.6 times faster (for \(n = 1000\)) than the non vectorized one. For \(n \geq 10^6\), arrays don’t fit in the cache and the performance is again limited by the memory bandwidth. Vectorization is useless and slows down the code. A very interesting observation is that the C code get vectorized even without the restrict keyword. In their last version, Intel compilers use the following trick: both standard and vectorized code are generated and the arrays are checked at run time to see if they alias or not. Therefore, pointer aliasing does not affect our code for \(n \geq 64\). But for \(n = 8\), the code using the restrict keyword is faster as arrays don’t have to be checked for aliasing, an operation that can’t be neglected when we just need to add 8 numbers.

Solving a quadratic equation

In this last test, we solve the quadratic equation \(a_k x^2 + b_k x + c_k = 0\) for every \(k\in\{0, 1, 2,\ldots, n-1\}\). For \(\Delta_k = b_k^2-4\cdot a_k c_k \geq 0\), the equation has 2 real solutions, and no solution for \(\Delta_k < 0\). A dummy element has been added to the discriminant as the Fortran compiler was smart enough to kill the outer loop without it. Here are the C and C++ version of the code.

void test3_c(double *a, double* b, double* c,
        double* x, double* y, bool* sol, int n, int nb_loops) {
    double d{ 0.0 };
    for (int i = 0; i < nb_loops; ++i) {
        for (int k = 0; k < n; k++) {
            double discriminant{ d + b[k]*b[k] - 4.0 * a[k] * c[k] };
            if (discriminant >= 0.0) {
                double s{ sqrt(discriminant) };
                x[k] = (-b[k] + s) / (2 * a[k]);
                y[k] = (-b[k] - s) / (2 * a[k]);
                sol[k] = true;
            } else {
                sol[k] = false;
            }
        }
    }
}

void test3_c_restrict(double* a, double* b, double* c,
        double* __restrict__ x, double* __restrict__ y, bool* __restrict__ sol,
        int n, int nb_loops) {
    double d{ 0.0 };
    for (int i = 0; i < nb_loops; ++i) {
        d += 1.0;
        for (int k = 0; k < n; k++) {
            double discriminant{ d + b[k]*b[k] - 4.0 * a[k] * c[k] };
            if (discriminant >= 0.0) {
                double s{ sqrt(discriminant) };
                x[k] = (-b[k] + s) / (2 * a[k]);
                y[k] = (-b[k] - s) / (2 * a[k]);
                sol[k] = true;
            } else {
                sol[k] = false;
            }
        }
    }
}

void test3_cpp(const std::vector<double>& a, const std::vector<double>& b,
        const std::vector<double>& c, std::vector<double>& x,
        std::vector<double>& y, std::vector<bool>& sol, int nb_loops) {
    double d{ 0.0 };
    for (int i = 0; i < nb_loops; ++i) {
        d += 1.0;
        for (std::size_t k = 0; k < a.size(); k++) {
            double discriminant{ d + b[k]*b[k] - 4.0 * a[k] * c[k] };
            if (discriminant >= 0.0) {
                double s{ sqrt(discriminant) };
                x[k] = (-b[k] + s) / (2 * a[k]);
                y[k] = (-b[k] - s) / (2 * a[k]);
                sol[k] = true;
            } else {
                sol[k] = false;
            }
        }
    }
}

and the Fortran version.

subroutine test3_f(a, b, c, x, y, sol, nb_loops)
    real(dp), dimension(:), intent(in) :: a
    real(dp), dimension(:), intent(in) :: b
    real(dp), dimension(:), intent(in) :: c
    real(dp), dimension(:), intent(out) :: x
    real(dp), dimension(:), intent(out) :: y
    logical, dimension(:), intent(out) :: sol
    integer, intent(in) :: nb_loops
    ! local variables
    integer :: i, k
    real(dp) :: discriminant, s
    real(dp) :: d

    d = 0.0_dp
    do i = 1, nb_loops
        d = d + 1.0_dp
        do k = 1, size(a)
            discriminant = d + b(k)**2 - 4 * a(k) * c(k)
            if (discriminant >= 0) then
                s = sqrt(discriminant)
                x(k) = (-b(k) + s) / (2 * a(k))
                y(k) = (-b(k) - s) / (2 * a(k))
                sol(k) = .true.
            else
                sol(k) = .false.
            end if
        end do
    end do
end subroutine test3_f

Still using the same options for compilation, this time we get those numbers.

solve-quadratic

The C and the C++ version do not get vectorized. The only versions that get vectorized are the C version with the restrict keyword and the Fortran version for which we get a speedup by a factor of 2.7. As the algorithm needs to compute a square root, an operation which is way slower than just adding two numbers, the memory bandwidth is not a limiting factor anymore.

Conclusion

Vector instructions can speed up your numerical code and should be considered before going to multithreaded programs. But bear in mind that your algorithm might be limited by the memory bandwidth of your CPU. In those cases, vectorization might become useless and even slow down your program. As far as languages are concerned, Fortran is still ahead of the pack in terms of simplicity, but C code using the restrict keyword can remove the aliasing problem and run as fast (and even slightly faster) as the Fortran version. We’ve learned that in C, modern compilers can generate vector instructions triggered after at run time check for aliasing. This check has a cost for small arrays and is not always made as we have seen in our last test. Vectorizing compilers are evolving rapidly. Therefore, results we got today might change in the future. Many bugs are still present in those compilers and I have discarded some tests where the C++ version was faster than the C and Fortran version because it was the only one to generate vector code. It is obviously a bug in the compiler as the C++ had exactly the same information as the other ones. When developing your own code, one should check the reports given by your compiler (\(\verb+-qopt-report=5+\) with Intel compilers) and have a look at the assembly code generated.

Posted in Uncategorized.

Leave a Reply

Your email address will not be published. Required fields are marked *