Arrays, Part I: Memory management

In this post, we’ll talk about the different kind of arrays available in Fortran, C and C++. They differ mostly in where they get their memory from to store their elements. If you don’t know what the stack and the heap are, let’s just assume for now that the stack is a part of the memory where objects needs to be deallocated in the reverse order of their allocation. The stack is limited in size, usually 8 MB. The heap is a memory from where creation and destruction can happen in any order and which is only limited by the amount of RAM you have on your computer, usually 16 GB on a modern laptop. Limitations imposed on the stack allow allocation, deallocation and first access time to be orders of magnitude faster than what you can expect from the heap.

Static arrays

An array is said to be static if the lifetime of its elements is restricted to the current scope and its size is known at compile time. Their elements are stored on the stack. Therefore, they are extremely fast to allocate and their elements will be in the cache of your CPU as soon as they are created.

Here is how they are created in Fortran

program main
    implicit none
    integer, parameter :: n = 10
    real, dimension(1:n) :: x

    write (*,*) x(1)
end program

in C

#include "stdio.h"

int main() {
    const int n = 10;
    double x[n];

    printf("%f", x[0]);
    return 0;
}

and C++

#include <array>
#include <iostream>

int main() {
    const int n{ 10 };
    std::array<double, n> x;

    std::cout << x[0] << std::endl;
    return 0;
}

Automatic arrays

An array is said to be automatic if the lifetime of its elements is restricted to the current scope and its size is known at run time. In Fortran, they are called automatic arrays and their elements are stored on the stack or the heap depending upon their size and your compiler options. In C 99, they are called Variable Length Arrays (VLA) and are stored on the stack. They are not available in the C++ 2011 standard although some compilers support them.

Here is a Fortran function that, given an integer \(n\), loads \(n\) floating points from an input and returns the median of those numbers.

function median(n)
    integer, intent(in) :: n
    real :: median
    real, dimension(1:n) :: x

    do k = 1, n
        x(k) = read_number_from_input()
    end do
    median = median_of_array(x)
end function median

Below is the C version with Variable Length Arrays.

int median(int n) {
    double x[n];
    
    for(int k = 0; k < n; ++k) {
        x[k] = read_number_from_input();
    }
    return median_of_array(x, n);
}

Dynamic arrays

An array is said to be dynamic if the lifetime of its elements is managed by the programmer (or the standard library in case of C++) and its size is known at run time. Their elements are stored on the heap. Therefore, they are slower to allocate and accessing their elements for the first time will most likely trigger a cache miss.

Here is the previous function implemented in Fortran with dynamic arrays

function median(n)
    integer, intent(in) :: n
    real :: median
    real, allocatable, dimension(:) :: x

    allocate(x(1:n))
    do k = 1, n
        x(k) = read_number_from_input()
    end do
    median = median_of_array(x)
end function median

its C version

int median(int n) {
    double *x = (double *) malloc(n * sizeof(double));
    
    for(int k = 0; k < n; ++k) {
        x[k] = read_number_from_input();
    }
    double median = median_of_array(x, n);
    free(x);
    return median; 
}

and its C++ version

int median(int n) {
    std::vector<double> x( n );
    
    for(size_t k = 0; k < x.size(); ++k) {
        x[k] = read_number_from_input();
    }
    return median_of_array(x, n);
}

Performance comparison on Euler’s method

In order to compare the performance of static and dynamic arrays, we decided to solve the ordinary differential equation
$$\forall t\in\mathbb{R} \quad y_{0}'(t) = y_1(t), \quad y_{1}'(t) = y_2(t),\quad \ldots, \quad y_{n-2}'(t) = y_{n-1}(t), \quad y_{n-1}'(t) = y_0(t) $$
with the Cauchy condition
$$y_0(0) = 0, \quad y_1(0) = 1,\quad\ldots,\quad y_{n-1}(0)=n-1\, .$$
The goal of the algorithm is to get an estimation of \(y_0(1) + y_1(1) + \cdots + y_{n-1}(1)\). We’ll use the Euler method with a time step
\(\Delta t = 1 / n\) so the number of elementary operations does not depend upon \(n\).

The implementation has been done in C++11 using two different methods. The first one allocates a static array \(\verb+dy_dt+\) and deallocates it for every time step. The second one does the same with a dynamic array. Obviously, the right way to program this algorithm would be to allocate \(\verb+dy_dt+\) once for all and never destroy it, but this program is just a proxy for larger programs where allocation and deallocation at every time step is needed.

#include <array>
#include <vector>
#include <chrono>
#include <iostream>

int main () {
    const double t{ 1.0 };
    const int n{ 64 };
    const int nb_operations{ 1000000000 };
    const int nb_steps{ nb_operations / n };
    const double dt{ t / nb_steps };
    {
        std::array<double, n> y;
        for(size_t k = 0; k < n; ++k) {
            y[k] = static_cast<double>(k);
        }

        auto start_time = std::chrono::high_resolution_clock::now();
        for(int i = 0; i < nb_steps; ++i) {
            std::array<double, n> dy_dt;
            for(size_t k = 0; k < y.size(); ++k) {
                dy_dt[k] = y[k + 1 < y.size() ? k + 1 : 0];
            }
            for(size_t k = 0; k < y.size(); ++k) {
                y[k] += dt * dy_dt[k];
            }
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        auto time = std::chrono::duration_cast<std::chrono::microseconds>(
                end_time - start_time).count();

        double sum{ 0.0 };
        for(size_t k = 0; k < y.size(); ++k) {
            sum += y[k];
        }
        std::cout << "Sum: " << sum << std::endl;
        std::cout << "Time static arrays: " <<
                static_cast<double>(time) / 1000000 << std::endl;
    }
    
    {
        std::vector<double> y( n );
        for(size_t k = 0; k < n; ++k) {
            y[k] = static_cast<double>(k);
        }

        auto start_time = std::chrono::high_resolution_clock::now();
        for(int i = 0; i < nb_steps; ++i) {
            std::vector<double> dy_dt( n );
            for(size_t k = 0; k < y.size(); ++k) {
                dy_dt[k] = y[k + 1 < y.size() ? k + 1 : 0];
            }
            for(size_t k = 0; k < y.size(); ++k) {
                y[k] += dt * dy_dt[k];
            }
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        auto time = std::chrono::duration_cast<std::chrono::microseconds>(
                end_time - start_time).count();
        
        double sum{ 0.0 };
        for(size_t k = 0; k < y.size(); ++k) {
            sum += y[k];
        }
        std::cout << "Sum: " << sum << std::endl;
        std::cout << "Time dynamic arrays: " <<
                static_cast<double>(time) / 1000000 << std::endl;
    }
    return 0;
}

This program has been run for arrays of size in between \(n = 2\), and \(n = 8192\). It has been compiled with the Intel compiler on Linux with the following options

icpc -std=c++11 -Ofast -ansi-alias -xHost main.cpp -o main

Here are the timings.

cpp-static-dynamic

Many interesting things could be said about this graph. But the most important thing is that the implementation using static arrays is up to 16 times faster (for \(n = 6\)) than the one using dynamic arrays. Even if the gap in between the two curves shrinks as \(n\) grows, for \(n = 8192\) the static version is still 25% faster than the dynamic one.

Please be aware that micro-benchmarking, as done here, might be misleading. In real applications, the allocation time for a dynamic array will depend upon many factors such as memory fragmentation. In this specific test, they are deallocated before the next one of the same size is allocated. It does not put much pressure on memory allocation, and things can get worse in real applications.

Let’s now have a look at the closest Fortran implementation I can get.

program main
    implicit none
    
    integer, parameter :: dp = 8
    
    real(dp) :: time_static = 0.0_dp
    real(dp) :: time_dynamic = 0.0_dp
 
    real(dp), parameter :: t = 1.0_dp
    integer, parameter :: n = 64 
    integer, parameter :: nb_operations = 10**9
    integer, parameter :: nb_steps = nb_operations / n
    real(dp), parameter :: dt = t / nb_steps
    
    integer :: time_begin
    integer :: time_end
    integer :: count_rate
    
    real(dp), dimension(1:n) :: y
    real(dp), dimension(1:n) :: dy_dt
    real(dp), allocatable, dimension(:) :: y_dynamic
    real(dp), allocatable, dimension(:) :: dy_dt_dynamic
    real(dp) :: tmp
    real(dp) :: sum
    integer :: i, k, ios
    
    do k = 1, n
        y(k) = real(k - 1, dp)
    end do
    call system_clock(time_begin, count_rate)
    do i = 1, nb_steps
        do k = 1, n
            dy_dt(k) = y(merge(k + 1, 0, k < n))
        end do
        do k = 1, n
            y(k) = y(k) + dt * dy_dt(k)
        end do
    end do
    call system_clock(time_end, count_rate)
    sum = 0.0_dp
    do k = 1, n
        sum = sum + y(k)
    end do
    time_static = real(time_end - time_begin, dp) / real(count_rate, dp)
    write (*,*) 'Sum: ', sum
    write (*,*) 'Time static arrays: ', time_static
    
    
    allocate(y_dynamic(1:n))
    do k = 1, n
        y_dynamic(k) = real(k - 1, dp)
    end do
    call system_clock(time_begin, count_rate)
    do i = 1, nb_steps
        allocate(dy_dt_dynamic(1:n))
        do k = 1, n
            dy_dt_dynamic(k) = y_dynamic(merge(k + 1, 0, k < n))
        end do
        do k = 1, n
            y_dynamic(k) = y_dynamic(k) + dt * dy_dt_dynamic(k)
        end do
        deallocate(dy_dt_dynamic)
    end do
    call system_clock(time_end, count_rate)
    sum = 0.0_dp
    do k = 1, n
        sum = sum + y_dynamic(k)
    end do
    time_dynamic = real(time_end - time_begin, dp) / real(count_rate, dp)
    write (*,*) 'Sum: ', sum
    write (*,*) 'Time dynamic arrays: ', time_dynamic
end program main

To be fair this version is not completely equivalent to the C++ one for the static part. In the Fortran version, \(\verb+dy_dt+\) is created once for all and not allocated at every time step. The reason for this is that Fortran does not provide scopes inside a given function. But it should not make any difference as allocating and deallocating memory on the stack only needs to change the value of a pointer and is therefore instantaneous. The code has been compiled with the Intel compiler on the same system and the following options

ifort -Ofast -xHost main.f90 -o main

Here are the timings

fortran-static-dynamic

We observe the same behaviour as the C++ version.

As most of you have noticed, out of the box, the Fortran version outperforms the C++ version with dynamic arrays: it is up to three times faster. Here are both timings on the same graph.

fortran-cpp-dynamic-bis

We observe the same behaviour with the static version. Using gcc/gfortran also gives the same results.

In a nutshell, to get good performance in your application, wether your are using Fortran, C or C++, one can apply the following rule of thumb:

  • Use static arrays for small arrays whose size is known at compile time. Their allocation and deallocation only needs to change the value of a pointer and are instantaneous. They should not be used for large arrays as they can cause stack overflow which would terminate the program prematurely.
  • Use automatic arrays in Fortran or C for small arrays whose size is only known at run time. They’ll be as fast as static arrays.
  • Use dynamic arrays for large arrays. Their allocation an allocation are not instantaneous and might need a fair amount of time when memory is fragmented. But with large arrays, where the amount of work on the array usually becomes sizeable, allocation and deallocation time become small compared to the main work.

As you can see, there is no easy solution for small C++ arrays whose size is not know at compile time. Using a memory pool is solution to that problem but that’s enough for today.

Next time, we’ll focus on a major feature while debugging: bound checking. See you in two weeks.

 

Posted in Uncategorized.

Leave a Reply

Your email address will not be published. Required fields are marked *