In this post, we’ll talk about the different kind of arrays available in Fortran, C and C++. They differ mostly in where they get their memory from to store their elements. If you don’t know what the stack and the heap are, let’s just assume for now that the stack is a part of the memory where objects needs to be deallocated in the reverse order of their allocation. The stack is limited in size, usually 8 MB. The heap is a memory from where creation and destruction can happen in any order and which is only limited by the amount of RAM you have on your computer, usually 16 GB on a modern laptop. Limitations imposed on the stack allow allocation, deallocation and first access time to be orders of magnitude faster than what you can expect from the heap.

#### Static arrays

An array is said to be static if the lifetime of its elements is restricted to the current scope and its size is known at compile time. Their elements are stored on the stack. Therefore, they are extremely fast to allocate and their elements will be in the cache of your CPU as soon as they are created.

Here is how they are created in Fortran

```
program main
implicit none
integer, parameter :: n = 10
real, dimension(1:n) :: x
write (*,*) x(1)
end program
```

in C

```
#include "stdio.h"
int main() {
const int n = 10;
double x[n];
printf("%f", x[0]);
return 0;
}
```

and C++

```
#include <array>
#include <iostream>
int main() {
const int n{ 10 };
std::array<double, n> x;
std::cout << x[0] << std::endl;
return 0;
}
```

#### Automatic arrays

An array is said to be automatic if the lifetime of its elements is restricted to the current scope and its size is known at run time. In Fortran, they are called automatic arrays and their elements are stored on the stack or the heap depending upon their size and your compiler options. In C 99, they are called Variable Length Arrays (VLA) and are stored on the stack. They are not available in the C++ 2011 standard although some compilers support them.

Here is a Fortran function that, given an integer \(n\), loads \(n\) floating points from an input and returns the median of those numbers.

```
function median(n)
integer, intent(in) :: n
real :: median
real, dimension(1:n) :: x
do k = 1, n
x(k) = read_number_from_input()
end do
median = median_of_array(x)
end function median
```

Below is the C version with Variable Length Arrays.

```
int median(int n) {
double x[n];
for(int k = 0; k < n; ++k) {
x[k] = read_number_from_input();
}
return median_of_array(x, n);
}
```

#### Dynamic arrays

An array is said to be dynamic if the lifetime of its elements is managed by the programmer (or the standard library in case of C++) and its size is known at run time. Their elements are stored on the heap. Therefore, they are slower to allocate and accessing their elements for the first time will most likely trigger a cache miss.

Here is the previous function implemented in Fortran with dynamic arrays

```
function median(n)
integer, intent(in) :: n
real :: median
real, allocatable, dimension(:) :: x
allocate(x(1:n))
do k = 1, n
x(k) = read_number_from_input()
end do
median = median_of_array(x)
end function median
```

its C version

```
int median(int n) {
double *x = (double *) malloc(n * sizeof(double));
for(int k = 0; k < n; ++k) {
x[k] = read_number_from_input();
}
double median = median_of_array(x, n);
free(x);
return median;
}
```

and its C++ version

```
int median(int n) {
std::vector<double> x( n );
for(size_t k = 0; k < x.size(); ++k) {
x[k] = read_number_from_input();
}
return median_of_array(x, n);
}
```

#### Performance comparison on Euler’s method

In order to compare the performance of static and dynamic arrays, we decided to solve the ordinary differential equation

$$\forall t\in\mathbb{R} \quad y_{0}'(t) = y_1(t), \quad y_{1}'(t) = y_2(t),\quad \ldots, \quad y_{n-2}'(t) = y_{n-1}(t), \quad y_{n-1}'(t) = y_0(t) $$

with the Cauchy condition

$$y_0(0) = 0, \quad y_1(0) = 1,\quad\ldots,\quad y_{n-1}(0)=n-1\, .$$

The goal of the algorithm is to get an estimation of \(y_0(1) + y_1(1) + \cdots + y_{n-1}(1)\). We’ll use the Euler method with a time step

\(\Delta t = 1 / n\) so the number of elementary operations does not depend upon \(n\).

The implementation has been done in C++11 using two different methods. The first one allocates a static array \(\verb+dy_dt+\) and deallocates it for every time step. The second one does the same with a dynamic array. Obviously, the right way to program this algorithm would be to allocate \(\verb+dy_dt+\) once for all and never destroy it, but this program is just a proxy for larger programs where allocation and deallocation at every time step is needed.

```
#include <array>
#include <vector>
#include <chrono>
#include <iostream>
int main () {
const double t{ 1.0 };
const int n{ 64 };
const int nb_operations{ 1000000000 };
const int nb_steps{ nb_operations / n };
const double dt{ t / nb_steps };
{
std::array<double, n> y;
for(size_t k = 0; k < n; ++k) {
y[k] = static_cast<double>(k);
}
auto start_time = std::chrono::high_resolution_clock::now();
for(int i = 0; i < nb_steps; ++i) {
std::array<double, n> dy_dt;
for(size_t k = 0; k < y.size(); ++k) {
dy_dt[k] = y[k + 1 < y.size() ? k + 1 : 0];
}
for(size_t k = 0; k < y.size(); ++k) {
y[k] += dt * dy_dt[k];
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(
end_time - start_time).count();
double sum{ 0.0 };
for(size_t k = 0; k < y.size(); ++k) {
sum += y[k];
}
std::cout << "Sum: " << sum << std::endl;
std::cout << "Time static arrays: " <<
static_cast<double>(time) / 1000000 << std::endl;
}
{
std::vector<double> y( n );
for(size_t k = 0; k < n; ++k) {
y[k] = static_cast<double>(k);
}
auto start_time = std::chrono::high_resolution_clock::now();
for(int i = 0; i < nb_steps; ++i) {
std::vector<double> dy_dt( n );
for(size_t k = 0; k < y.size(); ++k) {
dy_dt[k] = y[k + 1 < y.size() ? k + 1 : 0];
}
for(size_t k = 0; k < y.size(); ++k) {
y[k] += dt * dy_dt[k];
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::microseconds>(
end_time - start_time).count();
double sum{ 0.0 };
for(size_t k = 0; k < y.size(); ++k) {
sum += y[k];
}
std::cout << "Sum: " << sum << std::endl;
std::cout << "Time dynamic arrays: " <<
static_cast<double>(time) / 1000000 << std::endl;
}
return 0;
}
```

This program has been run for arrays of size in between \(n = 2\), and \(n = 8192\). It has been compiled with the Intel compiler on Linux with the following options

```
icpc -std=c++11 -Ofast -ansi-alias -xHost main.cpp -o main
```

Here are the timings.

Many interesting things could be said about this graph. But the most important thing is that the implementation using static arrays is up to 16 times faster (for \(n = 6\)) than the one using dynamic arrays. Even if the gap in between the two curves shrinks as \(n\) grows, for \(n = 8192\) the static version is still 25% faster than the dynamic one.

Please be aware that micro-benchmarking, as done here, might be misleading. In real applications, the allocation time for a dynamic array will depend upon many factors such as memory fragmentation. In this specific test, they are deallocated before the next one of the same size is allocated. It does not put much pressure on memory allocation, and things can get worse in real applications.

Let’s now have a look at the closest Fortran implementation I can get.

```
program main
implicit none
integer, parameter :: dp = 8
real(dp) :: time_static = 0.0_dp
real(dp) :: time_dynamic = 0.0_dp
real(dp), parameter :: t = 1.0_dp
integer, parameter :: n = 64
integer, parameter :: nb_operations = 10**9
integer, parameter :: nb_steps = nb_operations / n
real(dp), parameter :: dt = t / nb_steps
integer :: time_begin
integer :: time_end
integer :: count_rate
real(dp), dimension(1:n) :: y
real(dp), dimension(1:n) :: dy_dt
real(dp), allocatable, dimension(:) :: y_dynamic
real(dp), allocatable, dimension(:) :: dy_dt_dynamic
real(dp) :: tmp
real(dp) :: sum
integer :: i, k, ios
do k = 1, n
y(k) = real(k - 1, dp)
end do
call system_clock(time_begin, count_rate)
do i = 1, nb_steps
do k = 1, n
dy_dt(k) = y(merge(k + 1, 0, k < n))
end do
do k = 1, n
y(k) = y(k) + dt * dy_dt(k)
end do
end do
call system_clock(time_end, count_rate)
sum = 0.0_dp
do k = 1, n
sum = sum + y(k)
end do
time_static = real(time_end - time_begin, dp) / real(count_rate, dp)
write (*,*) 'Sum: ', sum
write (*,*) 'Time static arrays: ', time_static
allocate(y_dynamic(1:n))
do k = 1, n
y_dynamic(k) = real(k - 1, dp)
end do
call system_clock(time_begin, count_rate)
do i = 1, nb_steps
allocate(dy_dt_dynamic(1:n))
do k = 1, n
dy_dt_dynamic(k) = y_dynamic(merge(k + 1, 0, k < n))
end do
do k = 1, n
y_dynamic(k) = y_dynamic(k) + dt * dy_dt_dynamic(k)
end do
deallocate(dy_dt_dynamic)
end do
call system_clock(time_end, count_rate)
sum = 0.0_dp
do k = 1, n
sum = sum + y_dynamic(k)
end do
time_dynamic = real(time_end - time_begin, dp) / real(count_rate, dp)
write (*,*) 'Sum: ', sum
write (*,*) 'Time dynamic arrays: ', time_dynamic
end program main
```

To be fair this version is not completely equivalent to the C++ one for the static part. In the Fortran version, \(\verb+dy_dt+\) is created once for all and not allocated at every time step. The reason for this is that Fortran does not provide scopes inside a given function. But it should not make any difference as allocating and deallocating memory on the stack only needs to change the value of a pointer and is therefore instantaneous. The code has been compiled with the Intel compiler on the same system and the following options

```
ifort -Ofast -xHost main.f90 -o main
```

Here are the timings

We observe the same behaviour as the C++ version.

As most of you have noticed, out of the box, the Fortran version outperforms the C++ version with dynamic arrays: it is up to three times faster. Here are both timings on the same graph.

We observe the same behaviour with the static version. Using gcc/gfortran also gives the same results.

In a nutshell, to get good performance in your application, wether your are using Fortran, C or C++, one can apply the following rule of thumb:

- Use
**static arrays**for**small arrays**whose size is known at compile time. Their allocation and deallocation only needs to change the value of a pointer and are instantaneous. They should not be used for large arrays as they can cause stack overflow which would terminate the program prematurely. - Use
**automatic arrays**in Fortran or C for**small arrays**whose size is only known at run time. They’ll be as fast as static arrays. - Use
**dynamic arrays**for**large arrays**. Their allocation an allocation are not instantaneous and might need a fair amount of time when memory is fragmented. But with large arrays, where the amount of work on the array usually becomes sizeable, allocation and deallocation time become small compared to the main work.

As you can see, there is no easy solution for small C++ arrays whose size is not know at compile time. Using a memory pool is solution to that problem but that’s enough for today.

Next time, we’ll focus on a major feature while debugging: **bound checking**. See you in two weeks.