OpenMP in UM

From UKCA

OpenMP in the UM

What is OpenMP?

OpenMP is a directive (i.e. code comments) driven set of instructions for the compiler to generate parallel threads (light-weight computational tasks) to split up a block of work. It only works on shared-memory (i.e. within an IBM node) but is fully portable.

The practical result of OpenMP is that we have additional parallelism that we can utilise,

  * by using more processors, OR
  * by using SMT on IBM platforms. SMT gives each physical computational core an additional "virtual" core. Using this with OpenMP often allows a code to run significantly faster without using any additional computational resource.

The UM has traditionally been parallel by using a domain decomposition approach with explicit message passing. This uses processes (full computational tasks with high overheads) that work on their own chunk of data for long periods with occasional communications. OpenMP is typically finer grained in that it is normally used to split work up at the level of just a loop!

OpenMP can be added in a piecemeal fashion to a code and this is how we are approaching things in the UM. At the first pass we've inserted OpenMP for the most costly parts of code (most of the dynamics and the most expensive physics) but are gradually expanding the scope. As with all parallelism it is important to get a high level of parallel coverage for an efficient implementation.

Where Can I Use OpenMP

Although it can be used in other ways, the main use of OpenMP is to give parallelism across independent iterations of a loop. This is done by giving different iterations of the loop to different threads to do the computation. For example in a loop {{{ i = 1, n}}} run with 2 threads, {{{i = 1, n/2}}} will be done by thread 0, {{{i = n/2+1, n}}} will be done at the same time by thread 1.

In most cases it is fair to say that if a loop is vectorisable, it is also suitable for OpenMP. For example, the following code fragment is suitable for using OpenMP (because no one iteration of a loop depends on another), {{{

     DO i = 1, n
       a(i) = c * b(i)
       a(i) = sin(a(i))
     END DO

}}}

The loop below however is not suitable for OpenMP as there are dependencies between the iterations, {{{

     DO i = 1, n
       a(i) = 0.5 * (a(i-1) + a(i+1))
     END DO

}}}

If indirect addressing is used (e.g. {{{ a(ptr(j)) }}} ) one needs to be particularly careful that the indirection gives a one-to-one mapping (to avoid different elements being calculated by different threads). The most common case of indirect addressing in the UM is in compressing an active sub-set of a larger array (such as land-points in the surface scheme) and this is normally safe for use with OpenMP.

Collecting a running maximum or minimum over loop iterations is also a common case of a non-independent loop. Care is needed here to either use OpenMP on a different loop (e.g. internal nested loop) or to get advice about other techniques that may be possible to utilise OpenMP in this situation.

It is common to re-arrange loops a little when implementing OpenMP. This might be to ensure the loop iterations are independent (see above!) or to maximise the work done by a set of parallel threads. This latter reason is because there is an overhead with starting or stopping threads so this should be done as little as possible. The need to maximise work by the threads will mean that splitting working into parallel portions should happen at as high a level in the code as possible. This is done in (for example) radiation where the parallel loop is the main loop over segments in the control level.

Common OpenMP Directives

All OpenMP directives are comments that start with {{{!$OMP}}} - continuation lines start with {{{!$OMP&}}} (with the previous line ending in {{{&}}} as normal for Fortran 90).

  * {{{!$OMP PARALLEL}}} - starts up a parallel region i.e. the threads are started
  * {{{!$OMP END PARALLEL}}} - finishes the parallel region i.e. the threads are stopped
  * {{{!$OMP DO}}} - tells the threads to split the work up in the loop among themselves
  * {{{!$OMP END DO}}} - is the end of the loop. This may often have {{{nowait}}} added to the end which will ensure that the threads will not wait for all threads to finish the loop before moving on to other work.
  * {{{!$OMP SINGLE}}} and {{{!$OMP END SINGLE}}} - delineates a block of code to be executed by one thread only. No thread will continue past the end until all have reached it.
  * {{{!$OMP MASTER}}} and {{{!$OMP END MASTER}}} - delineates a block of code to be executed by the master thread only. Other threads won't wait and will carry on to the next bit of available work.

Note that if a parallel region only has one loop in it, the {{{!$OMP PARALLEL}}} and {{{!$OMP DO}}} can be combined into a {{{!$OMP PARALLEL DO}}}.

An {{{!$OMP PARALLEL}}} will require some additional clauses, primarily to describe how the data is managed and whether it is {{{PRIVATE}}} to a single thread (which will entail all threads getting their own temporary copy) or {{{SHARED}}} between threads. Getting the correct status for variables is key in ensuring a correct program. It is possible to declare a {{{DEFAULT}}} type for variables. Normally {{{DEFAULT(NONE)}}} is recommended but if a very large variable list is required then {{{DEFAULT(SHARED)}}} or {{{DEFAULT(PRIVATE)}}} may be more appropriate.

The {{{!OMP SINGLE}}} and {{{!OMP MASTER}}} constructs are used to get a section of code within a parallel region to run with just one thread to avoid all the overhead of shutting down the threads before the relevant bit of code and then starting them all back up again afterwards.

There may also be {{{SCHEDULE}}} clauses - these determine the order of how work is split between the threads. These are only a matter of efficiency of the final parallel code and won't affect correctness, so will tend to be the domain of experts.

Shared or Private Variables?

Determining whether a variable should be shared or private is the most tricky part of coding with OpenMP and the major cause of problems. There are however a few recipes that will help for the most usual cases.

Loop control variables

The loop index will be {{{PRIVATE}}} and the loop extent will be {{{SHARED}}}. For example, the following is correct, {{{ !$OMP PARALLEL DO PRIVATE(i) SHARED(n) DEFAULT(NONE)

     DO i = 1, n

...

     END DO

!$OMP END PARALLEL DO }}}

This example is of a a simple loop that will be parallelised with OpenMP.

Arrays

Arrays that either have pre-existing values that are required in the loop, or whose values are calculated in the loop and have one of their indices the control variable in the OpenMP parallel loop should almost always be {{{SHARED}}}. e.g. {{{

     a(:) = 3

!$OMP PARALLEL DO PRIVATE(i) SHARED(n, a, b) DEFAULT(NONE)

     DO i = 1, n
       b(i) = a(i) + i
     END DO

!$OMP END PARALLEL DO }}}

In this example, the i loop is again parallelised. {{{a}}} has pre-existing values that are required in the loop so should be {{{SHARED}}}. {{{b}}} is set in the loop and its index is {{{i}}}, the control variable for the parallel loop, so it again is {{{SHARED}}}.

Local temporary variables

A variable (scalar or array) that performs a temporary role within a loop but has not been initialised beforehand and whose value is not required afterwards should be {{{PRIVATE}}}. The following is an example, {{{ !$OMP PARALLEL DO PRIVATE(i, tmp) SHARED(n, a, b) DEFAULT(NONE)

 DO i = 1, n
   tmp = sqrt(a(i))
   b(i) = asin(tmp)
 END DO

!$OMP END PARALLEL DO }}}

a more complex example is the following, {{{ !$OMP PARALLEL DO PRIVATE(i, j, tmp) SHARED(n, m, a, b, c) DEFAULT(NONE)

 DO i = 1, n
   DO j = 1,m
     tmp(j) = sqrt(a(i,j))
     b(i) = b(i) + asin(tmp(j))
   END DO
   c(i) = b(i) / REAL(m)
 END DO

!$OMP END PARALLEL DO }}} This case has a double nested loop. The {{{i}}} loop is again the parallelised one - the {{{j}}} loop is not parallelised and each thread will execute the whole {{{j}}} loop (for a different set of {{{i}}} values of course). {{{tmp}}} is still {{{PRIVATE}}} as it is temporary to the parallelised {{{i}}} loop. {{{b}}} and {{{c}}} are {{{SHARED}}} as they have values set for the {{{i}}} index.


Other scalars

Usually a scalar that has a value before the loop that is required in the loop will be shared. If a scalar has a value assigned in the loop that is required outside of the loop it is usually a sign that the loop iterations are not independent and you should ask for some expert help. A common case for this would be a loop that is summing all the elements of an array into a scalar.

If in doubt then first ask if the variable will need to be read/written independently by all threads (if no - it should be shared, if yes it should be private).

Testing OpenMP

The simplest way to test that OpenMP is working correctly is to run the code with 1 thread and then with 2 threads and SMT (options exist for SMT and thread numbers in the {{{Target Machine}}} and/or {{{Job submission, resources and re-submission pattern}}} UMUI panels). The results should be the same, albeit the SMT run should be quicker - you can put in UM TIMER calls to verify this or use output from the !DrHook facility. Both the 1 thread and 2 thread + SMT runs should also give the same results as the original code. Using extra threads without using SMT will change the total amount of processors you are asking for and how computational tasks are distributed on the machine. This will probably change the result of the UM run if using a non-reproducible build.

Who Can Help?

The HPC Optimisation Team!. Please come and have a chat with us if you are going to be changing code that already has OpenMP in it, or if you would like to add more OpenMP to code. We are happy to discuss what you might need to do and help you along the way.

External Links

* Some advice on using OpenMP with the Intel compilerBR
  [1]