B673: Introduction to Parallel Computing

OpenMP: Compiler Directive Standard

The material here is old, and hence some of it may have been outdated. Use the Google, Luke. Or at least the links below.

This material summarizes some of the implementation details for using OpenMP. However, for B673 the most important take-home messages are:

all variables are shared unless otherwise specified
use guided self-scheduling whenever a scheduling choice is available

The OpenMP standard is a compiler directive driven parallel programming system. It uses a fork-join model, so relies on a (logically) shared memory system and a global data model. OpenMP is a consortium in which several vendors (SGI, IBM, DEC, etc) and the major compiler technology firms (including Portland Group) are taking part. Currently the standard has been available for Fortran for some time, and now has C implementations. Because of the complex ways that pointers and loops can be used in C, it has taken longer to implement - unlike Fortran, where pointer arithmetic is simply not supported. OpenMP replaces the ANSI X3H5 effort, which has become become outdated and has not evolved much in the recent past - particularly to handle CC-NUMA architectures, ones where the memory system is physically partitioned, but presented to the user as a logically single address space.

The home page for OpenMP is at http://www.openmp.org, and you should consult it for the C/C++ (or Fortran) version of the following notes. The language interface specifications are what you should learn, and they have nifty vade mecums for C, C++, and Fortran for quick reference. Also download the "examples" document, since it demonstrates the usage of many of the specs.

OpenMP is a standard intended for Windows, Mac, and Unix systems, and the participation of all major supercomputer and HPC compiler vendors is likely to make it a standard in effect as well as name. The notes which follow address mainly the Fortran version because it is the most commonly used language in scientific programming. Also, anyone under the age of 40 is going to write new programs in C/C++, so you'll learn that variant anyway.

Sentinels

Like all compiler directive systems, OpenMP uses a "sentinel" which looks like a comment line to indicate a directive; for OpenMP it is

!$OMP [directive]

while in C it takes the form

#pragma omp directive clauses
{ ... }

where the braces delimit the lines for which the directive applies. Because Fortran77 does not have code block delimiters, in Fortran you end a section with

!$OMP END [directive]

OpenMP also allows conditional compilation, so you can have calls to OpenMP functions invisible on systems without an OpenMP compiler. The sentinel for that in Fortran is !$. For example,

!$    myrank = OMP_GET_THREAD_NUM()

will let you find your thread number - without having to write a stub function for it on non-OpenMP systems.

Parallel Regions

All the action in OpenMP occurs in "parallel regions", which are started/ended by

            !$OMP PARALLEL [clause]
             ...
            !$OMP END PARALLEL

although C/C++ uses braces to specify the end of a parallel region. Until a parallel region is encountered, only one thread is running: the master thread. On encountering the directive, a team of threads is created. Then both the master and team members share the work in the parallel region. On encountering the end of the parallel region, an implicit barrier causes the threads to join, and only the master thread continues after that point.

Nested parallel regions are also allowed, although by default on encountering an inner parallel region a team of only one thread is created. However, there are mechanisms (OMP_SET_NESTED()) for allowing more threads to be created. In that case, "master thread" refers to the one that invoked the (nested) parallel region.

Worksharing Constructs

There are two basic ways to actually get parallel work done: parallel loops and parallel sections. The first is declared via

            !$OMP DO
            ...
            !$OMP END DO

and it applies to the next loop only. The END DO directive is actually not required, but is good form especially if you use the deprecated form of do-loops that don't end with "end do". [And if you do, stop that. You're doing it wrong.] The second has the form

            !$OMP SECTIONS
                block 1
            !$OMP SECTION
                block 2
                 ...
            !$OMP SECTION
                block n
            !$OMP END SECTIONS

and a thread is assigned in parallel to each section. There are two major restrictions on the worksharing constructs:

Like a BARRIER in MPI, all or none of the threads in a team must encounter the construct.
You cannot branch into or out of the construct.

Because it is often the case that you only want to parallelize a single loop or section, there are also versions that both declare the parallel region and the construct, eg.,

            !$OMP PARALLEL DO
            ...
            !$OMP END PARALLEL DO

However, keep in mind that a team of threads is created on encountering a parallel region, then synchronized and deleted on exiting it. It is typically more efficient to keep the threads around if only a few operations separate parallel regions.

Sychnronization

Because creating and destroying teams of threads may be relatively costly, there are mechanisms to allow you to execute some single-threaded sections of code without destroying the team.

!$OMP MASTER/END MASTER causes the code block inside to be executed only by the master thread
!$OMP SINGLE/END SINGLE causes the code block inside to be executed by only one thread (but not necessarily the master thread)
!$OMP CRITICAL/END CRITICAL causes the code block inside to be executed by only one thread at a time - but typically, each thread encountering the code block will execute it. This is useful for a queue model of work, where threads are dequeuing a job from the queue. To avoid race conditions, only one thread is allowed to execute dequeue at a time.
!$OMP ATOMIC causes the next assignment operation to be performed atomically; that is, it will be completed before any other operation is done. The ur-example of an atomic is n++; it requires reading the value of n, adding 1 to it, and then writing the incremented value back. The atomic directive guarantees that all three of those operations are completed without interrupt and without another thread trying to read the value of n while it is being incremented or before it is written back to memory.

Other useful constructs for synchronization are ORDERED, which causes the code inside to be executed only in sequential order (but not necessarily by a single thread) and BARRIER which is an old friend from MPI. A barrier is implicit at the end of most parallel work-sharing constructs.

Loop Scheduling

Suppose a parallel loop is encountered, with several iterations. Should the system assign chunks of iterations to each thread, or assign iterations cyclically to each thread? The first will entail smaller overhead associated with dispatching and synchronizing the work. The second will have a better chance at load balancing, since the threads may enter the loop at different times, or may take widely differing amounts of time to execute even identical code because of interrupts, etc.

One of the "clauses" you can insert into a worksharing construct is a schedule. The allowed ones are:

SCHEDULE(STATIC,chunk): iterations are divided into size chunk pieces and statically assigned in a round-robin way to threads.
SCHEDULE(DYNAMIC,chunk): the same as STATIC, but now the pieces are scheduled dynamically at run time; when a thread finishes its piece it goes back to the pool to see if there are others left to run. This means one thread may perform much more work than the others, if it is getting its tasks done quicker.
SCHEDULE(GUIDED): the slickest of the methods. The chunk sizes start out large and go down to small ones, then a dynamic type schedule is used. This way, initially each thread gets a large amount of work, reducing the dispatch overhead. Towards the end, pieces are smaller, allowing load balancing.
SCHEDULE(RUNTIME): the schedule is determined at runtime via an environmental variable.

In order of preference, use GUIDED, DYNAMIC, and then STATIC. Guided self-scheduling (GSS) is one of the most innovative ideas to emerge from the parallelizing compiler community in the late 1980's, and has now worked its way into optimizing compilers. DYNAMIC handles the problem of when one thread gets interrupted or takes much longer to finish its task - which is more common than you think.

Data Scope

By default, all data is shared in your OpenMP-ized code. However, when declaring a parallel region other scopes can be specified for the variables. This is handy for situations like the double-nested loop in HPF, where the inner loop index had to be declared before the outer loop could be parallelized. OpenMP environments include

PRIVATE(list): the listed variables will have a private copy made for each thread. This is like the HPF NEW() scope directive.
SHARED(list): the converse; the listed variables have a single global version which is explicitly shared by the threads. That means accesses will be implicitly protected by locks or mutexes.
DEFAULT: this allows specifying shared or private for all variables in a parallel region; this can be overriden for specified variables using PRIVATE or SHARED with lists.
FIRSTPRIVATE(list): the same as private, but each thread's copy of the variable is initialized to the value it had in the serial part of the code preceding the parallel region
LASTPRIVATE(list): the same as private, but after the end of the parallel region the variables listed will have their single remaining copy equal to the value assigned to the private copy corresponding to the code that would have been executed last in a serial execution.
REDUCTION: used for reduction operations like a dotproduct. The specified variables appear in operations like
```
            x = x op expression
```
where x is a single scalar, the operator is typically addition or multiplication, and the carry-around dependency would normally inhibit parallelism.

Last Modified: Tue 06 Feb 2018, 06:16 PM