OpenMP: Compiler Directive Standard
The material here is old, and hence some of it may have been
outdated. Use the Google, Luke. Or at least the links below.
This material summarizes some of the implementation details for using OpenMP.
However, for B673 the most important take-home messages are:
- all variables are shared unless otherwise specified
- use guided self-scheduling whenever a scheduling choice is available
The OpenMP standard is a compiler directive driven parallel
programming system. It uses a fork-join model, so relies on a
(logically) shared memory system and a global data model.
OpenMP is a consortium in which several vendors (SGI, IBM, DEC, etc)
and the major compiler technology firms (including Portland Group)
are taking part. Currently the standard
has been available for Fortran for some time, and now has
C implementations. Because of the complex ways that pointers and
loops can be used in C, it has taken longer to implement - unlike
Fortran, where pointer arithmetic is simply not supported.
OpenMP replaces the ANSI X3H5
effort, which has become become outdated and has not evolved much
in the recent past - particularly to handle CC-NUMA architectures,
ones where the memory system is physically partitioned, but presented
to the user as a logically single address space.
The home page for OpenMP is at
http://www.openmp.org, and
you should consult it for the C/C++ (or Fortran) version of the following notes.
The language
interface specifications are what
you should learn, and they have nifty vade mecums for C, C++, and Fortran
for quick reference. Also download the "examples" document, since it demonstrates
the usage of many of the specs.
OpenMP is a standard intended for Windows, Mac, and Unix
systems, and the participation of all major supercomputer and
HPC compiler vendors is likely to make it a standard in effect as well as name.
The notes which follow address mainly the Fortran version because it is the
most commonly used language in scientific programming. Also, anyone under the
age of 40 is going to write new programs in C/C++, so you'll learn that
variant anyway.
Sentinels
Like all compiler directive systems, OpenMP uses a "sentinel" which
looks like a comment line to indicate a directive; for OpenMP it is
!$OMP [directive]
while in C it takes the form
#pragma omp directive clauses
{ ... }
where the braces delimit the lines for which the directive applies.
Because Fortran77 does not have code block delimiters, in Fortran
you end a section with
!$OMP END [directive]
OpenMP also allows conditional compilation, so you can have calls to
OpenMP functions invisible on systems without an OpenMP compiler.
The sentinel for that in Fortran is !$
. For example,
!$ myrank = OMP_GET_THREAD_NUM()
will let you find your thread number - without having to write a
stub function for it on non-OpenMP systems.
Parallel Regions
All the action in OpenMP occurs in "parallel regions", which are
started/ended by
!$OMP PARALLEL [clause]
...
!$OMP END PARALLEL
although C/C++ uses braces to specify the end of a parallel region.
Until a parallel region is encountered, only one thread is running:
the master thread. On encountering the directive, a team of threads
is created. Then both the master and team members share the work in
the parallel region. On encountering the end of the parallel region,
an implicit barrier causes the threads to join, and only the
master thread continues after that point.
Nested parallel regions are also allowed, although by default on
encountering an inner parallel region a team of only one thread is
created. However, there are mechanisms (OMP_SET_NESTED()) for allowing
more threads to be created.
In that case, "master thread" refers to the one that invoked the (nested)
parallel region.
Worksharing Constructs
There are two basic ways to actually get parallel work done: parallel loops
and parallel sections. The first is declared via
!$OMP DO
...
!$OMP END DO
and it applies to the
next loop only.
The END DO directive is actually not required, but is good form especially
if you use the deprecated form of do-loops that don't end with "end do".
[And if you do, stop that. You're doing it wrong.]
The second has the form
!$OMP SECTIONS
block 1
!$OMP SECTION
block 2
...
!$OMP SECTION
block n
!$OMP END SECTIONS
and a thread is assigned in parallel to each section. There are two major
restrictions on the worksharing constructs:
- Like a BARRIER in MPI, all or none of the threads in a team must encounter
the construct.
- You cannot branch into or out of the construct.
Because it is often the case that you only want to parallelize a single
loop or section, there are also versions that both declare the parallel
region and the construct, eg.,
!$OMP PARALLEL DO
...
!$OMP END PARALLEL DO
However, keep in mind that a team of threads is created on encountering a
parallel region, then synchronized and deleted on exiting it. It is
typically more efficient to keep the threads around if only a few operations
separate parallel regions.
Sychnronization
Because creating and destroying teams of threads may be
relatively costly, there are mechanisms to allow you to execute
some single-threaded sections of code without destroying the
team.
- !$OMP MASTER/END MASTER causes the code block inside to be
executed only by the master thread
- !$OMP SINGLE/END SINGLE causes the code block inside to be
executed by only one thread (but not necessarily the master thread)
- !$OMP CRITICAL/END CRITICAL causes the code block inside to be
executed by only one thread at a time - but typically, each
thread encountering the code block will execute it. This is useful
for a queue model of work, where threads are dequeuing a job from
the queue. To avoid race conditions, only one thread is allowed
to execute dequeue at a time.
- !$OMP ATOMIC causes the next assignment operation to be performed
atomically; that is, it will be completed before any other
operation is done. The ur-example of an atomic is n++; it requires
reading the value of n, adding 1 to it, and then writing the
incremented value back. The atomic directive guarantees that all
three of those operations are completed without interrupt and
without another thread trying to read the value of n while it is
being incremented or before it is written back to memory.
Other useful constructs for synchronization are ORDERED, which causes the
code inside to be executed only in sequential order (but not necessarily
by a single thread) and BARRIER which is an old friend from MPI. A barrier
is implicit at the end of most parallel work-sharing constructs.
Loop Scheduling
Suppose a parallel loop is encountered, with several iterations.
Should the system assign chunks of iterations to each thread, or
assign iterations cyclically to each thread? The first will entail
smaller overhead associated with dispatching and synchronizing the
work. The second will have a better chance at load balancing, since
the threads may enter the loop at different times, or may take
widely differing amounts of time to execute even identical code because
of interrupts, etc.
One of the "clauses" you can insert into a worksharing construct is
a schedule. The allowed ones are:
- SCHEDULE(STATIC,chunk): iterations are divided into size chunk
pieces and statically assigned in a round-robin way to threads.
- SCHEDULE(DYNAMIC,chunk): the same as STATIC, but now the pieces
are scheduled dynamically at run time; when a thread finishes its
piece it goes back to the pool to see if there are others left to run.
This means one thread may perform much more work than the others,
if it is getting its tasks done quicker.
- SCHEDULE(GUIDED): the slickest of the methods. The chunk sizes
start out large and go down to small ones, then a dynamic type
schedule is used. This way, initially each thread gets a large
amount of work, reducing the dispatch overhead. Towards the end,
pieces are smaller, allowing load balancing.
- SCHEDULE(RUNTIME): the schedule is determined at runtime via an
environmental variable.
In order of preference, use GUIDED, DYNAMIC, and then STATIC.
Guided self-scheduling (GSS) is one of the most innovative ideas to emerge
from the parallelizing compiler community in the late 1980's, and has
now worked its way into optimizing compilers. DYNAMIC handles the problem
of when one thread gets interrupted or takes much longer to finish its
task - which is more common than you think.
Data Scope
By default, all data is shared in your OpenMP-ized code.
However, when declaring a parallel region
other scopes can be specified for the variables. This is handy
for situations like the double-nested loop in HPF, where
the inner loop index had to be declared before
the outer loop could be parallelized. OpenMP environments include
- PRIVATE(list): the listed variables will have a private copy
made for each thread. This is like the HPF NEW() scope
directive.
- SHARED(list): the converse; the listed variables have a single
global version which is explicitly shared by the threads. That
means accesses will be implicitly protected by locks or mutexes.
- DEFAULT: this allows specifying shared or private for all
variables in a parallel region; this can be overriden for
specified variables using PRIVATE or SHARED with lists.
- FIRSTPRIVATE(list): the same as private, but each thread's copy
of the variable is initialized to the value it had in the
serial part of the code preceding the parallel region
- LASTPRIVATE(list): the same as private, but after the end of
the parallel region the variables listed will have their
single remaining copy equal to the value assigned to the
private copy corresponding to the code that would have been
executed last in a serial execution.
- REDUCTION: used for reduction operations like a dotproduct.
The specified variables appear in operations like
x = x op expression
where x is a single scalar, the operator is typically addition or multiplication,
and the carry-around dependency would normally inhibit parallelism.
- Last Modified: Tue 06 Feb 2018, 06:16 PM