HPC++Lib

The HPC++ Parallel Programming Library

Pete Beckman, Dennis Gannon, Todd Green, Elizabeth Johnson

$Id: HPC++Lib.htm,v 1.1 1998/06/23 00:18:41 ejohnson Exp $

Purpose

Portability and component reuse depend on well-designed abstractions and interfaces. HPC++Lib is a portable, parallel, object-oriented run-time system class library that defines a standard interface that is both powerful and easy-to-use. Its high-level interface permits the underlying implementation machinery the freedom to optimize for particular architectures and changing status of the hardware. That interface includes support for allocating data structures from a shared region, copying shared data between unshared (distributed) regions of memory, basic synchronization primitives, collective functions such as broadcast and reduce, remote method invocation (RMI), and basic process and thread control for managing both task and data parallelism. HPC++Lib is also designed to accommodate tracing and profiling mechanisms that should be built into the run-time system from the ground up.

Our goal is to make HPC++Lib available on a wide variety of parallel architectures, from networks of workstations to the nation's largest supercomputers. While a group of scientists will choose to write parallel C++ programs that directly interface to the classes, abstractions, and functions in HPC++Lib, the library will also be used as the basis for other libraries, programming disciplines, and object frameworks that make programming parallel machines easier. This includes projects such as AMR++, the Parallel Standard Template Library, POOMA, P++, and PADRE.

Design Strategies

Quick Tour

Execution Model: An HPC++ program runs on a collection of nodes. A node is a physical compute resource that provides one or more contexts. A node can be a shared-memory multiprocessor (SMP), possibly connected to other SMPs via a network. Examples of a node include a single CPU desktop workstation, an SP2 "node", a dual-Pentium NT box, and a parallel supercomputer with cache-coherent distributed shared-memory support, such as the Convex SPP and the SGI Origin 2000. A context is a cache-coherent, possibly shared, virtual address space. A context may be accessible by several different threads of control. Often, a Unix process represents a context. Each node may have one or more threads within each of its contexts. Threads sharing a context man be run in parallel, share memory, and be either pre-emptive or non-preemptive. These resources may be allocated initially at run-time, or for some architectures, dynamically during the execution of the program.

The simplest parallel program using HPC++Lib would consist of a set of threads executing within a single context. The threads would have access to a shared address space, and could run on one CPU, or run in parallel. This model of parallelism is very well suited for modest levels of parallelism, such as is available on a 32 node SGI Origin 2000. HPC++Lib also supports running programs across a large number of nodes, such as a cluster of Origin 2000s, or an SP2. For this style of programming, a "Single Program Multiple Data" (SPMD) approach is used. In SPMD mode, distributed data structures and explicit synchronization must be managed by the programmer.

Basic Components of HPC++:

  1. Global Pointers: The global pointer class provides the basic mechanism for accessing distributed objects. It is a proxy for a remote object. Remote objects can be copied or modified via dereferencing a global pointer. Global pointers may be passed between contexts. A global pointer may only point to objects allocated from a special "global data area". Most of the operations that can be performed on local pointers, including pointer arithmetic, are supported by global pointers
  2. Threads: HPC++Lib supports a model of threads that is based on a Thread class which is, by design similar to the Java thread system. Threads may be created and managed by the programmer. Threads exist within a context, share memory, and their implementation in HPC++Lib may be either preemptive or non-preemptive.
  3. Synchronization: A set of synchronization objects provides basic support for mutual exclusion, barrier, broadcast, reduction, etc. A Sync class provides objects with empty/full/peek functionality.
  4. Collective Operations: The HPCxx_Group class provides basic support for collective operations such as Reduce(), Broadcast(), and Barrier().
  5. Remote Invocation: In addition to copying and moving objects, global pointers can be used to invoke remote functions and members. If the object is non-local, the owner of the object executes the function. If the global pointer is referencing a remote object, a request is sent to the remote context asking that the function be executed. Synchronous and asynchronous calls are available. Remote invocation provides the necessary substrate for building load balancing libraries, remote data manipulation, and specialized data transfer and object access protocols.
  6. Tracing and Profiling: All of the classes within HPC++Lib should have native tracing and profiling capabilities. Since HPC++Lib is responsible for moving objects and spawning tasks, it is ideally suited to record those events, so they may be analyzed, visualized, or used in real-time for debugging. The TAU system from the University of Oregon used the tracing/profiling information from the prototype implementation of HPC++Lib used by pC++ to provide visualization for communication patterns, time spent in barriers, a break-point debugger, etc.

SPMD Mode

The Single Program Multiple Data Model (SPMD) of execution is one of the standard models used in parallel scientific programming. Typically, parallel computers provide a mechanism for SPMD computation, where n copies of a single program begin execution of main(argc, argv). Often, in interactive mode, the run-time resources are specified with command line arguments such as "-procs 16". However, there is no standard; each vendor uses a different set of command line arguments. When using a batch mode scheduler, such as NQS, DJM, or LoadLeveler, the number of CPUs is usually specified via directives to the job scheduler. To make HPC++Lib programs portable, and share a single command line interface, a front-end script called hpc++run is provided. It is very similar to the mpirun script from MPICH.


hpc++run [hpc++ args] <user_executable> [user_args]

These are the currently defined arguments that may be passed to hpc++run

-cf <filename> | -configfile <filename>
Specifies a configuration file. The configuration file can be used to create complex execution models. This includes the number of nodes, which machines will be used, and the number of initial contexts and threads for each individual node in the computation. The configuration file supersedes all other arguments.
-sv <server_url> | -server <server_url>
Specifies the server to use to contact for resource allocation. Rather than use a configuration file to allocate resources, and determine where the program should run, it is often convenient to use a resource server, which can dynamically determine where what resources will be made available.
-nn <n> | -numnodes <n>
Specifies the number of nodes to use during the execution of the program. The default number of nodes is 1.
-cx <n> | -contexts <n>
Specifies the number of contexts per node. Therefore, the total number of contexts is the number of nodes (np) times the number of contexts per node (cx). The default number of contexts per node is 1.

At this point, it is important to understand that not all run-time values for the number of nodes, contexts, and threads may be supported by the SPMD program loader. For example, on the CM5, programs are loaded onto partitions; specifying "-nn 15" will likely cause a run-time error since it is not a power of two. However, the SPMD program loader will always execute at least one main(argc, argv) per allocated context. The interaction between run-time resource parameters, machine capabilities, and dynamic resource allocation after the program has begun is not easily simplified, and will always remain machine dependent.

The running processes are coordinated by the HPC++Lib initialization routine which is invoked as the first call in the main program of each context. To simplify things, make the program more portable, and behave predictably on a variety of machines and SPMD loaders, it is suggested no user code exist between main(argc, argv) and HPCxx_Init(). The initialization procedure strips off all arguments beginning with -hpcxx_, and fills in the object of type HPCxx_Group, which represents the initial SPMD group. Each collection of contexts and threads are represented by an HPCxx_Group object, with one representative copy in each context participating in the group. The HPCxx_Group object is similar to an MPI communicator - it provides a interface to build groups of contexts. Each context has a single representative object of type HPCxx_Context, as well as a representative object of type HPCxx_Node. Some of the important public interfaces for HPCxx_Group are shown below.

class HPCxx_Group {
public:
  // Construct a group from a HPCxx_Node object
  HPCxx_Group(const HPCxx_Node &node);
  // Construct a group from an array of HPCxx_Node objects
  HPCxx_Group(const HPCxx_Node *&nodes, int numNodes);

  HPCxx_Node *getNode();  // A pointer to my Node object
  HPCxx_Node *getNode(int nodeID); // A pointer to the node object
                                   // for any Node
  int getNumContexts(); // The total number of contexts in this group
  int getContextID();   // The unique ID for this context.
  int getGroupID();     // The unique ID for this group
  int getNumThreads();  // The number of threads in this context
                        // participating in group collective operations
  void setNumThreads(int c);
};

Here the simplest parallel program written with HPC++Lib

int main(int argc, char **argv) {
  HPCxx_Group SPMDgroup;
  if (HPCxx_Init(&argc, &argv, &SPMDgroup)) {
    fprintf(stderr,"Error in HPCxx_Init\n");
    exit(1);
  }
  // begin SPMD user program
  printf("Hello World from node %d, context %d\n",
    SPMD.getNode()->getNodeID() , SPMDgroup.getContextID());

  // Close things down....
  HPCxx_exit();
}

Running this program could be done with:

hpc++run -numnodes 4 a.out

or on some architectures, where the program loader starts SPMD mode, the hpc++run driver is not strictly required, you could use:

a.out -hpcxx_numnodes 4

However, the later form will not work on every architecture (as described above). The output from the program execution would probably look like this:

Hello World from context 3
Hello World from context 0
Hello World from context 2
Hello World from context 1

Interleaving of parallel output from the printf() function is not controlled by HPC++Lib. Most computers interleave based on lines, but that is not guaranteed across platforms. Use of the HPCxx_Group class will be discussed in greater detail in the Collective Operations section

Important Caveats

C++ constructors for global objects are run before main(argc, argv). A given constructor, could be run n times, one for each context that the SPMD loader created, or, only once, since for some machines, the call to HPCxx_Init(), not the SPMD loader, will create the required contexts. This can lead to non-portable programs, not to mention a lot of confusion. Those constructors may not use any of the HPC++Lib calls or classes, since until HPCxx_Init() completes, their use will generate an error. This includes thread and shared memory manipulation. Therefore, it is suggested that constructors for global objects not perform I/O, attempt to use HPC++Lib features, or initialize data that is context dependent. Additionally, data that is shared between contexts via global pointers (see below) must be dynamically heap allocated (not on the stack) after HPC++Lib initialization.

Global Pointers

Central to distributed memory parallel computation are the abstractions for sharing and moving data. Most programmers are comfortable with a simplified near/far model, where data that is “near” is accessible using regular shared memory, and data that is “far” must be accessed via function calls that invoke put/get or send/recv semantics. However, with the power of C++, many of those old interfaces can be pushed down into the supporting run-time system, and replaced by an easier to use, yet more powerful C++ class library for manipulating global data structures.

In this portion of the run-time system, a C++ class library supplies the interface for basic allocation and access of remote-accessible data. Central to this concept is the “handle” by which the remote data is referenced. Global Pointers are proxy objects that are designed to behave like pointers to objects in other address spaces.

The notion of a global pointer is quite old. However, with the extensibility of C++, it's semantics can be captured without language extensions or library calls at the user-code level. A global pointer to an object of type T is defined as a templated class of the form :

HPCxx_GlobalPtr<T> q;

Initializing a global pointer to an array of 100 doubles looks like:

HPCxx_GlobalPtr<double> q = new ("HPCxx_global") double[100];

Notice that the array was allocated from a special HPCxx_global area (see below for more details). Global pointers may be passed freely among different contexts. The global pointer provides a handle to objects that may be used from any other context. For objects of simple type, a global pointer can be dereferenced like any other pointer. The assignment operator performs a simple assignment if the global reference refers to a local object. Otherwise, the underlying run-time system is called to copy the right-hand side into the remote object. An analogous operation occurs when the cast operator is used. If the global pointer refers to a local object, that object is returned. If the pointer is to a remote object, the low-level data movement routines copy the value from the remote object to the local context.

For example, assignment and copy through a global pointer is given by

HPCxx_GlobalPtr<float> p = new ("HPCxx_Global") int;
*p = 3.14;         // A store through GlobalPtr
float y = 2-*p;    // A fetch via a GlobalPtr

Pointer arithmetic and the [ ] operator can be used on global pointers as you would with ordinary pointers. For large blocks of contiguous data, it would be too expensive to read or write them one at time, so block read and write operations are provided.

void HPCxx_GlobalPtr<T>::read(T *buffer, int size);
void HPCxx_GlobalPtr<T>::write(T *buffer, int size);

This design allows the library user to employ standard C++ pointer reference/dereference to read and write remote values. Below, are some more examples of how to use global pointers.

HPCxx_GlobalPtr<double> q = new ("HPCxx_Global") double[100];
q[8] = 3.99;        // Use array notation for remote store
double x = q[9];    // Use array notation for remote read

*(q+10) = 32.6;     // Pointer arithmetic and remote store.
double z = *(q+12); // Pointer arithmetic and remote read

double b[100];      // Local declaration and allocation
q.write(b, 100);    // Write 100 values
q.read(b, 100);     // Read 100 values into local array

q = MyCoolFunction(q, 5.6); // Global pointer used as a function
// function argument, and returned.

if (q.isLocal()) {    // Does the globalptr point to a local object?
  double *qLocal = q; // Use cast to extract the local pointer
}

Both the remote read and the remote write operations are blocking. The remote read waits until the data arrives from the distant context, and the remote write operation blocks until the source has been read, and may be modified. In the context of threads, this provides a convenient location to call yield(). A fetch through a global pointer may suspend the currently executing thread until the value arrives, at which time the scheduler can once again put the thread back into the ready queue, complete with the value requested. This can be an extremely powerful way to

User-defined Types

Unfortunately, using sizeof() for shallow copying of user-defined types only works for homogenous systems. In general, alignment and representations differences between machines prevent an object from being copied, byte-by-byte to another context. Another similarly complicated issue is properly invoking the constructor for byte-copied objects. Without compiler support or a preprocessor, the programmer must take on the responsibility for packing and unpacking objects into a form that can be sent to remote contexts.

To read and write user-defined types through a global pointer, the user must write pack and unpack friend functions for the class and an array of such objects. A simple example is given below.

class MyClass {
  T1 x;
  float y[100];
 public:
  friend void HPCxx_pack(HPCxx_Buffer *b, MyClass *, int count);
  friend void HPCxx_unpack(HPCxx_Buffer *b, MyClass *, int &count);
}; 
// The pack function for MyClass would look like this:
void HPCxx_pack(HPCxx_Buffer *b, MyClass *a, int count) {
  hpcxx_pack(b, size, 1);  // The first thing is a count of items
  for(int i = 0; i < count; i++) { 
    hpcxx_pack(b, a[i].x, 1);   // pack the one item of type T1
    hpcxx_pack(b, a[i],y, 100); // pack the 100 items of type float
  }
}

// The unpack function for MyClass
void HPCxx_unpack(hpxx_buffer_t *b, MyClass* a, int &count) {
  hpcxx_unpack(b, count, 1);  // fill in count with the # of items.
  for(int i = 0; i < count; i++) {
    unpack(b, a[i].x, 1);   // extract one item of type T1
    unpack(b, a[i],y, 100); // extract 100 items of type float
  }
} 

These pack and unpack functions can be considered a type of remote constructor. For example, suppose a object contains a pointer to a local-storage buffer. The programmer could choose to write the pack function to create a global pointer that referenced the original storage, or to pack the buffer and then unpack the data into suitable storage on the remote context. The pack/unpack functions could also be coded to copy only those fields that are important. The pack/unpack friends will be called by the system to move the data across context boundaries.

Object Allocation

Another important issue is what objects may have global pointers referencing them. Many computer architectures clearly delineate between objects in “shared space” and objects that are strictly local. This demarcation is reflected in the MPI 2 draft standard for one-sided communication. Remote Memory Access (RMA) extends MPI by allowing one process to specify all the communication parameters, both for the sending and receiving side. However, one-sided messages may only modify blocks of memory that have been allocated with MPI_MEM_ALLOC(). If the global pointer package is to be portable, the concept of shared and local memory regions must be reflected in the interface. Therefore, as mentioned eariler, special allocators (HPCxx_Global) are required for heap allocated objects that will have a global pointer referencing them. A C-style malloc() is also available, called hpcxx_sharedMalloc(). The C interface is helpful for allocating storage from C subroutines.

Caveats

Remote reads or writes are not guaranteed to be atomic. In other words, should another thread of control begin to simultaneously modify the contents of an object while it is being fetched via a global pointer, the results are not predictable. Following the design principle that communication and synchronization should be explicit, users should insure correctness with one of these strategies:

  1. Use algorithmic control. The design of the program may prevent race conditions
  2. Use a synchronization class (see below). Mutual exclusion locks, barriers, etc., may be inserted to provide the required synchronization.

Threads

Every context already has at least one thread that represents the initial execution of the program. More threads may be created. The thread interface is based on the Java Thread library. More specifically, there are two basic classes that are used to instantiate a thread and give it work. Basic HPCxx_Thread objects encapsulate a thread and provide a private data space. Objects of class HPCxx_Runnable provide a convenient way for a set of threads to execute the member functions of a shared object. The interface is:

class HPCxx_Runnable { 
 public:
   virtual void run() = 0;
}; 

class HPCxx_Thread {
 public: 
   Thread(HPCxx_Runnable *);
  // Subclasses of Thread should override this method to
  // get useful behavior.
  virtual void run();
  void start();

  // Ceases execution of the thread.  Status is not currently used.
  static void stop(void *status);

  // Yields thread
  static void yield();
}; 

There are two ways to create a thread and give it work to do:

1) Subclass HPCxx_Runnable

class MyRunnable: public HPCxx_Runnable {
  char *x;
 public: 
  MyRunnable(char *c): x(c){}
  void run() { printf(x); }   // The run() function must be supplied
}; 

then create threads to execute the object

MyRunnable r("Hello World");
HPCxx_Thread *t1 = new HPCxx_Thread(&r);
HPCxx_Thread *t2 = new HPCxx_Thread(&r);
t1->start(); // launch the thread but don't block
t2->start();
} 

The previous section of code will print "Hello WorldHello World". It should be noted that in this small example, it is possible for the main program to terminate prior to the completion of the two threads. This would signal an error condition. Techniques to insure threads terminate before the main thread exits will be covered in the section on synchronization.

2) Subclass HPCxx_Thread

By subclassing the thread object, it is possible to provide thread-private data.

class MyThread: public HPCxx_Thread {
  char *x;
 public: 
  MyThread(char *y): x(y), HPCxx_Thread(NULL) {}
  void run() { printf(x); }
};

Create a thead, and start execution with:

MyThread *t1 = new MyThread("Hello World");
t1->start();

Having both interfaces, HPCxx_Runnable and HPCxx_Thread, provides several advantages:

  1. Many threads my be given the same HPCxx_Runnable object.
  2. For loop transformations, the HPCxx_Runnable form is convenient because it is a pure abstract class, and HPCxx_Thread is not. A subclass of HPCxx_Runnable does not inherit any baggage, and since many threads can us it, the threads can elegantly parallelize loops.
  3. It is identical to Java. Those thread classes are now a defacto standard for c++-style thread manipulation.

Important Caveats

Threads may run in parallel, but are not required to be premptive. Also, there is no way to bind processor resources to threads. An HPC++Lib implementation could use POSIX Threads (Pthreads) or a thread package such as AWESIME.

Synchronization

There are two types of synchronization mechanisms used in HPC++Lib: collective operator objects and primitive synchronization objects. The collective operations are based on the HPCxx_Group class. The group defines which contexts must participate in the operation.

Sync

A Sync<T> object is a write-once variable. It can be read many times, but only when it is "full". When it is created, it is empty. An attempt to read an empty Sync<T> variable will cause the thread to be blocked until the object arrives and fills the Sync<T> object. At that time, the reader will get the value, and the thread may be scheduled for execution. Many readers can be waiting for a single Sync<T> object. All readers will be unblocked when a value is available. Readers and writers that come after the first write see it as a const value. CC++ provides a similar capability by extending C++ with a new declaration modifier.

The public interface for the Sync object is:

template<class T>
class HPCxx_Sync {
public:
  operator T();          // read a value
  operator =(const T &); // assign a value
  void read(T &);        // another form of read
  void write(const T &); // another form of writing

  bool peek(T &val); // Sneak a peek at the contents of the Sync
                     // without getting caught (blocking). If the
                     // return value is true, the caller snatched
                     // the value.  A false indicated the Sync is
                     // still empty.
};

SyncQ<T> provides a dual queue of values. Attempts to read the empty object cause the calling thread to block until an object is assigned. However, unlike the single Sync object, each read of the object must corrospond to a distinct write. In other words, SyncQ maintains an ordered list of blocked threads waiting for values. When a value is written to the SyncQ object, one and only one thread is unblocked to read the value and continue execution. The ith thread waiting for a value will receive the ith value written to the object. If there are no theads waiting for incoming values, the values are internally queued until a read is performed.

The public interface to the SyncQ object is:

template<class T>
class HPCxx_SyncQ {
public:
  operator T();          // read a value
  operator =(const T &); // assign a value
  void read(T &);        // another form of read
  void write(const T &); // another form of writing

  bool peek(T &val); // Sneak a peek at the SyncQ
                     // without getting caught (blocking). If the
                     // return value is true, the caller snatched
                     // the value.  A false indicated the SyncQ is
                     // still empty.
  int length();  // how many values are queued up
  void waitAndCopy(T& data); // wait for a value, then copy without
                             // removing the value from the queue
};

Here is a producer consumer example:

class Producer : public HPCxx_Thread {
HPCxx_SyncQ<int> &x;
 public:
  Mythread( HPCxx_SyncQ<int> &y): y(x){}
  void run() { 
    printf("hi there\n");
    x = 1; // produce a value for x
  }
};

int main(int argc, char *argv[]) {
  Hpcxx_Group g;
  HPCxx_Init(&argc, &argv, &g);
  HPCxx_SyncQ<int> a;
  MyThread *t = new Producer(a);
  printf("start then wait for the thread to assign a value\");
  t->start();  // The thread is off and running...
  int x = a;   // consume a value here, block if nothing is available.
  return x;    // The thread is done, and now I am too
}

Counting Semaphores

A counting semaphore is variable with a simple integer count. It is useful for many synchronization problems, for example, getting a group of threads to wait for some condition (when the limit of the semaphore is reached).

class HPCxx_CSem {
 public:
  HPCxx_CSem();
  HPCxx_CSem(int initial_limit);

  void wait();    // wait (block) until the semaphore reachs it's limit
  void incr();    // increment the value of the semaphore
  const int operator++(int); // another way to increment the semaphore
  int getCount(); // return the current counter value
  int setCount(int val); // set (reset) the current limit
};

I thread "join" operation looks like this:

class Worker: public HPCxx_Thread {
  HPCxx_CSem &c;
 public:
  Worker(HPCxx_CSem &c_): c(c_){}
  void run() { // work!
    c.incr();
  }
};
int main(int argc, char *argv[]) {
  Hpcxx_Group g;
  HPCxx_Init(&argc, &argv, &g);
  CSem cs(NUMWORKERS);
  for(int i = 0; i < NUMWORKERS; i++)
    Worker *w = new Worker(cs);
    w->start();
  }
  cs.wait(); // wait here for all workers to finish. 
  return 0;
}

Mutex Locks

Unlike Java, HPC++Lib cannot support synchronized methods or CC++ atomic members, but a simple Mutex object with two functions, lock and unlock, provide the basic capability.

class HPCxx_Mutex {
 public:
  void lock();
  void unlock();
};

To provide a synchronized method that only allows one thread at a time execution authority:

class Myclass: public HPCxx_Runnable {
  HPCxx_Mutex L;
 public:
  void synchronized(){
  L.lock();
  ....
  L.unlock();
}

Collective Operations

The HPCxx_Group object, as described earlier, is used to identify sets of threads and contexts that participate in collective operations like barriers. If the program is running on only one context, the HPCxx_Group object simply binds a collection of threads together. However if there are multiple contexts, then the collection of threads and contexts represented by the object becomes more complex, since different numbers of threads on each context may be a part of the group.

Barrier

The HPCxx_Barrier object provides a basic barrier synchronization primitive to threads and contexts. A barrier is a collective operation; each participating member will be blocked until all participating members have entered the function. Its declaration is shown below:

void HPCxx_Barrier(HPCxx_Group &g);

By default, the HPCxx_Group object assumes there is only one thread per context.

int main(int argc, char *argv[]) {
  HPCxx_Group g;
  HPCxx_Init(&argc, &argv, &g);
  HPCxx_Barrier mybarrier(g);
  
  mybarrier();
  printf("Hey, Hey, the gang's all here\n");
  mybarrier();

  HPCxx_End();
}

The barrier code above completely ignores threads. Of course in the example code we did not create any. However, nevertheless, if we had, they would not have known about the barrier.

If we want threads to participate in barrier operations, we must inform the HPCxx_Group object to include them. This is accomplished with the setNumThreads() member function, demonstrated below:

int main(int argc, char *argv[]) {
 HPCxx_Group g;
 HPCxx_Init(&argc, &argv, &g);
 g.setNumThreads(24);
 HPCxx_Barrier mybarrier(g);
 ...

However, just knowing how many threads should arrive at the barrier is not enough. In a multi-threaded, multi-context environment, collective operations such as barriers can get very complicated. To help organize the collective functions, each thread must aquire a "key" from the barrier object, and then present that key to participate in the barrier. If the number of keys requested from the barrier object is greater than getNumThreads(), a run-time error occurs. Similarly, if the same key is presented twice to enter a barrier, before the barrier has completed, an error is signaled. An example is shown below.

class Worker : public HPCxx_Thread {
  int my_key;             // This is thread-private data
  HPCxx_Barrier &barrier; // This is thread-private data
 public:
  Worker(HPCxx_Barrier & b): barrier(b) {
    my_key = barrier.getKey(); // Get thread-local key
  }
  void run() {
    while (1) {
     // work hard, do iteration
    barrier(key);  // barrier with my group buddies
    }
  }
};

int main(int argc, char *argv[]) {
  HPCxx_Group group;
  HPCxx_Init(&argc, &argv, &group);
  group.setNumThreads(13);
  HPCxx_Barrier barrier(g);
  for(int i = 0; i < 13; i++) {
    Worker *w = new Worker(barrier);
    w->start();
  }
}

A thread can participate in more than one barrier group, and a barrier can be deallocated when it is no longer needed. The thread count of a group may be changed, a new barrier may be allocated, and a thread can request new keys.

Reductions

Other collective operations can be subclassed from HPCxx_Barrier. For example, for an integer addition reduction, the basic reduction class

class intAdd {
 public:
  int & operator()(int &x, int &y) { x += y; return x;}
};

can be used to create an object to sum one integer from each thread. The declaration is:

HPCxx_Reduct1<int, intAdd> r(group);

and it can be used in the threads as follows:

class Worker : public HPCxx_Thread {
  int my_key;
  HPCxx_Reduct1<int, intAdd> &add;
 public:
  Worker(HPCxx_Reduct1<int, intAdd> & a): add(a){
    my_key = add.getKey();
  }
  void run() {
    int x =3.14*my_id;
    int t = add(key, x, intAdd() ); // global sum of x's
  }
};

The public definition of the reduction class is given by

template <class T, class Oper>
class HPCxx_Reduct1 : public HPCxx_Barrier {
 public:
  HPCxx_Reduct1(HPCxx_Group &);
  T operator()(int key, T &x, Oper op);
  T* destructive(int key, T *buffer, Oper op);
};

The operation can be invoked with the overloaded () operation as in the example above, or with the destructive() form which requires a user supplied buffer to hold the arguments and returns a pointer to the buffer that holds the result. That style avoids making copies of all the buffers modified in the computation. The reduction object is designed to be as efficient as possible, so it is implemented as a tree reduction; the binary operator must be associative.

The destructive form is much faster if the size of the data type T is large. The multiple-argument form is declared as a template:

template < class R, class T1, class T2, ... TK , class Op1, class Op2 >
  class HPCxx_ReductK{
 public:
  HPCxx_ReductK(Hpxx_Group &);
  R & operator()(int key, T1, T2, ..., Tk , Op1, Op2);
};

where K is 2, 3, 4 or 5 in the current implementation and Op1 returns a value of type R and Op2 is an associative binary operator on type R.

Broadcast

A synchronized broadcast of a value between a set of threads is accomplished with the operation

template < class T >
class HPCxx_Bcast{
 public:
  HPCxx_Bcast(HpxxGroup &);
  T operator()(int key, T *x);

In this case, only one thread supplies a non null pointer to the value, and all the others receive a copy of that value.

Multicast

A value in each thread can be concatenated into a vector of values by the collective multicast operation.

template < class T >
class HPCxx_Mcast{
 public:
  HPCxx_Mcast(Hpxx_Group &);
  T * operator()(int key, T &x);

In this case, the operator allocates an array of the appropriate size and copies the argument values into the array in "key" order.

Remote Invocation

For a user defined class C with member function,

class C {
 public:
  nt foo(float, char);
};

the standard way to invoke the member through a pointer is with an expression of the form:

C *p; 
p->foo(3.14, 'x');

It is a bit more work to make the member function call though a global pointer. First, each member function that will be invoked through global pointers must be registered. An example is shown below.

int main(int argc, char *argv[]){
  HPCxx_Group g;
  HPCxx_Init(&argc, &argv, &g);
  int C_foo_id = hpxx_register(&C::foo);

The key returned by hpcxx_register() uniquely identifies the member function within the calling context only. For the other contexts to be able to invoke the member function based on that key, the remote context must also register the member function, and since the keys are given out sequentially, all contexts should register functions in the same order.

To invoke the member function, there is a special function template

HPCxx_GlobalPtr<C> P;
...
int z = invoke(P, C_foo_id, 3.13, 'x');

Invoke will call C::foo(3.13, 'x') in the context that contains (owner computes) the object that P points to. The calling process will wait until the function returns. If you don't want to wait, the asynchronous invoke interface will allow the calling function to continue executing until the result is needed.

HPCxx_Sync<int> sz;
ainvoke(&sx, P, C_foo_id, 3.13, 'x');

.... // go do some work

int z = sz; // wait here.

It should be noted that it is not a good idea to pass pointers as argument values to invoke() or ainvoke(). However, it is completely legal to pass global pointers and return global pointers as results of remote member invocations.

Functions of the Global class

Ordinary functions can also be invoked remotely (they are viewed as members of the "Global" class). The HPCxx_Context object is used to register these global functions.

For example, to call a function on node "3" from node "0", the function must be registrered on each node. (As with member functions, the order of the function registration determines the function identifier, so the functions must be registered in exactly the same order on each context.)

double fun(char x, int y);

int main(int argc, int *argv[]) {
  HPCxx_Group g;
  HPCxx_Init(&argc, &argv, &g);

  int fun_id = hpcxx_register(fun);
  // remote invocation of x = fun('z', 44);
  double x = hpxx_invoke(g.context(3), fun_id , 'z', 44);

  //asynchronous invocation
  HPCxx_Sync<double> sx;
  hpxx_ainvoke(&sx, g.context(3), fun_id, 'z', 44 );
  double x = sx;
....
}