$Id: HPC++Lib.htm,v 1.1 1998/06/23 00:18:41 ejohnson Exp $
Portability and component reuse depend on well-designed abstractions and interfaces. HPC++Lib is a portable, parallel, object-oriented run-time system class library that defines a standard interface that is both powerful and easy-to-use. Its high-level interface permits the underlying implementation machinery the freedom to optimize for particular architectures and changing status of the hardware. That interface includes support for allocating data structures from a shared region, copying shared data between unshared (distributed) regions of memory, basic synchronization primitives, collective functions such as broadcast and reduce, remote method invocation (RMI), and basic process and thread control for managing both task and data parallelism. HPC++Lib is also designed to accommodate tracing and profiling mechanisms that should be built into the run-time system from the ground up.
Our goal is to make HPC++Lib available on a wide variety of parallel architectures, from networks of workstations to the nation's largest supercomputers. While a group of scientists will choose to write parallel C++ programs that directly interface to the classes, abstractions, and functions in HPC++Lib, the library will also be used as the basis for other libraries, programming disciplines, and object frameworks that make programming parallel machines easier. This includes projects such as AMR++, the Parallel Standard Template Library, POOMA, P++, and PADRE.
Execution Model: An HPC++ program runs on a collection of nodes. A node is a physical compute resource that provides one or more contexts. A node can be a shared-memory multiprocessor (SMP), possibly connected to other SMPs via a network. Examples of a node include a single CPU desktop workstation, an SP2 "node", a dual-Pentium NT box, and a parallel supercomputer with cache-coherent distributed shared-memory support, such as the Convex SPP and the SGI Origin 2000. A context is a cache-coherent, possibly shared, virtual address space. A context may be accessible by several different threads of control. Often, a Unix process represents a context. Each node may have one or more threads within each of its contexts. Threads sharing a context man be run in parallel, share memory, and be either pre-emptive or non-preemptive. These resources may be allocated initially at run-time, or for some architectures, dynamically during the execution of the program.
The simplest parallel program using HPC++Lib would consist of a set of threads executing within a single context. The threads would have access to a shared address space, and could run on one CPU, or run in parallel. This model of parallelism is very well suited for modest levels of parallelism, such as is available on a 32 node SGI Origin 2000. HPC++Lib also supports running programs across a large number of nodes, such as a cluster of Origin 2000s, or an SP2. For this style of programming, a "Single Program Multiple Data" (SPMD) approach is used. In SPMD mode, distributed data structures and explicit synchronization must be managed by the programmer.
Basic Components of HPC++:
The Single Program Multiple Data Model (SPMD) of execution is one of the standard models used in parallel scientific programming. Typically, parallel computers provide a mechanism for SPMD computation, where n copies of a single program begin execution of main(argc, argv). Often, in interactive mode, the run-time resources are specified with command line arguments such as "-procs 16". However, there is no standard; each vendor uses a different set of command line arguments. When using a batch mode scheduler, such as NQS, DJM, or LoadLeveler, the number of CPUs is usually specified via directives to the job scheduler. To make HPC++Lib programs portable, and share a single command line interface, a front-end script called hpc++run is provided. It is very similar to the mpirun script from MPICH.
hpc++run [hpc++ args] <user_executable> [user_args]
These are the currently defined arguments that may be passed to hpc++run
At this point, it is important to understand that not all run-time values for the number of nodes, contexts, and threads may be supported by the SPMD program loader. For example, on the CM5, programs are loaded onto partitions; specifying "-nn 15" will likely cause a run-time error since it is not a power of two. However, the SPMD program loader will always execute at least one main(argc, argv) per allocated context. The interaction between run-time resource parameters, machine capabilities, and dynamic resource allocation after the program has begun is not easily simplified, and will always remain machine dependent.
The running processes are coordinated by the HPC++Lib initialization routine which is invoked as the first call in the main program of each context. To simplify things, make the program more portable, and behave predictably on a variety of machines and SPMD loaders, it is suggested no user code exist between main(argc, argv) and HPCxx_Init(). The initialization procedure strips off all arguments beginning with -hpcxx_, and fills in the object of type HPCxx_Group, which represents the initial SPMD group. Each collection of contexts and threads are represented by an HPCxx_Group object, with one representative copy in each context participating in the group. The HPCxx_Group object is similar to an MPI communicator - it provides a interface to build groups of contexts. Each context has a single representative object of type HPCxx_Context, as well as a representative object of type HPCxx_Node. Some of the important public interfaces for HPCxx_Group are shown below.
class HPCxx_Group { public: // Construct a group from a HPCxx_Node object HPCxx_Group(const HPCxx_Node &node); // Construct a group from an array of HPCxx_Node objects HPCxx_Group(const HPCxx_Node *&nodes, int numNodes); HPCxx_Node *getNode(); // A pointer to my Node object HPCxx_Node *getNode(int nodeID); // A pointer to the node object // for any Node int getNumContexts(); // The total number of contexts in this group int getContextID(); // The unique ID for this context. int getGroupID(); // The unique ID for this group int getNumThreads(); // The number of threads in this context // participating in group collective operations void setNumThreads(int c); };
Here the simplest parallel program written with HPC++Lib
int main(int argc, char **argv) { HPCxx_Group SPMDgroup; if (HPCxx_Init(&argc, &argv, &SPMDgroup)) { fprintf(stderr,"Error in HPCxx_Init\n"); exit(1); } // begin SPMD user program printf("Hello World from node %d, context %d\n", SPMD.getNode()->getNodeID() , SPMDgroup.getContextID()); // Close things down.... HPCxx_exit(); }
Running this program could be done with:
hpc++run -numnodes 4 a.out
or on some architectures, where the program loader starts SPMD mode, the hpc++run driver is not strictly required, you could use:
a.out -hpcxx_numnodes 4
However, the later form will not work on every architecture (as described above). The output from the program execution would probably look like this:
Hello World from context 3
Hello World from context 0
Hello World from context 2
Hello World from context 1
Interleaving of parallel output from the printf() function is not controlled by HPC++Lib. Most computers interleave based on lines, but that is not guaranteed across platforms. Use of the HPCxx_Group class will be discussed in greater detail in the Collective Operations section
C++ constructors for global objects are run before main(argc, argv). A given constructor, could be run n times, one for each context that the SPMD loader created, or, only once, since for some machines, the call to HPCxx_Init(), not the SPMD loader, will create the required contexts. This can lead to non-portable programs, not to mention a lot of confusion. Those constructors may not use any of the HPC++Lib calls or classes, since until HPCxx_Init() completes, their use will generate an error. This includes thread and shared memory manipulation. Therefore, it is suggested that constructors for global objects not perform I/O, attempt to use HPC++Lib features, or initialize data that is context dependent. Additionally, data that is shared between contexts via global pointers (see below) must be dynamically heap allocated (not on the stack) after HPC++Lib initialization.
Central to distributed memory parallel computation are the abstractions for sharing and moving data. Most programmers are comfortable with a simplified near/far model, where data that is near is accessible using regular shared memory, and data that is far must be accessed via function calls that invoke put/get or send/recv semantics. However, with the power of C++, many of those old interfaces can be pushed down into the supporting run-time system, and replaced by an easier to use, yet more powerful C++ class library for manipulating global data structures.
In this portion of the run-time system, a C++ class library supplies the interface for basic allocation and access of remote-accessible data. Central to this concept is the handle by which the remote data is referenced. Global Pointers are proxy objects that are designed to behave like pointers to objects in other address spaces.
The notion of a global pointer is quite old. However, with the extensibility of C++, it's semantics can be captured without language extensions or library calls at the user-code level. A global pointer to an object of type T is defined as a templated class of the form :
HPCxx_GlobalPtr<T> q;
Initializing a global pointer to an array of 100 doubles looks like:
HPCxx_GlobalPtr<double> q = new ("HPCxx_global") double[100];
Notice that the array was allocated from a special HPCxx_global area (see below for more details). Global pointers may be passed freely among different contexts. The global pointer provides a handle to objects that may be used from any other context. For objects of simple type, a global pointer can be dereferenced like any other pointer. The assignment operator performs a simple assignment if the global reference refers to a local object. Otherwise, the underlying run-time system is called to copy the right-hand side into the remote object. An analogous operation occurs when the cast operator is used. If the global pointer refers to a local object, that object is returned. If the pointer is to a remote object, the low-level data movement routines copy the value from the remote object to the local context.
For example, assignment and copy through a global pointer is given by
HPCxx_GlobalPtr<float> p = new ("HPCxx_Global") int; *p = 3.14; // A store through GlobalPtr float y = 2-*p; // A fetch via a GlobalPtr
Pointer arithmetic and the [ ] operator can be used on global pointers as you would with ordinary pointers. For large blocks of contiguous data, it would be too expensive to read or write them one at time, so block read and write operations are provided.
void HPCxx_GlobalPtr<T>::read(T *buffer, int size); void HPCxx_GlobalPtr<T>::write(T *buffer, int size);
This design allows the library user to employ standard C++ pointer reference/dereference to read and write remote values. Below, are some more examples of how to use global pointers.
HPCxx_GlobalPtr<double> q = new ("HPCxx_Global") double[100]; q[8] = 3.99; // Use array notation for remote store double x = q[9]; // Use array notation for remote read *(q+10) = 32.6; // Pointer arithmetic and remote store. double z = *(q+12); // Pointer arithmetic and remote read double b[100]; // Local declaration and allocation q.write(b, 100); // Write 100 values q.read(b, 100); // Read 100 values into local array q = MyCoolFunction(q, 5.6); // Global pointer used as a function // function argument, and returned. if (q.isLocal()) { // Does the globalptr point to a local object? double *qLocal = q; // Use cast to extract the local pointer }
Both the remote read and the remote write operations are blocking. The remote read waits until the data arrives from the distant context, and the remote write operation blocks until the source has been read, and may be modified. In the context of threads, this provides a convenient location to call yield(). A fetch through a global pointer may suspend the currently executing thread until the value arrives, at which time the scheduler can once again put the thread back into the ready queue, complete with the value requested. This can be an extremely powerful way to
Unfortunately, using sizeof() for shallow copying of user-defined types only works for homogenous systems. In general, alignment and representations differences between machines prevent an object from being copied, byte-by-byte to another context. Another similarly complicated issue is properly invoking the constructor for byte-copied objects. Without compiler support or a preprocessor, the programmer must take on the responsibility for packing and unpacking objects into a form that can be sent to remote contexts.
To read and write user-defined types through a global pointer, the user must write pack and unpack friend functions for the class and an array of such objects. A simple example is given below.
class MyClass { T1 x; float y[100]; public: friend void HPCxx_pack(HPCxx_Buffer *b, MyClass *, int count); friend void HPCxx_unpack(HPCxx_Buffer *b, MyClass *, int &count); };
// The pack function for MyClass would look like this: void HPCxx_pack(HPCxx_Buffer *b, MyClass *a, int count) { hpcxx_pack(b, size, 1); // The first thing is a count of items for(int i = 0; i < count; i++) { hpcxx_pack(b, a[i].x, 1); // pack the one item of type T1 hpcxx_pack(b, a[i],y, 100); // pack the 100 items of type float } } // The unpack function for MyClass void HPCxx_unpack(hpxx_buffer_t *b, MyClass* a, int &count) { hpcxx_unpack(b, count, 1); // fill in count with the # of items. for(int i = 0; i < count; i++) { unpack(b, a[i].x, 1); // extract one item of type T1 unpack(b, a[i],y, 100); // extract 100 items of type float } }
These pack and unpack functions can be considered a type of remote constructor. For example, suppose a object contains a pointer to a local-storage buffer. The programmer could choose to write the pack function to create a global pointer that referenced the original storage, or to pack the buffer and then unpack the data into suitable storage on the remote context. The pack/unpack functions could also be coded to copy only those fields that are important. The pack/unpack friends will be called by the system to move the data across context boundaries.
Another important issue is what objects may have global pointers referencing them. Many computer architectures clearly delineate between objects in shared space and objects that are strictly local. This demarcation is reflected in the MPI 2 draft standard for one-sided communication. Remote Memory Access (RMA) extends MPI by allowing one process to specify all the communication parameters, both for the sending and receiving side. However, one-sided messages may only modify blocks of memory that have been allocated with MPI_MEM_ALLOC(). If the global pointer package is to be portable, the concept of shared and local memory regions must be reflected in the interface. Therefore, as mentioned eariler, special allocators (HPCxx_Global) are required for heap allocated objects that will have a global pointer referencing them. A C-style malloc() is also available, called hpcxx_sharedMalloc(). The C interface is helpful for allocating storage from C subroutines.
Remote reads or writes are not guaranteed to be atomic. In other words, should another thread of control begin to simultaneously modify the contents of an object while it is being fetched via a global pointer, the results are not predictable. Following the design principle that communication and synchronization should be explicit, users should insure correctness with one of these strategies:
Every context already has at least one thread that represents the initial execution of the program. More threads may be created. The thread interface is based on the Java Thread library. More specifically, there are two basic classes that are used to instantiate a thread and give it work. Basic HPCxx_Thread objects encapsulate a thread and provide a private data space. Objects of class HPCxx_Runnable provide a convenient way for a set of threads to execute the member functions of a shared object. The interface is:
class HPCxx_Runnable { public: virtual void run() = 0; }; class HPCxx_Thread { public: Thread(HPCxx_Runnable *); // Subclasses of Thread should override this method to // get useful behavior. virtual void run(); void start(); // Ceases execution of the thread. Status is not currently used. static void stop(void *status); // Yields thread static void yield(); };
There are two ways to create a thread and give it work to do:
class MyRunnable: public HPCxx_Runnable { char *x; public: MyRunnable(char *c): x(c){} void run() { printf(x); } // The run() function must be supplied };
then create threads to execute the object
MyRunnable r("Hello World"); HPCxx_Thread *t1 = new HPCxx_Thread(&r); HPCxx_Thread *t2 = new HPCxx_Thread(&r); t1->start(); // launch the thread but don't block t2->start(); }
The previous section of code will print "Hello WorldHello World". It should be noted that in this small example, it is possible for the main program to terminate prior to the completion of the two threads. This would signal an error condition. Techniques to insure threads terminate before the main thread exits will be covered in the section on synchronization.
By subclassing the thread object, it is possible to provide thread-private data.
class MyThread: public HPCxx_Thread { char *x; public: MyThread(char *y): x(y), HPCxx_Thread(NULL) {} void run() { printf(x); } };
Create a thead, and start execution with:
MyThread *t1 = new MyThread("Hello World"); t1->start();
Having both interfaces, HPCxx_Runnable and HPCxx_Thread, provides several advantages:
Threads may run in parallel, but are not required to be premptive. Also, there is no way to bind processor resources to threads. An HPC++Lib implementation could use POSIX Threads (Pthreads) or a thread package such as AWESIME.
There are two types of synchronization mechanisms used in HPC++Lib: collective operator objects and primitive synchronization objects. The collective operations are based on the HPCxx_Group class. The group defines which contexts must participate in the operation.
A Sync<T> object is a write-once variable. It can be read many times, but only when it is "full". When it is created, it is empty. An attempt to read an empty Sync<T> variable will cause the thread to be blocked until the object arrives and fills the Sync<T> object. At that time, the reader will get the value, and the thread may be scheduled for execution. Many readers can be waiting for a single Sync<T> object. All readers will be unblocked when a value is available. Readers and writers that come after the first write see it as a const value. CC++ provides a similar capability by extending C++ with a new declaration modifier.
The public interface for the Sync object is:
template<class T> class HPCxx_Sync { public: operator T(); // read a value operator =(const T &); // assign a value void read(T &); // another form of read void write(const T &); // another form of writing bool peek(T &val); // Sneak a peek at the contents of the Sync // without getting caught (blocking). If the // return value is true, the caller snatched // the value. A false indicated the Sync is // still empty. };
SyncQ<T> provides a dual queue of values. Attempts to read the empty object cause the calling
thread to block until an object is assigned. However, unlike the single Sync object, each read of
the object must corrospond to a distinct write. In other words, SyncQ maintains an ordered list of
blocked threads waiting for values. When a value is written to the SyncQ object, one and only
one thread is unblocked to read the value and continue execution. The ith thread waiting for a
value will receive the ith value written to the object. If there are no theads waiting for incoming
values, the values are internally queued until a read is performed.
The public interface to the SyncQ object is:
template<class T> class HPCxx_SyncQ { public: operator T(); // read a value operator =(const T &); // assign a value void read(T &); // another form of read void write(const T &); // another form of writing bool peek(T &val); // Sneak a peek at the SyncQ // without getting caught (blocking). If the // return value is true, the caller snatched // the value. A false indicated the SyncQ is // still empty. int length(); // how many values are queued up void waitAndCopy(T& data); // wait for a value, then copy without // removing the value from the queue };
Here is a producer consumer example:
class Producer : public HPCxx_Thread { HPCxx_SyncQ<int> &x; public: Mythread( HPCxx_SyncQ<int> &y): y(x){} void run() { printf("hi there\n"); x = 1; // produce a value for x } }; int main(int argc, char *argv[]) { Hpcxx_Group g; HPCxx_Init(&argc, &argv, &g); HPCxx_SyncQ<int> a; MyThread *t = new Producer(a); printf("start then wait for the thread to assign a value\"); t->start(); // The thread is off and running... int x = a; // consume a value here, block if nothing is available. return x; // The thread is done, and now I am too }
A counting semaphore is variable with a simple integer count. It is useful for many synchronization problems, for example, getting a group of threads to wait for some condition (when the limit of the semaphore is reached).
class HPCxx_CSem { public: HPCxx_CSem(); HPCxx_CSem(int initial_limit); void wait(); // wait (block) until the semaphore reachs it's limit void incr(); // increment the value of the semaphore const int operator++(int); // another way to increment the semaphore int getCount(); // return the current counter value int setCount(int val); // set (reset) the current limit };
I thread "join" operation looks like this:
class Worker: public HPCxx_Thread { HPCxx_CSem &c; public: Worker(HPCxx_CSem &c_): c(c_){} void run() { // work! c.incr(); } };
int main(int argc, char *argv[]) { Hpcxx_Group g; HPCxx_Init(&argc, &argv, &g); CSem cs(NUMWORKERS); for(int i = 0; i < NUMWORKERS; i++) Worker *w = new Worker(cs); w->start(); } cs.wait(); // wait here for all workers to finish. return 0; }
Unlike Java, HPC++Lib cannot support synchronized methods or CC++ atomic members, but a
simple Mutex object with two functions, lock and unlock, provide the basic capability.
class HPCxx_Mutex { public: void lock(); void unlock(); };
To provide a synchronized method that only allows one thread at a time execution authority:
class Myclass: public HPCxx_Runnable { HPCxx_Mutex L; public: void synchronized(){ L.lock(); .... L.unlock(); }
The HPCxx_Group object, as described earlier, is used to identify sets of threads and contexts that participate in collective operations like barriers. If the program is running on only one context, the HPCxx_Group object simply binds a collection of threads together. However if there are multiple contexts, then the collection of threads and contexts represented by the object becomes more complex, since different numbers of threads on each context may be a part of the group.
The HPCxx_Barrier object provides a basic barrier synchronization primitive to threads and contexts. A barrier is a collective operation; each participating member will be blocked until all participating members have entered the function. Its declaration is shown below:
void HPCxx_Barrier(HPCxx_Group &g);
By default, the HPCxx_Group object assumes there is only one thread per context.
int main(int argc, char *argv[]) { HPCxx_Group g; HPCxx_Init(&argc, &argv, &g); HPCxx_Barrier mybarrier(g); mybarrier(); printf("Hey, Hey, the gang's all here\n"); mybarrier(); HPCxx_End(); }
The barrier code above completely ignores threads. Of course in the example code we did not create any. However, nevertheless, if we had, they would not have known about the barrier.
If we want threads to participate in barrier operations, we must inform the HPCxx_Group object to include them. This is accomplished with the setNumThreads() member function, demonstrated below:
int main(int argc, char *argv[]) { HPCxx_Group g; HPCxx_Init(&argc, &argv, &g); g.setNumThreads(24); HPCxx_Barrier mybarrier(g); ...
However, just knowing how many threads should arrive at the barrier is not enough. In a multi-threaded, multi-context environment, collective operations such as barriers can get very complicated. To help organize the collective functions, each thread must aquire a "key" from the barrier object, and then present that key to participate in the barrier. If the number of keys requested from the barrier object is greater than getNumThreads(), a run-time error occurs. Similarly, if the same key is presented twice to enter a barrier, before the barrier has completed, an error is signaled. An example is shown below.
class Worker : public HPCxx_Thread { int my_key; // This is thread-private data HPCxx_Barrier &barrier; // This is thread-private data public: Worker(HPCxx_Barrier & b): barrier(b) { my_key = barrier.getKey(); // Get thread-local key } void run() { while (1) { // work hard, do iteration barrier(key); // barrier with my group buddies } } }; int main(int argc, char *argv[]) { HPCxx_Group group; HPCxx_Init(&argc, &argv, &group); group.setNumThreads(13); HPCxx_Barrier barrier(g); for(int i = 0; i < 13; i++) { Worker *w = new Worker(barrier); w->start(); } }
A thread can participate in more than one barrier group, and a barrier can be deallocated when it is no longer needed. The thread count of a group may be changed, a new barrier may be allocated, and a thread can request new keys.
Other collective operations can be subclassed from HPCxx_Barrier. For example, for an integer addition reduction, the basic reduction class
class intAdd { public: int & operator()(int &x, int &y) { x += y; return x;} };
can be used to create an object to sum one integer from each thread. The declaration is:
HPCxx_Reduct1<int, intAdd> r(group);
and it can be used in the threads as follows:
class Worker : public HPCxx_Thread { int my_key; HPCxx_Reduct1<int, intAdd> &add; public: Worker(HPCxx_Reduct1<int, intAdd> & a): add(a){ my_key = add.getKey(); } void run() { int x =3.14*my_id; int t = add(key, x, intAdd() ); // global sum of x's } };
The public definition of the reduction class is given by
template <class T, class Oper> class HPCxx_Reduct1 : public HPCxx_Barrier { public: HPCxx_Reduct1(HPCxx_Group &); T operator()(int key, T &x, Oper op); T* destructive(int key, T *buffer, Oper op); };
The operation can be invoked with the overloaded () operation as in the example above, or with the destructive() form which requires a user supplied buffer to hold the arguments and returns a pointer to the buffer that holds the result. That style avoids making copies of all the buffers modified in the computation. The reduction object is designed to be as efficient as possible, so it is implemented as a tree reduction; the binary operator must be associative.
The destructive form is much faster if the size of the data type T is large. The multiple-argument form is declared as a template:
template < class R, class T1, class T2, ... TK , class Op1, class Op2 > class HPCxx_ReductK{ public: HPCxx_ReductK(Hpxx_Group &); R & operator()(int key, T1, T2, ..., Tk , Op1, Op2); };
where K is 2, 3, 4 or 5 in the current implementation and Op1 returns a value of type R and Op2 is an associative binary operator on type R.
A synchronized broadcast of a value between a set of threads is accomplished with the operation
template < class T > class HPCxx_Bcast{ public: HPCxx_Bcast(HpxxGroup &); T operator()(int key, T *x);
In this case, only one thread supplies a non null pointer to the value, and all the others receive a copy of that value.
A value in each thread can be concatenated into a vector of values by the collective multicast operation.
template < class T > class HPCxx_Mcast{ public: HPCxx_Mcast(Hpxx_Group &); T * operator()(int key, T &x);
In this case, the operator allocates an array of the appropriate size and copies the argument values into the array in "key" order.
For a user defined class C with member function,
class C { public: nt foo(float, char); };
the standard way to invoke the member through a pointer is with an expression of the form:
C *p; p->foo(3.14, 'x');
It is a bit more work to make the member function call though a global pointer. First, each member function that will be invoked through global pointers must be registered. An example is shown below.
int main(int argc, char *argv[]){ HPCxx_Group g; HPCxx_Init(&argc, &argv, &g); int C_foo_id = hpxx_register(&C::foo);
The key returned by hpcxx_register() uniquely identifies the member function within the calling context only. For the other contexts to be able to invoke the member function based on that key, the remote context must also register the member function, and since the keys are given out sequentially, all contexts should register functions in the same order.
To invoke the member function, there is a special function template
HPCxx_GlobalPtr<C> P; ... int z = invoke(P, C_foo_id, 3.13, 'x');
Invoke will call C::foo(3.13, 'x') in the context that contains (owner computes) the object that P points to. The calling process will wait until the function returns. If you don't want to wait, the asynchronous invoke interface will allow the calling function to continue executing until the result is needed.
HPCxx_Sync<int> sz; ainvoke(&sx, P, C_foo_id, 3.13, 'x'); .... // go do some work int z = sz; // wait here.
It should be noted that it is not a good idea to pass pointers as argument values to invoke() or ainvoke(). However, it is completely legal to pass global pointers and return global pointers as results of remote member invocations.
Ordinary functions can also be invoked remotely (they are viewed as members of the "Global" class). The HPCxx_Context object is used to register these global functions.
For example, to call a function on node "3" from node "0", the function must be registrered on each node. (As with member functions, the order of the function registration determines the function identifier, so the functions must be registered in exactly the same order on each context.)
double fun(char x, int y); int main(int argc, int *argv[]) { HPCxx_Group g; HPCxx_Init(&argc, &argv, &g); int fun_id = hpcxx_register(fun); // remote invocation of x = fun('z', 44); double x = hpxx_invoke(g.context(3), fun_id , 'z', 44); //asynchronous invocation HPCxx_Sync<double> sx; hpxx_ainvoke(&sx, g.context(3), fun_id, 'z', 44 ); double x = sx; .... }