A Proposal for PORTS1 

The PORTS working group took the first step toward a common run-time
specification for compilers by developing the PORTS0 thread interface.
This document proposes a plan for developing a PORTS1 interface for
communication, remote procedure call, clocks, and event logging.

The strategy used to develop the PORTS0 thread interface was
successful because:
  It started with the best standard available, pthreads.
  A simple subset of the standard that could be implemented on a wide
variety of architectures was selected.
  Where pthreads lacked needed functionality, the PORTS group defined
a new set of function interfaces.

The PORTS group should follow this same pattern while designing
PORTS1.

  Basic Definitions:
The PORTS group should agree on some basic data structures and
definitions for discussing the machine model.  The group has discussed
these constructs several times in the past.  Below is a proposed
formalization of those constructs.

// ********************************************************
// ********************************************************
// BASIC MACHINE DEFINITIONS:
// ********************************************************
// ********************************************************

// A basic machine address, with no type information.
typedef void* LocalPointer;

// An OwnerContext uniquely describes an address space.
class OwnerContext {
 public:
  int nodenumber;
  int contextnumber;
  // Comparing two OwnerContexts is done element-wise, and 
determined
  // if the two contexts refer to the same address space.
  int operator==(OwnerContext C) {
    return ((C.nodenumber == nodenumber) &&
            (C.contextnumber == contextnumber));
  }
};

// Pointer to any physical memory location (no type info)
class GlobalPointer {
 public:
  LocalPointer toLocalPointer() { return P };
  LocalPointer P;
  OwnerContext O;
};


  Communication
There are three basic machine communication paradigms reflected in
popular commercial machines: shared memory, distributed memory with
message passing, and distributed memory with network DMA.  These three
communication architectures should not be cast into a single universal
model.  Rather, a compiler designed to generate code for all three
types of machines should understand these models and generate the most
efficient code for communication between nodes.  The compiler should
be able to access the lowest standard communication protocol for a
given architecture.  The next three sections propose a PORTS1
interface for basic communication.

  Shared Memory
The PORTS0 interface solved many issues for compilers using shared
memory by providing interfaces for mutual exclusion, synchronization,
and reentrant library functions.  However, allocation of strictly
shared and local memory was overlooked.  On some shared memory
machines, shared memory pages must be specially allocated, and they
often have a different cost associated with access.  Moreover,
strictly local memory may use a different caching strategy.  PORTS1
should include the following function interfaces:

	ports1_SharedMalloc();
	ports1_SharedFree();  
	ports1_LocalMalloc();
	ports1_LocalFree();

Of course on machines where there is only one interface for allocating
and freeing memory, these interfaces will simply be macros to malloc()
and free() respectively.  These function interfaces are also needed
for machines communicating with network DMA (see below).

  Message Passing
The high performance computing community has worked hard to develop a
standard for message passing on distributed memory machines.  Their
effort, MPI, should form the basis for PORTS1 message passing -- just
like pthreads formed the basis for PORTS0.  For example, IBM’s MPI
on the SP2 has a very efficient implementation; the various MPI
versions of SEND() and RECV() have been tuned specifically for the SP2
communication switch.  Compilers capable of scheduling communication
in advance with matched send/recv pairs must be permitted direct
access to those primitives for optimal performance.  The tremendous
efforts of the MPI community should be reflected in the PORTS1
interface.

Like with pthreads, only a subset of the immense functionality of MPI
will probably be needed for PORTS1.  PORTS should assemble a small
working group to identify the communication primitives from MPI that
should be included in a minimal PORTS1 definition.  Since MPI is not
currently designed to send messages between user-level threads, the
working group should also examine what constructs, like those found in
the ICASE Chant package, could be provided in a future PORTS2
interface.  Such an investigation should also explore possible
connections between thread scheduling and message arrival.  The PORTS1
MPI working group should circulate their proposed subset of MPI
through the ports mailing list.  At the next meeting of PORTS, the
final list of functions should be decided.

  Network DMA
Network DMA, or put and get for distributed memory is provided by the
Cray T3D and Meiko CS2.  CMU also has an experimental low-latency
put/get for the Intel Paragon.  Other vendors are likely to provide
put/get interfaces in the future.  However, there is no clear standard
for put/get functionality.  PORTS1 must propose a generalized put/get
interface until a standard gains wide-spread acceptance.  Using the
definition of GlobalPointer provided above, the following functions
provide a network DMA interface for PORTS1:

// ********************************************************
// ********************************************************
// BASIC NETWORK DMA FUNCTIONS:
// ********************************************************
// ********************************************************

// ********************************************************
// PUT bytes into location designated by globalptr
int ports1_StoreStart(GlobalPointer dst, LocalPointer *src, 
                      int length, int *handle);

// ********************************************************
// GET bytes from location designated by globalptr
int ports1_FetchStart(LocalPointer *dst, GlobalPointer src,
                      int length, int *handle);

// ********************************************************
// Check if a "handle" has completed (non-blocking)
int ports1_Probe(int *handle);

// ********************************************************
// Wait (continue polling) on a "handle" until 
// it has completed (blocking)
// Implementation suggestion:  This should translate into
// getting moved out of the thread ready queue until the DMA
// completes
void ports_Wait(int *handle);

A small group of interested individuals should review the interface
suggested above, and circulate a draft for PORTS1 network DMA.  At the
next meeting of PORTS, the final list of functions should be decided.

  Remote Procedure Call
PORTS1 needs an active message or remote procedure call function
interface.  Such an interface would provide the minimum functionality
necessary to implement asynchronous behavior.  While it has been
suggested that MPI might provide active message routines in the
future, the PORTS group cannot wait.  PORTS1 should develop a simple,
compact pair of routines for asynchronous communication.  The
implementations of an RPC-like mechanism for shared memory, message
passing, and network DMA machines will be very different.  Some
machines will use an event polling loop, while others will use
interrupt driven events.

It is assumed that the argument buffer passed to the remote machine
may be copied several times during transit.  This is due to the basic
asynchronous nature of RPC-style events, and is a fundamental cost
overhead.  However, since the proposed communication interfaces
(shared memory, message passing, and network DMA) described above do
not require extra buffer copies during transit, long argument buffers
should be passed by reference, with a global pointer.  Then the
handler for the remote context can fetch the argument buffer directly,
and without additional overhead.

Using the definition of OwnerContext provided above, one possible
function interface is shown below.

// ********************************************************
// ********************************************************
// BASIC RPC FUNCTIONALITY:
// ********************************************************
// ********************************************************

// ********************************************************
// Request that a handler be executed in the given OwnerContext
int ports1_RemoteAction(OwnerContext C, int type, 
                        LocalAddress *buffer,
                        int length, int *handle);

// ********************************************************
// The function that the run-time system executes after
// receiving a RemoteAction request from another context.
void ports1_RemoteActionHandler(OwnerContext asker, 
              int type, LocalAddress *buffer, int length);

Like the working groups described above, a group of people interested
in the functionality and possible implementations of RPC-style
communication should be formed.  They should form a draft document and
circulate the proposal via the mailing list. The function interfaces
should be finalized at the next meeting of PORTS.

  Clocks
While MPI does provide some clock functions, an interface for clocks
that shared memory and network DMA machines can use is required. A
simple clock interface should be included in PORTS1.  Since high
resolution synchronized clocks are vital to performance monitoring, a
PORTS1 clock implementation should use the best resolution clocks
available on the hardware.
 
// ********************************************************
// ********************************************************
// BASIC CLOCK FUNCTIONS:
// ********************************************************
// ********************************************************


// ********************************************************
// return a current wall-clock time (in secs.) as a double.
// No correlation to real TOD
double ports1_WallClock();

// ********************************************************
double ports1_UserClock();

// ********************************************************
// Resolution of the user timer in seconds
double ports1_UserTick();

// ********************************************************
// Resolution of the wall timer in seconds
double ports1_WallTick();

  Event Logging
Basic tracing and event logging should not be an afterthought - tacked
onto a run-time specification at the end.  Rather, PORTS1 should
define basic tracing and profiling functions that can be quickly
implemented and used throughout the run-time system in order to amass
information about its behavior.

// ********************************************************
// ********************************************************
// BASIC TRACING AND PROFILING:
// ********************************************************
// ********************************************************

// Initialize tracing; must be called once per (phys) node
void  ports1_EvInit(char* filename);

// trace event + parameter
void  ports1_Event(long int eventident,  long int 
parameter);

// terminate tracing; flush event buffer if necessary */
void  ports1_EvClose();

// flush event buffer on request
void  ports1_EvFlush();

  Other Functions
PORTS1 will also require basic functions to describe and allocate
processor and context resources at run time.  Those functions will be
added to the PORTS1 interface as needed to implement the interfaces
described above.