The contributions of the dissertation are:
The current system as described in this thesis (DSI V4.1) has been implemented sequentially. It includes a virtual machine assembler [Jes] implemented in Yacc/C, a host interface library (in C) and a set of DSI assembly modules comprising the DSI kernel and Daisy interpreter. A Daisy compiler, written in Daisy, translates Daisy code to DSI code.
I plan to fold the improvements described in this report back into the Butterfly implementation. Another possible target for a near-term port is a Silicon Graphics Challenge multiprocessor.
Some crude instrumentation is provided in the Butterfly implementation for event logging using the BBN gist utility. This has provided some limited preliminary quantitative data. Better instrumentation is needed for more accurate performance and analytic measurements for the experiments described above. Support for host-specific tools (e.g. gist) is useful, but a portable, cross-platform mechanism would be better.
Daisy has a similar design philosophy as Scheme: a simple, but expressive core language, without a lot of bells and whistles. The suggestions for improvements in this section reflect that spirit and are meant to address limitations based on experience using Daisy and not to match feature for feature in other languages. Solving these problems would also address some of the issues faced in implementing other languages on DSI (see section 9.2.2).
A second limitation I would like to address is the system-wide pauses caused by garbage collection. System-wide garbage collection pauses become unacceptable as heap sizes increase, making interactive response time suffer and may cause data overruns in a device handler if it occurs during pending I/O and the host cannot provide sufficient buffering. Our system has the advantage of parallel collection in all phases, so that collection pauses increase according to the heap size divided by the total number of processors. Nevertheless, this still allows for significant pauses. I would like to incorporate an incremental garbage collector [AAL88,Moo84,Kie86] to avoid the latency penalties incurred by the current collector. The incremental collector should preserve locality of data as in the current scheme and not migrate data or processes to other processor nodes.
Generational collectors are becoming commonplace in many strict applicative systems, but are likely to be ineffective for a lazy language. Generational collectors are designed around the principle that there are few references from older generations to newer generations; stores into old generations have to be monitored so that the references can be updated when newer generations are collected. This is generally true of strict languages, but lazy languages have many updates of old structures pointing to new structures. Indeed, suspending construction is predicated on this fact. Therefore, generational collection may not be a viable method for incremental collection on DSI.
DSI does provide a low-level exception mechanism (signals) that is compatible with an applicative solution. Signals establish an asynchronous event dependence between two processes without imposing state or control changes within those processes. A signal occurring in one process simply causes another process to be activated, presumably to handle that exception. An exception mechanism based on specifying exception dependences between expressions might fit naturally into this model.
The major obstacle to implementing this in DSI is that signals are a component of the architecture, not processes; i.e. signals map between context windows (running processes), not suspensions. An example of how this is adapted to arbitrary processes is given by the use of signals in supporting demand-driven scheduling (see chapter 4 and 6). It might be possible to further generalize the mechanisms used for this purpose to support user-defined signals. Ideally, exception support could be introduced into the language without requiring additional state fields to be added to suspensions. However, some kernel support is probably necessary, so this can also be considered as an enhancement to the DSI system.
Some potential issues to be resolved in this proposed language are whether the language should provide side-effects and what sort of I/O interfaces should be provided. If side-effects are included, some general-purpose visible synchronization constructs will need to be provided in the language. Also, the interaction of continuations across process boundaries is troublesome [KW90]. If side-effects are not provided, the language is free to be parallel wherever possible in its primitives (including speculatively) just like Daisy. As for I/O, the language could easily provide stream I/O based on the kernel's device interface, but standard Scheme also provides imperative I/O; providing a proper superset of the Scheme standard [CJR91] might be beneficial.
The major hurdle for implementing this type of language on DSI is the same one described above for providing exception handling in Daisy; namely, generalizing the signal mechanism from the architecture to arbitrary suspensions and user-defined signals.
Daisy's similarity to functional languages extends beyond laziness and lack of side-effects to include stream I/O and the use of error values. Lazy, side-effect-free programming also leads to similar styles of expressing data recursion, functional mappings and other functional programming hallmarks. At the same time, Daisy does not have many of the trappings of ``modern'' functional languages [Hud89] such as pattern matching, type inference, and algebraic or abstract data type constructors. In these areas Daisy remains very Lisp like, using lambda forms, dynamic typing, and standard Lisp data types.
Internally, Daisy's evaluation model is also Lisp like, using an environment-based closure model rather then the ubiquitous graph reduction, combinator or dataflow approaches common in functional languages. This point is significant, because the reduction model strongly influences the underlying view of hardware; DSI's virtual machine is fairly different from published descriptions of abstract machines for graph reduction.
This dissimilarity extends to the task model. Under graph reduction, a task is simply an available redex; this is just a pointer to a subgraph that can be reduced in parallel. The reduction machine will ``spark'' a parallel task [Jon87] by communicating this pointer to some other processor. In contrast, DSI's process decomposition occurs in list construction, not in reducing application redexes. Suspensions are implemented as a fine-grained process control records rather than as graph redexes.
In summary, Daisy shares many semantic traits with functional languages due to its lazy semantics and lack of side-effects, but differs substantially in most other respects. To avoid confusion, I refer to Daisy as a quasi-functional or applicative language.
In contrast to STING, DSI is at the same time both lower-level and higher-level. It is lower-level in the sense that programming the DSI virtual machine amounts roughly to assembly programming, with registers and instructions that are very close to the actual machine; STING's virtual machines, virtual processors, and so forth are all first-class objects that can be manipulated from within Scheme (in fact the system was written mostly in Scheme). DSI is higher-level in the sense that its kernel provides a more packaged solution to process control, with the kernel handling synchronization, demand-driven scheduling, useless task removal, and so forth.
Another very notable difference between STING and DSI is that DSI's processes are much finer-grained than STING's threads. Each STING thread contains a stack and a heap9.1, and is associated (through virtual machines) with a protected address space. Thus STING's threads, while conceptually smaller than traditional operating system processes, are still fairly large-grained tasks. It is highly unlikely that one could implement suspending construction on STING, using threads for suspensions, in any efficient way, other than by re-implementing a finer-grained process package on top of it, which would obviate the purpose of using it.
DSI's low-level implementation is due to one of the original motivations of the project: to explore fine-grained list-multiprocessing at target levels. DSI's virtual machine is ``the machine'' and its kernel is the operating system on that machine. In contrast, STING is a relatively high-level environment that is implemented on stock multiprocessing operating systems in Scheme. DSI's kernel reflects the motivation to provide as much system control of parallelism as possible, so that the language implementor is simply responsible for identifying parallelism. STING's approach is to allow the programmer full control (and thus full responsibility) over process decomposition, mapping, etc.
That is about where the similarities end. The chare computation model is inherently distributed (although there is an implementation for shared memory machines). Chare programming consists of creating chares and sending messages between them. Shared data is handled on distributed processors by having the user program provide a callback routine to package any data that the kernel is about to pass to another processor, where it is unpacked. Pointers must be eliminated by the packaging procedure. On shared memory machines this is not required. The kernel can be configured to use one of several load-balancing procedures that migrate chares to neighboring processors and also to use one of several queueing strategies for messages.
It is not clear how speculative processing could be handled in the Chare kernel. Chares can be created with specific priorities, but termination is explicit (a chare kills itself). It is unclear from [KS88] whether or how chares are garbage collected (or for that matter, how storage management in general is handled).
In summary, the Chare system bears a resemblance to DSI in that the process granularity is similar and the goals of the two systems are total process resource management. Otherwise, the systems are fairly different not only in computation model and style, but in the algorithms used for resource management.
Conceptually, both suspensions and futures represent fine-grained tasks. However, most implementation descriptions of a future's state indicate a larger granularity than that of suspensions. This is probably due to the historical fact that futures were designed primarily as an explicit eager construct, and thus can represent an arbitrarily large granularity, even though they are generally used in a fine-grained way. Suspensions, on the other hand, were intended from the outset to be a lazy object and to be created implicitly during list construction. Thus the suspension's design attempts to minimize the state profile. It is instructive to consider whether an efficient suspending construction type language could be built using delays, Multilisp's counterpart to suspensions.
Aside from granularity and scheduling semantics, futures and suspensions also differ operationally. Although they are mostly transparent in Scheme (most strict primitives will implicitly touch them into existence), futures can be directly shared in a way that suspensions cannot. Non-strict operations are free to pass future pointers around as arguments, insert them into new structures, etc. This is because when the future is determined (their terminology) the value is stored in the future itself, until the garbage collector comes along and ``shorts out'' the value. Suspensions, on the other hand, overwrite their references upon converging. This means that suspension checking occurs earlier than future detection, which is generally tied in to run-time type checking. Some implementations, such as MultiScheme, further differentiate between a placeholder (future) and task, actually separating the two entities into different objects, both of which may be manipulated within Scheme.
This more explicit manipulation of futures in Scheme is indicative of the language differences between Scheme and . Multilisp is a strict applicative order language (Scheme) with side-effects and explicit parallelism using futures. It also includes explicit synchronization constructs such as as atomic replace, locks and semaphores. These features allow explicit process granularity, low-level heap mutation and process synchronization to be performed in Scheme, which aids the parallel Lisp implementor, since more of the system code can be written in Scheme itself. It also provides a great deal of flexibility for implementing other parallel languages on top of it, as in [JP92b], although one implicitly inherits Scheme's control structures, memory management and other artifacts with this approach.
Since these low-level hooks into the implementation exist, it is tempting to make them visible to the programmer, on the principle that it affords the programmer greater flexibility and capabilities. In fact, this is the hallmark of Lisp systems. However, this also can work to the language's detriment. It drags the language down toward the level of the implementation, exposing the gritty details of implementation and requiring the programmer to take more active control in resource management. In contrast, Daisy remains a fairly high-level language, without low-level hooks, and in doing so has much flexibility in parallel scheduling at the language implementation level. DSI remains at a very-low level, for fine control of the machine. In this light, parallel Scheme appears to be more of a middle-level language between DSI and Daisy.
This concludes my general comparison of parallel Scheme and Daisy/DSI. In the next few sections I highlight the differences in resource management between the various flavors of parallel Scheme and DSI. The information presented here is gleaned from various papers in the bibliography, cited here where appropriate.
Multilisp's memory management is based on a copying, incremental collector. Each processor has an oldspace and newspace; objects are allocated out of the processor's own newspace, which means there is no allocation synchronization and very high data locality for local processes. During garbage collection, the first processor to reach an object copies it to it's newspace, regardless of whose oldspace it was originally in. The author claims this may improve locality of reference because if the containing processor has dereferenced the object it will be entirely migrated to the copying node, which presumably still needs it. However, if there is more than one reference to an object there is no guarantee that the processor copying it is the one that will access it the most, so this may not be a valid assumption. Unlike Multilisp, DSI's allocation is distributed
Multilisp employs an ``unfair'' scheduler to constrain parallelism. Processes are placed on a local stack, oldest first; executing a future results in the parent task being stacked while the child continues to execute. When a processor runs out of tasks it looks around at other processor's stacks for a task to ``steal'' off the top. This arrangement limits task queue growth to the same order of magnitude as if the futures executed on the stack. Note, however, that this also encourages depth-first expansion of the process decomposition, since the processes higher in the process tree are buried on the stacks; these processes may be more likely to represent significant computations that should be migrated rather than leaf tasks. A good example of this is the quicksort program and other divide-and-conquer strategies. It is interesting to note that the Mul-T implementation (Multilisp's heir apparent) uses a queue for this structure. Halstead notes that under the LIFO scheduling strategy, once the system is saturated, futures complete before their parent task ever gets resumed; in essence, the future expression could have been evaluated on the stack. This may have motivated the work on lazy future creation [Moh90] used in Mul-T (see below).
Miller's dissertation [Mil87] concentrates on the details of extending MIT Scheme to support futures and the other features of MultiScheme. The MultiScheme scheduler, written in Scheme, and presented in the thesis, uses a set of primitive functions to pass tasks to the underlying multiprocessing engine for distribution. There are no details about the underlying resource management algorithms used in the implementation, making it difficult to draw comparisons to DSI in this report, which essentially describes a different level of implementation and/or set of problems, although some issues such as task blocking and resumption are described.
Miller does describe the basis for speculative task recovery in MultiScheme. The system provides a special data type, weak cons cells, which have the property that the garbage collector will remove the car reference if no other references point to the same object. Thus, weak cons cells can be used to build hash tables, device lists, and other structures that must be specially collected in systems like DSI (see sections 5.3.2 and 7.3.3).
Weak cons cells are used to build the waiting (blocking) queues used for tasks waiting on the value of a future (placeholder in MultiScheme), so that any speculative task that has been dereferenced elsewhere will not be retained following a garbage collection. Osborne [Osb90] notes that the success of this approach depends on the frequency of garbage collection, although priority propagation was later added to downgrade speculative tasks between collections.
MultiScheme is similar to DSI in its use of global interrupts (visible in scheme) and barrier synchronization to organize garbage collection, although the garbage collection itself is encapsulated by a primitive function at the lower implementation level, and is not described. This example typifies the difference between Multilisp and Daisy, regarding low level hooks into the implementation, described earlier.
Mul-T's run-time system, like most other implementations (except ours, it seems) queues blocked tasks into the task state of processes directly. No mention is made of whether weak pointers are used or speculative tasks are garbage collected, as in MultiScheme. The language does provide a grouping mechanism for debugging, and users can suspend entire groups of tasks and kill the group. Mul-T provides support for process-specific variables (i.e. global name spaces) by converting T's normal shallow binding to deep binding.
Mul-T uses a memory management scheme very similar to that described for Butterfly Portable Standard Lisp (see below). Chunks of memory are allocated from a global heap and then objects are suballocated out of the chunks. This reduces contention for the shared global pointer managing allocation in the heap as a whole, although large objects are allocated out of the heap directly. Since the Encore is a non-NUMA architecture (see sections 2.1 and 2.2.1, chapters 3 and 5), no locality considerations are made for heap allocation. Mul-T uses a parallel stop-and-copy garbage collector.
The most significant difference between DSI and Mul-T regarding allocation is that Mul-T always allocates futures locally, whereas DSI distributes suspension allocation, subject to the availability of remote suspensions. This primarily reflects the difference in load-balancing strategies (see below). For further comparison remarks to DSI's memory allocation scheme, see section 9.3.3 below.
Mul-T scheduling uses two queues per processor, one for new tasks and one for resumed blocked tasks. Blocked tasks are restarted on the processor on which they last executed, in an attempt to improve snoopy cache locality, but tasks may migrate around the system due to stealing [RHH85]. The scheduling priority is as follows:
DSI, like Mul-T, employs a multi-tier task scheduling algorithm, but the queuing strategies used in the two systems are quite different. DSI uses a load-sensitive allocation strategy for load-balancing rather than migrating tasks via stealing. An experiment is needed to determine whether this allocation strategy results in less overhead than task stealing, because processors do not have to spend time examining each other's task queues, or whether the overhead of our allocation strategy and inter-processor scheduling requests outweighs the cost of non-local context switching. DSI tasks always run locally and do not migrate, so cache locality should be good for DSI suspensions on a bus-based multiprocessor like the Encore.
Mut-T's per-processor task queues correspond closely with DSI's conservative task queue and conservative task stack (see chapter 6). DSI's use of a stack for local suspensions rather than a queue plus the lack of stealing may provide even more parallelism throttling than is available under Mul-T's queuing scheme. Note, however, that Mul-T uses a technique called lazy task creation [Moh90] to increase granularity of programs dynamically by inlining, and this probably reduces the need for kernel-based throttling. This technique is unlikely to be applicable to a lazy language like Daisy, which relies on suspension creation for laziness in addition to parallelism. DSI's suspensions are also likely smaller than futures.
In addition, DSI provides an additional speculative task queue and stack with corresponding lower priorities to keep speculative processes from disturbing conservative ones. DSI removes speculative tasks from the system implicitly. Kranz [DAKM89] provides no indication of how speculative computation is handled, if at all, in Mul-T. The techniques used in [Osb90] may be a possible direction for Mul-T in this regard.
Mul-T's I/O also shares similarities with DSI's device management. Mul-T uses a ``distinguished task'' dedicated to terminal I/O to avoid the problems of shared, parallel I/O; this corresponds roughly with DSI's input device manager (see chapter 7). Mul-T likely uses the imperative I/O model of Scheme, however, compared to DSI's stream I/O model; Mul-T's I/O efforts are motivated by integration with it's task group debugging environment (see above). Mul-T also uses a distinguished ``exception handler task'' for each processor in the system devoted to coordinating task groups. In DSI, all exception handling is accomplished via separate processes using signals (see chapter 3); our ``distinguished exception handler task'' is the kernel. DSI signals are also used to coordinate I/O, tracing and debugging, garbage collection, suspension detection, etc.
Finally, Kranz [DAKM89] makes very good arguments for hardware support for tag-checking to accelerate future detection on stock hardware, validating similar remarks I made in section 2.1 and chapter 3.
The philosophy of the QLisp implementation seems to be oriented toward providing full programmer control over parallelism. QLisp provides functions to create, acquire, release and test locks. It uses catch and throw to explicitly kill other tasks; that is, when a processor executes a throw, all other tasks created within the corresponding catch are killed. This feature can be used to manage OR-parallel speculation.
Published details [GM84,GG88] are sketchy on the memory and process management used in QLisp. The language is implemented on an Alliant FX/8 multiprocessor (a non-NUMA design), which means QLisp probably does not micro-manage for locality purposes. The report states that processors contend for a shared lock on the ``free memory pool'' and that whichever processor exhausts the memory pool first performs a garbage collection while the other processors wait. The comparison remarks for DSI's memory management scheme verses QLisp are basically the same as for PSL, below.
A comparison of process management is difficult, given the lack of details on QLisp's design. However, the authors [GM84] describe the use of a single shared queue for distributing work among processors, hence the name, QLisp.
The memory management described in [MRSL88] has some resemblance to DSI's. It divides the heap into segments across all processors. Distributed allocation is accomplished by allocating chunks out of the global heap; suballocation occurs out of the chunk until it is exhausted, at which point another global chunk is allocated. Their scheme allocates a chunk at a time from shared global heap state variables. DSI also allocates in blocks, but each processor's allocation vector contains blocks for each processor so that suballocation is more distributed. Also, in our scheme blocks are requested directly from the remote processor and can only be donated by that processor; they are not allocated on demand from global shared heap variables. This supports our load-balancing scheme in which suspensions only execute locally (see section 5.2.2); under PSL, blocked tasks can be rescheduled on any processor. An interesting result in [MRSL88] supports our choice of distributed allocation of cells for the BBN Butterfly. They first tried a local allocation scheme, which fared poorly due to contention; a subsequent refinement to distributed round-robin allocation improved performance (see chapter 5).
The garbage collector described in [MRSL88] is a monolithic and sequential (i.e. it runs on one processor) stop-and-copy design; DSI's mark-sweep collector is distributed and fully parallel in all phases. PSL uses a single global scheduling queue, this also handles load balancing; DSI uses distributed queues, executes suspensions locally, and relies on distributed allocation for load balancing. Butterfly PSL is very similar to Multilisp and similar comparison remarks apply (see section 9.3.3).