In the absence of specialized hardware, the next best thing may be to use a virtual machine. This is a common approach to the implementation of symbolic languages. A virtual machine provides a low level abstraction of the actual hardware and provides a portable base for implementing a language (and more recently, entire operating systems) across various target architectures. The level of abstraction provided by the virtual machine depends on the needs of the implementor, but generally speaking, the higher the level of abstraction, the less it remains a virtual machine and the more it becomes a native implementation of the language2.2. DSI's virtual machine is quite low level, with registers and instructions that would have close correspondence to those on the target machine, as opposed to intermediate-level instructions that would be further compiled, such as in [FM90].
The special needs of a language or its computation model are often distilled into specific features or instructions in the virtual machine to facilitate and accelerate execution. This aspect of the virtual machine approach can identify architecture support that would be useful for accelerating language execution on actual hardware [Tra84,Veg84]. For example, dynamic tag-checking has long been considered a prime candidate for hardware support of Lisp programs [SH91]. Such a feature would be a relatively minor enhancement to current microprocessor designs. By some estimates it could increase performance of Lisp programs by as much as 35 percent, at a cost of an additional 2 percent of CPU logic [Joh90]. By analyzing the design of DSI's virtual machine we can determine those architecture features that would accelerate the execution of fine-grained parallel symbolic programs; in particular, programs based on the suspending construction model of execution. We describe the virtual machine in terms of its components: the register model, memory interface, instruction set, exception handling mechanism, and processor interconnection network. This provides a basis for comparison against their counterparts in current architectures.
Some of the important issues surrounding the design of the virtual machine are:
An important issue in parallel architecture design is the processor-memory interconnection model. Designs fall into two categories: shared or distributed memory. Although there are certainly examples of distributed symbolic language implementations, they are the exception rather than the norm. The extensive sharing relationships of most heap-oriented reduction models make a strong argument in favor of true shared memory. Lazy languages promote even more sharing, which further tips the scales toward shared memory. Thus we can assume shared memory, but that still leaves a lot of room for design variations.
We distinguish between two shared memory designs: one in which accesses to memory from any processor have constant time (disregarding interprocessor contention and caching) and one that does not. Multiprocessor architectures using the latter type are called Non-Uniform Memory Access (NUMA) designs [AG94]. NUMA architectures are generally more scalable, since they are designed using a many processor-to-many memory network [LD88] as opposed to several processors attached to one or more memory banks on a bus, which limits scalability due to bandwidth limitations. DSI's kernel is designed to handle both types of architecture.
Symbolic languages have to reimplement their own resource management kernel because operating systems oriented toward traditional languages do not adequately address the resource management needs of symbolic languages. The differences can be summed up along memory, process and device management lines:
The traditional implementation of a kernel is a protected layer of code underlying every process that is accessed by a function call-type interface; data is passed between user process and kernel on the stack. Modern microkernels are a stripped down version of the kernel providing just the essential resource management described above, relegating all other services to external implementation (sort of a RISC-like minimalist philosophy applied to operating system design). In addition, these microkernels are often implemented as a separate process rather than in the traditional approach described above. Programs interact with the kernel through a form of interprocess communication.
Another issue in multiprocessor kernel design is whether a single kernel controls all the processors in a sort of master-slave design, or whether the kernel is a reentrant, distributed design that runs on each processor. In the latter case, some mechanism is required to communicate between kernels on each processor so that they can operate as a whole to manage the machine.
Parallel memory management adds additional complexity to the problems of fine-grained allocation and reclamation. The most fundamental consideration in parallel memory management is the physical organization of memory; it affects all other implementation decisions. The distinction between NUMA and non-NUMA architectures is important because on NUMA architectures memory access patterns have a significant effect on performance2.3. These concerns constrain the way that the heap is allocated, how objects are allocated within the heap, and how processors cooperate to manage the heap. In this regard, there are two relevant considerations for NUMA architectures. One is the presence of network ``hot spots'' [PCYL87] where excess memory contention between processors can degrade performance (non-NUMA designs also suffer from memory contention, but it is mitigated by the use of heavy coherent caching). A second concern with NUMA architectures is locality. By this we refer to a processor's memory accesses in relation to its distance from other memories and position in the interconnection network. This concern is related to the hot-spot issue, because good locality minimizes contention, but it is also a more general concern about achieving good memory access times in a system where not all memory accesses are equal. For example, on the BBN Butterfly multiprocessor there is an average 4-to-1 difference in access times between a local memory access and a network memory access, even in the absence of contention. We would like to take advantage of locality where the opportunity presents itself, while still retaining the advantages of heap sharing. In a traditional system these concerns impact the programmer; in symbolic systems these concerns impact the implementation of automatic storage allocation and reclamation schemes.
Here is a partial listing of issues which are addressed in chapter 5:
The issue of process granularity is related to process creation; if creation is explicit in the surface language (e.g. futures [RHH85]), the grain size is generally larger than if process creation is implicit, such as in Daisy; in either case process granularity is finer and less stateful than in traditional languages. This finer granularity leads to increased amounts of context switching, which must be handled in an efficient way.
A related issue is the problem of controlling excessive parallelism. Once the machine is fully utilized it is counterproductive to schedule additional tasks. The additional parallelism causes higher peak memory usage and increased context switching, reducing throughput. If the language has eager semantics, one solution to the problem is to revert to in-lining the computation [Moh90], thus increasing the grain size. If the language is lazy, there may be less choice in reducing granularity, but also less parallelism to contend with. In either case, the kernel should provide a way to automatically control the growth of parallelism in the system. Similarly, parallel symbolic systems are expected to load balance themselves, so that all processors are kept busy. The primary issue here is whether the system migrates processes and if so, how this affects locality.
Two concerns that eventually confront a parallel language implementor are how to handle interprocess communication and synchronization. Interprocess communication for shared-memory symbolic languages is implicitly handled by heap sharing. Languages with explicit side-effects need to provide synchronization constructs (locks, mutexes, etc.) to arbitrate read/write access between processes for mutually shared structures; pure applicative languages like Daisy do not. However, another type of implicit synchronization is required even for applicative languages to handle the dependences that arise between processes. When a running process probes a suspension it sets up a dynamic dependence between the probing process and the suspension. The probing process cannot make any headway until its demand is satisfied. The kernel must transparently schedule the target suspension and handle the execution synchronization between the two processes (see section 1.3). Two broad strategies for this synchronization are possible: polling and blocking. The method used can have a large impact on performance; blocking synchronization, while generally considered more efficient, has inherent problems (see below) that require workarounds elsewhere in the system.
An idea that has gained momentum in symbolic languages in recent years is the concept of speculative computation [Bur85a,Bur85b]. Speculative parallelism refers to scheduling parallel tasks that may not be needed (on the chance that they might) to increase available parallelism over that provided by regular conservative parallelism [Jon87]. For example, the two arms of a conditional might be scheduled in parallel with the evaluation of the predicate so that the system will have a head start on the overall result of the conditional regardless of the outcome of the predicate.
The presence of speculative processes complicates scheduling; it exacerbates the problem of controlling parallelism and introduces the problem of of removing speculative tasks that have become useless, both from the schedule and from the heap. There is also the question of establishing the priority of conservative tasks over speculative tasks for resource allocation decisions; otherwise speculative tasks can reduce throughput for needed processes. Distinguishing between conservative and speculative tasks is complicated by heap sharing, which can lead to multiple perspectives on a task's actual need, and the fact that a task's status can change dynamically over the course of its life. Finally, systems with side-effects must also deal with the decision of whether to abort or to ``roll-back'' side-effects that are the result of useless speculation [Osb90].
One solution to these problem (assuming that the speculative tasks can be distinguished) is to spawn a ``termination process'' to kill them [GP81,Hem85]. Another approach is to implement some kind of priority mechanism that dynamically upgrades or downgrades the status of a speculative process and its descendants [Osb90]. This is usually coupled with a suitably-modified garbage collector to remove pointers to the speculative tasks from the scheduling infrastructure if they have been unreferenced from the programs data structures [BH77,Mil87].
One of the problems with implementing the stream model is that most I/O is not inherently demand-driven, even though data may be accessed that way. I/O is, in fact, event-driven at the lowest level of implementation, and it is the responsibility of the kernel's device drivers to bridge the gap between the demand-driven access and the event-driven I/O. DSI's kernel uses the interrupt mechanism of the DSI virtual machine to handle event-driven I/O using a special type of device driver process. This process is unique in that it can be scheduled by a probe to release its data (demand-driven) or scheduled by an I/O event occurring on the system (event-driven) to service interrupts in a timely fashion.
DSI's device manager allows some types of I/O, like keyboard input, to be handled at a very low level. Keyboard input is implemented as a stream of raw characters. To share this stream among multiple processes it becomes necessary to split the stream on a demand-driven basis, and we describe how this can be accomplished, along with filtering and interleaving of prompts.
Another device problem that is common to symbolic languages is that of dangling descriptors. This refers to the problem of an I/O device that is left in an ``open'' state by a process that has become unreferenced. This problem may manifest itself in slightly different forms depending on the I/O model and constructs of the surface language, but it is a common issue to all lazy language implementations as well as imperative languages that allow speculative computation. A common solution to the problem is to endow the garbage collector with the capability to ``close'' the device, possibly through a generic finalization facility [Mil87]. This problem is an artifact of an underlying imperative model of I/O that could be avoided using a functional file system [HW92].
The stream model of computing, while elegant, has its limitations. One of these is the difficulty of handling interactive terminal I/O, which must be synchronized by artificial strictness measures [HS88]. This problem may be quickly becoming irrelevant with the advent of modern GUI windowing shells. The stream interface turns out to be very adept at handling a windowing interface; DSI's approach uses a client-server model to handle character-based interaction with multiple input windows.
Lambda forms use a syntax reminiscent of the lambda calculus:
\ formals . bodywhere formals is a single identifier or an arbitrarily bushy formal argument list, and body is a Daisy expression. Assignment can occur at top level only, and is specified by the form
identifier = valueSo, for example,
add5 = \ x. add:[x 5]defines the function add5.
Numbers are expressed in standard decimal or floating point notation. Literals (identifiers and strings) are expressed as an alphanumeric sequence beginning with a letter; optionally, double quotes can be used to quote literals with embedded spaces and other characters.
let: [formals actuals body]extends the environment by locally binding the formals to actuals, both of which may be arbitrarily bushy structures. body is evaluated in the extended environment. The rec form is similar, but creates recursive bindings. It is somewhat similar to Scheme's letrec, but is fully general, allowing data recursions. For example,
rec: [L [0 ! map:[inc L]] L]specifies the list of integers.
Many demand-patterns allow parallelism. For example, the add operator referred to earlier is strict in both its arguments, which could be evaluated in parallel. In general, this is true of all strict Daisy primitives that take lists as arguments. This concurrency is possible because Daisy lacks side-effects, a property which allows suspensions to execute in any order. Daisy does not explicitly specify whether most of its primitives actually coerce arguments in parallel, since
Some Daisy primitives do imply concurrency semantics.
seq, from above, implies sequential coercion of its argument.
The only Daisy primitive that directly implies concurrency is
set:
set returns a new list of values corrosponding to the order
that the suspensions converge in its argument list.
This implies that the expressions will be concurrently evaluated,
at least until the first one to converge.
If the tenth element of the result is requested, one can assume
that the expressions will be concurrently evaluated at least until
ten of them have converged, and so forth.
Many Daisy primitives have potential concurrency.
add is an example of conservative parallelism
[Jon87];
Daisy has many more opportunites for speculative parallelism
[Jon87].
For example, a speculative OR-parallelism can be specified
by
which returns true if any of its arguments are non-nil.
Similarly, a speculative AND-parallelism is
which returns nil if any of its arguments are nil.
Even if is potentially (speculatively) parallel:
if can speculatively schedule all its then, else,
and subsequent (all but the first) predicate arguments,
and evaluate the first predicate sequentially.
There is also abundant potential parallelism in all mapping primitives.
is an example of a higher-order primitive.
It returns a closure that maps function over a list.
That mapping, of course, can be entirely parallel.
Many other potential sources of concurrency exist in Daisy; it has vast amounts of latent parallelism. This section illustrates how the language affords abundant implicit parallelism. The DSI kernel is charged with managing this concurrency so that it does not overwhelm the machine. Daisy primitives can freely schedule their arguments without worrying about resource management issues. Chapter 6 explains how the DSI kernel throttles parallelism.
For this reason, Daisy also includes parallelism annotations. Recall from chapter 1 that we are not philosophically opposed to parallel annotation, but rather the inclusion of side effects that limit opportunities for implicit parallelism.