Next: 3. Architecture of the Up: An Architecture for Parallel Previous: 1. Introduction

Subsections

2. A Primer

This chapter is an introduction to the topics discussed in the rest of the thesis. The sections below mirror the organization of chapters. Each section provides a short introduction to the topics within and raises some pertinent issues which are addressed in the chapters that follow.

2.1 Virtual Machine Issues

In the late 1970's and 1980's, it was popular to design (and even to build) architectures designed around a specific language. Today, the construction of hardware tailored to non-traditional programming languages is out of vogue. Economics, coupled with the rapid pace of RISC technology in the mainstream microprocessor industry has discouraged the development of specialized hardware for symbolic processing (witness the slow demise of the Symbolics Lisp machine). The reasons outlined above are doubly true for multiprocessors; there have been very few symbolic multiprocessing machines ever built, and fewer still on the way. The prospects for parallel symbolic processing are still bright, however, because stock hardware turns out to be reasonably good at supporting the implementation of parallel symbolic languages^2.1.

In the absence of specialized hardware, the next best thing may be to use a virtual machine. This is a common approach to the implementation of symbolic languages. A virtual machine provides a low level abstraction of the actual hardware and provides a portable base for implementing a language (and more recently, entire operating systems) across various target architectures. The level of abstraction provided by the virtual machine depends on the needs of the implementor, but generally speaking, the higher the level of abstraction, the less it remains a virtual machine and the more it becomes a native implementation of the language^2.2. DSI's virtual machine is quite low level, with registers and instructions that would have close correspondence to those on the target machine, as opposed to intermediate-level instructions that would be further compiled, such as in [FM90].

The special needs of a language or its computation model are often distilled into specific features or instructions in the virtual machine to facilitate and accelerate execution. This aspect of the virtual machine approach can identify architecture support that would be useful for accelerating language execution on actual hardware [Tra84,Veg84]. For example, dynamic tag-checking has long been considered a prime candidate for hardware support of Lisp programs [SH91]. Such a feature would be a relatively minor enhancement to current microprocessor designs. By some estimates it could increase performance of Lisp programs by as much as 35 percent, at a cost of an additional 2 percent of CPU logic [Joh90]. By analyzing the design of DSI's virtual machine we can determine those architecture features that would accelerate the execution of fine-grained parallel symbolic programs; in particular, programs based on the suspending construction model of execution. We describe the virtual machine in terms of its components: the register model, memory interface, instruction set, exception handling mechanism, and processor interconnection network. This provides a basis for comparison against their counterparts in current architectures.

Some of the important issues surrounding the design of the virtual machine are:

How is the orientation toward list and symbol processing reflected in the view of memory and registers?
What is the importance of tags and how are they used?
How is suspending construction supported at the hardware level?
How does the process grain size influence the register model?
How are exceptions handled in a fine-grained process context?

An important issue in parallel architecture design is the processor-memory interconnection model. Designs fall into two categories: shared or distributed memory. Although there are certainly examples of distributed symbolic language implementations, they are the exception rather than the norm. The extensive sharing relationships of most heap-oriented reduction models make a strong argument in favor of true shared memory. Lazy languages promote even more sharing, which further tips the scales toward shared memory. Thus we can assume shared memory, but that still leaves a lot of room for design variations.

We distinguish between two shared memory designs: one in which accesses to memory from any processor have constant time (disregarding interprocessor contention and caching) and one that does not. Multiprocessor architectures using the latter type are called Non-Uniform Memory Access (NUMA) designs [AG94]. NUMA architectures are generally more scalable, since they are designed using a many processor-to-many memory network [LD88] as opposed to several processors attached to one or more memory banks on a bus, which limits scalability due to bandwidth limitations. DSI's kernel is designed to handle both types of architecture.

2.2 Kernel Issues

At the core of most parallel symbolic language implementations is a kernel that handles the low-level dynamic resource management for processes running on the system. This resource management typically includes the allocation and sharing of memory, processors and devices. This is also what an operating system provides; the distinction between a kernel and an operating system is that the latter also includes many services and administration tools for users, including a file system for managing disk space, quotas and user accounts, login sessions, etc.

Symbolic languages have to reimplement their own resource management kernel because operating systems oriented toward traditional languages do not adequately address the resource management needs of symbolic languages. The differences can be summed up along memory, process and device management lines:

Traditional languages require the programmer to handle their own memory allocation and deallocation. Symbolic languages usually provide automatic storage allocation and reclamation of objects.
The process granularity for traditional languages is much greater than for symbolic languages and is normally defined by separate address spaces. In contrast, parallel symbolic languages have much finer grained processes that normally operate in a shared memory space. Symbolic languages may also provide automation for other process handling such as implicit process creation, interprocess synchronization and even process reclamation.
For lazy symbolic languages, the traditional imperative model of I/O is not compatible with the non-sequential scheduling orders imposed by the system. Symbolic languages may also provide automatic device reclamation (closing dereferenced file descriptors via garbage collection).

The traditional implementation of a kernel is a protected layer of code underlying every process that is accessed by a function call-type interface; data is passed between user process and kernel on the stack. Modern microkernels are a stripped down version of the kernel providing just the essential resource management described above, relegating all other services to external implementation (sort of a RISC-like minimalist philosophy applied to operating system design). In addition, these microkernels are often implemented as a separate process rather than in the traditional approach described above. Programs interact with the kernel through a form of interprocess communication.

Another issue in multiprocessor kernel design is whether a single kernel controls all the processors in a sort of master-slave design, or whether the kernel is a reentrant, distributed design that runs on each processor. In the latter case, some mechanism is required to communicate between kernels on each processor so that they can operate as a whole to manage the machine.

2.2.1 Memory Management

Memory management in symbolic systems is fundamentally different than in conventional languages and operating systems. In the latter, ``memory management'' is almost solely concerned with implementing demand-paged virtual memory. Processes are required to do their own memory allocation and reclamation, although the system is expected to be able to recover the physical memory used by a process when it terminates. In contrast, memory management for symbolic languages amounts to handling automatic storage allocation and reclamation. This may or may not be addressed in the context of virtual memory and protected address spaces, for reasons discussed in chapter 4. Allocation is an important issue for parallel symbolic languages because these languages allocate and recover memory in smaller granules and at a greater rate than in traditional languages. This is particularly true of pure applicative languages which discourage the notion of state.

Parallel memory management adds additional complexity to the problems of fine-grained allocation and reclamation. The most fundamental consideration in parallel memory management is the physical organization of memory; it affects all other implementation decisions. The distinction between NUMA and non-NUMA architectures is important because on NUMA architectures memory access patterns have a significant effect on performance^2.3. These concerns constrain the way that the heap is allocated, how objects are allocated within the heap, and how processors cooperate to manage the heap. In this regard, there are two relevant considerations for NUMA architectures. One is the presence of network ``hot spots'' [PCYL87] where excess memory contention between processors can degrade performance (non-NUMA designs also suffer from memory contention, but it is mitigated by the use of heavy coherent caching). A second concern with NUMA architectures is locality. By this we refer to a processor's memory accesses in relation to its distance from other memories and position in the interconnection network. This concern is related to the hot-spot issue, because good locality minimizes contention, but it is also a more general concern about achieving good memory access times in a system where not all memory accesses are equal. For example, on the BBN Butterfly multiprocessor there is an average 4-to-1 difference in access times between a local memory access and a network memory access, even in the absence of contention. We would like to take advantage of locality where the opportunity presents itself, while still retaining the advantages of heap sharing. In a traditional system these concerns impact the programmer; in symbolic systems these concerns impact the implementation of automatic storage allocation and reclamation schemes.

Here is a partial listing of issues which are addressed in chapter 5:

How do we structure our heap across processors?
How can we perform distributed storage allocation (i.e. allocation occurring in parallel, with minimal synchronization)?
How do we balance memory allocation, so that heavy allocation does not adversely affect any one processor?
How do we allocate memory to avoid hot spots?
How does allocation affect process scheduling?
Where can we take advantage of locality?
How can we perform distributed storage reclamation (parallel garbage collection)?
How does garbage collection affect locality?

2.2.2 Process Management

The second major area of support that a kernel provides is process management. The kernel provides system calls for process scheduling and termination, and manages processes over the course of their life-cycles. Under eager parallel systems, process scheduling is implicitly associated with creation; in lazy languages like Daisy, process scheduling is deferred until the suspension is probed (demand-driven scheduling) or explicitly scheduled by a concurrent language primitive.

The issue of process granularity is related to process creation; if creation is explicit in the surface language (e.g. futures [RHH85]), the grain size is generally larger than if process creation is implicit, such as in Daisy; in either case process granularity is finer and less stateful than in traditional languages. This finer granularity leads to increased amounts of context switching, which must be handled in an efficient way.

A related issue is the problem of controlling excessive parallelism. Once the machine is fully utilized it is counterproductive to schedule additional tasks. The additional parallelism causes higher peak memory usage and increased context switching, reducing throughput. If the language has eager semantics, one solution to the problem is to revert to in-lining the computation [Moh90], thus increasing the grain size. If the language is lazy, there may be less choice in reducing granularity, but also less parallelism to contend with. In either case, the kernel should provide a way to automatically control the growth of parallelism in the system. Similarly, parallel symbolic systems are expected to load balance themselves, so that all processors are kept busy. The primary issue here is whether the system migrates processes and if so, how this affects locality.

Two concerns that eventually confront a parallel language implementor are how to handle interprocess communication and synchronization. Interprocess communication for shared-memory symbolic languages is implicitly handled by heap sharing. Languages with explicit side-effects need to provide synchronization constructs (locks, mutexes, etc.) to arbitrate read/write access between processes for mutually shared structures; pure applicative languages like Daisy do not. However, another type of implicit synchronization is required even for applicative languages to handle the dependences that arise between processes. When a running process probes a suspension it sets up a dynamic dependence between the probing process and the suspension. The probing process cannot make any headway until its demand is satisfied. The kernel must transparently schedule the target suspension and handle the execution synchronization between the two processes (see section 1.3). Two broad strategies for this synchronization are possible: polling and blocking. The method used can have a large impact on performance; blocking synchronization, while generally considered more efficient, has inherent problems (see below) that require workarounds elsewhere in the system.

An idea that has gained momentum in symbolic languages in recent years is the concept of speculative computation [Bur85a,Bur85b]. Speculative parallelism refers to scheduling parallel tasks that may not be needed (on the chance that they might) to increase available parallelism over that provided by regular conservative parallelism [Jon87]. For example, the two arms of a conditional might be scheduled in parallel with the evaluation of the predicate so that the system will have a head start on the overall result of the conditional regardless of the outcome of the predicate.

The presence of speculative processes complicates scheduling; it exacerbates the problem of controlling parallelism and introduces the problem of of removing speculative tasks that have become useless, both from the schedule and from the heap. There is also the question of establishing the priority of conservative tasks over speculative tasks for resource allocation decisions; otherwise speculative tasks can reduce throughput for needed processes. Distinguishing between conservative and speculative tasks is complicated by heap sharing, which can lead to multiple perspectives on a task's actual need, and the fact that a task's status can change dynamically over the course of its life. Finally, systems with side-effects must also deal with the decision of whether to abort or to ``roll-back'' side-effects that are the result of useless speculation [Osb90].

One solution to these problem (assuming that the speculative tasks can be distinguished) is to spawn a ``termination process'' to kill them [GP81,Hem85]. Another approach is to implement some kind of priority mechanism that dynamically upgrades or downgrades the status of a speculative process and its descendants [Osb90]. This is usually coupled with a suitably-modified garbage collector to remove pointers to the speculative tasks from the scheduling infrastructure if they have been unreferenced from the programs data structures [BH77,Mil87].

2.2.3 Device Management

The third major area of kernel support is for device management. A number of lazy languages use a stream model of I/O interface [Hud89] in which data is read or written from devices in streams. This model elegantly handles the interaction with demand driven computation, which would be difficult with an imperative, temporal I/O interface, such as that used in Lisp. In DSI this support is integrated into the kernel, so that a seamless stream I/O interface is available to any language implemented on the kernel without having to implement this in the language itself.

One of the problems with implementing the stream model is that most I/O is not inherently demand-driven, even though data may be accessed that way. I/O is, in fact, event-driven at the lowest level of implementation, and it is the responsibility of the kernel's device drivers to bridge the gap between the demand-driven access and the event-driven I/O. DSI's kernel uses the interrupt mechanism of the DSI virtual machine to handle event-driven I/O using a special type of device driver process. This process is unique in that it can be scheduled by a probe to release its data (demand-driven) or scheduled by an I/O event occurring on the system (event-driven) to service interrupts in a timely fashion.

DSI's device manager allows some types of I/O, like keyboard input, to be handled at a very low level. Keyboard input is implemented as a stream of raw characters. To share this stream among multiple processes it becomes necessary to split the stream on a demand-driven basis, and we describe how this can be accomplished, along with filtering and interleaving of prompts.

Another device problem that is common to symbolic languages is that of dangling descriptors. This refers to the problem of an I/O device that is left in an ``open'' state by a process that has become unreferenced. This problem may manifest itself in slightly different forms depending on the I/O model and constructs of the surface language, but it is a common issue to all lazy language implementations as well as imperative languages that allow speculative computation. A common solution to the problem is to endow the garbage collector with the capability to ``close'' the device, possibly through a generic finalization facility [Mil87]. This problem is an artifact of an underlying imperative model of I/O that could be avoided using a functional file system [HW92].

The stream model of computing, while elegant, has its limitations. One of these is the difficulty of handling interactive terminal I/O, which must be synchronized by artificial strictness measures [HS88]. This problem may be quickly becoming irrelevant with the advent of modern GUI windowing shells. The stream interface turns out to be very adept at handling a windowing interface; DSI's approach uses a client-server model to handle character-based interaction with multiple input windows.

2.3 Daisy

Although this thesis is primarily about how suspending construction-based parallelism is managed in our DSI implementation, it is helpful to understand how this parallelism arises in the surface language. In this section we introduce Daisy and describe how suspending list construction gives rise to parallel programs. This description is necessarily brief; for a fuller description of the language, see The Daisy Programming Manual [Joh89b]. Daisy is a descendant of pure Lisp and is a contemporary of Scheme. It is a statically-scoped, lazy, applicative language. It provides a few atomic objects, such as numbers and literals (atoms). Lists glue together atomic objects and other lists to create complex structures.

2.3.1 Syntax

Daisy's syntax is a departure from standard Lisp S-expressions, yet is still relatively simple. Brackets [ ] are used to denote list structure, and used with ! (like Lisp's .) creates dotted pairs. The default interpretation of a bare list is evaluation/list construction, not application; application is expressed by infix colon. So, for example

means to multiply the values of x and y. Expressions can be quoted with the hat character to suppress evaluation.

Lambda forms use a syntax reminiscent of the lambda calculus:

        \ formals . body

where formals is a single identifier or an arbitrarily bushy formal argument list, and body is a Daisy expression. Assignment can occur at top level only, and is specified by the form

        identifier = value

So, for example,

        add5 = \ x. add:[x 5]

defines the function add5.

Numbers are expressed in standard decimal or floating point notation. Literals (identifiers and strings) are expressed as an alphanumeric sequence beginning with a letter; optionally, double quotes can be used to quote literals with embedded spaces and other characters.

2.3.1.1 Special Forms

Daisy has two main local binding forms that extend the environment for the scope of the form.

        let: [formals actuals body]

extends the environment by locally binding the formals to actuals, both of which may be arbitrarily bushy structures. body is evaluated in the extended environment. The rec form is similar, but creates recursive bindings. It is somewhat similar to Scheme's letrec, but is fully general, allowing data recursions. For example,

        rec: [L [0 ! map:[inc L]] L]

specifies the list of integers.

2.3.1.2 An Example

Here is an example of the Quicksort program, written in Daisy:

2.3.2 Semantics

Daisy's laziness is a byproduct of list construction, not lazy interpretation or compilation [FW76a] (there are exceptions, such as the binding forms described above). The evaluation of a list of expressions yields a list of suspensions for evaluating those expressions; in essence, lists are suspended-process builders. Application is strict; the application of a primitive to a list applies a demand pattern to the list, coercing some of the suspensions in it. The demand pattern of a primitive depends on the semantics and implementation of the primitive. For example,

is a sequencer, which coerces suspensions sequentially from left to right.

Many demand-patterns allow parallelism. For example, the add operator referred to earlier is strict in both its arguments, which could be evaluated in parallel. In general, this is true of all strict Daisy primitives that take lists as arguments. This concurrency is possible because Daisy lacks side-effects, a property which allows suspensions to execute in any order. Daisy does not explicitly specify whether most of its primitives actually coerce arguments in parallel, since

1.: the definition of these operators does not imply concurrency, even though the opportunity exists, and
2.: it leaves the choice up to the implementation. On a sequential machine they would not be coerced in parallel, but on a parallel machine they might.

Some Daisy primitives do imply concurrency semantics. seq, from above, implies sequential coercion of its argument. The only Daisy primitive that directly implies concurrency is set:

set returns a new list of values corrosponding to the order that the suspensions converge in its argument list. This implies that the expressions will be concurrently evaluated, at least until the first one to converge. If the tenth element of the result is requested, one can assume that the expressions will be concurrently evaluated at least until ten of them have converged, and so forth.

Many Daisy primitives have potential concurrency. add is an example of conservative parallelism [Jon87]; Daisy has many more opportunites for speculative parallelism [Jon87]. For example, a speculative OR-parallelism can be specified by

which returns true if any of its arguments are non-nil. Similarly, a speculative AND-parallelism is

which returns nil if any of its arguments are nil.

Even if is potentially (speculatively) parallel:

if can speculatively schedule all its then, else, and subsequent (all but the first) predicate arguments, and evaluate the first predicate sequentially.

There is also abundant potential parallelism in all mapping primitives.

is an example of a higher-order primitive. It returns a closure that maps function over a list. That mapping, of course, can be entirely parallel.

Many other potential sources of concurrency exist in Daisy; it has vast amounts of latent parallelism. This section illustrates how the language affords abundant implicit parallelism. The DSI kernel is charged with managing this concurrency so that it does not overwhelm the machine. Daisy primitives can freely schedule their arguments without worrying about resource management issues. Chapter 6 explains how the DSI kernel throttles parallelism.

2.3.3 Annotation

As we have seen, Daisy has large amounts of latent parallelism. Nevertheless, there are cases where it cannot be exploited due to the limitations of laziness and demand-driven computation. Chapter 8 explains how these situations can occur.

For this reason, Daisy also includes parallelism annotations. Recall from chapter 1 that we are not philosophically opposed to parallel annotation, but rather the inclusion of side effects that limit opportunities for implicit parallelism.

Next: 3. Architecture of the Up: An Architecture for Parallel Previous: 1. Introduction

Eric Jeschke
1999-07-08