Recall that shared memory systems present to the user a (logically) single address space. That means each processor is accessing what appears to be a single memory system, rather than each addressing their own memory system. Shared memory systems
Although shared memory machines do not require the programmer to explicitly partition data amongst processors, achieving good performance on them still requires some mental assignment of data to processes - for cache data locality if nothing else.
Shared memory programming approaches loosely fall into three categories:
Several libraries provide multithreading capabilities: Java threads, Solaris threads, Mach threads, P4 (Parmacs), NT threads. We will use Posix standard threads, otherwise called pthreads. For most of this course, "thread" can be taken to mean a pthread. Most vendors have built pthreads on top of one of their existing thread libraries - much the way MPI is built on a variety of native communication libraries. And like MPI, this provides a single API, portable between machines.
User level threads are used for all forms of concurrency. Places where they are particularly useful include
Because multiple threads can share a single process, it is important to know how process resources are allocated amongst them. The general rule is: everything possible is shared by the threads within a single process. Consider the figure below, which shows the layout in memory of a typical Unix process with a main function and two other functions inside of main(). The program code itself is typically called the "text" section of an assembly routine. Below that are any global variables. Above the text section are the two parts which can grow dynamically: the heap section which malloc() or new() use for dynamic memory managment and a stack, used for pushing down stack frames. These frames keep track of where the program is in possibly several layers of function calls. The figure has the program currently executing in function f1(), so both main() and f1() have stack frames on the stack.
In addition to the process in memory, the operating system keeps some associated data for the process:
Pthreads separates out the process data, which is accessible to and shared by all the threads) from the thread data, which is typically just what is needed to allow multiple control streams. That would be a separate stack (since the threads may be in different layers or even places in the program), and the registers. Automatic variables (ones allocated dynamically on entry) in the start function and its descendents have separate copies for each thread. All else is shared by the threads. This is shown in the next figure, which has two threads indicated in the process.
Several problems occur with this model. Among them:
This way of getting parallelism in a process should be compared with just using fork() to spawn off another process. The Unix fork() function
int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void* (*start_ftn) (void *), void *arg);In this,
#includeDepending on the system you are using, perror() may also be available. However, in general there is no guarantee that a thread will set the variable errno, or that perror() will be usable. So don't count on it; use the above code fragment style of working with the returned value. Note that the above executes exit(-1), which kills the process - and hence all threads associated with that process. In some situations you may want to just kill the thread, or wait and try again later, etc.... rtnval = pthread_create( ... ); if (rtnval != 0) { printf("Unable to create thread ... "); if (rtnval == EINVAL) printf("bad arguments to pthread_create()\n"); else if (rtnval == EAGAIN) printf("not enough resources available for pthread_create()\n"); exit(-1); }
A thread terminates when the end of the start function is reached and a return is executed from it, or when you explicitly call pthread_exit(void *val). The second form allows you to return a value from the thread.
The invoking thread can synchronize on the completion of the created thread by calling
pthread_join(pthread_t thread, void **val);The caller thread waits until the specified thread terminates, and the return value of the calling thread is in val. By default, threads are joinable in this fashion. You can specify instead that a thread is detached, in which case its exit state and return value is not saved. This allows the OS to reclaim all resources associated with the detached thread as soon as it completes. After a thread is created it can be changed to detached state via pthread_detach(thread). Another way is to do this using thread attributes at the thread creation time, to be covered later.
The first example is a simple stride-1 dotproduct, using two threads to compute different segments of the vectors. A problem is the need to have a single argument to the thread starting function, so we have to pack up the usual three arguments into a single dp_args structure.
typedef struct{int length; double *x; double *y; } dp_args; ... double dotprod(int n, double x[], double y[]) { dp_args seg; pthread_t chunk[2]; /* will start only two threads */ int retval = 0; double sum = 0.0; double *val; val = (double *) malloc(sizeof(double)); seg.length = n/2; seg.x = x; seg.y = y; retval = pthread_create(&chunk[0], NULL, (void *) local_dot, (void *) &seg); if (retval != 0) { /* error handling here */ } seg.length = n - n/2 + 1; seg.x = &x[n/2]; seg.y = &y[n/2]; retval = pthread_create(&chunk[1], NULL, (void *(*)(void *))local_dot, (void *) &seg); if (retval != 0) { /* error handling here */ } pthread_join(chunk[0], (void *) (&val)); sum += *val; pthread_join(chunk[1], (void *) (&val)); sum += *val; return sum; } double local_dot(dp_args *seg) { int k; double sum = 0.0; double *x = seg->x; double *y = seg->y; int n = seg->length; for (k = 0; k < n; k++) sum += x[k]*y[k]; return sum; }There is a major problem with the above code, which strikes at the heart of shared memory processing and is especially easy to make after dealing with distributed memory programming. Recall that everything not otherwise specified is shared. Above we have only one dp_args variable, seg. After creating the first thread, we then start reloading that variable with the arguments for the second thread. However, the variable seg is shared by both threads, so it is possible that before the first thread can get the argument values from it, the main thread has already started changing them.
Actually, the above code has several errors which are easy to make. One is the "return" value from local_dot(). Pthreads wants only a void return value from the start function. In that case, how do you get results back? Easy: since threads are shared memory, they can just write their results to a global variable - it will automatically belong to the other threads, including the main one.
This is fixed by the second version of the dotproduct, which also sets up the computation for a general number of threads. The CHUNKSZ argument gives the maximum length of a vector segment to use, and MXSEGS is the maximum number of segments. Each segment thread writes its results to a global array locsum.
#include#include #include #define FALSE 0 #define TRUE 1 #define CHUNKSZ 1000 #define MXSEGS 128 double locsum[MXSEGS]; typedef struct{int segment; /* Segment number */ int length; /* Segment length */ double *x; double *y; } dp_args; void local_dot(dp_args *seg) { int k; double *x = seg->x; double *y = seg->y; double val = 0.0; for (k = 0; k < seg->length; k++) val += x[k]*y[k]; locsum[seg->segment] = val; }; double dotprod(int n, double x[], double y[]) { dp_args *seg; pthread_t *chunk; static int chunksize = CHUNKSZ; /* Not a great way to do this */ int k = 0; int retval; int start = 0; double sum = 0.0; double *val; int odd = FALSE; int nsegs = n/chunksize; val = (double *) malloc(sizeof(double)); /* -----------------------------------------------------*/ /* Increase number of segments if chunksize does not */ /* evenly divide n, and allocate threads and dotproduct */ /* argument structures */ /* -----------------------------------------------------*/ if (n%chunksize != 0) {nsegs++; odd = TRUE;} chunk = (pthread_t *) malloc(nsegs*sizeof(pthread_t)); seg = (dp_args *) malloc(nsegs*sizeof(dp_args)); if (seg == NULL || chunk == NULL) { printf("failure to allocate chunk/seg\n"); exit(-1); } /* ------------------------------------------------------- */ /* Spawn off nsegs threads to compute chunks of dotproduct */ /* ------------------------------------------------------- */ for (k = 0; k < nsegs; k++) { /* ------------------------------------------- */ /* Load up the dp_args object for k-th segment */ /* ------------------------------------------- */ start = k*chunksize; seg[k].length = chunksize; if (odd == TRUE && k == nsegs-1) { seg[k].length = n - (nsegs-1)*chunksize; } seg[k].x = &x[start]; seg[k].y = &y[start]; seg[k].segment = k; /* ------------------------ */ /* Try to create the thread */ /* ------------------------ */ printf("Spawning thread %d \n", k); retval = pthread_create(&chunk[k], NULL, (void *(*)(void *)) local_dot, (void *) &(seg[k])); if (retval != 0) { printf("Unable to create thread ... "); if (retval == EINVAL) printf("bad arguments to pthread_create()\n"); else if (retval == EAGAIN) printf("not enough resources available for pthread_create()\n"); exit(-1); } } /* --------------------------------------------*/ /* Gather results from the different threads. */ /* --------------------------------------------*/ for (k = 0; k < nsegs; k++) { pthread_join(chunk[k], (void **)&val); sum += locsum[k]; } free(chunk); free(seg); free(val); return sum; };
The above is correct enough. However, in terms of performant computing there may be some problems. A minor problem is the arbitrarily set "chunksize". However, that can be determined fairly easily, and is something you should be able to answer after Exercise 6.
Another problem is that we wait for the threads to complete in order. However, that is not really necessary - since the variable "sum" in function dotprod() is shared, we can just have every thread add its contribution to it on completion. However, we have to make sure only one at a time does this, and that sum is not returned until all the threads complete their computations. These topics lead to thread synchronization.
How many threads are running in the above code? The answer is nsegs+1, not nsegs. This is because the main thread is also present. During the dotproduct computation, if there are nseg or fewer physical processors available, the main thread will be competing with the other threads for processor resources - even though all it is doing is waiting for the other threads to complete. This is a recurring problem in thread programming. This leads to examining thread programming models.