Load/Store Analysis

As the matrix-matrix multiply example shows, using techniques to achieve high performance can make a code dense and opaque. And as pointed out before, optimizing compilers cannot always do the work, either because the complexity of the code prevents it or because it requires changing the order of floating point operations, which can lead to different numerical results.

However, many scientific computations can be decomposed into small kernel operations, such as the matrix*matrix multiplication. Generic kernel examples include

Dense linear algebra: "dense" means the vectors and matrices which have mainly nonzero values, and it is generally better to use a 2D array to store the matrices.
Sparse linear algebra: the opposite of "dense" where the vectors and matrices have most of their entries equal to zero. In this case the data structures to store them take advantage of the presence and/or location of the zeros.
FFT: fast Fourier transforms.
STL: the Standard Template Library, which includes several standard data structures (stacks, queues, trees, etc) and algorithms on them

The prototype and original progenitor of kernel libraries is the BLAS for dense linear algebra.

The Basic Linear Algebra Subroutines

The BLAS (Basic Linear Algebra Subroutines) are classified into three groups, depending on the type of operands involved. Some examples are given for each level, but there are many more than shown here. The convention is that all vectors are n × 1 column vectors. A row vector is denoted and obtained by using the transpose operator: y^T is 1 × n.

BLAS Level 1: vector-vector operations
- Vector copy: x = y
- Vector update (saxpy): y = y + alpha*x
- Inner product: alpha = x^Ty
- Two norm of a vector: alpha = sqrt{x^Tx}
- Vector scaling: x = alpha*x
BLAS Level 2: matrix-vector operations
- Matrix-vector product: y = alpha y + beta A*x
- Transpose matrix-vector product: y = alpha y + beta A^T*x
- Rank-1 update A = A + x*y^T.
- Triangular solve: Solve Lx = b or Ux = b for x, where L is a lower triangular n × n matrix and b is n × 1.
BLAS Level 3: matrix-matrix operation
- Matrix product: C = alpha*C + beta*A*B
- Triangular solve with multiple right hand side vectors: solve LX = B or UX = B for X, where L or U is nxn triangular and B is n × k. This is mathematically the same as solving the sequence of ordinary triangular solves Lx_i = b_i, i = 1, ..., k, where x_i is the i^th column of X and b_i is the i^th column of B. Computationally, however, great advantage can be taken of the block nature of the solves. How do I know this? Load/store analysis.

Reminder: the notation follows the Householder convention. Feeling a little guilty about not recognizing the linear algebra operations? If any of the above operations are unknown to you, check the Matrix Manual for any unrecognized linear algebra terminology.

Load/Store Analysis

Knowing which BLAS will get good performance harks back to the basic idea that data movement needs to be minimized: Whenever possible, use a BLAS routine with lower ratio of memory references to floating point operations. That principle requires some explanation. A standalone dotproduct operation cannot be turned into a saxpy vector update. E.g., if the 2-norm of a vector x is needed, it is sqrt(x^Tx), and that toad won't turn into a prince no matter who kisses it. Instead, seek opportunities to restate algorithms so that the innermost loop(s) perform more favorable BLAS functions. The two basic and one derived numbers for L/S analysis are:

f = number of flops required for the operation
m = minimal possible number of memory references for the operation
r = m/f = asymptotic ratio as problem size grows

How to find the first two are simple ....

Counting memory references

Defining the number of memory references could depend upon the particular code, compiler, wind speed, and your shoe size. When analyzing the memory reference to flop count ratio, just use the mathematical statement of the problem, not the (nearly impossible to determine) memory references of a particular implementation on a particular machine. This is equivalent to assuming that the machine has an infinite number of registers and only one level of memory. In that case, just count up all the data from the vectors and matrices in the problem (to get the number of reads required), added to the amount of data in the result (for the number of writes required). If the operation involves a lower triangular n × n matrix L, then the minimal number of memory references does not include the zeros in the upper triangular part, just the count for the lower triangular entries. The idea is the minimal amount of data required to represent the information required mathematically, not what is done on any particular computer system.

tl;dr version: Use the mathematical description of the kernel to count memory references, not any particular implementation. The only exceptions will be cases like sparse matrices, or compressed representations. In those cases, still only count the minimal reads and writes needed.

Counting flops

Unlike counting memory references, for the number of flops required extract it from a code fragment or sometimes just by looking at the operation. However, always count the minimal number of flops to do the operation without extra memory being used. This is actually easy to do; the minimality restriction is not strict, but meant to avoid stupid things. E.g., two floating point numbers can be multiplied together (one flop) by first embedding each into the (1,1) entry of a 3000 x 3000 2D array of zeros, multiplying together the two matrices stored in the order 3000 arrays (requiring 18 million flops), then extracting the (1,1) entry of the product. A dotproduct of two vectors of length n can be done using 2n - 1 flops, but the usual implementation would use 2n flops. The -1 is not signficant here, especially when n is large. Minimality just means to exclude excessive unnecessary flops.

A code can always be written to artificially increase the number of references, or do something really dumb, by adding in a lot of useless operations. This has actually been done in some research projects that managed to get published, and it can have a valid role in parallel computing where a trade-off exists between inter-node communications and duplicating operations. In parallel computing, duplicating a small computation on all processors is usually much faster than computing on one and then sending the result to all of the other processors. So for the flop count, assume the minimal number possible to implement the algorithm. That number can be derived from a particular (but not stupid or fake) implementation.

Counting memory references and flops summation:

Repeating yet again: in the ratio of m/f = memory references/flops, the denominator can be found from a code or algorithm statement, while the numerator can be found from a mathematical statement of the operation. The intent is to get a lower bound on the ratio, not necessarily an exact count.

A small sidenote: "reduced order" methods like the Strassen algorithm for matrix-matrix multiplication are excluded. Those algorithms give different results in finite precision arithmetic. More deadly, for the few scientific applications that involve dense matrices, the amount of memory is the limiting bottleneck and a library implementation that increases memory usage is usually unacceptable. Regardless, reduced order methods for matrix-matrix multiplication have their number of flops < O(n^2.7) instead of the usual O(n³), but the asymptotic load-store analysis still holds: the ratio goes to 0 as n increases.

Load/Store Analysis Examples

The process of performing a load/store analysis is best illustrated by a few examples. However, you must try doing it on other examples which are not beaten to death here.

Example 1: vector update y = y + alpha*x.

   for k = 1:n
      y(k) = y(k) + alpha*x(k)
   end for

Count the minimal number of memory references by looking at the mathematical operation. The vector update requires reading one scalar (alpha) and two vectors, and writing one vector (y) back to memory. So the number of memory references is 3n+1. Each iteration has an add and a multiply, giving a ratio of (mem ref)/(flops) = (3n+1)/2n → 1.5 for large n.

In this case counting the number of memory references matches what the counting the references in the above pseudo-code. Since alpha is likely kept in a register (which even a brain-dead compiler will do), each iteration has two reads (x(k) and y(k)) and one write (y(k)), totaling three memory references. However, for level 2 and level 3 BLAS, using the pseudo-code statement of an algorithm usually gives the wrong number. So don't do that.

Example 2: inner product alpha = x^Ty.

   alpha = 0.0
   for k = 1:n
      alpha = alpha + x(k)*y(k)
   end for

This requires reading two vectors (x and y) of length n and writing one scalar (alpha), adding up to 2n+1 memory references. Each iteration performs two flops, giving a ratio of (mem ref)/(flops) = (2n+1)/(2n-1) → 1. What is the ratio if x = y in the dotproduct?

Example 3: matrix-vector product y = y + Ax. This can be implemented in at least two different ways:

Example 3, Version 1:

for k = 1:n
   for i = 1:n
      y(k) = y(k) + A(k,i) * x(i)
   end for 
end for

Example 3, Version 2:

for i = 1:n
   for k = 1:n
       y(k) = y(k) + A(k,i) * x(i)
   end for 
end for

Both versions perform exactly same number of flops. Both versions also perform the summations which give an entry y(k) in the exact same order, so they are identical even in floating point arithmetic. Which one is preferable? Answering that first requires answering the question "What are the kernel operations expressed by the innermost loops?". Ignoring any row versus column storage issue for the array containing the matrix A, which of the two versions is going to be faster? Why? How much faster might it be? Answering these questions will indicate if it is worth the effort to convert the algorithm from one to another, or if a different implementation is possible that will do better than those two. In class the two versions were analyzed and the innermost loop was daxpy for one and dotproduct for the other. Which is which? You need to be able to answer that question, so try it now.

The matrix-vector product itself is a BLAS-2 operation: level 2 because it involves both a matrix (A) and vectors (x and y). So it probably should not be implemented as a sequence of calls to a BLAS-1 kernel like the two implementations above. However, if the load-store ratio for the matrix-vector product is the same as that of, e.g., a dotproduct, then implementing it using a sequence of dotproducts is optimal. So ... need to find its ratio r.

Example 3, Version 3: Matrix-vector product with update: y = y + A*x. This operation requires

read A (n² memory reads)
read x (n memory reads)
read y (n memory reads)
write y (n memory reads)

Adding up, the minimal number of memory references is 3n + n². Each entry of A*x requires 2n-1 flops (n multiplications and n-1 additions), then adding the result to y requires another flop. So the total flops needed are 2n². That gives (mem ref)/flop = (3n + n²)/2n², which asymptotically is mem/flop = limit_{n → ∞}(3n+ n²)/2n² → 1/2. Compare this to the two level-1 BLAS versions for computing the matrix-vector product above; what does it imply about the optimal way of doing it? Does it consist of using one of those two, or some (as yet unknown) other implementation?

Now go through the same process as above for

the rank-1 update of a matrix: A = A + u*v^T where A is a n × n matrix and u, v are n × 1 vectors.
solving a lower triangular system Lx = b for vector x, given the n × n matrix L and the n × 1 vector b.
rank-2 update of a matrix: A = A + xy^T + uv^T, where x, y, u, and v are vectors of length n and A is an n × n matrix.
matrix-matrix multiply: C = C + A*B, where all of the matrices involved are all n × n.

Of all the BLAS mentioned, which one is the biggest winner in terms of (theoretically achievable) performance?

BLAS Example: Gaussian elimination

The immediate thought with this analysis is "who cares, I'll never want to multiply large matrices together". And that is the correct way to think; what scientists and engineers really need to do is to solve differential equations, solve integral equations, etc. It is the implementation of those higher level goals that require BLAS operations. Being able to identify BLAS operations in a code, and choosing the best one for that operation, is probably the single most important way to bring to bear load/store analaysis in practice. The load/store analysis is easiest for the linear algebra operations in the BLAS, but the same methodology can be used just about anywhere in a code. Load/store analysis tells you which BLAS are "better" than others, but spotting the BLAS in a code takes some effort. As a general rule of thumb, when you have

a single loop, identify it as some BLAS-1 operation
doubly nested loops, look for it to be a BLAS-2 operation
triply nested loops, look for a BLAS-3 operation

As an example, consider solving a linear system of equations A * x = b , where A is n × n, x and b are n × 1. A and b are given, and the vector x is what needs to be computed. The hoary old solution is Gaussian elimination. The naive version of this (e.g., as found in Numerical Recipes in C) is:

    for r = 1:n-1
        for c = r+1:n
            A(c,r) = A(c,r)/A(r,r)
        end
        for c = r+1:n
            for i = r+1:n
                A(c,i) = A(c,i) - A(c,r)*A(r,i)
            end 
        end 
    end

What BLAS-1 operation does the innermost loop perform? The innermost loop here is the one on i, starting with "for i = r+1:n". Slightly harder to recognize but more importantly: what is the BLAS-2 operation expressed by the two nested inner loops starting with "for c = r+1:n"?

Next page: More about using load/store ideas and BLAS

Last Modified: Mon 04 Nov 2019, 06:59 AM