Load/Store Analysis
As the matrix-matrix multiply example shows, using
techniques to achieve high performance can make a code dense and opaque. And as
pointed out before, optimizing compilers cannot always do the work, either because
the complexity of the code prevents it or because it requires changing the order of
floating point operations, which can lead to different numerical results.
However, many scientific computations can be decomposed into small kernel
operations, such as the matrix*matrix multiplication. Generic kernel examples include
-
Dense linear algebra:
"dense" means the vectors and matrices which have
mainly nonzero values, and it is generally better to use a 2D array
to store the matrices.
- Sparse linear algebra:
the opposite of "dense" where the vectors and matrices have most
of their entries equal to zero. In this case the data structures to
store them take advantage of the presence and/or location of the zeros.
- FFT:
fast Fourier transforms.
- STL:
the Standard Template Library, which includes several standard
data structures (stacks, queues, trees, etc) and algorithms on them
The prototype and original progenitor of kernel libraries is the BLAS for dense linear algebra.
The Basic Linear Algebra Subroutines
The BLAS (Basic Linear Algebra Subroutines) are
classified into three groups, depending on the type of
operands involved. Some examples are given for each level, but there are
many more than shown here. The convention is that all vectors
are n × 1 column vectors. A row vector is denoted and obtained by
using the transpose operator: yT is 1 × n.
- BLAS Level 1: vector-vector operations
- Vector copy: x = y
- Vector update (saxpy): y = y + alpha*x
- Inner product: alpha = xTy
- Two norm of a vector: alpha = sqrt{xTx}
- Vector scaling: x = alpha*x
- BLAS Level 2: matrix-vector operations
- Matrix-vector product: y = alpha y + beta A*x
- Transpose matrix-vector product:
y = alpha y + beta AT*x
- Rank-1 update A = A + x*yT.
- Triangular solve: Solve Lx = b or Ux = b for x,
where L is a lower triangular n × n matrix and b is n × 1.
- BLAS Level 3: matrix-matrix operation
- Matrix product: C = alpha*C + beta*A*B
- Triangular solve with multiple right hand side
vectors: solve LX = B or UX = B for X, where L or U is nxn
triangular and B is n × k. This is mathematically the
same as solving the sequence of ordinary triangular solves
Lxi = bi, i = 1, ..., k, where
xi is the ith column of X and bi
is the ith column of B. Computationally, however, great
advantage can be taken of the block nature of the solves.
How do I know this? Load/store analysis.
Reminder:
the notation follows the Householder convention.
Feeling a little guilty about not recognizing the linear algebra
operations? If any of the above operations are unknown to you, check the
Matrix Manual
for any unrecognized linear algebra terminology.
Load/Store Analysis
Knowing which BLAS will get good performance harks back to the
basic idea that data movement needs to be minimized:
Whenever possible, use a BLAS routine with lower ratio of memory
references to floating point operations.
That principle requires some explanation.
A standalone dotproduct operation cannot be turned into a
saxpy vector update. E.g., if the 2-norm of a vector x is needed, it is
sqrt(xTx), and that toad won't turn into a prince no matter
who kisses it. Instead, seek opportunities to restate algorithms
so that the innermost loop(s) perform more favorable BLAS functions.
The two basic and one derived numbers for L/S analysis are:
- f = number of flops required for the operation
- m = minimal possible number of memory references for the
operation
- r = m/f = asymptotic ratio as problem size grows
How to find the first two are simple ....
Counting memory references
Defining the number of memory references could depend upon the particular code,
compiler, wind speed, and your shoe size. When analyzing the memory reference to
flop count ratio, just use the mathematical statement of the problem, not the
(nearly impossible to determine) memory references of a particular implementation on
a particular machine. This is equivalent to assuming that the machine has an
infinite number of registers and only one level of memory. In that case, just count
up all the data from the vectors and matrices in the problem (to get the number of
reads required), added to the amount of data in the result (for the number of writes
required). If the operation involves a lower triangular n × n matrix
L, then the minimal number of memory references does not include the zeros
in the upper triangular part, just the count for the lower triangular entries. The
idea is the minimal amount of data required to represent the information required
mathematically, not what is done on any particular computer system.
tl;dr version:
Use the mathematical description of the kernel to count memory references, not any
particular implementation.
The only exceptions will be cases like sparse matrices, or compressed representations.
In those cases, still only count the minimal reads and writes needed.
Counting flops
Unlike counting memory references, for the number of flops required extract it from
a code fragment or sometimes just by looking at the operation. However, always
count the minimal number of flops to do the operation without extra memory
being used. This is actually easy to do; the minimality restriction is not strict,
but meant to avoid stupid things. E.g., two floating point numbers can be multiplied
together (one flop) by first embedding each into the (1,1) entry of a 3000 x 3000 2D
array of zeros, multiplying together the two matrices stored in the order 3000
arrays (requiring 18 million flops), then extracting the (1,1) entry of the product.
A dotproduct of two vectors of length n can be done using 2n - 1 flops, but the
usual implementation would use 2n flops. The -1 is not signficant here, especially
when n is large. Minimality just means to exclude excessive unnecessary flops.
A code can always be written to artificially increase the number of references, or
do something really dumb, by adding in a lot of useless operations. This has
actually been done in some research projects that managed to get published, and it
can have a valid role in parallel computing where a trade-off exists between
inter-node communications and duplicating operations. In parallel computing, duplicating a
small computation on all processors is usually much faster than computing on one and then
sending the result to all of the other processors. So for the flop count, assume
the minimal number possible to implement the algorithm. That number can be derived
from a particular (but not stupid or fake) implementation.
Counting memory references and flops summation:
Repeating yet again: in the ratio of m/f = memory references/flops,
the denominator can be found from a code or algorithm statement, while
the numerator can be found from a mathematical statement of the
operation. The intent is to get a lower bound on the ratio, not
necessarily an exact count.
A small sidenote:
"reduced order" methods like the Strassen algorithm for matrix-matrix multiplication
are excluded. Those algorithms give different results in finite precision
arithmetic. More deadly, for the few scientific applications that involve dense
matrices, the amount of memory is the limiting bottleneck and a library
implementation that increases memory usage is usually unacceptable. Regardless,
reduced order methods for matrix-matrix multiplication have their number of flops
< O(n2.7) instead of the usual O(n3), but the asymptotic
load-store analysis still holds: the ratio goes to 0 as n increases.
Load/Store Analysis Examples
The process of performing a load/store analysis is best illustrated by
a few examples. However, you must try doing it on other examples
which are not beaten to death here.
Example 1: vector update y = y + alpha*x.
for k = 1:n
y(k) = y(k) + alpha*x(k)
end for
Count the minimal number
of memory references by looking at the mathematical operation. The vector
update requires reading one scalar (alpha) and two vectors, and writing
one vector (y) back to memory. So the number of memory references is
3n+1. Each iteration has an add and a multiply, giving
a ratio of (mem ref)/(flops) = (3n+1)/2n → 1.5 for large n.
In this case counting the number of memory references matches
what the counting the references in the above pseudo-code.
Since alpha is likely kept in a register (which even a brain-dead compiler
will do),
each iteration has two reads (x(k) and y(k)) and one write (y(k)), totaling
three memory references.
However, for level 2 and level 3 BLAS, using the pseudo-code statement of
an algorithm usually gives the wrong number. So don't do that.
Example 2: inner product alpha = xTy.
alpha = 0.0
for k = 1:n
alpha = alpha + x(k)*y(k)
end for
This requires reading two vectors (x and y) of length n and writing one
scalar (alpha), adding up to 2n+1 memory references.
Each iteration performs two flops, giving a
ratio of (mem ref)/(flops) = (2n+1)/(2n-1) → 1.
What is the ratio if x = y in the dotproduct?
Example 3: matrix-vector product y = y + Ax.
This can be implemented in at least two different ways:
Example 3, Version 1:
for k = 1:n
for i = 1:n
y(k) = y(k) + A(k,i) * x(i)
end for
end for
Example 3, Version 2:
for i = 1:n
for k = 1:n
y(k) = y(k) + A(k,i) * x(i)
end for
end for
Both versions perform exactly same number of flops. Both versions also perform the
summations which give an entry y(k) in the exact same order, so they are identical
even in floating point arithmetic. Which one is preferable? Answering that first
requires answering the question "What are the kernel operations expressed by the
innermost loops?". Ignoring any row versus column storage issue for the array
containing the matrix A, which of the two versions is going to be faster? Why? How
much faster might it be? Answering these questions will indicate if it is worth the
effort to convert the algorithm from one to another, or if a different
implementation is possible that will do better than those two.
In class the two versions were analyzed and the innermost loop was daxpy for
one and dotproduct for the other. Which is which? You need to be able to answer
that question, so try it now.
The matrix-vector product itself is a BLAS-2 operation:
level 2 because it involves both a matrix (A) and vectors (x and y).
So it probably should not be implemented as a sequence
of calls to a BLAS-1 kernel like the two implementations above.
However, if the load-store ratio for the matrix-vector product is the
same as that of, e.g., a dotproduct, then implementing it using a sequence
of dotproducts is optimal. So ... need to find its ratio r.
Example 3, Version 3:
Matrix-vector product with update: y = y + A*x. This operation requires
- read A (n2 memory reads)
- read x (n memory reads)
- read y (n memory reads)
- write y (n memory reads)
Adding up, the minimal number of memory references is 3n + n2.
Each entry of A*x requires 2n-1 flops (n multiplications and n-1 additions),
then adding the result to y requires another flop. So the total flops
needed are 2n2. That
gives (mem ref)/flop = (3n + n2)/2n2, which
asymptotically is mem/flop = limitn → ∞(3n+ n2)/2n2 → 1/2.
Compare this to the two level-1 BLAS versions for computing the matrix-vector product above; what does it
imply about the optimal way of doing it?
Does it consist of using one of those two, or some (as yet unknown) other
implementation?
Now go through the same process as above for
- the rank-1 update of a matrix: A = A + u*vT where A is a
n × n matrix and u, v are n × 1 vectors.
- solving a lower triangular system Lx = b for vector x, given
the n × n matrix L and the n × 1 vector b.
- rank-2 update of a matrix: A = A + xyT + uvT,
where x, y, u, and v are vectors of length n and A is an n × n matrix.
- matrix-matrix multiply: C = C + A*B, where all of the matrices involved are
all n × n.
Of all
the BLAS mentioned, which one is the biggest winner in terms
of (theoretically achievable) performance?
BLAS Example: Gaussian elimination
The immediate thought with this analysis is "who cares, I'll never want to multiply
large matrices together". And that is the correct way to think; what scientists and
engineers really need
to do is to solve differential equations, solve integral equations, etc. It is the
implementation
of those higher level goals that require BLAS operations.
Being able to identify BLAS operations in a code, and choosing the best one for
that operation, is probably the single most important way to bring to bear
load/store analaysis in practice.
The load/store analysis is easiest for the linear algebra operations in the BLAS,
but the same methodology can be used just about anywhere in a code.
Load/store analysis tells you which BLAS are "better" than others, but spotting the
BLAS in a code takes some effort. As a general rule of thumb, when you have
- a single loop, identify it as some BLAS-1 operation
- doubly nested loops, look for it to be a BLAS-2 operation
- triply nested loops, look for a BLAS-3 operation
As an example, consider solving a linear system of equations
A * x = b ,
where A is n × n, x and b are n × 1.
A and b are given, and the vector x is what needs to be computed.
The hoary old solution is
Gaussian elimination.
The naive version of this (e.g., as found in
Numerical Recipes in C) is:
for r = 1:n-1
for c = r+1:n
A(c,r) = A(c,r)/A(r,r)
end
for c = r+1:n
for i = r+1:n
A(c,i) = A(c,i) - A(c,r)*A(r,i)
end
end
end
What BLAS-1 operation does the innermost loop
perform? The innermost loop here
is the one on i, starting with "for i = r+1:n". Slightly harder to recognize
but more importantly: what is the BLAS-2 operation expressed
by the two nested inner loops starting with "for c = r+1:n"?
- Last Modified: Mon 04 Nov 2019, 06:59 AM