Final, Computer Science P573
Perform a load/store analysis for a lower triangular solve, both for
the single and multiple right-hand sides cases.
Single right-hand side case
The single right-hand side case is to solve
L*u = v
for the n × 1 vector
u, where L
is a lower triangular n × n matrix
and v is a given n × 1 vector.
Since L is lower triangular, it only requires storing
(n*(n+1))/2 doubles, while
u and v each require n doubles.
That makes getting the minimal possible number of memory references
easy.
One pseudo-code algorithm for solving such a lower triangular system is
for k = 1:n
u(k) = v(k)
for i = 1:k-1
u(k) = u(k) - L(k,i)*u(i)
end for
u(k) = u(k)/L(k,k)
end for
and that can be used to get the count of flops. From there, finding the
memory reference to flop ratio and its asymptotic value is easy.
Do so, then answer the following:
- A lower triangular solve with one right-hand side is a level-2 BLAS
operation (why?). So how does it compare to the other two such creatures
you've seen so far, matrix-vector multiply and rank-1 update? "Compare"
means in terms of which should run faster/slower.
- The algorithm above can actually be done using one array for both
u and v, which is the way it usually is done.
Does using overwriting this way change the load/store analysis?
- What level-1 BLAS operation is implemented by the innermost loop in the code
above? Does the load/store analysis imply that implementing it as a sequence
of those level-1 BLAS operations is near-optimal, or that there is a better
way?
- The inner and outer loops can be interchanged in the algorithm give
above, provided the division by diagonal elements L(k,k) is handled
properly. In that case, what vector operation does the innermost loop consist of?
Multiple right-hand side case
The major use (in terms of computational time and work) for triangular
solves is where the vectors u and v
are replaced by general n × p matrices U and V, where
generally p ≥ n. Sometimes this is also called the block
triangular solve.
An algorithm for doing this is to just do the single-right hand side case
p times, one for each column of V. So this means invoking
the pseudo-code above for j = 1:p, and solve
L*uj = vj
where uj is the j-th column of U
and vj is the j-th column of V.
At least doing so is suitable for getting a flop count for the multiple right-hand
side version of triangular solve, so ... do that.
- What level of BLAS is this operation?
- What is the load-store ratio for the multiple right-hand side triangular solve?
Keep in mind that we are assuming p ≥ n, so for the
asymptotic case letting n → ∞ suffices to
also drive p → ∞.
- What does this imply about the performance of the multiple right-hand side
lower triangular solve?
- What does your load/store analysis imply about
implementing the multiple right-hand side
version as indicated above, as a sequence of single right-hand side solves?
More!
At this point you should be able to make some tentative conjectures (or leaps of
faith) about the following:
- What if anything about the load/store analysis changes for solving an
upper triangular system U*x = y? What about a multiple
right-hand side upper triangular system?
- In LU factorization for solving a linear system of equations,
the lower triangular matrix L will be "unit lower
triangular", meaning the diagonal elements are all 1.0. In that case they are
not stored explicitly and are never referenced. That reduces the amount of
memory needed both theoretically and in practice. Does that change the
asymptotic load/store analysis results?
For this last set, make an educated guess by taking a step back and considering
what your previous computation of memory references and flops for triangular solves
showed. You should be able to carry out the details, but it's a good idea to be
able to be able to form an idea of what those details should show in advance.
Notes
- Allow yourself at least 4 hours for this assignment.
- This is the analytical crux of what P573 covers, so don't just breeze
through without fully understanding what is being asked, and what the answers
imply.
- Matlab code that implements four possible versions for the single
right-hand side case are in a tar file
or in a zip file. Both uncompress into a directory
named "lower".
Handin
Don't waste time trying to type in and format your answers MS Word, Latex, Lotus
Notes, or Incan quipu. It's OK if you have lots of time and just want to learn how to
typeset math in Latex (or tie knots in llama wool strings), but unless that's
something you need to do for research papers or a dissertation, it's a waste of your time.
You can just hand write it and then hand it to me, or put into my mailbox on the first
floor of Luddy Hall, or send it by campus mail.
As always, if you work with or talks to someone other than me about this, just make a
note of it on your handin.