Recall that LU factorization with partial pivoting gives PA = LU, where
Given P, L, and U, solving Ax = b then becomes solving LUx = Pb, since PAx=Pb. Three steps are used in solving for the vector x in LUx = Pb.
The most computationally intensive part of solving a linear system via LU is finding the factors. However, handling permuatations and triangular solves is also required so those are analyzed here.
i: 1 2 3 4 5 6 7 8 pi : 3 7 5 8 4 1 2 6gives y = (x3, x7, x5, x8, x4, x1, x2, x6)T. When a permutation vector is used, y can be computed by
for i = 1:n y(i) = x(p(i)) end for i
[0 0 1 0 0 0 0 0] |0 0 0 0 0 0 1 0| |0 0 0 0 1 0 0 0| P = |0 0 0 0 0 0 0 1| |0 0 0 1 0 0 0 0| |1 0 0 0 0 0 0 0| |0 1 0 0 0 0 0 0| [0 0 0 0 0 1 0 0]and clearly no one in their right mind would represent it using an array with n2 = 64 elements. OK, maybe you would for 8×8 matrices ... but for 12000×12000 matrices, you are entering territory where institutionalization is recommended.
The second way of storing a permutation matrix P is with pivot vectors, and LU factorization uses those. The data structure used is an integer array piv of length n, which can be applied to a vector x in y = Px using
y = x for k = 1:n swap y(k) and y(piv(k)) in the vector y end for
Storage: L can be stored by rows in what is called packed storage :
L: [l11 l21 l22 l31 l32 l33 l41 l42 l43 l44 ... ]so that the matrix element lij is stored in the array location L(j + i(i−1)/2). However, for LU factorization both L and U need to be stored, and A is just overwritten with them (this works since L is unit lower triangular and so its main diagonal need not be stored, allowing U to use the main diagonal storage.) In that case lij is stored in A(i,j). That makes indexing into the array that stores L easier.
As an example, a particular lower triangular system is:
[ 2 0 0 0 ] (x1) ( −2) | 1 3 0 0 | (x2) ( 2) | −3 2 5 0 | (x3) = ( 15) [ 1 1 1 1 ] (x4) ( 0)Everybody would begin with 2*x1 = −2 giving x1 = −1, and then solve in order for x2, x3, .... Two possibilities occur next: Plug in already-known values as you need them to solve for xi, or when you find a value, plug it into all the remaining equations before going on to find the next value. Gives two algorithms:
for i = 1:n for j = 1:i−1 b(i) = b(i) − l(i,j)*x(j) end for j x(i) = b(i)/l(i,i) end for i
for j = 1:n x(j) = b(j)/l(j,j) for i = j+1:n b(i) = b(i) − l(i,j)*x(j) end for i end for j
Algorithm 1 is sometimes called a row sweep, and Algorithm 2 is called a column sweep. Both have block versions possible by simply treating lij as an m × m block Lij. Then
for i = 1:n for j = 1:i−1 b(i) = b(i) − L(i,j)*x(j) end for j solve L(i,i) x(i) = b(i) for x(i). end for i
for j = 1:n solve L(j,j) x(j) = b(j) for x(j) for i = j+1:n b(i) = b(i) − L(i,j)*x(j) end for i end for j
The minimal number of memory references are to read L and b and write x. This gives n(n+1)/2 + n + n = (n2 + 5n)/2 memory references. The number of flops involved is sumj=1:n[2*(n−j) + 1] = n2. So the ratio of memory references to flops is 1/2 + (5/2)n, a typical BLAS2 number.
That also means the two block versions shown above are optimal implementations. They both are based on matrix-vector products, which itself has a load/store ratio of 1/2. This is vitally important: load/store analysis tells us there is no point in pursuing a more efficient version since no such version exists.
Performing triangular solves is easy, and not worth spending a lot of time optimizing them. The reason is simple: computing the LU factorization requires ≈ (2/3)n3 flops, while the triangular solves require only ≈ 4n2 (2n2 each for L and U). So optimization efforts should go for the low-hanging fruit first: LU factorization. The final stages in getting a good LU will turn out to require triangular solves, but with multiple right hand sides. That is, it will involve solving L*U = V, where V is an n × p matrix. That operation is a level-3 BLAS and worth optimizing ... but we'll let the BLAS take care of that one.
Next: Pivoting, and a BLAS-3 LU factorization